SpringOne 2020
“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events
James Webb, MTS at T-Mobile
Brendan Aye, Technical Director, Platform Architecture at T-Mobile
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events
1. “Sh!@$ on Fire, Yo!”
True Stories Inspired by Real Events
Brendan Aye
Technical Director, Platform Architecture
James Webb
Member of Technical Staff
2. 2
Platform and Infrastructure Engineering
§ 55 Team Members, including redundant and
geo-distributed Joes
§ Virtual Infrastructure
§ 5,000 Virtual Hosts
§ 50,000 Virtual Machines
§ CloudFoundry
§ 30 Foundations
§ 75,000 Application Instances
§ Kubernetes
§ 90 Clusters
§ 22,000 Pods
Who We Are
T-Mobile Confidential
4. 4
Architecting a Highly Available CloudFoundry App
Foundation A Foundation B Foundation C
Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com
Load Balancer Load Balancer Load Balancer
GSLB
Clients
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
5. 55
§ Platform team built shiny GSLB-as-a-service
§ Customer team consumed shiny GSLB-as-a-
service
§ Clients queried GLSB to determine endpoint
and established persistent HTTP
connections
§ App teams took one region out of load which
correctly de-registered it with GSLB
§ Persistent connections don’t need to query
GSLB anymore and the LoadBalancer kept
the connections alive… L
What Went Wrong?
T-Mobile Confidential
6. 6
§ Improved Documentation! GLSB is only one
method to load balance application traffic, so
explaining its benefits and drawbacks is
crucial to a successful partnership.
§ Sharing incident post-mortem with GSLB
customers so they understand what went
wrong, and how they can plan for expected
failure.
§ Suggesting disabling HTTP keep-alive when
using GLSB
§ Investing in alternative platform-supported
load balancing methodologies.
How Did We Get Better?
T-Mobile Confidential
7. 77
§ Homebrew Java Application running on
WebLogic
§ Running in a single Kubernetes cluster, but
with many instances spread across
multiple share-nothing AZs
§ Application upgrades and restarts
working fine and not causing any impacts
to service
§ Multi-tenant cluster managed by Platform
Team with daytime upgrades planned
during CloudFoundry Summit 2019
Anatomy of a Failing Kubernetes App
8. 88
§ Cluster upgrades kicked off with max-in-flight
of one
§ As nodes quickly cycled through upgrades,
application had fewer and fewer ‘ready’
pods
§ By the time remaining nodes were upgraded,
all customer pods were in a crashed state
and failing to come back up
§ Management was displeased with our
daytime upgrades with no Change Request
leading to a P1 Incident
What Went Wrong?
T-Mobile Confidential
9. 9
§ Switching application to depend on
/dev/urandom instead of /dev/random
§ Customers implemented Pod Disruption
Budgets (PDB) to maintain a minimum
of 66% of ready pods before upgrades
can proceed
§ File a Change Request for anything that
touches a customer-facing cluster (yes,
even non-production)
How Did We Get Better?How Did We Get Better?
11. 11
§ Adopt a policy of radical transparency
with your customers
§ Assume your customers are right until
you can demonstrate otherwise
§ Avoid seeing Mean-Time-To-Blame as a
useful KPI
§ When your platform is at fault, accept
responsibility, fix the issue, and explain
how you’ll improve
§ When a customer is doing something
that will lead to failure, ensure your
concern is heard and partner for
success
You
Can’t
T-Mobile Confidential