“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

•

1 recomendación•355 vistas

SpringOne 2020 “Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events James Webb, MTS at T-Mobile Brendan Aye, Technical Director, Platform Architecture at T-Mobile

Software

“Sh!@$ on Fire, Yo!”
True Stories Inspired by Real Events
Brendan Aye
Technical Director, Platform Architecture
James Webb
Member of Technical Staff

2
Platform and Infrastructure Engineering
§ 55 Team Members, including redundant and
geo-distributed Joes
§ Virtual Infrastructure
§ 5,000 Virtual Hosts
§ 50,000 Virtual Machines
§ CloudFoundry
§ 30 Foundations
§ 75,000 Application Instances
§ Kubernetes
§ 90 Clusters
§ 22,000 Pods
Who We Are
T-Mobile Confidential

3
Platform KPIs
Synthetic Transactions
BlackBox Monitoring
Server Infrastructure
Network Infrastructure
Slack
Application Requests
Container Metrics
What Do We Watch?

4
Architecting a Highly Available CloudFoundry App
Foundation A Foundation B Foundation C
Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com
Load Balancer Load Balancer Load Balancer
GSLB
Clients
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com
Myapp.geo.cf.mydomain.com

55
§ Platform team built shiny GSLB-as-a-service
§ Customer team consumed shiny GSLB-as-a-
service
§ Clients queried GLSB to determine endpoint
and established persistent HTTP
connections
§ App teams took one region out of load which
correctly de-registered it with GSLB
§ Persistent connections don’t need to query
GSLB anymore and the LoadBalancer kept
the connections alive… L
What Went Wrong?
T-Mobile Confidential

6
§ Improved Documentation! GLSB is only one
method to load balance application traffic, so
explaining its benefits and drawbacks is
crucial to a successful partnership.
§ Sharing incident post-mortem with GSLB
customers so they understand what went
wrong, and how they can plan for expected
failure.
§ Suggesting disabling HTTP keep-alive when
using GLSB
§ Investing in alternative platform-supported
load balancing methodologies.
How Did We Get Better?
T-Mobile Confidential

77
§ Homebrew Java Application running on
WebLogic
§ Running in a single Kubernetes cluster, but
with many instances spread across
multiple share-nothing AZs
§ Application upgrades and restarts
working fine and not causing any impacts
to service
§ Multi-tenant cluster managed by Platform
Team with daytime upgrades planned
during CloudFoundry Summit 2019
Anatomy of a Failing Kubernetes App

88
§ Cluster upgrades kicked off with max-in-flight
of one
§ As nodes quickly cycled through upgrades,
application had fewer and fewer ‘ready’
pods
§ By the time remaining nodes were upgraded,
all customer pods were in a crashed state
and failing to come back up
§ Management was displeased with our
daytime upgrades with no Change Request
leading to a P1 Incident
What Went Wrong?
T-Mobile Confidential

9
§ Switching application to depend on
/dev/urandom instead of /dev/random
§ Customers implemented Pod Disruption
Budgets (PDB) to maintain a minimum
of 66% of ready pods before upgrades
can proceed
§ File a Change Request for anything that
touches a customer-facing cluster (yes,
even non-production)
How Did We Get Better?How Did We Get Better?

11
§ Adopt a policy of radical transparency
with your customers
§ Assume your customers are right until
you can demonstrate otherwise
§ Avoid seeing Mean-Time-To-Blame as a
useful KPI
§ When your platform is at fault, accept
responsibility, fix the issue, and explain
how you’ll improve
§ When a customer is doing something
that will lead to failure, ensure your
concern is heard and partner for
success
You
Can’t
T-Mobile Confidential

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

Más contenido relacionado

La actualidad más candente

Handling Secrets in Your Cloud Native ArchitectureVMware Tanzu

DevSecOps with ConfidenceVMware Tanzu

Creating Polyglot Communication Between Kubernetes Clusters and Legacy System...VMware Tanzu

Spring Cloud Kubernetes: An Easier Path from Idea to ProductionVMware Tanzu

State of Steeltoe 2020VMware Tanzu

Observability Enhancements in Steeltoe VMware Tanzu

What Is Spring?VMware Tanzu

DevOps KPIs as a Service: Daimler’s SolutionVMware Tanzu

The Path Towards Spring Boot Native ApplicationsVMware Tanzu

A Leader’s Guide to DevOps Practices and CultureVMware Tanzu

Not Just InitializingVMware Tanzu

VMware Tanzu Introduction- June 11, 2020VMware Tanzu

From Pivotal to VMware Tanzu: What you need to knowVMware Tanzu

Building Kubernetes images at scale with Tanzu Build ServiceVMware Tanzu

Adopting Azure, Cloud Foundry and Microservice Architecture at Merrill Corpor...VMware Tanzu

Accelerate Spring Apps to Cloud at ScaleAsir Selvasingh

Spring: Your Next Java Micro-FrameworkVMware Tanzu

IoT Scale Event-Stream Processing for Connected Fleet at PenskeVMware Tanzu

Successful and Sustainable Business Transformation: The 4 x 3 ApproachVMware Tanzu

vSphere with Kubernetes Virtual Event- June 16, 2020VMware Tanzu

La actualidad más candente (20)

Handling Secrets in Your Cloud Native Architecture

DevSecOps with Confidence

Creating Polyglot Communication Between Kubernetes Clusters and Legacy System...

Spring Cloud Kubernetes: An Easier Path from Idea to Production

State of Steeltoe 2020

Observability Enhancements in Steeltoe

What Is Spring?

DevOps KPIs as a Service: Daimler’s Solution

The Path Towards Spring Boot Native Applications

A Leader’s Guide to DevOps Practices and Culture

Not Just Initializing

VMware Tanzu Introduction- June 11, 2020

From Pivotal to VMware Tanzu: What you need to know

Building Kubernetes images at scale with Tanzu Build Service

Adopting Azure, Cloud Foundry and Microservice Architecture at Merrill Corpor...

Accelerate Spring Apps to Cloud at Scale

Spring: Your Next Java Micro-Framework

IoT Scale Event-Stream Processing for Connected Fleet at Penske

Successful and Sustainable Business Transformation: The 4 x 3 Approach

vSphere with Kubernetes Virtual Event- June 16, 2020

Similar a “Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

Blue Shield of California: Improving Service and Competitiveness with IBM Pur...Perficient, Inc.

Webinar - Order out of Chaos: Avoiding the Migration MigrainePeak Hosting

Blue Shield of CA Revolutionizes its Portal Environment on IBM PureApplicatio...Perficient, Inc.

Planning for a (Mostly) Hassle-Free Cloud Migration | VTUG 2016 Winter WarmerJoe Conlin

A Framework to Measure and Maximize Cloud ROIRightScale

(DVO201) Scaling Your Web Applications with AWS Elastic BeanstalkAmazon Web Services

The simplest cloud migration in the world by WebscaleWebscale Networks

Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfDATAVERSITY

Turning up the HEAT with IBM MobileFirst for iOS AppsMichael Elder

Technical Webinar with AWS - Everything You Need to Measure in Your MigrationNew Relic

Accelerate Application Migration - August 5, 2020VMware Tanzu

How to move to the cloudInterxion

Cloud Foundry Summit 2015: Leaving your Comfort Zone - Garmin and Cloud FoundryVMware Tanzu

Implementing Service Oriented ArchitectureAmazon Web Services

DevOps and Application Delivery for Hybrid Cloud - DevOpsSummit sessionSanjeev Sharma

Implementing Service Oriented ArchitectureAmazon Web Services

Implementing Service Oriented Architecture Amazon Web Services

WebSphere Application Server - Meeting Your Cloud and On-Premise DemandsIan Robinson

Cloud Roundtable | Pivoltal: Agile platformCodemotion

DevOps Case StudiesWhiteHedge Technologies Inc.

Similar a “Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events (20)

Blue Shield of California: Improving Service and Competitiveness with IBM Pur...

Webinar - Order out of Chaos: Avoiding the Migration Migraine

Blue Shield of CA Revolutionizes its Portal Environment on IBM PureApplicatio...

Planning for a (Mostly) Hassle-Free Cloud Migration | VTUG 2016 Winter Warmer

A Framework to Measure and Maximize Cloud ROI

(DVO201) Scaling Your Web Applications with AWS Elastic Beanstalk

The simplest cloud migration in the world by Webscale

Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself

Turning up the HEAT with IBM MobileFirst for iOS Apps

Technical Webinar with AWS - Everything You Need to Measure in Your Migration

Accelerate Application Migration - August 5, 2020

How to move to the cloud

Cloud Foundry Summit 2015: Leaving your Comfort Zone - Garmin and Cloud Foundry

Implementing Service Oriented Architecture

DevOps and Application Delivery for Hybrid Cloud - DevOpsSummit session

Implementing Service Oriented Architecture

WebSphere Application Server - Meeting Your Cloud and On-Premise Demands

Cloud Roundtable | Pivoltal: Agile platform

DevOps Case Studies

Más de VMware Tanzu

What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu

Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu

Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu

Spring Update | July 2023VMware Tanzu

Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu

Building Cloud Ready AppsVMware Tanzu

Spring Boot 3 And BeyondVMware Tanzu

Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu

Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu

Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu

tanzu_developer_connect.pptxVMware Tanzu

Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu

Tanzu Developer Connect Workshop - EnglishVMware Tanzu

Virtual Developer Connect Workshop - EnglishVMware Tanzu

Tanzu Developer Connect - FrenchVMware Tanzu

Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu

SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu

SpringOne Tour: The Influential Software EngineerVMware Tanzu

SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu

SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu

Más de VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It

Make the Right Thing the Obvious Thing at Cardinal Health 2023

Enhancing DevEx and Simplifying Operations at Scale

Spring Update | July 2023

Platforms, Platform Engineering, & Platform as a Product

Building Cloud Ready Apps

Spring Boot 3 And Beyond

Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf

Simplify and Scale Enterprise Apps in the Cloud | Boston 2023

Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023

tanzu_developer_connect.pptx

Tanzu Virtual Developer Connect Workshop - French

Tanzu Developer Connect Workshop - English

Virtual Developer Connect Workshop - English

Tanzu Developer Connect - French

Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023

SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot

SpringOne Tour: The Influential Software Engineer

SpringOne Tour: Domain-Driven Design: Theory vs Practice

SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions

Último

Powering Real-Time Decisions with Continuous Data StreamsSafe Software

VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools

Precise and Complete Requirements? An Elusive GoalLionel Briand

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Ronisha Informatics Private Limited Catalogueitservices996

Osi security architecture in network.pptxVinzoCenzo

SAM Training Session - How to use EXCEL ?Alexandre Beguel

What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics

Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

1. “Sh!@$ on Fire, Yo!” True Stories Inspired by Real Events Brendan Aye Technical Director, Platform Architecture James Webb Member of Technical Staff

2. 2 Platform and Infrastructure Engineering § 55 Team Members, including redundant and geo-distributed Joes § Virtual Infrastructure § 5,000 Virtual Hosts § 50,000 Virtual Machines § CloudFoundry § 30 Foundations § 75,000 Application Instances § Kubernetes § 90 Clusters § 22,000 Pods Who We Are T-Mobile Confidential

3. 3 Platform KPIs Synthetic Transactions BlackBox Monitoring Server Infrastructure Network Infrastructure Slack Application Requests Container Metrics What Do We Watch?

4. 4 Architecting a Highly Available CloudFoundry App Foundation A Foundation B Foundation C Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Load Balancer Load Balancer Load Balancer GSLB Clients Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com Myapp.geo.cf.mydomain.com

5. 55 § Platform team built shiny GSLB-as-a-service § Customer team consumed shiny GSLB-as-a- service § Clients queried GLSB to determine endpoint and established persistent HTTP connections § App teams took one region out of load which correctly de-registered it with GSLB § Persistent connections don’t need to query GSLB anymore and the LoadBalancer kept the connections alive… L What Went Wrong? T-Mobile Confidential

6. 6 § Improved Documentation! GLSB is only one method to load balance application traffic, so explaining its benefits and drawbacks is crucial to a successful partnership. § Sharing incident post-mortem with GSLB customers so they understand what went wrong, and how they can plan for expected failure. § Suggesting disabling HTTP keep-alive when using GLSB § Investing in alternative platform-supported load balancing methodologies. How Did We Get Better? T-Mobile Confidential

7. 77 § Homebrew Java Application running on WebLogic § Running in a single Kubernetes cluster, but with many instances spread across multiple share-nothing AZs § Application upgrades and restarts working fine and not causing any impacts to service § Multi-tenant cluster managed by Platform Team with daytime upgrades planned during CloudFoundry Summit 2019 Anatomy of a Failing Kubernetes App

8. 88 § Cluster upgrades kicked off with max-in-flight of one § As nodes quickly cycled through upgrades, application had fewer and fewer ‘ready’ pods § By the time remaining nodes were upgraded, all customer pods were in a crashed state and failing to come back up § Management was displeased with our daytime upgrades with no Change Request leading to a P1 Incident What Went Wrong? T-Mobile Confidential

9. 9 § Switching application to depend on /dev/urandom instead of /dev/random § Customers implemented Pod Disruption Budgets (PDB) to maintain a minimum of 66% of ready pods before upgrades can proceed § File a Change Request for anything that touches a customer-facing cluster (yes, even non-production) How Did We Get Better?How Did We Get Better?

10. How Do You Prevent Incidents?

11. 11 § Adopt a policy of radical transparency with your customers § Assume your customers are right until you can demonstrate otherwise § Avoid seeing Mean-Time-To-Blame as a useful KPI § When your platform is at fault, accept responsibility, fix the issue, and explain how you’ll improve § When a customer is doing something that will lead to failure, ensure your concern is heard and partner for success You Can’t T-Mobile Confidential

12. Let’s talk

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a “Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events

Similar a “Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events (20)

Más de VMware Tanzu

Más de VMware Tanzu (20)

Último

Último (20)

“Sh*^%# on Fire, Yo!”: A True Story Inspired by Real Events