This document discusses chaos engineering for Pivotal Cloud Foundry (PCF). It introduces Ramesh Krishnaram and Karun Chennuri from the Platform Engineering team at Pivotal. They explore tools for chaos engineering like Chaos Lemur, Gremlin, and Turbulence. They demonstrate adding capabilities to Turbulence for simulating failures in PCF infrastructure and applications using the Cloud Foundry Blocker tool from Chaos Toolkit. The document discusses cascading failures and contributions to open source chaos engineering tools.
21. LET US KNOW HOW YOU FEEL ABOUT THIS SESSION.
TAKE THE SURVEY ON THE MOBILE APP!
Ramesh.Vaithiyamkrishnaram1@T-Mobile.com,
Karun.Chennuri1@T-Mobile.com
#springone@s1p
Notas del editor
“JOKER: Introduce a little anarchy. Upset the established order and everything becomes chaos. I'm an agent of chaos. Oh, and you know the thing about chaos?”
Ramesh: Hey Karun, I recently saw this movie The Dark Knight. And I really enjoyed the Joker’s interpretation of Chaos. So you know the thing about Chaos Karun?
Karun: Apart from what Joker said, I have been reading about a famous metaphor that explain Chaos Theory i.e. How “Butterfly wings in Brazil could ultimately cause a hurricane in Texas”
Ramesh : And ?
Karun: I think we should say “Pre-emptive chaos attack on butterflies!!!???” Not really ‘am just joking… You look like you have something to say, what is it and how can I help ?
Ramesh : I am trying to draw an analogy here - A tiny butterfly could bring such a huge impact to the environment, so can a bug/failure in the system to a company and company’s revenue. So, let’s get started…
Ramesh: So let’s talk about us. Who are we ? We are a group of engineers that fondly like to call ourselves agents of chaos.
Ramesh: That’s right, what I mean by agent of chaos is we like to radically transform the complexities involved in deploying software to the cloud, we have done this by delivering a platform that is simple/secure/scalable to use. Our goal is to have our application workloads to be able to run from anywhere, anyhow.
IT is now all about as-a-service which means the expectation of Customers is all about Agility. And this varies broadly with the level of abstraction you choose. We are a team that is focused predominantly on delivering services for CaaS, PaaS and FaaS (future).
Karun: Ramesh, what’s the big deal almost every company has this right?
Ramesh: Big deal??? Here we go… <Take to next slide… talk about metrics>
Ramesh: So Karun, you said “what’s the big deal?”. So why not I use data to talk about the big deal !
PCF was launched at T-Mobile in early 2016 & you quickly see how we have graduated over the last 24 months. A number of T-Mo business critical (customer facing or middle-ware) runs on PCF. Still not convinced ? In that case, let me tell you that as of this minute we have roughly 30K+ containers, 900 active users in the PCF community at T-Mo. And just in FY 2018, we have scaled out our PCF foundations from 2 to 10+.
If that does not cut it, let me tell you that since the time we have moved a number of apps to micro-service SOA, we have shorter/fewer incidents and faster apps ! And guess what, on top of this we have seen an increase in # of changes made to these services, a vast majority of these being day-time changes.
Karun: Alright, I get it. Where are we going with this and what is your problem statement ?
Ramesh: What is this?
Karun: Don’t know. But looks like abstract Chaos.
Ramesh: What is this one?
Karun: Blue Chaos?
Ramesh: What is this one?
Karun: Green & Incomplete Chaos?
Ramesh: You are right to an extent. But let me clarify. We are engineers, we write services. A simple web app has a client making a connection to a server, server talks to a backend dependency determines what needs to be rendered to the client & responds back. But that’s one app and a SOA has thousands of these micro-services & just like how we share the world, they share a eco-systems that is complex & vulnerable to attacks.
Karun : Really, what kind of attacks are these ?
Ramesh : I like to call this death start diagram as Micro-service explosion, a common theme. In summary, when we design services, we make assumptions. Assumptions go wrong/not validated. A few common fallacies in distributed system
The network is reliable
Latency is zero
Bandwidth is infinite
Infinite compute resources
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous
Chaos Engineering focuses on building confidence in your system by validating known recovery paths. When a recovery path fails, you get an opportunity to look at the results and fix why it failed.
Ramesh: Before we move on I want to high-light that T-Mobile is not one of the mom and pop telecom companies out there. We are the Un-carrier. We care about our customers, so we want to build stuff that’s simple, secure and scalable. And this is not possible until you acknowledge that the only thing that is constant is failure. Learn to embrace it, failure is inevitable.
Ramesh : This is certainly not T-Mobile Datacenter nor it is Ramesh and I. Chaos Engineering is the concept of injecting possible real-world failures or load which has a potential to disrupt the system with the goal of finding potential issues before they happen naturally so the system’s resilience can be improved. Think of Chaos Engineering as a fire alarm drill – you run drill occasionally so you can validate your recovery path, when the drill fails you fix it so when an actual catastrophe happens there is no room for failure in your escape route.
At T-Mobile we started with below 2 challenges:
Platform: Hardware failures, service failures, network connectivity and connection quality issues, and limited resources (CPU/Memory/Disk).
Application: Failure of application build dependencies and random failures of application dependencies.
Karun: Hey I know what you are saying…
We are not a single application company! We’ve many independent and not inter-dependent apps.
As said in earlier slide, we’ve about 4k applications running on our platform sharing same resources & underlying infrastructure. Platform level attacks impact several apps, which is not what we want, we want a more targeted attack simulations, targeting specific apps running in an org and space with out affecting other apps running on same hardware, org, space and using same shared instance.
Ramesh:
My question to you (Karun), is it doable? I want our team not to re-invent the wheel, evaluate existing tools & make a proposal around how can we deliver a tool-as-a-service with which we can build a better platform and deliver more resilient applications.
Karun: Well I hope so, I can get back to you with my research. s
Only certain features taken for comparison for now.
Ramesh: Why Gremlin? Isn’t it a commercial offering?
Karun: Gremlin is a commercial offering with Control plane offered as SaaS offering, which means one less software for Ops team to manage. It’s a good option that comes with a cost.
Gremlin can run as a process as well as in container. We deployed Gremlin as a run time config on one of our test foundations.
Gremlin falls short of app knowledge. But our recent interaction with Mr. Kolton founder of Gremlin looks like they are building app knowledge capability.
Ramesh: Can turbulence replace Gremlin?
Karun : It’s unfair to compare commercial Gremlin with opensource Turbulence. Original author ‘cppforlife’ (I hope he is in this conference) has put a Go package that deploys turbulence api-server and agent on each of the VMs. Again here Turbulence falls short of App knowledge aspect. No doubt better control with enhancing Turbulence, but Gremlin has one advantage esp in T-Mobile case. Since we’ve K8s and PCF in our infra, we can have a single control plane to plan our attacks for PCF and k8s. Turb on other hand is PCF only.
Enters ChaosToolKit a nice little framework that orchestrates solutions like Gremlin, turbulence, aws, all at the same time. It’s driver based architecture helped us build a new capability that now knows how to interact with an app instance running in the cluster. Experiments are JSON, we need to comply with specific grammar.
Karun : Typical PCF component diagram, each of component or a combo is a single VM or multi-processes within a VM. But high-level look at the different arrows & imagine an interaction going wrong here which might have a cascading effect.
Now how to simulate these? Good news Turbulence does some of the basic stuff already, here we added bunch of new features that help you perform more serious attack simulation that are close to real time attacks.
Eg: Imagine you’ve Autoscaling ON for an app. Via Turbulence, bring down Cloud Controller for n interval of time. Autoscaling queries CC every 30 seconds to get app stats, since CC is down and AS doesn’t have the app stat metrics, AS fails thus never scales the app. At this point introduce a heavy spike in traffic see what happens to your app.
Also imagine what if existing diego-cell hosting multiple app containers goes down?
Ramesh: Are we going to demo existing features of Turbulence?
Karun: Certainly not, we will show how to Pause a process say ssh in diego cell. That will be first demo tonight.
Karun: Before we jump into our next demo or talk about App Chaos Engineering, can we talk a bit about Ops world?
Ramesh: Sure…
Next slide…
Karun: Hey Ramesh what do we hear from our customers in day to day ops?
Ramesh: Of course, we are a service team. So when stuff doesn’t work, the first thing you hear is “It’s those platform guys” and if when it’s not us the next thing you hear “It’s the network team”.
Let’s talk about few examples.
Karun: My app isn’t picking latest configuration…
Ramesh: When Bad Karma hits you back, not much anyone could do, even apps doesn’t listen to you.
Karun : My app isn’t connecting to Cassandra cluster
Ramesh : why would it? When the cluster was decommissioned 2 weeks ago !
Karun: oh wow!
Karun: My app works locally but not on PCF!
Ramesh: Well customer misbehavior, blocked them on PCF forever.
Karun: Oh that’s fair!
Karun: My app was working well till yesterday but not today! How about that?
Ramesh: Outstanding payments due!
But jokes apart folks, we like calling ourselves enablers. What I mean by that is, we built a platform for community to use. We onboard customer and we get out of the way, we trust our customers will do the right things within their app architecture. But that’s not always the case & our customers encounter problems which boils down to be an app architecture issue or a cloud anti-pattern. What we want to do now is be enablers & guardians, meaning provide a self-service mechanism with which you can find loopholes in your app/deployment. Question is how we can empower our Developers ?
Karun: Awesome. So here come CF App Blocker new CTK addon!
Ramesh: Do we really need CTK CF Blocker when you’ve Hystrix Circuit breaker?
Karun: yes, we still would need. Not all apps deployed are Spring apps. Hystrix Circuit breaker is the design pattern to make apps fault tolerant. However not all technologies have the implementation of this pattern we saw it in Java apps and python apps too, but we’ve apps using other than these 2 stacks. Also CF Blocker complements these design patterns, if an app is bound to hystrix circuit breaker, CTK CF blocker on the app can help with failure test cases …
Ramesh: Not sure I get that. Please explain more…
Karun : No matter how good we design, no matter if we follow 12-factor design patterns, in real world as in this case Weather service is dependent on 3rd party, which if goes down would result in Concert app’s failure thus eventually web app fails. Couple of questions to keep in mind:
How to verify app’s behavior if 3rd party goes offline?
What if Concert database goes offline?
What if Weather microservice misbehaves?
Ramesh : Why can’t we use hystrix Circuit Breaker for Weather service?
Karun : Yes we can… and should in fact. Having something like cf blocker programmed to run interval of time, will simulate cascading failures seamlessly every interval of time and thus generates job for circuit breaker…
Karun : Here is more accurate interaction of microservices / spring app behavior within PCF.
You can see Config server is dependent on GitRepo. & services dependent on spring cloud services that includes Service Registry and Circuit Breaker. In this both Weather and Concert are bound to SCS (internal services of PCF) and Message Broker & DBs external to services.
How to target specific bound services to the app
How to disable traffic to an app
How to block traffic from a service to backend database, but yet allow access from another service. Note the difference we are not killing database here, which may eventually impact other services, but we are only blocking traffic from app to database.
We do that via IP Table rules.
Ramesh : Do more OSS.
Ramesh : So what’s next ? Our high-level goals
Build confidence in our services by running gamedays (targeted failure attacks).
And yes, finally – we are big on contributions to the community. So we will continue to push our work outwards in to the OSS community.