Chaos Engineering for PCF

•Descargar como PPTX, PDF•

3 recomendaciones•1,370 vistas

This document discusses chaos engineering for Pivotal Cloud Foundry (PCF). It introduces Ramesh Krishnaram and Karun Chennuri from the Platform Engineering team at Pivotal. They explore tools for chaos engineering like Chaos Lemur, Gremlin, and Turbulence. They demonstrate adding capabilities to Turbulence for simulating failures in PCF infrastructure and applications using the Cloud Foundry Blocker tool from Chaos Toolkit. The document discusses cascading failures and contributions to open source chaos engineering tools.

Software

LET US KNOW HOW YOU FEEL ABOUT THIS SESSION.
TAKE THE SURVEY ON THE MOBILE APP!
Ramesh.Vaithiyamkrishnaram1@T-Mobile.com,
Karun.Chennuri1@T-Mobile.com
#springone@s1p

Más contenido relacionado

La actualidad más candente

(BAC404) Deploying High Availability and Disaster Recovery Architectures with...Amazon Web Services

CI/CD Pipeline Security: Advanced Continuous Delivery RecommendationsAmazon Web Services

Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...Amazon Web Services

[AWS Builders] 클라우드 비용, 어떻게 줄일 수 있을까?Amazon Web Services Korea

API Best PracticesSai Koppala

데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...

AWS IAMDiego Pacheco

C12 AlwaysOn 可用性グループとデータベースミラーリングのIO特製の比較 by 多田典史Insight Technology, Inc.

Introducing Amazon EKSAmazon Web Services

Red Hat Openshift Fundamentals.pptxssuser18b1c6

[AWS Builders] AWS 네트워크 서비스 소개 및 사용 방법 - 김기현, AWS 솔루션즈 아키텍트Amazon Web Services Korea

Communication in a Microservice ArchitecturePer Bernhardt

AWS 고객이 주로 겪는 운영 이슈에 대한 해법-AWS Summit Seoul 2017Amazon Web Services Korea

Microservices with Kafka EcosystemGuido Schmutz

Container SecurityAmazon Web Services

What you have to know about Certified Kubernetes Administrator (CKA)Opsta

AWS를 활용해서 글로벌 게임 런칭하기 - 박진성 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul 2021Amazon Web Services Korea

API Gateway - ヘッダー／クエリー変換、認証・認可機能詳細オラクルエンジニア通信

わかりづらいS3クロスアカウントアクセス許可に立ち向かおうTakashi Toyosaki

Azure Api Management 俺的マニュアル 2020年3月版貴志上坂

La actualidad más candente (20)

(BAC404) Deploying High Availability and Disaster Recovery Architectures with...

CI/CD Pipeline Security: Advanced Continuous Delivery Recommendations

Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)...

[AWS Builders] 클라우드 비용, 어떻게 줄일 수 있을까?

API Best Practices

데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...

AWS IAM

C12 AlwaysOn 可用性グループとデータベースミラーリングのIO特製の比較 by 多田典史

Introducing Amazon EKS

Red Hat Openshift Fundamentals.pptx

[AWS Builders] AWS 네트워크 서비스 소개 및 사용 방법 - 김기현, AWS 솔루션즈 아키텍트

Communication in a Microservice Architecture

AWS 고객이 주로 겪는 운영 이슈에 대한 해법-AWS Summit Seoul 2017

Microservices with Kafka Ecosystem

Container Security

What you have to know about Certified Kubernetes Administrator (CKA)

AWS를 활용해서 글로벌 게임 런칭하기 - 박진성 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul 2021

API Gateway - ヘッダー／クエリー変換、認証・認可機能詳細

わかりづらいS3クロスアカウントアクセス許可に立ち向かおう

Azure Api Management 俺的マニュアル 2020年3月版

Similar a Chaos Engineering for PCF

It’s a Multi-Cloud World, But What About The Data?VMware Tanzu

Developer Secure Containers for the Cyberspace BattlefieldVMware Tanzu

Connecting All Abstractions with IstioVMware Tanzu

Cross-Platform Observability for Cloud FoundryVMware Tanzu

Cloud Foundry Services on PKS with No Extra Code, "We Bosh So You Don’t Have ...VMware Tanzu

Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...VMware Tanzu

P to V to C: The Value of Bringing “Everything” to ContainersVMware Tanzu

Lattice: A Cloud-Native Platform for Your Spring ApplicationsMatt Stine

Cassandra and DataStax Enterprise on PCFVMware Tanzu

Cloud Foundry Networking with VMware NSXVMware Tanzu

Cloud-Native Streaming Platform: Running Apache Kafka on PKS (Pivotal Contain...VMware Tanzu

Deploying Spring Boot apps on KubernetesVMware Tanzu

Containers Were Never Your End StateVMware Tanzu

How to Build More Secure Service BrokersVMware Tanzu

What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerceVMware Tanzu

Scalable Smart Caching for Spring DevelopersVMware Tanzu

Consumer Driven Contracts and Your Microservice ArchitectureMarcin Grzejszczak

Heavyweights: Tipping the Scales with Very Large FoundationsVMware Tanzu

Building a Data Exchange with Spring Cloud Data FlowVMware Tanzu

S1P: Spring Cloud on PKSMauricio (Salaboy) Salatino

Similar a Chaos Engineering for PCF (20)

It’s a Multi-Cloud World, But What About The Data?

Developer Secure Containers for the Cyberspace Battlefield

Connecting All Abstractions with Istio

Cross-Platform Observability for Cloud Foundry

Cloud Foundry Services on PKS with No Extra Code, "We Bosh So You Don’t Have ...

Monoliths, Microservices, Events, Functions: What It Takes to Go Through the ...

P to V to C: The Value of Bringing “Everything” to Containers

Lattice: A Cloud-Native Platform for Your Spring Applications

Cassandra and DataStax Enterprise on PCF

Cloud Foundry Networking with VMware NSX

Cloud-Native Streaming Platform: Running Apache Kafka on PKS (Pivotal Contain...

Deploying Spring Boot apps on Kubernetes

Containers Were Never Your End State

How to Build More Secure Service Brokers

What We're Learning Adopting Spring Boot and PCF for Dell.com's eCommerce

Scalable Smart Caching for Spring Developers

Consumer Driven Contracts and Your Microservice Architecture

Heavyweights: Tipping the Scales with Very Large Foundations

Building a Data Exchange with Spring Cloud Data Flow

S1P: Spring Cloud on PKS

Más de VMware Tanzu

What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu

Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu

Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu

Spring Update | July 2023VMware Tanzu

Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu

Building Cloud Ready AppsVMware Tanzu

Spring Boot 3 And BeyondVMware Tanzu

Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu

Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu

Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu

tanzu_developer_connect.pptxVMware Tanzu

Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu

Tanzu Developer Connect Workshop - EnglishVMware Tanzu

Virtual Developer Connect Workshop - EnglishVMware Tanzu

Tanzu Developer Connect - FrenchVMware Tanzu

Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu

SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu

SpringOne Tour: The Influential Software EngineerVMware Tanzu

SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu

SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu

Más de VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It

Make the Right Thing the Obvious Thing at Cardinal Health 2023

Enhancing DevEx and Simplifying Operations at Scale

Spring Update | July 2023

Platforms, Platform Engineering, & Platform as a Product

Building Cloud Ready Apps

Spring Boot 3 And Beyond

Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf

Simplify and Scale Enterprise Apps in the Cloud | Boston 2023

Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023

tanzu_developer_connect.pptx

Tanzu Virtual Developer Connect Workshop - French

Tanzu Developer Connect Workshop - English

Virtual Developer Connect Workshop - English

Tanzu Developer Connect - French

Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023

SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot

SpringOne Tour: The Influential Software Engineer

SpringOne Tour: Domain-Driven Design: Theory vs Practice

SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions

Último

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

WSO2CON2024 - It's time to go PlatformlessWSO2

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

Architecture decision records - How not to get lost in the pastPapp Krisztián

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics

%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba

Chaos Engineering for PCF

1. Chaos Engineering for PCF Ramesh Krishnaram : Sr. Engineering Manager Karun Chennuri : Sr. Software Engineer PLATFORM ENGINEERING

2. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ WHO ARE WE ? PLATFORM ENGINEERING “Provide simple, secure and scalable platform services that is platform and infrastructure- agnostic.” FaaS PaaS CaaS IaaS Greater Flexibility Less conformance to standards Lower dev complexity Greater operational efficiency

3. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Here is the BIG Deal…

4. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ “The only thing that is constant is change failure. Learn to embrace it. Failure is inevitable”

5. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Problem Statement  Platform Failures  Application Failures A A AA B BB B C C C C Ref: https://twitter.com/fiberstore/status/549826256338825216

6. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ After few busy weeks….

7. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Journey & Tools we explored… Chaos Lemur Kill VMs Kill Process Latency CPU/Memory App Knowledge Gremlin Kill VMs Kill Process Latency CPU/Memory App Knowledge Turbulence Kill VMs Kill Process Latency CPU/Memory App Knowledge T-Mo CTK Kill VMs Kill Process Latency CPU/Memory App Knowledge Note: App knowledge in Gremlin seem to be in the road map and may be available in future versions. CTK – Chaos Toolkit

8. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Chaos Engineering: Platform/ Infrastructure

9. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Simulating Failures in PCF Turbulence: Features: • Kill VM • Kill Process • Pause Process • Stress • Disk Corrupt • Control Network Delay • Limit Bandwidth • Re-ordering Packets • Firewall • Targeted Blocking • Shutdown • Block DNS • Duplication • Api-server • Agent T-Mobile OSS contribution

10. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Demo 1: Addons to Turbulence

11. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Demo 1: Addons to Turbulence

12. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Chaos Engineering: Applications

13. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Some where in ops world… My App isn’t picking latest configurations My app isn’t connecting to Cassandra My app works locally but not on PCF! WTF with the Platform? My app was working well till yesterday, but not today!

14. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Cascading effect Ref: https://github.com/michaelgruczel/microservice-architecture-by-example WeatherConcert 3rd PartyWeb App Client Database Timeout

15. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ CTK CF BLOCKER CTK CF Blocker: • Target specific CF Apps • Discovers • Application hosts • Bound services • Service Instances • Block all traffic to • App instances • Bound services Diego Cell Weather Concert Config Server Eureka Service Discovery Hystrix Circuit Breaker Cloud Controller UAA Git Repo Message B rokers RMQ Kafka JMX Database Go Router CredHub

16. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Demo 2: CTK CF Blocker

17. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/

18. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Upstream Contribution Demo Videos: • Platform Chaos Attack: https://www.youtube.com/watch?v=9jt8Qq6RTN8 • CF App Blocker Attack: https://www.youtube.com/watch?v=ewtzyZdb67o https://opensource.t-mobile.com Turbulence Release PR : https://github.com/cppforlife/turbulence-release/pull/25

19. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Next Steps… Ref: http://funnypicture.org/funny-cat-games-27-cool-hd-wallpaper.html#.W6uZ_2hKiUk

20. Unless otherwise indicated, these slides are © 2013 -2018 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by -nc/3.0/ Team PLATFORM ENGINEERING

21. LET US KNOW HOW YOU FEEL ABOUT THIS SESSION. TAKE THE SURVEY ON THE MOBILE APP! Ramesh.Vaithiyamkrishnaram1@T-Mobile.com, Karun.Chennuri1@T-Mobile.com #springone@s1p

Notas del editor

“JOKER: Introduce a little anarchy. Upset the established order and everything becomes chaos. I'm an agent of chaos. Oh, and you know the thing about chaos?” Ramesh: Hey Karun, I recently saw this movie The Dark Knight. And I really enjoyed the Joker’s interpretation of Chaos. So you know the thing about Chaos Karun? Karun: Apart from what Joker said, I have been reading about a famous metaphor that explain Chaos Theory i.e. How “Butterfly wings in Brazil could ultimately cause a hurricane in Texas” Ramesh : And ? Karun: I think we should say “Pre-emptive chaos attack on butterflies!!!???” Not really ‘am just joking… You look like you have something to say, what is it and how can I help ? Ramesh : I am trying to draw an analogy here - A tiny butterfly could bring such a huge impact to the environment, so can a bug/failure in the system to a company and company’s revenue. So, let’s get started…
Ramesh: So let’s talk about us. Who are we ? We are a group of engineers that fondly like to call ourselves agents of chaos. Ramesh: That’s right, what I mean by agent of chaos is we like to radically transform the complexities involved in deploying software to the cloud, we have done this by delivering a platform that is simple/secure/scalable to use. Our goal is to have our application workloads to be able to run from anywhere, anyhow. IT is now all about as-a-service which means the expectation of Customers is all about Agility. And this varies broadly with the level of abstraction you choose. We are a team that is focused predominantly on delivering services for CaaS, PaaS and FaaS (future). Karun: Ramesh, what’s the big deal almost every company has this right? Ramesh: Big deal??? Here we go… <Take to next slide… talk about metrics>
Ramesh: So Karun, you said “what’s the big deal?”. So why not I use data to talk about the big deal ! PCF was launched at T-Mobile in early 2016 & you quickly see how we have graduated over the last 24 months. A number of T-Mo business critical (customer facing or middle-ware) runs on PCF. Still not convinced ? In that case, let me tell you that as of this minute we have roughly 30K+ containers, 900 active users in the PCF community at T-Mo. And just in FY 2018, we have scaled out our PCF foundations from 2 to 10+. If that does not cut it, let me tell you that since the time we have moved a number of apps to micro-service SOA, we have shorter/fewer incidents and faster apps ! And guess what, on top of this we have seen an increase in # of changes made to these services, a vast majority of these being day-time changes. Karun: Alright, I get it. Where are we going with this and what is your problem statement ?
Ramesh: What is this? Karun: Don’t know. But looks like abstract Chaos. Ramesh: What is this one? Karun: Blue Chaos? Ramesh: What is this one? Karun: Green & Incomplete Chaos? Ramesh: You are right to an extent. But let me clarify. We are engineers, we write services. A simple web app has a client making a connection to a server, server talks to a backend dependency determines what needs to be rendered to the client & responds back. But that’s one app and a SOA has thousands of these micro-services & just like how we share the world, they share a eco-systems that is complex & vulnerable to attacks. Karun : Really, what kind of attacks are these ? Ramesh : I like to call this death start diagram as Micro-service explosion, a common theme. In summary, when we design services, we make assumptions. Assumptions go wrong/not validated. A few common fallacies in distributed system The network is reliable Latency is zero Bandwidth is infinite Infinite compute resources The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous Chaos Engineering focuses on building confidence in your system by validating known recovery paths. When a recovery path fails, you get an opportunity to look at the results and fix why it failed. Ramesh: Before we move on I want to high-light that T-Mobile is not one of the mom and pop telecom companies out there. We are the Un-carrier. We care about our customers, so we want to build stuff that’s simple, secure and scalable. And this is not possible until you acknowledge that the only thing that is constant is failure. Learn to embrace it, failure is inevitable.
Ramesh : This is certainly not T-Mobile Datacenter nor it is Ramesh and I. Chaos Engineering is the concept of injecting possible real-world failures or load which has a potential to disrupt the system with the goal of finding potential issues before they happen naturally so the system’s resilience can be improved. Think of Chaos Engineering as a fire alarm drill – you run drill occasionally so you can validate your recovery path, when the drill fails you fix it so when an actual catastrophe happens there is no room for failure in your escape route. At T-Mobile we started with below 2 challenges: Platform: Hardware failures, service failures, network connectivity and connection quality issues, and limited resources (CPU/Memory/Disk). Application: Failure of application build dependencies and random failures of application dependencies. Karun: Hey I know what you are saying… We are not a single application company! We’ve many independent and not inter-dependent apps. As said in earlier slide, we’ve about 4k applications running on our platform sharing same resources & underlying infrastructure. Platform level attacks impact several apps, which is not what we want, we want a more targeted attack simulations, targeting specific apps running in an org and space with out affecting other apps running on same hardware, org, space and using same shared instance. Ramesh: My question to you (Karun), is it doable? I want our team not to re-invent the wheel, evaluate existing tools & make a proposal around how can we deliver a tool-as-a-service with which we can build a better platform and deliver more resilient applications. Karun: Well I hope so, I can get back to you with my research. s
Only certain features taken for comparison for now. Ramesh: Why Gremlin? Isn’t it a commercial offering? Karun: Gremlin is a commercial offering with Control plane offered as SaaS offering, which means one less software for Ops team to manage. It’s a good option that comes with a cost. Gremlin can run as a process as well as in container. We deployed Gremlin as a run time config on one of our test foundations. Gremlin falls short of app knowledge. But our recent interaction with Mr. Kolton founder of Gremlin looks like they are building app knowledge capability. Ramesh: Can turbulence replace Gremlin? Karun : It’s unfair to compare commercial Gremlin with opensource Turbulence. Original author ‘cppforlife’ (I hope he is in this conference) has put a Go package that deploys turbulence api-server and agent on each of the VMs. Again here Turbulence falls short of App knowledge aspect. No doubt better control with enhancing Turbulence, but Gremlin has one advantage esp in T-Mobile case. Since we’ve K8s and PCF in our infra, we can have a single control plane to plan our attacks for PCF and k8s. Turb on other hand is PCF only. Enters ChaosToolKit a nice little framework that orchestrates solutions like Gremlin, turbulence, aws, all at the same time. It’s driver based architecture helped us build a new capability that now knows how to interact with an app instance running in the cluster. Experiments are JSON, we need to comply with specific grammar.
Karun : Typical PCF component diagram, each of component or a combo is a single VM or multi-processes within a VM. But high-level look at the different arrows & imagine an interaction going wrong here which might have a cascading effect. Now how to simulate these? Good news Turbulence does some of the basic stuff already, here we added bunch of new features that help you perform more serious attack simulation that are close to real time attacks. Eg: Imagine you’ve Autoscaling ON for an app. Via Turbulence, bring down Cloud Controller for n interval of time. Autoscaling queries CC every 30 seconds to get app stats, since CC is down and AS doesn’t have the app stat metrics, AS fails thus never scales the app. At this point introduce a heavy spike in traffic see what happens to your app. Also imagine what if existing diego-cell hosting multiple app containers goes down? Ramesh: Are we going to demo existing features of Turbulence? Karun: Certainly not, we will show how to Pause a process say ssh in diego cell. That will be first demo tonight.
Karun: Before we jump into our next demo or talk about App Chaos Engineering, can we talk a bit about Ops world? Ramesh: Sure… Next slide…
Karun: Hey Ramesh what do we hear from our customers in day to day ops? Ramesh: Of course, we are a service team. So when stuff doesn’t work, the first thing you hear is “It’s those platform guys” and if when it’s not us the next thing you hear “It’s the network team”. Let’s talk about few examples. Karun: My app isn’t picking latest configuration… Ramesh: When Bad Karma hits you back, not much anyone could do, even apps doesn’t listen to you. Karun : My app isn’t connecting to Cassandra cluster Ramesh : why would it? When the cluster was decommissioned 2 weeks ago ! Karun: oh wow! Karun: My app works locally but not on PCF! Ramesh: Well customer misbehavior, blocked them on PCF forever. Karun: Oh that’s fair! Karun: My app was working well till yesterday but not today! How about that? Ramesh: Outstanding payments due! But jokes apart folks, we like calling ourselves enablers. What I mean by that is, we built a platform for community to use. We onboard customer and we get out of the way, we trust our customers will do the right things within their app architecture. But that’s not always the case & our customers encounter problems which boils down to be an app architecture issue or a cloud anti-pattern. What we want to do now is be enablers & guardians, meaning provide a self-service mechanism with which you can find loopholes in your app/deployment. Question is how we can empower our Developers ? Karun: Awesome. So here come CF App Blocker new CTK addon! Ramesh: Do we really need CTK CF Blocker when you’ve Hystrix Circuit breaker? Karun: yes, we still would need. Not all apps deployed are Spring apps. Hystrix Circuit breaker is the design pattern to make apps fault tolerant. However not all technologies have the implementation of this pattern we saw it in Java apps and python apps too, but we’ve apps using other than these 2 stacks. Also CF Blocker complements these design patterns, if an app is bound to hystrix circuit breaker, CTK CF blocker on the app can help with failure test cases … Ramesh: Not sure I get that. Please explain more…
Karun : No matter how good we design, no matter if we follow 12-factor design patterns, in real world as in this case Weather service is dependent on 3rd party, which if goes down would result in Concert app’s failure thus eventually web app fails. Couple of questions to keep in mind: How to verify app’s behavior if 3rd party goes offline? What if Concert database goes offline? What if Weather microservice misbehaves? Ramesh : Why can’t we use hystrix Circuit Breaker for Weather service? Karun : Yes we can… and should in fact. Having something like cf blocker programmed to run interval of time, will simulate cascading failures seamlessly every interval of time and thus generates job for circuit breaker…
Karun : Here is more accurate interaction of microservices / spring app behavior within PCF. You can see Config server is dependent on GitRepo. & services dependent on spring cloud services that includes Service Registry and Circuit Breaker. In this both Weather and Concert are bound to SCS (internal services of PCF) and Message Broker & DBs external to services. How to target specific bound services to the app How to disable traffic to an app How to block traffic from a service to backend database, but yet allow access from another service. Note the difference we are not killing database here, which may eventually impact other services, but we are only blocking traffic from app to database. We do that via IP Table rules.
Ramesh : Do more OSS.
Ramesh : So what’s next ? Our high-level goals Build confidence in our services by running gamedays (targeted failure attacks). And yes, finally – we are big on contributions to the community. So we will continue to push our work outwards in to the OSS community.
Team photo graph

Chaos Engineering for PCF

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Chaos Engineering for PCF

Similar a Chaos Engineering for PCF (20)

Más de VMware Tanzu

Más de VMware Tanzu (20)

Último

Último (20)

Chaos Engineering for PCF

Notas del editor