SlideShare una empresa de Scribd logo
1 de 22
Building and Monitoring
Services at Lithium
(fault tolerance, resiliency and monitoring)

Paul Cichonski, Senior Software Engineer
@paulcichonski
Services at Lithium Use:

2
Failure is a Constant, Need to
Avoid Cascading Failure

Image Source: Netflix Hystrix: https://github.com/Netflix/Hystrix/wiki

3
We All Know How to Simulate
Failure:

4
But how do we develop code to
deal with failure?

5
Need to build fault tolerant and
resilient services... How?

Clustering, for high-availability, is
not enough to protect against
cascading failure
6
#1 Fail Fast: use timeouts
aggressively

7
#2 Use circuit breakers on
network calls

8
#3 Use async communication
when possible

9
#4 Have well thought-out
backpressure mechanisms

10
#5 Use cross-region (or crossdatacenter) replication

11
#6 Failure models should be built
into the business requirements of a
service

12
Read:

13
Even with all of that, your app will
still fail, so how do you recover
quickly?

14
Devops/Cloudops Model:
OODA

15
Observe and Orient: you need
metrics and dashboards

16
You Need Metrics
• Reduce “map/territory” confusion
• We use Yammer Metrics
– Timers
– Meters
– Histograms

• We use them a lot
– Every class has at least one metric, most
have multiple
17
You Need to Visualize the Metrics

18
You Need Dashboards Keyed to
Business Functionality

19
Use alerting as a last resort
(because sometimes we need to
sleep)

20
Decide and Act: you need robust
CI and fast code roll-outs

21
Rinse and Repeat

22

Más contenido relacionado

La actualidad más candente

DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysDevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
Andreas Grabner
 

La actualidad más candente (20)

.NET Security (Radu Vunvulea)
.NET Security (Radu Vunvulea).NET Security (Radu Vunvulea)
.NET Security (Radu Vunvulea)
 
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysDevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
 
Soluciones Dynatrace
Soluciones DynatraceSoluciones Dynatrace
Soluciones Dynatrace
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing Primer
 
DevNet UX Creative Design 101 workshop
DevNet UX Creative Design 101 workshopDevNet UX Creative Design 101 workshop
DevNet UX Creative Design 101 workshop
 
NashTech - Azure Application Insights
NashTech - Azure Application InsightsNashTech - Azure Application Insights
NashTech - Azure Application Insights
 
DevOps for AI Apps
DevOps for AI AppsDevOps for AI Apps
DevOps for AI Apps
 
Building Observable Infrastructure and Code
Building Observable Infrastructure and CodeBuilding Observable Infrastructure and Code
Building Observable Infrastructure and Code
 
Automate threat detections and avoid false positives
Automate threat detections and avoid false positivesAutomate threat detections and avoid false positives
Automate threat detections and avoid false positives
 
Modern App Architecture - Microservices, API Friendly
Modern App Architecture - Microservices, API FriendlyModern App Architecture - Microservices, API Friendly
Modern App Architecture - Microservices, API Friendly
 
Metrics Driven DevOps - Automate Scalability and Performance Into your Pipeline
Metrics Driven DevOps - Automate Scalability and Performance Into your PipelineMetrics Driven DevOps - Automate Scalability and Performance Into your Pipeline
Metrics Driven DevOps - Automate Scalability and Performance Into your Pipeline
 
Humana digitally transforming health and well-being with Pivotal cloud foundr...
Humana digitally transforming health and well-being with Pivotal cloud foundr...Humana digitally transforming health and well-being with Pivotal cloud foundr...
Humana digitally transforming health and well-being with Pivotal cloud foundr...
 
WestJet Customer Presentation
WestJet Customer PresentationWestJet Customer Presentation
WestJet Customer Presentation
 
Exploring the Trend Toward the Edge | Eclipse IoT Day Santa Clara 2019
Exploring the Trend Toward the Edge | Eclipse IoT Day Santa Clara 2019Exploring the Trend Toward the Edge | Eclipse IoT Day Santa Clara 2019
Exploring the Trend Toward the Edge | Eclipse IoT Day Santa Clara 2019
 
Dynatrace: Going beyond APM and soaring to the future
Dynatrace: Going beyond APM and soaring to the futureDynatrace: Going beyond APM and soaring to the future
Dynatrace: Going beyond APM and soaring to the future
 
Automate Your Container Deployments Securely
Automate Your Container Deployments SecurelyAutomate Your Container Deployments Securely
Automate Your Container Deployments Securely
 
Habitat for Reals
Habitat for RealsHabitat for Reals
Habitat for Reals
 
Threat hunting with Elastic APM
Threat hunting with Elastic APMThreat hunting with Elastic APM
Threat hunting with Elastic APM
 
Construção de uma plataforma de observabilidade centralizada
Construção de uma plataforma de observabilidade centralizadaConstrução de uma plataforma de observabilidade centralizada
Construção de uma plataforma de observabilidade centralizada
 
Api Management and Demo
Api Management and DemoApi Management and Demo
Api Management and Demo
 

Destacado

River monitoring site 7
River monitoring site 7River monitoring site 7
River monitoring site 7
John Hoopman
 
Low power wireless sensor network for building monitoring
Low power wireless sensor network for building monitoringLow power wireless sensor network for building monitoring
Low power wireless sensor network for building monitoring
ecwayerode
 

Destacado (19)

Nagios Conference 2013 - Thomas Dunbar - Building Technology for Storage Syst...
Nagios Conference 2013 - Thomas Dunbar - Building Technology for Storage Syst...Nagios Conference 2013 - Thomas Dunbar - Building Technology for Storage Syst...
Nagios Conference 2013 - Thomas Dunbar - Building Technology for Storage Syst...
 
LabVIEW Based Monitoring the Building in wireless communication
LabVIEW Based Monitoring the Building in wireless communicationLabVIEW Based Monitoring the Building in wireless communication
LabVIEW Based Monitoring the Building in wireless communication
 
Site Operation Manual for a Typical Air Monitoring Site
Site Operation Manual for a Typical Air Monitoring SiteSite Operation Manual for a Typical Air Monitoring Site
Site Operation Manual for a Typical Air Monitoring Site
 
Presentation Mrs.Smolka Ursula, Ramboll: costs and benefits when monitoring s...
Presentation Mrs.Smolka Ursula, Ramboll: costs and benefits when monitoring s...Presentation Mrs.Smolka Ursula, Ramboll: costs and benefits when monitoring s...
Presentation Mrs.Smolka Ursula, Ramboll: costs and benefits when monitoring s...
 
Khulisa Management Services- ECD Site Monitoring Instrument
Khulisa Management Services- ECD Site Monitoring InstrumentKhulisa Management Services- ECD Site Monitoring Instrument
Khulisa Management Services- ECD Site Monitoring Instrument
 
The Benefits of Having Nerds On Site Monitoring Your Technology
The Benefits of Having Nerds On Site Monitoring Your TechnologyThe Benefits of Having Nerds On Site Monitoring Your Technology
The Benefits of Having Nerds On Site Monitoring Your Technology
 
River monitoring site 7
River monitoring site 7River monitoring site 7
River monitoring site 7
 
How to build a budget transparency site: 5 easy steps
How to build a budget transparency site: 5 easy steps How to build a budget transparency site: 5 easy steps
How to build a budget transparency site: 5 easy steps
 
The Drupal Ecosystem for Drupal Services
The Drupal Ecosystem for Drupal ServicesThe Drupal Ecosystem for Drupal Services
The Drupal Ecosystem for Drupal Services
 
Big Data and Social Monitoring: Building Meaningful Relationships
Big Data and Social Monitoring: Building Meaningful RelationshipsBig Data and Social Monitoring: Building Meaningful Relationships
Big Data and Social Monitoring: Building Meaningful Relationships
 
How to Efficiently and Effectively Balance Central Monitoring with On-Site Mo...
How to Efficiently and Effectively Balance Central Monitoring with On-Site Mo...How to Efficiently and Effectively Balance Central Monitoring with On-Site Mo...
How to Efficiently and Effectively Balance Central Monitoring with On-Site Mo...
 
Low power wireless sensor network for building monitoring
Low power wireless sensor network for building monitoringLow power wireless sensor network for building monitoring
Low power wireless sensor network for building monitoring
 
Experience from Phase 3 Study Using Risk- Based Monitoring and eSource Method...
Experience from Phase 3 Study Using Risk- Based Monitoring and eSource Method...Experience from Phase 3 Study Using Risk- Based Monitoring and eSource Method...
Experience from Phase 3 Study Using Risk- Based Monitoring and eSource Method...
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB
 
ECD monitoring instrument
ECD monitoring instrumentECD monitoring instrument
ECD monitoring instrument
 
Notes to support the presentation 'Introduction to the Visual Infusion Phlebi...
Notes to support the presentation 'Introduction to the Visual Infusion Phlebi...Notes to support the presentation 'Introduction to the Visual Infusion Phlebi...
Notes to support the presentation 'Introduction to the Visual Infusion Phlebi...
 
Meeting Enrollment Goals in a Competitive Environment
Meeting Enrollment Goals in a Competitive EnvironmentMeeting Enrollment Goals in a Competitive Environment
Meeting Enrollment Goals in a Competitive Environment
 
Exploring the AmIHEALTH paradigm. Monitoring in Healthcare: Building mHealth ...
Exploring the AmIHEALTH paradigm. Monitoring in Healthcare: Building mHealth ...Exploring the AmIHEALTH paradigm. Monitoring in Healthcare: Building mHealth ...
Exploring the AmIHEALTH paradigm. Monitoring in Healthcare: Building mHealth ...
 
Pendergrass, Gary, GeoEngineers, CCR Rule Compliance Innovative Geophysics to...
Pendergrass, Gary, GeoEngineers, CCR Rule Compliance Innovative Geophysics to...Pendergrass, Gary, GeoEngineers, CCR Rule Compliance Innovative Geophysics to...
Pendergrass, Gary, GeoEngineers, CCR Rule Compliance Innovative Geophysics to...
 

Similar a Building and Monitoring Services at Lithium

Similar a Building and Monitoring Services at Lithium (20)

Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)Evolving to Cloud-Native - Nate Schutta (2/2)
Evolving to Cloud-Native - Nate Schutta (2/2)
 
Webinar : Microservices and Containerization
Webinar : Microservices and ContainerizationWebinar : Microservices and Containerization
Webinar : Microservices and Containerization
 
Evolving to Cloud-Native - Nate Schutta 2/2
Evolving to Cloud-Native - Nate Schutta 2/2Evolving to Cloud-Native - Nate Schutta 2/2
Evolving to Cloud-Native - Nate Schutta 2/2
 
Application Modernisation through Event-Driven Microservices
Application Modernisation through Event-Driven Microservices Application Modernisation through Event-Driven Microservices
Application Modernisation through Event-Driven Microservices
 
Microservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration PatternsMicroservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration Patterns
 
Securing a Cloud Migration
Securing a Cloud MigrationSecuring a Cloud Migration
Securing a Cloud Migration
 
Securing a Cloud Migration
Securing a Cloud MigrationSecuring a Cloud Migration
Securing a Cloud Migration
 
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
 
The Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian CockcroftThe Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian Cockcroft
 
Think Small To Go Big - Introduction To Microservices
Think Small To Go Big - Introduction To MicroservicesThink Small To Go Big - Introduction To Microservices
Think Small To Go Big - Introduction To Microservices
 
Introduction to Serverless through Architectural Patterns
Introduction to Serverless through Architectural PatternsIntroduction to Serverless through Architectural Patterns
Introduction to Serverless through Architectural Patterns
 
Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Cloud-Native Fundamentals: An Introduction to 12-Factor Applications
Cloud-Native Fundamentals: An Introduction to 12-Factor ApplicationsCloud-Native Fundamentals: An Introduction to 12-Factor Applications
Cloud-Native Fundamentals: An Introduction to 12-Factor Applications
 
We are drowning in complexity—can we do better?
We are drowning in complexity—can we do better?We are drowning in complexity—can we do better?
We are drowning in complexity—can we do better?
 
APIDays 2018 - APIOps & Microservices - What is MICRO by the Way ?
APIDays 2018 - APIOps & Microservices - What is MICRO by the Way ?APIDays 2018 - APIOps & Microservices - What is MICRO by the Way ?
APIDays 2018 - APIOps & Microservices - What is MICRO by the Way ?
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.js
 
Resilient microservices
Resilient microservicesResilient microservices
Resilient microservices
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science Platform
 
Micro Front-End & Microservices - Plansoft
Micro Front-End & Microservices - PlansoftMicro Front-End & Microservices - Plansoft
Micro Front-End & Microservices - Plansoft
 

Último

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Building and Monitoring Services at Lithium

Notas del editor

  1. “these are the technologies we use at lithium, you can see that we use a few different technologies for storing data...mostly for different use cases (i.e., batch vs realtime vs transactional). On top of all these data-storage technologies we are also building up services as we move towards a service oriented architecture and horizontally scalable, highly-available services. As we move towards SOA, and more importantly towards cloud the design space changes and we must deal with failure more realistically......transition, failure is constant”.
  2. - As dependencies for fulfilling a request goes up, so does the probability of failure, which is fine, we just can’t have cascading failure.
  3. - Everyone has now heard of netflix simian army for simulating failure, but we don’t always talk about the coding practices to be able to withstand its wrath.
  4. This is not really a problem of “cloud” this is a problem associated with building distributed, horizontally scalable applications.The only time you don’t have to worry about these things is if your chosen method of scaling is “up” and not “out”, but event then, does it connect to users (i.e., the wider-system context is distributed).In the past generation of “scale-up” failure meant everything was dead, now it just means functionality gets degraded.
  5. - Especially on network calls, this is about protecting the client
  6. - Also about protecting the client- See hystrix from netflix, this is about protecting the client and allowing the downstream service to heal
  7. - Beware that it is harder to reason about anything async.
  8. - This is about signaling to upstream traffic that something is wrong downstream and they may want to take evasive action.
  9. Or at least fail-over.Easy in cloud, harder in datacenter
  10. - They should be explicit, how is this app going to deal with failure in these dependencies (always from the client-side perspective).
  11. - Now that your apps have all of the previous concepts built in bad stuff will still happen. How do you manage the service in production to know when things are going wrong?
  12. Find the most critical calls in your app (i.e., network calls, client calls)Figure out a way to visualize them to gain instant awareness as to what is going wrong in prodService should be small enough for a single engineer to gain full insight (assuming he has the baseline).
  13. - Create alerts around specific log levels (ERROR) or system usage outside of a well-known baseline.- An alert should mean that something needs immediate attention (i.e., keep noise to a minimum)- Alerts should be a last resort (because sometimes you need to sleep).- Alerts should not be a substitute for continuous monitoring of the service through dashboards.
  14. Or at least fail-over.Easy in cloud, harder in datacenter