SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Agile
Server and Data Infrastructure
TARUN RAJPUT
•  Portworx: Kubernetes storage – OnPrem, AWS, Google,
Azure, Hybrid DC
•  Cisco: Industrial IoT platform – 2 Billion Events daily
– 16 On-Prem Data Centers, AWS, Google, Azure, China Cloud,
200m devices, 17000 Customers, 100+ Releases
•  ServiceNow: ITSM World leader
•  Delphix: DB Virtualization Platform
•  Lucidera: OLAP engine – now Pentaho BI Suite
•  Syndera: – now TIBCO BI Suite
•  Celequest: In-Memory Fast Streaming DB, BI appliance –
now COGNOS
•  AT&T Bell Labs, SunSoft, HP: Unix OS QA Architect
My Past:
•  Logging, Monitoring
•  Availability
•  Latency
•  Performance
•  Security
•  Capacity Planning
•  Infrastructure Reliability, Scalability
•  Emergency Response
•  Root Cause Analysis
•  Change Management
Infrastructure Management
•  Microservices
•  MySQL, Snowflake
•  Kafka
•  Redis
•  Elastic Search
•  CentOS
•  Java, Go, Python
•  Kubernetes, Istio, Envoy, Persistent Storage
•  GCP, AWS, Azure
•  MFA, Authorization, TLS, HTTPS, Certificates
•  Volume Management, Replication
•  Load Balancers, Auto scaling, Disaster Recovery
Application Stack
•  ~70% outages are due to changes in live production
•  All Production updates - ITIL processes
– New configuration
– New Features
– New Patch
•  Progressive rollouts
•  Accurately detect problems
•  Rolling back changes when problems arise
•  Planned downtime – Maintenance windows
Change Management
•  Production Environments
•  Product Development
•  Quality Engineering
•  Support
•  Customers / Users
Team Interactions
•  On-call Playbooks
•  MTTR metric - mean time to repair
•  On-call team – fix production issues, handle root cause
•  May roll back to the previous version
•  Patch the production environment
•  Update configurations
•  Move load to different clusters
•  Auto-scale to address additional traffic volume
Emergency Response
Work tied to running production service
•  Manual
•  Repetitive
•  Automatable and not requiring Human Judgement
•  Interrupt driven
•  Reactive
•  No enduring value
•  Running fast to stay in the same place
TOIL everywhere!!
Automate self service tasks for running production service
•  Automation scripts
•  Creative Self Healing Autonomous engineering
•  Tools and frameworks
•  Robust Infrastructure code
•  Runbooks automation
•  Automated Configuration updates
•  Monitoring Setup, OS configuration checklist
•  Automated validations
•  Happy and Productive TEAMS!!
Kill TOIL!
•  Resolve crisis followed by Identify and Triage Root Cause
•  Blame-free postmortem culture
•  Actionable
•  Learn from Failures
•  Lightweight for small/simple incidents
•  In-Depth for large/complex outages
•  Outage is expected part of Innovation process – manage
it fearlessly!!
PostMortems - RCAs
•  Regular load testing of the system
•  Correlate raw capacity with service capacity
•  Adding additional clusters
•  More VMs to extend auto-scaling
•  Containerization
•  Updating configuration, load balances, networking
•  Certify new capacity works
Demand Forecasting – Capacity Planning
Automated CI/CD – Code, Test, Monitor, Deploy
Dev Lab QA Lab
Prod
1 to 3
week
Sprints
Nightly Build
and deploy
Sprint Release
Rolling Deploy
Production Release
Rolling Deploy
Staging
Lab
Cloud
Jenkins
AWS slaves
Test Applications
Test Datasets Perf
Lab
Continuous
Integration
Release
Certification
CSV
Hadoop
Google
AWS
Appliance
Database
Azure
VMware
Lab Env
•  Track System’s health and availability
•  Should address: Symptom (what’s broken) Cause (why)
•  Latency, Traffic, Errors, Saturation
•  Trash what is not working, Use monitors effectively
•  Report and fix issues proactivity before the errors hit
Customers!
•  Avoid staring at a Dashboard to watch for Problems! Pair
with Alerts and Logs for Historical correlation
•  Challenges in Maintaining Monitoring
Monitoring – Keep it Simple!!
QA	
	
Change Request (CR) Approval and Tracking – Cherwell,ServiceNow
Planning &
Requirements
Design,
Development
QA
& Ops
Approval
Deploy to
Production
Quality
Assurance
RE/QA	
•  Incremental disruption-free rollout
•  Ensure rolling deployment by never taking more than 1 host of the same type out of the
load balancer pool.
(in case deployment results in any error)
–  Code exists on AppServer for previous release
–  Revert back to Previous Release Version
–  QA runs API and UI tests on Production load balancer URL
–  Confirm Production Monitoring is all green
Rolling Deployment Model: MOP
Planning &
Requirements
Design,
Development
QA
& Ops
Approval
Deploy to
Production
Quality
Assurance
Rollback Process
•  Monitor Infrastructure:
–  All hypervisors,
–  VMs
–  Containers
–  Kafka message queues
–  Load balancers
–  Data base hosts
–  Elastic search, redis, rabbitmq, network elements, switches, routers, firewalls, …
•  Monitor all applications:
–  UI, API
–  Batch servers
–  Logger apps
–  JVM monitors
–  Search app, indexing jobs
–  Data base locks, Full table scans
–  Through put, latency issues,
–  New exceptions in splunk, elastic search/kibana, expired certificates, Auto scaling issues, …
Production Monitoring 24x7x365
Planning &
Requirements
Design,
Development
QA
& Ops
Approval
Deploy to
Production
Quality
Assurance
– APIs are failing
– UI is not working. Unable to login, multifactor authentication is not operational
– Performance has gone done. Everything seems really slow
– Logs are showing an abundance of new exceptions
– Connectivity to external systems is broken
– Report generation is taking forever
– The search is failing after Failover – need to rebuild index
– The billing system is down
– Customers cannot provision - network APIs are failing
– Network or encryption issues – multi tenant issues
– Added new containers, microservices, MQ clusters, but horizontal scalability is not
operational
– Having RDBMS issues – Full table scans, patch adds index to very large tables
– New feature related monitors are not working
– Linux, File system or device driver has crashed!
– New release or patch is causing issues - AUTO ROLLBACK! (Kubernetes will do
this for you)
What if - Actions
•  Monitor the state of all of the Hypervisors, VMs, Containers: Compute,
Storage, Memory, Swap, Network - private, public, hybrid cloud
•  Monitor microservices, legacy applications, log processing servers
•  Monitor the system performance – latency or throughput issues
•  Monitor the state of JVM heap
•  Monitor message queues subsystem - IBM MQ, Kafka, RabbitMQ, Kestrel. If
the queues start building up, the service may stop real soon
•  Monitor the state of frontend and backend servers
•  Monitor the state of log processing servers.
–  Splunk, Elastic Search. Monitor the exceptions in logs from various applications, microservices
or infrastructure
•  Monitor the state of MongoDB, Cassandra, Redis, Hadoop, Hive, Spark,
Nginx, Zuul.
What to Monitor?
•  Check the throughput and latency on API, UI servers. Are there any delays?
•  Monitor the state of RDBMS. Are there any major locks or full table scans?
•  Load Balancers - Are the underlying rules and associated servers fully
operational?
•  Monitor network elements, firewalls, switches, routers for any issues
•  Monitor incremental and full backups
•  Check monitors are in place for new functionality ready to be turned on in
Production!
•  What to do if service is down? Start the automated DR immediately and
debug later!
•  Are monitors in the secondary cloud environment fully configured and
operational?
•  What's the health of the DR site before the DR occurs? Run a full set of end
to end qualification tests before declaring DR victory!
What to Monitor? …
•  Fast Releases with Features, Supported Platforms,
Performance/Security Improvements
•  99.999% Five nines SLA
•  Address customer concerns
•  Replicate Good Experience
•  Learn from Mistakes and Fix Fast
Happy Customers!
•  Team delivering Automated self service tools:
– Infrastructure configurations and updates. Kill TOIL!!
•  Monitor labs and production with automation
•  24 hours or better release cycles with no one burning
•  Automated deploys/roll backs/validations to production
•  Everyone learning, executing, creating, achieving
Happy Teams!
•  Automated CI/CD
•  Automated Self-Service tools/labs
– Kill TOIL
•  Automated Deployments, Validations, and Rollback.
Summary

Más contenido relacionado

La actualidad más candente

Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Kai Wähner
 

La actualidad más candente (20)

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
 
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade PlatformOSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
 
Cloud Native Camel Riding
Cloud Native Camel RidingCloud Native Camel Riding
Cloud Native Camel Riding
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
 
From Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
From Concept to Clustered JAC (jira.atlassian.com) - Graham CarrickFrom Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
From Concept to Clustered JAC (jira.atlassian.com) - Graham Carrick
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
Scale your application to new heights with NGINX and AWS
Scale your application to new heights with NGINX and AWSScale your application to new heights with NGINX and AWS
Scale your application to new heights with NGINX and AWS
 
Sas 2015 event_driven
Sas 2015 event_drivenSas 2015 event_driven
Sas 2015 event_driven
 
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
 
THEFT-PROOF JAVA EE - SECURING YOUR JAVA EE APPLICATIONS
 THEFT-PROOF JAVA EE - SECURING YOUR JAVA EE APPLICATIONS THEFT-PROOF JAVA EE - SECURING YOUR JAVA EE APPLICATIONS
THEFT-PROOF JAVA EE - SECURING YOUR JAVA EE APPLICATIONS
 
Redgate Database Devops Demo webinar - Visual Studio Team Services - 21st Fe...
Redgate Database Devops Demo webinar  - Visual Studio Team Services - 21st Fe...Redgate Database Devops Demo webinar  - Visual Studio Team Services - 21st Fe...
Redgate Database Devops Demo webinar - Visual Studio Team Services - 21st Fe...
 
Siebel Monitoring Tools
Siebel Monitoring ToolsSiebel Monitoring Tools
Siebel Monitoring Tools
 
V mware v realize orchestrator 6.0 knowledge transfer kit
V mware v realize orchestrator 6.0 knowledge transfer kitV mware v realize orchestrator 6.0 knowledge transfer kit
V mware v realize orchestrator 6.0 knowledge transfer kit
 
Docker in the Cloud
Docker in the CloudDocker in the Cloud
Docker in the Cloud
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
Splunk for ITOps
Splunk for ITOpsSplunk for ITOps
Splunk for ITOps
 

Similar a Agile infrastructure

Cloud Hosting for Government Agencies: Drupal Platform as a Service
Cloud Hosting for Government Agencies: Drupal Platform as a ServiceCloud Hosting for Government Agencies: Drupal Platform as a Service
Cloud Hosting for Government Agencies: Drupal Platform as a Service
Acquia
 
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Lucas Jellema
 

Similar a Agile infrastructure (20)

Microservices
MicroservicesMicroservices
Microservices
 
LogicMonitor: An Overview
LogicMonitor: An Overview LogicMonitor: An Overview
LogicMonitor: An Overview
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
 
AWS Meetup - Nordstrom Data Lab and the AWS Cloud
AWS Meetup - Nordstrom Data Lab and the AWS CloudAWS Meetup - Nordstrom Data Lab and the AWS Cloud
AWS Meetup - Nordstrom Data Lab and the AWS Cloud
 
Devops architecture
Devops architectureDevops architecture
Devops architecture
 
How Applications Manager helps with application performance monitoring
How Applications Manager helps with application performance monitoringHow Applications Manager helps with application performance monitoring
How Applications Manager helps with application performance monitoring
 
Centralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container OperationsCentralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container Operations
 
Cloud Hosting for Government Agencies: Drupal Platform as a Service
Cloud Hosting for Government Agencies: Drupal Platform as a ServiceCloud Hosting for Government Agencies: Drupal Platform as a Service
Cloud Hosting for Government Agencies: Drupal Platform as a Service
 
Olivier_Tisserand_projects
Olivier_Tisserand_projectsOlivier_Tisserand_projects
Olivier_Tisserand_projects
 
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
 
Kaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORINGKaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORING
 
Jineesh
JineeshJineesh
Jineesh
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
 
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National Police
 

Último

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Último (18)

Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 

Agile infrastructure

  • 1. Agile Server and Data Infrastructure TARUN RAJPUT
  • 2. •  Portworx: Kubernetes storage – OnPrem, AWS, Google, Azure, Hybrid DC •  Cisco: Industrial IoT platform – 2 Billion Events daily – 16 On-Prem Data Centers, AWS, Google, Azure, China Cloud, 200m devices, 17000 Customers, 100+ Releases •  ServiceNow: ITSM World leader •  Delphix: DB Virtualization Platform •  Lucidera: OLAP engine – now Pentaho BI Suite •  Syndera: – now TIBCO BI Suite •  Celequest: In-Memory Fast Streaming DB, BI appliance – now COGNOS •  AT&T Bell Labs, SunSoft, HP: Unix OS QA Architect My Past:
  • 3. •  Logging, Monitoring •  Availability •  Latency •  Performance •  Security •  Capacity Planning •  Infrastructure Reliability, Scalability •  Emergency Response •  Root Cause Analysis •  Change Management Infrastructure Management
  • 4. •  Microservices •  MySQL, Snowflake •  Kafka •  Redis •  Elastic Search •  CentOS •  Java, Go, Python •  Kubernetes, Istio, Envoy, Persistent Storage •  GCP, AWS, Azure •  MFA, Authorization, TLS, HTTPS, Certificates •  Volume Management, Replication •  Load Balancers, Auto scaling, Disaster Recovery Application Stack
  • 5. •  ~70% outages are due to changes in live production •  All Production updates - ITIL processes – New configuration – New Features – New Patch •  Progressive rollouts •  Accurately detect problems •  Rolling back changes when problems arise •  Planned downtime – Maintenance windows Change Management
  • 6. •  Production Environments •  Product Development •  Quality Engineering •  Support •  Customers / Users Team Interactions
  • 7. •  On-call Playbooks •  MTTR metric - mean time to repair •  On-call team – fix production issues, handle root cause •  May roll back to the previous version •  Patch the production environment •  Update configurations •  Move load to different clusters •  Auto-scale to address additional traffic volume Emergency Response
  • 8. Work tied to running production service •  Manual •  Repetitive •  Automatable and not requiring Human Judgement •  Interrupt driven •  Reactive •  No enduring value •  Running fast to stay in the same place TOIL everywhere!!
  • 9. Automate self service tasks for running production service •  Automation scripts •  Creative Self Healing Autonomous engineering •  Tools and frameworks •  Robust Infrastructure code •  Runbooks automation •  Automated Configuration updates •  Monitoring Setup, OS configuration checklist •  Automated validations •  Happy and Productive TEAMS!! Kill TOIL!
  • 10. •  Resolve crisis followed by Identify and Triage Root Cause •  Blame-free postmortem culture •  Actionable •  Learn from Failures •  Lightweight for small/simple incidents •  In-Depth for large/complex outages •  Outage is expected part of Innovation process – manage it fearlessly!! PostMortems - RCAs
  • 11. •  Regular load testing of the system •  Correlate raw capacity with service capacity •  Adding additional clusters •  More VMs to extend auto-scaling •  Containerization •  Updating configuration, load balances, networking •  Certify new capacity works Demand Forecasting – Capacity Planning
  • 12. Automated CI/CD – Code, Test, Monitor, Deploy Dev Lab QA Lab Prod 1 to 3 week Sprints Nightly Build and deploy Sprint Release Rolling Deploy Production Release Rolling Deploy Staging Lab Cloud Jenkins AWS slaves Test Applications Test Datasets Perf Lab Continuous Integration Release Certification CSV Hadoop Google AWS Appliance Database Azure VMware Lab Env
  • 13. •  Track System’s health and availability •  Should address: Symptom (what’s broken) Cause (why) •  Latency, Traffic, Errors, Saturation •  Trash what is not working, Use monitors effectively •  Report and fix issues proactivity before the errors hit Customers! •  Avoid staring at a Dashboard to watch for Problems! Pair with Alerts and Logs for Historical correlation •  Challenges in Maintaining Monitoring Monitoring – Keep it Simple!!
  • 14. QA Change Request (CR) Approval and Tracking – Cherwell,ServiceNow Planning & Requirements Design, Development QA & Ops Approval Deploy to Production Quality Assurance
  • 15. RE/QA •  Incremental disruption-free rollout •  Ensure rolling deployment by never taking more than 1 host of the same type out of the load balancer pool. (in case deployment results in any error) –  Code exists on AppServer for previous release –  Revert back to Previous Release Version –  QA runs API and UI tests on Production load balancer URL –  Confirm Production Monitoring is all green Rolling Deployment Model: MOP Planning & Requirements Design, Development QA & Ops Approval Deploy to Production Quality Assurance Rollback Process
  • 16. •  Monitor Infrastructure: –  All hypervisors, –  VMs –  Containers –  Kafka message queues –  Load balancers –  Data base hosts –  Elastic search, redis, rabbitmq, network elements, switches, routers, firewalls, … •  Monitor all applications: –  UI, API –  Batch servers –  Logger apps –  JVM monitors –  Search app, indexing jobs –  Data base locks, Full table scans –  Through put, latency issues, –  New exceptions in splunk, elastic search/kibana, expired certificates, Auto scaling issues, … Production Monitoring 24x7x365 Planning & Requirements Design, Development QA & Ops Approval Deploy to Production Quality Assurance
  • 17. – APIs are failing – UI is not working. Unable to login, multifactor authentication is not operational – Performance has gone done. Everything seems really slow – Logs are showing an abundance of new exceptions – Connectivity to external systems is broken – Report generation is taking forever – The search is failing after Failover – need to rebuild index – The billing system is down – Customers cannot provision - network APIs are failing – Network or encryption issues – multi tenant issues – Added new containers, microservices, MQ clusters, but horizontal scalability is not operational – Having RDBMS issues – Full table scans, patch adds index to very large tables – New feature related monitors are not working – Linux, File system or device driver has crashed! – New release or patch is causing issues - AUTO ROLLBACK! (Kubernetes will do this for you) What if - Actions
  • 18. •  Monitor the state of all of the Hypervisors, VMs, Containers: Compute, Storage, Memory, Swap, Network - private, public, hybrid cloud •  Monitor microservices, legacy applications, log processing servers •  Monitor the system performance – latency or throughput issues •  Monitor the state of JVM heap •  Monitor message queues subsystem - IBM MQ, Kafka, RabbitMQ, Kestrel. If the queues start building up, the service may stop real soon •  Monitor the state of frontend and backend servers •  Monitor the state of log processing servers. –  Splunk, Elastic Search. Monitor the exceptions in logs from various applications, microservices or infrastructure •  Monitor the state of MongoDB, Cassandra, Redis, Hadoop, Hive, Spark, Nginx, Zuul. What to Monitor?
  • 19. •  Check the throughput and latency on API, UI servers. Are there any delays? •  Monitor the state of RDBMS. Are there any major locks or full table scans? •  Load Balancers - Are the underlying rules and associated servers fully operational? •  Monitor network elements, firewalls, switches, routers for any issues •  Monitor incremental and full backups •  Check monitors are in place for new functionality ready to be turned on in Production! •  What to do if service is down? Start the automated DR immediately and debug later! •  Are monitors in the secondary cloud environment fully configured and operational? •  What's the health of the DR site before the DR occurs? Run a full set of end to end qualification tests before declaring DR victory! What to Monitor? …
  • 20. •  Fast Releases with Features, Supported Platforms, Performance/Security Improvements •  99.999% Five nines SLA •  Address customer concerns •  Replicate Good Experience •  Learn from Mistakes and Fix Fast Happy Customers!
  • 21. •  Team delivering Automated self service tools: – Infrastructure configurations and updates. Kill TOIL!! •  Monitor labs and production with automation •  24 hours or better release cycles with no one burning •  Automated deploys/roll backs/validations to production •  Everyone learning, executing, creating, achieving Happy Teams!
  • 22. •  Automated CI/CD •  Automated Self-Service tools/labs – Kill TOIL •  Automated Deployments, Validations, and Rollback. Summary