This document discusses agile server and data infrastructure. It covers several topics including: Portworx for Kubernetes storage, Cisco's industrial IoT platform, ServiceNow as an ITSM leader, experience with logging, monitoring, availability, and more. It emphasizes automating tasks to eliminate toil, implementing monitoring, facilitating fast releases with automation and ensuring happy customers and teams.
5. • ~70% outages are due to changes in live production
• All Production updates - ITIL processes
– New configuration
– New Features
– New Patch
• Progressive rollouts
• Accurately detect problems
• Rolling back changes when problems arise
• Planned downtime – Maintenance windows
Change Management
6. • Production Environments
• Product Development
• Quality Engineering
• Support
• Customers / Users
Team Interactions
7. • On-call Playbooks
• MTTR metric - mean time to repair
• On-call team – fix production issues, handle root cause
• May roll back to the previous version
• Patch the production environment
• Update configurations
• Move load to different clusters
• Auto-scale to address additional traffic volume
Emergency Response
8. Work tied to running production service
• Manual
• Repetitive
• Automatable and not requiring Human Judgement
• Interrupt driven
• Reactive
• No enduring value
• Running fast to stay in the same place
TOIL everywhere!!
9. Automate self service tasks for running production service
• Automation scripts
• Creative Self Healing Autonomous engineering
• Tools and frameworks
• Robust Infrastructure code
• Runbooks automation
• Automated Configuration updates
• Monitoring Setup, OS configuration checklist
• Automated validations
• Happy and Productive TEAMS!!
Kill TOIL!
10. • Resolve crisis followed by Identify and Triage Root Cause
• Blame-free postmortem culture
• Actionable
• Learn from Failures
• Lightweight for small/simple incidents
• In-Depth for large/complex outages
• Outage is expected part of Innovation process – manage
it fearlessly!!
PostMortems - RCAs
11. • Regular load testing of the system
• Correlate raw capacity with service capacity
• Adding additional clusters
• More VMs to extend auto-scaling
• Containerization
• Updating configuration, load balances, networking
• Certify new capacity works
Demand Forecasting – Capacity Planning
12. Automated CI/CD – Code, Test, Monitor, Deploy
Dev Lab QA Lab
Prod
1 to 3
week
Sprints
Nightly Build
and deploy
Sprint Release
Rolling Deploy
Production Release
Rolling Deploy
Staging
Lab
Cloud
Jenkins
AWS slaves
Test Applications
Test Datasets Perf
Lab
Continuous
Integration
Release
Certification
CSV
Hadoop
Google
AWS
Appliance
Database
Azure
VMware
Lab Env
13. • Track System’s health and availability
• Should address: Symptom (what’s broken) Cause (why)
• Latency, Traffic, Errors, Saturation
• Trash what is not working, Use monitors effectively
• Report and fix issues proactivity before the errors hit
Customers!
• Avoid staring at a Dashboard to watch for Problems! Pair
with Alerts and Logs for Historical correlation
• Challenges in Maintaining Monitoring
Monitoring – Keep it Simple!!
14. QA
Change Request (CR) Approval and Tracking – Cherwell,ServiceNow
Planning &
Requirements
Design,
Development
QA
& Ops
Approval
Deploy to
Production
Quality
Assurance
15. RE/QA
• Incremental disruption-free rollout
• Ensure rolling deployment by never taking more than 1 host of the same type out of the
load balancer pool.
(in case deployment results in any error)
– Code exists on AppServer for previous release
– Revert back to Previous Release Version
– QA runs API and UI tests on Production load balancer URL
– Confirm Production Monitoring is all green
Rolling Deployment Model: MOP
Planning &
Requirements
Design,
Development
QA
& Ops
Approval
Deploy to
Production
Quality
Assurance
Rollback Process
16. • Monitor Infrastructure:
– All hypervisors,
– VMs
– Containers
– Kafka message queues
– Load balancers
– Data base hosts
– Elastic search, redis, rabbitmq, network elements, switches, routers, firewalls, …
• Monitor all applications:
– UI, API
– Batch servers
– Logger apps
– JVM monitors
– Search app, indexing jobs
– Data base locks, Full table scans
– Through put, latency issues,
– New exceptions in splunk, elastic search/kibana, expired certificates, Auto scaling issues, …
Production Monitoring 24x7x365
Planning &
Requirements
Design,
Development
QA
& Ops
Approval
Deploy to
Production
Quality
Assurance
17. – APIs are failing
– UI is not working. Unable to login, multifactor authentication is not operational
– Performance has gone done. Everything seems really slow
– Logs are showing an abundance of new exceptions
– Connectivity to external systems is broken
– Report generation is taking forever
– The search is failing after Failover – need to rebuild index
– The billing system is down
– Customers cannot provision - network APIs are failing
– Network or encryption issues – multi tenant issues
– Added new containers, microservices, MQ clusters, but horizontal scalability is not
operational
– Having RDBMS issues – Full table scans, patch adds index to very large tables
– New feature related monitors are not working
– Linux, File system or device driver has crashed!
– New release or patch is causing issues - AUTO ROLLBACK! (Kubernetes will do
this for you)
What if - Actions
18. • Monitor the state of all of the Hypervisors, VMs, Containers: Compute,
Storage, Memory, Swap, Network - private, public, hybrid cloud
• Monitor microservices, legacy applications, log processing servers
• Monitor the system performance – latency or throughput issues
• Monitor the state of JVM heap
• Monitor message queues subsystem - IBM MQ, Kafka, RabbitMQ, Kestrel. If
the queues start building up, the service may stop real soon
• Monitor the state of frontend and backend servers
• Monitor the state of log processing servers.
– Splunk, Elastic Search. Monitor the exceptions in logs from various applications, microservices
or infrastructure
• Monitor the state of MongoDB, Cassandra, Redis, Hadoop, Hive, Spark,
Nginx, Zuul.
What to Monitor?
19. • Check the throughput and latency on API, UI servers. Are there any delays?
• Monitor the state of RDBMS. Are there any major locks or full table scans?
• Load Balancers - Are the underlying rules and associated servers fully
operational?
• Monitor network elements, firewalls, switches, routers for any issues
• Monitor incremental and full backups
• Check monitors are in place for new functionality ready to be turned on in
Production!
• What to do if service is down? Start the automated DR immediately and
debug later!
• Are monitors in the secondary cloud environment fully configured and
operational?
• What's the health of the DR site before the DR occurs? Run a full set of end
to end qualification tests before declaring DR victory!
What to Monitor? …
20. • Fast Releases with Features, Supported Platforms,
Performance/Security Improvements
• 99.999% Five nines SLA
• Address customer concerns
• Replicate Good Experience
• Learn from Mistakes and Fix Fast
Happy Customers!
21. • Team delivering Automated self service tools:
– Infrastructure configurations and updates. Kill TOIL!!
• Monitor labs and production with automation
• 24 hours or better release cycles with no one burning
• Automated deploys/roll backs/validations to production
• Everyone learning, executing, creating, achieving
Happy Teams!