Más contenido relacionado

Presentaciones para ti(20)

Similar a Troubleshooting App Health and Performance with PCF Metrics 1.2(20)


Más de VMware Tanzu(20)


Troubleshooting App Health and Performance with PCF Metrics 1.2

  1. PCF Metrics – App Dev Providing App Developers insight into app performance PCF Metrics Providing App Developers insight into app performance Pieter Humphrey, Allen Duet
  2. Gartner believes that more than 80% of all mission-critical IT service outages result from people and process errors and failures, and of those outages, more than 50% result from a lack of coordination between change, release and configuration management processes. Four Steps to Optimize Configuration Management Process and Tools, By Ronni J. Colville, Doc #G00258557 Oct 2013
  3. Modern infrastructure is constantly changing Methodologies Deployment Sparingly at designated times Ready for prod at any time Architecture Technologies Operations App Server on Machine Containers, Public / Private / Hybrid Cloud Monolithic App Microservices / Composite app Linear / Sequential Agile DevOps CI / CD Pipelines Many tools, ad hoc automation Manage services, not servers
  4. Rate of change is driving more outages
  5. 5 Outages often preventable using automation Facebook 1 hour, Jan 26th Config / app / net failures Apple App Store 11 hours March 11th Internal DNS error NYSE, United, WSJ 4 hr, 1.5 hr, 1 hr July 8th Software update, routing failure, server overload UltraDNS 2.5 hours Oct 15th Configuration Errors 2015
  6. “25% of customers will abandon a web page that takes more than 4 seconds to load” “47% of consumers expect a web page to load in < 2 seconds” “Customers prefer competitors website if it is 250ms faster” “Increase revenue 1% for each 100ms improvement” Sources: Gartner, Google, Amazon, Walmart 6 Speed and Availability Matters
  7. 7 Speed Performance and Human Perception Delay time User Reaction 0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds + Instant Feels sluggish Machine is working.. Mental context switch I’ll come back later .. Stay under 250 ms to feel "fast". Stay under 1000 ms to keep users attention. Breaking the 1000 ms Mobile Barrier - Velocity - Google Slides
  8. Changes to a single microservice or monolithic app can impact performance of downstream apps and services, or cause breakage 8
  9. 9 Troubleshooting apps and microservices is hard Most platforms have: Disparate permissions on different apps Data silos across subsystems Trouble reconciling time series data
  10. Multiple Languages Microservices Support Services Marketplace Native User Provided Partner DEVELOPMENT 1010 Operating System Cloud API Container Orchestration App Deployment & Management Availability Visibility & Administration CI/CD Tools, ID, Security Health, Metrics, Patching Apps & Platform Dashboards OPERATIONS
  11. 4 Levels of High Availability 11 Availability Zone Fail 4 VM Fail 3 Process Fail 2 App Instance Fail 1 V M V M Process V M V M V M VM VM VM VM VM VM VM VM
  12. Container Scheduler Handles Workloads 12 250,000 containers managed in a single environment
  13. Container Scheduler Handles Workloads 13 Dynamic load balancing
  14. Container Scheduler Handles Workloads 14 Dynamic load balancing Remediation and rebalance of workloads
  15. Each Layer Upgradable with No Downtime 15 App Runtime* File system mapping Application Linux host & kernel Blue-Green deploy Canary style deploy * e.g. Embedded webserver, app configurations, JRE, agents for services packaged as buildpacks C o n t a i n e r
  16. Our Charter To provide App Devs with data points to assess overall solution performance and healthProviding App Developers insight into app performance
  17. • Near real-time view • Covers 80-90% of the problems • One tool correlates events, logs, metrics • Common set of facts for Dev+Ops • Designed for PCF multi-tenancy • Agentless, no install • Enabled automatically for all applications Immediate Integrated Automated
  19. Select an app, watch streaming data
  20. 2 weeks of app log storage 2 weeks of detailed container and http start stop metric storage App Log distribution histogram App Event UI improvements Fault tolerance on all storage services Testing and tuning for large ingestion loads v1.2.1 PCF Metrics
  21. Data Correlation Demo
  22. 22 PCF Metrics 1.2 Architecture
  23. Our Journey PCF Metrics v1.0 PCF Metrics v1.1 PCF Metrics v1.2.1 PCF Metrics v1.3 Aggregate Container and HTTP metrics provided for Apps Aggregate Container and HTTP metrics + App events and Logs (24 hour storage) Aggregate Container and HTTP metrics + App events and Logs (2 weeks storage) Aggregate Container and HTTP metrics + App events and Logs (2 weeks storage) TraceID capture and Trace Logs
  24. Spring Boot actuator support Expanded event descriptions Additional Log sources * Data exposed as API Continued UX improvements v1.3+ App Developers