“Without data, you’re just another person with an opinion.” W. Edwards Deming was talking about statistical quality control in manufacturing but he could equally have been referring to managing modern iterative and automated software deployment pipelines and cloud-native infrastructure. Certainly there's a wealth of open source tools to capture and visualize data. However, a data strategy isn’t solely or even mostly about drawing up a long list of technical measurements and instrumenting software to capture everything.
It's crucial to distinguish between metrics that relate software initiatives to positive business outcomes, the alerts needed to respond to problems now, and the data required for root cause analysis or to optimize processes over time. All data is not equal. And most data is not a metric for measuring success.
4. “ Implicit in the phrase “big
data,” as well as the concept of
data as gold, is that more is
better. But in the case of
analytics, a legitimate question
worth considering: Is more data
really better?”
- Bob O’Donnell
5. “ You can’t pick your data, but
you must pick your metrics.”
- Jeff Bladt and Bob Filbin
6. “ A familiar phrase on the turf
is 'horses for courses.’”
- Unknown British writer,
1898
7. “Human beings adjust behavior
based on the metrics they’re
held against. Anything you
measure will impel a person to
optimize his score on that
metric. What you measure is
what you’ll get. Period.”
8. THE PRINCIPLES
● You need to measure
● You need to choose relevant metrics
● Quantity may not lead to quality
● Different measurements serve different purposes
● Measurements drive behaviors
10. BUSINESS
Customer satisfaction
Shopping cart abandons
Employee turnover
OPERATIONS
Cluster health
Utilization
Outages
DEVELOPERS
“Productivity”
Test coverage
Time to deploy
AUDIENCE
12. BUSINESS
SUCCESS
Churn
Conversion rates
Avg revenue per user
CUSTOMER
EXPERIENCE
Customer satisfaction
Frequency of visits
A/B test results
APPLICATION
PERFORMANCE
Application response
Database query time
Uptime
FUNCTIONAL GOALS (NEW RELIC)
SPEED
Lead time for changes
Code release frequency
Mean time to resolution
QUALITY
Deployment success rate
Incident severity
Outstanding bugs
14. 4 RULES FOR DATA
● Instrument (many/most of) the things
● Root cause analysis (reactive)
● Detect patterns/trends (proactive)
● Context and distributions matter
15. WHAT DO WE MEASURE AND STORE?
● Most things
● Unexamined data has negative ROI
● General trend toward keeping data
“forever”
Give it two years and
everything will be stored.
—Harel Kodesh, GE Digital CTO
300GB of data per engine
per flight
16. SOME DIRECTIONS
● Increased use of statistics and machine learning
(eyeballing dashboards doesn’t scale)
● Better understand how data interacts (latency
affects page load affects customer conversion
affects revenue)
● Context (seasonal patterns are OK)
● Bottom line: Find patterns that don't conform to
expected behavior (anomolies 101)
17. LOGGING: EFK STACK
● ElasticSearch, Fluentd, Kibana
● Collect, index, search, and visualize
log data
● Good for ad hoc analytics
● Good for post mortem forensics
because of extensive log information
● Fluentd can serve as integration
point between cloud native software
like Kubernetes and Prometheus
18. MONITORING: PROMETHEUS
● Time series data model identified by
metric name and key/value pairs
● Collection happens via a pull model over
HTTP
● Values reliability even under failure
conditions over 100% accuracy
● Most associated with web-scale
DevSecOps
19. MONITORING: HAWKULAR
● REST API to store and retrieve
availability, counter, and gauge
measurements
● Visualization and alerting
● Application performance management
● Integration with ManageIQ (cloud mgmt)
● Most associated with large scale central
IT teams with lots of apps
23. WHICH OF THE FOLLOWING SHOULD WAKE UP AN
EXPENSIVE ENGINEER AT 2AM?
A: Based on current trends, we need to add additional
capacity within 2 weeks
B: A hardware failure led to a successful cluster failover
C: Response time has increased by 20%
D: Our customer support site is down because of an
AWS-East outage
24.
25. D: Our customer support
site is down because of an
AWS-East outage
27. 4 RULES FOR METRICS
● What’s important to you? (Success criteria)
● Tied to business outcomes
● Traceable to root cause(s)
● Not too many!
28. SELECTED PAYPAL METRICS
WHAT
% of failed deployments
Customer ticket volume
Response time
Deployment frequency
Change volume
29. SELECTED PAYPAL METRICS
WHAT WHY
% of failed deployments Dysfunction in deployment pipeline
Customer ticket volume Basic customer satisfaction measure
Response time Service operating within thresholds
Deployment frequency Faster iterations for new code
Change volume User stories/new lines of code
30. PUPPET LABS METRICS
● Deployment (or change) frequency
● Change lead time
● Change failure rate
● Mean Time to Recover
31. RED HAT OPENSHIFT ONLINE METRICS
● Number of applications
● Efficiency (cost)
● Response time (various measures)
● Uptime
34. ANTI-PATTERN WARNING SIGNS
● Easy to collect but don’t really
mean anything
● Drive lack of cooperation
● Not observable or not
actionable
● Not aligned with business
objectives
35.
36. WHAT MATTERS TO YOU?
What do you want to optimize for?
Customers, cost, speed…?
38. ● Measurements matter
● They’re not metrics
● Metrics are about your success factors
● Do you need to wake someone up?
● New open source tooling (but early)