Flaky tests are a waste of time and money. It becomes a nightmare when the component under test is a very complex non deterministic decision system. We found that mocks, stubs and docker sidecars have their limits and we moved more tests from the CI to pre-production testing (canary on steroids)
5. 5
Confidence is not binary, it’s a spectrum
• Tests should provide confidence
• During development
• In new deployments
• Also... continuously in production (monitoring)
• Flaky tests usually assert a binary result
• 0Flake is about the tools to create a spectrum of results.
6. 0Flake Agenda
• Problem Description
• Precision and Accuracy ⇒ Flaky Unit Tests
• Non-Deterministic Results ⇒ Flaky Integration Tests
• Data Pipeline Hiccups ⇒ Flaky System Tests
7. Fraud Prevention Decision as a Service
25% fraud
probability declineFeatures
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
db
billing,
shipping
details
decline
8. Precision and Accuracy ⇒ Flaky Unit Tests
Unit Tests
Features
bill_ship_dist =
1200 miles
billing,
shipping
details
13. ±1 mile is negligible in terms of Fraud Analysis
def test_distance_miles():
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
result = distance_miles(newport_ri, cleveland_oh)
assert result == approx(538.39, abs=1, rel=0.01)
Unit Tests
14. requires fraud analysis understanding
⇒ test DSL for analysts
Expected results is a range, not a single value
Unit Tests
input = {
"bill_addr": "Newport, Rhode Island",
"ship_addr": "Cleveland, Ohio"
}
output = {
"bill_ship_dist": approx(538, abs=1, rel=0.01)
}
15. Non-Deterministic Results ⇒ Flaky Integration Tests
Integration Tests
25% fraud
probability declineFeatures
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
db
billing,
shipping
details
16. 25% fraud
probability decline
Features report exceptions but use fallback logic
Features
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
db
geocoding
service
service service
Integration Tests
billing,
shipping
details
17. 25% fraud
probability decline
Features report exceptions but use fallback logic
Features
bill_ship_dist =
1200 miles
>100 miles
Fraud
Prediction
Real time
Decision
> 20%
geocoding
service
service service
Exception Monitoring
Integration Tests
billing,
shipping
details
db
18. 25%
19% fraud
probability
decline
Features report exceptions but use fallback logic
Features
bill_ship_dist =
1200 miles
>100 miles
Fraud
Prediction
Real time
Decision
> 20%
geocoding
service
service service
Exception Monitoring
Integration Tests
billing,
shipping
details
db
19. 25%
19% fraud
probability
decline
approve
Features report exceptions but use fallback logic
Features
bill_ship_dist =
1200 miles
>100 miles
Fraud
Prediction
Real time
Decision
> 20%
Exception Monitoring
Integration Tests
billing,
shipping
details
db
geocoding
service
service service
20. Integration Testing: Stability vs Coverage
Integration Tests
stability
Integration
coverage
stubs
connect to
other
services
23. "Monitor" exceptions raised during each test
Integration Tests
stability
Integration
coverage
ignore
(some)
exceptions
stubs
connect to
other
services
docker
sidecar
(localhost)
24. Don't assert a non-deterministic service
stability
Integration
coverage
relax
asserts
Integration Tests
stubs
connect to
other
services
ignore
(some)
exceptions
docker
sidecar
(localhost)
25. Expected result is a spectrum, not a single value
decline
decline
Integration Tests
25% fraud
probabilityFeatures
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
billing,
shipping
details
30. Why canary is not the right choice for us
T1 = time to detect problem
T2 = time to resolve problem
engineering_loss = avg_loss_per_tx * tx_throughput * (T1+T2)
minimum TXs
needed
Production Tests
31. Why canary is not the right choice for us
T1 = time to detect problem
T2 = time to resolve problem
engineering_loss = avg_loss_per_tx * tx_throughput * (T1+T2)
Netflix:
Movie/Ad Recommendation, Video Streaming
Forter:
~0.015 x (Flight tickets ,Jewelry, Shoes, Food)
Production Tests
32. C.D. deploys a new version (effectless toggled on)
Production Tests
Green env (v1)
(production)real
traffic
ELB
db
Blue env (v2)
(effectless)
33. C.D. runs warm-up tests
synthetic
traffic
Production Tests
Blue env (v2)
(effectless)
Green env (v1)
(production)real
traffic
ELB
db
34. C.D. streams (copy of) real traffic for 15 minutes
real
traffic
Production Tests
Green env (v1)
(production)real
traffic
ELB
db
Blue env (v2)
(effectless)
35. Machines
“some of my answers you will
understand, and some of
them you will not“
Image from The Matrix
Production Tests
36. Fraud Analysts
“You've already made your
choice.
You're here to try to
understand *why* you made
it.“
Image from The Matrix
Production Tests
37. #effectless slack channel
Decisions diverged from existing
production
API Latency
Number / Percent of exceptions below
threshold
Image from The Matrix
Production Tests
38. Developers can force ELB switch
Blue env is actually better
Call an analyst to explain *why*
Image from The Matrix
Production Tests
39. C.D. toggles effectless off and diverts ELB traffic
real
traffic
draining
Production Tests
Green env (v1)
(fallback)
ELB
db
Blue env (v2)
(production)
40. Continuous BI monitoring and alerts
Production Tests
real
traffic
Green env (v1)
(fallback)
ELB
db
Blue env (v2)
(production)
41. After 4 quiet hours, safley terminates green env
Production Tests
real
traffic
ELB
db
Blue env (v2)
(production)
42. Effectless Caveats
● 15 minutes may not be enough
○ Small problems slip through and accumulate
■ Covered by BI monitoring
○ Stats per ..
■ per tenant / sub-service / host
● API Latencies
○ 99th percentile is noisy (start with 50ile, 95ile)
○ caching effects
● Exception thresholds must be gradually tightened
○ 0 exceptions not realistic for new features
Production Tests
45. Now we need an async data pipeline
System Tests
Decision Analytics Billing
db db db
Async data pipeline
46. But each service has a different data freshness req.
System Tests
Decision
(<1 sec)
Analytics
(15 secs)
Billing
(days)
db db db
Async data pipeline
47. Naive system tests (sleep 60)
Decision
(<1 sec)
Analytics
(15 secs)
Billing
(days)
db db db
Async data pipeline
Send TX Query for TX Query for TX
System Tests
49. Continuous Data Reconciliation
● Compares DB with Source-Of-Truth DB
○ missing data (by timestamp , by id)
○ referential integrity problems ("broken links")
● Continuous Testing
○ Green ⇒ data in sync
○ Red ⇒ data sync problem
Reconciliation
51. Continuous Data Reconciliation
● Triggers MicroService/Pipeline APIs to reprocess data
● Continuous Testing
○ Green ⇒ data in sync
○ Yellow ⇒ data is being synced
○ Red ⇒ data sync problem
Reconciliation