0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Tel Aviv 2018

0Flake
Reaching reliable non-flaky tests
DevOpsDays TLV 2018
Itai Friendinger <itai@forter.com>

Flaky tests can fail or pass for the same
configuration
Image from Huck Nets

patch the test …

patch the test … and test the patch

5
Confidence is not binary, it’s a spectrum
• Tests should provide confidence
• During development
• In new deployments
• Also... continuously in production (monitoring)
• Flaky tests usually assert a binary result
• 0Flake is about the tools to create a spectrum of results.

0Flake Agenda
• Problem Description
• Precision and Accuracy ⇒ Flaky Unit Tests
• Non-Deterministic Results ⇒ Flaky Integration Tests
• Data Pipeline Hiccups ⇒ Flaky System Tests

Fraud Prevention Decision as a Service
25% fraud
probability declineFeatures
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
db
billing,
shipping
details
decline

Precision and Accuracy ⇒ Flaky Unit Tests
Unit Tests
Features
bill_ship_dist =
1200 miles
billing,
shipping
details

Unit Testing: Floats
assert 0.1 + 0.2 == 0.3
E assert
0.30000000000000004 == 0.3
Unit Tests

Unit Testing: Geo Distance Precision
def test_distance_miles():
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
result = distance_miles(newport_ri, cleveland_oh)
assert result == 538.39044536
E assert 538.3904453622719 == 538.39044536
Unit Tests

Quick patch: ±0.01
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
assert result == approx(538.39044536, abs=0.01)
Unit Tests

Geo Distance Accuracy
Unit Tests

±1 mile is negligible in terms of Fraud Analysis
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
assert result == approx(538.39, abs=1, rel=0.01)
Unit Tests

requires fraud analysis understanding
⇒ test DSL for analysts
Expected results is a range, not a single value
Unit Tests
input = {
"bill_addr": "Newport, Rhode Island",
"ship_addr": "Cleveland, Ohio"
}
output = {
"bill_ship_dist": approx(538, abs=1, rel=0.01)
}

Non-Deterministic Results ⇒ Flaky Integration Tests
Integration Tests
25% fraud
probability declineFeatures
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
db
billing,
shipping
details

25% fraud
probability decline
Features report exceptions but use fallback logic
Features
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
db
geocoding
service
service service
Integration Tests
billing,
shipping
details

25% fraud
probability decline
Features
bill_ship_dist =
1200 miles
>100 miles
Fraud
Prediction
Real time
Decision
> 20%
geocoding
service
service service
Exception Monitoring
Integration Tests
billing,
shipping
details
db

25%
19% fraud
probability
decline
Features
bill_ship_dist =
1200 miles
>100 miles
Fraud
Prediction
Real time
Decision
> 20%
geocoding
service
service service
Integration Tests
billing,
shipping
details
db

25%
19% fraud
probability
decline
approve
Features
bill_ship_dist =
1200 miles
>100 miles
Fraud
Prediction
Real time
Decision
> 20%
Integration Tests
billing,
shipping
details
db
geocoding
service
service service

Integration Testing: Stability vs Coverage
Integration Tests
stability
Integration
coverage
stubs
connect to
other
services

Retries
Integration Tests
stability
Integration
coverage
stubs
connect to
other
services
(retry)
connect to
other
services

Docker sidecars
Integration Tests
stability
Integration
coverage
docker
sidecar
(localhost)
stubs
connect to
other
services

"Monitor" exceptions raised during each test
Integration Tests
stability
Integration
coverage
ignore
(some)
exceptions
stubs
connect to
other
services
docker
sidecar
(localhost)

Don't assert a non-deterministic service
stability
Integration
coverage
relax
asserts
Integration Tests
stubs
connect to
other
services
ignore
(some)
exceptions
docker
sidecar
(localhost)

Expected result is a spectrum, not a single value
decline
decline
Integration Tests
25% fraud
probabilityFeatures
bill_ship_dist =
1200 miles
Fraud
Prediction
Real time
Decision
> 20%
billing,
shipping
details

Let's monitor in production !
Production Tests

Canary Release (ala Netflix)
Production Tests
Production (v1)
real
traffic
ELB
db

Canary Release (ala Netflix)
real
traffic
ELB
Canary (v2)
compare
Baseline
with
Canary
Production Tests
Production (v1)
Baseline (v1) db

spinnaker.io
Kayenta - automated canary analysis
2
Production Tests

Why canary is not the right choice for us
T1 = time to detect problem
T2 = time to resolve problem
engineering_loss = avg_loss_per_tx * tx_throughput * (T1+T2)
minimum TXs
needed
Production Tests

Why canary is not the right choice for us
T1 = time to detect problem
T2 = time to resolve problem
engineering_loss = avg_loss_per_tx * tx_throughput * (T1+T2)
Netflix:
Movie/Ad Recommendation, Video Streaming
Forter:
~0.015 x (Flight tickets ,Jewelry, Shoes, Food)
Production Tests

C.D. deploys a new version (effectless toggled on)
Production Tests
Green env (v1)
(production)real
traffic
ELB
db
Blue env (v2)
(effectless)

C.D. runs warm-up tests
synthetic
traffic
Production Tests
Blue env (v2)
(effectless)
Green env (v1)
(production)real
traffic
ELB
db

C.D. streams (copy of) real traffic for 15 minutes
real
traffic
Production Tests
Green env (v1)
(production)real
traffic
ELB
db
Blue env (v2)
(effectless)

Machines
“some of my answers you will
understand, and some of
them you will not“
Image from The Matrix
Production Tests

Fraud Analysts
“You've already made your
choice.
You're here to try to
understand *why* you made
it.“
Production Tests

#effectless slack channel
Decisions diverged from existing
production
API Latency
Number / Percent of exceptions below
threshold
Production Tests

Developers can force ELB switch
Blue env is actually better
Call an analyst to explain *why*
Production Tests

C.D. toggles effectless off and diverts ELB traffic
real
traffic
draining
Production Tests
Green env (v1)
(fallback)
ELB
db
Blue env (v2)
(production)

Continuous BI monitoring and alerts
Production Tests
real
traffic
Green env (v1)
(fallback)
ELB
db
Blue env (v2)
(production)

After 4 quiet hours, safley terminates green env
Production Tests
real
traffic
ELB
db
Blue env (v2)
(production)

Effectless Caveats
● 15 minutes may not be enough
○ Small problems slip through and accumulate
■ Covered by BI monitoring
○ Stats per ..
■ per tenant / sub-service / host
● API Latencies
○ 99th percentile is noisy (start with 50ile, 95ile)
○ caching effects
● Exception thresholds must be gradually tightened
○ 0 exceptions not realistic for new features
Production Tests

Data Pipeline Hiccups ⇒ Flaky System tests
System Tests
Decision Analytics Billing

DB isolation (MicroServices isolation)
System Tests
db db db

Now we need an async data pipeline
System Tests
db db db
Async data pipeline

But each service has a different data freshness req.
System Tests
Decision
(<1 sec)
Analytics
(15 secs)
Billing
(days)
db db db
Async data pipeline

Naive system tests (sleep 60)
Decision
(<1 sec)
Analytics
(15 secs)
Billing
(days)
db db db
Async data pipeline
Send TX Query for TX Query for TX
System Tests

Continuous Data Reconciliation
db db db
Reconciliation

● Compares DB with Source-Of-Truth DB
○ missing data (by timestamp , by id)
○ referential integrity problems ("broken links")
● Continuous Testing
○ Green ⇒ data in sync
○ Red ⇒ data sync problem
Reconciliation

db db db
Async data pipeline
Reconciliation

● Triggers MicroService/Pipeline APIs to reprocess data
● Continuous Testing
○ Green ⇒ data in sync
○ Yellow ⇒ data is being synced
○ Red ⇒ data sync problem
Reconciliation

Reconciliation Prerequisites
● Schema Repository
● Source of Truth
● Justification (e.g. customer facing data)
○ Billing
○ Analytics
Reconciliation

53
Reconciliation Caveats
● Violates Micro-Service isolation
● Prevent Positive Feedback Avalanche:
○ Filtering
○ Limiting
○ Damping
○ Negative Feedback
Reconciliation
Image from Wikipedia

Questions ?
Itai Friendinger <itai@forter.com>
0Flake - Reaching reliable non-flaky tests
https://tech.forter.com

0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Tel Aviv 2018

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a 0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Tel Aviv 2018

Similar a 0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Tel Aviv 2018 (20)

Último

Último (20)

0Flake - Reaching reliable non-flaky tests - Itai Friendinger - DevOpsDays Tel Aviv 2018