Engineers guide to data analysis

Engineer’s guide to Data
Analysis
Avishai Ish-Shalom
github.com/avishai-ish-shalom@nukembergnukemberg@wix.com

Wix in numbers
~ 400 Engineers
~ 1400 employees
~ 100M Sites
~ 250 micro services

IaaS
(Insult as a Service)
▪ Thin API, written in Flask (python)
▪ CouchDB
▪ Apache proxy
▪ StatsD, Graphite, ELK

Graphite
▪ Metrics collector, storage and UI
▪ Math functions
▪ Common
▪ De-facto standard

Oops, I think
something is
broken

What is this
“metric” you speak
of?

A metric is
▪ Numeric data
▪ Often with timestamp (time series)
▪ A “measurement” of something
▪ Discrete

Where do metrics come from?
▪ Events with numeric data
▪ Counting/aggregating
▪ Sampling

Events
▪ Data about something that happened
▪ timestamp (time series data)
▪ Has properties - numeric and non-
numeric
{
“timestamp”: “2016-11-
15T18:43:39+00:00”,
“host”: “test01.example.net”,
“status”: “ok”,
“latency”: 14.31
}

10000 events/sec
x
0.5kb/event
=
How much data?
400GB a day

Telemetry is a big data problem

Aggregates are lossy compression
We must decide in advance how we’ll use the
metric

Aggregates
▪ Max, Min, Sum, Average, etc
▪ Last, random point
▪ Percentiles (quantiles)
▪ Historgrams, reverse quantiles
▪ Each is suitable for a particular use case

Percentiles
p99 - The sampled value that is larger than other 99% of samples
▪ O(n) memory complexity
▪ O(n*log n) computation complexity
▪ Some shortcuts for p50 (median), p100 (max), p0 (min)
Use when clients experience individual values

Percentiles
▪ Precentiles are not additive
▪ You cannot average percentiles
Example:
s1 (100 points) = [0, 0, ....., 100, 100] => p99 = 100
s2 (100 points) = [0, 0, …., 50, 50] => p99 = 50
p99(s1 : s2) = 50, avg(p99(s1), p99(s2)) = 75

Histograms
Distribution visualization of sample
▪ Count of events in each bin
▪ Beans are usually evenly spaced
▪ Use logarithmically spaced bins for
long tails
▪ Additive

Histograms :-(
So why aren’t we all using this?
▪ Storage
▪ Have to decide on bins schema
▪ Not many tools support this

Choosing the right aggregate
▪ Percentiles/histograms for latency
▪ Max/min for latency and sizes
▪ Histogram analysis for sizes and latency
▪ Sums/averages for capacity and money
▪ Aggregate per domain
▪ Look for deviations

Resolution
▪ Humans need ~5 data points to see a trend
▪ Hides faster changes
▪ Rollups/downscaling is hard
▪ Multi tier FTW!

It ain't what you don’t know that gets
you into trouble.
It's what you know for sure that just
ain’t so.
“
“

Peak Erasure/Spike erosion
■ When lowering resolution, data points are
aggregated
■ Default aggregation is average
■ Peaks are erased
■ This can happen in storage or visualization

Peak Erasure/Spike erosion
■ Storages down-sample to save space
■ Aggregation function may be configurable
■ Metric collectors aggregate too
○ carbon-cache uses last value
○ StatsD - gauges, timers, counters

Counters vs Gauges
Behaviour in low res time window
■ Low res sampling erases fast changes
■ “Round numbers” syndrom
■ Counters smear changes, but don’t erase them
TLDR: use counters when possible

Mixed modes
Aggregating multiple modes reduces usability of aggregates
■ Different transaction types differ in latencies/sizes
■ Errors, successes have very different latencies/sizes
■ Makes your graphs weird
TLDR: use separate metrics for different things

Visualization
■ Timeframe
■ No more than 3 series
■ Be weary of multiple Y scales, but scale if needed
■ Only related series on the same graph
■ Never mix X scales
■ Visual references: bounds, Y min/max values, legend

Metric design
■ Choose your aggregates wisely
■ Decide on a proper resolution, sampling rate, aggregation time
windows
■ Explore the distribution
■ Separate known modes to independent metrics

Separate signal from noise
■ Use low-pass filters to smooth
■ Trend changes
■ Timeshifts
■ Filter out outliers

Working with clusters
■ Most-deviant/outliers
■ Max/Min
■ Sum (capacity)
■ Pre-aggregate percentiles

Thank You

Questions?

Engineers guide to data analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Engineers guide to data analysis

Similar a Engineers guide to data analysis (20)

Último

Último (20)

Engineers guide to data analysis