BKK16-300 Benchmarking 102

Presented by
Date
Event
Benchmarking Best
Practices 102
Maxim Kuvyrkov
BKK16-300 March 9, 2016
Linaro Connect BKK16

Overview
● Revision (Benchmarking Best Practices 101)
● Reproducibility
● Reporting

Previously, in Benchmarking-101...
● Approach benchmarking as an experiment.
Be scientific.
● Design the experiment in light of your goal.
● Repeatability:
○ Understand and control noise.
○ Use statistical methods to find truth in noise.

And we briefly mentioned
● Reproducibility
● Reporting
So let’s talk some more about those.

Reproducibility
An experiment is reproducible if external
teams can run the same experiment over large
periods of time and get commensurate
(comparable) results.
Achieved if others can repeat what we did
and get the same results as us, within the
given confidence interval.

From Repeatability to Reproducibility
We must log enough information that anyone
else can use that information to repeat our
experiments.
We have achieved reproducibility if they can
get the same results, within the given
confidence interval.

Logging: Target
● CPU/SoC/Board
○ Revision, patch level, firmware version…
● Instance of the board
○ Is board 1 really identical to board 2?
● Kernel version and configuration
● Distribution

Example: Target
Board: Juno r0
CPU: 2 * Cortex-A57r0p0, 4 * Cortex-A53r0p0
Firmware version: 0.11.3
Hostname: juno-01
Kernel: 3.16.0-4-generic #1 SMP
Distribution: Debian Jessie

Logging: Build
● Exact toolchain version
● Exact libraries used
● Exact benchmark source
● Build system (scripts, makefiles etc)
● Full build log
Others should be able to acquire and rebuild all
of these components.

Example: Build
Toolchain: Linaro GCC 2015.04
CLI: -O2 -fno-tree-vectorize -DFOO
Libraries: libBar.so.1.3.2, git.linaro.org/foo/bar
#8d30a2c508468bb534bb937bd488b18b8636d3b1
Benchmark: MyBenchmark, git.linaro.org/foo/mb
#d00fb95a1b5dbe3a84fa158df872e1d2c4c49d06
Build System: abe, git.linaro.org/toolchain/abe
#d758ec431131655032bc7de12c0e6f266d9723c2

Logging: Run-time Environment
● Environment variables
● Command-line options passed to benchmark
● Mitigation measures taken

Logging: Other
All of the above may need modification
depending on what is being measured.
● Network-sensitive benchmarks may need
details of network configuration
● IO-sensitive benchmarks may need details
of storage devices
● And so on...

Long Term Storage
All results should be stored with information
required for reproducibility
Results should be kept for the long term
● Someone may ask you for some information
● You may want to do some new analysis in
the future

Reporting
● Clear, concise reporting allows others to
utilise benchmark results.
● Does not have to include all data required for
reproducibility.
● But that data should be available.
● Do not assume too much reader knowledge.
○ Err on the side of over-explanation

Reporting: Goal
Explain the goal of the experiment
● What decision will it help you to make?
● What improvement will it allow you to
deliver?
Explain the question that the experiment asks
Explain how the answer to that question helps
you to achieve the goal

Reporting
● Method: Sufficient high-level detail
○ Target, toolchain, build options, source, mitigation
● Limitations: Acknowledge and justify
○ What are the consequences for this experiment?
● Results: Discuss in context of goal
○ Co-locate data, graphs, discussion
○ Include units - numbers without units are useless
○ Include statistical data
○ Use the benchmark’s metrics

Presentation of Results
Graphs are always useful
Tables of raw data also useful
Statistical context essential:
● Number of runs
● (Which) mean
● Standard deviation

Experimental Conditions
Precisely what to report depends on what is
relevant to the results
The following are guidelines
Of course, all the environmental data should be
logged and therefore available on request

Include
Highlight key information, even if it could be
derived. Including:
● All toolchain options
● Noise mitigation measures
● Testing domain
● For e.g. memory sensitive benchmark, report
bus speed, cache hierarchy

Leave Out
Everything not essential to the main point
● Environment variables
● Build logs
● Firmware
● ...
All of this information should be available to be
provided on request.

Speedup Over Baseline (1/3)
Misleading scale
● A is about 3.5%
faster than it was
before, not 103.5%
Obfuscated regression
● B is a regression

Baseline becomes 0
Title now correct
Regression clear
But, no confidence
interval.

Error bars tell us more
● Effect on D can be
disregarded
● Effect on A is real,
but noisy.

Labelling (1/2)
What is the unit?
What are we comparing?

Direction of ‘Good’ (1/2)
“Speedup” changes to
“time to execute”
Direction of “good” flips
If possible, maintain a
constant direction of
good.

Direction of ‘Good’ (2/2)
If you have to change
the direction of ‘good’,
flag the direction
(everywhere)
Can be helpful to flag
it anyway

Consistent Order
Presents
improvements neatly
But, hard to compare
different graphs in the
same report

Scale (1/2)
A few high scores make
other results hard to see
A couple of alternatives
may be more clear...

Summary
● Log everything, in detail
● Be clear about:
○ What the goal of your experiment is
○ What your method is, and how it achieves your
purpose
● Present results
○ Unambiguously
○ With statistical context
● Relate results to your goal

BKK16-300 Benchmarking 102

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a BKK16-300 Benchmarking 102

Similar a BKK16-300 Benchmarking 102 (20)

Más de Linaro

Más de Linaro (20)

Último

Último (20)

BKK16-300 Benchmarking 102