High Performance Haskell

Composewell Technologies
High Performance
Haskell
Harendra Kumar

15 Dec 2018

Harendra Kumar
‣More than a decade of systems programming in C

‣Writing Haskell for last three years

‣Currently focusing on streamly, an ambitious
project that aims to make programming practical
systems in Haskell a joy and ensure C like high
performance.

Haskell Performance
‣Can easily be off by 10x or 100x from the best

‣Refactoring can easily affect performance

‣You cannot be confident unless you measure

‣Best practices can easily get you in the ballpark

‣Squeezing the last drop may be harder

‣With some effort, can get close to C or even better

Unicode Normalization
A case study
‣Challenge: can we do unicode normalization equal to or
faster than the best C++ library (icu)?

The Problem
‣A unicode character may have multiple forms
(composed/decomposed).
‣ Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)

‣ Åström (U+0041 U+030A U+0073 U+0074 U+0072
U+006F U+0308 U+006D)

‣To compare strings we need to bring them to a common
same normal form (e.g. NFC/NFD).

Normalized Form
Decomposed (NFD)
‣Sequence of chars:

‣Starter,Starter,Combining1,Combining2…Starter,Combining1…
‣Lookup character:

‣has decomposition?

‣replace with its components

‣Lookup combining class:

‣0 => Starter, Non-zero => combining

‣Reorder multiple combining chars as per combining class

Unicode Character
Database
‣Lookup maps:

‣Decomposition map, ~2000 entries

‣Combining class map ~1000 entries

‣Algorithmic decomposition of Hangul characters

Naive, Elegant Code
‣Normalization in ~50 lines of core code

‣Use IntMap for database lookup

‣Use Haskell lists for processing

‣Idiomatic code

Naive Implementation
Performance (C++/Haskell)

Use Pattern Match for
Lookup

IntMap vs Pattern Match

Fast Path Decomposition
Lookup

Decomposition
‣Decomposition is recursive

‣Use simple recursion instead of iterate, zip with
tail idioms to decompose recursively.

Fast Path Reordering
‣Original code:

‣split into groups, sortBy combining class (CC)

‣“SCCSCC” => [(S,0), (C,10), (C,11)], [(S,0), (C,5), (C,6)]

‣Optimized code: use custom sorting for the cases when
the sort group size is 1 or 2, fallback to regular list sort for
the rest.

‣Use bitmap for a quick combining or non-combining
check, non-combining is a common case.

Monolithic Decompose and
Reorder
‣ Original code: reorder . decompose

‣ Optimized code: decomposeAndReorder reorderBuffer
‣ In the common case the buffer has just one char and it gets flushed
when we get the next char.

‣ We need to sort the buffer only when there are more than one
combining chars in the buffer.

‣ Use custom sorting for 2 char sorting case.

‣ Do not use string append for reorder buffer, manually deconstruct
and reconstruct the list for short common cases. (10% improvement)

Hangul Jamo Normalization
‣Use algorithmic decomposition as prescribed by the unicode standard,
instead of simple lookup based approach.

‣NOINLINE Hangul Jamo case - this is not fast path

‣Use quot/rem instead of div/mod

‣user quotRem instead of quot/rem

‣Use unsafeChr instead of chr

‣Use strict values in list buﬀers

‣Use tuples instead of lists for returning short buﬀers

‣Localize recursion to non-hangul case

Where are we?
(C++/Haskell/C)

Can we do better?
‣Remember we are still using plain Haskell strings! Let’s
do some minimal experiments to test the limits:

stringOp = map (chr . (+ 1) . ord) — 17 ms
textOp = T.map (chr . (+ 1) . ord) — 11 ms
textOp = T.unstream . T.stream — 4.0 ms
ICU English Normalization — 2.7 ms
Fixed Text unstream code -
NOINLINE realloc code — 1.3 ms

Let’s Apply This
‣Use Text with stream/unstream instead of strings

‣Conditional branch readjustments, for fast path.
‣Inlining
‣INLINE the isCombining check (+16%)

‣Add NOINLINE to slow path code

‣-funbox-strict-fields

Optimize Reorder Buffer
‣Instead of a list, use a custom data type optimized for
fast path cases:

data Buffer = Empty | One {-# UNPACK #-} !Char | Many [Char]
‣Use a mutable reorder buﬀer

‣ + 5%

Where are we now?
(C++/Haskell)

Use llvm backend (+10%)

We can do better
‣We can use non-decomposable starter lookup for
fast path. It will cut common case lookups by half.

‣We have not tried hash lookup

‣ICU C++ library uses unicode quick check properties for
optimization, we can also do the same to further optimize
at algorithmic level.

‣Code generation by GHC can possibly be improved. I
raised a couple of tickets about it.

Lessons
‣Using Haskell we can write concise code with acceptable
performance quickly.

‣The code can be optimized to perform as well as C

‣Most of the optimization we did were algorithmic and
logic related rather than language related issues. Mostly
custom handling of fast path.

‣The most common, language related optimizations are
INLINE annotations. Others are mostly last drop
squeezing kind.

Ground Rules
‣ MEASURE, define proper benchmarks

‣ ANALYZE, benchmarks may be wrong

‣ OPTIMIZE

‣ Algorithmic optimization first

‣ Biggest gain first

‣ Optimize where it matters (fast path)

‣ DEBUG

‣ Narrow down by incremental elimination

‣ Narrow down by incremental addition

‣ RATCHET, don’t lose the hard work spent in discovering issues

The three musketeers
1. INLINE

2. SPECIALIZE

3. STRICTIFY

Inlining
‣Instead of making a function call, expand the deﬁnition of
a function at the call site.

Inlining
(Definition Site)
‣For inlining or specialization to occur in another module the
original RHS of a function must be recorded in the interface file (.hi).

‣By default GHC may or may not choose to keep the original RHS
in the interface file.

‣INLINABLE => direct the compiler to record the original RHS of
the function in interface file (.hi)

‣INLINE => Like INLINABLE, but also direct the compiler to
actually inline the function at all call sites.

‣-fexpose-all-unfoldings is a way to mark everything INLINABLE

Inlining
(Call Site)
‣Prerequisite: function’s original RHS must be available in
the interface ﬁle.

‣If the function was marked INLINE at the deﬁnition site,
then unconditionally inline it.

‣If the function was not marked INLINE, then the function
inline can be used to ask the compiler to inline it
unconditionally.

‣Otherwise, GHC decides whether to inline or not. See -
funfolding-* and -fmax-inline-* options to control.

When inlining cannot occur
‣Function is not fully applied

‣The function is passed as an argument to a function which
itself is not inlined.

‣Function is self recursive

‣For mutually recursive functions GHC tries not to use a
function with INLINE pragma as a loop breaker.

When an INLINE is missing
func :: String -> Stream IO Int -> Benchmark
func name f = bench name $ nfIO $ S.mapM_ (_ -> return ()) f
• Without an INLINE on func 50 ms, with INLINE 500us, 100x faster.

• Without marking func inline, f cannot be inlined and cannot fuse with
mapM_. So we need an INLINE on both func as well as f.

• Code depending on fusion is specially sensitive to inlining, because
fusion depends on inlining.

• CPS code is more robust against inlining. Direct style code may
perform much worse compared to CPS when an INLINE goes
missing. However, it can be much faster than CPS with proper inlining.

NOINLINE for better
performance!
• Lot of people think it is counterintuitive, even the GHC
manual says you should never need this, but it is pretty
common to get modest perf gains by using NOINLINE.

• Putting slow path branch out of the way in a separate
function marked NOINLINE helps the fast path branch to
be executed more eﬃciently.

• We can use noinline as well to avoid inlining a
particular call.

Specializing
‣Instead of calling a polymorphic version of a function,
make a copy, specialized to less polymorphic types.

{-# SPECIALIZE consM :: IO a -> Stream IO a -> Stream IO a #-}
consM :: Monad m => m a -> Stream m a -> Stream m a
consM = consMSerial

Specializing
(Deﬁnition Site)
‣INLINABLE => direct the compiler to record the original
RHS of the function in interface ﬁle (.hi). The function can
then be specialized where it is imported using
SPECIALIZE.

‣SPECIALIZE => direct the compiler to specialize a
function at the given type and use that version wherever
applicable.

‣SPECIALIZE instance => direct the compiler to
specialize a type class instance at the given type.

Specializing
(Call Site)
‣Prerequisite: function’s original RHS must be available in
the interface ﬁle. INLINE or INLINABLE can be used to
ensure that.

‣SPECIALIZE => direct the compiler to specialize an
imported function at the given type for this module.

‣For all local functions or imported functions that have their
RHS available in the interface ﬁle, GHC may automatically
specialize them. See -fspecialise-aggressively
too.

Call Pattern Specialization
(Recursive Functions)
‣GHC option -fspec-constr specializes a recursive
function for diﬀerent constructor cases of its argument.

‣Use SPEC and a strict argument to a function to direct the
compiler to perform spec-constr aggressively.

When specialization cannot
occur
‣Function is not fully applied (unsaturated calls)

‣Function calls other functions which cannot be
specialized.

‣Function uses polymorphic recursion

‣-Wmissed-specialisations and -Wall-missed-
specialisations GHC options can be useful.

Strictness
• Do not keep lazy expressions in memory that are anyway to be
reduced ultimately, reduce them as soon as possible.

• It may be inefficient, may consume more memory and more
importantly make GC expensive.

• As a general rule be lazy for construction and transformation and
be strict for reduction. Laziness helps when you are processing
something, strictness helps when you are storing or buffering.

• Use strict accumulator for strict left folds.

• Use strict record fields for records used for buffered storage.

Strictify and Unbox
• BangPatterns can be used to mark function arguments
or constructor ﬁelds strict, i.e. reduced when applied.

• Strict function application $!
• Use UNPACK pragma to keep constructor ﬁelds unboxed.

• -funbox-strict-fields is often useful

Measurement
Focus on tests in C, benchmarks in Haskell

Benchmarking Tools
• gauge vs criterion

• Faced several benchmarking issues during streamly
and streaming-benchmarks development

• Made signiﬁcant improvements to gauge to address the
issues.

• Wrote the bench-show package for robust analysis,
comparison and presentation of benchmarks

Benchmarking Pitfalls
• Benchmarking code need to be optimized exactly the way
you would optimize the code being benchmarked.

• A missing INLINE in benchmarking code could cause a
huge diﬀerence invalidating the results.

• Benchmarking relies on rnf implementation, if that itself
is slow (e.g. not marked INLINE) then we may get false
results. We encountered this problem at least once.

• Multiple benchmarks can interfere with each other in ways
you may not be able to detect easily.

Benchmarking Pitfalls
• You may be measuring the cost of doing nothing, even
with nfIO. We generate a random number in IO and pass
it to the computation being benchmarked to avoid the
issue.

• When measuring with nf f arg, remember we are
measuring f and not arg. arg may get evaluated once
and reused.

Gauge Improvements
• Run each benchmark in isolation, in a separate process. This
is brute force way to ensure that there is no interference from
other benchmarks. Correct maxrss measurement requires
this.

• Several correctness ﬁxes to measure stats accurately.

• Use getrusage to report many other stats like maxrss, page
faults and context switches. maxrss is especially
useful to get peak memory consumption data.

• Added a —quick mode to run benchmarks quickly (10x faster)

Gauge Improvements
• Provides raw data for each iteration in a CSV ﬁle, for
external analysis. This is used by bench-show.

• Better control over measurement process from the CLI

• nfAppIO and whnfAppIO for more reliable
measurements. Contributed by rubenpieters.

Analyzing and Comparing
Performance
(bench-show)

Benchmarking Business
• streamly is a high performance monadic streaming
framework generalizing lists to monads with inherent
concurrency support.

• When a single INLINE can degrade performance by 100x
how do we guarantee performance?

• Measure everything. We have hundreds of benchmarks,
each and every op is benchmarked.

• With such a large number of benchmarks, how do we
analyze the benchmarking output?

Enter bench-show
• Analyses the results using 3 statistical estimators - linear
regression, median and mean
• Finds the diﬀerence between two runs and reports the min of
3 estimators

• Computes the percentage regression or improvement

• Sorts and reports by the highest regression, time as well as
space.

• We can automatically report regressions on each commit, by
using a threshold.

Reporting Regressions
(% Diff)

Reporting Regressions
(Absolute Delta)

Comparing Packages
• bench-show can group benchmarks arbitrarily and
compare the groups.

• streaming-benchmarks package uses this to compare
various streaming libraries.

Monadic Streaming

Pure Streaming (Time)

Pure Streaming (Space)

References
• https://github.com/composewell/streamly

• https://github.com/composewell/streaming-benchmarks

• https://github.com/composewell/bench-show

• https://github.com/vincenthz/hs-gauge

• https://github.com/composewell/unicode-transforms

Thank You
harendra.kumar@gmail.com
@hk_hooda

High Performance Haskell

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a High Performance Haskell

Similar a High Performance Haskell (20)

Último

Último (20)

High Performance Haskell