1. Microseconds matter in High Frequency
Trading
High performance trading systems in C++
Ravi Parikh TWO ROADS TRADING PVT LTD
(http://tworoads-trading.co.in/)
2. 2
Introduction
About me :
●
HFT infra developer for TWO ROADS TRADING since 2011
●
Overall close to 9 years of experience in software development
Today’s talk :
●
General Software Development vs HFT software development
●
Overview of HFT trading and why does speed matter a lot ?
●
Importance of Correctness / Robustness of the HFT systems
●
A few Techniques for C++ Optimizations for ultra low latency software development
●
Noisy neighbors
●
Measurements of performance
5. 5
HFT Trading
➔
Trading in general is about buying something and selling it, can result into profit or loss
based on prices at which it was bought & sold.
➔
HFT trading is about market making and there is no genuine intention of buying /
selling, They aren’t speculators and they are there to provide liquidity to the market
➔
HFT makes money from very small profitable trades executed at very high frequency
( the holding time period for any open trade is very small )
➔
Other main objective is to avoid taking bad trades which can result into larger losses.
➔
So what is the role of ultra low latency system in HFT ? It’s about spotting the
opportunity for those quick small profitable trades and grabbing those, at the same time
it’s about pulling it out in time to avoid taking larger losses ( After all, you’ll always have
a very stiff competition fighting for the same trades given markets are becoming more
and more efficient each passing day )
6. 6
Role Of Latency In Grabbing The
Opportunity
- Against all odds only the fastest few will be able to book tickets successfully !!
-
Source : Internet
7. 7
Role of Latency In Pulling Out ! ( Avoid
bad trades )
- It’s equally important to pull out of a bad trade before someone hits you with the fill
( It’s those trade where you were slow to change the price and now you were forced to
take the trade which you know is a bad one, So speed matters even when you want to
avoid making a loss )
Source : Two Players Org
8. 8
So How Fast Is Fast Enough ?
- Doesn’t matter if you’re faster by 1 sec / 1 nano sec long as you’re ahead of everyone
else ( Unfortunately in HFT trading domain in most cases there are no silver & bronze
rewards, It’s gold or nothing or even worse which is a loss )
Source : Photo By Alvin Loke
Source : Two Players OrgSource : Two Players Org
10. 10
Robustness ??
●
There is always a trade-off between putting that extra if checks against saving
a few cpu cycles, so robustness and optimizations don’t always go well together.
●
Even though not being put forward as the most critical feature in the system
ahead of speed, robustness can never be compromised in HFT
●
An opportunity of making a 1 Rs profit from buying / selling a stock in a 5 micro
latency system at each uniform points doesn’t guarantee in all profitable trades
because we will not be able to capture all opportunities, However a BUG in the
system ( trivial it may be say buy/sell flipped ! ) it will guarantee a loss of 12
Million INR in ONE MINUTE
●
So in HFT infra development, Safety is always first, one has to be 120% sure
that there are no bugs in the system which will run in Production because all it
may take is just a few seconds / minutes of buggy run and it can make headlines
the next day.
●
So now keeping in mind that you can’t do anything against robustness making
an application work faster becomes even more challenging and interesting.
11. 11
Optimizations ( Prerequisites )
●
Hardware selection ( CPU / RAM / CACHE )
●
Network selection ( Switches / Network Adapters )
●
Understanding of OS/Platform ( OS version, OS / kernel features, OS memory
management, Interrupts Management etc )
●
Programming Language Selection – Why C++ ?
●
Compiler / Linker ( Compiler features / compiling options / type of compiler etc )
●
External libraries ( Dependencies / Features )
●
Various Tools For Debugging / Profiling ( GDB, valgrind, cachegrind, gprof etc )
It is simply not possible to improve T2T in HFT even with the logically
most optimized C++ code unless one understands the environment
under which that C++ code is eventually going to interact / run on.
12. 12
External Optimizations ( Hardware )
●
CPU Processor
●
RAM
●
Different Types of Cache and Cache Sizes
●
How do you pick the correct combination ?
●
Network Adapter ( Kernel Bypassing )
●
OS Tuning ( Context Switches ? Interrupts Binding ? )
14. 14
C++ Coding Optimizations
A Few Techniques That We’ll Talk About :
●
Where do we start ? What is the hot path ?
●
Logging is essential isn’t it ? What do we do then ?
●
Dynamic Memory Management ( New / Delete )
●
Data Binding
●
Strings
●
Inline ( always_inline / noinline )
●
Branching ( What are the issues ? )
Disclaimer : I’ve not covered all typical C++ optimizations, it’ just a few quick techniques
which can make significant difference to the performance.
15. 15
Where to start ? Hot Path
●
The “hotpath” is the full path through which the execution flows and it does
the actual end transaction, in HFT it’s the T2T path
●
The “hotpath” is only exercised 0.01% of the time – the rest of the time the
system is idle or doing administrative work or is waiting for events
●
OS, Networks and Hardware are focused on throughput and fairness
●
Jitter is totally unacceptable – This is the major source of bad trades and
forces one to move to total hardware solution even though the Median number
might actually get worse
19. 19
Logging
●
Almost all production systems will need to log some important data
●
Disk I/O is the worst of all hardware operations in terms of performance
●
if your C++ code logs too much then it’s most of the time busy doing Disk I/O and
consuming CPU for unproductive work, First try to minimize the logging to an extent
possible, remove it out of hotpath, use compressed forms of data etc
What are other options ?
20. 20
Offload Logging
●
Move logging to custom handles rather than std::cout / std::cerr / printf, Introduce
buffering on your handle ( I.e create a buffer of 1024 bytes and only flush it when
required )
●
Standard streams are also buffered unless we flush it, but with custom handles we can
better control when to flush and can design it to work better with the type of logging we
have
●
Completely get rid of logging from your production system to eliminate jitters, One can
write the required information in some format in say MQ / SHM and then it can be
offloaded to log into files via a completely separate process, This will improve the latency
significantly for the production system.
21. 21
Dynamic Memory Management
●
There will always be cases when Production system will make use of heap memory and
use objects on the fly ( with new and delete )
●
If your c++ code makes use of new / delete / malloc etc then what are the issues in
terms of latency ?
What are the alternatives to improve ?
22. 22
Memory Pool
●
New / Delete are system calls and the control will flow through kernel space / libs
●
Delete code in glibc is actually a 400 line of bookkeeping which will eat up a lot of CPU
cycles
●
The solution here would be to develop your own C++ class which takes care of memory
management for the duration of your program. We can initially allocate a pool of objects
in a class and instead of using new / delete, We can use this class to assign / release
objects, This way we can actually avoid kernel space execution and improve latency as
well as jitters
●
Another bonus advantage will be we will run into recently used objects very frequently
and hence improve cache performance.
23. 23
Data Binding
●
How many bytes are read when some_function is called ?
●
What is the problem with data access here ?
How do we fix the issue here ?
24. 24
Cache Binding / Cache Line Usage
●
Binding the data very closely will help benefit improve cache access
●
In this case, you’d get an access to other variables of the arguments at zero
cost
●
You can design your code in a way to optimize the usage of cache lines
25. 25
Strings
●
We do like the C++ strings and use it extensively. But you may be surprised to realize
how slower they get executed when put under performance stress testing.
●
There are a lot of standard studies which have been done on char array vs strings and in
general the strings are slower compared to char array by around 23% !!
●
Eventually the CPU processor / OS works best when they get to deal with only 1s and 0s,
When you ask it to a string comparison or char array comparison, it tries to do the
comparison in generic way ( I.e goes on comparison each character and stops at end of
string / a mismatch ), So this becomes a problem for latency as it’s a linear search and
even it takes 50-60 cycles in isolation for say 16 char comparison, a usage of strings at
20 places in the code will take 1200 cycles ( ~0.4 micros on 3GHz !! )
Solution ?
26. 26
Avoid String Operations When Possible
●
We can implement a simpler solution when we know in most cases the length of string is
fixed or it can vary as well by type casting, the latency of comparison will drop by 38% at
least
Length 8 char array comparison,
(uint64_t)(arr_a) == (uint64_t)(arr_b)
Length 16 char array comparison can be done as below,
*((uint64_t*)(arr_a)) == *((uint64_t*)(arr_b)) &&
*((uint64_t*)(arr_a+8)) == *((uint64_t*)(arr_b+8))
This will get executed faster now because the processor is only looking to match all bits
and in 64 bit system it’s just a single word bitwise comparison.
27. 27
Inline
●
What is inline keyword ?
●
When happens when the execution reaches a function call ?
●
Why not inline everything ?
●
Why doesn’t compiler expand everything ?
28. 28
always_inline and noinline
●
inline word has been slightly misunderstood – It mainly means multiple definitions are
permitted ( i.e a common header with definition is included into 2 cpp )
●
always_inline and noinline are stronger hints to the compiler but one has to measure the
latency impact when using it.
●
Why doesn’t compiler expand everything in place ?
- DLL
- Virtual functions
- Recursive function call
- Bigger executable means more disk space and load time, also puts pressure on cache
You can in general try to hint compiler not to inline small functions which are not doing
anything productive or should be out of the hotpath.
__atrribute__((noinline))
void some_function () { // Not doing anything useful}
29. 29
Branching
●
Why is branching bad ?
i.e Consider I can buy / sell something and at multiple places through my execution code
I’ve checks like,
if( BUY == activity_type ) {
}else if( SELL == activity_type ) {
}else { //ERROR }
●
What are the options we have ?
31. 31
Branch Prediction
●
Consider an if statement shown above : At the processor level, it’s actually a branch
instruction ! ( Assume : data[c] is between 0 – 255 values, c is a counter which is looping over
the array )
●
Processors are smart to prefetch a set of instruction to speed up the execution time
●
Your processor sees a branch and it has no idea which way it will go – what it will have to do
is halt the execution and wait until the previous instructions are complete and it can pick the
correct path
●
Modern processors are quite complicated and they have long pipelines, So they take forever
to “warm up” and “slow down”
●
What are the alternatives – develop your code which is friendly enough for branch prediction
to work ( i.e If possible sort the array, will improve branch prediction )
●
Apply some smart hacks with assumptions which are valid ( No Branching with below
replacement and your train never has to stop here )
int t = ( data[c] – 128 ) >> 31 ;
sum += ~t & data[c] ;
●
34. 34
Noisy Neighbours Solution
●
You have to be very careful in choosing which all processes run on the system.
●
Which processes are actually sharing the L2 cache
●
Identify if there is any process messing up with L3 cache which is impacting the
performance of production application in turn
●
One can actually disable cores which are not being used to sort of lock the cache,
disable hyper-threading to ensure better use of L2 cache
●
There are various hacks available to control some kernel modules to not cache data and
rather actually make use of RAM
35. 35
Performance Measurements
Challenges :
●
How do you measure very micro blocks of code where the mere measurement itself may
be taking more time, gettimeofday in linux with tsc clock kernel itself takes ~120/150
cpu cycles.
●
Measurement in an offline setup will be far far away from the one observed in
Production system
●
How do you analyze which are the slow performing units of the code ?
●
Do you actually try to take a look at some assembly code and how useful it is in
practical scenarios ?
●
How useful are the tools like cachegrind, gprof, papi libs with counters ?
37. 37
Talk of the town – FPGA
●
This is the current area of focus for most of the HFT firms now a days.
●
A pure end – end FPGA solution is quite complex and requires lot of time and effort
●
Not everything can be optimized in FPGA since at present most FPGA boards operate at
around 2/2.5 GHz.
●
A lot of the firms like us are trying develop a hybrid end – end solution in FPGA where we
can retain best of the software and hardware.
●
The primary motivation to move to FPGA is remove jitters from the system, no software
solution can offer as good stability in latency as hardware can. The major concern here is
no one wants to be slow even during 1% of the time under which the application is
trading. You can be fastest to make money 99% of the time but jitters can wipe it all away
!!
Questions ??