SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Microseconds matter in High Frequency
Trading
High performance trading systems in C++
Ravi Parikh TWO ROADS TRADING PVT LTD
(http://tworoads-trading.co.in/)
2
Introduction
About me :
●
HFT infra developer for TWO ROADS TRADING since 2011
●
Overall close to 9 years of experience in software development
Today’s talk :
●
General Software Development vs HFT software development
●
Overview of HFT trading and why does speed matter a lot ?
●
Importance of Correctness / Robustness of the HFT systems
●
A few Techniques for C++ Optimizations for ultra low latency software development
●
Noisy neighbors
●
Measurements of performance
3
General Software Development
Source : RFE Electronics
4
HFT Software Development
Source : RFE Electronics
5
HFT Trading
➔
Trading in general is about buying something and selling it, can result into profit or loss
based on prices at which it was bought & sold.
➔
HFT trading is about market making and there is no genuine intention of buying /
selling, They aren’t speculators and they are there to provide liquidity to the market
➔
HFT makes money from very small profitable trades executed at very high frequency
( the holding time period for any open trade is very small )
➔
Other main objective is to avoid taking bad trades which can result into larger losses.
➔
So what is the role of ultra low latency system in HFT ? It’s about spotting the
opportunity for those quick small profitable trades and grabbing those, at the same time
it’s about pulling it out in time to avoid taking larger losses ( After all, you’ll always have
a very stiff competition fighting for the same trades given markets are becoming more
and more efficient each passing day )
6
Role Of Latency In Grabbing The
Opportunity
- Against all odds only the fastest few will be able to book tickets successfully !!
-
Source : Internet
7
Role of Latency In Pulling Out ! ( Avoid
bad trades )
- It’s equally important to pull out of a bad trade before someone hits you with the fill
( It’s those trade where you were slow to change the price and now you were forced to
take the trade which you know is a bad one, So speed matters even when you want to
avoid making a loss )
Source : Two Players Org
8
So How Fast Is Fast Enough ?
- Doesn’t matter if you’re faster by 1 sec / 1 nano sec long as you’re ahead of everyone
else ( Unfortunately in HFT trading domain in most cases there are no silver & bronze
rewards, It’s gold or nothing or even worse which is a loss )
Source : Photo By Alvin Loke
Source : Two Players OrgSource : Two Players Org
9
HFT System Overview ( T2T )
Software Solutions : 1-10 micros
Hardware Solutions : 0.5-2 micros
10
Robustness ??
●
There is always a trade-off between putting that extra if checks against saving
a few cpu cycles, so robustness and optimizations don’t always go well together.
●
Even though not being put forward as the most critical feature in the system
ahead of speed, robustness can never be compromised in HFT
●
An opportunity of making a 1 Rs profit from buying / selling a stock in a 5 micro
latency system at each uniform points doesn’t guarantee in all profitable trades
because we will not be able to capture all opportunities, However a BUG in the
system ( trivial it may be say buy/sell flipped ! ) it will guarantee a loss of 12
Million INR in ONE MINUTE
●
So in HFT infra development, Safety is always first, one has to be 120% sure
that there are no bugs in the system which will run in Production because all it
may take is just a few seconds / minutes of buggy run and it can make headlines
the next day.
●
So now keeping in mind that you can’t do anything against robustness making
an application work faster becomes even more challenging and interesting.
11
Optimizations ( Prerequisites )
●
Hardware selection ( CPU / RAM / CACHE )
●
Network selection ( Switches / Network Adapters )
●
Understanding of OS/Platform ( OS version, OS / kernel features, OS memory
management, Interrupts Management etc )
●
Programming Language Selection – Why C++ ?
●
Compiler / Linker ( Compiler features / compiling options / type of compiler etc )
●
External libraries ( Dependencies / Features )
●
Various Tools For Debugging / Profiling ( GDB, valgrind, cachegrind, gprof etc )
It is simply not possible to improve T2T in HFT even with the logically
most optimized C++ code unless one understands the environment
under which that C++ code is eventually going to interact / run on.
12
External Optimizations ( Hardware )
●
CPU Processor
●
RAM
●
Different Types of Cache and Cache Sizes
●
How do you pick the correct combination ?
●
Network Adapter ( Kernel Bypassing )
●
OS Tuning ( Context Switches ? Interrupts Binding ? )
13
Fine Tuned System Performance
Source : CPPCON ( Carl Cook )
14
C++ Coding Optimizations
A Few Techniques That We’ll Talk About :
●
Where do we start ? What is the hot path ?
●
Logging is essential isn’t it ? What do we do then ?
●
Dynamic Memory Management ( New / Delete )
●
Data Binding
●
Strings
●
Inline ( always_inline / noinline )
●
Branching ( What are the issues ? )
Disclaimer : I’ve not covered all typical C++ optimizations, it’ just a few quick techniques
which can make significant difference to the performance.
15
Where to start ? Hot Path
●
The “hotpath” is the full path through which the execution flows and it does
the actual end transaction, in HFT it’s the T2T path
●
The “hotpath” is only exercised 0.01% of the time – the rest of the time the
system is idle or doing administrative work or is waiting for events
●
OS, Networks and Hardware are focused on throughput and fairness
●
Jitter is totally unacceptable – This is the major source of bad trades and
forces one to move to total hardware solution even though the Median number
might actually get worse
16
Removing Jitters From Hotpath
Source : CPPCON ( Carl Cook )
17
HOTPATH in HFT System
Source : CPPCON ( Carl Cook )
18
Solution ?
Source : CPPCON ( Carl Cook )
19
Logging
●
Almost all production systems will need to log some important data
●
Disk I/O is the worst of all hardware operations in terms of performance
●
if your C++ code logs too much then it’s most of the time busy doing Disk I/O and
consuming CPU for unproductive work, First try to minimize the logging to an extent
possible, remove it out of hotpath, use compressed forms of data etc
What are other options ?
20
Offload Logging
●
Move logging to custom handles rather than std::cout / std::cerr / printf, Introduce
buffering on your handle ( I.e create a buffer of 1024 bytes and only flush it when
required )
●
Standard streams are also buffered unless we flush it, but with custom handles we can
better control when to flush and can design it to work better with the type of logging we
have
●
Completely get rid of logging from your production system to eliminate jitters, One can
write the required information in some format in say MQ / SHM and then it can be
offloaded to log into files via a completely separate process, This will improve the latency
significantly for the production system.
21
Dynamic Memory Management
●
There will always be cases when Production system will make use of heap memory and
use objects on the fly ( with new and delete )
●
If your c++ code makes use of new / delete / malloc etc then what are the issues in
terms of latency ?
What are the alternatives to improve ?
22
Memory Pool
●
New / Delete are system calls and the control will flow through kernel space / libs
●
Delete code in glibc is actually a 400 line of bookkeeping which will eat up a lot of CPU
cycles
●
The solution here would be to develop your own C++ class which takes care of memory
management for the duration of your program. We can initially allocate a pool of objects
in a class and instead of using new / delete, We can use this class to assign / release
objects, This way we can actually avoid kernel space execution and improve latency as
well as jitters
●
Another bonus advantage will be we will run into recently used objects very frequently
and hence improve cache performance.
23
Data Binding
●
How many bytes are read when some_function is called ?
●
What is the problem with data access here ?
How do we fix the issue here ?
24
Cache Binding / Cache Line Usage
●
Binding the data very closely will help benefit improve cache access
●
In this case, you’d get an access to other variables of the arguments at zero
cost
●
You can design your code in a way to optimize the usage of cache lines
25
Strings
●
We do like the C++ strings and use it extensively. But you may be surprised to realize
how slower they get executed when put under performance stress testing.
●
There are a lot of standard studies which have been done on char array vs strings and in
general the strings are slower compared to char array by around 23% !!
●
Eventually the CPU processor / OS works best when they get to deal with only 1s and 0s,
When you ask it to a string comparison or char array comparison, it tries to do the
comparison in generic way ( I.e goes on comparison each character and stops at end of
string / a mismatch ), So this becomes a problem for latency as it’s a linear search and
even it takes 50-60 cycles in isolation for say 16 char comparison, a usage of strings at
20 places in the code will take 1200 cycles ( ~0.4 micros on 3GHz !! )
Solution ?
26
Avoid String Operations When Possible
●
We can implement a simpler solution when we know in most cases the length of string is
fixed or it can vary as well by type casting, the latency of comparison will drop by 38% at
least
Length 8 char array comparison,
(uint64_t)(arr_a) == (uint64_t)(arr_b)
Length 16 char array comparison can be done as below,
*((uint64_t*)(arr_a)) == *((uint64_t*)(arr_b)) &&
*((uint64_t*)(arr_a+8)) == *((uint64_t*)(arr_b+8))
This will get executed faster now because the processor is only looking to match all bits
and in 64 bit system it’s just a single word bitwise comparison.
27
Inline
●
What is inline keyword ?
●
When happens when the execution reaches a function call ?
●
Why not inline everything ?
●
Why doesn’t compiler expand everything ?
28
always_inline and noinline
●
inline word has been slightly misunderstood – It mainly means multiple definitions are
permitted ( i.e a common header with definition is included into 2 cpp )
●
always_inline and noinline are stronger hints to the compiler but one has to measure the
latency impact when using it.
●
Why doesn’t compiler expand everything in place ?
- DLL
- Virtual functions
- Recursive function call
- Bigger executable means more disk space and load time, also puts pressure on cache
You can in general try to hint compiler not to inline small functions which are not doing
anything productive or should be out of the hotpath.
__atrribute__((noinline))
void some_function () { // Not doing anything useful}
29
Branching
●
Why is branching bad ?
i.e Consider I can buy / sell something and at multiple places through my execution code
I’ve checks like,
if( BUY == activity_type ) {
}else if( SELL == activity_type ) {
}else { //ERROR }
●
What are the options we have ?
30
Branching Effects
Source : Image by Mecanismo ( CC-By-SA 3.0)
31
Branch Prediction
●
Consider an if statement shown above : At the processor level, it’s actually a branch
instruction ! ( Assume : data[c] is between 0 – 255 values, c is a counter which is looping over
the array )
●
Processors are smart to prefetch a set of instruction to speed up the execution time
●
Your processor sees a branch and it has no idea which way it will go – what it will have to do
is halt the execution and wait until the previous instructions are complete and it can pick the
correct path
●
Modern processors are quite complicated and they have long pipelines, So they take forever
to “warm up” and “slow down”
●
What are the alternatives – develop your code which is friendly enough for branch prediction
to work ( i.e If possible sort the array, will improve branch prediction )
●
Apply some smart hacks with assumptions which are valid ( No Branching with below
replacement and your train never has to stop here )
int t = ( data[c] – 128 ) >> 31 ;
sum += ~t & data[c] ;
●
32
Further Branching Improvments
Source : CPPCON ( Carl Cook )
33
Noisy Neighbours
34
Noisy Neighbours Solution
●
You have to be very careful in choosing which all processes run on the system.
●
Which processes are actually sharing the L2 cache
●
Identify if there is any process messing up with L3 cache which is impacting the
performance of production application in turn
●
One can actually disable cores which are not being used to sort of lock the cache,
disable hyper-threading to ensure better use of L2 cache
●
There are various hacks available to control some kernel modules to not cache data and
rather actually make use of RAM
35
Performance Measurements
Challenges :
●
How do you measure very micro blocks of code where the mere measurement itself may
be taking more time, gettimeofday in linux with tsc clock kernel itself takes ~120/150
cpu cycles.
●
Measurement in an offline setup will be far far away from the one observed in
Production system
●
How do you analyze which are the slow performing units of the code ?
●
Do you actually try to take a look at some assembly code and how useful it is in
practical scenarios ?
●
How useful are the tools like cachegrind, gprof, papi libs with counters ?
36
Measurements of HFT system
performance
Source : CPPCON ( Carl Cook )
37
Talk of the town – FPGA
●
This is the current area of focus for most of the HFT firms now a days.
●
A pure end – end FPGA solution is quite complex and requires lot of time and effort
●
Not everything can be optimized in FPGA since at present most FPGA boards operate at
around 2/2.5 GHz.
●
A lot of the firms like us are trying develop a hybrid end – end solution in FPGA where we
can retain best of the software and hardware.
●
The primary motivation to move to FPGA is remove jitters from the system, no software
solution can offer as good stability in latency as hardware can. The major concern here is
no one wants to be slow even during 1% of the time under which the application is
trading. You can be fastest to make money 99% of the time but jitters can wipe it all away
!!
Questions ??
38
THANK YOU !
Contact :
ravi.parikh@tworoads-trading.co.in

Más contenido relacionado

La actualidad más candente

Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit ControllerOverview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit ControllerPremier Farnell
 
ARM architcture
ARM architcture ARM architcture
ARM architcture Hossam Adel
 
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberArm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberasodariyabhavesh
 
ARM7-ARCHITECTURE
ARM7-ARCHITECTURE ARM7-ARCHITECTURE
ARM7-ARCHITECTURE Dr.YNM
 
Verifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami MakelaVerifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami MakelaCyber Fund
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overviewSunil Thorat
 
Arm architecture
Arm architectureArm architecture
Arm architectureMinYeop Na
 
Introduction to microcontrollers
Introduction to microcontrollersIntroduction to microcontrollers
Introduction to microcontrollersCorrado Santoro
 
Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...rahul kumar verma
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollersmike parks
 

La actualidad más candente (20)

ARM
ARMARM
ARM
 
ARM Micro-controller
ARM Micro-controllerARM Micro-controller
ARM Micro-controller
 
Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit ControllerOverview Study on PIC32MX3XX / 4XX 32-Bit Controller
Overview Study on PIC32MX3XX / 4XX 32-Bit Controller
 
Microcontroller part 2
Microcontroller part 2Microcontroller part 2
Microcontroller part 2
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furberArm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furber
 
ARM7-ARCHITECTURE
ARM7-ARCHITECTURE ARM7-ARCHITECTURE
ARM7-ARCHITECTURE
 
ARM7TDM
ARM7TDMARM7TDM
ARM7TDM
 
Arm architechture
Arm architechtureArm architechture
Arm architechture
 
Verifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami MakelaVerifying offchain computations using TrueBit. Sami Makela
Verifying offchain computations using TrueBit. Sami Makela
 
Introduction to ARM
Introduction to ARMIntroduction to ARM
Introduction to ARM
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Ppt
PptPpt
Ppt
 
Lec07
Lec07Lec07
Lec07
 
Pic16f84
Pic16f84Pic16f84
Pic16f84
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
Introduction to microcontrollers
Introduction to microcontrollersIntroduction to microcontrollers
Introduction to microcontrollers
 
Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...Memory map selection of real time sdram controller using verilog full project...
Memory map selection of real time sdram controller using verilog full project...
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
 

Similar a Presentation

Arm developement
Arm developementArm developement
Arm developementhirokiht
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing PythonAdimianBE
 
AVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontrollerAVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontrollerMohamed Ali
 
Assembly programming
Assembly programmingAssembly programming
Assembly programmingOmar Sanchez
 
Let’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllLet’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllScyllaDB
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Embedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 CourseEmbedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 CourseFastBit Embedded Brain Academy
 
Ppt on embedded system
Ppt on embedded systemPpt on embedded system
Ppt on embedded systemPankaj joshi
 
Ds03 part i algorithms by jyoti lakhani
Ds03 part i algorithms   by jyoti lakhaniDs03 part i algorithms   by jyoti lakhani
Ds03 part i algorithms by jyoti lakhanijyoti_lakhani
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsPriyanka Aash
 
Micro controller selection
Micro controller selectionMicro controller selection
Micro controller selectionVijay kumar
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded SystemsSudhanshu Janwadkar
 

Similar a Presentation (20)

Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Arm developement
Arm developementArm developement
Arm developement
 
Phytium 64 core cpu preview
Phytium 64 core cpu previewPhytium 64 core cpu preview
Phytium 64 core cpu preview
 
Java under the hood
Java under the hoodJava under the hood
Java under the hood
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Technical Implementation: Hardware
Technical Implementation: HardwareTechnical Implementation: Hardware
Technical Implementation: Hardware
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing Python
 
AVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontrollerAVR_Course_Day4 introduction to microcontroller
AVR_Course_Day4 introduction to microcontroller
 
Micro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application DevelopmentMicro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application Development
 
Assembly programming
Assembly programmingAssembly programming
Assembly programming
 
Basic 8051 question
Basic 8051 questionBasic 8051 question
Basic 8051 question
 
TMS320C5x
TMS320C5xTMS320C5x
TMS320C5x
 
Let’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllLet’s Fix Logging Once and for All
Let’s Fix Logging Once and for All
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Embedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 CourseEmbedded System Programming on ARM Cortex M3 and M4 Course
Embedded System Programming on ARM Cortex M3 and M4 Course
 
Ppt on embedded system
Ppt on embedded systemPpt on embedded system
Ppt on embedded system
 
Ds03 part i algorithms by jyoti lakhani
Ds03 part i algorithms   by jyoti lakhaniDs03 part i algorithms   by jyoti lakhani
Ds03 part i algorithms by jyoti lakhani
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
Micro controller selection
Micro controller selectionMicro controller selection
Micro controller selection
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systems
 

Último

STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectGayathriM270621
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 

Último (20)

STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subject
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptx
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 

Presentation

  • 1. Microseconds matter in High Frequency Trading High performance trading systems in C++ Ravi Parikh TWO ROADS TRADING PVT LTD (http://tworoads-trading.co.in/)
  • 2. 2 Introduction About me : ● HFT infra developer for TWO ROADS TRADING since 2011 ● Overall close to 9 years of experience in software development Today’s talk : ● General Software Development vs HFT software development ● Overview of HFT trading and why does speed matter a lot ? ● Importance of Correctness / Robustness of the HFT systems ● A few Techniques for C++ Optimizations for ultra low latency software development ● Noisy neighbors ● Measurements of performance
  • 5. 5 HFT Trading ➔ Trading in general is about buying something and selling it, can result into profit or loss based on prices at which it was bought & sold. ➔ HFT trading is about market making and there is no genuine intention of buying / selling, They aren’t speculators and they are there to provide liquidity to the market ➔ HFT makes money from very small profitable trades executed at very high frequency ( the holding time period for any open trade is very small ) ➔ Other main objective is to avoid taking bad trades which can result into larger losses. ➔ So what is the role of ultra low latency system in HFT ? It’s about spotting the opportunity for those quick small profitable trades and grabbing those, at the same time it’s about pulling it out in time to avoid taking larger losses ( After all, you’ll always have a very stiff competition fighting for the same trades given markets are becoming more and more efficient each passing day )
  • 6. 6 Role Of Latency In Grabbing The Opportunity - Against all odds only the fastest few will be able to book tickets successfully !! - Source : Internet
  • 7. 7 Role of Latency In Pulling Out ! ( Avoid bad trades ) - It’s equally important to pull out of a bad trade before someone hits you with the fill ( It’s those trade where you were slow to change the price and now you were forced to take the trade which you know is a bad one, So speed matters even when you want to avoid making a loss ) Source : Two Players Org
  • 8. 8 So How Fast Is Fast Enough ? - Doesn’t matter if you’re faster by 1 sec / 1 nano sec long as you’re ahead of everyone else ( Unfortunately in HFT trading domain in most cases there are no silver & bronze rewards, It’s gold or nothing or even worse which is a loss ) Source : Photo By Alvin Loke Source : Two Players OrgSource : Two Players Org
  • 9. 9 HFT System Overview ( T2T ) Software Solutions : 1-10 micros Hardware Solutions : 0.5-2 micros
  • 10. 10 Robustness ?? ● There is always a trade-off between putting that extra if checks against saving a few cpu cycles, so robustness and optimizations don’t always go well together. ● Even though not being put forward as the most critical feature in the system ahead of speed, robustness can never be compromised in HFT ● An opportunity of making a 1 Rs profit from buying / selling a stock in a 5 micro latency system at each uniform points doesn’t guarantee in all profitable trades because we will not be able to capture all opportunities, However a BUG in the system ( trivial it may be say buy/sell flipped ! ) it will guarantee a loss of 12 Million INR in ONE MINUTE ● So in HFT infra development, Safety is always first, one has to be 120% sure that there are no bugs in the system which will run in Production because all it may take is just a few seconds / minutes of buggy run and it can make headlines the next day. ● So now keeping in mind that you can’t do anything against robustness making an application work faster becomes even more challenging and interesting.
  • 11. 11 Optimizations ( Prerequisites ) ● Hardware selection ( CPU / RAM / CACHE ) ● Network selection ( Switches / Network Adapters ) ● Understanding of OS/Platform ( OS version, OS / kernel features, OS memory management, Interrupts Management etc ) ● Programming Language Selection – Why C++ ? ● Compiler / Linker ( Compiler features / compiling options / type of compiler etc ) ● External libraries ( Dependencies / Features ) ● Various Tools For Debugging / Profiling ( GDB, valgrind, cachegrind, gprof etc ) It is simply not possible to improve T2T in HFT even with the logically most optimized C++ code unless one understands the environment under which that C++ code is eventually going to interact / run on.
  • 12. 12 External Optimizations ( Hardware ) ● CPU Processor ● RAM ● Different Types of Cache and Cache Sizes ● How do you pick the correct combination ? ● Network Adapter ( Kernel Bypassing ) ● OS Tuning ( Context Switches ? Interrupts Binding ? )
  • 13. 13 Fine Tuned System Performance Source : CPPCON ( Carl Cook )
  • 14. 14 C++ Coding Optimizations A Few Techniques That We’ll Talk About : ● Where do we start ? What is the hot path ? ● Logging is essential isn’t it ? What do we do then ? ● Dynamic Memory Management ( New / Delete ) ● Data Binding ● Strings ● Inline ( always_inline / noinline ) ● Branching ( What are the issues ? ) Disclaimer : I’ve not covered all typical C++ optimizations, it’ just a few quick techniques which can make significant difference to the performance.
  • 15. 15 Where to start ? Hot Path ● The “hotpath” is the full path through which the execution flows and it does the actual end transaction, in HFT it’s the T2T path ● The “hotpath” is only exercised 0.01% of the time – the rest of the time the system is idle or doing administrative work or is waiting for events ● OS, Networks and Hardware are focused on throughput and fairness ● Jitter is totally unacceptable – This is the major source of bad trades and forces one to move to total hardware solution even though the Median number might actually get worse
  • 16. 16 Removing Jitters From Hotpath Source : CPPCON ( Carl Cook )
  • 17. 17 HOTPATH in HFT System Source : CPPCON ( Carl Cook )
  • 18. 18 Solution ? Source : CPPCON ( Carl Cook )
  • 19. 19 Logging ● Almost all production systems will need to log some important data ● Disk I/O is the worst of all hardware operations in terms of performance ● if your C++ code logs too much then it’s most of the time busy doing Disk I/O and consuming CPU for unproductive work, First try to minimize the logging to an extent possible, remove it out of hotpath, use compressed forms of data etc What are other options ?
  • 20. 20 Offload Logging ● Move logging to custom handles rather than std::cout / std::cerr / printf, Introduce buffering on your handle ( I.e create a buffer of 1024 bytes and only flush it when required ) ● Standard streams are also buffered unless we flush it, but with custom handles we can better control when to flush and can design it to work better with the type of logging we have ● Completely get rid of logging from your production system to eliminate jitters, One can write the required information in some format in say MQ / SHM and then it can be offloaded to log into files via a completely separate process, This will improve the latency significantly for the production system.
  • 21. 21 Dynamic Memory Management ● There will always be cases when Production system will make use of heap memory and use objects on the fly ( with new and delete ) ● If your c++ code makes use of new / delete / malloc etc then what are the issues in terms of latency ? What are the alternatives to improve ?
  • 22. 22 Memory Pool ● New / Delete are system calls and the control will flow through kernel space / libs ● Delete code in glibc is actually a 400 line of bookkeeping which will eat up a lot of CPU cycles ● The solution here would be to develop your own C++ class which takes care of memory management for the duration of your program. We can initially allocate a pool of objects in a class and instead of using new / delete, We can use this class to assign / release objects, This way we can actually avoid kernel space execution and improve latency as well as jitters ● Another bonus advantage will be we will run into recently used objects very frequently and hence improve cache performance.
  • 23. 23 Data Binding ● How many bytes are read when some_function is called ? ● What is the problem with data access here ? How do we fix the issue here ?
  • 24. 24 Cache Binding / Cache Line Usage ● Binding the data very closely will help benefit improve cache access ● In this case, you’d get an access to other variables of the arguments at zero cost ● You can design your code in a way to optimize the usage of cache lines
  • 25. 25 Strings ● We do like the C++ strings and use it extensively. But you may be surprised to realize how slower they get executed when put under performance stress testing. ● There are a lot of standard studies which have been done on char array vs strings and in general the strings are slower compared to char array by around 23% !! ● Eventually the CPU processor / OS works best when they get to deal with only 1s and 0s, When you ask it to a string comparison or char array comparison, it tries to do the comparison in generic way ( I.e goes on comparison each character and stops at end of string / a mismatch ), So this becomes a problem for latency as it’s a linear search and even it takes 50-60 cycles in isolation for say 16 char comparison, a usage of strings at 20 places in the code will take 1200 cycles ( ~0.4 micros on 3GHz !! ) Solution ?
  • 26. 26 Avoid String Operations When Possible ● We can implement a simpler solution when we know in most cases the length of string is fixed or it can vary as well by type casting, the latency of comparison will drop by 38% at least Length 8 char array comparison, (uint64_t)(arr_a) == (uint64_t)(arr_b) Length 16 char array comparison can be done as below, *((uint64_t*)(arr_a)) == *((uint64_t*)(arr_b)) && *((uint64_t*)(arr_a+8)) == *((uint64_t*)(arr_b+8)) This will get executed faster now because the processor is only looking to match all bits and in 64 bit system it’s just a single word bitwise comparison.
  • 27. 27 Inline ● What is inline keyword ? ● When happens when the execution reaches a function call ? ● Why not inline everything ? ● Why doesn’t compiler expand everything ?
  • 28. 28 always_inline and noinline ● inline word has been slightly misunderstood – It mainly means multiple definitions are permitted ( i.e a common header with definition is included into 2 cpp ) ● always_inline and noinline are stronger hints to the compiler but one has to measure the latency impact when using it. ● Why doesn’t compiler expand everything in place ? - DLL - Virtual functions - Recursive function call - Bigger executable means more disk space and load time, also puts pressure on cache You can in general try to hint compiler not to inline small functions which are not doing anything productive or should be out of the hotpath. __atrribute__((noinline)) void some_function () { // Not doing anything useful}
  • 29. 29 Branching ● Why is branching bad ? i.e Consider I can buy / sell something and at multiple places through my execution code I’ve checks like, if( BUY == activity_type ) { }else if( SELL == activity_type ) { }else { //ERROR } ● What are the options we have ?
  • 30. 30 Branching Effects Source : Image by Mecanismo ( CC-By-SA 3.0)
  • 31. 31 Branch Prediction ● Consider an if statement shown above : At the processor level, it’s actually a branch instruction ! ( Assume : data[c] is between 0 – 255 values, c is a counter which is looping over the array ) ● Processors are smart to prefetch a set of instruction to speed up the execution time ● Your processor sees a branch and it has no idea which way it will go – what it will have to do is halt the execution and wait until the previous instructions are complete and it can pick the correct path ● Modern processors are quite complicated and they have long pipelines, So they take forever to “warm up” and “slow down” ● What are the alternatives – develop your code which is friendly enough for branch prediction to work ( i.e If possible sort the array, will improve branch prediction ) ● Apply some smart hacks with assumptions which are valid ( No Branching with below replacement and your train never has to stop here ) int t = ( data[c] – 128 ) >> 31 ; sum += ~t & data[c] ; ●
  • 32. 32 Further Branching Improvments Source : CPPCON ( Carl Cook )
  • 34. 34 Noisy Neighbours Solution ● You have to be very careful in choosing which all processes run on the system. ● Which processes are actually sharing the L2 cache ● Identify if there is any process messing up with L3 cache which is impacting the performance of production application in turn ● One can actually disable cores which are not being used to sort of lock the cache, disable hyper-threading to ensure better use of L2 cache ● There are various hacks available to control some kernel modules to not cache data and rather actually make use of RAM
  • 35. 35 Performance Measurements Challenges : ● How do you measure very micro blocks of code where the mere measurement itself may be taking more time, gettimeofday in linux with tsc clock kernel itself takes ~120/150 cpu cycles. ● Measurement in an offline setup will be far far away from the one observed in Production system ● How do you analyze which are the slow performing units of the code ? ● Do you actually try to take a look at some assembly code and how useful it is in practical scenarios ? ● How useful are the tools like cachegrind, gprof, papi libs with counters ?
  • 36. 36 Measurements of HFT system performance Source : CPPCON ( Carl Cook )
  • 37. 37 Talk of the town – FPGA ● This is the current area of focus for most of the HFT firms now a days. ● A pure end – end FPGA solution is quite complex and requires lot of time and effort ● Not everything can be optimized in FPGA since at present most FPGA boards operate at around 2/2.5 GHz. ● A lot of the firms like us are trying develop a hybrid end – end solution in FPGA where we can retain best of the software and hardware. ● The primary motivation to move to FPGA is remove jitters from the system, no software solution can offer as good stability in latency as hardware can. The major concern here is no one wants to be slow even during 1% of the time under which the application is trading. You can be fastest to make money 99% of the time but jitters can wipe it all away !! Questions ??
  • 38. 38 THANK YOU ! Contact : ravi.parikh@tworoads-trading.co.in