2. Goal was:
Deliver short term tactical performance
improvements
● Fix common performance bottlenecks
● Introduce incremental architectural improvements
based on Proof-of-Concept
● => No business logic or radical architecture changes
● => Update software to incorporate modern
programming and architectural standards
3. Where did we start?
Version 1.9.5 running 8 PG and 8 L
This version cannot run 20PG and 15L
Mean latency: 70+ms (with long tail)
% Messages over 100ms: 49%
20 Currency pairs meant 40 processes running
JVM overload
Inefficient processor utilization – only 40%
Message latency characteristics
● Sub 10ms: 9.64%
● Sub 20ms: 31.69%
● Sub 40ms: 64.17%
● Sub 80ms: 85.34%
Long tail on distribution
4. Where are we now?
Version PERF running 20 PG and 15 L
Mean latency: 16ms
% Messages over 100ms: 0.029%
20 Currency pairs means 20 processes running
JVM not taxed
All processors utilized: 80%
Message latency characteristics
● Sub 10ms: 15.75%
● Sub 20ms: 77.53%
● Sub 40ms: 96.48%
● Sub 80ms: 99.89%
Comparare to http://www.lmax.com/execution-performance (can’t
guarantee latency 100%)
5. How did we test?
Instrumented code with Fixprotocol Inter Party Latency
LMP's, and recorded timing info
Run simulated price feed with constant and live-like rates
19 currency pairs
20 price groups (PG) and 15 layers (L) per pair for PERF
branch
8 PG and 8 L for 1.9.5.56 branch
100 spot updates/sec
20 fwd updates/sec
8. Performance Improvements
Common Improvements
● Eliminate sources of latency common to many
applications
● While some may have seemed trivial, they had
significant impact
Improvements based on the PoC
● Apply PoC architecture principles in key areas where
latency was measured
● Only tactical changes, not strategic
● Required careful measurements: bottlenecks turned out
to be in different places than previously thought
9. Common Performance Bottlenecks:
Price Object Marshalling
Replaced object marshalling
● Significant source of latency, large message sizes, and
garbage (object) creation
● Serialize-Deserialize cycle was performed at least three
times for every price
● Previously based on JDK serialization, replaced with
custom code
● Removed one cycle (more on that later)
Optimized the Price Object data structure
10. Common Performance Bottlenecks:
Logging
Price Engine logging levels were insane
● INFO level logging was performed over 1,500 times
● Most INFO level logging was redundant
Significant performance bottlenecks
● Disk writes, thread contention, object creation (GC)
● Logs could grow to GB size in minutes
Removed all but necessary logging
● Logs will need further work short term...
11. Code Review and Optimization
All PE code was reviewed for efficiency
● Re-work (tactical) but not re-write (strategic)
Improvements
● Timer scheduling replaced with a more efficient
approach
● Replace synchronization locks with CAS operations when
possible to reduce contention
● Replace inefficient cache access
● Numerous code tweaks
12. PoC Architectural Principles
Only distribute components when absolutely
necessary
● Challenge the myth that distributed components improve
throughput and latency
Parellelism (threads) may dramatically slow a
system down
● Contrary to old conventional wisdom
● Mechanical Sympathy has challenged this assumption
● Data contention, context switching often leads to data
duplication and GC
● A lot can be done in a single thread
13. Reduced Parallelism
Significant contention was eliminated in the Broadcast module
Excessive use of “in memory” producer/consumer queues
● Price objects put on queues for margining, forward calculation, and plugin delivery
● Multiple worker consumer threads pull from those queues and process prices
Queues written using synchronization primitives
● Very inefficient
● Contention between producers and consumers (put and take operations)
● Large number of worker threads lead to context switching
Queues were replaced with a highly efficient lock-free buffer
● Uses CAS operations instead of synchronization to dramatically reduce contention
● Only one consumer thread to reduce context switching
We attempted to eliminate buffers and queues altogether
● Make processing synchronous (and therefore remove contention)
● Turned out to be higher latency than using the lock-free buffers
– Likely because business logic is not optimised
14. Reduced Distribution
Everyone thought the bottleneck was Broadcast
● It turned out that bottlenecks existed in Broadcast, but there were
other equally significant sources of latency...
...Validator and TW
● One Validator process and one TW process per currency pair --> 20
currency pairs = 40 processes!
● Context switching, JMS latency, serialization overhead
Combined Validator and TW into a single processes
● Halved the number of processes
● Removed one serialization cycle
● Greatly simplified system management