2011-11-02 | 05:45 PM - 06:35 PM | Victoria
The Disruptor is new open-source concurrency framework, designed as a high performance mechanism for inter-thread messaging. It was developed at LMAX as part of our efforts to build the world's fastest financial exchange. Using the Disruptor as an example, this talk will explain of some of the more detailed and less understood areas of concurrency, such as memory barriers and cache coherency. These concepts are often regarded as scary complex magic only accessible by wizards like Doug Lea and Cliff Click. Our talk will try and demystify them and show that concurrency can be understood by us mere mortal programmers.
3. Program Order: Execution Order (maybe):
int w = 10; int x = 20;
int x = 20; int y = 30;
int y = 30; int b = x * y;
int z = 40;
int w = 10;
int a = w + z; int z = 40;
int b = x * y; int a = w + z;
6. static long foo = 0;
private static void increment() {
for (long l = 0; l < 500000000L; l++) {
foo++;
}
}
7. public static long foo = 0;
public static Lock lock = new Lock();
private static void increment() {
for (long l = 0; l < 500000000L; l++){
lock.lock();
try {
foo++;
} finally {
lock.unlock();
}
}
}
8. static AtomicLong foo = new AtomicLong(0);
private static void increment() {
for (long l = 0; l < 500000000L; l++) {
foo.getAndIncrement();
}
}
9. Cost of Contention
Increment a counter 500 000 000 times.
• One Thread : 300 ms
10. Cost of Contention
Increment a counter 500 000 000 times.
• One Thread : 300 ms
• One Thread (volatile): 4 700 ms (15x)
11. Cost of Contention
Increment a counter 500 000 000 times.
• One Thread : 300 ms
• One Thread (volatile): 4 700 ms (15x)
• One Thread (Atomic) : 5 700 ms (19x)
12. Cost of Contention
Increment a counter 500 000 000 times.
• One Thread : 300 ms
• One Thread (volatile): 4 700 ms (15x)
• One Thread (Atomic) : 5 700 ms (19x)
• One Thread (Lock) : 10 000 ms (33x)
13. Cost of Contention
Increment a counter 500 000 000 times.
• One Thread : 300 ms
• One Thread (volatile): 4 700 ms (15x)
• One Thread (Atomic) : 5 700 ms (19x)
• One Thread (Lock) : 10 000 ms (33x)
• Two Threads (Atomic) : 30 000 ms (100x)
14. Cost of Contention
Increment a counter 500 000 000 times.
• One Thread : 300 ms
• One Thread (volatile): 4 700 ms (15x)
• One Thread (Atomic) : 5 700 ms (19x)
• One Thread (Lock) : 10 000 ms (33x)
• Two Threads (Atomic) : 30 000 ms (100x)
• Two Threads (Lock) : 224 000 ms (746x)
^^^^^^^^
~4 minutes!!!
15. Parallel v. Serial - String Split
Guy Steele @ Strangle Loop:
http://www.infoq.com/presentations/Thinking-Parallel-
Programming
Scala Implementation and Brute Force version in Java:
https://github.com/mikeb01/folklore/
15
16. Parallel (Scala) Serial (Java)
2000.0
1500.0
1000.0
500.0
0
String Split (ops/sec) higher is better
27. How Fast Is It - Throughput
ABQ Disruptor
30000000.0
22500000.0
15000000.0
7500000.0
0
Unicast Diamond
27
28. How Fast Is It - Latency
ABQ Disruptor
Min 145 29
Mean 32,757 52
99 Percentile 2,097,152 128
99.99 Percentile 4,194,304 8,192
Max 5,069,086 175,567
28
33. Look Ma’ No Memory Barrier
AtomicLong sequence = new AtomicLong(-1);
public void publish(Object value) {
long index = ++nextValue;
data[(int)(index % SIZE)] = value;
sequence.lazySet(index);
}
33
35. Cache Line Padding
public class PaddedAtomicLong extends AtomicLong {
public volatile long p1, p2, p3, p4, p5, p6 = 7L;
//... lines omitted
public long sumPaddingToPreventOptimisation() {
return p1 + p2 + p3 + p4 + p5 + p6;
}
}
35
36. Summary
• Concurrency is a tool
• Ordering and visibility are the key challenges
• For performance the details matter
• Don't believe everything you read
o Come up with your own theories and test them!
36
(Trish)\n\nIntroduce ourselves, mention the award :)\n\n&#x201C;Duke's Choice Award for&#xA0;Innovative Programming Framework&#x201D;\n\nIntroduce what we&#x2019;re going to cover\n - concurrency/performance\n - deep & narrow\n - contradictory - going to argue against abstractions\n
(Trish)\n\nAll of the ask the audience questions:\n\n- Who works with concurrent code daily?\n- Who finds concurrency difficult?\n- Who cares about performance?\n\n
(Trish)\n\nCompilers and CPUs are allowed to reorder instructions as long as program semantics are maintained. &#xA0;Without any explicit requests those correctness semantics are limited to observers in the same thread.\n\nDifferent CPUs reorder&#xA0;instructions&#xA0;to varying degrees. &#xA0;E.g. Intel x86 not much, DEC Alpha lots, Intel Atom not at all.\n\nUnless explicit instructions are used to ensure ordering observers in another thread will see different results. &#xA0;I.e. a separate thread can't assume that because z = 40 is true, x = 20 is also true. &#xA0;It may not have happened yet.\n\nThat's if the other thread can even see the data, throw to next slide on visibility.\n\n1) Compilers and CPUs are free to re-order instructions\n2) Different CPUs reorder different amounts - intel x86 not much\n3) unless otherwise specified, you can only guarantee ordering within same thread\n4) other thread can't guarantee order. x is not necessarily 20. if it's even visible\n\n
(Trish)\n\nMemory on modern CPUs consists of multiple layers of buffering and caching. &#xA0;Storing a value does not mean that it is immediately visible to all threads running on all cores.\n\nExplicit instructions are needed to make data&#xA0;visible&#xA0;to other threads.\n\nThis is the crux of why concurrency is hard. &#xA0;The logical order of your program is not maintained when&#xA0;observing&#xA0;it from another thread (but sometimes it is).\n\nThere are tools to help reason about concurrent programs, the main one is the Memory Model. Memory Models exist at multiple levels langauge, VM and CPUs all can have memory models. &#xA0;Java fortunately has a good one which is portable. &#xA0;C++ only introduced one in the most recent spec. &#xA0;C++ programmers often had to think about the CPUs memory model more often (though there are helpful libraries and compiler&#xA0;intrinsics&#xA0;too).\n\nReordering and smart use of caching is the result of many years of hardware engineering applied to deal with the signficant performance mismatch between the CPU and accessing Memory.\n\nHowever correctness is not the only concern, a lack of understanding of the detail can lead to other problems...(thow to Mike).\n\n1) different layers of storage on modern cpu (explain diag)\n2) the different levels exist because main memory is slow\n3) data for your instruction could be at any of these levels. &#xA0;threads on a different CPU might not see it\n4) Java Memory Model is a good tool to reason about concurrent programming, and it's cross platform\n
(Mike)\n\nAre we taking a step back (Martijn)? &#xA0;Yes, necessarily so. Parallelism and concurrency are means to an end, not an end in their own right. &#xA0;The goal being performance. &#xA0;Performance generally covers throughput, latency, scalability. &#xA0;I'd through in 4th, energy efficiency. Concurrent code can bring with it a number of performance surprises, let's look an example.\n\n
\n
\n
\n
\n
\n
\n
\n
\n
(Mike)\n\nUp to 3 orders of magnitude difference in performance between best and worst case. Important to understand what's happening at the lower layers to &#xA0;understand why code can perform poorly. Understanding the details of how concurrency works at the machine level makes understanding higher level concurrency models (e.g. Clojure STM, Queues, Actors) much easier. Locks require kernal arbitration. The Disruptor looks to remove as much contention as possible.\n
(Mike)\n\nDon't be scared of putting code into a single thread. Ported Guy Steele's algorithm written in Fortress to Scala and compared it to a brute single threaded implementation. Use Scala's new parallel collections library. Not too much detail on the algorithm, based on divide/conqueor model to fit easily with fork/join.\n\n
(Mike)\n\nTested with a copy of Alice in Wonderland. Guess how many cores used to get 440 ops/sec? &#xA0;8 cores with hyper-threading - 16 concurrent threads. While eventually the Scala version would be faster, given enough cores it is horribly inefficient with its use of energy. This is likely to be more of an issue as we move into the future. Don't take this as a negative regarding Scala's parallel collections.\n\n
(Mike)\n\nCPU performance is more&#xA0;complicated&#xA0;than a simple measurement of GHz or number of cores. &#xA0;Many other factors come into place. &#xA0;Cache size, cache speed, bus architecture, data path sizes, numbers of caches.....\n
\n
(Trish)\n\nIntro to lmax\nReal world problems\nDR / replication, high availability\n\n
(Trish)\n\nSEDA Architecture. A real enterprise solution needs more than just business logic. DR, journalling - gives you reliability\n\nThis is a single service. Business logic is the interesting thing.\n\n
(Trish)\n\n- Testing showed each queue had its own latency overhead. \n- When you add it all up, even including your IO for replication and journalling, it's a big chunk of overall latency.\n- Business logic is such a tiny amount of time\n\n
\n
\n
\n
(Trish)\n\nIntroduce the Ring buffer, talk about how this is the basis of the Disruptor\n\nAll event processors own their own sequence numbers. Producer writes to ringbuffer, which updates its sequence number. Consumer reads from ring buffer, and writes to its own sequence number\n\n1) RingBuffer/Disruptor intro\n2) Producer writes, consumer reads\n3) Individual sequence numbers\n\n
(Trish)\n\nNote that things can now be parallelised\n\n
\n
\n
Diving in deep! If assembly code scares you, go get beer now\n\n
\n
\n
(Mike)\n\nIntel doesn't need a&#xA0;special&#xA0;instruction for volatile reads. &#xA0;Just ensures that it is not stored in a register and that the write takes care of the cache invalidation. &#xA0;Doesn't reorder reads with respect to each other. &#xA0;Intel has a strong memory model. &#xA0;Other CPU would require fence instructions on the read too.\n\n*notes from Brown bag*\nFriendlier comment/java code\ntalk about where cache lines are flushed\ntalk about number of cycles\n\n
(Trish)\n(Still not sure about this, doesn't make it clear where the sequence number is)\n\ncreate a bunch of empty longs to pad out the cache line\nadd them to a public method so they don't get optimised away\n\nsee our shiny method names\n\n
Concurrency is a good way to make your code slower and more complex.\n