Session four of my series on many cores turns to data, both big and small. Looks at MapReduce but approaches sideways from a classic computer science perspective.
1. If the Data Cannot Come to
the Algorithm...
many cores with java
session four
data locality
copyright 2013 Robert Burrell Donkin robertburrelldonkin.name
this work is licensed under a Creative Commons Attribution 3.0 Unported License
2. Pre-emptive multi-tasking operating
systems use involuntary context switching
to provide the illusion of parallel processes
even when the hardware supports only a
single thread of execution.
Take Away from Session One
3. Even on a single core,
there's no escaping parallelism.
Take Away from Session Two
4. Take Away from Session Three
Code executing on different cores uses copies held
in registers and caches, so memory shared is likely
to be incoherent unless the program plays by the
rules of the software platform.
5. Gustafson's Law
S(p) = p - a (p-1)
● S(p) is the speedup for pprocessors
● a is the non-parallelizable fraction
"in practice, the problem size scales with the number of
processors" John L. Gustafson
6. ● Think about Gustafson's Law...
● The quantity of data processed...
● ...scales linearly as processors added.
● Throwing processors at the problem
works...
● ...at least sometimes.
Scales and Scaling
7. Divide and Conquer
● Back to the future
● Partition the data...
○ ...apply the same algorithm to each part and then
○ ...collate the answers.
● Natural to parallelise
● No contended shared memory
8. Data Locality
● When the algorithm is small
○ it's more efficient
■ to bring the algorithm to the data
■ than the data to the algorithm
● Whether the data is in
○ caches on cores in a many core computer, or in
○ disc storage in a distributed data store
9. Map and Reduce
● Partition the data
● The map algorithm
○ works in parallel
○ on local data
○ independently
● The reduce algorithm
○ collates output from map algorithms
● More complex systems built from these blocks
10. Map-Reduce
As a Query Language
● NoSQL
● A popular alternative to SQL
○ for distributed data stores
● Why...?
○ Easy to
■ read and write
■ parallelize
○ Rich and full programming model
11. Map-Reduce
Crunching Big Data
● Commodity hardware
● Scales up to Terabyte and Petabyte
○ smoothly by adding new nodes
● Map-Reduce platforms typically provide
○ fault tolerance eg. retry
○ orchestration
○ redundant data storage
● Statistical resilience
12. Take Away
When you want to be able to process big data
tomorrow by adding cores or computers, adopt
an appropriate architecture today.