4. Tasks and Deadlines
• Talk on selected paper (student 1)
– 30min with slides (+ 15min discussion)
• to be discussed with us 1 week before
– Summary (max. 500 word)
• 2 days before seminar, 11:59am
• Questions on assigned paper (student 2)
– Min. 5 questions
– 2 days before seminar, 11:59am
4
5. Report
Category 1: Theoretical treatment
• Focus on paper, related work, state of the art
of the field
• Detailed discussion
Category 2: Practical treatment of topic, for
instance
• Reproduce experiments/results
• Extend experiments
• Experiment with variations
5
6. Report
• paper summary (500 words)
• outline, content, and experiments to be
discussed with us
• Cat. 1: ca. 4000 word (excl. references)
– state of the art, context in field, and specific
technique from paper
• Cat. 2: ca. 2000 word (excl. references)
– Discuss experiments, gained insights, found
limitations, etc.
Deadline: Feb. 6th
6
7. Consultations
• For alternative paper proposals
• To prepare presentation!
• To agree on focus of report/experiments
– For experiments mandatory
7
8. Grading
• Required attendance: 80% of all meetings
• 50% slides, presentation, and discussion
• 50% write-up/experiments
8
9. Timeline
Oct. 5th Introduction to Concurrent
Programming Models
Oct. 10th Deadline: List of ranked papers
Oct. 12th Runtime Techniques for Big Data
and Parallelism
Week 3-5 Preparations and Consultations
Week 6-12 Presentations
Feb. 6th Deadline for Report
9
16. Memory Wall
1
10
100
1000
10000
1980 1985 1990 1995 2000 2005
CPU Frequency
DRAM Speeds
Relative
Performance
Gap
Source: Sun World Wide Analyst Conference Feb. 25, 2003
17. Multicore Transition
Work around physical limitations
Power Wall and Memory Wall
10/5/2016 17
MemoryMemory
MemoryMemory
Main Memory Main Memory
18. For a brief bit of history:
ENIAC’s recessive gene
Marcus Mitch, and Akera Atsushi. Penn Printout (March 1996)
http://www.upenn.edu/computing/printout/archive/v12/4/pdf/gene.pdf
ENIAC's main control panel, U. S. Army Photo
22. A Rough Categorization
22
Marr, S. (2013), 'Supporting Concurrency Abstractions in High-level Language Virtual Machines', PhD
thesis, Software Languages Lab, Vrije Universiteit Brussel.
Data Parallelism
25. Threads
• Sequences of instructions
• Unit of scheduling
– Preemptive and concurrent
– Or parallel
25
time
26. A Snake Game
• Multiple players
• Compete for ‘apples’
• Shared board
10/5/2016 26
27. Race Conditions and Data Races
Race Condition
• Result depending on
timing of operations
Data Race
• Race condition on
memory
• Synchronization
absent or incomplete
27
29. Optimized Locking for more Parallelism
synchronized (board[3][3]) {
synchronized (board[3][2]) {
board.moveLeft(snake)
}
}
29
Strategy: Lock only cells you need to update
What could go
wrong?
30. Common Issues
• Lack of Progress
– Deadlock
– Livelock
• Race Condition
– Data race
– Atomicity violation
• Performance
– Sequential bottle necks
– False sharing
30
31. Basic Concepts
Shared Memory with Threads and Locks
• Threads
• Synchronization
• No safety guarantees
– Data Races
– Deadlocks
31
P1.9 The Linux Scheduler: A Decade of Wasted Cores, J.-P. Lozi et al.
P2.1 Optimistic Concurrency with OPTIK, R. Guerraoui, V. Trigonakis
P2.9 OCTET: Capturing and Controlling Cross-Thread Dependences Efficiently, M. Bond et al.
P2.10 Efficient and Thread-Safe Objects for Dynamically-Typed Languages, B. Daloze et al.
Questions?
36. Transactional Memory
Simple Programing Model
• No Data Races
(within transactions)
• No Deadlocks
36
Issues
• Performance overhead
• Still experimental
• Livelocks
• Inter-transactional
race conditions
• I/O semantics
37. Some Issues
atomic {
dataArray = getData();
fork { compute(dataArray[0]); }
compute(dataArray[1]);
}
37
P2.2 Transactional Tasks: Parallelism in Software Transactions, J. Swalens et al.
P1.1 Transactional Data Structure Libraries, A. Spiegelman et al.
P1.2 Type-Aware Transactions for Faster Concurrent Code, N. Herman et al.
What happens with
forked thread when
transaction aborts?
38. Channel-based Communication
coordChannel ! (#moveLeft, snake)
38
for i in players():
msg ? coordChannels[i]
match msg:
(#moveLeft, snake):
board[…,…] = …
Player Thread
Coordinator Thread
Coordinator Thread
Player Thread Player Thread
send
receive
High-level communication
but no safety guarantees
39. Coordinating Threads
Transactional Memory
• Transactions
• Simple Programming Model
• Practical Issues
Channel/Message Communication
• Explicit coordination
– Channels or message sending
– Higher abstraction level
• No safety guarantees
39
P1.4 Why Do Scala Developers Mix the Actor Model with other Concurrency Models?, S.
Tasharofi et al.
P1.6 The Asynchronous Partitioned Global Address Space Model, V. Saraswat et al. (conc-
model, AMP'10)
Questions?
43. Many Many Variations
• Channel based
– Communicating Sequential Processes
• Message based
– Actor models
43
P1.3 43 Years of Actors: a Taxonomy of Actor Models and Their Key
Properties, J. De Koster et al.
47. Message-based Communication
47
Player 1
Player 1
Board Actor
board <- moveLeft(snake)
class Board {
private array;
public moveLeft(snake) {
array[snake.x][snake.y] = ...
}
}
Player Actor
Board Actor
async send
actors.create(Board)
actors.create(Snake)
actors.create(Snake)
Main Program
48. Communicating Isolates
Message or Channel Based
• Explicit communication
• No shared memory
• Still potential for
– Behavioral deadlocks
– Livelocks
– Bad message inter-leavings
– Message protocol violations
48
P1.3 43 Years of Actors: a Taxonomy of Actor Models and Their Key Properties, J. De Koster et
al.
P1.11 Distributed Debugging for Mobile Networks, E. Gonzalez Boix et al. (tooling, JSS'14)
Questions?
51. Fork/Join with Work-Stealing
• Recursive
divide-and-conquer
• Automatic and efficient
parallel scheduling
• Widely available for C++,
Java, and .NET
10/5/2016 51
Blumofe, R. D.; Joerg, C. F.; Kuszmaul, B. C.; Leiserson, C. E.; Randall, K. H. & Zhou, Y. (1995),
'Cilk: An Efficient Multithreaded Runtime System', SIGPLAN Not. 30 (8), 207-216.
52. Typical Applications
• Recursive Algorithms1
– Mergesort
– List and tree traversals
• Parallel prefix, pack, and
sorting problems2
• Irregular and unbalanced
computation
– On directed acyclic graphs
(DAGs)
– Ideally tree-shaped
52
1) More material can be found at: http://homes.cs.washington.edu/~djg/teachingMaterials/spac/
2) Prefix Sums and Their Applications: http://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
53. Tiny Example: Summing a large Array
• Simple array with numbers
• Recursively divide
– Every ‘ ’ is a parallel fork
• Then do addition
– Every ‘ ’ is a join
53
Note: This example is academic, and could be better expressed with a parallel map/reduce
library, such as Scala’s Parallel Collections, Java 8 Streams, or Microsoft’s PLINQ.
46 9 42 7 55
45724965
4965
5 6
11
49
13
24
4572
72 45
9 9
18
42
54. Data Parallelism with Fork/Join
• Parallel programming
technique
• Recursive divide-and-
conquer
• Automatic and efficient
load-balancing
58
P1.5 A Java Fork/Join Framework, D. Lea (conc-model, runtime, Java'00)
59. Topics of Interest
• High-level language
concurrency models
– Actors, Communicating
Sequential Processes,
STM, Stream Processing,
...
• Tooling
– Debugging
– Profiling
• Implementation and
runtime systems
– Communication
mechanisms
– Data/object
representation
– System-level aspects
• Big Data Frameworks
– Programming models
– Runtime level problems
63
60. Papers without Artifacts
P1.1 Transactional Data Structure Libraries, A. Spiegelman et al.
(conc-model, PLDI'16)
P1.2 Type-Aware Transactions for Faster Concurrent Code, N.
Herman et al. (conc-model, runtime, EuroSys'16)
P1.3 43 Years of Actors: a Taxonomy of Actor Models and Their Key
Properties, J. De Koster et al. (conc-model, Agere'16)
P1.4 Why Do Scala Developers Mix the Actor Model with other
Concurrency Models?, S. Tasharofi et al. (conc-model, ECOOP'13)
P1.5 A Java Fork/Join Framework, D. Lea (conc-model, runtime,
Java'00)
P1.6 The Asynchronous Partitioned Global Address Space Model, V.
Saraswat et al. (conc-model, AMP'10)
64
61. Papers without Artifacts
P1.7 Pydron: Semi-Automatic Parallelization for Multi-
Core and the Cloud, S. C. Müller et al. (conc-model,
runtime, OSDI'15)
P1.8 Fast Splittable Pseudorandom Number Generators,
G. L. Steele et al. (runtime, OOPSLA'14)
P1.9 The Linux Scheduler: A Decade of Wasted Cores, J.-
P. Lozi et al. (runtime, EuroSys'15)
P1.10Application-Assisted Live Migration of Virtual
Machines with Java Applications, K.-Y. Hou et al.
(runtime, EuroSys'15)
P1.11Distributed Debugging for Mobile Networks, E.
Gonzalez Boix et al. (tooling, JSS'14)
65
62. Papers with Artifacts
P2.1 Optimistic Concurrency with OPTIK, R. Guerraoui,
V. Trigonakis (conc-model, PPoPP'16)
P2.2 Transactional Tasks: Parallelism in Software
Transactions, J. Swalens et al. (conc-model,
ECOOP'16)
P2.3 StreamJIT: a commensal compiler for high-
performance stream programming, J. Bosboom et
al. (conc-model, runtime, OOPSLA'14)
P2.4 An Efficient Synchronization Mechanism for Multi-
core Systems, M. Aldinucci et al. (conc-model,
runtime, EuroPar'12)
P2.5 Parallel parsing made practical, A. Barenghi et al.
(runtime, SCP'15) 66
63. Papers with Artifacts
P2.6 SparkR : Scaling R Program with Spark, S.
Venkataraman et al. (conc-model, bigdata,
SIGMOD'16)
P2.7 SparkSQL: Relational Data Processing in Spark, M.
Armbrust et al. (bigdata, runtime, VLDB'14)
P2.8 Twitter Heron: Stream Processing at Scale, S.
Kulkarni et al. (bigdata, SIGMOD'15)
P2.9 OCTET: Capturing and Controlling Cross-Thread
Dependences Efficiently, M. D. Bond et al. (tooling,
OOPSLA'13)
P2.10Efficient and Thread-Safe Objects for Dynamically-
Typed Languages, B. Daloze et al. (runtime,
OOPSLA'16) 67
Notas del editor
Talk: 18min + 5min questions
Multicore is everywhere
Just one-processor systems here, workstations usually have 2 processors, server even more
Embedded systems already use manycore processors
If you buy a notebook/computer something today, it is multicore
GHz == consumed power == produced heat
Cooling to complex, no way to put such things in portable devices
Why do we need to? So, why manycore then?
Unfortunately CPUs are not becoming faster anymore
Reached a peak in 2005, no CPUs are actually slower (simplified speaking)
Notes:
- show graph 1990, 2000, 2005, 2010 GHz count + CPUs red line power-wall
-89' Intel486™ DX Processor: 50, 33, 25 MHz
- November 1, 1995, Intel® Pentium® Pro Processor, 200, 180, 166, 150 MHz
- November 20, 2000, Intel® Pentium® 4 Processor, 1.50 GHz, 1.40 GHz
- February, 2005: Intel® Pentium® 4 Processor Extreme Edition supporting HT Technology
3.80 GHz (570)
- 3.33 GHz (with boost to 3.6 GHz) Intel® Core™ i7-980X processor Extreme Edition
- decreasing GHz a bit and putting another core on the chip
allows to keep power consumption stable
Theoretical speedup is times 1.8
but cores have lower sequential performance
ENIAC
AI players can consume as much CPU as they like
Presentation can be done on different core
- Efficient load balancing
Good fit for
tree-recursion
Irregular computational complexity