Parallel Programming

Learning and Development
Presents

OPEN TALK SERIES
A series of illuminating talks and interactions that open our minds to new ideas
and concepts; that makes us look for newer or better ways of doing what we did;
or point us to exciting things we have never done before. A range of topics on
Technology, Business, Fun and Life.

Be part of the learning experience at Aditi.
Join the talks. Its free. Free as in freedom at work, not free-beer.
Speak at these events. Or bring an expert/friend to talk.
Mail LEAD with topic and availability.

Parallel Programming

Sundararajan Subramanian
Aditi Technologies

2

Introduction to Parallel Computing
• The challenge
– Provide the abstractions , programming
paradigms, and algorithms needed to
effectively design, implement, and maintain
applications that exploit the parallelism
provided by the underlying hardware in order
to solve modern problems.

Single-core CPU chip
the single core

4

Multi-core architectures

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip 5

Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)

c c c c
o o o o
r r r r
e e e e

1 2 3 4

6

The cores run in parallel
thread 1 thread 2 thread 3 thread 4

c c c c
o o o o
r r r r
e e e e

1 2 3 4

7

Within each core, threads are time-sliced
(just like on a uniprocessor)
several several several several
threads threads threads threads

c c c c
o o o o
r r r r
e e e e

1 2 3 4

8

Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years

9

Instruction level parallelism
• For(int i-0;i<1000;i++)
{ a[0]++; a[0]++; }

• For(int i-0;i<1000;i++)
{ a[0]++; a[1]++; }

Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
physics in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
11

A technique complementary to multi-core:
Simultaneous multithreading

• Problem addressed: L1 D-Cache D-TLB

The processor pipeline Integer Floating Point
can get stalled:

L2 Cache and Control
– Waiting for the result Schedulers

of a long floating point Uop queues
(or integer) operation
Rename/Alloc
– Waiting for data to
BTB Trace Cache uCode
arrive from memory ROM
Decoder
Other execution units Bus

wait unused BTB and I-TLB
Source: Intel

12

Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
on the same core

• Example: if one thread is waiting for a floating
point operation to complete, another thread can
use the integer units

13

Without SMT, only a single thread can
run at any given time
L1 D-Cache D-TLB

Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 1: floating point
14

Without SMT, only a single thread can
run at any given time
L1 D-Cache D-TLB


Schedulers

Uop queues

Rename/Alloc


Decoder
Bus

BTB and I-TLB

Thread 2:
integer operation 15

SMT processor: both threads can run
concurrently
L1 D-Cache D-TLB


Schedulers

Uop queues

Rename/Alloc


Decoder
Bus

BTB and I-TLB

Thread 2: Thread 1: floating point
integer operation 16

But: Can’t simultaneously use the
same functional unit
L1 D-Cache D-TLB


Schedulers

Uop queues

Rename/Alloc


Decoder This scenario is
impossible with SMT
Bus

BTB and I-TLB
on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer unit) 17

SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
simultaneous thread as a separate
“virtual processor”
• The chip has only a single copy
of each resource
• Compare to multi-core:
each core has its own copy of resources

18

Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point

Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode
ROM ROM
Decoder Decoder
Bus
Bus

BTB and I-TLB BTB and I-TLB

Thread 1 Thread 2 19

Multi-core:
threads can run on separate cores





ROM ROM
Decoder Decoder
Bus
Bus


Thread 3 Thread 4 20

Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
21

SMT Dual-core: all four threads can run
concurrently





ROM ROM
Decoder Decoder
Bus
Bus


Thread 1 Thread 3 Thread 2 Thread 4 22

Designs with private L2 caches

CORE0
CORE1

CORE0

CORE1
L1 cache L1 cache L1 cache L1 cache

L2 cache L2 cache L2 cache L2 cache

L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2

Private vs shared caches?
• Advantages/disadvantages?

25

Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the
same cache data
– More cache space available if a single (or a
few) high-performance thread runs on the
system
26

Parallel Architectures
• Use multiple
– Datapaths
– Memory units
– Processing units

• SIMD
– Single instruction stream, multiple data stream
Processing
Unit
Processing
Unit

Interconnect
Control
Processing
Unit
Unit
Processing
Unit
Processing
Unit

• MIMD
– Multiple instruction stream, multiple data stream
Processing/Control
Unit

Processing/Control

Interconnect
Unit

Processing/Control
Unit

Processing/Control
Unit

Parallelism in Visual Studio 2010
Integrated Programming Models Programming Models
Tooling
PLINQ
Parallel Task Parallel Parallel Pattern Agents
Debugger Library Library
Library
Toolwindows

Data Structures

Data Structures
Concurrency Runtime Concurrency Runtime

ThreadPool
Profiler Task Scheduler
Concurrency Task Scheduler
Analysis
Resource Manager
Resource Manager

Operating System

Threads

Key: Tools Native Library Managed Library

Multi threading Today
• Divide the total number of activites across n
processors
• In case of 2 Procs, divide it by 2.

User Mode Scheduler
CLR Thread Pool

Global
Queue

Worker … Worker
Thread 1 Thread p

Program
Thread

User Mode Scheduler For Tasks
CLR Thread Pool: Work-Stealing

Local … Local
Global Queue Queue
Queue

Worker … Worker
Thread 1 Thread p
Task 6
Task 1 Task Task 3
4
Task 2Program Task 5
Thread

Task-based Programming
ThreadPool Summary
ThreadPool.QueueUserWorkItem(…);

System.Threading.Tasks
Starting Parent/Child
Task.Factory.StartNew(…); var p = new Task(() => {
var t = new Task(…);
});
Continue/Wait/Cancel
Task t = … Tasks with results
Task<int> f =
Task p = t.ContinueWith(…); new Task<int>(() => C());
t.Wait(2000); …
t.Cancel(); int result = f.Result;

Coordination Data Structures (1 of
3)
Block if full
Concurrent Collections P C
• BlockingCollection<T> P C
• ConcurrentBag<T> P C
• ConcurrentDictionary<TKey,TValu
e> Block if empty
• ConcurrentLinkedList<T>
• ConcurrentQueue<T>
• ConcurrentStack<T>
• IProducerConsumerCollection<T>
• Partitioner, Partitioner<T>,
OrderablePartitioner<T>

3)
Synchronization Primitives
• Barrier
• CountdownEvent

Loop
• ManualResetEventSlim Barrier postPhaseAction

• SemaphoreSlim
• SpinLock
• SpinWait

CountdownEvent.

3)
Initialization Primitives
• Lazy<T>, LazyVariable<T>, LazyInitializer Cancellation MyMethod( )
• ThreadLocal<T> Source

Foo(…, CancellationToken ct)
Cancellation Primitives Thread Boundary

• CancellationToken
• CancellationTokenSource Bar(…, CancellationToken ct)
• ICancelableOperation

ManualResetEventSlim.Wait( ct )

Cancellation
Token

Parallel Programming

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Parallel Programming

Similar to Parallel Programming (20)

More from HARMAN Services

More from HARMAN Services (20)

Recently uploaded

Recently uploaded (20)

Parallel Programming