Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Parallel Programming
1. Learning and Development
Presents
OPEN TALK SERIES
A series of illuminating talks and interactions that open our minds to new ideas
and concepts; that makes us look for newer or better ways of doing what we did;
or point us to exciting things we have never done before. A range of topics on
Technology, Business, Fun and Life.
Be part of the learning experience at Aditi.
Join the talks. Its free. Free as in freedom at work, not free-beer.
Speak at these events. Or bring an expert/friend to talk.
Mail LEAD with topic and availability.
3. Introduction to Parallel Computing
• The challenge
– Provide the abstractions , programming
paradigms, and algorithms needed to
effectively design, implement, and maintain
applications that exploit the parallelism
provided by the underlying hardware in order
to solve modern problems.
6. Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)
c c c c
o o o o
r r r r
e e e e
1 2 3 4
6
7. The cores run in parallel
thread 1 thread 2 thread 3 thread 4
c c c c
o o o o
r r r r
e e e e
1 2 3 4
7
8. Within each core, threads are time-sliced
(just like on a uniprocessor)
several several several several
threads threads threads threads
c c c c
o o o o
r r r r
e e e e
1 2 3 4
8
9. Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years
9
11. Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
physics in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
11
12. A technique complementary to multi-core:
Simultaneous multithreading
• Problem addressed: L1 D-Cache D-TLB
The processor pipeline Integer Floating Point
can get stalled:
L2 Cache and Control
– Waiting for the result Schedulers
of a long floating point Uop queues
(or integer) operation
Rename/Alloc
– Waiting for data to
BTB Trace Cache uCode
arrive from memory ROM
Decoder
Other execution units Bus
wait unused BTB and I-TLB
Source: Intel
12
13. Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
on the same core
• Example: if one thread is waiting for a floating
point operation to complete, another thread can
use the integer units
13
14. Without SMT, only a single thread can
run at any given time
L1 D-Cache D-TLB
L2 Cache and Control
Integer Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB Trace Cache uCode ROM
Decoder
Bus
BTB and I-TLB
Thread 1: floating point
14
15. Without SMT, only a single thread can
run at any given time
L1 D-Cache D-TLB
L2 Cache and Control
Integer Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB Trace Cache uCode ROM
Decoder
Bus
BTB and I-TLB
Thread 2:
integer operation 15
16. SMT processor: both threads can run
concurrently
L1 D-Cache D-TLB
L2 Cache and Control
Integer Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB Trace Cache uCode ROM
Decoder
Bus
BTB and I-TLB
Thread 2: Thread 1: floating point
integer operation 16
17. But: Can’t simultaneously use the
same functional unit
L1 D-Cache D-TLB
L2 Cache and Control
Integer Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB Trace Cache uCode ROM
Decoder This scenario is
impossible with SMT
Bus
BTB and I-TLB
on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer unit) 17
18. SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
simultaneous thread as a separate
“virtual processor”
• The chip has only a single copy
of each resource
• Compare to multi-core:
each core has its own copy of resources
18
19. Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB
Integer Floating Point Integer Floating Point
L2 Cache and Control
L2 Cache and Control
Schedulers Schedulers
Uop queues Uop queues
Rename/Alloc Rename/Alloc
BTB Trace Cache uCode BTB Trace Cache uCode
ROM ROM
Decoder Decoder
Bus
Bus
BTB and I-TLB BTB and I-TLB
Thread 1 Thread 2 19
20. Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB
Integer Floating Point Integer Floating Point
L2 Cache and Control
L2 Cache and Control
Schedulers Schedulers
Uop queues Uop queues
Rename/Alloc Rename/Alloc
BTB Trace Cache uCode BTB Trace Cache uCode
ROM ROM
Decoder Decoder
Bus
Bus
BTB and I-TLB BTB and I-TLB
Thread 3 Thread 4 20
21. Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
21
22. SMT Dual-core: all four threads can run
concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB
Integer Floating Point Integer Floating Point
L2 Cache and Control
L2 Cache and Control
Schedulers Schedulers
Uop queues Uop queues
Rename/Alloc Rename/Alloc
BTB Trace Cache uCode BTB Trace Cache uCode
ROM ROM
Decoder Decoder
Bus
Bus
BTB and I-TLB BTB and I-TLB
Thread 1 Thread 3 Thread 2 Thread 4 22
23.
24. Designs with private L2 caches
CORE0
CORE1
CORE0
CORE1
L1 cache L1 cache L1 cache L1 cache
L2 cache L2 cache L2 cache L2 cache
L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2
26. Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the
same cache data
– More cache space available if a single (or a
few) high-performance thread runs on the
system
26
28. Parallel Architectures
• SIMD
– Single instruction stream, multiple data stream
Processing
Unit
Processing
Unit
Interconnect
Control
Processing
Unit
Unit
Processing
Unit
Processing
Unit
29. Parallel Architectures
• MIMD
– Multiple instruction stream, multiple data stream
Processing/Control
Unit
Processing/Control
Interconnect
Unit
Processing/Control
Unit
Processing/Control
Unit
35. Task-based Programming
ThreadPool Summary
ThreadPool.QueueUserWorkItem(…);
System.Threading.Tasks
Starting Parent/Child
Task.Factory.StartNew(…); var p = new Task(() => {
var t = new Task(…);
});
Continue/Wait/Cancel
Task t = … Tasks with results
Task<int> f =
Task p = t.ContinueWith(…); new Task<int>(() => C());
t.Wait(2000); …
t.Cancel(); int result = f.Result;
36. Coordination Data Structures (1 of
3)
Block if full
Concurrent Collections P C
• BlockingCollection<T> P C
• ConcurrentBag<T> P C
• ConcurrentDictionary<TKey,TValu
e> Block if empty
• ConcurrentLinkedList<T>
• ConcurrentQueue<T>
• ConcurrentStack<T>
• IProducerConsumerCollection<T>
• Partitioner, Partitioner<T>,
OrderablePartitioner<T>