Introduction To Parallel Computing

Introduction to
Parallel Computing

Jörn Dinkla
http://www.dinkla.com

Version 1.1

Dipl.-Inform. Jörn Dinkla
 Java (J2SE, JEE)
 Programming Languages
 Scala, Groovy, Haskell
 Parallel Computing
 GPU Computing
 Model driven
 Eclipse-Plugins

Overview
 Progress in computing
 Traditional Hard- and Software
 Theoretical Computer Science
 Algorithms
 Machines
 Optimization
 Parallelization
 Parallel Hard- and Software

Progress in Computing
1. New applications
 Not feasible before
 Not needed before
 Not possible before
2. Better applications
 Faster
 More data
 Better quality
 precision, accuracy, exactness

Progress in Computing
 Two ingredients
 Hardware
 Machine(s) to execute program
 Software
 Model / language to formulate program
 Libraries
 Methods

How was progress achieved?
 Hardware
 CPU, memory, disks, networks
 Faster and larger
 Software
 New and better algorithms
 Programming methods and languages

Traditional Hardware
 Von Neumann-Architecture
CPU I/O Memory

Bus

 John Backus 1977
 “von Neumann bottleneck“ Cache

Improvements
 Increasing Clock Frequency
 Memory Hierarchy / Cache
 Parallelizing ALU
 Pipelining
 Very-long Instruction Words (VLIW)
 Instruction-Level parallelism (ILP)
 Superscalar processors
 Vector data types
 Multithreaded
 Multicore / Manycore

Moore‘s law
 Guaranteed until 2020

Clock frequency
 No increase since 2005

Physical Limits
 Increase of clock frequency
 >>> Energy-consumption
 >>> Heat-dissipation
 Limit to transistor size

Faster processors impossible !?!

2005
“The Free Lunch Is Over:
A Fundamental Turn Toward
Concurrency in Software”

Herb Sutter
Dr. Dobb’s Journal, March 2005

Multicore
 Transistor count
 Doubles every 2-3 years
 Calculation speed
 No increase

Multicore

 Efficient?

How to use the cores?
 Multi-Tasking OS
 Different tasks
 Speeding up same task
 Assume 2 CPUs
 Problem is divided in half
 Each CPU calculates a half
 Time taken is half of the original time?

Traditional Software
 Computation is expressed as “algorithm
 “a step-by-step procedure for calculations”
 algorithm = logic + control
 Example
1. Open file
2. For all records in the file
1. Add the salary
3. Close file
4. Print out the sum of the salaries

 Keywords
 Sequential, Serial, Deterministic

Traditional Software
 Improvements
 Better algorithms
 Programming languages (OO)
 Developement methods (agile)
 Limits
 Complexity theory (NP, P, NC)

Architecture
 Simplification: Ignore the bus

CPU I/O Memory
I/O Memory

Bus
CPU

More than one CPU?
 How should they communicate ?

I/O Memory I/O Memory

CPU CPU

Message Passing
 Distributed system
 Loose coupling
Messages

Network


CPU CPU

Shared Memory
 Shared Memory
 Tight coupling

I/O Memory I/O

CPU CPU

Shared Memory
 Global vs. Local
 Memory hierarchy


Shared
CPU CPU
Memory

Overview: Memory
 Unshared Memory
 Message Passing
 Actors
 Shared Memory
 Threads
 Memory hierarchies / hybrid
 Partitioned Global Adress Space (PGAS)
 Transactional Memory

Sequential Algorithms
 Random Access Machine (RAM)
 Step by step, deterministic
Addr Value
0 3
PC int sum = 0 1
2
7
5
for i=0 to 4 3 1
4 2
sum += mem[i] 5 18
mem[5]= sum

Sequential Algorithms
int sum = 0
for i=0 to 4
sum += mem[i]
Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value
0 3 0 3 0 3 0 3 0 3 0 3
1 7 1 7 1 7 1 7 1 7 1 7
2 5 2 5 2 5 2 5 2 5 2 5
3 1 3 1 3 1 3 1 3 1 3 1
4 2 4 2 4 2 4 2 4 2 4 2
5 0 5 3 5 10 5 15 5 16 5 18

More than one CPU
 How many programs should run?
 One
 In lock-step
 All processors do the same
 In any order
 More than one

Two Processors
PC 1 int sum = 0 int sum = 0
for i=0 to 2 PC 2 for i=3 to 4
sum += mem[i] sum += mem[i]
mem[5]= sum mem[5]= sum
Addr Value
0 3
 Lockstep 1
2
7
5

 Memory Access! 3
4
1
2
5 18

Flynn‘s Taxonomy
 1966

Instruction
Single Multiple
Single SISD MISD
Data
Multiple SIMD MIMD

Flynn‘s Taxonomy
 SISD
 RAM, Von Neumann
 SIMD
 Lockstep, vector processor, GPU
 MISD
 Fault tolerance
 MIMD

Extension MIMD
 How many programs?

 SPMD
 One program
 Not in lockstep as in SIMD
 MPMD
 Many programs

Processes & Threads
 Process
 Operating System
 Address space
 IPC
 Heavy weight
 Contains 1..* threads
 Thread
 Smallest unit of execution
 Light weight

Overview: Algorithms
 Sequential
 Parallel
 Concurrent Overlap
 Distributed
 Randomized
 Quantum

Computer Science
 A long time before 2005
 1989: Gibbons, Rytter
 1990: Ben-Ari
 1996: Lynch

Gap: Theory and Practice
 Galactic algorithms
 Written for abstract machines
 PRAM, special networks, etc.
 Simplifying assumptions
 No boundaries
 Exact arithmetic
 Infinite memory, network speed, etc.

Sequential algorithms
 Implementing a sequential algorithm
 Machine architecture
 Programming language
 Performance
 Processor, memory and cache speed
 Boundary cases
 Sometimes hard

Parallel algorithms
 Implementing a parallel algorithm
 Adapt algorithm to architecture
 No PRAM or sorting network!
 Problems with shared memory
 Synchronization
 Harder!

Parallelization
 Transforming
 a sequential
 into a parallel algorithm

 Tasks
 Adapt to architecture
 Rewrite
 Test correctness wrt „golden“ seq. code

Granularity
 “Size” of the threads?
 How much computation?
 Coarse vs. fine grain
 Right choice
 Important for good performance
 Algorithm design

Computational thinking
 “… is the thought processes involved
in formulating problems and their
solutions so that the solutions are
represented in a form that can be
effectively carried out by an
information-processing agent.”
Cuny, Snyder, Wing 2010

Computational thinking
 “… is the new literacy of the 21st
Century.”
Cuny, Snyder, Wing 2010

 Expert level needed for parallelization!

Problems: Shared Memory
 Destructive updates
 i += 1
 Parallel, independent processes
 How do the others now that i increased?
 Synchronization needed
 Memory barrier
 Complicated for beginners


mem[5]= sum mem[5]= sum
Addr Value
0 3
 Which one first? 1
2
7
5
3 1
4 2
5 18


mem[5]= sum
sync() sync()
mem[5] += sum

 Synchronization needed

 The memory barrier
 When is a value read or written?
 Optimizing compilers change semantics

 int a = b + 5
 Read b
 Add 5 to b, store temporary in c
 Write c to a

 Solutions (Java)
 volatile
 java.util.concurrent.atomic

 Thread safety
 Reentrant code

class X {
int x;
void inc() { x+=1; }
}

Problems: Threads
 Deadlock
 A wants B, B wants A, both waiting
 Starvation
 A wants B, but never gets it
 Race condition
 A writes to mem, B reads/writes mem

Shared Mem: Solutions
 Shared mutable state
 Synchronize properly

 Isolated mutable state
 Don‘t share state

 Immutable or unshared
 Don‘t mutate state!

Solutions
 Every access within transaction
 See databases
 Actor models
 Message passing
 Immutable state / pure functional

Speedup and Efficiency
 Running time
 T(1) with one processor
 T(n) with two processors
 Speedup
 How much faster?
 S(n) = T(1) / T(n)

Speedup and Efficiency
 Efficiency
 Are all the processors used?
 E(n) = S(n) / n = T(1) / (n * T(n))

Amdahl‘s Law
 Corrolary
 Maximize the parallel part
 Only parallelize when parallel part is large
enough

P-Completeness
 Is there an efficient parallel version for
every algorithm?
 No! Hardly parallelizable problems
 P-Completeness
 Example Circuit-Value-Problem (CVP)

Optimization
 What can i achieve?
 When do I stop?
 How many threads should i use?

Optimization
 I/O bound
 Thread is waiting for memory, disk, etc.
 Computation bound
 Thread is calculating the whole time

 Watch processor utilization!

Optimization
 I/O bound
 Use asynchronous/non-blocking I/O
 Increase number of threads
 Computation bound
 Number of threads = Number of cores

Processors
 Multicore CPU
 Graphical Processing Unit (GPU)
 Field-Programmable Gate Array
(FPGA)

GPU Computing
 Finer granularity than CPU
 Specialized processors
 512 cores on a Fermi
 High memory bandwidth 192 GB/sec

CPU vs. GPU

 Source: SGI

FPGA
 Configurable hardware circuits
 Programmed in Verilog, VHDL
 Now: OpenCL
 Much higher level of abstraction
 Under development, promising
 No performance tests results
(2011/12)

Networks / Cluster
 Combination of CPU

 CPU Memory

 Memory
 Network
Network

 GPU GPU

 FPGA
FPGA

 Vast possibilities

Example
 2 x connected by network
 2 CPU each with local cache
 Global memory
Network

CPU CPU CPU CPU

Memory Memory

Memory Memory Memory Memory

Example
 1 CPU with local cache
 Connected by shared memory
 2 GPU with local memory („device“)

CPU Memory GPU Memory

GPU Memory
Memory

Next Step: Hybrid
 Hybrid / Heterogenous
 Multi-Core / Many-Core
 Plus special purpose hardware
 GPU
 FPGA

Optimal combination?
 Which network gives the best
performance?
 Complicated
 Technical restrictions
 4x PCI-Express 16x Motherboards
 Power consumption
 Cooling

Example: K-Computer
 SPARC64 VIIIfx 2.0GHz
 705024 Cores
 10.51 Petaflop/s
 No GPUs

 #1 2011

Example: Tianhe-1A
 14336 Xeon X5670
 7168 Tesla M2050
 2048 NUDT FT1000
 2.57 petaflop/s

 #2 2011

Example: HPC at home
 Workstations and blades
 8 x 512 cores = 4096 cores

Frameworks: Shared Mem
 C/C++
 OpenMP
 POSIX Threads (pthreads)
 Intel Thread Building Blocks
 Windows Threads
 Java
 java.util.concurrent

Frameworks: Actors
 C/C++
 Theron
 Java / JVM
 Akka
 Scala
 GPars (Groovy)

GPU Computing
 NVIDIA CUDA
 NVIDIA
 OpenCL
 AMD
 NVIDIA
 Intel
 Altera
 Apple
 WebCL
 Nokia
 Samsung

Advanced courses
 Best practices for concurrency in Java
 Java‘s java.util.concurrent
 Actor models

 See http://www.dinkla.com

Advanced courses
 GPU Computing
 NVIDIA CUDA
 OpenCL
 Using NVIDIA CUDA with Java
 Using OpenCL with Java
 See http://www.dinkla.com

References: Practice
 Mattson, Sanders, Massingill
 Patterns for
Parallel Programming
 Breshears
 The Art of Concurrency

References: Practice
 Pacheco
 An Introduction to
Parallel Programming
 Herlihy, Shavit
 The Art of
Multiprocessor Programming

References: Theory
 Gibbons, Rytter
 Efficient Parallel Algorithms
 Lynch
 Distributed Algorithms
 Ben-Ari
 Principles of Concurrent and
Distributed Programming

References: GPU Computing
 Scarpino
 OpenCL in Action

 Sanders, Kandrot
 CUDA by Example

References: Background
 Hennessy, Paterson
 Computer Architecture: A Quantitative
Approach

Introduction To Parallel Computing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Introduction To Parallel Computing

Similar a Introduction To Parallel Computing (20)

Más de Jörn Dinkla

Más de Jörn Dinkla (18)

Último

Último (20)

Introduction To Parallel Computing