SlideShare una empresa de Scribd logo
1 de 79
Introduction to
Parallel Computing


     Jörn Dinkla
     http://www.dinkla.com

          Version 1.1
Dipl.-Inform. Jörn Dinkla
 Java (J2SE, JEE)
 Programming Languages
   Scala, Groovy, Haskell
 Parallel Computing
   GPU Computing
 Model driven
 Eclipse-Plugins
Overview
 Progress in computing
 Traditional Hard- and Software
 Theoretical Computer Science
   Algorithms
   Machines
   Optimization
 Parallelization
 Parallel Hard- and Software
Progress in Computing
1. New applications
   Not feasible before
   Not needed before
   Not possible before
2. Better applications
    Faster
    More data
    Better quality
      precision, accuracy, exactness
Progress in Computing
 Two ingredients
   Hardware
     Machine(s) to execute program
   Software
     Model / language to formulate program
     Libraries
     Methods
How was progress achieved?
 Hardware
   CPU, memory, disks, networks
   Faster and larger
 Software
   New and better algorithms
   Programming methods and languages
Traditional Hardware
 Von Neumann-Architecture
           CPU    I/O   Memory




                  Bus




 John Backus 1977
   “von Neumann bottleneck“     Cache
Improvements
   Increasing Clock Frequency
   Memory Hierarchy / Cache
   Parallelizing ALU
   Pipelining
   Very-long Instruction Words (VLIW)
   Instruction-Level parallelism (ILP)
   Superscalar processors
   Vector data types
   Multithreaded
   Multicore / Manycore
Moore‘s law
 Guaranteed until 2020
Clock frequency
 No increase since 2005
Physical Limits
 Increase of clock frequency
   >>> Energy-consumption
   >>> Heat-dissipation
 Limit to transistor size

   Faster processors impossible !?!
2005
“The Free Lunch Is Over:
   A Fundamental Turn Toward
   Concurrency in Software”

       Herb Sutter
       Dr. Dobb’s Journal, March 2005
Multicore
 Transistor count
    Doubles every 2-3 years
 Calculation speed
    No increase

  Multicore

 Efficient?
How to use the cores?
 Multi-Tasking OS
   Different tasks
 Speeding up same task
     Assume 2 CPUs
     Problem is divided in half
     Each CPU calculates a half
     Time taken is half of the original time?
Traditional Software
 Computation is expressed as “algorithm
    “a step-by-step procedure for calculations”
    algorithm = logic + control
 Example
   1.   Open file
   2.   For all records in the file
        1.   Add the salary
   3.   Close file
   4.   Print out the sum of the salaries

 Keywords
    Sequential, Serial, Deterministic
Traditional Software
 Improvements
   Better algorithms
   Programming languages (OO)
   Developement methods (agile)
 Limits
   Theoretical Computer Science
   Complexity theory (NP, P, NC)
Architecture
 Simplification: Ignore the bus

    CPU    I/O   Memory
                            I/O         Memory




           Bus
                                  CPU
More than one CPU?
 How should they communicate ?


   I/O         Memory     I/O         Memory




         CPU                    CPU
Message Passing
 Distributed system
 Loose coupling
                                      Messages

                            Network




       I/O         Memory               I/O            Memory




             CPU                                 CPU
Shared Memory
 Shared Memory
 Tight coupling

            I/O         Memory         I/O




                  CPU            CPU
Shared Memory
 Global vs. Local
 Memory hierarchy

     I/O         Memory            I/O         Memory




                          Shared
           CPU                           CPU
                          Memory
Overview: Memory
 Unshared Memory
   Message Passing
   Actors
 Shared Memory
   Threads
 Memory hierarchies / hybrid
   Partitioned Global Adress Space (PGAS)
 Transactional Memory
Sequential Algorithms
 Random Access Machine (RAM)
   Step by step, deterministic
                                  Addr Value
                                   0     3
    PC    int sum = 0              1
                                   2
                                         7
                                         5
          for i=0 to 4             3     1
                                   4     2
            sum += mem[i]          5    18
          mem[5]= sum
Sequential Algorithms
int sum = 0
for i=0 to 4
  sum += mem[i]
Addr Value   Addr Value   Addr Value   Addr Value   Addr Value   Addr Value
 0     3      0     3      0     3      0     3      0     3      0     3
 1     7      1     7      1     7      1     7      1     7      1     7
 2     5      2     5      2     5      2     5      2     5      2     5
 3     1      3     1      3     1      3     1      3     1      3     1
 4     2      4     2      4     2      4     2      4     2      4     2
 5     0      5     3      5    10      5    15      5    16      5    18
More than one CPU
 How many programs should run?
   One
     In lock-step
        All processors do the same
     In any order
   More than one
     Distributed system
Two Processors
PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum              mem[5]= sum
                                            Addr Value
                                             0     3
 Lockstep                                   1
                                             2
                                                   7
                                                   5

 Memory Access!                             3
                                             4
                                                   1
                                                   2
                                             5    18
Flynn‘s Taxonomy
 1966

                        Instruction
                     Single    Multiple
           Single     SISD      MISD
   Data
          Multiple   SIMD       MIMD
Flynn‘s Taxonomy
 SISD
   RAM, Von Neumann
 SIMD
   Lockstep, vector processor, GPU
 MISD
   Fault tolerance
 MIMD
   Distributed system
Extension MIMD
 How many programs?

 SPMD
   One program
   Not in lockstep as in SIMD
 MPMD
   Many programs
Processes & Threads
 Process
   Operating System
      Address space
      IPC
   Heavy weight
   Contains 1..* threads
 Thread
   Smallest unit of execution
   Light weight
Overview: Algorithms
   Sequential
   Parallel
   Concurrent    Overlap
   Distributed
   Randomized
   Quantum
Computer Science
 Theoretical Computer Science
     A long time before 2005
     1989: Gibbons, Rytter
     1990: Ben-Ari
     1996: Lynch
Gap: Theory and Practice
 Galactic algorithms
 Written for abstract machines
   PRAM, special networks, etc.
 Simplifying assumptions
   No boundaries
   Exact arithmetic
   Infinite memory, network speed, etc.
Sequential algorithms
 Implementing a sequential algorithm
   Machine architecture
   Programming language
   Performance
     Processor, memory and cache speed
   Boundary cases
   Sometimes hard
Parallel algorithms
 Implementing a parallel algorithm
   Adapt algorithm to architecture
      No PRAM or sorting network!
   Problems with shared memory
   Synchronization
   Harder!
Parallelization
 Transforming
   a sequential
   into a parallel algorithm

 Tasks
   Adapt to architecture
   Rewrite
   Test correctness wrt „golden“ seq. code
Granularity
 “Size” of the threads?
   How much computation?
 Coarse vs. fine grain
 Right choice
   Important for good performance
   Algorithm design
Computational thinking
 “… is the thought processes involved
  in formulating problems and their
  solutions so that the solutions are
  represented in a form that can be
  effectively carried out by an
  information-processing agent.”
              Cuny, Snyder, Wing 2010
Computational thinking
 “… is the new literacy of the 21st
  Century.”
               Cuny, Snyder, Wing 2010



 Expert level needed for parallelization!
Problems: Shared Memory
 Destructive updates
   i += 1
 Parallel, independent processes
   How do the others now that i increased?
   Synchronization needed
      Memory barrier
      Complicated for beginners
Problems: Shared Memory

PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum              mem[5]= sum
                                            Addr Value
                                             0     3
 Which one first?                           1
                                             2
                                                   7
                                                   5
                                             3     1
                                             4     2
                                             5    18
Problems: Shared Memory

PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum
       sync()                   sync()
                                mem[5] += sum


 Synchronization needed
Problems: Shared Memory
 The memory barrier
    When is a value read or written?
    Optimizing compilers change semantics

 int a = b + 5
    Read b
    Add 5 to b, store temporary in c
    Write c to a

 Solutions (Java)
    volatile
    java.util.concurrent.atomic
Problems: Shared Memory
 Thread safety
 Reentrant code

  class X {
    int x;
    void inc() { x+=1; }
  }
Problems: Threads
 Deadlock
   A wants B, B wants A, both waiting
 Starvation
   A wants B, but never gets it
 Race condition
   A writes to mem, B reads/writes mem
Shared Mem: Solutions
 Shared mutable state
   Synchronize properly


 Isolated mutable state
   Don‘t share state


 Immutable or unshared
   Don‘t mutate state!
Solutions
 Transactional Memory
   Every access within transaction
   See databases
 Actor models
   Message passing
 Immutable state / pure functional
Speedup and Efficiency
 Running time
   T(1) with one processor
   T(n) with two processors
 Speedup
   How much faster?
   S(n) = T(1) / T(n)
Speedup and Efficiency
 Efficiency
   Are all the processors used?
   E(n) = S(n) / n = T(1) / (n * T(n))
Amdahl‘s Law

Amdahl‘s Law
Amdahl‘s Law
 Corrolary
   Maximize the parallel part
   Only parallelize when parallel part is large
    enough
P-Completeness
 Is there an efficient parallel version for
  every algorithm?
   No! Hardly parallelizable problems
   P-Completeness
   Example Circuit-Value-Problem (CVP)
P-Completeness

Optimization
 What can i achieve?
 When do I stop?
 How many threads should i use?
Optimization
 I/O bound
   Thread is waiting for memory, disk, etc.
 Computation bound
   Thread is calculating the whole time

 Watch processor utilization!
Optimization
 I/O bound
   Use asynchronous/non-blocking I/O
   Increase number of threads
 Computation bound
   Number of threads = Number of cores
Processors
 Multicore CPU
 Graphical Processing Unit (GPU)
 Field-Programmable Gate Array
  (FPGA)
GPU Computing
 Finer granularity than CPU
   Specialized processors
   512 cores on a Fermi
 High memory bandwidth 192 GB/sec
CPU vs. GPU




 Source: SGI
FPGA
 Configurable hardware circuits
 Programmed in Verilog, VHDL
 Now: OpenCL
   Much higher level of abstraction
 Under development, promising
 No performance tests results
  (2011/12)
Networks / Cluster
 Combination of             CPU




     CPU                   Memory


     Memory
     Network
                            Network




     GPU                    GPU



     FPGA
                            FPGA

 Vast possibilities
Example
 2 x connected by network
   2 CPU each with local cache
   Global memory
                            Network




  CPU               CPU                CPU               CPU



          Memory                               Memory

 Memory            Memory             Memory            Memory
Example
 1 CPU with local cache
 Connected by shared memory
   2 GPU with local memory („device“)


         CPU      Memory   GPU   Memory




                           GPU   Memory
        Memory
Next Step: Hybrid
 Hybrid / Heterogenous
   Multi-Core / Many-Core
   Plus special purpose hardware
     GPU
     FPGA
Optimal combination?
 Which network gives the best
  performance?
   Complicated
   Technical restrictions
      4x PCI-Express 16x Motherboards
      Power consumption
      Cooling
Example: K-Computer
   SPARC64 VIIIfx 2.0GHz
   705024 Cores
   10.51 Petaflop/s
   No GPUs

 #1 2011
Example: Tianhe-1A
   14336 Xeon X5670
   7168 Tesla M2050
   2048 NUDT FT1000
   2.57 petaflop/s

 #2 2011
Example: HPC at home
 Workstations and blades
   8 x 512 cores = 4096 cores
Frameworks: Shared Mem
 C/C++
     OpenMP
     POSIX Threads (pthreads)
     Intel Thread Building Blocks
     Windows Threads
 Java
   java.util.concurrent
Frameworks: Actors
 C/C++
   Theron
 Java / JVM
   Akka
   Scala
   GPars (Groovy)
GPU Computing
 NVIDIA CUDA
   NVIDIA
 OpenCL
     AMD
     NVIDIA
     Intel
     Altera
     Apple
 WebCL
   Nokia
   Samsung
Advanced courses
 Best practices for concurrency in Java
   Java‘s java.util.concurrent
   Actor models
   Transactional Memory


 See http://www.dinkla.com
Advanced courses
 GPU Computing
     NVIDIA CUDA
     OpenCL
     Using NVIDIA CUDA with Java
     Using OpenCL with Java
 See http://www.dinkla.com
References: Practice
 Mattson, Sanders, Massingill
   Patterns for
    Parallel Programming
 Breshears
   The Art of Concurrency
References: Practice
 Pacheco
   An Introduction to
    Parallel Programming
 Herlihy, Shavit
   The Art of
    Multiprocessor Programming
References: Theory
 Gibbons, Rytter
   Efficient Parallel Algorithms
 Lynch
   Distributed Algorithms
 Ben-Ari
   Principles of Concurrent and
    Distributed Programming
References: GPU Computing
 Scarpino
   OpenCL in Action


 Sanders, Kandrot
   CUDA by Example
References: Background
 Hennessy, Paterson
   Computer Architecture: A Quantitative
    Approach

Más contenido relacionado

La actualidad más candente

Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance ComputingUmarudin Zaenuri
 
Multiprocessor Architecture (Advanced computer architecture)
Multiprocessor Architecture  (Advanced computer architecture)Multiprocessor Architecture  (Advanced computer architecture)
Multiprocessor Architecture (Advanced computer architecture)vani261
 
Introduction to MPI
Introduction to MPI Introduction to MPI
Introduction to MPI Hanif Durad
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platformsSyed Zaid Irshad
 
Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed SystemsDr Sandeep Kumar Poonia
 
Virtual Memory
Virtual MemoryVirtual Memory
Virtual MemoryArchith777
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingShitalkumar Sukhdeve
 
Routing algorithm
Routing algorithmRouting algorithm
Routing algorithmBushra M
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessorKishan Panara
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memorysgpraju
 
Cache memory ppt
Cache memory ppt  Cache memory ppt
Cache memory ppt Arpita Naik
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architectureMr SMAK
 
Generations of Processors.pptx
Generations of Processors.pptxGenerations of Processors.pptx
Generations of Processors.pptxMubeenNaeem4
 

La actualidad más candente (20)

VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Multiprocessor Architecture (Advanced computer architecture)
Multiprocessor Architecture  (Advanced computer architecture)Multiprocessor Architecture  (Advanced computer architecture)
Multiprocessor Architecture (Advanced computer architecture)
 
Introduction to MPI
Introduction to MPI Introduction to MPI
Introduction to MPI
 
My ppt hpc u4
My ppt hpc u4My ppt hpc u4
My ppt hpc u4
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
 
Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed Systems
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Virtual Memory
Virtual MemoryVirtual Memory
Virtual Memory
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Unit 7
Unit 7Unit 7
Unit 7
 
Research Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel ProgrammingResearch Scope in Parallel Computing And Parallel Programming
Research Scope in Parallel Computing And Parallel Programming
 
Routing algorithm
Routing algorithmRouting algorithm
Routing algorithm
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memory
 
Cache memory ppt
Cache memory ppt  Cache memory ppt
Cache memory ppt
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architecture
 
Generations of Processors.pptx
Generations of Processors.pptxGenerations of Processors.pptx
Generations of Processors.pptx
 

Destacado

Higher nab preparation
Higher nab preparationHigher nab preparation
Higher nab preparationscaddell
 
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationHighly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationMatthew Gaudet
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
VLSI Design(Fabrication)
VLSI Design(Fabrication)VLSI Design(Fabrication)
VLSI Design(Fabrication)Trijit Mallick
 
Parallel computing
Parallel computingParallel computing
Parallel computingvirend111
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architectureaamc1100
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processingPage Maker
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 
Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithmRicha Kumari
 

Destacado (20)

Higher nab preparation
Higher nab preparationHigher nab preparation
Higher nab preparation
 
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationHighly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Parallel computing(1)
Parallel computing(1)Parallel computing(1)
Parallel computing(1)
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
VLSI Design(Fabrication)
VLSI Design(Fabrication)VLSI Design(Fabrication)
VLSI Design(Fabrication)
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel computing(2)
Parallel computing(2)Parallel computing(2)
Parallel computing(2)
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architecture
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 
Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithm
 

Similar a Introduction To Parallel Computing

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013midnite_runr
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Basic course
Basic courseBasic course
Basic courseSirajRock
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerAllineaSoftware
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2ice799
 
Parallelizing Conqueror's Blade
Parallelizing Conqueror's BladeParallelizing Conqueror's Blade
Parallelizing Conqueror's BladeIntel® Software
 
Multicore processing
Multicore processingMulticore processing
Multicore processingguestc0be34a
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)David Evans
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraStefan Marr
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptxSimRelokasi2
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 

Similar a Introduction To Parallel Computing (20)

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Performance
PerformancePerformance
Performance
 
Basic course
Basic courseBasic course
Basic course
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Basic course
Basic courseBasic course
Basic course
 
Parallelizing Conqueror's Blade
Parallelizing Conqueror's BladeParallelizing Conqueror's Blade
Parallelizing Conqueror's Blade
 
Multicore processing
Multicore processingMulticore processing
Multicore processing
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore Era
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
Basic course
Basic courseBasic course
Basic course
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 

Más de Jörn Dinkla

Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Jörn Dinkla
 
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDKorrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDJörn Dinkla
 
Nebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenNebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenJörn Dinkla
 
Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Jörn Dinkla
 
A short introduction to Kotlin
A short introduction to KotlinA short introduction to Kotlin
A short introduction to KotlinJörn Dinkla
 
Concurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesConcurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesJörn Dinkla
 
Nebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenNebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenJörn Dinkla
 
GPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLGPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLJörn Dinkla
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDAJörn Dinkla
 
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftDie ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftJörn Dinkla
 
Geschäftsmodelle - Ein kurzer Überblick
Geschäftsmodelle -Ein kurzer ÜberblickGeschäftsmodelle -Ein kurzer Überblick
Geschäftsmodelle - Ein kurzer ÜberblickJörn Dinkla
 
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyBuchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyJörn Dinkla
 
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleMulti-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleJörn Dinkla
 
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingTipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingJörn Dinkla
 
GPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisGPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisJörn Dinkla
 
Subversion Schulung
Subversion SchulungSubversion Schulung
Subversion SchulungJörn Dinkla
 
Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Jörn Dinkla
 

Más de Jörn Dinkla (18)

Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"
 
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDKorrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
 
Nebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenNebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturieren
 
Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?
 
A short introduction to Kotlin
A short introduction to KotlinA short introduction to Kotlin
A short introduction to Kotlin
 
Concurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesConcurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutines
 
Nebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenNebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins Koroutinen
 
GPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLGPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCL
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
 
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftDie ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
 
Geschäftsmodelle - Ein kurzer Überblick
Geschäftsmodelle -Ein kurzer ÜberblickGeschäftsmodelle -Ein kurzer Überblick
Geschäftsmodelle - Ein kurzer Überblick
 
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyBuchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
 
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleMulti-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
 
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingTipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
 
GPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisGPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der Praxis
 
Subversion Schulung
Subversion SchulungSubversion Schulung
Subversion Schulung
 
Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4
 
Ant im Detail
Ant im DetailAnt im Detail
Ant im Detail
 

Último

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Introduction To Parallel Computing

  • 1. Introduction to Parallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  • 2. Dipl.-Inform. Jörn Dinkla  Java (J2SE, JEE)  Programming Languages  Scala, Groovy, Haskell  Parallel Computing  GPU Computing  Model driven  Eclipse-Plugins
  • 3. Overview  Progress in computing  Traditional Hard- and Software  Theoretical Computer Science  Algorithms  Machines  Optimization  Parallelization  Parallel Hard- and Software
  • 4. Progress in Computing 1. New applications  Not feasible before  Not needed before  Not possible before 2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  • 5. Progress in Computing  Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  • 6. How was progress achieved?  Hardware  CPU, memory, disks, networks  Faster and larger  Software  New and better algorithms  Programming methods and languages
  • 7. Traditional Hardware  Von Neumann-Architecture CPU I/O Memory Bus  John Backus 1977  “von Neumann bottleneck“ Cache
  • 8. Improvements  Increasing Clock Frequency  Memory Hierarchy / Cache  Parallelizing ALU  Pipelining  Very-long Instruction Words (VLIW)  Instruction-Level parallelism (ILP)  Superscalar processors  Vector data types  Multithreaded  Multicore / Manycore
  • 10. Clock frequency  No increase since 2005
  • 11. Physical Limits  Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation  Limit to transistor size Faster processors impossible !?!
  • 12. 2005 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  • 13. Multicore  Transistor count  Doubles every 2-3 years  Calculation speed  No increase Multicore  Efficient?
  • 14. How to use the cores?  Multi-Tasking OS  Different tasks  Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  • 15. Traditional Software  Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control  Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries  Keywords  Sequential, Serial, Deterministic
  • 16. Traditional Software  Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile)  Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  • 17. Architecture  Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
  • 18. More than one CPU?  How should they communicate ? I/O Memory I/O Memory CPU CPU
  • 19. Message Passing  Distributed system  Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  • 20. Shared Memory  Shared Memory  Tight coupling I/O Memory I/O CPU CPU
  • 21. Shared Memory  Global vs. Local  Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  • 22. Overview: Memory  Unshared Memory  Message Passing  Actors  Shared Memory  Threads  Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS)  Transactional Memory
  • 23. Sequential Algorithms  Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  • 24. Sequential Algorithms int sum = 0 for i=0 to 4 sum += mem[i] Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  • 25. More than one CPU  How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  • 26. Two Processors PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Lockstep 1 2 7 5  Memory Access! 3 4 1 2 5 18
  • 27. Flynn‘s Taxonomy  1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  • 28. Flynn‘s Taxonomy  SISD  RAM, Von Neumann  SIMD  Lockstep, vector processor, GPU  MISD  Fault tolerance  MIMD  Distributed system
  • 29. Extension MIMD  How many programs?  SPMD  One program  Not in lockstep as in SIMD  MPMD  Many programs
  • 30. Processes & Threads  Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads  Thread  Smallest unit of execution  Light weight
  • 31. Overview: Algorithms  Sequential  Parallel  Concurrent Overlap  Distributed  Randomized  Quantum
  • 32. Computer Science  Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  • 33. Gap: Theory and Practice  Galactic algorithms  Written for abstract machines  PRAM, special networks, etc.  Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  • 34. Sequential algorithms  Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  • 35. Parallel algorithms  Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  • 36. Parallelization  Transforming  a sequential  into a parallel algorithm  Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  • 37. Granularity  “Size” of the threads?  How much computation?  Coarse vs. fine grain  Right choice  Important for good performance  Algorithm design
  • 38. Computational thinking  “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  • 39. Computational thinking  “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010  Expert level needed for parallelization!
  • 40. Problems: Shared Memory  Destructive updates  i += 1  Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  • 41. Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Which one first? 1 2 7 5 3 1 4 2 5 18
  • 42. Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum  Synchronization needed
  • 43. Problems: Shared Memory  The memory barrier  When is a value read or written?  Optimizing compilers change semantics  int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a  Solutions (Java)  volatile  java.util.concurrent.atomic
  • 44. Problems: Shared Memory  Thread safety  Reentrant code class X { int x; void inc() { x+=1; } }
  • 45. Problems: Threads  Deadlock  A wants B, B wants A, both waiting  Starvation  A wants B, but never gets it  Race condition  A writes to mem, B reads/writes mem
  • 46. Shared Mem: Solutions  Shared mutable state  Synchronize properly  Isolated mutable state  Don‘t share state  Immutable or unshared  Don‘t mutate state!
  • 47. Solutions  Transactional Memory  Every access within transaction  See databases  Actor models  Message passing  Immutable state / pure functional
  • 48. Speedup and Efficiency  Running time  T(1) with one processor  T(n) with two processors  Speedup  How much faster?  S(n) = T(1) / T(n)
  • 49. Speedup and Efficiency  Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  • 52. Amdahl‘s Law  Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  • 53. P-Completeness  Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  • 55. Optimization  What can i achieve?  When do I stop?  How many threads should i use?
  • 56. Optimization  I/O bound  Thread is waiting for memory, disk, etc.  Computation bound  Thread is calculating the whole time  Watch processor utilization!
  • 57. Optimization  I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads  Computation bound  Number of threads = Number of cores
  • 58. Processors  Multicore CPU  Graphical Processing Unit (GPU)  Field-Programmable Gate Array (FPGA)
  • 59. GPU Computing  Finer granularity than CPU  Specialized processors  512 cores on a Fermi  High memory bandwidth 192 GB/sec
  • 60. CPU vs. GPU  Source: SGI
  • 61. FPGA  Configurable hardware circuits  Programmed in Verilog, VHDL  Now: OpenCL  Much higher level of abstraction  Under development, promising  No performance tests results (2011/12)
  • 62. Networks / Cluster  Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA  Vast possibilities
  • 63. Example  2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  • 64. Example  1 CPU with local cache  Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  • 65. Next Step: Hybrid  Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  • 66. Optimal combination?  Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  • 67. Example: K-Computer  SPARC64 VIIIfx 2.0GHz  705024 Cores  10.51 Petaflop/s  No GPUs  #1 2011
  • 68. Example: Tianhe-1A  14336 Xeon X5670  7168 Tesla M2050  2048 NUDT FT1000  2.57 petaflop/s  #2 2011
  • 69. Example: HPC at home  Workstations and blades  8 x 512 cores = 4096 cores
  • 70. Frameworks: Shared Mem  C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads  Java  java.util.concurrent
  • 71. Frameworks: Actors  C/C++  Theron  Java / JVM  Akka  Scala  GPars (Groovy)
  • 72. GPU Computing  NVIDIA CUDA  NVIDIA  OpenCL  AMD  NVIDIA  Intel  Altera  Apple  WebCL  Nokia  Samsung
  • 73. Advanced courses  Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory  See http://www.dinkla.com
  • 74. Advanced courses  GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java  See http://www.dinkla.com
  • 75. References: Practice  Mattson, Sanders, Massingill  Patterns for Parallel Programming  Breshears  The Art of Concurrency
  • 76. References: Practice  Pacheco  An Introduction to Parallel Programming  Herlihy, Shavit  The Art of Multiprocessor Programming
  • 77. References: Theory  Gibbons, Rytter  Efficient Parallel Algorithms  Lynch  Distributed Algorithms  Ben-Ari  Principles of Concurrent and Distributed Programming
  • 78. References: GPU Computing  Scarpino  OpenCL in Action  Sanders, Kandrot  CUDA by Example
  • 79. References: Background  Hennessy, Paterson  Computer Architecture: A Quantitative Approach