SlideShare una empresa de Scribd logo
1 de 79
Introduction to
Parallel Computing


     Jörn Dinkla
     http://www.dinkla.com

          Version 1.1
Dipl.-Inform. Jörn Dinkla
 Java (J2SE, JEE)
 Programming Languages
   Scala, Groovy, Haskell
 Parallel Computing
   GPU Computing
 Model driven
 Eclipse-Plugins
Overview
 Progress in computing
 Traditional Hard- and Software
 Theoretical Computer Science
   Algorithms
   Machines
   Optimization
 Parallelization
 Parallel Hard- and Software
Progress in Computing
1. New applications
   Not feasible before
   Not needed before
   Not possible before
2. Better applications
    Faster
    More data
    Better quality
      precision, accuracy, exactness
Progress in Computing
 Two ingredients
   Hardware
     Machine(s) to execute program
   Software
     Model / language to formulate program
     Libraries
     Methods
How was progress achieved?
 Hardware
   CPU, memory, disks, networks
   Faster and larger
 Software
   New and better algorithms
   Programming methods and languages
Traditional Hardware
 Von Neumann-Architecture
           CPU    I/O   Memory




                  Bus




 John Backus 1977
   “von Neumann bottleneck“     Cache
Improvements
   Increasing Clock Frequency
   Memory Hierarchy / Cache
   Parallelizing ALU
   Pipelining
   Very-long Instruction Words (VLIW)
   Instruction-Level parallelism (ILP)
   Superscalar processors
   Vector data types
   Multithreaded
   Multicore / Manycore
Moore‘s law
 Guaranteed until 2020
Clock frequency
 No increase since 2005
Physical Limits
 Increase of clock frequency
   >>> Energy-consumption
   >>> Heat-dissipation
 Limit to transistor size

   Faster processors impossible !?!
2005
“The Free Lunch Is Over:
   A Fundamental Turn Toward
   Concurrency in Software”

       Herb Sutter
       Dr. Dobb’s Journal, March 2005
Multicore
 Transistor count
    Doubles every 2-3 years
 Calculation speed
    No increase

  Multicore

 Efficient?
How to use the cores?
 Multi-Tasking OS
   Different tasks
 Speeding up same task
     Assume 2 CPUs
     Problem is divided in half
     Each CPU calculates a half
     Time taken is half of the original time?
Traditional Software
 Computation is expressed as “algorithm
    “a step-by-step procedure for calculations”
    algorithm = logic + control
 Example
   1.   Open file
   2.   For all records in the file
        1.   Add the salary
   3.   Close file
   4.   Print out the sum of the salaries

 Keywords
    Sequential, Serial, Deterministic
Traditional Software
 Improvements
   Better algorithms
   Programming languages (OO)
   Developement methods (agile)
 Limits
   Theoretical Computer Science
   Complexity theory (NP, P, NC)
Architecture
 Simplification: Ignore the bus

    CPU    I/O   Memory
                            I/O         Memory




           Bus
                                  CPU
More than one CPU?
 How should they communicate ?


   I/O         Memory     I/O         Memory




         CPU                    CPU
Message Passing
 Distributed system
 Loose coupling
                                      Messages

                            Network




       I/O         Memory               I/O            Memory




             CPU                                 CPU
Shared Memory
 Shared Memory
 Tight coupling

            I/O         Memory         I/O




                  CPU            CPU
Shared Memory
 Global vs. Local
 Memory hierarchy

     I/O         Memory            I/O         Memory




                          Shared
           CPU                           CPU
                          Memory
Overview: Memory
 Unshared Memory
   Message Passing
   Actors
 Shared Memory
   Threads
 Memory hierarchies / hybrid
   Partitioned Global Adress Space (PGAS)
 Transactional Memory
Sequential Algorithms
 Random Access Machine (RAM)
   Step by step, deterministic
                                  Addr Value
                                   0     3
    PC    int sum = 0              1
                                   2
                                         7
                                         5
          for i=0 to 4             3     1
                                   4     2
            sum += mem[i]          5    18
          mem[5]= sum
Sequential Algorithms
int sum = 0
for i=0 to 4
  sum += mem[i]
Addr Value   Addr Value   Addr Value   Addr Value   Addr Value   Addr Value
 0     3      0     3      0     3      0     3      0     3      0     3
 1     7      1     7      1     7      1     7      1     7      1     7
 2     5      2     5      2     5      2     5      2     5      2     5
 3     1      3     1      3     1      3     1      3     1      3     1
 4     2      4     2      4     2      4     2      4     2      4     2
 5     0      5     3      5    10      5    15      5    16      5    18
More than one CPU
 How many programs should run?
   One
     In lock-step
        All processors do the same
     In any order
   More than one
     Distributed system
Two Processors
PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum              mem[5]= sum
                                            Addr Value
                                             0     3
 Lockstep                                   1
                                             2
                                                   7
                                                   5

 Memory Access!                             3
                                             4
                                                   1
                                                   2
                                             5    18
Flynn‘s Taxonomy
 1966

                        Instruction
                     Single    Multiple
           Single     SISD      MISD
   Data
          Multiple   SIMD       MIMD
Flynn‘s Taxonomy
 SISD
   RAM, Von Neumann
 SIMD
   Lockstep, vector processor, GPU
 MISD
   Fault tolerance
 MIMD
   Distributed system
Extension MIMD
 How many programs?

 SPMD
   One program
   Not in lockstep as in SIMD
 MPMD
   Many programs
Processes & Threads
 Process
   Operating System
      Address space
      IPC
   Heavy weight
   Contains 1..* threads
 Thread
   Smallest unit of execution
   Light weight
Overview: Algorithms
   Sequential
   Parallel
   Concurrent    Overlap
   Distributed
   Randomized
   Quantum
Computer Science
 Theoretical Computer Science
     A long time before 2005
     1989: Gibbons, Rytter
     1990: Ben-Ari
     1996: Lynch
Gap: Theory and Practice
 Galactic algorithms
 Written for abstract machines
   PRAM, special networks, etc.
 Simplifying assumptions
   No boundaries
   Exact arithmetic
   Infinite memory, network speed, etc.
Sequential algorithms
 Implementing a sequential algorithm
   Machine architecture
   Programming language
   Performance
     Processor, memory and cache speed
   Boundary cases
   Sometimes hard
Parallel algorithms
 Implementing a parallel algorithm
   Adapt algorithm to architecture
      No PRAM or sorting network!
   Problems with shared memory
   Synchronization
   Harder!
Parallelization
 Transforming
   a sequential
   into a parallel algorithm

 Tasks
   Adapt to architecture
   Rewrite
   Test correctness wrt „golden“ seq. code
Granularity
 “Size” of the threads?
   How much computation?
 Coarse vs. fine grain
 Right choice
   Important for good performance
   Algorithm design
Computational thinking
 “… is the thought processes involved
  in formulating problems and their
  solutions so that the solutions are
  represented in a form that can be
  effectively carried out by an
  information-processing agent.”
              Cuny, Snyder, Wing 2010
Computational thinking
 “… is the new literacy of the 21st
  Century.”
               Cuny, Snyder, Wing 2010



 Expert level needed for parallelization!
Problems: Shared Memory
 Destructive updates
   i += 1
 Parallel, independent processes
   How do the others now that i increased?
   Synchronization needed
      Memory barrier
      Complicated for beginners
Problems: Shared Memory

PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum              mem[5]= sum
                                            Addr Value
                                             0     3
 Which one first?                           1
                                             2
                                                   7
                                                   5
                                             3     1
                                             4     2
                                             5    18
Problems: Shared Memory

PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum
       sync()                   sync()
                                mem[5] += sum


 Synchronization needed
Problems: Shared Memory
 The memory barrier
    When is a value read or written?
    Optimizing compilers change semantics

 int a = b + 5
    Read b
    Add 5 to b, store temporary in c
    Write c to a

 Solutions (Java)
    volatile
    java.util.concurrent.atomic
Problems: Shared Memory
 Thread safety
 Reentrant code

  class X {
    int x;
    void inc() { x+=1; }
  }
Problems: Threads
 Deadlock
   A wants B, B wants A, both waiting
 Starvation
   A wants B, but never gets it
 Race condition
   A writes to mem, B reads/writes mem
Shared Mem: Solutions
 Shared mutable state
   Synchronize properly


 Isolated mutable state
   Don‘t share state


 Immutable or unshared
   Don‘t mutate state!
Solutions
 Transactional Memory
   Every access within transaction
   See databases
 Actor models
   Message passing
 Immutable state / pure functional
Speedup and Efficiency
 Running time
   T(1) with one processor
   T(n) with two processors
 Speedup
   How much faster?
   S(n) = T(1) / T(n)
Speedup and Efficiency
 Efficiency
   Are all the processors used?
   E(n) = S(n) / n = T(1) / (n * T(n))
Amdahl‘s Law

Amdahl‘s Law
Amdahl‘s Law
 Corrolary
   Maximize the parallel part
   Only parallelize when parallel part is large
    enough
P-Completeness
 Is there an efficient parallel version for
  every algorithm?
   No! Hardly parallelizable problems
   P-Completeness
   Example Circuit-Value-Problem (CVP)
P-Completeness

Optimization
 What can i achieve?
 When do I stop?
 How many threads should i use?
Optimization
 I/O bound
   Thread is waiting for memory, disk, etc.
 Computation bound
   Thread is calculating the whole time

 Watch processor utilization!
Optimization
 I/O bound
   Use asynchronous/non-blocking I/O
   Increase number of threads
 Computation bound
   Number of threads = Number of cores
Processors
 Multicore CPU
 Graphical Processing Unit (GPU)
 Field-Programmable Gate Array
  (FPGA)
GPU Computing
 Finer granularity than CPU
   Specialized processors
   512 cores on a Fermi
 High memory bandwidth 192 GB/sec
CPU vs. GPU




 Source: SGI
FPGA
 Configurable hardware circuits
 Programmed in Verilog, VHDL
 Now: OpenCL
   Much higher level of abstraction
 Under development, promising
 No performance tests results
  (2011/12)
Networks / Cluster
 Combination of             CPU




     CPU                   Memory


     Memory
     Network
                            Network




     GPU                    GPU



     FPGA
                            FPGA

 Vast possibilities
Example
 2 x connected by network
   2 CPU each with local cache
   Global memory
                            Network




  CPU               CPU                CPU               CPU



          Memory                               Memory

 Memory            Memory             Memory            Memory
Example
 1 CPU with local cache
 Connected by shared memory
   2 GPU with local memory („device“)


         CPU      Memory   GPU   Memory




                           GPU   Memory
        Memory
Next Step: Hybrid
 Hybrid / Heterogenous
   Multi-Core / Many-Core
   Plus special purpose hardware
     GPU
     FPGA
Optimal combination?
 Which network gives the best
  performance?
   Complicated
   Technical restrictions
      4x PCI-Express 16x Motherboards
      Power consumption
      Cooling
Example: K-Computer
   SPARC64 VIIIfx 2.0GHz
   705024 Cores
   10.51 Petaflop/s
   No GPUs

 #1 2011
Example: Tianhe-1A
   14336 Xeon X5670
   7168 Tesla M2050
   2048 NUDT FT1000
   2.57 petaflop/s

 #2 2011
Example: HPC at home
 Workstations and blades
   8 x 512 cores = 4096 cores
Frameworks: Shared Mem
 C/C++
     OpenMP
     POSIX Threads (pthreads)
     Intel Thread Building Blocks
     Windows Threads
 Java
   java.util.concurrent
Frameworks: Actors
 C/C++
   Theron
 Java / JVM
   Akka
   Scala
   GPars (Groovy)
GPU Computing
 NVIDIA CUDA
   NVIDIA
 OpenCL
     AMD
     NVIDIA
     Intel
     Altera
     Apple
 WebCL
   Nokia
   Samsung
Advanced courses
 Best practices for concurrency in Java
   Java‘s java.util.concurrent
   Actor models
   Transactional Memory


 See http://www.dinkla.com
Advanced courses
 GPU Computing
     NVIDIA CUDA
     OpenCL
     Using NVIDIA CUDA with Java
     Using OpenCL with Java
 See http://www.dinkla.com
References: Practice
 Mattson, Sanders, Massingill
   Patterns for
    Parallel Programming
 Breshears
   The Art of Concurrency
References: Practice
 Pacheco
   An Introduction to
    Parallel Programming
 Herlihy, Shavit
   The Art of
    Multiprocessor Programming
References: Theory
 Gibbons, Rytter
   Efficient Parallel Algorithms
 Lynch
   Distributed Algorithms
 Ben-Ari
   Principles of Concurrent and
    Distributed Programming
References: GPU Computing
 Scarpino
   OpenCL in Action


 Sanders, Kandrot
   CUDA by Example
References: Background
 Hennessy, Paterson
   Computer Architecture: A Quantitative
    Approach

Más contenido relacionado

La actualidad más candente

Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating Systemvivek223
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
message passing vs shared memory
message passing vs shared memorymessage passing vs shared memory
message passing vs shared memoryHamza Zahid
 
Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System ConceptsSanjiv Malik
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemanujos25
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
Process Synchronization And Deadlocks
Process Synchronization And DeadlocksProcess Synchronization And Deadlocks
Process Synchronization And Deadlockstech2click
 
VTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingVTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingSachin Gowda
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 

La actualidad más candente (20)

Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
message passing vs shared memory
message passing vs shared memorymessage passing vs shared memory
message passing vs shared memory
 
Parallel processing and pipelining
Parallel processing and pipeliningParallel processing and pipelining
Parallel processing and pipelining
 
Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System Concepts
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Deadlocks in operating system
Deadlocks in operating systemDeadlocks in operating system
Deadlocks in operating system
 
Vx works RTOS
Vx works RTOSVx works RTOS
Vx works RTOS
 
Scope of parallelism
Scope of parallelismScope of parallelism
Scope of parallelism
 
Process Synchronization And Deadlocks
Process Synchronization And DeadlocksProcess Synchronization And Deadlocks
Process Synchronization And Deadlocks
 
VTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingVTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computing
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 

Destacado

Higher nab preparation
Higher nab preparationHigher nab preparation
Higher nab preparationscaddell
 
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationHighly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationMatthew Gaudet
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
VLSI Design(Fabrication)
VLSI Design(Fabrication)VLSI Design(Fabrication)
VLSI Design(Fabrication)Trijit Mallick
 
Parallel computing
Parallel computingParallel computing
Parallel computingvirend111
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architectureaamc1100
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processingPage Maker
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 
Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithmRicha Kumari
 

Destacado (19)

Higher nab preparation
Higher nab preparationHigher nab preparation
Higher nab preparation
 
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationHighly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Parallel computing(1)
Parallel computing(1)Parallel computing(1)
Parallel computing(1)
 
VLSI Design(Fabrication)
VLSI Design(Fabrication)VLSI Design(Fabrication)
VLSI Design(Fabrication)
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel computing(2)
Parallel computing(2)Parallel computing(2)
Parallel computing(2)
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architecture
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 
Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithm
 

Similar a Introduction To Parallel Computing

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013midnite_runr
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Basic course
Basic courseBasic course
Basic courseSirajRock
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerAllineaSoftware
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2ice799
 
Parallelizing Conqueror's Blade
Parallelizing Conqueror's BladeParallelizing Conqueror's Blade
Parallelizing Conqueror's BladeIntel® Software
 
Multicore processing
Multicore processingMulticore processing
Multicore processingguestc0be34a
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)David Evans
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraStefan Marr
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptxSimRelokasi2
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 

Similar a Introduction To Parallel Computing (20)

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Performance
PerformancePerformance
Performance
 
Basic course
Basic courseBasic course
Basic course
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Basic course
Basic courseBasic course
Basic course
 
Parallelizing Conqueror's Blade
Parallelizing Conqueror's BladeParallelizing Conqueror's Blade
Parallelizing Conqueror's Blade
 
Multicore processing
Multicore processingMulticore processing
Multicore processing
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore Era
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
Basic course
Basic courseBasic course
Basic course
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 

Más de Jörn Dinkla

Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Jörn Dinkla
 
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDKorrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDJörn Dinkla
 
Nebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenNebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenJörn Dinkla
 
Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Jörn Dinkla
 
A short introduction to Kotlin
A short introduction to KotlinA short introduction to Kotlin
A short introduction to KotlinJörn Dinkla
 
Concurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesConcurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesJörn Dinkla
 
Nebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenNebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenJörn Dinkla
 
GPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLGPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLJörn Dinkla
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDAJörn Dinkla
 
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftDie ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftJörn Dinkla
 
Geschäftsmodelle - Ein kurzer Überblick
Geschäftsmodelle -Ein kurzer ÜberblickGeschäftsmodelle -Ein kurzer Überblick
Geschäftsmodelle - Ein kurzer ÜberblickJörn Dinkla
 
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyBuchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyJörn Dinkla
 
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleMulti-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleJörn Dinkla
 
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingTipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingJörn Dinkla
 
GPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisGPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisJörn Dinkla
 
Subversion Schulung
Subversion SchulungSubversion Schulung
Subversion SchulungJörn Dinkla
 
Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Jörn Dinkla
 

Más de Jörn Dinkla (18)

Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"
 
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDKorrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
 
Nebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenNebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturieren
 
Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?
 
A short introduction to Kotlin
A short introduction to KotlinA short introduction to Kotlin
A short introduction to Kotlin
 
Concurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesConcurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutines
 
Nebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenNebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins Koroutinen
 
GPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLGPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCL
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
 
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftDie ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
 
Geschäftsmodelle - Ein kurzer Überblick
Geschäftsmodelle -Ein kurzer ÜberblickGeschäftsmodelle -Ein kurzer Überblick
Geschäftsmodelle - Ein kurzer Überblick
 
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyBuchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
 
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleMulti-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
 
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingTipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
 
GPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisGPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der Praxis
 
Subversion Schulung
Subversion SchulungSubversion Schulung
Subversion Schulung
 
Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4
 
Ant im Detail
Ant im DetailAnt im Detail
Ant im Detail
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Introduction To Parallel Computing

  • 1. Introduction to Parallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  • 2. Dipl.-Inform. Jörn Dinkla  Java (J2SE, JEE)  Programming Languages  Scala, Groovy, Haskell  Parallel Computing  GPU Computing  Model driven  Eclipse-Plugins
  • 3. Overview  Progress in computing  Traditional Hard- and Software  Theoretical Computer Science  Algorithms  Machines  Optimization  Parallelization  Parallel Hard- and Software
  • 4. Progress in Computing 1. New applications  Not feasible before  Not needed before  Not possible before 2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  • 5. Progress in Computing  Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  • 6. How was progress achieved?  Hardware  CPU, memory, disks, networks  Faster and larger  Software  New and better algorithms  Programming methods and languages
  • 7. Traditional Hardware  Von Neumann-Architecture CPU I/O Memory Bus  John Backus 1977  “von Neumann bottleneck“ Cache
  • 8. Improvements  Increasing Clock Frequency  Memory Hierarchy / Cache  Parallelizing ALU  Pipelining  Very-long Instruction Words (VLIW)  Instruction-Level parallelism (ILP)  Superscalar processors  Vector data types  Multithreaded  Multicore / Manycore
  • 10. Clock frequency  No increase since 2005
  • 11. Physical Limits  Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation  Limit to transistor size Faster processors impossible !?!
  • 12. 2005 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  • 13. Multicore  Transistor count  Doubles every 2-3 years  Calculation speed  No increase Multicore  Efficient?
  • 14. How to use the cores?  Multi-Tasking OS  Different tasks  Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  • 15. Traditional Software  Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control  Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries  Keywords  Sequential, Serial, Deterministic
  • 16. Traditional Software  Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile)  Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  • 17. Architecture  Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
  • 18. More than one CPU?  How should they communicate ? I/O Memory I/O Memory CPU CPU
  • 19. Message Passing  Distributed system  Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  • 20. Shared Memory  Shared Memory  Tight coupling I/O Memory I/O CPU CPU
  • 21. Shared Memory  Global vs. Local  Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  • 22. Overview: Memory  Unshared Memory  Message Passing  Actors  Shared Memory  Threads  Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS)  Transactional Memory
  • 23. Sequential Algorithms  Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  • 24. Sequential Algorithms int sum = 0 for i=0 to 4 sum += mem[i] Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  • 25. More than one CPU  How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  • 26. Two Processors PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Lockstep 1 2 7 5  Memory Access! 3 4 1 2 5 18
  • 27. Flynn‘s Taxonomy  1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  • 28. Flynn‘s Taxonomy  SISD  RAM, Von Neumann  SIMD  Lockstep, vector processor, GPU  MISD  Fault tolerance  MIMD  Distributed system
  • 29. Extension MIMD  How many programs?  SPMD  One program  Not in lockstep as in SIMD  MPMD  Many programs
  • 30. Processes & Threads  Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads  Thread  Smallest unit of execution  Light weight
  • 31. Overview: Algorithms  Sequential  Parallel  Concurrent Overlap  Distributed  Randomized  Quantum
  • 32. Computer Science  Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  • 33. Gap: Theory and Practice  Galactic algorithms  Written for abstract machines  PRAM, special networks, etc.  Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  • 34. Sequential algorithms  Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  • 35. Parallel algorithms  Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  • 36. Parallelization  Transforming  a sequential  into a parallel algorithm  Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  • 37. Granularity  “Size” of the threads?  How much computation?  Coarse vs. fine grain  Right choice  Important for good performance  Algorithm design
  • 38. Computational thinking  “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  • 39. Computational thinking  “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010  Expert level needed for parallelization!
  • 40. Problems: Shared Memory  Destructive updates  i += 1  Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  • 41. Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Which one first? 1 2 7 5 3 1 4 2 5 18
  • 42. Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum  Synchronization needed
  • 43. Problems: Shared Memory  The memory barrier  When is a value read or written?  Optimizing compilers change semantics  int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a  Solutions (Java)  volatile  java.util.concurrent.atomic
  • 44. Problems: Shared Memory  Thread safety  Reentrant code class X { int x; void inc() { x+=1; } }
  • 45. Problems: Threads  Deadlock  A wants B, B wants A, both waiting  Starvation  A wants B, but never gets it  Race condition  A writes to mem, B reads/writes mem
  • 46. Shared Mem: Solutions  Shared mutable state  Synchronize properly  Isolated mutable state  Don‘t share state  Immutable or unshared  Don‘t mutate state!
  • 47. Solutions  Transactional Memory  Every access within transaction  See databases  Actor models  Message passing  Immutable state / pure functional
  • 48. Speedup and Efficiency  Running time  T(1) with one processor  T(n) with two processors  Speedup  How much faster?  S(n) = T(1) / T(n)
  • 49. Speedup and Efficiency  Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  • 52. Amdahl‘s Law  Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  • 53. P-Completeness  Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  • 55. Optimization  What can i achieve?  When do I stop?  How many threads should i use?
  • 56. Optimization  I/O bound  Thread is waiting for memory, disk, etc.  Computation bound  Thread is calculating the whole time  Watch processor utilization!
  • 57. Optimization  I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads  Computation bound  Number of threads = Number of cores
  • 58. Processors  Multicore CPU  Graphical Processing Unit (GPU)  Field-Programmable Gate Array (FPGA)
  • 59. GPU Computing  Finer granularity than CPU  Specialized processors  512 cores on a Fermi  High memory bandwidth 192 GB/sec
  • 60. CPU vs. GPU  Source: SGI
  • 61. FPGA  Configurable hardware circuits  Programmed in Verilog, VHDL  Now: OpenCL  Much higher level of abstraction  Under development, promising  No performance tests results (2011/12)
  • 62. Networks / Cluster  Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA  Vast possibilities
  • 63. Example  2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  • 64. Example  1 CPU with local cache  Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  • 65. Next Step: Hybrid  Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  • 66. Optimal combination?  Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  • 67. Example: K-Computer  SPARC64 VIIIfx 2.0GHz  705024 Cores  10.51 Petaflop/s  No GPUs  #1 2011
  • 68. Example: Tianhe-1A  14336 Xeon X5670  7168 Tesla M2050  2048 NUDT FT1000  2.57 petaflop/s  #2 2011
  • 69. Example: HPC at home  Workstations and blades  8 x 512 cores = 4096 cores
  • 70. Frameworks: Shared Mem  C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads  Java  java.util.concurrent
  • 71. Frameworks: Actors  C/C++  Theron  Java / JVM  Akka  Scala  GPars (Groovy)
  • 72. GPU Computing  NVIDIA CUDA  NVIDIA  OpenCL  AMD  NVIDIA  Intel  Altera  Apple  WebCL  Nokia  Samsung
  • 73. Advanced courses  Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory  See http://www.dinkla.com
  • 74. Advanced courses  GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java  See http://www.dinkla.com
  • 75. References: Practice  Mattson, Sanders, Massingill  Patterns for Parallel Programming  Breshears  The Art of Concurrency
  • 76. References: Practice  Pacheco  An Introduction to Parallel Programming  Herlihy, Shavit  The Art of Multiprocessor Programming
  • 77. References: Theory  Gibbons, Rytter  Efficient Parallel Algorithms  Lynch  Distributed Algorithms  Ben-Ari  Principles of Concurrent and Distributed Programming
  • 78. References: GPU Computing  Scarpino  OpenCL in Action  Sanders, Kandrot  CUDA by Example
  • 79. References: Background  Hennessy, Paterson  Computer Architecture: A Quantitative Approach