SlideShare a Scribd company logo
1 of 124
Download to read offline
Irregular Parallelism on the GPU:
      Algorithms and Data Structures

                       John Owens
Associate Professor, Electrical and Computer Engineering
      Institute for Data Analysis and Visualization
      SciDAC Institute for Ultrascale Visualization
              University of California, Davis
Outline
•   Irregular parallelism

    •   Get a new data structure! (sparse matrix-vector)

    •   Reuse a good data structure! (composite)

    •   Get a new algorithm! (hash tables)

    •   Convert serial to parallel! (task queue)

    •   Concentrate on communication not computation!
        (MapReduce)

•   Broader thoughts
Sparse Matrix-Vector Multiply
Sparse Matrix-Vector Multiply: What’s Hard?


 •     Dense approach is wasteful

 •     Unclear how to map work to parallel processors

 •     Irregular data access




     “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Bell & Garland, SC ’
(DI
                     A)
                          Dia
                             go
                                na
                                   l




Structured
               (EL
                  L)
                     EL
                        LP
                          AC
                             K


               (CS
                     R)
                          Co
                             m    pre
                                     ss
                                        e dR
                                            ow
               (HY
                     B)
                          Hy
                            bri
                               d
                                                 Sparse Matrix Formats




               (CO
                     O)
Unstructured




                           Co
                              o   rdi
                                      na
                                         te
Diagonal Matrices

                                               We’ve recently done
                                                    -2  0   1
                                               work on tridiagonal
                                               systems—PPoPP ’ 10




•   Diagonals should be mostly populated

•   Map one thread per row

    •   Good parallel efficiency

    •   Good memory behavior [column-major storage]
Irregular Matrices: ELL
                                                 0   2

                                                 0   1   2    3    4   5

                                                 0   2   3

                                                 0   1   2

                                                 1   2   3    4    5

                                                 5


                                                         padding



•   Assign one thread per row again

•   But now:

    •   Load imbalance hurts parallel efficiency
Irregular Matrices: COO
                                              00   02

                                              10   11   12   13   14   15

                                              20   22   23

                                              30   31   32

                                              41   42   43   44   45

                                              55
                                                             no padding!



•   General format; insensitive to sparsity pattern, but ~3x
    slower than ELL

•   Assign one thread per element, combine results from all
    elements in a row to get output element

    •   Req segmented reduction, communication btwn threads
Thread-per-{element,row}
Irregular Matrices: HYB
                               “Typical”            “Exceptional”
                                     0     2

                                     0     1   2            13   14   15

                                     0     2   3

                                     0     1   2

                                     1     2   3            44   45

                                     5




•   Combine regularity of ELL + flexibility of COO
SpMV: Summary

•   Ample parallelism for large matrices

    •   Structured matrices (dense, diagonal): straightforward

    •   Sparse matrices: Tradeoff between parallel efficiency and
        load balance

•   Take-home message: Use data structure appropriate to
    your matrix

    •   Insight: Irregularity is manageable if you regularize the
        common case
Compositing
The Programmable Pipeline
  Input     Tess.        Vertex      Geom.               Frag.
                                              Raster            Compose
Assembly   Shading      Shading     Shading             Shading




   Split         Dice             Shading      Sampling     Composition




                                        Ray-Primitive
Ray Generation    Ray Traversal                              Shading
                                         Intersection
Composition
                                                               Samples /
                                                               Fragments
           Fragment Generation
                           Fragments
                           (position, depth, color)


                  Composite
                                                      S1
                           Subpixels
                           (position, color)



                      Filter                          S3                S2
                           Final pixels

                                                            Subpixels
                                                            locations


                                                                             S4
Anjul Patney, Stanley Tzeng, and John D.                   Pixel
Owens. Fragment-Parallel Composite and
Filter. Computer Graphics Forum (Proceedings
of the Eurographics Symposium on Rendering),
29(4):1251–1258, June 2010.
Why?
Pixel-Parallel Composition
   Subpixel 0             Subpixel 1   Subpixel 0   Subpixel 1




                Pixel i                      Pixel i+1
Sample-Parallel Composition
  Subpixel 0                    Subpixel 1                        Subpixel 0        Subpixel 1




                              Segmented Scan




                         Segmented Reduction

               Pixel i                                                         Pixel i+1
                 Subpixel 0             Subpixel 1   Subpixel 0   Subpixel 1




                              Pixel i                      Pixel i+1
M randomly-distributed fragments
Nested Data Parallelism
                              00   02

                              10    11   12   13    14   15

                              20

                              30
                                   22

                                    31
                                         23

                                         32
                                                                  =
                              41   42    43   44    45

                              55


     Subpixel 0             Subpixel 1             Subpixel 0   Subpixel 1




                        Segmented Scan




                      Segmented Reduction

                  Pixel i                                Pixel i+1
Hash Tables
Hash Tables & Sparsity




 Lefebvre and Hoppe, Siggraph 2006
Scalar Hashing
                                                        #2




                       #1
       #
                                        #1



 key             key              key

                       #2




Linear Probing   Double Probing              Chaining
Scalar Hashing: Parallel Problems
                                                        #2




                               #1
              #                                 #1


        key              key              key

                               #2




•   Construction and Lookup

    •   Variable time/work per entry

•   Construction

    •   Synchronization / shared access to data structure
Parallel Hashing: The Problem
•   Hash tables are good for sparse data.

•   Input: Set of key-value pairs to place in the hash table

•   Output: Data structure that allows:

    •   Determining if key has been placed in hash table

    •   Given the key, fetching its value

•   Could also:

    •   Sort key-value pairs by key (construction)

    •   Binary-search sorted list (lookup)

•   Recalculate at every change
Parallel Hashing: What We Want

•   Fast construction time

•   Fast access time

    •   O(1) for any element, O(n) for n elements in parallel

•   Reasonable memory usage

•   Algorithms and data structures may sit at different places in
    this space

    •   Perfect spatial hashing has good lookup times and memory
        usage but is very slow to construct
Level : Distribute into buckets
         Keys                                                                                                                    Bucket ids




                                                                                                Atomic
                                                                                                 add

                                                                                                            8    5    6    8    Bucket sizes

                                                       h



     Local offsets       1         3 7                      0        5 2       4       6                      0   8    13   19   Global offsets




                                                           Data distributed into buckets


Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and
Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.
Parallel Hashing: Level


•   FKS (Level 1) is good for a coarse categorization

    •   Possible performance issue: atomics

•   Bad for a fine categorization

    •   Space requirement for n elements to (probabilistically) guarantee
        no collisions is O(n2)
Hashing in Parallel
      ha              hb

      1    0          1    0
      3    1          3    1
      2    2          2    2
      0    3          1    3
Cuckoo Hashing Construction
                        h1   h2             T1         T2

                        1     1

                        0     1             0          0
                        1     0             1          1
                        0     0


•   Lookup procedure: in parallel, for each element:

    •   Calculate h1 & look in T1;

    •   Calculate h2 & look in T2; still O(1) lookup
Cuckoo Construction Mechanics

•   Level 1 created buckets of no more than 512 items

    •   Average: 409; probability of overflow: < 10-6

•   Level 2: Assign each bucket to a thread block, construct
    cuckoo hash per bucket entirely within shared memory

    •   Semantic: Multiple writes to same location must have one and
        only one winner

•   Our implementation uses 3 tables of 192 elements each (load
    factor: 71%)

•   What if it fails? New hash functions & start over.
Timings on random voxel data
Why did we pick this implementation?




 •   Alternative: single stage, cuckoo hash only

 •   Problem 1: Uncoalesced reads/writes to global memory
     (and may require atomics)

 •   Problem 2: Penalty for failure is large
But it turns out …                               Thanks to expert reviewer
                                                 (and current collaborator)
                                                       Vasily Volkov!

 •   … this simpler approach is faster.

 •   Single hash table in global memory (1.4x data size),
     4 different hash functions into it

 •   For each thread:

     •   Atomically exchange my item with item in hash table

         •   If my new item is empty, I’m done

     •   If my new item used hash function n, use hash function
         n+1 to reinsert
Hashing: Big Ideas


•   Classic serial hashing techniques are a poor fit for a GPU.

    •   Serialization, load balance

•   Solving this problem required a different algorithm

    •   Both hashing algorithms were new to the parallel literature

    •   Hybrid algorithm was entirely new
Task Queues
Representing Smooth Surfaces
•   Polygonal Meshes

     •   Lack view adaptivity

     •   Inefficient storage/transfer

•   Parametric Surfaces

     •   Collections of smooth patches

     •   Hard to animate

•   Subdivision Surfaces

     •   Widely popular, easy for modeling

     •   Flexible animation
                                             [Boubekeur and Schlick 2007]
Parallel Traversal




                     Anjul Patney and John D. Owens. Real-Time Reyes-
                     Style Adaptive Surface Subdivision. ACM Transactions
                     on Graphics, 27(5):143:1–143:8, December 2008.
Cracks & T-Junctions
      Uniform Refinement       Adaptive Refinement




                     Crack

                                                T-Junction



Previous work was post-process after all steps; ours was
  inherent within subdivision process
Recursive Subdivision is Irregular




                                                                                                                                     1
                                                                                                                                                     2
                                                                                                                         2                                       2
                                                                                                                     2                   1                                   1
                                                                                                                                                         1
                                                                                                                                 1
                                                                                                                                                 1                       2
                                                                                                                 1           2                                       2
                                                                                                                                     2

                                                                                                                                                             1
                                                                                                                                 1
                                                                                                                             2
                                                                                                                                                                     1
                                                                                                                     2
                                                                                                                                         1
                                                                                                                 1               1
                                                                                                                                             1                           1
                                                                                                                             2
                                                                                                                                     2
Patney, Ebeida, and Owens. “Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces”. HPG ’   .                                       1
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
Static Task List
           Input   Input     Input      Input        Input



           SM      SM         SM        SM             SM



                           Atomic Ptr

                                                        Daniel Cederman and
                            Output              Philippas Tsigas, On Dynamic
 restart                                         Load Balancing on Graphics
                                                        Processors. Graphics
 kernel                                          Hardware 2008, June 2008.
Private Work Queue Approach

•   Allocate private work queue of tasks per
    core

    •   Each core can add to or remove work from
        its local queue

•   Cores mark self as idle if {queue
    exhausts storage, queue is empty}

•   Cores periodically check global idle
    counter

•   If global idle counter reaches threshold,
    rebalance work                            gProximity: Fast Hierarchy Operations
                                                   on GPU Architectures, Lauterbach,
                                                   Mo, and Manocha, EG ’
Work Stealing & Donating
                              Lock      Lock        ...

         I/O       I/O        I/O       I/O        I/O
        Deque     Deque      Deque     Deque      Deque



         SM         SM        SM         SM         SM

•   Cederman and Tsigas: Stealing == best performance and
    scalability (follows Arora CPU-based work)

•   We showed how to do this with multiple kernels in an
    uberkernel and persistent-thread programming style

•   We added donating to minimize memory usage
Ingredients for Our Scheme
    Implementation
   questions that we
    need to address:
   What is the proper         Warp Size
  granularity for tasks?   Work Granularity

  How many threads to      Persistent Threads
       launch?
   How to avoid global        Uberkernels
    synchronizations?
    How to distribute        Task Donation
      tasks evenly?


                                                44
Warp Size Work Granularity
• Problem: We want to emulate task level parallelism on
  the GPU without loss in efficiency.
• Solution: we choose block sizes of 32 threads / block.
  • Removes messy synchronization barriers.
  • Can view each block as a MIMD thread. We call these blocks
    processors




    P    P    P            P    P    P            P    P    P

      Physical              Physical               Physical
     Processor             Processor              Processor



                                                                 45
Uberkernel Processor Utilization
• Problem: Want to eliminate global kernel barriers for
  better processor utilization
• Uberkernels pack multiple execution routes into one
  kernel.
             Data Flow                        Data Flow



    Pipeline Stage 1                  Uberkernel

                                          Stage 1


                                          Stage 2
    Pipeline Stage 2




                                                          46
Persistent Thread Scheduler Emulation
• Problem: If input is irregular? How many threads do we
  launch?
• Launch enough to fill the GPU, and keep them alive so
  they keep fetching work.

 Life of a Thread: Thread:
           Persistent

              Fetch       Process    Write
 Spawn                                              Death
              Data         Data     Output




 How do we know when to stop?
When there is no more work left


                                                       47
Task Donation
• When a bin is full, processor can give work to someone
  else.




        P1          P2         P3         P4




   This processor ‘s bin         Smaller memory usage
    is full and donates
   its work to someone           More complicated
         else’s bin
                                                           48
Queuing Schemes
ent for Irregular-Parallel Workloads on the GPU

                            4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(                                                                           4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)(
                                                                                                                                                   "0!!



                                                                                    25k                                                                                                                                 25k
                                                                              1)(2                                                                                                                               1)(2
                      "!!                                                              "0!!!                                                                                                                              "0!!!
                                                                              .3;+                                                                                                                               .3;+

                                                                                                                                                   "!!!

                                                                                       "!!!!                                                                                                                              "!!!!
                      &0!
     1)(2-'()*+,,+3




                                                                                                                                 1)(2-'()*+,,+3
                                                                                                               .3;+-7<+(6<7)8,




                                                                                                                                                                                                                                                  .3;+-7<+(6<7)8,
                                                                                                                                                    &0!!
                                                                                       &0!!!                                                                                                                              &0!!!

                      &!!

                                                                                       &!!!!
                                                                                                                                                    &!!!
                                                                                                                                                                                                                          &!!!!
                                                                                                                                                                                                                                                                    Gray bars: Load
                       0!
                                                                                       0!!!
                                                                                                                                                         0!!
                                                                                                                                                                                                                          0!!!                                         balance
                        !                                                              !                                                                   !                                                              !
                            !       "!      #!       $!        %!    &!!      &"!                                                                              !       "!      #!       $!        %!    &!!      &"!
                                                  '()*+,,)(-./                                                                                                                       '()*+,,)(-./




                                (a) Block Queuing                                                                                                         (b) Distributed Queuing                                                                                   Black line: Idle
                                                                                                                                                                                                                                                                         time
                      $!!
                            5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

                                                                              2)(3
                                                                                       160
                                                                                       &$!                                                               1!!
                                                                                                                                                               5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)(

                                                                                                                                                                                                                 2)(3
                                                                                                                                                                                                                          140
                                                                                                                                                                                                                          &#!


                                                                              .4<+     &#!                                                                                                                       .4<+
                                                                                                                                                                                                                          &"!
                      1!!
                                                                                                                                                         #!!
                                                                                       &"!
                                                                                                                                                                                                                          &!!
                      #!!
                                                                                       &!!
     2)(3-'()*+,,+4




                                                                                                                                        2)(3-'()*+,,+4
                                                                                             .4<+-8=+(7=8)9,




                                                                                                                                                                                                                                .4<+-8=+(7=8)9,
                                                                                                                                                         0!!
                                                                                                                                                                                                                          %!

                      0!!                                                              %!

                                                                                                                                                                                                                          $!
                                                                                                                                                         "!!
                                                                                       $!
                      "!!
                                                                                                                                                                                                                          #!
                                                                                       #!
                                                                                                                                                         &!!
                      &!!
                                                                                                                                                                                                                          "!
                                                                                       "!
                                                                                                                                                                                                                                             Stanley Tzeng, Anjul Patney, and John D.
                        !
                            !       "!      #!       $!        %!    &!!      &"!
                                                                                       !                                                                   !
                                                                                                                                                               !       "!      #!       $!        %!    &!!      &"!
                                                                                                                                                                                                                          !
                                                                                                                                                                                                                                             Owens. Task Management for Irregular-
                                                  '()*+,,)(-./                                                                                                                       '()*+,,)(-./
                                                                                                                                                                                                                                             Parallel Workloads on the GPU. In
                                                                                                                                                                                                                                             Proceedings of High Performance Graphics
                                 (c) Task Stealing                                                                                                                 (d) Task Donating                                                         2010, pages 29–37, June 2010.
Smooth Surfaces, High Detail




              16 samples per pixel
    >15 frames per second on GeForce GTX280
MapReduce
MapReduce




            http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
Why MapReduce?

•   Simple programming model

•   Parallel programming model

•   Scalable



•   Previous GPU work: neither multi-GPU nor out-of-core
Block Diagram
                    !"#$              !"#$
                     %#$               %#$
                    &'($              )'($

          &'($8$                             &'($9$
          +#2$             :$$:$$:$          +#2$
          )'($8$                             )'($9$

         63;12"01.$                          63;12"01.$


          *+,$-$                              *+,$-$
          '+./+0$                             '+./+0$
         !12"31$-$                           !12"31$-$
         '+.//%#$                            '+.//%#$



            45#$                                45#$



           6%.7$                               6%.7$


         63;12"01.$                          63;12"01.$   Jeff A. Stuart and John D. Owens. Multi-
                                                          GPU MapReduce on GPU Clusters. In
                                                          Proceedings of the 25th IEEE
          !12"31$                             !12"31$     International Parallel and Distributed
                                                          Processing Symposium, May 2011.
Keys to Performance
                                                 !"#$              !"#$
                                                  %#$               %#$
                                                 &'($              )'($

                                       &'($8$                             &'($9$
                                       +#2$             :$$:$$:$          +#2$
                                       )'($8$                             )'($9$
•   Process data in chunks            63;12"01.$                          63;12"01.$

    •   More efficient transmission &
                                       *+,$-$                              *+,$-$
        computation                    '+./+0$                             '+./+0$
                                      !12"31$-$                           !12"31$-$

    •   Also allows out of core       '+.//%#$                            '+.//%#$



•   Overlap computation and              45#$                                45#$
    communication

•   Accumulate                          6%.7$                               6%.7$



•   Partial Reduce                    63;12"01.$                          63;12"01.$


                                       !12"31$                             !12"31$
Partial Reduce
                                                   !"#$              !"#$
                                                    %#$               %#$
                                                   &'($              )'($

                                         &'($8$                             &'($9$
                                         +#2$             :$$:$$:$          +#2$
                                         )'($8$                             )'($9$

•   User-specified function combines     63;12"01.$                          63;12"01.$
    pairs with the same key
                                         *+,$-$                              *+,$-$
•   Each map chunk spawns a partial      '+./+0$
                                        !12"31$-$
                                                                             '+./+0$
                                                                            !12"31$-$
    reduction on that chunk             '+.//%#$                            '+.//%#$



•   Only good when cost of reduction       45#$                                45#$
    is less than cost of transmission

•   Good for ~larger number of keys       6%.7$                               6%.7$



•   Reduces GPU->CPU bw                 63;12"01.$                          63;12"01.$


                                         !12"31$                             !12"31$
Accumulate
                                                 !"#$              !"#$
                                                  %#$               %#$
                                                 &'($              )'($

                                       &'($8$                             &'($9$
                                       +#2$             :$$:$$:$          +#2$
                                       )'($8$                             )'($9$

•   Mapper has explicit knowledge     63;12"01.$                          63;12"01.$

    about nature of key-value pairs
                                       *+,$-$                              *+,$-$

•   Each map chunk accumulates its
                                       '+./+0$
                                      !12"31$-$
                                      '+.//%#$
                                                                           '+./+0$
                                                                          !12"31$-$
                                                                          '+.//%#$
    key-value outputs with GPU-
    resident key-value accumulated
                                         45#$                                45#$
    outputs

•   Good for small number of keys       6%.7$                               6%.7$


•   Also reduces GPU->CPU bw          63;12"01.$                          63;12"01.$


                                       !12"31$                             !12"31$
Benchmarks—Which
•   Matrix Multiplication (MM)

•   Word Occurrence (WO)

•   Sparse-Integer Occurrence (SIO)

•   Linear Regression (LR)

•   K-Means Clustering (KMC)



•   (Volume Renderer—presented
    @ MapReduce ’10)
        Jeff A. Stuart, Cheng-Kai Chen, Kwan-Liu Ma, and John D. Owens. Multi-GPU Volume Rendering using MapReduce. In HPDC '10:
        Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing / MAPREDUCE '10: The First
        International Workshop on MapReduce and its Applications, pages 841–848, June 2010
Benchmarks—Why

•   Needed to stress aspects of GPMR

    •   Unbalanced work (WO)

    •   Multiple emits/Non-uniform number of emits (LR, KMC, WO)

    •   Sparsity of keys (SIO)

    •   Accumulation (WO, LR, KMC)

    •   Many key-value pairs (SIO)

    •   Compute Bound Scalability (MM)
Benchmarks—Results
WO. We test GPMR against all available input sets.

Benchmarks—Results
                               MM          KMC        LR      SIO      WO
           1-GPU Speedup       162.712     2.991      1.296   1.450    11.080
           4-GPU Speedup       559.209     11.726     4.085   2.322    18.441
vs. CPU
          TABLE 2: Speedup for GPMR over Phoenix on our large (second-
          biggest) input data from our first set. The exception is MM, for which
          we use our small input set (Phoenix required almost twenty seconds      TAB
          to multiply two 1024 × 1024 matrices).                                  wri
                                                                                  all
                                                                                  littl
                                                                                  boil
                                         MM         KMC       WO                  fun
                                                                                  GPM
                    1-GPU Speedup        2.695      37.344    3.098               of t
                    4-GPU Speedup        10.760     129.425   11.709
vs. GPU
          TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix
          Multiplication, an 8M-point K-Means Clustering, and a 512 MB
          Word Occurrence. These sizes represent the largest problems that
          can meet the in-core memory requirements of Mars.
Benchmarks—Results




                     Good
Benchmarks—Results




                     Good
Benchmarks - Results




                       Good
MapReduce Summary

•   Time is right to explore diverse programming models on
    GPUs

    •   Few threads -> many, heavy threads -> light

•   Lack of {bandwidth, GPU primitives for communication,
    access to NIC} are challenges

•   What happens if GPUs get a lightweight serial processor?

    •   Future hybrid hardware is exciting!
Broader Thoughts
Nested Data Parallelism this?
                How do we generalize
                                                How do we abstract it? What
                                00   02
                                                 are design alternatives?
                                10    11   12   13    14   15

                                20

                                30
                                     22

                                      31
                                           23

                                           32
                                                                    =
                                41   42    43   44    45

                                55


       Subpixel 0             Subpixel 1             Subpixel 0   Subpixel 1




                          Segmented Scan




                        Segmented Reduction

                    Pixel i                                Pixel i+1
Hashing   Tradeoffs between construction
           time, access time, and space?
            Incremental data structures?
Work Stealing & Donating          Generalize and abstract task
                                  queues? [structural pattern?]
                                Persistent threads / ...
                             Lock         Lock         other sticky
                                         CUDA constructs?
        I/O        I/O        I/O        Whole pipelines?
                                           I/O        I/O
       Deque      Deque         Priorities / real-time constraints?
                            Deque       Deque        Deque



         SM        SM         SM         SM          SM

•   Cederman and Tsigas: Stealing == best performance and
    scalability

•   We showed how to do this with an uberkernel and persistent
    thread programming style

•   We added donating to minimize memory usage
Horizontal vs. Vertical Development

                                         Little code
     Applications                          reuse!          App 1      App 2        App 3

 Higher-Level Libraries    App       App         App
                                                               Primitive Libraries
                            1         2           3
Algorithm/Data Structure                                       (Domain-specific,
        Libraries                                          Algorithm, Data Structure,
                                                                      etc.)
  Programming System
       Primitives          Programming System Primitives    Programming System Primitives


       Hardware                 Hardware                           Hardware


        CPU                GPU (Usual)                     GPU (Our Goal)
                                                           [CUDPP library]
Energy Cost per Operation
            Energy Cost of Operations
Operation                        Energy (pJ)
64b Floating FMA (2 ops)         100
64b Integer Add                  1
Write 64b DFF                    0.5
Read 64b Register (64 x 32 bank) 3.5
Read 64b RAM (64 x 2K)           25
Read tags (24 x 2K)              8
Move 64b 1mm                     6
Move 64b 20mm                    120
Move 64b off chip                256
Read 64b from DRAM               2000




                                               From Bill Dally talk, September 2009
Dwarfs
•   1. Dense Linear Algebra

•   2. Sparse Linear Algebra   •   9. Graph Traversal

•   3. Spectral Methods        •   10. Dynamic
                                   Programming
•   4. N-Body Methods
                               •   11. Backtrack and
•   5. Structured Grids            Branch-and-Bound
•   6. Unstructured Grids
                               •   12. Graphical Models
•   7. MapReduce
                               •   13. Finite State Machines
•   8. Combinational Logic
Dwarfs
•   1. Dense Linear Algebra

•   2. Sparse Linear Algebra   •   9. Graph Traversal

•   3. Spectral Methods        •   10. Dynamic
                                   Programming
•   4. N-Body Methods
                               •   11. Backtrack and
•   5. Structured Grids            Branch-and-Bound
•   6. Unstructured Grids
                               •   12. Graphical Models
•   7. MapReduce
                               •   13. Finite State Machines
•   8. Combinational Logic
The Programmable Pipeline                                                Bricks & mortar: how do we
                                                                        allow programmers to build
                                                                       stages without worrying about
                                                                         assembling them together?
  Input     Tess.        Vertex      Geom.               Frag.
                                              Raster            Compose
Assembly   Shading      Shading     Shading             Shading




   Split         Dice             Shading      Sampling     Composition




                                        Ray-Primitive
Ray Generation    Ray Traversal                              Shading
                                         Intersection
The Programmable Pipeline
  Input     Tess.        Vertex      Geom.               Frag.
                                              Raster            Compose
Assembly   Shading      Shading     Shading             Shading




                                      [Irregular] Parallelism

   Split         Dice             Shading      Sampling     Composition




                                               Heterogeneity

Ray Generation    Ray Traversal
                                        Ray-Primitive
                                         Intersection
                                                             Shading   What does it look like when you
                                                                       build applications with GPUs as
                                                                           the primary processor?
Owens Group Interests

•   Irregular parallelism
                             •   Optimization: self-tuning
                                 and tools
•   Fundamental data
    structures and
    algorithms
                             •   Abstract models of GPU
                                 computation
•   Computational patterns
                             •   Multi-GPU computing:
    for parallelism              programming models
                                 and abstractions
Thanks to ...

•   Nathan Bell, Michael Garland, David Luebke, Sumit Gupta,
    Dan Alcantara, Anjul Patney, and Stanley Tzeng for helpful
    comments and slide material.

•   Funding agencies: Department of Energy (SciDAC Institute
    for Ultrascale Visualization, Early Career Principal
    Investigator Award), NSF, Intel Science and Technology
    Center for Visual Computing, LANL, BMW, NVIDIA, HP, UC
    MICRO, Microsoft, ChevronTexaco, Rambus
“If you were plowing a field, which
would you rather use? Two strong
      oxen or 1024 chickens?”
         —Seymour Cray
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
Parallel Prefix Sum (Scan)
•   Given an array A = [a0, a1, …, an-1]
    and a binary associative operator ⊕ with identity I,



•   scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)]



•   Example: if ⊕ is addition, then scan on the set

    •   [3 1 7 0 4 1 6 3]

•   returns the set

    •   [0 3 4 11 11 15 16 22]
Segmented Scan
•   Example: if ⊕ is addition, then scan on the set

    •   [3 1 7 | 0 4 1 | 6 3]

•   returns the set

    •   [0 3 4 | 0 0 4 | 0 6]

•   Same computational complexity as scan, but additionally
    have to keep track of segments (we use head flags to mark
    which elements are segment heads)

•   Useful for nested data parallelism (quicksort)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

More Related Content

What's hot

Gdc2012 frames, sparsity and global illumination
Gdc2012 frames, sparsity and global illumination Gdc2012 frames, sparsity and global illumination
Gdc2012 frames, sparsity and global illumination Manchor Ko
 
Fingerprint _prem
Fingerprint _premFingerprint _prem
Fingerprint _premlgbl40
 
Four Side Distance: A New Fourier Shape Signature
Four Side Distance: A New Fourier Shape SignatureFour Side Distance: A New Fourier Shape Signature
Four Side Distance: A New Fourier Shape SignatureIJASCSE
 
Abstract of project 2
Abstract of project 2Abstract of project 2
Abstract of project 2Vikram Mandal
 
Elettronica: Multimedia Information Processing in Smart Environments by Aless...
Elettronica: Multimedia Information Processing in Smart Environments by Aless...Elettronica: Multimedia Information Processing in Smart Environments by Aless...
Elettronica: Multimedia Information Processing in Smart Environments by Aless...Codemotion
 
A Robust Object Recognition using LBP, LTP and RLBP
A Robust Object Recognition using LBP, LTP and RLBPA Robust Object Recognition using LBP, LTP and RLBP
A Robust Object Recognition using LBP, LTP and RLBPEditor IJMTER
 
Areva10 Technical Day
Areva10 Technical DayAreva10 Technical Day
Areva10 Technical DaySDTools
 
Illumination-robust Recognition and Inspection in Carbide Insert Production
Illumination-robust Recognition and Inspection in Carbide Insert ProductionIllumination-robust Recognition and Inspection in Carbide Insert Production
Illumination-robust Recognition and Inspection in Carbide Insert ProductionIDES Editor
 

What's hot (9)

Gdc2012 frames, sparsity and global illumination
Gdc2012 frames, sparsity and global illumination Gdc2012 frames, sparsity and global illumination
Gdc2012 frames, sparsity and global illumination
 
Fingerprint _prem
Fingerprint _premFingerprint _prem
Fingerprint _prem
 
Four Side Distance: A New Fourier Shape Signature
Four Side Distance: A New Fourier Shape SignatureFour Side Distance: A New Fourier Shape Signature
Four Side Distance: A New Fourier Shape Signature
 
Abstract of project 2
Abstract of project 2Abstract of project 2
Abstract of project 2
 
Elettronica: Multimedia Information Processing in Smart Environments by Aless...
Elettronica: Multimedia Information Processing in Smart Environments by Aless...Elettronica: Multimedia Information Processing in Smart Environments by Aless...
Elettronica: Multimedia Information Processing in Smart Environments by Aless...
 
H0545156
H0545156H0545156
H0545156
 
A Robust Object Recognition using LBP, LTP and RLBP
A Robust Object Recognition using LBP, LTP and RLBPA Robust Object Recognition using LBP, LTP and RLBP
A Robust Object Recognition using LBP, LTP and RLBP
 
Areva10 Technical Day
Areva10 Technical DayAreva10 Technical Day
Areva10 Technical Day
 
Illumination-robust Recognition and Inspection in Carbide Insert Production
Illumination-robust Recognition and Inspection in Carbide Insert ProductionIllumination-robust Recognition and Inspection in Carbide Insert Production
Illumination-robust Recognition and Inspection in Carbide Insert Production
 

Viewers also liked

[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedArea41
 

Viewers also liked (16)

[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromised
 

Similar to [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

Geometry Processingで学ぶSparse Matrix
Geometry Processingで学ぶSparse MatrixGeometry Processingで学ぶSparse Matrix
Geometry Processingで学ぶSparse MatrixJun Saito
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringWaqas Nawaz
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
Mesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisMesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisDon Sheehy
 
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal StructuresLarge Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structuresayubimoak
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeLorenzo Alberton
 
Solid modeling
Solid modelingSolid modeling
Solid modelingKRvEsL
 
CS 354 Acceleration Structures
CS 354 Acceleration StructuresCS 354 Acceleration Structures
CS 354 Acceleration StructuresMark Kilgard
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012Mark Kilgard
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Object Recognition with Deformable Models
Object Recognition with Deformable ModelsObject Recognition with Deformable Models
Object Recognition with Deformable Modelszukun
 
Recommender Systems Tutorial (Part 2) -- Offline Components
Recommender Systems Tutorial (Part 2) -- Offline ComponentsRecommender Systems Tutorial (Part 2) -- Offline Components
Recommender Systems Tutorial (Part 2) -- Offline ComponentsBee-Chung Chen
 
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorImplementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorPTIHPA
 
study Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video Processingstudy Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video ProcessingChiamin Hsu
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptirisshicat
 

Similar to [Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis) (20)

Geometry Processingで学ぶSparse Matrix
Geometry Processingで学ぶSparse MatrixGeometry Processingで学ぶSparse Matrix
Geometry Processingで学ぶSparse Matrix
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph Clustering
 
Shuronr
ShuronrShuronr
Shuronr
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
Mesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisMesh Generation and Topological Data Analysis
Mesh Generation and Topological Data Analysis
 
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal StructuresLarge Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
 
Surveys
SurveysSurveys
Surveys
 
N20190530
N20190530N20190530
N20190530
 
Solid modeling
Solid modelingSolid modeling
Solid modeling
 
CS 354 Acceleration Structures
CS 354 Acceleration StructuresCS 354 Acceleration Structures
CS 354 Acceleration Structures
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Object Recognition with Deformable Models
Object Recognition with Deformable ModelsObject Recognition with Deformable Models
Object Recognition with Deformable Models
 
Recommender Systems Tutorial (Part 2) -- Offline Components
Recommender Systems Tutorial (Part 2) -- Offline ComponentsRecommender Systems Tutorial (Part 2) -- Offline Components
Recommender Systems Tutorial (Part 2) -- Offline Components
 
Two
TwoTwo
Two
 
2d3d
2d3d2d3d
2d3d
 
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorImplementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
 
study Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video Processingstudy Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video Processing
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features ppt
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

More from npinto (15)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Recently uploaded

Protein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxProtein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxvidhisharma994099
 
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustVani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustSavipriya Raghavendra
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeCeline George
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdfJayanti Pande
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesCeline George
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17Celine George
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
EBUS5423 Data Analytics and Reporting Bl
EBUS5423 Data Analytics and Reporting BlEBUS5423 Data Analytics and Reporting Bl
EBUS5423 Data Analytics and Reporting BlDr. Bruce A. Johnson
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
Optical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptxOptical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptxPurva Nikam
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfMohonDas
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 

Recently uploaded (20)

Protein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxProtein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustVani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using Code
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 Sales
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
EBUS5423 Data Analytics and Reporting Bl
EBUS5423 Data Analytics and Reporting BlEBUS5423 Data Analytics and Reporting Bl
EBUS5423 Data Analytics and Reporting Bl
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
Optical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptxOptical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptx
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdf
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

  • 1. Irregular Parallelism on the GPU: Algorithms and Data Structures John Owens Associate Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization University of California, Davis
  • 2. Outline • Irregular parallelism • Get a new data structure! (sparse matrix-vector) • Reuse a good data structure! (composite) • Get a new algorithm! (hash tables) • Convert serial to parallel! (task queue) • Concentrate on communication not computation! (MapReduce) • Broader thoughts
  • 4. Sparse Matrix-Vector Multiply: What’s Hard? • Dense approach is wasteful • Unclear how to map work to parallel processors • Irregular data access “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Bell & Garland, SC ’
  • 5. (DI A) Dia go na l Structured (EL L) EL LP AC K (CS R) Co m pre ss e dR ow (HY B) Hy bri d Sparse Matrix Formats (CO O) Unstructured Co o rdi na te
  • 6. Diagonal Matrices We’ve recently done -2 0 1 work on tridiagonal systems—PPoPP ’ 10 • Diagonals should be mostly populated • Map one thread per row • Good parallel efficiency • Good memory behavior [column-major storage]
  • 7. Irregular Matrices: ELL 0 2 0 1 2 3 4 5 0 2 3 0 1 2 1 2 3 4 5 5 padding • Assign one thread per row again • But now: • Load imbalance hurts parallel efficiency
  • 8. Irregular Matrices: COO 00 02 10 11 12 13 14 15 20 22 23 30 31 32 41 42 43 44 45 55 no padding! • General format; insensitive to sparsity pattern, but ~3x slower than ELL • Assign one thread per element, combine results from all elements in a row to get output element • Req segmented reduction, communication btwn threads
  • 10. Irregular Matrices: HYB “Typical” “Exceptional” 0 2 0 1 2 13 14 15 0 2 3 0 1 2 1 2 3 44 45 5 • Combine regularity of ELL + flexibility of COO
  • 11. SpMV: Summary • Ample parallelism for large matrices • Structured matrices (dense, diagonal): straightforward • Sparse matrices: Tradeoff between parallel efficiency and load balance • Take-home message: Use data structure appropriate to your matrix • Insight: Irregularity is manageable if you regularize the common case
  • 13. The Programmable Pipeline Input Tess. Vertex Geom. Frag. Raster Compose Assembly Shading Shading Shading Shading Split Dice Shading Sampling Composition Ray-Primitive Ray Generation Ray Traversal Shading Intersection
  • 14. Composition Samples / Fragments Fragment Generation Fragments (position, depth, color) Composite S1 Subpixels (position, color) Filter S3 S2 Final pixels Subpixels locations S4 Anjul Patney, Stanley Tzeng, and John D. Pixel Owens. Fragment-Parallel Composite and Filter. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering), 29(4):1251–1258, June 2010.
  • 15. Why?
  • 16. Pixel-Parallel Composition Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Pixel i Pixel i+1
  • 17. Sample-Parallel Composition Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Segmented Scan Segmented Reduction Pixel i Pixel i+1 Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Pixel i Pixel i+1
  • 19. Nested Data Parallelism 00 02 10 11 12 13 14 15 20 30 22 31 23 32 = 41 42 43 44 45 55 Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Segmented Scan Segmented Reduction Pixel i Pixel i+1
  • 21. Hash Tables & Sparsity Lefebvre and Hoppe, Siggraph 2006
  • 22. Scalar Hashing #2 #1 # #1 key key key #2 Linear Probing Double Probing Chaining
  • 23. Scalar Hashing: Parallel Problems #2 #1 # #1 key key key #2 • Construction and Lookup • Variable time/work per entry • Construction • Synchronization / shared access to data structure
  • 24. Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place in the hash table • Output: Data structure that allows: • Determining if key has been placed in hash table • Given the key, fetching its value • Could also: • Sort key-value pairs by key (construction) • Binary-search sorted list (lookup) • Recalculate at every change
  • 25. Parallel Hashing: What We Want • Fast construction time • Fast access time • O(1) for any element, O(n) for n elements in parallel • Reasonable memory usage • Algorithms and data structures may sit at different places in this space • Perfect spatial hashing has good lookup times and memory usage but is very slow to construct
  • 26. Level : Distribute into buckets Keys Bucket ids Atomic add 8 5 6 8 Bucket sizes h Local offsets 1 3 7 0 5 2 4 6 0 8 13 19 Global offsets Data distributed into buckets Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.
  • 27. Parallel Hashing: Level • FKS (Level 1) is good for a coarse categorization • Possible performance issue: atomics • Bad for a fine categorization • Space requirement for n elements to (probabilistically) guarantee no collisions is O(n2)
  • 28. Hashing in Parallel ha hb 1 0 1 0 3 1 3 1 2 2 2 2 0 3 1 3
  • 29. Cuckoo Hashing Construction h1 h2 T1 T2 1 1 0 1 0 0 1 0 1 1 0 0 • Lookup procedure: in parallel, for each element: • Calculate h1 & look in T1; • Calculate h2 & look in T2; still O(1) lookup
  • 30. Cuckoo Construction Mechanics • Level 1 created buckets of no more than 512 items • Average: 409; probability of overflow: < 10-6 • Level 2: Assign each bucket to a thread block, construct cuckoo hash per bucket entirely within shared memory • Semantic: Multiple writes to same location must have one and only one winner • Our implementation uses 3 tables of 192 elements each (load factor: 71%) • What if it fails? New hash functions & start over.
  • 31. Timings on random voxel data
  • 32. Why did we pick this implementation? • Alternative: single stage, cuckoo hash only • Problem 1: Uncoalesced reads/writes to global memory (and may require atomics) • Problem 2: Penalty for failure is large
  • 33. But it turns out … Thanks to expert reviewer (and current collaborator) Vasily Volkov! • … this simpler approach is faster. • Single hash table in global memory (1.4x data size), 4 different hash functions into it • For each thread: • Atomically exchange my item with item in hash table • If my new item is empty, I’m done • If my new item used hash function n, use hash function n+1 to reinsert
  • 34. Hashing: Big Ideas • Classic serial hashing techniques are a poor fit for a GPU. • Serialization, load balance • Solving this problem required a different algorithm • Both hashing algorithms were new to the parallel literature • Hybrid algorithm was entirely new
  • 36. Representing Smooth Surfaces • Polygonal Meshes • Lack view adaptivity • Inefficient storage/transfer • Parametric Surfaces • Collections of smooth patches • Hard to animate • Subdivision Surfaces • Widely popular, easy for modeling • Flexible animation [Boubekeur and Schlick 2007]
  • 37. Parallel Traversal Anjul Patney and John D. Owens. Real-Time Reyes- Style Adaptive Surface Subdivision. ACM Transactions on Graphics, 27(5):143:1–143:8, December 2008.
  • 38. Cracks & T-Junctions Uniform Refinement Adaptive Refinement Crack T-Junction Previous work was post-process after all steps; ours was inherent within subdivision process
  • 39. Recursive Subdivision is Irregular 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 1 1 2 1 2 1 1 1 1 1 2 2 Patney, Ebeida, and Owens. “Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces”. HPG ’ . 1
  • 41. Static Task List Input Input Input Input Input SM SM SM SM SM Atomic Ptr Daniel Cederman and Output Philippas Tsigas, On Dynamic restart Load Balancing on Graphics Processors. Graphics kernel Hardware 2008, June 2008.
  • 42. Private Work Queue Approach • Allocate private work queue of tasks per core • Each core can add to or remove work from its local queue • Cores mark self as idle if {queue exhausts storage, queue is empty} • Cores periodically check global idle counter • If global idle counter reaches threshold, rebalance work gProximity: Fast Hierarchy Operations on GPU Architectures, Lauterbach, Mo, and Manocha, EG ’
  • 43. Work Stealing & Donating Lock Lock ... I/O I/O I/O I/O I/O Deque Deque Deque Deque Deque SM SM SM SM SM • Cederman and Tsigas: Stealing == best performance and scalability (follows Arora CPU-based work) • We showed how to do this with multiple kernels in an uberkernel and persistent-thread programming style • We added donating to minimize memory usage
  • 44. Ingredients for Our Scheme Implementation questions that we need to address: What is the proper Warp Size granularity for tasks? Work Granularity How many threads to Persistent Threads launch? How to avoid global Uberkernels synchronizations? How to distribute Task Donation tasks evenly? 44
  • 45. Warp Size Work Granularity • Problem: We want to emulate task level parallelism on the GPU without loss in efficiency. • Solution: we choose block sizes of 32 threads / block. • Removes messy synchronization barriers. • Can view each block as a MIMD thread. We call these blocks processors P P P P P P P P P Physical Physical Physical Processor Processor Processor 45
  • 46. Uberkernel Processor Utilization • Problem: Want to eliminate global kernel barriers for better processor utilization • Uberkernels pack multiple execution routes into one kernel. Data Flow Data Flow Pipeline Stage 1 Uberkernel Stage 1 Stage 2 Pipeline Stage 2 46
  • 47. Persistent Thread Scheduler Emulation • Problem: If input is irregular? How many threads do we launch? • Launch enough to fill the GPU, and keep them alive so they keep fetching work. Life of a Thread: Thread: Persistent Fetch Process Write Spawn Death Data Data Output How do we know when to stop? When there is no more work left 47
  • 48. Task Donation • When a bin is full, processor can give work to someone else. P1 P2 P3 P4 This processor ‘s bin Smaller memory usage is full and donates its work to someone More complicated else’s bin 48
  • 49. Queuing Schemes ent for Irregular-Parallel Workloads on the GPU 4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)( 4)5'6(7,)8-)9-:)(2-'()*+,,+3-683-73;+-<75+-'+(-'()*+,,)( "0!! 25k 25k 1)(2 1)(2 "!! "0!!! "0!!! .3;+ .3;+ "!!! "!!!! "!!!! &0! 1)(2-'()*+,,+3 1)(2-'()*+,,+3 .3;+-7<+(6<7)8, .3;+-7<+(6<7)8, &0!! &0!!! &0!!! &!! &!!!! &!!! &!!!! Gray bars: Load 0! 0!!! 0!! 0!!! balance ! ! ! ! ! "! #! $! %! &!! &"! ! "! #! $! %! &!! &"! '()*+,,)(-./ '()*+,,)(-./ (a) Block Queuing (b) Distributed Queuing Black line: Idle time $!! 5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)( 2)(3 160 &$! 1!! 5)6'7(8,)9-):-;)(3-'()*+,,+4-794-84<+-=86+-'+(-'()*+,,)( 2)(3 140 &#! .4<+ &#! .4<+ &"! 1!! #!! &"! &!! #!! &!! 2)(3-'()*+,,+4 2)(3-'()*+,,+4 .4<+-8=+(7=8)9, .4<+-8=+(7=8)9, 0!! %! 0!! %! $! "!! $! "!! #! #! &!! &!! "! "! Stanley Tzeng, Anjul Patney, and John D. ! ! "! #! $! %! &!! &"! ! ! ! "! #! $! %! &!! &"! ! Owens. Task Management for Irregular- '()*+,,)(-./ '()*+,,)(-./ Parallel Workloads on the GPU. In Proceedings of High Performance Graphics (c) Task Stealing (d) Task Donating 2010, pages 29–37, June 2010.
  • 50. Smooth Surfaces, High Detail 16 samples per pixel >15 frames per second on GeForce GTX280
  • 52. MapReduce http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
  • 53. Why MapReduce? • Simple programming model • Parallel programming model • Scalable • Previous GPU work: neither multi-GPU nor out-of-core
  • 54. Block Diagram !"#$ !"#$ %#$ %#$ &'($ )'($ &'($8$ &'($9$ +#2$ :$$:$$:$ +#2$ )'($8$ )'($9$ 63;12"01.$ 63;12"01.$ *+,$-$ *+,$-$ '+./+0$ '+./+0$ !12"31$-$ !12"31$-$ '+.//%#$ '+.//%#$ 45#$ 45#$ 6%.7$ 6%.7$ 63;12"01.$ 63;12"01.$ Jeff A. Stuart and John D. Owens. Multi- GPU MapReduce on GPU Clusters. In Proceedings of the 25th IEEE !12"31$ !12"31$ International Parallel and Distributed Processing Symposium, May 2011.
  • 55. Keys to Performance !"#$ !"#$ %#$ %#$ &'($ )'($ &'($8$ &'($9$ +#2$ :$$:$$:$ +#2$ )'($8$ )'($9$ • Process data in chunks 63;12"01.$ 63;12"01.$ • More efficient transmission & *+,$-$ *+,$-$ computation '+./+0$ '+./+0$ !12"31$-$ !12"31$-$ • Also allows out of core '+.//%#$ '+.//%#$ • Overlap computation and 45#$ 45#$ communication • Accumulate 6%.7$ 6%.7$ • Partial Reduce 63;12"01.$ 63;12"01.$ !12"31$ !12"31$
  • 56. Partial Reduce !"#$ !"#$ %#$ %#$ &'($ )'($ &'($8$ &'($9$ +#2$ :$$:$$:$ +#2$ )'($8$ )'($9$ • User-specified function combines 63;12"01.$ 63;12"01.$ pairs with the same key *+,$-$ *+,$-$ • Each map chunk spawns a partial '+./+0$ !12"31$-$ '+./+0$ !12"31$-$ reduction on that chunk '+.//%#$ '+.//%#$ • Only good when cost of reduction 45#$ 45#$ is less than cost of transmission • Good for ~larger number of keys 6%.7$ 6%.7$ • Reduces GPU->CPU bw 63;12"01.$ 63;12"01.$ !12"31$ !12"31$
  • 57. Accumulate !"#$ !"#$ %#$ %#$ &'($ )'($ &'($8$ &'($9$ +#2$ :$$:$$:$ +#2$ )'($8$ )'($9$ • Mapper has explicit knowledge 63;12"01.$ 63;12"01.$ about nature of key-value pairs *+,$-$ *+,$-$ • Each map chunk accumulates its '+./+0$ !12"31$-$ '+.//%#$ '+./+0$ !12"31$-$ '+.//%#$ key-value outputs with GPU- resident key-value accumulated 45#$ 45#$ outputs • Good for small number of keys 6%.7$ 6%.7$ • Also reduces GPU->CPU bw 63;12"01.$ 63;12"01.$ !12"31$ !12"31$
  • 58. Benchmarks—Which • Matrix Multiplication (MM) • Word Occurrence (WO) • Sparse-Integer Occurrence (SIO) • Linear Regression (LR) • K-Means Clustering (KMC) • (Volume Renderer—presented @ MapReduce ’10) Jeff A. Stuart, Cheng-Kai Chen, Kwan-Liu Ma, and John D. Owens. Multi-GPU Volume Rendering using MapReduce. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing / MAPREDUCE '10: The First International Workshop on MapReduce and its Applications, pages 841–848, June 2010
  • 59. Benchmarks—Why • Needed to stress aspects of GPMR • Unbalanced work (WO) • Multiple emits/Non-uniform number of emits (LR, KMC, WO) • Sparsity of keys (SIO) • Accumulation (WO, LR, KMC) • Many key-value pairs (SIO) • Compute Bound Scalability (MM)
  • 61. WO. We test GPMR against all available input sets. Benchmarks—Results MM KMC LR SIO WO 1-GPU Speedup 162.712 2.991 1.296 1.450 11.080 4-GPU Speedup 559.209 11.726 4.085 2.322 18.441 vs. CPU TABLE 2: Speedup for GPMR over Phoenix on our large (second- biggest) input data from our first set. The exception is MM, for which we use our small input set (Phoenix required almost twenty seconds TAB to multiply two 1024 × 1024 matrices). wri all littl boil MM KMC WO fun GPM 1-GPU Speedup 2.695 37.344 3.098 of t 4-GPU Speedup 10.760 129.425 11.709 vs. GPU TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix Multiplication, an 8M-point K-Means Clustering, and a 512 MB Word Occurrence. These sizes represent the largest problems that can meet the in-core memory requirements of Mars.
  • 65. MapReduce Summary • Time is right to explore diverse programming models on GPUs • Few threads -> many, heavy threads -> light • Lack of {bandwidth, GPU primitives for communication, access to NIC} are challenges • What happens if GPUs get a lightweight serial processor? • Future hybrid hardware is exciting!
  • 67. Nested Data Parallelism this? How do we generalize How do we abstract it? What 00 02 are design alternatives? 10 11 12 13 14 15 20 30 22 31 23 32 = 41 42 43 44 45 55 Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Segmented Scan Segmented Reduction Pixel i Pixel i+1
  • 68. Hashing Tradeoffs between construction time, access time, and space? Incremental data structures?
  • 69. Work Stealing & Donating Generalize and abstract task queues? [structural pattern?] Persistent threads / ... Lock Lock other sticky CUDA constructs? I/O I/O I/O Whole pipelines? I/O I/O Deque Deque Priorities / real-time constraints? Deque Deque Deque SM SM SM SM SM • Cederman and Tsigas: Stealing == best performance and scalability • We showed how to do this with an uberkernel and persistent thread programming style • We added donating to minimize memory usage
  • 70. Horizontal vs. Vertical Development Little code Applications reuse! App 1 App 2 App 3 Higher-Level Libraries App App App Primitive Libraries 1 2 3 Algorithm/Data Structure (Domain-specific, Libraries Algorithm, Data Structure, etc.) Programming System Primitives Programming System Primitives Programming System Primitives Hardware Hardware Hardware CPU GPU (Usual) GPU (Our Goal) [CUDPP library]
  • 71. Energy Cost per Operation Energy Cost of Operations Operation Energy (pJ) 64b Floating FMA (2 ops) 100 64b Integer Add 1 Write 64b DFF 0.5 Read 64b Register (64 x 32 bank) 3.5 Read 64b RAM (64 x 2K) 25 Read tags (24 x 2K) 8 Move 64b 1mm 6 Move 64b 20mm 120 Move 64b off chip 256 Read 64b from DRAM 2000 From Bill Dally talk, September 2009
  • 72. Dwarfs • 1. Dense Linear Algebra • 2. Sparse Linear Algebra • 9. Graph Traversal • 3. Spectral Methods • 10. Dynamic Programming • 4. N-Body Methods • 11. Backtrack and • 5. Structured Grids Branch-and-Bound • 6. Unstructured Grids • 12. Graphical Models • 7. MapReduce • 13. Finite State Machines • 8. Combinational Logic
  • 73. Dwarfs • 1. Dense Linear Algebra • 2. Sparse Linear Algebra • 9. Graph Traversal • 3. Spectral Methods • 10. Dynamic Programming • 4. N-Body Methods • 11. Backtrack and • 5. Structured Grids Branch-and-Bound • 6. Unstructured Grids • 12. Graphical Models • 7. MapReduce • 13. Finite State Machines • 8. Combinational Logic
  • 74. The Programmable Pipeline Bricks & mortar: how do we allow programmers to build stages without worrying about assembling them together? Input Tess. Vertex Geom. Frag. Raster Compose Assembly Shading Shading Shading Shading Split Dice Shading Sampling Composition Ray-Primitive Ray Generation Ray Traversal Shading Intersection
  • 75. The Programmable Pipeline Input Tess. Vertex Geom. Frag. Raster Compose Assembly Shading Shading Shading Shading [Irregular] Parallelism Split Dice Shading Sampling Composition Heterogeneity Ray Generation Ray Traversal Ray-Primitive Intersection Shading What does it look like when you build applications with GPUs as the primary processor?
  • 76. Owens Group Interests • Irregular parallelism • Optimization: self-tuning and tools • Fundamental data structures and algorithms • Abstract models of GPU computation • Computational patterns • Multi-GPU computing: for parallelism programming models and abstractions
  • 77. Thanks to ... • Nathan Bell, Michael Garland, David Luebke, Sumit Gupta, Dan Alcantara, Anjul Patney, and Stanley Tzeng for helpful comments and slide material. • Funding agencies: Department of Energy (SciDAC Institute for Ultrascale Visualization, Early Career Principal Investigator Award), NSF, Intel Science and Technology Center for Visual Computing, LANL, BMW, NVIDIA, HP, UC MICRO, Microsoft, ChevronTexaco, Rambus
  • 78. “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” —Seymour Cray
  • 80. Parallel Prefix Sum (Scan) • Given an array A = [a0, a1, …, an-1] and a binary associative operator ⊕ with identity I, • scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)] • Example: if ⊕ is addition, then scan on the set • [3 1 7 0 4 1 6 3] • returns the set • [0 3 4 11 11 15 16 22]
  • 81. Segmented Scan • Example: if ⊕ is addition, then scan on the set • [3 1 7 | 0 4 1 | 6 3] • returns the set • [0 3 4 | 0 0 4 | 0 6] • Same computational complexity as scan, but additionally have to keep track of segments (we use head flags to mark which elements are segment heads) • Useful for nested data parallelism (quicksort)