SlideShare una empresa de Scribd logo
1 de 44
Nested data parallelism in
        Haskell
    Simon Peyton Jones (Microsoft)
 Manuel Chakravarty, Gabriele Keller,
         Roman Leshchinskiy
      (University of New South Wales)

                Feb 2007
Road map
Multicore
                           Parallel
                        programming
                          essential




Task parallelism             Data parallelism
• Explicit threads           Operate simultaneously
• Synchronise via locks,     on bulk data
  messages, or STM
                               Massive parallelism
                               Easy to program
  Modest parallelism
                               • Single flow of control
  Hard to program
                               • Implicit synchronisation
Data parallelism
         The key to using multicores

Flat data parallel            Nested data parallel
  Apply sequential                Apply parallel
operation to bulk data         operation to bulk data
• The brand leader            • Developed in 90’s
• Limited applicability       • Much wider applicability
  (dense matrix,                (sparse matrix, graph
  map/reduce)                   algorithms, games etc)
• Well developed              • Practically un-developed
                              •Huge opportunity
• Limited new opportunities
e.g. Fortran(s), *C
Flat data parallel                MPI, map/reduce
 The brand leader: widely used, well
  understood, well supported
          foreach i in 1..N {
             ...do something to A[i]...
          }

 BUT: “something” is sequential
 Single point of concurrency
 Easy to implement:
  use “chunking”
 Good cost model
                                  P1         P2      P3
                             1,000,000’s of (small) work items
Nested data parallel
 Main idea: allow “something” to be
  parallel
             foreach i in 1..N {
                ...do something to A[i]...
             }

 Now the parallelism
  structure is recursive,
  and un-balanced
 Still good cost model
 Hard to implement!
                         Still 1,000,000’s of (small) work items
Nested DP is great for
             programmers
 Fundamentally more modular
 Opens up a much wider range of applications:
   – Sparse arrays, variable grid adaptive methods
     (e.g. Barnes-Hut)
   – Divide and conquer algorithms (e.g. sort)
   – Graph algorithms (e.g. shortest path, spanning
     trees)
   – Physics engines for games, computational
     graphics (e.g. Delauny triangulation)
   – Machine learning, optimisation, constraint
     solving
Nested DP is tough for compilers
 ...because the concurrency tree is both
  irregular and fine-grained
 But it can be done! NESL (Blelloch
  1995) is an existence proof
 Key idea: “flattening” transformation:
      nested DP     flat DP

 Great work, surprisingly ignored since
  1990’s; ripe for exploitation
Array comprehensions
           [:Float:] is the type of
           parallel arrays of Float


 vecMul :: [:Float:] -> [:Float:] -> Float
 vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :]



sumP :: [:Float:] -> Float            An array comprehension:
                                    “the array of all f1*f2 where
                                     f1 is drawn from v1 and f2
Operations over parallel array
                                              from v2”
are computed in parallel; that is
the only way the programmer
says “do parallel stuff”
                                                 NB: no locks!
Sparse vector multiplication
   A sparse vector is represented as a
      vector of (index,value) pairs



svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul sv v = sumP [: f*(v!i) | (i,f) <- sv :]


Parallelism is
                          v!i gets the i’th element of v
proportional to
length of sparse
vector
Sparse matrix multiplication
   A sparse matrix is a vector of sparse
                 vectors



smMul :: [:[:(Int,Float):]:] -> [:Float:] -> Float
smMul sm v = sumP [: svMul sv v | sv <- sm :]




           Nested data parallelism here!
       We are calling a parallel operation, svMul, on
          every element of a parallel array, sm
Hard to implement well
• Evenly chunking at top level might be ill-balanced
• Top level along might not be very parallel
The flattening transformation
• Concatenate sub-arrays into one big, flat array
• Operate in parallel on the big array
• Segment vector keeps track of where the sub-arrays
  are

                                            ...etc



                 • Lots of tricksy book-keeping!
                 • Possible to do by hand (and done in
                   practice), but very hard to get right
                 • Blelloch showed it could be done
                   systematically
Data-parallel quicksort
sort :: [:Float:] -> [:Float:]
sort a = if (length a <= 1) then a
                                           Parallel
         else sa!0 +++ eq +++ sa!1
                                           filters
   where
      m = a!0
      lt = [: f | f<-a, f<m :]
      eq = [: f | f<a, f==m :]
      gr = [: f | f<-a, f>m :]
      sa = [: sort a | a <- [:lt,gr:] :]




       2-way nested data
        parallelism here!
How it works
Step 1                           sort


                       sort                     sort
Step 2


Step 3      sort              sort            sort
                          ...etc...
   • All sub-sorts at the same level are done in parallel
   • Segment vectors track which chunk belongs to which
     sub problem
   • Instant insanity when done by hand
Fusion
  Flattening is not enough
vecMul :: [:Float:] -> [:Float:] -> Float
vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :]

  Do not
        1. Generate [: f1*f2 | f1 <- v1 | f2 <- v2 :]
                (big intermediate vector)
        2. Add up the elements of this vector
  Instead: multiply and add in the same loop
  That is, fuse the multiply loop with the add
   loop
  Very general, aggressive fusion is required
Purity pays off
 Two key transformations:
  – Flattening
  – Fusion
 Both depend utterly on purely-
  functional semantics:
  – no assignments
  – every operation is a pure function

  The data-parallel languages of the
  future will be functional languages
What we are doing about it
                                       Substantial improvement in
  NESL
                                       • Expressiveness
    a mega-breakthrough but:
                                       • Performance
     –   specialised, prototype
     –   first order
     –   few data types
     –   no fusion
     –   interpreted




                                  Haskell
•Shared memory initially           –   broad-spectrum, widely used
•Distributed memory                –   higher order
       eventually                  –   very rich data types
                                   –   aggressive fusion
•GPUs anyone?
                                   –   compiled
And it goes fast too...
                      1-processor
                      version goes
                      only 30%
                      slower than C

                      Perf win with 2
                      processors

                           Pinch
                             of
                            salt
Summary so far
 Data parallelism is the only way to harness
  100’s of cores
 Nested DP is great for programmers: far, far
  more flexible than flat DP
 Nested DP is tough to implement
 But we (think we) know how to do it
 Functional programming is a massive win in
  this space
 Haskell prototype in 2007



  http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell
Implementing
Data Parallel Haskell
Three forms of concurrency
 Explicit threads
                                    main :: IO ()
                                      = do { ch <- newChan
                                       ; forkIO (ioManager ch)
   Non-deterministic by design
                                       ; forkIO (worker 1 ch)
   Monadic: forkIO and STM            ... etc ... }

 Semi-implicit
                                   f :: Int -> Int
   Deterministic                  f x = a `par` b `seq` a + b
   Pure: par and seq                 where

 Data parallel
                                          a = f (x-1)
                                          b = f (x-2)
   Deterministic
   Pure: parallel arrays
   Shared memory initially; distributed memory eventually;
    possibly even GPUs
Main contribution: an optimising data-parallel
  compiler implemented by modest enhancements
 to a full-scale functional language implementation

Three key pieces of technology
1. Vectorisation
  –   specific to parallel arrays
2. Non-parametric data representations
  –   A generically useful new feature in GHC
3. Chunking
  –   Divide up the work evenly between processors
4. Aggressive fusion
  –   Uses “rewrite rules”, an old feature of GHC

      Not a special purpose data-parallel compiler!
      Most support is either useful for other things,
      or is in the form of library code.
Step 0: desugaring
svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul sv v = sumP [: f*(v!i) | (i,f) <- sv :]


           sumP :: Num a => [:a:] -> a
           mapP :: (a -> b) -> [:a:] -> [:b:]




svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul sv v = sumP (mapP ((i,f) -> f * (v!i)) sv)
Step 1: Vectorisation
   svMul :: [:(Int,Float):] -> [:Float:] -> Float
   svMul sv v = sumP (mapP ((i,f) -> f * (v!i)) sv)


         sumP      ::   Num a => [:a:] -> a
         *^        ::   Num a => [:a:] -> [:a:] -> [:a:]
         fst^      ::   [:(a,b):] -> [:a:]
         bpermuteP ::   [:a:] -> [:Int:] -> [:a:]




svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul sv v = sumP (snd^ sv *^ bpermuteP v (fst^ sv))

            Scalar operation * replaced by
                 vector operation *^
Vectorisation: the basic idea
        mapP f v              f^ v
     f :: T1 -> T2
     f^ :: [:T1:] -> [:T2:]    -- f^ = mapP f

 For every function f, generate its
  vectorised version, namely f^
 Result: a functional program, operating over
  flat arrays, with a fixed set of primitive
  operations *^, sumP, fst^, etc.
 Lots of intermediate arrays!
Vectorisation: the basic idea
   f :: Int -> Int
   f x = x+1

   f^ :: [:Int:] -> [:Int:]
   f^ x = x +^ (replicateP (lengthP x) 1)

        This           Transforms to this
        Locals, x      x
        Globals, g     g^
        Constants, k   replicateP (lengthP x) k



                     replicateP :: Int -> a -> [:a:]
                     lengthP :: [:a:] -> Int
Vectorisation: the key insight
f :: [:Int:] -> [:Int:]
f a = mapP g a = g^ a

f^ :: [:[:Int:]:] -> [:[:Int:]:]
f^ a = g^^ a      --???




        Yet another version of g???
Vectorisation: the key insight
  f :: [:Int:] -> [:Int:]
                                       First concatenate,
  f a = mapP g a = g^ a
                                           then map,
                                          then re-split
  f^ :: [:[:Int:]:] -> [:[:Int:]:]
  f^ a = segmentP a (g^ (concatP a))


 concatP :: [:[:a:]:] -> [:a:]
 segmentP :: [:[:a:]:] -> [:b:] -> [:[:b:]:]


      Shape        Flat data     Nested
                                  data


Payoff: f and f^ are enough. No f^^
Step 2: Representing arrays
[:Double:] Arrays of pointers to boxed
           numbers are Much Too Slow
[:(a,b):] Arrays of pointers to pairs are
           Much Too Slow


       ...
                              Idea!
                        Representation of
                        an array depends
                         on the element
                               type
Step 2: Representing arrays
             [POPL05], [ICFP05], [TLDI07]
     data family [:a:]

     data instance [:Double:] = AD ByteArray
     data instance [:(a,b):] = AP [:a:] [:b:]

AP




 Now *^ is a fast loop
 And fst^ is constant time!
 fst^ :: [:(a,b):] -> [:a:]
 fst^ (AP as bs) = as
Step 2: Nested arrays
                             Shape        Flat data


data instance [:[:a:]:] = AN [:Int:] [:a:]

concatP :: [:[:a:]:] -> [:a:]
concatP (AN shape data) = data

segmentP :: [:[:a:]:] -> [:b:] -> [:[:b:]:]
segmentP (AN shape _) data = AN shape data


Surprise: concatP, segmentP are constant time!
Higher order complications
f   :: T1 -> T2 -> T3

f1^ :: [:T1:] -> [:T2:] -> [:T3:] -– f1^ = zipWithP f
f2^ :: [:T1:] -> [:(T2 -> T3):]   -- f2^ = mapP f


 f1^ is good for [: f a b | a <- as | b <- bs :]
 But the type transformation is not uniform
 And sooner or later we want higher-order
  functions anyway
 f2^ forces us to find a representation for
  [:(T2->T3):]. Closure conversion [PAPP06]
Step 3: chunking
svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul (AP is fs) v = sumP (fs *^ bpermuteP v is)

 Program consists of
  – Flat arrays
  – Primitive operations over them
    (*^, sumP etc)
 Can directly execute this (NESL).
  – Hand-code assembler for primitive ops
  – All the time is spent here anyway
 But:
  – intermediate arrays, and hence memory traffic
  – each intermediate array is a synchronisation point
 Idea: chunking and fusion
Step 3: Chunking
svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul (AP is fs) v = sumP (fs *^ bpermuteP v is)


1. Chunking: Divide is,fs into chunks, one
   chunk per processor
2. Fusion: Execute sumP (fs *^ bpermute
   v is) in a tight, sequential loop on each
   processor
3. Combining: Add up the results of each
   chunk
 Step 2 alone is not good for a parallel machine!
Expressing chunking
sumP :: [:Float:] -> Float
sumP xs = sumD (mapD sumS (splitD xs)


splitD   :: [:a:] -> Dist [:a:]
mapD     :: (a->b) -> Dist a -> Dist b
sumD     :: Dist Float -> Float

sumS     :: [:Float:] -> Float        -- Sequential!


 sumS is a tight sequential loop
 mapD is the true source of parallelism:
  – it starts a “gang”,
  – runs it,
  – waits for all gang members to finish
Expressing chunking
*^ :: [:Float:] -> [:Float:] -> [:Float:]
*^ xs ys = joinD (mapD mulS
            (zipD (splitD xs) (splitD ys))


splitD   ::   [:a:] -> Dist [:a:]
joinD    ::   Dist [:a:] -> [:a:]
mapD     ::   (a->b) -> Dist a -> Dist b
zipD     ::   Dist a -> Dist b -> Dist (a,b)

mulS :: ([:Float:],[: Float :]) -> [:Float:]


 Again, mulS is a tight, sequential loop
Step 4: Fusion
svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul (AP is fs) v = sumP (fs *^ bpermuteP v is)
 = sumD . mapD sumS . splitD . joinD . mapD mulS $
      zipD (splitD fs) (splitD (bpermuteP v is))

 Aha! Now use rewrite rules:
  {-# RULE
     splitD (joinD x) = x
     mapD f (mapD g x) = mapD (f.g) x #-}


svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul (AP is fs) v = sumP (fs *^ bpermuteP v is)
 = sumD . mapD (sumS . mulS) $
      zipD (splitD fs) (splitD (bpermuteP v is))
Step 4: Sequential fusion
svMul :: [:(Int,Float):] -> [:Float:] -> Float
svMul (AP is fs) v = sumP (fs *^ bpermuteP v is)
 = sumD . mapD (sumS . mulS) $
      zipD (splitD fs) (splitD (bpermuteP v is))

 Now we have a sequential fusion
  problem.
 Problem:
  – lots and lots of functions over arrays
  – we can’t have fusion rules for every pair
 New idea: stream fusion
Stream fusion for lists
          map f (filter p (map g xs))

 Problem:
  – lots and lots of functions over lists
  – and they are recursive functions
 New idea: make map, filter etc non-
  recursive, by defining them to work
  over streams
Stream fusion for lists
data Stream a where
   S :: (s -> Step s a) -> s -> Stream a

data Step s a = Done | Yield a (Stream s a)

toStream :: [a] -> Stream a
toStream as = S step as
                                                Non-
  where
                                              recursive!
   step [] = Done
   step (a:as) = Yield a as

                                              Recursive
fromStream :: Stream a -> [a]
fromStream (S step s) = loop s
  where
   loop s = case step s of
              Yield a s’ -> a : loop s’
              Done        -> []
Stream fusion for lists
mapStream :: (a->b) -> Stream a -> Stream b
                                                 Non-
mapStream f (S step s) = S step’ s
                                               recursive!
  where
   step’ s = case step s of
                Done       -> Done
                Yield a s’ -> Yield (f a) s’

map :: (a->b) -> [a] -> [b]
map f xs = fromStream (mapStream f (toStream xs))
Stream fusion for lists
map f (map g xs)

= fromStream (mapStream f (toStream
   (fromStream (mapStream g (toStream xs))))

=                 -- Apply (toStream (fromStream xs) = xs)
    fromStream (mapStream f (mapStream g (toStream xs)))

=                 -- Inline mapStream, toStream
    fromStream (Stream step xs)
     where
       step [] = Done
       step (x:xs) = Yield (f (g x)) xs
Stream fusion for lists
      fromStream (Stream step xs)
           where
             step [] = Done
             step (x:xs) = Yield (f (g x)) xs

      =                       -- Inline fromStream
          loop xs
            where
                loop [] = []
                loop (x:xs) = f (g x) : loop xs


 Key idea: mapStream, filterStream etc are all
  non-recursive, and can be inlined
 Works for arrays; change only fromStream,
  toStream
Summary
 Nested data parallelism is the Way of the Future.
 Key ideas: vectorisation, array representation,
  chunking, fusion
 Lots more to say about all of these; each worth
  several papers
 All have to work together to achieve fast data
  parallel computation
 But the prize is well worth having
 And only functional programmers can get near it



http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

Más contenido relacionado

La actualidad más candente

Moxi - Memcached Proxy
Moxi - Memcached ProxyMoxi - Memcached Proxy
Moxi - Memcached ProxyNorthScale
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
The Parenscript Common Lisp to JavaScript compiler
The Parenscript Common Lisp to JavaScript compilerThe Parenscript Common Lisp to JavaScript compiler
The Parenscript Common Lisp to JavaScript compilerVladimir Sedach
 
PHP Performance with APC + Memcached
PHP Performance with APC + MemcachedPHP Performance with APC + Memcached
PHP Performance with APC + MemcachedFord AntiTrust
 
Improving PHP Application Performance with APC
Improving PHP Application Performance with APCImproving PHP Application Performance with APC
Improving PHP Application Performance with APCvortexau
 
Service-Oriented Integration With Apache ServiceMix
Service-Oriented Integration With Apache ServiceMixService-Oriented Integration With Apache ServiceMix
Service-Oriented Integration With Apache ServiceMixBruce Snyder
 
Rails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & ToolsRails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & Toolsguest05c09d
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsSerge Smetana
 
Developing high-performance network servers in Lisp
Developing high-performance network servers in LispDeveloping high-performance network servers in Lisp
Developing high-performance network servers in LispVladimir Sedach
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torqueboxrockyjaiswal
 
Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009
Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009
Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009Hiroshi Ono
 
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for WindowsFord AntiTrust
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Tim Bunce
 

La actualidad más candente (20)

Moxi - Memcached Proxy
Moxi - Memcached ProxyMoxi - Memcached Proxy
Moxi - Memcached Proxy
 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
The Parenscript Common Lisp to JavaScript compiler
The Parenscript Common Lisp to JavaScript compilerThe Parenscript Common Lisp to JavaScript compiler
The Parenscript Common Lisp to JavaScript compiler
 
How DSL works on Ruby
How DSL works on RubyHow DSL works on Ruby
How DSL works on Ruby
 
PHP Performance with APC + Memcached
PHP Performance with APC + MemcachedPHP Performance with APC + Memcached
PHP Performance with APC + Memcached
 
Improving PHP Application Performance with APC
Improving PHP Application Performance with APCImproving PHP Application Performance with APC
Improving PHP Application Performance with APC
 
Memcached
MemcachedMemcached
Memcached
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
Service-Oriented Integration With Apache ServiceMix
Service-Oriented Integration With Apache ServiceMixService-Oriented Integration With Apache ServiceMix
Service-Oriented Integration With Apache ServiceMix
 
Rails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & ToolsRails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & Tools
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails Applications
 
Developing high-performance network servers in Lisp
Developing high-performance network servers in LispDeveloping high-performance network servers in Lisp
Developing high-performance network servers in Lisp
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009
Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009
Solving_the_C20K_problem_PHP_Performance_and_Scalability-phpquebec_2009
 
Fluentd and WebHDFS
Fluentd and WebHDFSFluentd and WebHDFS
Fluentd and WebHDFS
 
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
/* pOrt80BKK */ - PHP Day - PHP Performance with APC + Memcached for Windows
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013
 

Destacado

TDC ENETERTAINMENT ORG. NARRATIVE
TDC ENETERTAINMENT ORG. NARRATIVETDC ENETERTAINMENT ORG. NARRATIVE
TDC ENETERTAINMENT ORG. NARRATIVEtdcent
 
Presentacion juegos y ropa escoceses
Presentacion juegos y ropa escocesesPresentacion juegos y ropa escoceses
Presentacion juegos y ropa escocesesSemana Inglesa
 
dfdghghhfg
dfdghghhfgdfdghghhfg
dfdghghhfgninito77
 
Cómo escoger el mejor curso online
Cómo escoger el mejor curso onlineCómo escoger el mejor curso online
Cómo escoger el mejor curso onlineEAE Business School
 
Presentacion blog mari gracia y laura
Presentacion blog mari gracia y lauraPresentacion blog mari gracia y laura
Presentacion blog mari gracia y lauralorygrace2
 
Jeremy Driggs resume
Jeremy Driggs resumeJeremy Driggs resume
Jeremy Driggs resumeJeremy Driggs
 
Media center facilities plan FRIT 7132
Media center facilities plan FRIT 7132Media center facilities plan FRIT 7132
Media center facilities plan FRIT 7132kbates22
 
021. Gabungan Portait
021. Gabungan Portait021. Gabungan Portait
021. Gabungan PortaitJatmiko Haryo
 
La psicomotricidad y el movimiento creativo
La psicomotricidad y el movimiento creativoLa psicomotricidad y el movimiento creativo
La psicomotricidad y el movimiento creativoagalve001
 
Tríptico CFGS Automatización Y Robótica Industrial
Tríptico CFGS Automatización Y Robótica IndustrialTríptico CFGS Automatización Y Robótica Industrial
Tríptico CFGS Automatización Y Robótica IndustrialOrientacionIESJorgeManrique
 

Destacado (20)

TDC ENETERTAINMENT ORG. NARRATIVE
TDC ENETERTAINMENT ORG. NARRATIVETDC ENETERTAINMENT ORG. NARRATIVE
TDC ENETERTAINMENT ORG. NARRATIVE
 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
 
Chelsea footbal club
Chelsea footbal club Chelsea footbal club
Chelsea footbal club
 
Presentacion juegos y ropa escoceses
Presentacion juegos y ropa escocesesPresentacion juegos y ropa escoceses
Presentacion juegos y ropa escoceses
 
dfdghghhfg
dfdghghhfgdfdghghhfg
dfdghghhfg
 
Cómo escoger el mejor curso online
Cómo escoger el mejor curso onlineCómo escoger el mejor curso online
Cómo escoger el mejor curso online
 
Cuerpo 40
Cuerpo 40Cuerpo 40
Cuerpo 40
 
Presentacion blog mari gracia y laura
Presentacion blog mari gracia y lauraPresentacion blog mari gracia y laura
Presentacion blog mari gracia y laura
 
Emergencia
EmergenciaEmergencia
Emergencia
 
Jeremy Driggs resume
Jeremy Driggs resumeJeremy Driggs resume
Jeremy Driggs resume
 
Cuerpo 34
Cuerpo 34Cuerpo 34
Cuerpo 34
 
Media center facilities plan FRIT 7132
Media center facilities plan FRIT 7132Media center facilities plan FRIT 7132
Media center facilities plan FRIT 7132
 
021. Gabungan Portait
021. Gabungan Portait021. Gabungan Portait
021. Gabungan Portait
 
Connector
ConnectorConnector
Connector
 
La psicomotricidad y el movimiento creativo
La psicomotricidad y el movimiento creativoLa psicomotricidad y el movimiento creativo
La psicomotricidad y el movimiento creativo
 
Cuerpo 73
Cuerpo 73Cuerpo 73
Cuerpo 73
 
Eu directives
Eu directivesEu directives
Eu directives
 
Tríptico CFGS Automatización Y Robótica Industrial
Tríptico CFGS Automatización Y Robótica IndustrialTríptico CFGS Automatización Y Robótica Industrial
Tríptico CFGS Automatización Y Robótica Industrial
 
Derecho Colonial
Derecho ColonialDerecho Colonial
Derecho Colonial
 
INGOBERNABILIDAD
INGOBERNABILIDADINGOBERNABILIDAD
INGOBERNABILIDAD
 

Similar a Ndp Slides

Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with pythonPatrick Vergain
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5Shah Zaib
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim CrontabsPaolo Negri
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 

Similar a Ndp Slides (20)

Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Graph processing
Graph processingGraph processing
Graph processing
 
IN4308 1
IN4308 1IN4308 1
IN4308 1
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
vega
vegavega
vega
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 

Más de oscon2007

Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5oscon2007
 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifmoscon2007
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Moleoscon2007
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashearsoscon2007
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swposcon2007
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Mythsoscon2007
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholisticoscon2007
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillipsoscon2007
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdatedoscon2007
 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reformoscon2007
 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007oscon2007
 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbakeroscon2007
 

Más de oscon2007 (20)

Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
 
Os Borger
Os BorgerOs Borger
Os Borger
 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
 
Yuicss R7
Yuicss R7Yuicss R7
Yuicss R7
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
 
Os Fogel
Os FogelOs Fogel
Os Fogel
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
 
Os Tucker
Os TuckerOs Tucker
Os Tucker
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
 
Os Furlong
Os FurlongOs Furlong
Os Furlong
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
 
Os Kimsal
Os KimsalOs Kimsal
Os Kimsal
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reform
 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007
 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbaker
 

Último

Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfRbc Rbcua
 
Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Anamaria Contreras
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03DallasHaselhorst
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Peter Ward
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfShashank Mehta
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
Chapter 9 PPT 4th edition.pdf internal audit
Chapter 9 PPT 4th edition.pdf internal auditChapter 9 PPT 4th edition.pdf internal audit
Chapter 9 PPT 4th edition.pdf internal auditNhtLNguyn9
 
TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024Adnet Communications
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCRashishs7044
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...ssuserf63bd7
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
Memorándum de Entendimiento (MoU) entre Codelco y SQM
Memorándum de Entendimiento (MoU) entre Codelco y SQMMemorándum de Entendimiento (MoU) entre Codelco y SQM
Memorándum de Entendimiento (MoU) entre Codelco y SQMVoces Mineras
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdfShaun Heinrichs
 
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxFinancial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxsaniyaimamuddin
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 

Último (20)

Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdf
 
Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.
 
Corporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information TechnologyCorporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information Technology
 
Japan IT Week 2024 Brochure by 47Billion (English)
Japan IT Week 2024 Brochure by 47Billion (English)Japan IT Week 2024 Brochure by 47Billion (English)
Japan IT Week 2024 Brochure by 47Billion (English)
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdf
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
Chapter 9 PPT 4th edition.pdf internal audit
Chapter 9 PPT 4th edition.pdf internal auditChapter 9 PPT 4th edition.pdf internal audit
Chapter 9 PPT 4th edition.pdf internal audit
 
TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
Memorándum de Entendimiento (MoU) entre Codelco y SQM
Memorándum de Entendimiento (MoU) entre Codelco y SQMMemorándum de Entendimiento (MoU) entre Codelco y SQM
Memorándum de Entendimiento (MoU) entre Codelco y SQM
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf
 
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxFinancial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
 
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 

Ndp Slides

  • 1. Nested data parallelism in Haskell Simon Peyton Jones (Microsoft) Manuel Chakravarty, Gabriele Keller, Roman Leshchinskiy (University of New South Wales) Feb 2007
  • 2. Road map Multicore Parallel programming essential Task parallelism Data parallelism • Explicit threads Operate simultaneously • Synchronise via locks, on bulk data messages, or STM Massive parallelism Easy to program Modest parallelism • Single flow of control Hard to program • Implicit synchronisation
  • 3. Data parallelism The key to using multicores Flat data parallel Nested data parallel Apply sequential Apply parallel operation to bulk data operation to bulk data • The brand leader • Developed in 90’s • Limited applicability • Much wider applicability (dense matrix, (sparse matrix, graph map/reduce) algorithms, games etc) • Well developed • Practically un-developed •Huge opportunity • Limited new opportunities
  • 4. e.g. Fortran(s), *C Flat data parallel MPI, map/reduce  The brand leader: widely used, well understood, well supported foreach i in 1..N { ...do something to A[i]... }  BUT: “something” is sequential  Single point of concurrency  Easy to implement: use “chunking”  Good cost model P1 P2 P3 1,000,000’s of (small) work items
  • 5. Nested data parallel  Main idea: allow “something” to be parallel foreach i in 1..N { ...do something to A[i]... }  Now the parallelism structure is recursive, and un-balanced  Still good cost model  Hard to implement! Still 1,000,000’s of (small) work items
  • 6. Nested DP is great for programmers  Fundamentally more modular  Opens up a much wider range of applications: – Sparse arrays, variable grid adaptive methods (e.g. Barnes-Hut) – Divide and conquer algorithms (e.g. sort) – Graph algorithms (e.g. shortest path, spanning trees) – Physics engines for games, computational graphics (e.g. Delauny triangulation) – Machine learning, optimisation, constraint solving
  • 7. Nested DP is tough for compilers  ...because the concurrency tree is both irregular and fine-grained  But it can be done! NESL (Blelloch 1995) is an existence proof  Key idea: “flattening” transformation: nested DP flat DP  Great work, surprisingly ignored since 1990’s; ripe for exploitation
  • 8. Array comprehensions [:Float:] is the type of parallel arrays of Float vecMul :: [:Float:] -> [:Float:] -> Float vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :] sumP :: [:Float:] -> Float An array comprehension: “the array of all f1*f2 where f1 is drawn from v1 and f2 Operations over parallel array from v2” are computed in parallel; that is the only way the programmer says “do parallel stuff” NB: no locks!
  • 9. Sparse vector multiplication A sparse vector is represented as a vector of (index,value) pairs svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul sv v = sumP [: f*(v!i) | (i,f) <- sv :] Parallelism is v!i gets the i’th element of v proportional to length of sparse vector
  • 10. Sparse matrix multiplication A sparse matrix is a vector of sparse vectors smMul :: [:[:(Int,Float):]:] -> [:Float:] -> Float smMul sm v = sumP [: svMul sv v | sv <- sm :] Nested data parallelism here! We are calling a parallel operation, svMul, on every element of a parallel array, sm
  • 11. Hard to implement well • Evenly chunking at top level might be ill-balanced • Top level along might not be very parallel
  • 12. The flattening transformation • Concatenate sub-arrays into one big, flat array • Operate in parallel on the big array • Segment vector keeps track of where the sub-arrays are ...etc • Lots of tricksy book-keeping! • Possible to do by hand (and done in practice), but very hard to get right • Blelloch showed it could be done systematically
  • 13. Data-parallel quicksort sort :: [:Float:] -> [:Float:] sort a = if (length a <= 1) then a Parallel else sa!0 +++ eq +++ sa!1 filters where m = a!0 lt = [: f | f<-a, f<m :] eq = [: f | f<a, f==m :] gr = [: f | f<-a, f>m :] sa = [: sort a | a <- [:lt,gr:] :] 2-way nested data parallelism here!
  • 14. How it works Step 1 sort sort sort Step 2 Step 3 sort sort sort ...etc... • All sub-sorts at the same level are done in parallel • Segment vectors track which chunk belongs to which sub problem • Instant insanity when done by hand
  • 15. Fusion  Flattening is not enough vecMul :: [:Float:] -> [:Float:] -> Float vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :]  Do not 1. Generate [: f1*f2 | f1 <- v1 | f2 <- v2 :] (big intermediate vector) 2. Add up the elements of this vector  Instead: multiply and add in the same loop  That is, fuse the multiply loop with the add loop  Very general, aggressive fusion is required
  • 16. Purity pays off  Two key transformations: – Flattening – Fusion  Both depend utterly on purely- functional semantics: – no assignments – every operation is a pure function The data-parallel languages of the future will be functional languages
  • 17. What we are doing about it Substantial improvement in NESL • Expressiveness a mega-breakthrough but: • Performance – specialised, prototype – first order – few data types – no fusion – interpreted Haskell •Shared memory initially – broad-spectrum, widely used •Distributed memory – higher order eventually – very rich data types – aggressive fusion •GPUs anyone? – compiled
  • 18. And it goes fast too... 1-processor version goes only 30% slower than C Perf win with 2 processors Pinch of salt
  • 19. Summary so far  Data parallelism is the only way to harness 100’s of cores  Nested DP is great for programmers: far, far more flexible than flat DP  Nested DP is tough to implement  But we (think we) know how to do it  Functional programming is a massive win in this space  Haskell prototype in 2007 http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell
  • 21. Three forms of concurrency  Explicit threads main :: IO () = do { ch <- newChan ; forkIO (ioManager ch)  Non-deterministic by design ; forkIO (worker 1 ch)  Monadic: forkIO and STM ... etc ... }  Semi-implicit f :: Int -> Int  Deterministic f x = a `par` b `seq` a + b  Pure: par and seq where  Data parallel a = f (x-1) b = f (x-2)  Deterministic  Pure: parallel arrays  Shared memory initially; distributed memory eventually; possibly even GPUs
  • 22. Main contribution: an optimising data-parallel compiler implemented by modest enhancements to a full-scale functional language implementation Three key pieces of technology 1. Vectorisation – specific to parallel arrays 2. Non-parametric data representations – A generically useful new feature in GHC 3. Chunking – Divide up the work evenly between processors 4. Aggressive fusion – Uses “rewrite rules”, an old feature of GHC Not a special purpose data-parallel compiler! Most support is either useful for other things, or is in the form of library code.
  • 23. Step 0: desugaring svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul sv v = sumP [: f*(v!i) | (i,f) <- sv :] sumP :: Num a => [:a:] -> a mapP :: (a -> b) -> [:a:] -> [:b:] svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul sv v = sumP (mapP ((i,f) -> f * (v!i)) sv)
  • 24. Step 1: Vectorisation svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul sv v = sumP (mapP ((i,f) -> f * (v!i)) sv) sumP :: Num a => [:a:] -> a *^ :: Num a => [:a:] -> [:a:] -> [:a:] fst^ :: [:(a,b):] -> [:a:] bpermuteP :: [:a:] -> [:Int:] -> [:a:] svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul sv v = sumP (snd^ sv *^ bpermuteP v (fst^ sv)) Scalar operation * replaced by vector operation *^
  • 25. Vectorisation: the basic idea mapP f v f^ v f :: T1 -> T2 f^ :: [:T1:] -> [:T2:] -- f^ = mapP f  For every function f, generate its vectorised version, namely f^  Result: a functional program, operating over flat arrays, with a fixed set of primitive operations *^, sumP, fst^, etc.  Lots of intermediate arrays!
  • 26. Vectorisation: the basic idea f :: Int -> Int f x = x+1 f^ :: [:Int:] -> [:Int:] f^ x = x +^ (replicateP (lengthP x) 1) This Transforms to this Locals, x x Globals, g g^ Constants, k replicateP (lengthP x) k replicateP :: Int -> a -> [:a:] lengthP :: [:a:] -> Int
  • 27. Vectorisation: the key insight f :: [:Int:] -> [:Int:] f a = mapP g a = g^ a f^ :: [:[:Int:]:] -> [:[:Int:]:] f^ a = g^^ a --??? Yet another version of g???
  • 28. Vectorisation: the key insight f :: [:Int:] -> [:Int:] First concatenate, f a = mapP g a = g^ a then map, then re-split f^ :: [:[:Int:]:] -> [:[:Int:]:] f^ a = segmentP a (g^ (concatP a)) concatP :: [:[:a:]:] -> [:a:] segmentP :: [:[:a:]:] -> [:b:] -> [:[:b:]:] Shape Flat data Nested data Payoff: f and f^ are enough. No f^^
  • 29. Step 2: Representing arrays [:Double:] Arrays of pointers to boxed numbers are Much Too Slow [:(a,b):] Arrays of pointers to pairs are Much Too Slow ... Idea! Representation of an array depends on the element type
  • 30. Step 2: Representing arrays [POPL05], [ICFP05], [TLDI07] data family [:a:] data instance [:Double:] = AD ByteArray data instance [:(a,b):] = AP [:a:] [:b:] AP  Now *^ is a fast loop  And fst^ is constant time! fst^ :: [:(a,b):] -> [:a:] fst^ (AP as bs) = as
  • 31. Step 2: Nested arrays Shape Flat data data instance [:[:a:]:] = AN [:Int:] [:a:] concatP :: [:[:a:]:] -> [:a:] concatP (AN shape data) = data segmentP :: [:[:a:]:] -> [:b:] -> [:[:b:]:] segmentP (AN shape _) data = AN shape data Surprise: concatP, segmentP are constant time!
  • 32. Higher order complications f :: T1 -> T2 -> T3 f1^ :: [:T1:] -> [:T2:] -> [:T3:] -– f1^ = zipWithP f f2^ :: [:T1:] -> [:(T2 -> T3):] -- f2^ = mapP f  f1^ is good for [: f a b | a <- as | b <- bs :]  But the type transformation is not uniform  And sooner or later we want higher-order functions anyway  f2^ forces us to find a representation for [:(T2->T3):]. Closure conversion [PAPP06]
  • 33. Step 3: chunking svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul (AP is fs) v = sumP (fs *^ bpermuteP v is)  Program consists of – Flat arrays – Primitive operations over them (*^, sumP etc)  Can directly execute this (NESL). – Hand-code assembler for primitive ops – All the time is spent here anyway  But: – intermediate arrays, and hence memory traffic – each intermediate array is a synchronisation point  Idea: chunking and fusion
  • 34. Step 3: Chunking svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul (AP is fs) v = sumP (fs *^ bpermuteP v is) 1. Chunking: Divide is,fs into chunks, one chunk per processor 2. Fusion: Execute sumP (fs *^ bpermute v is) in a tight, sequential loop on each processor 3. Combining: Add up the results of each chunk Step 2 alone is not good for a parallel machine!
  • 35. Expressing chunking sumP :: [:Float:] -> Float sumP xs = sumD (mapD sumS (splitD xs) splitD :: [:a:] -> Dist [:a:] mapD :: (a->b) -> Dist a -> Dist b sumD :: Dist Float -> Float sumS :: [:Float:] -> Float -- Sequential!  sumS is a tight sequential loop  mapD is the true source of parallelism: – it starts a “gang”, – runs it, – waits for all gang members to finish
  • 36. Expressing chunking *^ :: [:Float:] -> [:Float:] -> [:Float:] *^ xs ys = joinD (mapD mulS (zipD (splitD xs) (splitD ys)) splitD :: [:a:] -> Dist [:a:] joinD :: Dist [:a:] -> [:a:] mapD :: (a->b) -> Dist a -> Dist b zipD :: Dist a -> Dist b -> Dist (a,b) mulS :: ([:Float:],[: Float :]) -> [:Float:]  Again, mulS is a tight, sequential loop
  • 37. Step 4: Fusion svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul (AP is fs) v = sumP (fs *^ bpermuteP v is) = sumD . mapD sumS . splitD . joinD . mapD mulS $ zipD (splitD fs) (splitD (bpermuteP v is))  Aha! Now use rewrite rules: {-# RULE splitD (joinD x) = x mapD f (mapD g x) = mapD (f.g) x #-} svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul (AP is fs) v = sumP (fs *^ bpermuteP v is) = sumD . mapD (sumS . mulS) $ zipD (splitD fs) (splitD (bpermuteP v is))
  • 38. Step 4: Sequential fusion svMul :: [:(Int,Float):] -> [:Float:] -> Float svMul (AP is fs) v = sumP (fs *^ bpermuteP v is) = sumD . mapD (sumS . mulS) $ zipD (splitD fs) (splitD (bpermuteP v is))  Now we have a sequential fusion problem.  Problem: – lots and lots of functions over arrays – we can’t have fusion rules for every pair  New idea: stream fusion
  • 39. Stream fusion for lists map f (filter p (map g xs))  Problem: – lots and lots of functions over lists – and they are recursive functions  New idea: make map, filter etc non- recursive, by defining them to work over streams
  • 40. Stream fusion for lists data Stream a where S :: (s -> Step s a) -> s -> Stream a data Step s a = Done | Yield a (Stream s a) toStream :: [a] -> Stream a toStream as = S step as Non- where recursive! step [] = Done step (a:as) = Yield a as Recursive fromStream :: Stream a -> [a] fromStream (S step s) = loop s where loop s = case step s of Yield a s’ -> a : loop s’ Done -> []
  • 41. Stream fusion for lists mapStream :: (a->b) -> Stream a -> Stream b Non- mapStream f (S step s) = S step’ s recursive! where step’ s = case step s of Done -> Done Yield a s’ -> Yield (f a) s’ map :: (a->b) -> [a] -> [b] map f xs = fromStream (mapStream f (toStream xs))
  • 42. Stream fusion for lists map f (map g xs) = fromStream (mapStream f (toStream (fromStream (mapStream g (toStream xs)))) = -- Apply (toStream (fromStream xs) = xs) fromStream (mapStream f (mapStream g (toStream xs))) = -- Inline mapStream, toStream fromStream (Stream step xs) where step [] = Done step (x:xs) = Yield (f (g x)) xs
  • 43. Stream fusion for lists fromStream (Stream step xs) where step [] = Done step (x:xs) = Yield (f (g x)) xs = -- Inline fromStream loop xs where loop [] = [] loop (x:xs) = f (g x) : loop xs  Key idea: mapStream, filterStream etc are all non-recursive, and can be inlined  Works for arrays; change only fromStream, toStream
  • 44. Summary  Nested data parallelism is the Way of the Future.  Key ideas: vectorisation, array representation, chunking, fusion  Lots more to say about all of these; each worth several papers  All have to work together to achieve fast data parallel computation  But the prize is well worth having  And only functional programmers can get near it http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell