SlideShare una empresa de Scribd logo
1 de 41
ACCELERATING MACHINE LEARNING ALGORITHMS BY INTEGRATING
                 GPUS INTO MAPREDUCE CLUSTERS


                       Sergio Herrero-Lopez
          Intelligent Engineering Systems Laboratory (IESL)


                        November 30, 2011




1                                    Accelerating ML algorithms by integrating GPUs in MR Clusters
INTRODUCTION



    ABOUT ME:

       Ph.D (December 2011) at Massachusetts Institute of Technology (USA)
       M.Sc (2007) and B.Sc (2005) in Electrical Engineering at University of Navarra (Spain)
       Microsoft Research (Redmond WA, 2008), Tampere University of Technology (Finland,
        2005) and IKUSI (Spain, 2003)




    ABOUT PROF. WILLIAMS RESEARCH GROUP (ENGINEERING SYSTEMS DIVISION):

                High Performance Price Analytics for the Smart Grid (2008-2009)
                Large-Scale Simulator for Global Data Infrastructure Optimization (2009-2011)
                Music Event Detection from Tweets in New York (2010-2011)
                Accelerating Machine Learning Algorithms by integrating GPUs into
                 MapReduce Clusters




2                                                    Accelerating ML algorithms by integrating GPUs in MR Clusters
AGENDA




    o    PROBLEM STATEMENT: Big Data & Need for scale and/or speed
    o    PROPOSITION: Modify MapReduce runtime to
           o Satisfy the particular requirements of ML algorithms
           o Integrate Massively Parallel Processors in the system
    o    PREVIOUS WORK MapReduce for ML in Multicore/Single-GPU/Multi-
         GPU/GPU-Cluster/FPGA
    o    IMPLEMENTATION of new MR runtime using Port abstractions
    o    PERFORMANCE results running SVMs on the proposed system
    o    CONCLUSIONS: Contributions and Limitations. Lessons learned
    o    FUTURE WORK




3                                             Accelerating ML algorithms by integrating GPUs in MR Clusters
MACHINE LEARNING PARALLELIZATION


    { xi, yi },i =1… n, "i,               n -Representative sample                       1.   Does not fit in resources
                                          d -Feature selection                           2.   Takes too long
    xi Î R d , yi Î Y = {1… k}            k -Consolidate classes                         3.   Accuracy was sacrificed




                 Algorithm 1                Algorithm 1                             Independent Runs
          L1      Worker X                   Worker Y
                                                                                        (Cluster)

                 Algorithm 1                                                        Summation Form
          L2                                                                          (MapReduce)
                               Worker X                   Worker Y


                 Algorithm 1
          L3                                                                      Structural Parallelism
                                                                                         (MPPs)



                 Machine Learning Algorithms decomposable into MR primitives
            Naïve Bayes
            K-means                                          Expectation Maximization
            Neural Network                                   Support Vector Machine Classification
            Principal Component Analysis                     Hidden Markov Models


4                                                                    Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE PRIMITIVES & RUNTIME
                                                                        Input


     M [ k1, v1 ] ® [ k2, v2 ]
                                                                           
                                                     Split



    R ék2 , {v2,i }k
      ë
                           ù® v    WORKER 1       WORKER 2                WORKER M-1       WORKER M
                   2,i =k2 û
                               3

                                                                                                            Map


                   
                                                                                                             Sort

                                   WORKER 1       WORKER 2                WORKER N-1       WORKER N

                                                                  
                                                                                                

                                                                                                           Reduce


                         



                                                     Merge
                                                                           

                                                                        Output

5                                                   Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE REPRESENTATION OF K-MEANS




    M ékit , xi ù ® éki¢t , xi ù
      ë         û ë            û
                                                                                                                                    kit
                                                                                                 

             {
       ki¢t = x j : x j - mit £ x j - mit¢ "i¢ =1… k            }                                                                ki¢t




 Rék¢t , { xi }k¢t =k¢t ù ® mk¢t
  ë                     û
                             t+1
                                                                                                             
                                                                                                                            { xi }k¢ =k¢
                                                                                                                                     i
                                                                                                                                         t   t

                  i


                                             å
                             1
              mk¢t =
               t+1
                                                            x
                        xi   ki¢t =k ¢t x Î{ xi }k¢t =k¢t
                                                                                                                                t+1
                                                                                                                               mk¢t
                                                   i




6                                                                   Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE REPRESENTATION OF EM FOR MIXTURE OF GAUSSIANS

     M [(i, k), xi ] ® é(i, k), pi,k ù
                       ë             û                                                                                                       xi
                                  a f ( xi | m , S
                                      t
                                      k
                                                             t
                                                             k
                                                                       t
                                                                       k   )                                 
                    pi,k =       K

                                åa f ( x | m , S )
                                          t
                                          k          i
                                                                  t
                                                                  k
                                                                           t
                                                                           k                                                                pi,k
                                k=1



     Rék, { pi,k¢ }k¢=k ù ® ak
                             t+1                                                                                                        { pi,k¢ }k¢=k
      ë                 û                                                                                                
                                               n

                                          åp             i,k                                                                                 t+1
                            a t+1
                              k       =       i=1                                                                                           ak
                                                    n

    Rék, { xi , pi,k¢ }k¢=k ù ® mk
     ë                      û
                                 t+1
                                                                                                                                         { xi, pi,k¢ }k¢=k
                                               n
                                                                                                                         
                                              åx p       i       i,k

                            m   t+1
                                k     =       i=1
                                                     t+1                                                                                    mk
                                                                                                                                             t+1
                                                   nak

    Rék, { xi , pi,k¢ }k¢=k ù ® St+1
     ë                      û    k                                                                                       
                                                                                                                                        { xi, pi,k¢ }k¢=k
                        n

                        å p (x                i - m k ) ( xi - m k )
                                                    t+1          t+1           T
                                i,k

          S   t+1
              k     =   i=1
                                                    na       t+1                                                                          St+1
                                                                                                                                           k
                                                             k


7                                                                                  Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE REPRESENTATION OF SVM (SMO)

    M [i, fi ] ® [i, fi¢]                                                                                                                        fi
                                                                                                             
 fi¢= fi + Da Iup yIup k(x Iup , x i )+ Da Ilow yIlow k(x Ilow , x i )                                                                           fi¢

    M [i, ai ] ® [i, ki ]
 I 0 = {i : yi = {1, -1}, 0 < a i < C}
                                                                                                                                                 ai
 I1 = {i : yi = 1, a i = 0} È {i : yi = -1, a i = C }                                                        
 I 2 = {i : yi = 1, ai = C} È {i : yi = -1, a i = 0}                                                                                             ki
 kup = {i Î I 0 È I1 }, klow = {i Î I 0 È I 2 }
 ki Î kup , klow

    R ék, { fi }k =k ù ® (b, I )
      ë            i û
                                                                                                                          
                                                                                                                                         { fi }k =k
                                                                                                                                                  i
 bup = min{ fi : ki = kup }, Iup = argmin ki =kup fi
 blow = max{ fi : ki = klow }, I low = argmax ki =klow fi                                                                                 (b, I)

    M [i, ai ] ® [i, ai¢]
                                           yIup ( fIlow - fIup )                                                                            ai
    a¢ = aI -
     Iup     up
                       2k(xIlow , xIup ) - k(xIlow , xIlow ) - k(xIup , xIup )
    a ¢ = a I + yI yI (a I - a ¢ )
      I
      low    low          low  Iup    up         up
                                                                                                                                           a i¢

8                                                                                Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE FOR ML WISHLIST
                                                                              Static                                            Variable
                                                                                                                        mk
    Static vs Variable data
                                                                                       xi x
     Static: Largest, fixed, used in every iteration
                                                                                              i             (a     , mk , St+1 )
                                                                                                                 t+1  t+1


                                                                                   ( xi, yi )
                                                                                                                 k         k
     Variable: Results of each iteration, consumed in the next
        iteration                                                                                                  ( fi, ai )

                                                                                                                        DFS
                                                                                                     


    Iterate until convergence
     Avoid reloading static data between iterations                                                                    MEM
     Utilize memory hierarchy as opposed to DFS or LFS
                                                                                                            




                                                                                                                        DFS



    Massively Threaded MapReduce Tasks
     Map is embarrassingly parallel
                                                                                 CPU                             MPP
     Reduce is highly parallelizable




    Dimensionality & Algebra                                                                              - b xi -x j
                                                                                                                        2

     Map Tasks may encapsulate high dimensional matrix-vector                         k(xi , x j ) = e
       or matrix-matrix operations
     Interleave multithreaded BLAS operations using static data                       i = 1...n, j Î { I up, I low }
     Sparse data structures

9                                                             Accelerating ML algorithms by integrating GPUs in MR Clusters
COMPUTING ECOSYSTEM

           COMMODITY                                            HIGH PERFORMANCE/SUPER
           COMPUTING                                                   COMPUTING
                          RELATIONAL DB
                                  HADOOP              INFINIBAND
                      BIGTABLE
                                   DRYAD
                      CASSANDRA                             OPENMPI
                                    GPU
                         DYNAMO                             GPU

                                         1/10 GB ETHERNET
                                  FPGA
                                           COLUMN DB
                                 HADOOP
                                      20 GB INFINIBAND
                                           SSD

                          DATA APPLIANCE/ WAREHOUSE
                                  COMPUTING




10                                               Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE CLUSTER: ARCHITECTURE


                                               Client                                      1)    Distributed File System.
                                                                                           -     Unstructured data
                                      File                  Job                            -     Scales to thousands of nodes
                                                                                           -     High reliability through
                     NameNode
                                                                                                 replication

                       DFS                                MRF
                                                                                            2) Map Reduce Framework Runtime
                                                            JobTracker                      -   Batch processing system
                                                                                            -   Load balancing


              Task                                                         Task
                                                   Task
 DataNode 1           Block                  DataNode 2                   DataNode 3
                                                                  Block
                              Block
     MRF                                     MRF                          MRF

      TaskTracker                             TaskTracker                  TaskTracker


 DFS                                     DFS                              DFS




11                                                                        Accelerating ML algorithms by integrating GPUs in MR Clusters
MAPREDUCE CLUSTER: LIMITATIONS

     DataNode 1             DataNode 2

            Task                     Task
     MRF   Tracker
                             MRF    Tracker
                                                             One (or two) tasks per node


     DFS    Block             DFS    Block



                                                           One Task  One Data Block 
     CPU                    CPU
                                                             One Core  One Thread
                     Map                      Map
                     Task                     Task

     HD     Block             HD     Block

                                                       Synchronization by materialization of
                                                              intermediate results
     CPU                     CPU
               Reduce                   Reduce
                Task                     Task

     DFS    Block             DFS    Block
                                                             No support for iterative jobs



12                                                   Accelerating ML algorithms by integrating GPUs in MR Clusters
MASSIVELY PARALLEL PROCESSORS: NVIDIA TESLA ARCHITECTURE
 Host             Device
                                       Stream Multiprocessor N

                                   Stream Multiprocessor 2 Memory
                                                    Shared
                                                                     1 Cycle coalesced
                               Stream Multiprocessor 1 Memory
                                                 Shared          ~10 Cycles uncoalesced
                                 Registers     Registers         Registers
                                              Shared Memory
                              Registers     Registers         Registers            Instruction
                            Registers                                                               Unit
                                  ProcessorRegisters
                                            1    Processor 2     Registers
                                                                 …. Processor M              Instruction
                                                                                                 Unit
                  0 Cycles Processor 1       Processor 2    …. Processor M                 Instruction
                                                         Constant Cache                        Unit
                                SP 1           SP 2       ….       SP M
                                                       Constant Cache
                                                          Texture Cache
                                                                             ~10 Cycles Cache Hit
                                                Constant Memory
                                                       Texture Cache

                                                Texture Memory

              ~400 Cycles                                                                     ~400 Cycles
                102 GB/s                                                                        102 GB/s
  Host
 Memory                                            Device Memory
          PCI-E 16x
           (8GB/s)


13                                                     Accelerating ML algorithms by integrating GPUs in MR Clusters
NVIDIA TESLA: REPRESENTATIONS
                Logical Representation               Physical Representation

                       Thread                                 Processor



                        Block                              MultiProcessor


   Maximum
(512,512,64)
But max 512
threads per
    block


                       Grid                                      Device


                                                               Shared
                                                              Shared
                                                    Register Memor
                                                           Register Register
 Maximum
                                                   Register Memor s
                                                        s Register yRegister
                                                               s
                                                     Processs y …. s
                                                      s     Process Process
 (65535,                                            Process or ConstantM
                                                           Process…. or
  65535)                                               or 1      2 Process
                                                      or 1 or ConstantM
                                                                2 Texture
                                                                      or
                                                                   Cache
                                                                 Cache
                                                                   Cache




14                                       Accelerating ML algorithms by integrating GPUs in MR Clusters
PROPOSED RUNTIME: MR + GPU
            Block    Block

     DFS                        MRF   Task Tracker

                       HState
     HMem
                                                                        Split
                                                                                               H->D Transfers
 DMem                  DState                                        Pre-Map                   BLAS


      GPU                                                              Map

 DMem                  DState                                         Post-Map                 D->H Transfers


                       HState            Cross-Node
     HMem                                                               Sort

 DMem                  DState                                                                  H->D Transfers
                                                                    Pre-Reduce                 BLAS

                                                                       Local
      GPU                                                             Reduce

 DMem                  DState                                                                  D->H Transfers
                                                                    Post-Reduce
                       HState
                                         Cross-Node                   Global
     HMem
                                                                      Reduce
             Block    Block
                                      State Snapshot every
     DFS                                    x iterations

15                                                   Accelerating ML algorithms by integrating GPUs in MR Clusters
PROPOSED RUNTIME: MR + GPU
            Block    Block

     DFS                        MRF   Task Tracker

                       HState
     HMem
                                                                          Multiple tasks per node
 DMem                  DState
                                                                         Multithreaded MR Tasks

      GPU                                                            Interleave Multithreaded BLAS

 DMem                  DState
                                                                          Local/Global Reduction

                       HState                                               Static/Variable Data
                                         Cross-Node
     HMem
                                                                       Long-running Iterative Jobs
 DMem                  DState                                                   Stateful Nodes
                                                                              Shared-Memory
      GPU
                                                                        Fault-Tolerance Relaxation
     DMem              DState
                       HState
                                         Cross-Node
     HMem

             Block    Block

     DFS

16                                                   Accelerating ML algorithms by integrating GPUs in MR Clusters
PREVIOUS WORK


          MAPREDUCE ON SINGLE GPU/ SINGLE FPGA                                            Interleave Multithreaded BLAS
          •Mars (He et al. PACT 2008)
          •NVIDIA (Catanzaro et al. STMCS 2008)
          •Cell (de Kruijf and Sankaralingam IBM Journal R&D 2009)
                                                                                         Massively Multithreaded MR Tasks



          MAPREDUCE ON MULTICORE                                                                    Shared-Memory
          •Phoenix (Ranger et al. HPCA 2007)
          •Phoenix 2 (Yoo et al. IISWC 2009)
          •Phoenix ++ (Talbot et al. MAPREDUCE 2011)
                                                                                             Fault-Tolerance Relaxation



          MAPREDUCE ON MULTI-GPU/GPU CLUSTERS                                               Intermediate data in-memory
          •CellMR (Rafique et al. IPDPS 2009)
          •GPMR (Stuart and Owens IPDPS 2011)
                                                                                               Local/Global Reduction

          MAPREDUCE FOR MACHINE LEARNING
          •Mahout (Apache)                                                                 Long running (iterative) Tasks
          •Multicore (Chu et al. NIPS 2006)
          •FGPA (Xu NIPS 2009)
          •Twister (Ekanayake et al. MAPREDUCE 2010)
          •SystemML (Ghoting et al. ICDE 2011)                                                 Static vs Variable Data



17                                                                   Accelerating ML algorithms by integrating GPUs in MR Clusters
PORT-BASED PROGRAMMING: ABSTRACTION
                            Message

                                  Port
                                                                                        Single Item Receiver


       Arbiter                                                                           Multiple Item Receiver
                                      Dispatcher

         Handler
        Handler     Task
     Handler
                                                                                              Join Receiver

               Dispatcher
                 Queue                                                                      Choice Receiver



                                                                                               Teardown
        State                    Handler                                                       Concurrent
                                                                                                Exclusive




                                                   Scatter

                                                                                           Gather




18                                                           Accelerating ML algorithms by integrating GPUs in MR Clusters
SCATTER-GATHER USING GPU-PORTS

                                                                            Task
                                                                   MRF     Tracker
                                                                                                                                      (Task, Block,
                                     Port                                                                                            Response Port)
                                                                   Master
       Arbiter                                           C#/Java
                                                                   Thread
                                         Dispatcher

         Handler
                                                                                       Scatter
        Handler     Task
     Handler


               Dispatcher
                 Queue




       HState                   CPU Handler               C++
                                  Kernel                                   


                                                                           


                                                                           
                                                                                   


                                                                                   


                                                                                   
                                                                                          


                                                                                           


                                                                                           
                                                                                                   


                                                                                                   


                                                                                                   
                                                                                                         


                                                                                                         


                                                                                                         
                                                                                                                 


                                                                                                                 


                                                                                                                 
                                                                                                                        


                                                                                                                         


                                                                                                                         
                                                                                                                                 


                                                                                                                                 


                                                                                                                                 
                                                                                                                                         


                                                                                                                                         


                                                                                                                                         
                                                                                                                                                 


                                                                                                                                                 


                                                                                                                                                 
                                                                                                                                                        


                                                                                                                                                         


                                                                                                                                                         
                                                                                                                                                                 


                                                                                                                                                                 


                                                                                                                                                                 




                                                                                                                                                        




                                                 

                                                  

                                                  




                                                




                                CUDA 3.2
                                                                                                       Gather
       DState




19                                                                       Accelerating ML algorithms by integrating GPUs in MR Clusters
H-DISPATCH ALTERNATIVE

          Task
 MRF     Tracker
                                                                    (Task, Block,
                                                                   Response Port)
     Master
                                                                                                     H-Dispatch
     Thread
                     Scatter

                                                                                                                        + Load Balancing for non-uniform
                                                                                                                                   workloads


                                                                                                                        + Local variable reutilization. Avoid
                                                                                                                                GC blocking threads
                                                                                 


                                                                                    


                                                                                    




                                                                                      




                                                                                                                             + Runs hState> sum(Dmem)


                                     Gather
                                                                                                                        -    Detach state and port: dState
                                                                                                                                     load/unload




20                                                                                                 Accelerating ML algorithms by integrating GPUs in MR Clusters
BINARY SVM

      Binary Classification:

      Given l samples      x1, y1 ,, xl , yl with xi        Rn , yi Y , i and Y                   1,1 ,
       a binary classifier predicts the label y Y of an unseen sample x                    Rn


                   1
                   f*




             f*




                                                                 2
                                                        xi x j
                                 k ( xi , x j ) e

21                                                  Accelerating ML algorithms by integrating GPUs in MR Clusters
PRIMAL & DUAL FORM OF THE SVM

      Find the function f that solves the following regularization problem:
                          l                                                                                k               maxk,0
                                                                1      2
      min f   HC                1       yi f xi                   f                   where
                          i 1                                   2                                          C           0
      Then slack variables                       i       are introduced to classify non-separable data:

       Primal form:                                                                    Dual form:
                                    l                                                                              l
                                                         1        2                                                            1   T
         min f       H C                     i             f                          max                                              K
                                    i 1                  2                                       Rl                i 1
                                                                                                                           i
                                                                                                                               2
       subject to:                                                                     subject to:
                                                                                                      l
              yi f xi           1                                                                         yi   i       0
                                         i
                                                     i    1, , l                                     i 1                          i   1, , l
                 i    0                                                                      0             i       C

                                                                                        where Kij yi y j k xi , x j
                                                                                        is the kernel function

                                                          l
     Solving the dual: f ( x )                                yi i k x , xi   b where b is an unreagularized bias term
                                                          i 1



22                                                                               Accelerating ML algorithms by integrating GPUs in MR Clusters
MULTICLASS CLASSIFICATION

     Multiclass Classification:
     Given l samples      x1, y1 ,, xl , yl with xi   Rn , yi Y , i and Y                   1, M ,
                                                                                              ,
     a multiclass classifier predicts the label y Y of an unseen sample x                Rn

     Multiclass SVM: Combination of N independent binary classification tasks. Binary tasks
     are defined by an output code matrix R of size MxN and R ij        1,0,1

                         1    1   0                                                 1    1    1
                                          M
 All vs All (AVA):   R    1   0   1   N
                                          2     One vs All (OVA):             R     1   1     1    N   M
                         0    1   1                                                 1    1   1




23                                                 Accelerating ML algorithms by integrating GPUs in MR Clusters
BINARY SVM AS MAP REDUCE PRIMITIVES IN A SINGLE-GPU

GPU


                 Processor 1                       Processor p                                       Processor P
                                                                                                                         fi
     MAP                                                                                             
                                                                                                                        f i'
                                                                                                      
     MAP
                                                                                                                 (ai , ki )

 LOCAL
REDUCE       (ki , fi ' )                                                                                          (ki , fi ' )
                                                                                             
GLOBAL
REDUCE     (bup , I up )                                    Pre-MAP                                               (blow, Ilow)

 MAP             '                                                                                                      '
                 up                                                                                                     low
                                                                                      2
                                                                        - b xi -x j
                                                     k(xi , x j ) = e
                                                     i = 1...n, j Î { I up, I low }


 Device State:                  (xi , yi )   ( fi , ai , ki , b, I, K)                            LRU Cache

                                Static             Variable

24                                                                 Accelerating ML algorithms by integrating GPUs in MR Clusters
BINARY SVM AS MAP REDUCE PRIMITIVES IN 4 GPUS
                                                             Master
                                                             Thread
             GPU 1                     GPU 2                                      GPU 3                        GPU 4

         MAP                           MAP                                        MAP                              MAP

                                                                                                              

                                                                                                                  

                                                                                                                  




                                                                                                                      




     LOCAL REDUCE                  LOCAL REDUCE                           LOCAL REDUCE                     LOCAL REDUCE



                                                                      GLOBAL
                                                                      REDUCE




             MAP                           MAP                                    MAP                              MAP




25                                                                Accelerating ML algorithms by integrating GPUs in MR Clusters
EXPERIMENTS AND HARDWARE

                                                                        Host                             Device
                                                                   Ubuntu 8.10 64bit                 4x Tesla C1060

                                                                 Dual Socket Intel Xeon
                                                                                                 # Stream Processors: 240
                                                                         E5520

                                                                       Frequency of       Frequency of Processors:
                                                                      Cores: 2.26 GHz             1.3GHz
                                                                        145 GFlops              933 GFlops
                                                                     Memory:                     Memory:
                                                                    32GB DDR3                   4GB DDR3
                                                                 Memory Bandwidth:          Memory Bandwidth:
                                                                     25.6GB/s                    102GB/s
                                                                               Host <-> Device
                                                                               PCIe x16 (8GB/s)




            LIBSVM              Hadoop               Multicore                    Single GPU                   Multi GPU

     • Single threaded    • 4 VMs with one     • 8 Worker Threads           • 1 Worker Thread            • 4 Worker Threads
     • Double precision     datanode each        in H-Dispatch              • 1 GPU                      • 4 GPUs
     • Sparse             • Pegasos SVM        • 1 Block – 1 Thread         • Single Precision           • Single Precision
                          • Double Precision   • Double Precision           • Dense-Sparse               • Dense-Sparse
                          • Sparse             • Dense




26                                                           Accelerating ML algorithms by integrating GPUs in MR Clusters
PERFORMANCE RESULTS: DATASETS




 SVM Experiment Setup                              # Training      # Testing     # (Features,
                                       Dataset                                                      (C,β)
                                                     Points         Points         Classes)

 Same kernel types (RBF)               WEB          49749          14951          (300,2)      (64,7.8125)
 Same regularization parameter C
 Same stopping criteria:      0.001   MNIST         60000          10000          (780,10)      (10,0.125)
 SMO based (Except Hadoop version)
                                        RCV1        518571          15564        (47236,53)        (1,0.1)
 One vs All in multiclass problems
 1GB kernel cache                     PROTEIN       17766           6621          (357,3)        (10,0.05)
                                       SENSIT        78823          19705          (100,3)         (1,0.7)




27                                               Accelerating ML algorithms by integrating GPUs in MR Clusters
PERFORMANCE RESULT COMPARISON



                                                                                    Single     Multi
          Dataset (Non-Zero %)                  LIBSVM     Hadoop     Multicore
                                                                                  GPU(Dense) GPU(Dense)
                                   Time(s)       2364.2    1698.7       912.81      154.3       73.6
               WEB (3%)            Gain (x)       1.00      1.39         2.59       15.32      32.12
                                 Accuracy (%)     82.69     82.69        82.69      82.69      82.69

                                   Time(s)      118943.5   66753.5    22873.75      2010.3        726.9
             MNIST (19%)           Gain (x)       1.00      1.78        5.20         59.17       163.63
                                 Accuracy (%)     95.76     95.76       95.76        95.76        95.76

                                   Time(s)      710664     231486        N/A          N/A          N/A
              RCV1 (0.1%)          Gain (x)      1.00       3.07         N/A          N/A          N/A
                                 Accuracy (%)    94.67      94.67       94.67        94.67        94.67

                                   Time(s)        861       717.5       260.12       32.93        16.06
            PROTEIN (29%)          Gain (x)      1.00       1.20         3.31        26.15        53.61
                                 Accuracy (%)    70.03      70.03        70.03       70.03        70.03

                                   Time(s)       8162      4295.78      2005.4      134.67        58.29
             SENSIT (100%)         Gain (x)      1.00       1.90         4.07        60.61       140.02
                                 Accuracy (%)    83.46      83.46        83.46       83.46        83.46




28                                                           Accelerating ML algorithms by integrating MapReduce Clusters
                                                                          SVMs by integrating GPUs in GPUs in MR
ELLPACK-R (Vazquez et al. IEEE CIT 2010)


                                              Dataset                       Single      Multi
                                            (Non-Zero %)                  GPU(Sparse) GPU(Sparse)

                                                             Time(s)        107.35           57.3
                                             WEB (3%)        Gain (x)     22.02 (1.43)   41.26 (1.26)
                                                           Accuracy (%)      82.69           82.69


                                                             Time(s)          N/A            3686
                                            RCV1 (0.1%)      Gain (x)         N/A           192.80
                                                           Accuracy (%)      94.67           94.67


                                                                                           ~8.2 days -> ~1hour




29                                                  Accelerating ML algorithms by integrating GPUs in MR Clusters
CONCLUSIONS


 CONCLUSIONS:

        Constructed a MR runtime that satisfies the requirements of many ML algorithms and integrates GPUs.
                Iterative stateful jobs
                Multithreaded BLAS to prepare Map or Reduce Tasks
                Static/Variable data

        Tested the runtime solving popular classification problems.
                Delivered up to two orders of magnitude of acceleration using 4 GPUs
                Compared different runtimes




     LIMITATIONS:

        H-Dispatch (Pull) dependent on H->D state transfers

        Relaxation of Fault-tolerance must be acceptable

        d>>n -> MapReduce will have little benefit




30                                                             Accelerating ML algorithms by integrating GPUs in MR Clusters
FUTURE WORK




     FUTURE:

        GPU Technology:
               Concurrent Kernel Execution-> Maximize utilization
               GPUDirect-> Facilitate Sort operation
               Distributed Memory -> Intermediate Results
               Shared memory space CPU-GPU

        Communication
                Cross-Node performance
                GPU-Port-Abstraction
                       In-node: Cross-Thread pointer exchange
                       Out-node: MVAPICH2 and MVAPICH2-GPU
        Algorithms
                Requirements for incremental classification and clustering




31                                                     Accelerating ML algorithms by integrating GPUs in MR Clusters
CONCURRENT KERNEL EXECUTION



                                    Port


                                                 CPU
         Task
                                               Thread 1
        Queue
                                                      CPU
                                                    Thread 2

                                                                    • CUDA Compute Capability 2.0
                                            
        

        
                

                
                    
                    
                        

                        
                            
                                

                                
                                                                    allows up to sixteen concurrent
                                             
                             
                                                                      kernels.
                                          
                           




                                                                    • Concurrent kernels need to run
                                                                      on the same context.




32                                                             Accelerating ML algorithms by integrating GPUs in MR Clusters
INTEGRATING THE MPP IN THE MR CLUSTER ARCHITECTURE
            Block    Block

     DFS                        MRF    Task Tracker

                       HState
     HMem

 DMem                  DState
                                                                           GPUDirect:
      GPU                                                                  • GPU to GPU memory copy
                       DState                                              • Communication with network
 DMem                                                                        devices

                       HState             Cross-Node
     HMem

 DMem                  DState
                                                                            Minimal Communication to HState

      GPU
                       DState
 DMem
                       HState
                                          Cross-Node
     HMem

             Block    Block
                                      State Snapshot every
     DFS                                    x iterations

33                                                Accelerating ML algorithms by integrating GPUs in MR Clusters
PIPELINING/MEMCACHED
         DataNode 1             DataNode 2

                Task                     Task
         MRF   Tracker
                                 MRF    Tracker



         DFS    Block             DFS    Block
                                                                      Memcached

                                                                       node
         CPU                    CPU
                         Map                      Map
                         Task                     Task
                                                                       node

         MEM                     MEM

                                                                       node
         CPU                     CPU

                   Reduce                   Reduce
                    Task                     Task

         DFS    Block             DFS    Block




34                                                       Accelerating ML algorithms by integrating GPUs in MR Clusters
QUESTIONS




35          Accelerating ML algorithms by integrating MapReduce Clusters
                         SVMs by integrating GPUs in GPUs in MR
APPLICATION I: EVENT DETECTION USING TWEETS


                                                         Sakaki et al: Detect Tweet
                                                      outbreaks about large-scale and
                                                         infrequent events: Natural
                                                       Disasters: Earthquakes, floods.
                                                       Accidents: Fire, road accidents


                                                                  INFREQUENT EVENTS




36                                            Accelerating ML algorithms by integrating GPUs in MR Clusters
APPLICATION I: EVENT DETECTION USING TWEETS




                                                                   Listening to the New
                                                                     York Philarmonic,
                                                                    amazing performance

                                                                    Lots of people trying
                                                                     to enter the MSG for
                                                                      the Alice in Chains
                                                                    concert. I wish I had
                                                                            tickets.



       Goal: Detect popular                    Nassau County Museum of Art is
     events on locations with                 looking for volunteers to greet,
      high volume of tweets.                    work in gift shop or perform
                                                      clerical support.




37                                            Accelerating ML algorithms by integrating GPUs in MR Clusters
APPLICATION I: FEATURE VECTOR


       It/PRP is/VBZ a/DT good/JJ day/NN when/WRB the/DT CEO/NN
       of/IN a/DT multinational/JJ ,/, multi-million/JJ
       dollar/NN company/NN tells/VBZ you/PRP you/PRP 're/VBP
       a/DT genius/NN ./.:/: D/NNP


                   Lots/NNS of/IN people/NNS trying/VBG to/TO enter/VB
                   the/DT MSG/NNP for/IN the/DT Alice/NNP in/IN
                   Chains/NNP concert/NN ./.I/PRP wish/VBP I/PRP
                   had/VBD tickets/NNS ./.




      Feature Vectors:
                                                            -    Has unigram with POS
                  ì 1    If (x,y) contains___
                                                            -    Has bigram with POSs
      hi (x, y) = í                                         -    Has trigram with POSs
                  î 0          otherwise                    -    X1 is subject of X2
                                                            -    ….
38                                              Accelerating ML algorithms by integrating GPUs in MR Clusters
APPLICATION I: EXPERIMENT

                 Used NYC.com event calendar (Oct 9-11,2009). Extracted ~400 features

       Title              Location                                      Description

                                             Alice in Chains has sold more than twenty million albums in the
                  Madison Square Garden, 2 United States (and an estimated 40 million worldwide), released
      Alice in
                  Penn Plaza, New York, NY, two number-one albums and 19 top 40 singles, and has received
      Chains
                           10001                                six Grammy nominations…




     EXPERIMENT 1:
     • 2000 Tweets from the same weekend (160 (%8) “Concert”, 1840 (%92) “Background”)
     • RBF Kernel (C=10, gamma=1.0). Testing 20% -> Accuracy of %97
     • “False positives”


     EXPERIMENT 2:
     • 2000 Tweets from the next weekend (160 (%8) “Concert”, 1840 (%92) “Background”)
     • RBF Kernel (C=10, gamma=1.0). Testing 100% -> Accuracy of %93
     • “False positives” + “False negative”
     • After using NYC.com again -> Accuracy of %96


39                                                           Accelerating ML algorithms by integrating GPUs in MR Clusters
APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD



                                         30 x 96 = 2880 Values




                                                      8
40                                           Accelerating ML algorithms by integrating GPUs in MR Clusters
APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD




41                                           Accelerating ML algorithms by integrating GPUs in MR Clusters

Más contenido relacionado

La actualidad más candente

PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...cscpconf
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...csandit
 
Optimization of distributed generation of renewable energy sources by intelli...
Optimization of distributed generation of renewable energy sources by intelli...Optimization of distributed generation of renewable energy sources by intelli...
Optimization of distributed generation of renewable energy sources by intelli...Beniamino Murgante
 
"Modern Tracking" Short Course Taught at University of Hawaii
"Modern Tracking" Short Course Taught at University of Hawaii"Modern Tracking" Short Course Taught at University of Hawaii
"Modern Tracking" Short Course Taught at University of HawaiiWilliam J Farrell III
 
HOSVD-visualization
HOSVD-visualizationHOSVD-visualization
HOSVD-visualizationKeyvan Sadri
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...Taiji Suzuki
 
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...IDES Editor
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingzukun
 
Two Days workshop on MATLAB
Two Days workshop on MATLABTwo Days workshop on MATLAB
Two Days workshop on MATLABBhavesh Shah
 
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd Iaetsd
 
An approach to incentive based reputation for communities of web services
An approach to incentive based reputation for communities of web servicesAn approach to incentive based reputation for communities of web services
An approach to incentive based reputation for communities of web servicesBabak Khosravifar
 

La actualidad más candente (19)

PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
 
Dk32696699
Dk32696699Dk32696699
Dk32696699
 
Optimization of distributed generation of renewable energy sources by intelli...
Optimization of distributed generation of renewable energy sources by intelli...Optimization of distributed generation of renewable energy sources by intelli...
Optimization of distributed generation of renewable energy sources by intelli...
 
"Modern Tracking" Short Course Taught at University of Hawaii
"Modern Tracking" Short Course Taught at University of Hawaii"Modern Tracking" Short Course Taught at University of Hawaii
"Modern Tracking" Short Course Taught at University of Hawaii
 
HOSVD-visualization
HOSVD-visualizationHOSVD-visualization
HOSVD-visualization
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
 
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programming
 
Image transforms
Image transformsImage transforms
Image transforms
 
Two Days workshop on MATLAB
Two Days workshop on MATLABTwo Days workshop on MATLAB
Two Days workshop on MATLAB
 
20120140506008 2
20120140506008 220120140506008 2
20120140506008 2
 
Unit ii
Unit iiUnit ii
Unit ii
 
E0522327
E0522327E0522327
E0522327
 
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
 
MSc Presentation
MSc PresentationMSc Presentation
MSc Presentation
 
An approach to incentive based reputation for communities of web services
An approach to incentive based reputation for communities of web servicesAn approach to incentive based reputation for communities of web services
An approach to incentive based reputation for communities of web services
 
PIMRC 2012
PIMRC 2012PIMRC 2012
PIMRC 2012
 

Similar a Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters

All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and AnalysisAll Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and AnalysisInderjeet Singh
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkKazuki Fujikawa
 
Matrix Chain Scheduling Algorithm
Matrix Chain Scheduling AlgorithmMatrix Chain Scheduling Algorithm
Matrix Chain Scheduling AlgorithmWen-Shih Chao
 
Ppt 2 d ploting k10998
Ppt 2 d ploting k10998Ppt 2 d ploting k10998
Ppt 2 d ploting k10998Vinit Rajput
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLABBhavesh Shah
 
Some Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSome Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSanjaySingh011996
 
Dynamic Programming - Part 1
Dynamic Programming - Part 1Dynamic Programming - Part 1
Dynamic Programming - Part 1Amrinder Arora
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoJoel Falcou
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural NetworksLucaCrociani1
 
Bouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration ToolboxBouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration ToolboxYuji Oyamada
 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14Max De Koninck
 
Basic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptxBasic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptxPremanandS3
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 

Similar a Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters (20)

All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and AnalysisAll Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
Matrix Chain Scheduling Algorithm
Matrix Chain Scheduling AlgorithmMatrix Chain Scheduling Algorithm
Matrix Chain Scheduling Algorithm
 
Ppt 2 d ploting k10998
Ppt 2 d ploting k10998Ppt 2 d ploting k10998
Ppt 2 d ploting k10998
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLAB
 
70
7070
70
 
Poster rga
Poster rgaPoster rga
Poster rga
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Some Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSome Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial Derivatives
 
Dynamic Programming - Part 1
Dynamic Programming - Part 1Dynamic Programming - Part 1
Dynamic Programming - Part 1
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural Networks
 
Bouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration ToolboxBouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration Toolbox
 
38 116-1-pb
38 116-1-pb38 116-1-pb
38 116-1-pb
 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14
 
Basic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptxBasic MATLAB-Presentation.pptx
Basic MATLAB-Presentation.pptx
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters

  • 1. ACCELERATING MACHINE LEARNING ALGORITHMS BY INTEGRATING GPUS INTO MAPREDUCE CLUSTERS Sergio Herrero-Lopez Intelligent Engineering Systems Laboratory (IESL) November 30, 2011 1 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 2. INTRODUCTION ABOUT ME:  Ph.D (December 2011) at Massachusetts Institute of Technology (USA)  M.Sc (2007) and B.Sc (2005) in Electrical Engineering at University of Navarra (Spain)  Microsoft Research (Redmond WA, 2008), Tampere University of Technology (Finland, 2005) and IKUSI (Spain, 2003) ABOUT PROF. WILLIAMS RESEARCH GROUP (ENGINEERING SYSTEMS DIVISION):  High Performance Price Analytics for the Smart Grid (2008-2009)  Large-Scale Simulator for Global Data Infrastructure Optimization (2009-2011)  Music Event Detection from Tweets in New York (2010-2011)  Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters 2 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 3. AGENDA o PROBLEM STATEMENT: Big Data & Need for scale and/or speed o PROPOSITION: Modify MapReduce runtime to o Satisfy the particular requirements of ML algorithms o Integrate Massively Parallel Processors in the system o PREVIOUS WORK MapReduce for ML in Multicore/Single-GPU/Multi- GPU/GPU-Cluster/FPGA o IMPLEMENTATION of new MR runtime using Port abstractions o PERFORMANCE results running SVMs on the proposed system o CONCLUSIONS: Contributions and Limitations. Lessons learned o FUTURE WORK 3 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 4. MACHINE LEARNING PARALLELIZATION { xi, yi },i =1… n, "i, n -Representative sample 1. Does not fit in resources d -Feature selection 2. Takes too long xi Î R d , yi Î Y = {1… k} k -Consolidate classes 3. Accuracy was sacrificed Algorithm 1 Algorithm 1 Independent Runs L1 Worker X Worker Y (Cluster) Algorithm 1 Summation Form L2 (MapReduce) Worker X Worker Y Algorithm 1 L3 Structural Parallelism (MPPs) Machine Learning Algorithms decomposable into MR primitives Naïve Bayes K-means Expectation Maximization Neural Network Support Vector Machine Classification Principal Component Analysis Hidden Markov Models 4 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 5. MAPREDUCE PRIMITIVES & RUNTIME Input M [ k1, v1 ] ® [ k2, v2 ]  Split R ék2 , {v2,i }k ë ù® v WORKER 1 WORKER 2 WORKER M-1 WORKER M 2,i =k2 û 3  Map  Sort WORKER 1 WORKER 2 WORKER N-1 WORKER N       Reduce   Merge  Output 5 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 6. MAPREDUCE REPRESENTATION OF K-MEANS M ékit , xi ù ® éki¢t , xi ù ë û ë û kit  { ki¢t = x j : x j - mit £ x j - mit¢ "i¢ =1… k } ki¢t Rék¢t , { xi }k¢t =k¢t ù ® mk¢t ë û t+1   { xi }k¢ =k¢ i t t i å 1 mk¢t = t+1 x xi ki¢t =k ¢t x Î{ xi }k¢t =k¢t t+1 mk¢t i 6 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 7. MAPREDUCE REPRESENTATION OF EM FOR MIXTURE OF GAUSSIANS M [(i, k), xi ] ® é(i, k), pi,k ù ë û xi a f ( xi | m , S t k t k t k )  pi,k = K åa f ( x | m , S ) t k i t k t k pi,k k=1 Rék, { pi,k¢ }k¢=k ù ® ak t+1 { pi,k¢ }k¢=k ë û   n åp i,k t+1 a t+1 k = i=1 ak n Rék, { xi , pi,k¢ }k¢=k ù ® mk ë û t+1 { xi, pi,k¢ }k¢=k n   åx p i i,k m t+1 k = i=1 t+1 mk t+1 nak Rék, { xi , pi,k¢ }k¢=k ù ® St+1 ë û k   { xi, pi,k¢ }k¢=k n å p (x i - m k ) ( xi - m k ) t+1 t+1 T i,k S t+1 k = i=1 na t+1 St+1 k k 7 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 8. MAPREDUCE REPRESENTATION OF SVM (SMO) M [i, fi ] ® [i, fi¢] fi  fi¢= fi + Da Iup yIup k(x Iup , x i )+ Da Ilow yIlow k(x Ilow , x i ) fi¢ M [i, ai ] ® [i, ki ] I 0 = {i : yi = {1, -1}, 0 < a i < C} ai I1 = {i : yi = 1, a i = 0} È {i : yi = -1, a i = C }  I 2 = {i : yi = 1, ai = C} È {i : yi = -1, a i = 0} ki kup = {i Î I 0 È I1 }, klow = {i Î I 0 È I 2 } ki Î kup , klow R ék, { fi }k =k ù ® (b, I ) ë i û   { fi }k =k i bup = min{ fi : ki = kup }, Iup = argmin ki =kup fi blow = max{ fi : ki = klow }, I low = argmax ki =klow fi (b, I) M [i, ai ] ® [i, ai¢] yIup ( fIlow - fIup ) ai a¢ = aI - Iup up 2k(xIlow , xIup ) - k(xIlow , xIlow ) - k(xIup , xIup ) a ¢ = a I + yI yI (a I - a ¢ ) I low low low Iup up up a i¢ 8 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 9. MAPREDUCE FOR ML WISHLIST Static Variable mk Static vs Variable data xi x  Static: Largest, fixed, used in every iteration i (a , mk , St+1 ) t+1 t+1 ( xi, yi ) k k  Variable: Results of each iteration, consumed in the next iteration ( fi, ai ) DFS  Iterate until convergence  Avoid reloading static data between iterations MEM  Utilize memory hierarchy as opposed to DFS or LFS   DFS Massively Threaded MapReduce Tasks  Map is embarrassingly parallel CPU MPP  Reduce is highly parallelizable Dimensionality & Algebra - b xi -x j 2  Map Tasks may encapsulate high dimensional matrix-vector k(xi , x j ) = e or matrix-matrix operations  Interleave multithreaded BLAS operations using static data i = 1...n, j Î { I up, I low }  Sparse data structures 9 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 10. COMPUTING ECOSYSTEM COMMODITY HIGH PERFORMANCE/SUPER COMPUTING COMPUTING RELATIONAL DB HADOOP INFINIBAND BIGTABLE DRYAD CASSANDRA OPENMPI GPU DYNAMO GPU 1/10 GB ETHERNET FPGA COLUMN DB HADOOP 20 GB INFINIBAND SSD DATA APPLIANCE/ WAREHOUSE COMPUTING 10 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 11. MAPREDUCE CLUSTER: ARCHITECTURE Client 1) Distributed File System. - Unstructured data File Job - Scales to thousands of nodes - High reliability through NameNode replication DFS MRF 2) Map Reduce Framework Runtime JobTracker - Batch processing system - Load balancing Task Task Task DataNode 1 Block DataNode 2 DataNode 3 Block Block MRF MRF MRF TaskTracker TaskTracker TaskTracker DFS DFS DFS 11 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 12. MAPREDUCE CLUSTER: LIMITATIONS DataNode 1 DataNode 2 Task Task MRF Tracker MRF Tracker One (or two) tasks per node DFS Block DFS Block One Task  One Data Block  CPU CPU One Core  One Thread Map Map Task Task HD Block HD Block Synchronization by materialization of intermediate results CPU CPU Reduce Reduce Task Task DFS Block DFS Block No support for iterative jobs 12 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 13. MASSIVELY PARALLEL PROCESSORS: NVIDIA TESLA ARCHITECTURE Host Device Stream Multiprocessor N Stream Multiprocessor 2 Memory Shared 1 Cycle coalesced Stream Multiprocessor 1 Memory Shared ~10 Cycles uncoalesced Registers Registers Registers Shared Memory Registers Registers Registers Instruction Registers Unit ProcessorRegisters 1 Processor 2 Registers …. Processor M Instruction Unit 0 Cycles Processor 1 Processor 2 …. Processor M Instruction Constant Cache Unit SP 1 SP 2 …. SP M Constant Cache Texture Cache ~10 Cycles Cache Hit Constant Memory Texture Cache Texture Memory ~400 Cycles ~400 Cycles 102 GB/s 102 GB/s Host Memory Device Memory PCI-E 16x (8GB/s) 13 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 14. NVIDIA TESLA: REPRESENTATIONS Logical Representation Physical Representation Thread Processor Block MultiProcessor Maximum (512,512,64) But max 512 threads per block Grid Device Shared Shared Register Memor Register Register Maximum Register Memor s s Register yRegister s Processs y …. s s Process Process (65535, Process or ConstantM Process…. or 65535) or 1 2 Process or 1 or ConstantM 2 Texture or Cache Cache Cache 14 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 15. PROPOSED RUNTIME: MR + GPU Block Block DFS MRF Task Tracker HState HMem Split H->D Transfers DMem DState Pre-Map BLAS GPU Map DMem DState Post-Map D->H Transfers HState Cross-Node HMem Sort DMem DState H->D Transfers Pre-Reduce BLAS Local GPU Reduce DMem DState D->H Transfers Post-Reduce HState Cross-Node Global HMem Reduce Block Block State Snapshot every DFS x iterations 15 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 16. PROPOSED RUNTIME: MR + GPU Block Block DFS MRF Task Tracker HState HMem Multiple tasks per node DMem DState Multithreaded MR Tasks GPU Interleave Multithreaded BLAS DMem DState Local/Global Reduction HState Static/Variable Data Cross-Node HMem Long-running Iterative Jobs DMem DState Stateful Nodes Shared-Memory GPU Fault-Tolerance Relaxation DMem DState HState Cross-Node HMem Block Block DFS 16 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 17. PREVIOUS WORK MAPREDUCE ON SINGLE GPU/ SINGLE FPGA Interleave Multithreaded BLAS •Mars (He et al. PACT 2008) •NVIDIA (Catanzaro et al. STMCS 2008) •Cell (de Kruijf and Sankaralingam IBM Journal R&D 2009) Massively Multithreaded MR Tasks MAPREDUCE ON MULTICORE Shared-Memory •Phoenix (Ranger et al. HPCA 2007) •Phoenix 2 (Yoo et al. IISWC 2009) •Phoenix ++ (Talbot et al. MAPREDUCE 2011) Fault-Tolerance Relaxation MAPREDUCE ON MULTI-GPU/GPU CLUSTERS Intermediate data in-memory •CellMR (Rafique et al. IPDPS 2009) •GPMR (Stuart and Owens IPDPS 2011) Local/Global Reduction MAPREDUCE FOR MACHINE LEARNING •Mahout (Apache) Long running (iterative) Tasks •Multicore (Chu et al. NIPS 2006) •FGPA (Xu NIPS 2009) •Twister (Ekanayake et al. MAPREDUCE 2010) •SystemML (Ghoting et al. ICDE 2011) Static vs Variable Data 17 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 18. PORT-BASED PROGRAMMING: ABSTRACTION Message Port Single Item Receiver Arbiter Multiple Item Receiver Dispatcher Handler Handler Task Handler Join Receiver Dispatcher Queue Choice Receiver Teardown State Handler Concurrent Exclusive Scatter Gather 18 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 19. SCATTER-GATHER USING GPU-PORTS Task MRF Tracker (Task, Block, Port Response Port) Master Arbiter C#/Java Thread Dispatcher Handler Scatter Handler Task Handler Dispatcher Queue HState CPU Handler C++ Kernel                                                             CUDA 3.2 Gather DState 19 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 20. H-DISPATCH ALTERNATIVE Task MRF Tracker (Task, Block, Response Port) Master H-Dispatch Thread Scatter + Load Balancing for non-uniform workloads + Local variable reutilization. Avoid GC blocking threads                                              + Runs hState> sum(Dmem) Gather - Detach state and port: dState load/unload 20 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 21. BINARY SVM Binary Classification: Given l samples x1, y1 ,, xl , yl with xi Rn , yi Y , i and Y 1,1 , a binary classifier predicts the label y Y of an unseen sample x Rn 1 f* f* 2 xi x j k ( xi , x j ) e 21 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 22. PRIMAL & DUAL FORM OF THE SVM Find the function f that solves the following regularization problem: l k maxk,0 1 2 min f HC 1 yi f xi f where i 1 2 C 0 Then slack variables i are introduced to classify non-separable data: Primal form: Dual form: l l 1 2 1 T min f H C i f max K i 1 2 Rl i 1 i 2 subject to: subject to: l yi f xi 1 yi i 0 i i 1, , l i 1 i 1, , l i 0 0 i C where Kij yi y j k xi , x j is the kernel function l Solving the dual: f ( x ) yi i k x , xi b where b is an unreagularized bias term i 1 22 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 23. MULTICLASS CLASSIFICATION Multiclass Classification: Given l samples x1, y1 ,, xl , yl with xi Rn , yi Y , i and Y 1, M , , a multiclass classifier predicts the label y Y of an unseen sample x Rn Multiclass SVM: Combination of N independent binary classification tasks. Binary tasks are defined by an output code matrix R of size MxN and R ij 1,0,1 1 1 0 1 1 1 M All vs All (AVA): R 1 0 1 N 2 One vs All (OVA): R 1 1 1 N M 0 1 1 1 1 1 23 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 24. BINARY SVM AS MAP REDUCE PRIMITIVES IN A SINGLE-GPU GPU Processor 1 Processor p Processor P fi MAP      f i'     MAP     (ai , ki ) LOCAL REDUCE (ki , fi ' ) (ki , fi ' )   GLOBAL REDUCE (bup , I up ) Pre-MAP (blow, Ilow) MAP ' ' up low 2 - b xi -x j k(xi , x j ) = e i = 1...n, j Î { I up, I low } Device State: (xi , yi ) ( fi , ai , ki , b, I, K) LRU Cache Static Variable 24 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 25. BINARY SVM AS MAP REDUCE PRIMITIVES IN 4 GPUS Master Thread GPU 1 GPU 2 GPU 3 GPU 4 MAP MAP MAP MAP                                                             LOCAL REDUCE LOCAL REDUCE LOCAL REDUCE LOCAL REDUCE GLOBAL REDUCE MAP MAP MAP MAP 25 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 26. EXPERIMENTS AND HARDWARE Host Device Ubuntu 8.10 64bit 4x Tesla C1060 Dual Socket Intel Xeon # Stream Processors: 240 E5520 Frequency of Frequency of Processors: Cores: 2.26 GHz 1.3GHz 145 GFlops 933 GFlops Memory: Memory: 32GB DDR3 4GB DDR3 Memory Bandwidth: Memory Bandwidth: 25.6GB/s 102GB/s Host <-> Device PCIe x16 (8GB/s) LIBSVM Hadoop Multicore Single GPU Multi GPU • Single threaded • 4 VMs with one • 8 Worker Threads • 1 Worker Thread • 4 Worker Threads • Double precision datanode each in H-Dispatch • 1 GPU • 4 GPUs • Sparse • Pegasos SVM • 1 Block – 1 Thread • Single Precision • Single Precision • Double Precision • Double Precision • Dense-Sparse • Dense-Sparse • Sparse • Dense 26 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 27. PERFORMANCE RESULTS: DATASETS SVM Experiment Setup # Training # Testing # (Features, Dataset (C,β) Points Points Classes) Same kernel types (RBF) WEB 49749 14951 (300,2) (64,7.8125) Same regularization parameter C Same stopping criteria: 0.001 MNIST 60000 10000 (780,10) (10,0.125) SMO based (Except Hadoop version) RCV1 518571 15564 (47236,53) (1,0.1) One vs All in multiclass problems 1GB kernel cache PROTEIN 17766 6621 (357,3) (10,0.05) SENSIT 78823 19705 (100,3) (1,0.7) 27 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 28. PERFORMANCE RESULT COMPARISON Single Multi Dataset (Non-Zero %) LIBSVM Hadoop Multicore GPU(Dense) GPU(Dense) Time(s) 2364.2 1698.7 912.81 154.3 73.6 WEB (3%) Gain (x) 1.00 1.39 2.59 15.32 32.12 Accuracy (%) 82.69 82.69 82.69 82.69 82.69 Time(s) 118943.5 66753.5 22873.75 2010.3 726.9 MNIST (19%) Gain (x) 1.00 1.78 5.20 59.17 163.63 Accuracy (%) 95.76 95.76 95.76 95.76 95.76 Time(s) 710664 231486 N/A N/A N/A RCV1 (0.1%) Gain (x) 1.00 3.07 N/A N/A N/A Accuracy (%) 94.67 94.67 94.67 94.67 94.67 Time(s) 861 717.5 260.12 32.93 16.06 PROTEIN (29%) Gain (x) 1.00 1.20 3.31 26.15 53.61 Accuracy (%) 70.03 70.03 70.03 70.03 70.03 Time(s) 8162 4295.78 2005.4 134.67 58.29 SENSIT (100%) Gain (x) 1.00 1.90 4.07 60.61 140.02 Accuracy (%) 83.46 83.46 83.46 83.46 83.46 28 Accelerating ML algorithms by integrating MapReduce Clusters SVMs by integrating GPUs in GPUs in MR
  • 29. ELLPACK-R (Vazquez et al. IEEE CIT 2010) Dataset Single Multi (Non-Zero %) GPU(Sparse) GPU(Sparse) Time(s) 107.35 57.3 WEB (3%) Gain (x) 22.02 (1.43) 41.26 (1.26) Accuracy (%) 82.69 82.69 Time(s) N/A 3686 RCV1 (0.1%) Gain (x) N/A 192.80 Accuracy (%) 94.67 94.67 ~8.2 days -> ~1hour 29 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 30. CONCLUSIONS CONCLUSIONS:  Constructed a MR runtime that satisfies the requirements of many ML algorithms and integrates GPUs.  Iterative stateful jobs  Multithreaded BLAS to prepare Map or Reduce Tasks  Static/Variable data  Tested the runtime solving popular classification problems.  Delivered up to two orders of magnitude of acceleration using 4 GPUs  Compared different runtimes LIMITATIONS:  H-Dispatch (Pull) dependent on H->D state transfers  Relaxation of Fault-tolerance must be acceptable  d>>n -> MapReduce will have little benefit 30 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 31. FUTURE WORK FUTURE:  GPU Technology:  Concurrent Kernel Execution-> Maximize utilization  GPUDirect-> Facilitate Sort operation  Distributed Memory -> Intermediate Results  Shared memory space CPU-GPU  Communication  Cross-Node performance  GPU-Port-Abstraction  In-node: Cross-Thread pointer exchange  Out-node: MVAPICH2 and MVAPICH2-GPU  Algorithms  Requirements for incremental classification and clustering 31 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 32. CONCURRENT KERNEL EXECUTION Port CPU Task Thread 1 Queue CPU Thread 2 • CUDA Compute Capability 2.0                   allows up to sixteen concurrent         kernels.     • Concurrent kernels need to run on the same context. 32 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 33. INTEGRATING THE MPP IN THE MR CLUSTER ARCHITECTURE Block Block DFS MRF Task Tracker HState HMem DMem DState GPUDirect: GPU • GPU to GPU memory copy DState • Communication with network DMem devices HState Cross-Node HMem DMem DState Minimal Communication to HState GPU DState DMem HState Cross-Node HMem Block Block State Snapshot every DFS x iterations 33 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 34. PIPELINING/MEMCACHED DataNode 1 DataNode 2 Task Task MRF Tracker MRF Tracker DFS Block DFS Block Memcached node CPU CPU Map Map Task Task node MEM MEM node CPU CPU Reduce Reduce Task Task DFS Block DFS Block 34 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 35. QUESTIONS 35 Accelerating ML algorithms by integrating MapReduce Clusters SVMs by integrating GPUs in GPUs in MR
  • 36. APPLICATION I: EVENT DETECTION USING TWEETS Sakaki et al: Detect Tweet outbreaks about large-scale and infrequent events: Natural Disasters: Earthquakes, floods. Accidents: Fire, road accidents INFREQUENT EVENTS 36 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 37. APPLICATION I: EVENT DETECTION USING TWEETS Listening to the New York Philarmonic, amazing performance Lots of people trying to enter the MSG for the Alice in Chains concert. I wish I had tickets. Goal: Detect popular Nassau County Museum of Art is events on locations with looking for volunteers to greet, high volume of tweets. work in gift shop or perform clerical support. 37 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 38. APPLICATION I: FEATURE VECTOR It/PRP is/VBZ a/DT good/JJ day/NN when/WRB the/DT CEO/NN of/IN a/DT multinational/JJ ,/, multi-million/JJ dollar/NN company/NN tells/VBZ you/PRP you/PRP 're/VBP a/DT genius/NN ./.:/: D/NNP Lots/NNS of/IN people/NNS trying/VBG to/TO enter/VB the/DT MSG/NNP for/IN the/DT Alice/NNP in/IN Chains/NNP concert/NN ./.I/PRP wish/VBP I/PRP had/VBD tickets/NNS ./. Feature Vectors: - Has unigram with POS ì 1 If (x,y) contains___ - Has bigram with POSs hi (x, y) = í - Has trigram with POSs î 0 otherwise - X1 is subject of X2 - …. 38 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 39. APPLICATION I: EXPERIMENT Used NYC.com event calendar (Oct 9-11,2009). Extracted ~400 features Title Location Description Alice in Chains has sold more than twenty million albums in the Madison Square Garden, 2 United States (and an estimated 40 million worldwide), released Alice in Penn Plaza, New York, NY, two number-one albums and 19 top 40 singles, and has received Chains 10001 six Grammy nominations… EXPERIMENT 1: • 2000 Tweets from the same weekend (160 (%8) “Concert”, 1840 (%92) “Background”) • RBF Kernel (C=10, gamma=1.0). Testing 20% -> Accuracy of %97 • “False positives” EXPERIMENT 2: • 2000 Tweets from the next weekend (160 (%8) “Concert”, 1840 (%92) “Background”) • RBF Kernel (C=10, gamma=1.0). Testing 100% -> Accuracy of %93 • “False positives” + “False negative” • After using NYC.com again -> Accuracy of %96 39 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 40. APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD 30 x 96 = 2880 Values 8 40 Accelerating ML algorithms by integrating GPUs in MR Clusters
  • 41. APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD 41 Accelerating ML algorithms by integrating GPUs in MR Clusters