SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
A Terascale Learning Algorithm

Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and John
                        Langford




           ... And the Vowpal Wabbit project.
Applying for a fellowship in 1997
Applying for a fellowship in 1997



   Interviewer: So, what do you want to do?
Applying for a fellowship in 1997



   Interviewer: So, what do you want to do?
   John: I’d like to solve AI.
Applying for a fellowship in 1997



   Interviewer: So, what do you want to do?
   John: I’d like to solve AI.
   I: How?
Applying for a fellowship in 1997



   Interviewer: So, what do you want to do?
   John: I’d like to solve AI.
   I: How?
   J: I want to use parallel learning algorithms to create fantastic
   learning machines!
Applying for a fellowship in 1997



   Interviewer: So, what do you want to do?
   John: I’d like to solve AI.
   I: How?
   J: I want to use parallel learning algorithms to create fantastic
   learning machines!
   I: You fool! The only thing parallel machines are good for is
   computational windtunnels!
Applying for a fellowship in 1997



   Interviewer: So, what do you want to do?
   John: I’d like to solve AI.
   I: How?
   J: I want to use parallel learning algorithms to create fantastic
   learning machines!
   I: You fool! The only thing parallel machines are good for is
   computational windtunnels!
   The worst part: he had a point. At that time, smarter learning
   algorithms always won. To win, we must master the best
   single-machine learning algorithms, then clearly beat them with a
   parallel approach.
Demonstration
Terascale Linear Learning ACDL11

   Given 2.1 Terafeatures of data, how can you learn a good linear
   predictor fw (x) = i wi xi ?
Terascale Linear Learning ACDL11

   Given 2.1 Terafeatures of data, how can you learn a good linear
   predictor fw (x) = i wi xi ?


   2.1T sparse features
   17B Examples
   16M parameters
   1K nodes
Terascale Linear Learning ACDL11

   Given 2.1 Terafeatures of data, how can you learn a good linear
   predictor fw (x) = i wi xi ?


   2.1T sparse features
   17B Examples
   16M parameters
   1K nodes


   70 minutes = 500M features/second: faster than the IO
   bandwidth of a single machine⇒ we beat all possible single
   machine linear learning algorithms.
Terascale Linear Learning ACDL11

   Given 2.1 Terafeatures of data, how can you learn a good linear
   predictor fw (x) = i wi xi ?


   2.1T sparse features
   17B Examples
   16M parameters
   1K nodes


   70 minutes = 500M features/second: faster than the IO
   bandwidth of a single machine⇒ we beat all possible single
   machine linear learning algorithms.


   (Actually, we can do even better now.)
book
                     Features/s




                       100
                      1000
                     10000
                    100000
                     1e+06
                     1e+07
                     1e+08
                     1e+09
       RBF-SVM
        MPI?-500
            RCV1
   Ensemble Tree
         MPI-128
        Synthetic
                              single
                             parallel




       RBF-SVM
         TCP-48
     MNIST 220K
    Decision Tree
     MapRed-200
     Ad-Bounce #
      Boosted DT
          MPI-32
                                        Speed per method




       Ranking #
           Linear
       Threads-2
            RCV1
           Linear
Hadoop+TCP-1000
           Ads *&
                                                           Compare: Other Supervised Algorithms in Parallel Learning
The tricks we use




    First Vowpal Wabbit       Newer Algorithmics         Parallel Stuff
      Feature Caching         Adaptive Learning      Parameter Averaging
      Feature Hashing        Importance Updates         Smart Averaging
       Online Learning     Dimensional Correction     Gradient Summing
                                   L-BFGS             Hadoop AllReduce
                               Hybrid Learning
   We’ll discuss Hashing, AllReduce, then how to learn.
String −> Index dictionary
                   RAM

                            Weights Weights
                         Conventional VW

Most algorithms use a hashmap to change a word into an index for
a weight.
VW uses a hash function which takes almost no RAM, is x10
faster, and is easily parallelized.
Empirically, radical state compression tricks possible with this.
MPI-style AllReduce

    Allreduce initial state
                    7

         5                    6

    1          2         3        4
   AllReduce = Reduce+Broadcast
MPI-style AllReduce

    Reducing, step 1
                    7

         8                    13

    1          2         3         4
   AllReduce = Reduce+Broadcast
MPI-style AllReduce

    Reducing, step 2
                    28

         8                    13

    1          2         3         4
   AllReduce = Reduce+Broadcast
MPI-style AllReduce

    Broadcast, step 1
                    28

         28                   28

    1          2         3         4
   AllReduce = Reduce+Broadcast
MPI-style AllReduce

    Allreduce final state
                    28

         28                   28
    28         28        28        28
   AllReduce = Reduce+Broadcast
MPI-style AllReduce

    Allreduce final state
                       28

           28                      28
    28           28          28          28
   AllReduce = Reduce+Broadcast
   Properties:
    1   Easily pipelined so no latency concerns.
    2   Bandwidth ≤ 6n.
    3   No need to rewrite code!
An Example Algorithm: Weight averaging


   n = AllReduce(1)
   While (pass number < max)
     1 While (examples left)

          1   Do online update.
    2   AllReduce(weights)
    3   For each weight w ← w /n
An Example Algorithm: Weight averaging


   n = AllReduce(1)
   While (pass number < max)
     1 While (examples left)

           1   Do online update.
     2   AllReduce(weights)
     3   For each weight w ← w /n



   Other algorithms implemented:
     1   Nonuniform averaging for online learning
     2   Conjugate Gradient
     3   LBFGS
What is Hadoop-Compatible AllReduce?


                                           Program
          Data


    1

        “Map” job moves program to data.
What is Hadoop-Compatible AllReduce?


                                                      Program
           Data


    1

        “Map” job moves program to data.
    2   Delayed initialization: Most failures are disk failures. First
        read (and cache) all data, before initializing allreduce. Failures
        autorestart on different node with identical data.
What is Hadoop-Compatible AllReduce?


                                                      Program
           Data


    1

        “Map” job moves program to data.
    2   Delayed initialization: Most failures are disk failures. First
        read (and cache) all data, before initializing allreduce. Failures
        autorestart on different node with identical data.
    3   Speculative execution: In a busy cluster, one node is often
        slow. Hadoop can speculatively start additional mappers. We
        use the first to finish reading all data once.
What is Hadoop-Compatible AllReduce?


                                                       Program
            Data


     1

         “Map” job moves program to data.
     2   Delayed initialization: Most failures are disk failures. First
         read (and cache) all data, before initializing allreduce. Failures
         autorestart on different node with identical data.
     3   Speculative execution: In a busy cluster, one node is often
         slow. Hadoop can speculatively start additional mappers. We
         use the first to finish reading all data once.
   The net effect: Reliable execution out to perhaps 10K node-hours.
Algorithms: Preliminaries


   Optimize so few data passes required ⇒ Smart algorithms.
   Basic problem with gradient descent = confused units.
   fw (x) = i wi xi
          (x)−y 2
   ⇒ ∂(fw ∂wi ) = 2(fw (x) − y )xi which has units of i.
   But wi naturally has units of 1/i since doubling xi implies halving
   wi to get the same prediction.
   Crude fixes:
                                               ∂ 2 −1
     1   Newton: Multiply inverse Hessian:   ∂wi ∂wj    by gradient to
         get update direction.
     2   Normalize update so total step size is controlled.
Algorithms: Preliminaries


   Optimize so few data passes required ⇒ Smart algorithms.
   Basic problem with gradient descent = confused units.
   fw (x) = i wi xi
          (x)−y 2
   ⇒ ∂(fw ∂wi ) = 2(fw (x) − y )xi which has units of i.
   But wi naturally has units of 1/i since doubling xi implies halving
   wi to get the same prediction.
   Crude fixes:
                                                 2   −1
     1   Newton: Multiply inverse Hessian: ∂w∂∂wj
                                              i
                                                     by gradient to
         get update direction...but computational complexity kills you.
     2   Normalize update so total step size is controlled...but this just
         works globally rather than per dimension.
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                    ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                     T
                                     w g
              w and ∆g is a change in the loss gradient g .
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                     ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                      T
                                      w g
              w and ∆g is a change in the loss gradient g .
          2   Dimensionally correct, adaptive, online, gradient descent for
              small-multiple passes.
                1   Online = update weights after seeing each example.
                                                                        1
                2   Adaptive = learning rate of feature i according to √P
                                                                            gi2
                    where gi = previous gradients.
                3   Dimensionally correct = still works if you double all feature
                    values.
          3   Use (2) to warmstart (1).
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                     ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                      T
                                      w g
              w and ∆g is a change in the loss gradient g .
          2   Dimensionally correct, adaptive, online, gradient descent for
              small-multiple passes.
                1   Online = update weights after seeing each example.
                                                                        1
                2   Adaptive = learning rate of feature i according to √P
                                                                            gi2
                    where gi = previous gradients.
                3   Dimensionally correct = still works if you double all feature
                    values.
          3   Use (2) to warmstart (1).
    2   Use map-only Hadoop for process control and error recovery.
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                     ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                      T
                                      w g
              w and ∆g is a change in the loss gradient g .
          2   Dimensionally correct, adaptive, online, gradient descent for
              small-multiple passes.
                1   Online = update weights after seeing each example.
                                                                        1
                2   Adaptive = learning rate of feature i according to √P
                                                                            gi2
                    where gi = previous gradients.
                3   Dimensionally correct = still works if you double all feature
                    values.
          3   Use (2) to warmstart (1).
    2   Use map-only Hadoop for process control and error recovery.
    3   Use custom AllReduce code to sync state.
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                     ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                      T
                                      w g
              w and ∆g is a change in the loss gradient g .
          2   Dimensionally correct, adaptive, online, gradient descent for
              small-multiple passes.
                1   Online = update weights after seeing each example.
                                                                        1
                2   Adaptive = learning rate of feature i according to √P
                                                                            gi2
                    where gi = previous gradients.
                3   Dimensionally correct = still works if you double all feature
                    values.
          3   Use (2) to warmstart (1).
    2   Use map-only Hadoop for process control and error recovery.
    3   Use custom AllReduce code to sync state.
    4   Always save input examples in a cachefile to speed later
        passes.
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                     ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                      T
                                      w g
              w and ∆g is a change in the loss gradient g .
          2   Dimensionally correct, adaptive, online, gradient descent for
              small-multiple passes.
                1   Online = update weights after seeing each example.
                                                                        1
                2   Adaptive = learning rate of feature i according to √P
                                                                            gi2
                    where gi = previous gradients.
                3   Dimensionally correct = still works if you double all feature
                    values.
          3   Use (2) to warmstart (1).
    2   Use map-only Hadoop for process control and error recovery.
    3   Use custom AllReduce code to sync state.
    4   Always save input examples in a cachefile to speed later
        passes.
    5   Use hashing trick to reduce input complexity.
Approach Used
    1   Optimize hard so few data passes required.
          1   L-BFGS = batch algorithm that builds up approximate inverse
                                     ∆ ∆T
              hessian according to: ∆w ∆w where ∆w is a change in weights
                                      T
                                      w g
              w and ∆g is a change in the loss gradient g .
          2   Dimensionally correct, adaptive, online, gradient descent for
              small-multiple passes.
                1   Online = update weights after seeing each example.
                                                                        1
                2   Adaptive = learning rate of feature i according to √P
                                                                            gi2
                    where gi = previous gradients.
                3   Dimensionally correct = still works if you double all feature
                    values.
          3   Use (2) to warmstart (1).
    2   Use map-only Hadoop for process control and error recovery.
    3   Use custom AllReduce code to sync state.
    4   Always save input examples in a cachefile to speed later
        passes.
    5   Use hashing trick to reduce input complexity.
  Open source in Vowpal Wabbit 6.0. Search for it.
Robustness & Speedup


                                        Speed per method
               10
                         Average_10
                9            Min_10
                8           Max_10
                               linear
                7
     Speedup




                6
                5
                4
                3
                2
                1
                0
                    10     20    30     40   50   60   70   80   90   100
                                              Nodes
Webspam optimization


                                         Webspam test results
                      0.5
                                                            Online
                     0.45                      Warmstart L-BFGS 1
                      0.4                                L-BFGS 1
                                              Single machine online
                     0.35
     test log loss




                      0.3
                     0.25
                      0.2
                     0.15
                      0.1
                     0.05
                       0
                            0   5   10   15   20    25 30    35   40   45   50
                                                   pass
How does it work?

   The launch sequence (on Hadoop):
     1   mapscript.sh hdfs output hdfs input maps asked
     2   mapscript.sh starts spanning tree on the gateway. Each vw
         connects to spanning tree to learn which other vw is adjacent
         in the binary tree.
     3   mapscript.sh starts hadoop streaming map-only job: runvw.sh
     4   runvw.sh calls vw twice.
     5   First vw call does online learning while saving examples in a
         cachefile. Weights are averaged before saving.
     6   Second vw uses online solution to warmstart L-BFGS.
         Gradients are shared via allreduce.
   Example: mapscript.sh outdir indir 100
   Everything also works in a nonHadoop cluster: you just need to
   script more to do some of the things that Hadoop does for you.
How does this compare to other cluster learning efforts?
   Mahout was founded as the MapReduce ML project, but has
   grown to include VW-style online linear learning. For linear
   learning, VW is superior:
     1   Vastly faster.
     2   Smarter Algorithms.
   Other algorithms exist in Mahout, but they are apparently buggy.
   AllReduce seems a superior paradigm in the 10K node-hours
   regime for Machine Learning.


   Graphlab(@CMU) doesn’t overlap much. This is mostly about
   graphical model evaluation and learning.


   Ultra LDA(@Y!), overlaps partially. VW has noncluster-parallel
   LDA which is perhaps x3 more effecient.
Can allreduce be used by others?


   It might be incorporated directly into next generation Hadoop.


   But for now, it’s quite easy to use the existing code.
   void all reduce(char* buffer, int n, std::string master location,
   size t unique id, size t total, size t node);
   buffer = pointer to some floats
   n = number of bytes (4*floats)
   master location = IP address of gateway
   unique id = nonce (unique for different jobs)
   total = total number of nodes
   node = node id number
   The call is stateful: it initializes the topology if necessary.
Further Pointers



   search://Vowpal Wabbit


   mailing list: vowpal wabbit@yahoogroups.com


   VW tutorial (Preparallel): http://videolectures.net


   Machine Learning (Theory) blog: http://hunch.net
Bibliography: Original VW


Caching L. Bottou. Stochastic Gradient Descent Examples on Toy
        Problems, http://leon.bottou.org/projects/sgd, 2007.
Release Vowpal Wabbit open source project,
        http://github.com/JohnLangford/vowpal_wabbit/wiki,
        2007.
Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and
        SVN Vishwanathan, Hash Kernels for Structured Data,
        AISTAT 2009.
Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J.
        Attenberg, Feature Hashing for Large Scale Multitask
        Learning, ICML 2009.
Bibliography: Algorithmics



L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited
       Storage, Mathematics of Computation 35:773–782, 1980.
Adaptive H. B. McMahan and M. Streeter, Adaptive Bound
         Optimization for Online Convex Optimization, COLT 2010.
Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient
         Methods for Online Learning and Stochastic Optimization,
         COLT 2010.
 Import. N. Karampatziakis, and J. Langford, Online Importance
         Weight Aware Updates, UAI 2011.
Bibliography: Parallel



grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable
         Modular Convex Solver for Regularized Risk Minimization,
         KDD 2007.
averaging G. Mann, R. Mcdonald, M. Mohri, N. Silberman, and D.
          Walker. Efficient large-scale distributed training of conditional
          maximum entropy models, NIPS 2009.
averaging K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable for
          Distributed Optimization, LCCC 2010.
     (More forthcoming, of course)

Más contenido relacionado

La actualidad más candente

Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...MLconf
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit fasterPatrick Bos
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceBen Ball
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 

La actualidad más candente (20)

Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for Finance
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 

Destacado

Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examplesStanley Wang
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
Boccia, il curriculum per il concorso
Boccia, il curriculum per il concorsoBoccia, il curriculum per il concorso
Boccia, il curriculum per il concorsoilfattoquotidiano.it
 
Boccia, il curriculum dopo la correzione
Boccia, il curriculum dopo la correzioneBoccia, il curriculum dopo la correzione
Boccia, il curriculum dopo la correzioneilfattoquotidiano.it
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...Shohei Hido
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning modelsExtract Data Conference
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
 

Destacado (10)

Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examples
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for Physicists
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
Boccia, il curriculum per il concorso
Boccia, il curriculum per il concorsoBoccia, il curriculum per il concorso
Boccia, il curriculum per il concorso
 
Boccia, il curriculum dopo la correzione
Boccia, il curriculum dopo la correzioneBoccia, il curriculum dopo la correzione
Boccia, il curriculum dopo la correzione
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
 

Similar a Terascale Learning

Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookFaisal Siddiqi
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Lino Possamai
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based programRalf Gommers
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingDatabricks
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종NAVER D2
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep LearningShajun Nisha
 

Similar a Terascale Learning (20)

Ndp Slides
Ndp SlidesNdp Slides
Ndp Slides
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Graph processing
Graph processingGraph processing
Graph processing
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN Training
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 

Más de pauldix

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Indexing thousands of writes per second with redis
Indexing thousands of writes per second with redisIndexing thousands of writes per second with redis
Indexing thousands of writes per second with redispauldix
 
Building Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModelBuilding Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModelpauldix
 
Building Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModelBuilding Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModelpauldix
 
Building Event-Based Systems for the Real-Time Web
Building Event-Based Systems for the Real-Time WebBuilding Event-Based Systems for the Real-Time Web
Building Event-Based Systems for the Real-Time Webpauldix
 
Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009pauldix
 
Machine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic WebMachine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic Webpauldix
 

Más de pauldix (7)

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Indexing thousands of writes per second with redis
Indexing thousands of writes per second with redisIndexing thousands of writes per second with redis
Indexing thousands of writes per second with redis
 
Building Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModelBuilding Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModel
 
Building Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModelBuilding Web Service Clients with ActiveModel
Building Web Service Clients with ActiveModel
 
Building Event-Based Systems for the Real-Time Web
Building Event-Based Systems for the Real-Time WebBuilding Event-Based Systems for the Real-Time Web
Building Event-Based Systems for the Real-Time Web
 
Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009
 
Machine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic WebMachine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic Web
 

Último

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Terascale Learning

  • 1. A Terascale Learning Algorithm Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and John Langford ... And the Vowpal Wabbit project.
  • 2. Applying for a fellowship in 1997
  • 3. Applying for a fellowship in 1997 Interviewer: So, what do you want to do?
  • 4. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I’d like to solve AI.
  • 5. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I’d like to solve AI. I: How?
  • 6. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I’d like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines!
  • 7. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I’d like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels!
  • 8. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I’d like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! The worst part: he had a point. At that time, smarter learning algorithms always won. To win, we must master the best single-machine learning algorithms, then clearly beat them with a parallel approach.
  • 10. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x) = i wi xi ?
  • 11. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x) = i wi xi ? 2.1T sparse features 17B Examples 16M parameters 1K nodes
  • 12. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x) = i wi xi ? 2.1T sparse features 17B Examples 16M parameters 1K nodes 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine⇒ we beat all possible single machine linear learning algorithms.
  • 13. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x) = i wi xi ? 2.1T sparse features 17B Examples 16M parameters 1K nodes 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine⇒ we beat all possible single machine linear learning algorithms. (Actually, we can do even better now.)
  • 14. book Features/s 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 RBF-SVM MPI?-500 RCV1 Ensemble Tree MPI-128 Synthetic single parallel RBF-SVM TCP-48 MNIST 220K Decision Tree MapRed-200 Ad-Bounce # Boosted DT MPI-32 Speed per method Ranking # Linear Threads-2 RCV1 Linear Hadoop+TCP-1000 Ads *& Compare: Other Supervised Algorithms in Parallel Learning
  • 15. The tricks we use First Vowpal Wabbit Newer Algorithmics Parallel Stuff Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Smart Averaging Online Learning Dimensional Correction Gradient Summing L-BFGS Hadoop AllReduce Hybrid Learning We’ll discuss Hashing, AllReduce, then how to learn.
  • 16. String −> Index dictionary RAM Weights Weights Conventional VW Most algorithms use a hashmap to change a word into an index for a weight. VW uses a hash function which takes almost no RAM, is x10 faster, and is easily parallelized. Empirically, radical state compression tricks possible with this.
  • 17. MPI-style AllReduce Allreduce initial state 7 5 6 1 2 3 4 AllReduce = Reduce+Broadcast
  • 18. MPI-style AllReduce Reducing, step 1 7 8 13 1 2 3 4 AllReduce = Reduce+Broadcast
  • 19. MPI-style AllReduce Reducing, step 2 28 8 13 1 2 3 4 AllReduce = Reduce+Broadcast
  • 20. MPI-style AllReduce Broadcast, step 1 28 28 28 1 2 3 4 AllReduce = Reduce+Broadcast
  • 21. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast
  • 22. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1 Easily pipelined so no latency concerns. 2 Bandwidth ≤ 6n. 3 No need to rewrite code!
  • 23. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n
  • 24. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n Other algorithms implemented: 1 Nonuniform averaging for online learning 2 Conjugate Gradient 3 LBFGS
  • 25. What is Hadoop-Compatible AllReduce? Program Data 1 “Map” job moves program to data.
  • 26. What is Hadoop-Compatible AllReduce? Program Data 1 “Map” job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on different node with identical data.
  • 27. What is Hadoop-Compatible AllReduce? Program Data 1 “Map” job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on different node with identical data. 3 Speculative execution: In a busy cluster, one node is often slow. Hadoop can speculatively start additional mappers. We use the first to finish reading all data once.
  • 28. What is Hadoop-Compatible AllReduce? Program Data 1 “Map” job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on different node with identical data. 3 Speculative execution: In a busy cluster, one node is often slow. Hadoop can speculatively start additional mappers. We use the first to finish reading all data once. The net effect: Reliable execution out to perhaps 10K node-hours.
  • 29. Algorithms: Preliminaries Optimize so few data passes required ⇒ Smart algorithms. Basic problem with gradient descent = confused units. fw (x) = i wi xi (x)−y 2 ⇒ ∂(fw ∂wi ) = 2(fw (x) − y )xi which has units of i. But wi naturally has units of 1/i since doubling xi implies halving wi to get the same prediction. Crude fixes: ∂ 2 −1 1 Newton: Multiply inverse Hessian: ∂wi ∂wj by gradient to get update direction. 2 Normalize update so total step size is controlled.
  • 30. Algorithms: Preliminaries Optimize so few data passes required ⇒ Smart algorithms. Basic problem with gradient descent = confused units. fw (x) = i wi xi (x)−y 2 ⇒ ∂(fw ∂wi ) = 2(fw (x) − y )xi which has units of i. But wi naturally has units of 1/i since doubling xi implies halving wi to get the same prediction. Crude fixes: 2 −1 1 Newton: Multiply inverse Hessian: ∂w∂∂wj i by gradient to get update direction...but computational complexity kills you. 2 Normalize update so total step size is controlled...but this just works globally rather than per dimension.
  • 31. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g .
  • 32. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1).
  • 33. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery.
  • 34. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state.
  • 35. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state. 4 Always save input examples in a cachefile to speed later passes.
  • 36. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state. 4 Always save input examples in a cachefile to speed later passes. 5 Use hashing trick to reduce input complexity.
  • 37. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state. 4 Always save input examples in a cachefile to speed later passes. 5 Use hashing trick to reduce input complexity. Open source in Vowpal Wabbit 6.0. Search for it.
  • 38. Robustness & Speedup Speed per method 10 Average_10 9 Min_10 8 Max_10 linear 7 Speedup 6 5 4 3 2 1 0 10 20 30 40 50 60 70 80 90 100 Nodes
  • 39. Webspam optimization Webspam test results 0.5 Online 0.45 Warmstart L-BFGS 1 0.4 L-BFGS 1 Single machine online 0.35 test log loss 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 45 50 pass
  • 40. How does it work? The launch sequence (on Hadoop): 1 mapscript.sh hdfs output hdfs input maps asked 2 mapscript.sh starts spanning tree on the gateway. Each vw connects to spanning tree to learn which other vw is adjacent in the binary tree. 3 mapscript.sh starts hadoop streaming map-only job: runvw.sh 4 runvw.sh calls vw twice. 5 First vw call does online learning while saving examples in a cachefile. Weights are averaged before saving. 6 Second vw uses online solution to warmstart L-BFGS. Gradients are shared via allreduce. Example: mapscript.sh outdir indir 100 Everything also works in a nonHadoop cluster: you just need to script more to do some of the things that Hadoop does for you.
  • 41. How does this compare to other cluster learning efforts? Mahout was founded as the MapReduce ML project, but has grown to include VW-style online linear learning. For linear learning, VW is superior: 1 Vastly faster. 2 Smarter Algorithms. Other algorithms exist in Mahout, but they are apparently buggy. AllReduce seems a superior paradigm in the 10K node-hours regime for Machine Learning. Graphlab(@CMU) doesn’t overlap much. This is mostly about graphical model evaluation and learning. Ultra LDA(@Y!), overlaps partially. VW has noncluster-parallel LDA which is perhaps x3 more effecient.
  • 42. Can allreduce be used by others? It might be incorporated directly into next generation Hadoop. But for now, it’s quite easy to use the existing code. void all reduce(char* buffer, int n, std::string master location, size t unique id, size t total, size t node); buffer = pointer to some floats n = number of bytes (4*floats) master location = IP address of gateway unique id = nonce (unique for different jobs) total = total number of nodes node = node id number The call is stateful: it initializes the topology if necessary.
  • 43. Further Pointers search://Vowpal Wabbit mailing list: vowpal wabbit@yahoogroups.com VW tutorial (Preparallel): http://videolectures.net Machine Learning (Theory) blog: http://hunch.net
  • 44. Bibliography: Original VW Caching L. Bottou. Stochastic Gradient Descent Examples on Toy Problems, http://leon.bottou.org/projects/sgd, 2007. Release Vowpal Wabbit open source project, http://github.com/JohnLangford/vowpal_wabbit/wiki, 2007. Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009. Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009.
  • 45. Bibliography: Algorithmics L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation 35:773–782, 1980. Adaptive H. B. McMahan and M. Streeter, Adaptive Bound Optimization for Online Convex Optimization, COLT 2010. Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, COLT 2010. Import. N. Karampatziakis, and J. Langford, Online Importance Weight Aware Updates, UAI 2011.
  • 46. Bibliography: Parallel grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable Modular Convex Solver for Regularized Risk Minimization, KDD 2007. averaging G. Mann, R. Mcdonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models, NIPS 2009. averaging K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable for Distributed Optimization, LCCC 2010. (More forthcoming, of course)