Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

mapReduce for machine learning

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 1
1 Introduction
Frequency scaling on silicon—th...
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 2
2 Machine learning
“Machine learning” sounds m...
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 3
Now let us explain in simple words the kind of...
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 16 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a mapReduce for machine learning (20)

Anuncio

mapReduce for machine learning

  1. 1. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 1 1 Introduction Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning to hit a power limit as device geometries shrink due to leakage, and simply because CMOS consumes power every time it changes state. Yet Moore’s law, the density of circuits doubling every generation, is projected to last between 10 and 20 more years for silicon based circuits, but doubling the number of processing cores on a chip, one can maintain lower power while doubling the speed of many applications. This has forced an industry-wide shift to multicore. We thus approach an era of increasing numbers of cores per chip, but there is as yet no good frame-work for machine learning to take advantage of massive numbers of cores. There are many parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but none of these approaches make it obvious how to parallelize a particular algorithm. There is a vast literature on distributed learning and data mining, but very little of this literature focuses on our goal: A general means of programming machine learning on multicore. Much of this literature contains a long and distinguished tradition of developing (often ingenious) ways to speed up or parallelize individual learning algorithms, for instance cascaded parallelization technique for machine learning and, more pragmatically, specialized implementations of popular algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et. al. give some general data distribution conditions for parallelizing machine learning, but restrict the focus to decision trees; Jin and Agrawal give a general machine learning programming approach, but only for shared memory machines. This doesn’t fit the architecture of cellular or grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated. In this paper, we focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea of this approach is to allow a future programmer or user to speed up machine learning applications by “throwing more cores” at the problem rather than search for specialized optimizations. This paper’s contributions are :(i) We show that any algorithm fitting the Statistical Query Model may be written in a certain “summation form.” This form does not change the underlying algorithm and so is not an approximation, but is instead an exact implementation. (ii) The summation form does not depend on, but can be easily expressed in a map-reduce framework which is easy to program in. (iii) this technique achieves basically linear speed-up with the number of cores.
  2. 2. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 2 2 Machine learning “Machine learning” sounds mysterious for most people. Indeed, only a small fraction of professionals really know what it stands for. And there is a serious reason for it – this field is rather technical and difficult to explain to a layman. However, we would like to bridge this gap and explain a bit about what machine learning (ML) is and how it can be used in our everyday life or business. So what is this mysterious ML? Machine learning can refer to: • The branch of artificial intelligence; • The methods used in this field (there are a variety of different approaches). Overall, if talking about the latter, Tom Mitchell, author of the well-known book “Machine learning”, defines ML as “improving performance in some task with experience”. However, this definition is quite a broad one, so we can quote another more specific description stating that ML deals with systems that can learn from data. ML works with data and processes it to discover patterns that can be later used to analyze new data. ML usually relies on specific representation of data, a set of “features” that are understandable for a computer. For example, if we are talking about text it should be represented through the words it contains or some other characteristics such as length of the text, number of emotional words etc. This presentation depends on the task you are dealing with and is typically referred to as “feature extraction”. Types of ML All ML tasks can be classified in several categories, the main ones are: • supervised ML; • Unsupervised ML; • Reinforcement learning.
  3. 3. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 3 Now let us explain in simple words the kind of problems that are dealt with by each category. Supervised ML relies on data where the true label/class was indicated. This is easier to explain using an example. Let us imagine that we want to teach a computer to distinguish pictures of cats and dogs. We can ask some of our friends to send us pictures of cats and dogs adding a tag ‘cat’ or ‘dog’. Labeling is usually done by human annotators to ensure a high quality of data. So now we know the true labels of the pictures and can use this data to “supervise” our algorithm in learning the right way to classify images. Once our algorithm learns how to classify images we can use it on new data and predict labels (‘cat’ or ‘dog’ in our case) on previously unseen images. 2.1 Applications to Machine Learning Many standard machine learning algorithms follow one of a few canonical data processing patterns, which we outline below. A large subset of these can be phrased as MapReduce tasks, illuminating the benefits that the MapReduce framework offers to the machine learning community. In this section, we investigate the performance trade-offs of using MapReduce from an algorithm centric perspective, considering in turn three classes of ML algorithms and the issues of adapting each to a MapReduce framework. The performance that results depends intimately on the design choices underlying the MapReduce implementation, and how well those choices support the data processing pattern of the ML algorithm. We conclude this section with a discussion of changes and extensions to the Hadoop MapReduce implementation that would benefit the machine learning community. 2.1.1 A Taxonomy of Standard Machine Learning Algorithms While ML algorithms can be classified on many dimensions, the one we take primary interest in here is that of procedural character: the data processing pattern of the algorithm. Here, we consider single-pass, iterative and query-based learning techniques, along with several example algorithms and applications.
  4. 4. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 4 2.1.2 Single-pass Learning Many ML applications make only one pass through a data set, extracting relevant statistics for later use during inference. This relatively simple learning setting arises often in natural language processing, from machine translation to information extraction to spam filtering. These applications often fit perfectly into the MapReduce abstraction, encapsulating the extraction of local contributions to the map task, then combining those contributions to compute relevant statistics about the dataset as a whole. Consider the following examples, illustrating common decompositions of these statistics. Estimating Language Model Multinomial: Extracting language models from a large corpus amounts to little more than counting n-grams, though some parameter smoothing over the statistics is also common. The map phase enumerates the n-grams in each training instance (typically a sentence or paragraph), and the reduce function counts instances of n-grams.2This option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this semester. Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes classifier, or any fully observed Bayes net, again requires counting occurrences in the training data. In this case, however, feature extraction is often computation-intensive, perhaps involving small search or optimization problems for each datum. The reduce task, however, remains a summation of each (feature, label) environment pair. Syntactic Translation Modeling: Generating a syntactic model for machine translation is an example of a research-level machine learning application that involves only a single pass through a preprocessed training set. Each training datum consists of a pair of sentences in two languages, an estimated alignment between the words in each, and an estimated syntactic parse tree for one sentence.3 The per-datum feature extraction encapsulated in the map phase for this task involves search over these coupled data structures. 2.1.3 Iterative Learning The class of iterative ML algorithms – perhaps the most common within the machine learning research community – can also be expressed within the framework of MapReduce by chaining
  5. 5. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 5 together multiple MapReduce tasks. While such algorithms vary widely in the type of operation they perform on each datum (or pair of data) in a training set, they share the common characteristic that a set of parameters is matched to the data set via iterative improvement. The update to these parameters across iterations must again decompose into per-datum contributions, which is the case of the example applications below. As with the examples discussed in the previous section, the reduce function is considerably less compute-intensive than the map tasks. In the examples below, the contribution to parameter updates from each datum (the mapfunction) depends in a meaningful way on the output of the previous iteration. For example, the expectation computation of EM or the inference computation in an SVM or perceptron classifier can reference a large portion or all of the parameters generated by the algorithm. Hence, these parameters must remain available to the map tasks in a distributed environment. The information necessary to compute the map step of each algorithm is described below; the complications that arise because this information is vital to the computation are investigated later in the paper. Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of a training set given a generative model with latent variables. The E-step of the algorithm computes posterior distributions over the latent variables given current model parameters and the observed data. The maximization step adjusts model parameters to maximize the likelihood of the data assuming that latent variables take on their expected values. Projecting onto the MapReduce framework, the map task computes posterior distributions over the latent variables of a datum using current model parameters; the maximization step is performed as a single reduction, which sums the sufficient statistics and normalizes to produce updated parameters. We consider applications for machine translation and speech recognition. For multivariate Gaussian mixture models (e.g., for speaker identification), these parameters are simply the mean vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition),parameters are also needed to specify the state transition probabilities; the models, efficiently3Generating appropriate training data for this task involves several applications of iterative learning algorithms, described in the following section.
  6. 6. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 6 Stored in binary form, occupy tens of megabytes. For word alignment models (e.g., machine translation), these parameters include word-to-word translation probabilities; these can number in the millions, even after pruning heuristics remove the unnecessary parameters. Discriminative Classification and Regression: When fitting model parameters via a perceptron, boosting, or support vector machine algorithm for classification or regression, the map stage of training will involve computing inference over the training example given the current model parameters. Similar to the EM case, a subset of the parameters from the previous iteration must be available for inference. However, the reduce stage typically involves summing over parameter changes. Thus, all relevant model parameters must be broadcast to each map task. In the case of a typical featured setting that often extracts hundreds or thousands of features from each training example, the relevant parameter space needed for inference can be quite large. 2.1.4 Query-based Learning with Distance Metrics Finally, we consider distance-based ML applications that directly reference the training set during inference, such as the nearest-neighbor classifier. In this setting, the training data are the parameters, and a query instance must be compared to each training datum. While it is the case that multiple query instances can be processed simultaneously within a MapReduce implementation of these techniques, the query set must be broadcast to all map tasks. Again, we have a need for the distribution of state information. However, in this case, the query information that must be distributed to all map tasks needn’t be processed concurrently – a query set can be broken up and processed over multiple MapReduce operations. In the examples below, each query instance tends to be of a manageable size. K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query set to each element of a training set, and discovers examples with minimal distances from the queries. The map stage computes distance metrics, while the reduce stage tracks k examples for each label that have minimal distance to the query.
  7. 7. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 7 Similarity-based Search Finding the most similar instances to a given query has a similar character, sifting through the training set to find examples that minimize a distance metric. Computing the distance is the map stage, while minimizing it is the reduce stage. 2.2 Performance and Implementation Issues While the algorithms discussed above can all be implemented in parallel using the MapReduce abstraction, our example applications from each category revealed a set of implementation challenges. We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we will address issues related both to the Hadoop implementation of MapReduce and the MapReduce framework itself. 2.2.1 Single-pass Learning The single-pass learning algorithms described in the previous section are clearly amenable to the MapReduce framework. We will focus here on the task of generating a syntactic translation model from a set of sentence pairs, their word-level bilingual alignment, and their syntactic structure 160 140 Local MapReduce 3-node MapReduce 120 Local Reference Seconds 100 80 60 40 20 0 0 10 20 30 40 50 60 Training (Thousands of sentence pairs) Fig1: the benefit of distributed computation quickly outweighs the overhead of a mapReduce implementation on a 3-node cluster Figure1 shows the running times for various input sizes, demonstrating the overhead of running
  8. 8. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 8 MapReduce relative to the reference implementation. The cost of running Hadoop cancels out some of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed- up of 39% over the reference implementation. The overhead of simulating a MapReduce computation on a single machine was 51% of the compute cost of the reference implementation. Distributing the task to a large cluster would clearly justify this overhead, but parallelizing to two machines would give virtually no benefit for the largest data set size we tested. A more promising metric shows that as the size of the data scales, the distributed MapReduce implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of each example by comparing the slopes of the curves in figure 1, which are all near-linear. The reference implementation shows a variable computation cost of 1.7 seconds per 1000 examples, while the distributed implementation across three machines shows a cost of 0.5 seconds per 1000 examples. So, the variable overhead of MapReduce is minimal for this task, while the static overhead of distributing the code base and channeling the processing through Hadoop’s infrastructure is large. We would expect that substantially increasing the size of the training set would accentuate the utility of distributing the computation. Thus far, we have assumed that the training data was already written to Hadoop’s distributed file System (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks of Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus need only be distributed to Hadoop’s file system once, and can be processed by many different applications. On the other hand, this application references four different input sources, including sentences, alignments and parse trees for each example. When copying these resources independently to the DFS, the Hadoop implementation gives no control over how those files are mapped to remote machines. Hence, no one machine necessarily contains all of the data necessary for a given example. Apache mahout is a framework for implementing machine learning in mapreduce paradigm. 3 Apache mahout Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of
  9. 9. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 9 collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform. Mahout also provides Java libraries for common maths operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly. But various algorithms are still missing. While Mahout's core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict contributions to Hadoop based implementations. Contributions that run on a single node or on a non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering recommender component of Mahout was originally a separate project and can run stand-alone without Hadoop. Integration with initiatives such as the Pregel-like graphs is actively under discussion. Mahout Algorithms include many new implementations built for speed on Mahout-Samsara. They run on Spark and some on H2O, which means as much as a 10x speed increase. You’ll find robust matrix decomposition algorithms as well as a Naive Bayes classifier and collaborative filtering. The new spark-item similarity enables the next generation of co-occurrence recommenders that can use entire user click streams and context in making recommendations. 3.1 Mahout installation Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find
  10. 10. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 10 items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category. Frequent item set mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together. 3.1.1 Apache Mahout Integration The flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with Hadoop is invaluable. Therefore, I have packaged the most recent stable release of Mahout (0.5), and very excited to work with the Mahout community becoming much more involved with the project as both Mahout& Hadoop continue to grow. 3.1.2 Why we are packing Mahout with Hadoop Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra, Analysis of Algorithms, and many other subjects. This field allows us to examine things such as recommendation engines involving new friends, love interests, and new products. We can do incredibly advanced analysis around genetic sequencing and examination, distributed search and frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular value decomposition (SVD). Apache Mahout is an open source project that is a part of the Apache Software Foundation, devoted to Machine Learning. Mahout can operate on top of Hadoop, which allows the user to apply the concept of Machine Learning via a selection of algorithms in Mahout to distributed computing via Hadoop. Mahout packages popular machine learning algorithms such as:  Recommendation mining takes users’ behavior and finds items said specified user might like.  Clustering, takes e.g. text documents and groups them based on related document topics.  Classification learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category.
  11. 11. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 11  Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together. We are very excited to be working with the Apache Mahout community and highly encourage everyone who is using CDH currently to give Mahout a try! As always, we are open to any guests who would like to blog about their experience using Mahout with CDH. 3.2 Installing Mahout Mahout is an acquisition of highly scalable machine learning algorithms over very large data sets. Although the real power of Mahout can be vouched for only on large HDFS data, but Mahout also supports running algorithm on local file system data that can help you get a feel of how to run Mahout Algorithms. Before you can run any Mahout algorithm you need a Mahout Installation ready on your Linux machine which can be carried out in two ways as described below Step 1:- We will download mahout-distribution-0.x.tar.gz Download mahout-distribution- 0.x.tar.gz from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example: cd /usr/local $ sudo tar xzf mahout-distribution-0.x.tar.gz $ sudo mv mahout-distribution-0.x.tar.gz to mahout $ sudo chown -R hduser:hadoop mahout This should result in a folder with name /path_to_downloaded_tarball/mahout-distribution-0.x
  12. 12. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 12 Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted folder. For testing your installation, you can also run bin/mahout without any other arguments. Now we will set the path in the .bashrc file export MAHOUT_HOME=/usr/local/mahout path=$path:$MAHOUT_HOME/bin Step 2:- Create a directory where you would want to check out the Mahout code, we’ll call it here MAHOUT_HOME: $ sudo mkdir -p /app/mahout $ Sudo chown hduser: hadoop /app/mahout # ...and if you want to tighten up security, chmod from 755 to 750.. . $ sudo chmod 750 /app/mahout Step: 3- Now we will set Hadoop_confi path in in .hadoop.env.sh
  13. 13. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 13 /usr/local/mahout/lib/* Step 4:- Now we have INSTALLATION OF MAVEN. $ sudo tar xzf apache-maven-2.0.9-bin.tar.gz $ sudo mv apache-maven-2.0.9-bin.tar.gz maven $ sudo chown -R hduser:hadoop maven Now we will set the path in the .bashrc file export M2_HOME=/usr/local/maven path=$path:$M2_HOME/bin Now we can Start Mahout by .bin/mahout command now we have complete mahout installation. Now we can use mahout with hadoop configuration. Fig2: showing maven version
  14. 14. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 14 4 Advantages and Disadvantages Advantages The paper "Map-Reduce for Machine Learning on Multicore" shows 10 machine learning algorithms, which can benefit from map reduce model. The key point is "any algorithm fitting the Statistical Query Model may be written in a certain “summation form.”, and the algorithms can be expressed as summation form can apply map reduce programming model. Disadvantages The MapReduce does not work when there are computational dependencies in the data. This limitation makes it difficult to represent algorithms that operate on structured models. As a consequence, when confronted with large scale problems, we often abandon rich structured models in favor of overly simplistic methods that are amenable to the MapReduce abstraction. In Machine-learning community, numerous algorithms iteratively transform parameters during both learning and inference, e.g., Belief Propagation, Expectation Maximization, Gradient Descent and Gibbs Sampling. Those algorithms iteratively refine a set of parameters until some termination criteria is matched. If you invoke MapReduce in each iteration, you still can speed up the computation. The point here is that we want a better abstraction framework so that it's possible to embrace the graphical structure of data, to express sophisticated scheduling or automatically assess termination.
  15. 15. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 15 6 Conclusions By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable gateway to parallelizing machine learning applications. The benefits of easy development and robust computation did come at a price in terms of performance, however. Furthermore, proper tuning of the system led to substantial variance in performance. Defining common settings for machine learning algorithms lead us to discover the core shortcomings of Hadoop’s MapReduce implementation. We were able to address the most significant, the need to broadcast data to map tasks. We also greatly improved the convenience of running MapReduce jobs from within existing Java applications. However, the ability to tie together the distribution of parallel files on the DFS remains an outstanding challenge. All in all, MapReduce represents a promising direction for future machine learning implementations. When continuing to develop Hadoop and tools that surround it, we must strive to minimize the compromises between convenience and performance, providing a platform that allows for efficient processing and rapid application development.
  16. 16. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 16 5 References  http://mahout.apache.org/  https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu- for-beginners/  http://nivirao.blogspot.com/2012/04/installing-apache-mahout-on-ubuntu.html  https://help.ubuntu.com/community/Java  http://mahout.apache.org/developers/buildingmahout.html

×