This document outlines the agenda and steps for a hands-on session on implementing linear regression using MapReduce on Hadoop. The objective is to use a dataset of fish observations to predict the number of rings on a fish's shell based on other attributes. Attendees will copy sample data to HDFS, generate a larger dataset by replicating and modifying the samples, then use MapReduce to train a linear regression model to predict the number of rings based on the other attributes.
2. Data Computing Division
Prerequisites
•Make sure you haveVMWare player installed
•VMWare Fusion for Mac OS X
•Copy the GPHD (Greenplum Distribution of
Hadoop v 1.0) virtual machine to your
laptop
•Also copy exercise.zip file to your laptop,
and decompress
Monday, February 18, 13
3. Data Computing Division
Setting Up
•Start GPHDVirtual Machine
•Make sure you can login to it
•Copy exercise.zip from your laptop to the
VM, and unzip in ~/exercise
Monday, February 18, 13
5. Data Computing Division
Hands-On
•Objective: Implement Linear Regression using
MapReduce, and use it to train a model
•Data Set: from Marine Resources Division,
Department of Primary Industries and
Fisheries,Tasmania
•4177 samples from observations
Monday, February 18, 13
6. Data Computing Division
Data
•Attributes about a type of fish
•M/F, Length, Diameter, Height,Weight,
Rings on shell
•Problem:To predict number of rings as a
function of other attributes
Monday, February 18, 13
7. Data Computing Division
Step 1
•Copy the small sample data set to HDFS
•See: Scripts/cp_to_grid.sh
Monday, February 18, 13
8. Data Computing Division
Step 2
•Blow up the dataset 1000 times by adding
gaussian noise to most fields
•Output: 4M sample observations
•Using Hadoop Streaming
•See: Scripts/stream_replicate.sh
•Monitor this job in JobTracker UI
Monday, February 18, 13
9. Data Computing Division
Step 3
•Train model based on Linear Regression
•See: Scripts/stream_train_linreg.sh
•Monitor the Job
•Copy the model to a local directory
•Check it
Monday, February 18, 13