The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.
1. useR Vignette:
R + 15 minutes =
Hadoop cluster
Greater Boston useR Group
February 2011
by
Jeffrey Breen
jbreen@cambridge.aero
2. Agenda
● What's Hadoop?
● But I don't have Big
Data
● Building the cluster
● Estimating π
stochastically
● Want to know more?
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2
3. MapReduce, Hadoop and Big Data
● Hadoop is an open source implementation of
Google's MapReduce-based data processing
infrastructure
● Designed to process huge data sets
– “huge” = “all of facebook's web logs”
– Yahoo! sorted 1TB in 62 seconds in May 2009
– HDFS distributed file system makes replication decisions
based on knowledge of network topology
● Amazon Elastic MapReduce is full Hadoop stack
on EC2
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 3
4. MapReduce = Map + shuffle + Reduce
Source: http://developer.yahoo.com/hadoop/tutorial/module4.html
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 4
5. But I don't have Big Data
● Agricultural economist J.D. Long doesn't either, but
he does have a bunch of simulations to run
● Had a key insight: the input could be small amount
of data (like 1:1000) to serve as random seeds for
simulation code in “mapper” function
● Enjoy Hadoop's infrastructure for job scheduling,
fault tolerance, inter-node communication, etc.
● Use Amazon's cloud to scale up quickly as needed
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 5
6. Load the segue library
> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run
the setCredentials() function.
> setCredentials('YOUR_ACCESS_KEY_ID',
'YOUR_SECRET_ACCESS_KEY')
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 6
7. Start the cluster
> myCluster <- createCluster(numInstances=5)
STARTING - 2011-01-04 15:07:53
[…]
BOOTSTRAPPING - 2011-01-04 15:11:28
[…]
WAITING - 2011-01-04 15:15:35
Your Amazon EMR Hadoop Cluster is ready for action.
Remember to terminate your cluster with
stopCluster().
Amazon is billing you!
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 7
8. Estimate π stochastically
> estimatePi <- function(seed){
set.seed(seed)
numDraws <- 1e6
r <- .5 #radius
x <- runif(numDraws, min=-r, max=r)
y <- runif(numDraws, min=-r, max=r)
inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)
return(sum(inCircle) / length(inCircle) * 4)
}
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 8
9. Run the simulation
> seedList <- as.list(1:1e3)
> myEstimates <- emrlapply( myCluster, seedList,
estimatePi )
RUNNING - 2011-01-04 15:22:28
[…]
WAITING - 2011-01-04 15:32:18
> myPi <- Reduce(sum, myEstimates) / length(myEstimates)
> format(myPi, digits=10)
[1] "3.141586544"
> format(pi, digits=10)
[1] "3.141592654"
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 9
10. Won't break the bank
● Total cost: $0.15
Standard On-Demand Amazon EC2 Amazon Elastic
Instances Price per hour MapReduce
(On-Demand Instances) Price per hour
Small (Default) $0.085 per hour $0.015 per hour
Large $0.34 per hour $0.06 per hour
Extra Large $0.68 per hour $0.12 per hour
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 10
11. Want to know more?
● JD Long's segue package
● http://code.google.com/p/segue/
● Hadoop
● http://hadoop.apache.org/
● Book: http://oreilly.com/catalog/0636920010388
● My blog
● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-a
useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 11