16. Functions
> sq <- function(x) { x*x }
> sq(3)
[1] 9
16
Note:
R is a functional programming language.
Functions are first class objects
And can be passed to other functions.
18. Agenda
• R Basics
• Hadoop Basics
• Data Manipulation
• Rhadoop
18
19. “In pioneer days they used oxen for heavy
pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox”
— Grace Hopper, early advocate of distributed computing
21. Map-Reduce is the interesting bit
• Map – Apply a function to each input record
• Shuffle & Sort – Partition the map output and sort
each partition
• Reduce – Apply aggregation function to all values in
each partition
• Map reads input from disk
• Reduce writes output to disk
21
23. Sessionize
Identify unique “sessions” of interacting with our
website
Session – for each user (IP), set of clicks that happened
within 30 minutes of each other
23
24. Input – Apache Access Log Records
127.0.0.1 - frank
[10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0"
200 2326
24
25. Output – Add Session ID
127.0.0.1 - frank
[10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0"
200 2326 15
25
26. Overview
26
Map
Map
Map
Reduce
Reduce
Log line
Log line
Log line
IP1, log lines
Log line, session ID
30. Agenda
• R Basics
• Hadoop Basics
• Data Manipulation Libraries
• Rhadoop
30
31. Reshape2
• Two functions:
• Melt – wide format to long format
• Cast – long format to wide format
• Columns: identifiers or measured variables
• Molten data:
• Unique identifiers
• New column – variable name
• New column – value
• Default – all numbers are values
31
32. Melt
> tips
total_bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
> melt(tips)
sex smoker day time variable value
Female No Sun Dinner total_bill 16.99
Female No Sun Dinner tip 1.01
Female No Sun Dinner size 2
32
33. Cast
> m_tips <- melt(tips)
sex smoker day time variable value
Female No Sun Dinner total_bill 16.99
Female No Sun Dinner tip 1.01
Female No Sun Dinner size 2
> dcast(m_tips,sex+time~variable,mean)
sex time total_bill tip size
Female Dinner 19.21308 3.002115 2.461538
Female Lunch 16.33914 2.582857 2.457143
Male Dinner 21.46145 3.144839 2.701613
Male Lunch 18.04848 2.882121 2.363636
33
34. *Apply
• apply – apply function on rows or columns of matrix
• lapply – apply function on each item of list
• Returns list
• sapply – like lapply, but return vector
• tapply – apply function to subsets of vector or lists
34
35. plyr
• Split – apply – combine
• Ddply – data frame to data frame
ddply(.data, .variables, .fun = NULL, ...,
• Summarize – aggregate data into new data frame
• Transform – modify data frame
35
36. DDPLY Example
> ddply(tips,c("sex","time"),summarize,
+ mean=mean(tip),
+ sd=sd(tip),
+ ratio=mean(tip/total_bill)
+ )
sex time mean sd ratio
1 Female Dinner 3.002115 1.193483 0.1693216
2 Female Lunch 2.582857 1.075108 0.1622849
3 Male Dinner 3.144839 1.529116 0.1554065
4 Male Lunch 2.882121 1.329017 0.1660826
36
37. Agenda
• R Basics
• Hadoop Basics
• Data Manipulation Libraries
• Rhadoop
37
39. Most Important:
RMR does not parallelize algorithms.
It allows you to implement MapReduce in R.
Efficiently. That’s it.
39
40. What does that mean?
• Use RMR if you can break your problem down to
small pieces and apply the algorithm there
• Use commercial R+Hadoop if you need a parallel
version of well known algorithm
• Good fit: Fit piecewise regression model for each
county in the US
• Bad fit: Fit piecewise regression model for the entire
US population
• Bad fit: Logistic regression
40
41. Use-case examples – Good or Bad?
1. Model power consumption per household to
determine if incentive programs work
2. Aggregate corn yield per 10x10 portion of field to
determine best seeds to use
3. Create churn models for service subscribers and
determine who is most likely to cancel
4. Determine correlation between device restarts and
support calls
41
42. Second Most Important:
RMR requires R, RMR and all libraries you’ll
use to be installed on all nodes and
accessible by Hadoop user
42
43. RMR is different from Hadoop Streaming.
RMR mapper input:
Key, [List of Records]
This is so we can use vector operations
43
45. In more detail…
• Mappers get list of values
• You need to process each one independently
• But do it for all lines at once.
• Reducers work normally
45
48. Avg Tips per Person – Naïve Input
Gwen 1
Jeff 2
Leon 1
Gwen 2.5
Leon 3
Jeff 1
Gwen 1
Gwen 2
Jeff 1.5
48
49. Avg Tips per Person - Naive
avg.map <- function(k,v){keyval(v$V1,v$V2)}
avg.reduce <- function(k,v) {keyval(k,mean(v))}
mapreduce(input=”~/hadoop-recipes/data/tip1.txt",
output="~/avg.txt",
input.format=make.input.format("csv"),
output.format="text",
map=avg.map,reduce=avg.reduce);
49
50. Avg Tips per Person – Awesome Input
Gwen 1,2.5,1,2
Jeff 2,1,1.5
Leon 1,3
50
51. Avg Tips per Person - Optimized
function(k,v) {
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x),"
")}))
keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))}))
}
mapreduce(input=”~/hadoop-recipes/data/tip2.txt",
output="~/avg2.txt",
input.format=make.input.format("csv",sep=","),
output.format="text",map=avg2.map);
51
52. Few Final RMR Tips
• Backend = “local” has files as input and output
• Backend = “hadoop” uses HDFS directories
• In “hadoop” mode, print(X) inside the mapper will fail
the job.
• Use: cat(“ERROR!”, file = stderr())
52
Modern CPUs are optimized with vector instructions – so many vector operations can be done on entire vectors in one instructions. Loops obviously take many instructions both for the operations and for running through the loop.
This quote is excerpted from the one at the beginning of Chapter 1 in Hadoop: The Definitive Guide by Tom White.
Example to illustrate MR
RevolutionR and Oracle have (expensive) packages of popular algorithms, parallelized.
Just saved you hours of debugging. You can thank me later