SlideShare a Scribd company logo
1 of 54
1 
Scalable Analytics 
with 
R, Hadoop and RHadoop 
Gwen Shapira, Software Engineer 
@gwenshap 
gshapira@cloudera.com
2
3
4
#include warning.h 
5
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
6
Get Started with R-Studio 
7
Basic Data Types 
• String 
• Number 
• Boolean 
• Assignment <- 
8
R can be a nice calculator 
> x <- 1 
> x * 2 
[1] 2 
> y <- x + 3 
> y 
[1] 4 
> log(y) 
[1] 1.386294 
> help(log) 
9
Complex Data Types 
• Vector 
• c, seq, rep, [] 
• List 
• Data Frame 
• Lists of vectors of same length 
• Not a matrix 
10
Creating vectors 
> v1 <- c(1,2,3,4) 
[1] 1 2 3 4 
> v1 * 4 
[1] 4 8 12 16 
> v4 <- c(1:5) 
[1] 1 2 3 4 5 
> v2 <- seq(2,12,by=3) 
[1] 2 5 8 11 
> v1 * v2 
[1] 2 10 24 44 
> v3 <- rep(3,4) 
[1] 3 3 3 3 
11
Accessing and filtering vectors 
> v1 <- c(2,4,6,8) 
[1] 2 4 6 8 
> v1[2] 
[1] 4 
> v1[2:4] 
[1] 4 6 8 
> v1[-2] 
[1] 2 6 8 
> v1[v1>3] 
[1] 4 6 8 
12
Lists 
> lst <- list (1,"x",FALSE) 
[[1]] 
[1] 1 
[[2]] 
[1] "x" 
[[3]] 
[1] FALSE 
> lst[1] 
[[1]] 
[1] 1 
> lst[[1]] 
[1] 1 
13
Data Frames 
books <- read.csv("~/books.csv") 
books[1,] 
books[,1] 
books[3:4] 
books$price 
books[books$price==6.99,] 
martin_price <- books[books$author_t=="George 
R.R. Martin",]$price 
mean(martin_price) 
subset(books,select=-c(id,cat,sequence_i)) 
14
15
Functions 
> sq <- function(x) { x*x } 
> sq(3) 
[1] 9 
16 
Note: 
R is a functional programming language. 
Functions are first class objects 
And can be passed to other functions.
packages 
17
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
18
“In pioneer days they used oxen for heavy 
pulling, and when one ox couldn’t budge a log, 
we didn’t try to grow a larger ox” 
— Grace Hopper, early advocate of distributed computing
20 
Hadoop in a Nutshell
Map-Reduce is the interesting bit 
• Map – Apply a function to each input record 
• Shuffle & Sort – Partition the map output and sort 
each partition 
• Reduce – Apply aggregation function to all values in 
each partition 
• Map reads input from disk 
• Reduce writes output to disk 
21
Example – Sessionize clickstream 
22
Sessionize 
Identify unique “sessions” of interacting with our 
website 
Session – for each user (IP), set of clicks that happened 
within 30 minutes of each other 
23
Input – Apache Access Log Records 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 
24
Output – Add Session ID 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 15 
25
Overview 
26 
Map 
Map 
Map 
Reduce 
Reduce 
Log line 
Log line 
Log line 
IP1, log lines 
Log line, session ID
Map 
parsedRecord = re.search(‘(d+.d+….’,record) 
IP = parsedRecord.group(1) 
timestamp = parsedRecord.group(2) 
print ((IP,Timestamp),record) 
27
Shuffle & Sort 
Partition by: IP 
Sort by: timestamp 
Now reduce gets: 
(IP,timestamp) [record1,record2,record3….] 
28
Reduce 
SessionID = 1 
curr_record = records[0] 
Curr_timestamp = getTimestamp(curr_record) 
foreach record in records: 
if (curr_timestamp – getTimestamp(record) > 30): 
sessionID += 1 
curr_timestamp = getTimestamp(record) 
print(record + “ “ + sessionID) 
29
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
30
Reshape2 
• Two functions: 
• Melt – wide format to long format 
• Cast – long format to wide format 
• Columns: identifiers or measured variables 
• Molten data: 
• Unique identifiers 
• New column – variable name 
• New column – value 
• Default – all numbers are values 
31
Melt 
> tips 
total_bill tip sex smoker day time size 
16.99 1.01 Female No Sun Dinner 2 
10.34 1.66 Male No Sun Dinner 3 
21.01 3.50 Male No Sun Dinner 3 
> melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
32
Cast 
> m_tips <- melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
> dcast(m_tips,sex+time~variable,mean) 
sex time total_bill tip size 
Female Dinner 19.21308 3.002115 2.461538 
Female Lunch 16.33914 2.582857 2.457143 
Male Dinner 21.46145 3.144839 2.701613 
Male Lunch 18.04848 2.882121 2.363636 
33
*Apply 
• apply – apply function on rows or columns of matrix 
• lapply – apply function on each item of list 
• Returns list 
• sapply – like lapply, but return vector 
• tapply – apply function to subsets of vector or lists 
34
plyr 
• Split – apply – combine 
• Ddply – data frame to data frame 
ddply(.data, .variables, .fun = NULL, ..., 
• Summarize – aggregate data into new data frame 
• Transform – modify data frame 
35
DDPLY Example 
> ddply(tips,c("sex","time"),summarize, 
+ mean=mean(tip), 
+ sd=sd(tip), 
+ ratio=mean(tip/total_bill) 
+ ) 
sex time mean sd ratio 
1 Female Dinner 3.002115 1.193483 0.1693216 
2 Female Lunch 2.582857 1.075108 0.1622849 
3 Male Dinner 3.144839 1.529116 0.1554065 
4 Male Lunch 2.882121 1.329017 0.1660826 
36
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
37
Rhadoop Projects 
• RMR 
• RHDFS 
• RHBase 
• (new) PlyRMR 
38
Most Important: 
RMR does not parallelize algorithms. 
It allows you to implement MapReduce in R. 
Efficiently. That’s it. 
39
What does that mean? 
• Use RMR if you can break your problem down to 
small pieces and apply the algorithm there 
• Use commercial R+Hadoop if you need a parallel 
version of well known algorithm 
• Good fit: Fit piecewise regression model for each 
county in the US 
• Bad fit: Fit piecewise regression model for the entire 
US population 
• Bad fit: Logistic regression 
40
Use-case examples – Good or Bad? 
1. Model power consumption per household to 
determine if incentive programs work 
2. Aggregate corn yield per 10x10 portion of field to 
determine best seeds to use 
3. Create churn models for service subscribers and 
determine who is most likely to cancel 
4. Determine correlation between device restarts and 
support calls 
41
Second Most Important: 
RMR requires R, RMR and all libraries you’ll 
use to be installed on all nodes and 
accessible by Hadoop user 
42
RMR is different from Hadoop Streaming. 
RMR mapper input: 
Key, [List of Records] 
This is so we can use vector operations 
43
How to RMRify a Problem 
44
In more detail… 
• Mappers get list of values 
• You need to process each one independently 
• But do it for all lines at once. 
• Reducers work normally 
45
Demo 6 
> library(rmr2) 
t <- list("hello world","don't worry be happy") 
unlist(sapply(t,function (x) {strsplit(x," ")})) 
function(k,v) { 
ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) 
keyval(ret_k,1) 
} 
function(k,v) { 
keyval(k,sum(v))} 
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", 
output=”~/wc.json",input.format="text”,output.format=”json", 
map=wc.map,reduce=wc.reduce); 
46
Cheating in MapReduce: 
Do everything possible to have 
map only jobs 
47
Avg Tips per Person – Naïve Input 
Gwen 1 
Jeff 2 
Leon 1 
Gwen 2.5 
Leon 3 
Jeff 1 
Gwen 1 
Gwen 2 
Jeff 1.5 
48
Avg Tips per Person - Naive 
avg.map <- function(k,v){keyval(v$V1,v$V2)} 
avg.reduce <- function(k,v) {keyval(k,mean(v))} 
mapreduce(input=”~/hadoop-recipes/data/tip1.txt", 
output="~/avg.txt", 
input.format=make.input.format("csv"), 
output.format="text", 
map=avg.map,reduce=avg.reduce); 
49
Avg Tips per Person – Awesome Input 
Gwen 1,2.5,1,2 
Jeff 2,1,1.5 
Leon 1,3 
50
Avg Tips per Person - Optimized 
function(k,v) { 
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," 
")})) 
keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) 
} 
mapreduce(input=”~/hadoop-recipes/data/tip2.txt", 
output="~/avg2.txt", 
input.format=make.input.format("csv",sep=","), 
output.format="text",map=avg2.map); 
51
Few Final RMR Tips 
• Backend = “local” has files as input and output 
• Backend = “hadoop” uses HDFS directories 
• In “hadoop” mode, print(X) inside the mapper will fail 
the job. 
• Use: cat(“ERROR!”, file = stderr()) 
52
Recommended Reading 
• http://cran.r-project.org/doc/manuals/R-intro.html 
• http://blog.revolutionanalytics.com/2013/02/10-r-packages- 
every-data-scientist-should-know-about. 
html 
• http://had.co.nz/reshape/paper-dsc2005.pdf 
• http://seananderson.ca/2013/12/01/plyr.html 
• https://github.com/RevolutionAnalytics/rmr2/blob/m 
aster/docs/tutorial.md 
• http://cran.r-project. 
org/web/packages/data.table/index.html 
53
54

More Related Content

What's hot

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 

Similar to R for hadoopers

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 

Similar to R for hadoopers (20)

Big datacourse
Big datacourseBig datacourse
Big datacourse
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Hadoop
HadoopHadoop
Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Recently uploaded

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

R for hadoopers

  • 1. 1 Scalable Analytics with R, Hadoop and RHadoop Gwen Shapira, Software Engineer @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. 3
  • 4. 4
  • 6. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 6
  • 7. Get Started with R-Studio 7
  • 8. Basic Data Types • String • Number • Boolean • Assignment <- 8
  • 9. R can be a nice calculator > x <- 1 > x * 2 [1] 2 > y <- x + 3 > y [1] 4 > log(y) [1] 1.386294 > help(log) 9
  • 10. Complex Data Types • Vector • c, seq, rep, [] • List • Data Frame • Lists of vectors of same length • Not a matrix 10
  • 11. Creating vectors > v1 <- c(1,2,3,4) [1] 1 2 3 4 > v1 * 4 [1] 4 8 12 16 > v4 <- c(1:5) [1] 1 2 3 4 5 > v2 <- seq(2,12,by=3) [1] 2 5 8 11 > v1 * v2 [1] 2 10 24 44 > v3 <- rep(3,4) [1] 3 3 3 3 11
  • 12. Accessing and filtering vectors > v1 <- c(2,4,6,8) [1] 2 4 6 8 > v1[2] [1] 4 > v1[2:4] [1] 4 6 8 > v1[-2] [1] 2 6 8 > v1[v1>3] [1] 4 6 8 12
  • 13. Lists > lst <- list (1,"x",FALSE) [[1]] [1] 1 [[2]] [1] "x" [[3]] [1] FALSE > lst[1] [[1]] [1] 1 > lst[[1]] [1] 1 13
  • 14. Data Frames books <- read.csv("~/books.csv") books[1,] books[,1] books[3:4] books$price books[books$price==6.99,] martin_price <- books[books$author_t=="George R.R. Martin",]$price mean(martin_price) subset(books,select=-c(id,cat,sequence_i)) 14
  • 15. 15
  • 16. Functions > sq <- function(x) { x*x } > sq(3) [1] 9 16 Note: R is a functional programming language. Functions are first class objects And can be passed to other functions.
  • 18. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 18
  • 19. “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox” — Grace Hopper, early advocate of distributed computing
  • 20. 20 Hadoop in a Nutshell
  • 21. Map-Reduce is the interesting bit • Map – Apply a function to each input record • Shuffle & Sort – Partition the map output and sort each partition • Reduce – Apply aggregation function to all values in each partition • Map reads input from disk • Reduce writes output to disk 21
  • 22. Example – Sessionize clickstream 22
  • 23. Sessionize Identify unique “sessions” of interacting with our website Session – for each user (IP), set of clicks that happened within 30 minutes of each other 23
  • 24. Input – Apache Access Log Records 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 24
  • 25. Output – Add Session ID 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15 25
  • 26. Overview 26 Map Map Map Reduce Reduce Log line Log line Log line IP1, log lines Log line, session ID
  • 27. Map parsedRecord = re.search(‘(d+.d+….’,record) IP = parsedRecord.group(1) timestamp = parsedRecord.group(2) print ((IP,Timestamp),record) 27
  • 28. Shuffle & Sort Partition by: IP Sort by: timestamp Now reduce gets: (IP,timestamp) [record1,record2,record3….] 28
  • 29. Reduce SessionID = 1 curr_record = records[0] Curr_timestamp = getTimestamp(curr_record) foreach record in records: if (curr_timestamp – getTimestamp(record) > 30): sessionID += 1 curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID) 29
  • 30. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 30
  • 31. Reshape2 • Two functions: • Melt – wide format to long format • Cast – long format to wide format • Columns: identifiers or measured variables • Molten data: • Unique identifiers • New column – variable name • New column – value • Default – all numbers are values 31
  • 32. Melt > tips total_bill tip sex smoker day time size 16.99 1.01 Female No Sun Dinner 2 10.34 1.66 Male No Sun Dinner 3 21.01 3.50 Male No Sun Dinner 3 > melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 32
  • 33. Cast > m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 > dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115 2.461538 Female Lunch 16.33914 2.582857 2.457143 Male Dinner 21.46145 3.144839 2.701613 Male Lunch 18.04848 2.882121 2.363636 33
  • 34. *Apply • apply – apply function on rows or columns of matrix • lapply – apply function on each item of list • Returns list • sapply – like lapply, but return vector • tapply – apply function to subsets of vector or lists 34
  • 35. plyr • Split – apply – combine • Ddply – data frame to data frame ddply(.data, .variables, .fun = NULL, ..., • Summarize – aggregate data into new data frame • Transform – modify data frame 35
  • 36. DDPLY Example > ddply(tips,c("sex","time"),summarize, + mean=mean(tip), + sd=sd(tip), + ratio=mean(tip/total_bill) + ) sex time mean sd ratio 1 Female Dinner 3.002115 1.193483 0.1693216 2 Female Lunch 2.582857 1.075108 0.1622849 3 Male Dinner 3.144839 1.529116 0.1554065 4 Male Lunch 2.882121 1.329017 0.1660826 36
  • 37. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 37
  • 38. Rhadoop Projects • RMR • RHDFS • RHBase • (new) PlyRMR 38
  • 39. Most Important: RMR does not parallelize algorithms. It allows you to implement MapReduce in R. Efficiently. That’s it. 39
  • 40. What does that mean? • Use RMR if you can break your problem down to small pieces and apply the algorithm there • Use commercial R+Hadoop if you need a parallel version of well known algorithm • Good fit: Fit piecewise regression model for each county in the US • Bad fit: Fit piecewise regression model for the entire US population • Bad fit: Logistic regression 40
  • 41. Use-case examples – Good or Bad? 1. Model power consumption per household to determine if incentive programs work 2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use 3. Create churn models for service subscribers and determine who is most likely to cancel 4. Determine correlation between device restarts and support calls 41
  • 42. Second Most Important: RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user 42
  • 43. RMR is different from Hadoop Streaming. RMR mapper input: Key, [List of Records] This is so we can use vector operations 43
  • 44. How to RMRify a Problem 44
  • 45. In more detail… • Mappers get list of values • You need to process each one independently • But do it for all lines at once. • Reducers work normally 45
  • 46. Demo 6 > library(rmr2) t <- list("hello world","don't worry be happy") unlist(sapply(t,function (x) {strsplit(x," ")})) function(k,v) { ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) keyval(ret_k,1) } function(k,v) { keyval(k,sum(v))} mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", output=”~/wc.json",input.format="text”,output.format=”json", map=wc.map,reduce=wc.reduce); 46
  • 47. Cheating in MapReduce: Do everything possible to have map only jobs 47
  • 48. Avg Tips per Person – Naïve Input Gwen 1 Jeff 2 Leon 1 Gwen 2.5 Leon 3 Jeff 1 Gwen 1 Gwen 2 Jeff 1.5 48
  • 49. Avg Tips per Person - Naive avg.map <- function(k,v){keyval(v$V1,v$V2)} avg.reduce <- function(k,v) {keyval(k,mean(v))} mapreduce(input=”~/hadoop-recipes/data/tip1.txt", output="~/avg.txt", input.format=make.input.format("csv"), output.format="text", map=avg.map,reduce=avg.reduce); 49
  • 50. Avg Tips per Person – Awesome Input Gwen 1,2.5,1,2 Jeff 2,1,1.5 Leon 1,3 50
  • 51. Avg Tips per Person - Optimized function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) } mapreduce(input=”~/hadoop-recipes/data/tip2.txt", output="~/avg2.txt", input.format=make.input.format("csv",sep=","), output.format="text",map=avg2.map); 51
  • 52. Few Final RMR Tips • Backend = “local” has files as input and output • Backend = “hadoop” uses HDFS directories • In “hadoop” mode, print(X) inside the mapper will fail the job. • Use: cat(“ERROR!”, file = stderr()) 52
  • 53. Recommended Reading • http://cran.r-project.org/doc/manuals/R-intro.html • http://blog.revolutionanalytics.com/2013/02/10-r-packages- every-data-scientist-should-know-about. html • http://had.co.nz/reshape/paper-dsc2005.pdf • http://seananderson.ca/2013/12/01/plyr.html • https://github.com/RevolutionAnalytics/rmr2/blob/m aster/docs/tutorial.md • http://cran.r-project. org/web/packages/data.table/index.html 53
  • 54. 54

Editor's Notes

  1. Modern CPUs are optimized with vector instructions – so many vector operations can be done on entire vectors in one instructions. Loops obviously take many instructions both for the operations and for running through the loop.
  2. This quote is excerpted from the one at the beginning of Chapter 1 in Hadoop: The Definitive Guide by Tom White.
  3. Example to illustrate MR
  4. RevolutionR and Oracle have (expensive) packages of popular algorithms, parallelized.
  5. Just saved you hours of debugging. You can thank me later 