SlideShare a Scribd company logo
1 of 27
Download to read offline
Parallelizing Existing R
Packages with SparkR
Hossein Falaki
@mhfalaki
About me
• Former Data Scientist at Apple Siri
• Software Engineer at Databricks
• Started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data
source
• Worked on SparkR &Databricks R Notebook feature
• Currently focusing on R experience at Databricks
2
What is SparkR?
An R package distributed with Apache Spark (soon CRAN):
- Provides R frontend to Spark
- Exposes Spark DataFrames (inspired by R and Pandas)
- Convenient interoperability between R and Spark DataFrames
3
distributed/robust processing, data
sources, off-memory data
structures
+	 Dynamic environment, interactivity,
packages, visualization
SparkR architecture
4
Spark	Driver	
							JVM	R	
R	Backend	 JVM	
Worker	
JVM	
Worker	
Data	Sources	
	JVM
SparkR architecture (since 2.0)
5
Spark	Driver	
R	 							JVM	
R	Backend	
JVM	
Worker	
JVM	
Worker	
Data	Sources	
R	
R
Overview of SparkR API
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL 
sql / table / saveAsTable /
registerTempTable / tables
6
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy / 
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect 
http://spark.apache.org/docs/latest/api/R/
Overview of SparkR API :: Session
Spark session is your interface to Spark functionality in R
o SparkR DataFrames are implemented on top of SparkSQL tables
o All DataFrame operations go through a SQL optimizer (catalyst)
o Since 2.0 sqlContext is wrapped in a new object called SparkR Session.
7
> spark <- sparkR.session()
All SparkR functions work if you pass them a session or will
assume an existing session.
Reading/Writing data
8
R	 							JVM	
R	Backend	
JVM	
Worker	
JVM	
Worker	
HDFS/S3/…	
read.df()	
write.df()
Moving data between R and JVM
9
R	 							JVM	
R	Backend	
SparkR::collect()	
SparkR::createDataFrame()
Overview of SparkR API :: DataFrame
API
SparkR DataFrame behaves similar to R data.frames
> sparkDF$newCol <- sparkDF$col + 1
> subsetDF <- sparkDF[, c(“date”, “type”)]
> recentData <- subset(sparkDF$date == “2015-10-24”)
> firstRow <- sparkDF[[1, ]]
> names(subsetDF) <- c(“Date”, “Type”)
> dim(recentData)
> head(count(group_by(subsetDF, “Date”)))
10
Overview of SparkR API :: SQL
You can register a DataFrame as a table and query it in SQL
> logs <- read.df(“data/logs”, source = “json”)
> registerTempTable(df, “logsTable”)
> errorsByCode <- sql(“select count(*) as num, type from
logsTable where type == “error” group by code order by
date desc”)
> reviewsDF <- tableToDF(“reviewsTable”)
> registerTempTable(filter(reviewsDF, reviewsDF$rating ==
5), “fiveStars”)
11
Moving between languages
12
R Scala
Spark	
df <- read.df(...)

wiki <- filter(df, ...)

registerTempTable(wiki,
“wiki”)
val wiki = table(“wiki”)

val parsed = wiki.map {
Row(_, _, text: String,
_, _) =>text.split(‘ ’)
}

val model =
Kmeans.train(parsed)
Overview of SparkR API
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL 
sql / table / saveAsTable /
registerTempTable / tables
13
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy / 
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect 
http://spark.apache.org/docs/latest/api/R/
SparkR UDF API
14
spark.lapply
Runs a function
over a list of
elements
spark.lapply()
dapply
Applies a function
to each partition of
a SparkDataFrame
dapply()
dapplyCollect()
gapply
Applies a function
to each group
within a
SparkDataFrame
gapply()
gapplyCollect()
spark.lapply
15
Simplest SparkR UDF pattern
For each element of a list:
1.  Sends the function to an R worker
2.  Executes the function
3.  Returns the result of all workers as a list to R driver
spark.lapply(1:100, function(x) {
runBootstrap(x)
}
spark.lapply control flow
16
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
1.	Serialize	R	closure	
3.	Transfer	serialized	closure	over	the	network	
5.	De-serialize	closure	
4.	Transfer	over	
						local	socket	
6.	Serialize	result	
2.	Transfer	over	
					local	socket	
7.	Transfer	over	
						local	socket	9.	Transfer	over	
					local	socket	
10.	Deserialize	result	
8.	Transfer	serialized	closure	over	the	network
dapply
17
For each partition of a Spark DataFrame
1.  collects each partition as an R data.frame
2.  sends the R function to the R worker
3.  executes the function
dapply(sparkDF, func, schema)
combines results as
DataFrame with schema
dapplyCollect(sparkDF, func)
combines results as R
data.frame
dapply control & data flow
18
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
dapplyCollect control & data flow
19
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
local socket cluster network local socket
input data
ser/de transfer
result transfer
result deser
gapply
20
Groups a Spark DataFrame on one or more columns
1.  collects each group as an R data.frame
2.  sends the R function to the R worker
3.  executes the function
gapply(sparkDF, cols, func, schema)
combines results as
DataFrame with schema
gapplyCollect(sparkDF, cols, func)
combines results as R
data.frame
gapply control & data flow
21
R	Worker	JVM	
R	Worker	JVM	
R	Worker	JVM	R	 Driver	JVM	
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
data
shuffle
dapply vs. gapply
22
gapply
 dapply	
signature gapply(df, cols, func, schema)
gapply(gdf, func, schema)	
	
dapply(df, func, schema)	
user function
signature
function(key, data)	 function(data)	
data partition controlled	by	grouping	 not	controlled
Parallelizing data
• Do not use spark.lapply() to distribute large data sets
• Do not pack data in the closure
• Watch for skew in data
– Are partitions evenly sized?
• Auxiliary data
– Can be joined with input DataFrame
– Can be distributed to all the workers using FileSystem
23
Packages on workers
• SparkR closure capture does not include packages
• You need to import packages on each worker inside your
function
• If not installed install packages on workers out-of-band
• spark.lapply() can be used to install packages
24
Debugging user code
1.  Verify your code on the Driver
2.  Interactively execute the code on the cluster
–  When R worker fails, Spark Driver throws exception with the R error
text
3.  Inspect details of failure reason of failed job in spark UI
4.  Inspect stdout/stderror of workers
25
Demo
26
Notebooks available at:
•  hSp://bit.ly/2krYMwC	
•  hSp://bit.ly/2ltLVKs
Thank you!

More Related Content

What's hot

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 

What's hot (20)

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Spark etl
Spark etlSpark etl
Spark etl
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 

Similar to Parallelizing Existing R Packages

Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 

Similar to Parallelizing Existing R Packages (20)

Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 

Parallelizing Existing R Packages

  • 1. Parallelizing Existing R Packages with SparkR Hossein Falaki @mhfalaki
  • 2. About me • Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature • Currently focusing on R experience at Databricks 2
  • 3. What is SparkR? An R package distributed with Apache Spark (soon CRAN): - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames 3 distributed/robust processing, data sources, off-memory data structures + Dynamic environment, interactivity, packages, visualization
  • 5. SparkR architecture (since 2.0) 5 Spark Driver R JVM R Backend JVM Worker JVM Worker Data Sources R R
  • 6. Overview of SparkR API IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables 6 ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect http://spark.apache.org/docs/latest/api/R/
  • 7. Overview of SparkR API :: Session Spark session is your interface to Spark functionality in R o SparkR DataFrames are implemented on top of SparkSQL tables o All DataFrame operations go through a SQL optimizer (catalyst) o Since 2.0 sqlContext is wrapped in a new object called SparkR Session. 7 > spark <- sparkR.session() All SparkR functions work if you pass them a session or will assume an existing session.
  • 9. Moving data between R and JVM 9 R JVM R Backend SparkR::collect() SparkR::createDataFrame()
  • 10. Overview of SparkR API :: DataFrame API SparkR DataFrame behaves similar to R data.frames > sparkDF$newCol <- sparkDF$col + 1 > subsetDF <- sparkDF[, c(“date”, “type”)] > recentData <- subset(sparkDF$date == “2015-10-24”) > firstRow <- sparkDF[[1, ]] > names(subsetDF) <- c(“Date”, “Type”) > dim(recentData) > head(count(group_by(subsetDF, “Date”))) 10
  • 11. Overview of SparkR API :: SQL You can register a DataFrame as a table and query it in SQL > logs <- read.df(“data/logs”, source = “json”) > registerTempTable(df, “logsTable”) > errorsByCode <- sql(“select count(*) as num, type from logsTable where type == “error” group by code order by date desc”) > reviewsDF <- tableToDF(“reviewsTable”) > registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”) 11
  • 12. Moving between languages 12 R Scala Spark df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”) val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)
  • 13. Overview of SparkR API IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables 13 ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect http://spark.apache.org/docs/latest/api/R/
  • 14. SparkR UDF API 14 spark.lapply Runs a function over a list of elements spark.lapply() dapply Applies a function to each partition of a SparkDataFrame dapply() dapplyCollect() gapply Applies a function to each group within a SparkDataFrame gapply() gapplyCollect()
  • 15. spark.lapply 15 Simplest SparkR UDF pattern For each element of a list: 1.  Sends the function to an R worker 2.  Executes the function 3.  Returns the result of all workers as a list to R driver spark.lapply(1:100, function(x) { runBootstrap(x) }
  • 16. spark.lapply control flow 16 R Worker JVM R Worker JVM R Worker JVM R Driver JVM 1. Serialize R closure 3. Transfer serialized closure over the network 5. De-serialize closure 4. Transfer over local socket 6. Serialize result 2. Transfer over local socket 7. Transfer over local socket 9. Transfer over local socket 10. Deserialize result 8. Transfer serialized closure over the network
  • 17. dapply 17 For each partition of a Spark DataFrame 1.  collects each partition as an R data.frame 2.  sends the R function to the R worker 3.  executes the function dapply(sparkDF, func, schema) combines results as DataFrame with schema dapplyCollect(sparkDF, func) combines results as R data.frame
  • 18. dapply control & data flow 18 R Worker JVM R Worker JVM R Worker JVM R Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer
  • 19. dapplyCollect control & data flow 19 R Worker JVM R Worker JVM R Worker JVM R Driver JVM local socket cluster network local socket input data ser/de transfer result transfer result deser
  • 20. gapply 20 Groups a Spark DataFrame on one or more columns 1.  collects each group as an R data.frame 2.  sends the R function to the R worker 3.  executes the function gapply(sparkDF, cols, func, schema) combines results as DataFrame with schema gapplyCollect(sparkDF, cols, func) combines results as R data.frame
  • 21. gapply control & data flow 21 R Worker JVM R Worker JVM R Worker JVM R Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer data shuffle
  • 22. dapply vs. gapply 22 gapply dapply signature gapply(df, cols, func, schema) gapply(gdf, func, schema) dapply(df, func, schema) user function signature function(key, data) function(data) data partition controlled by grouping not controlled
  • 23. Parallelizing data • Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data – Are partitions evenly sized? • Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers using FileSystem 23
  • 24. Packages on workers • SparkR closure capture does not include packages • You need to import packages on each worker inside your function • If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages 24
  • 25. Debugging user code 1.  Verify your code on the Driver 2.  Interactively execute the code on the cluster –  When R worker fails, Spark Driver throws exception with the R error text 3.  Inspect details of failure reason of failed job in spark UI 4.  Inspect stdout/stderror of workers 25
  • 26. Demo 26 Notebooks available at: •  hSp://bit.ly/2krYMwC •  hSp://bit.ly/2ltLVKs