Sparklyr: Big Data enabler for R users

1 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: Big Data
enabler for R users
Serena Signorelli
Data Science Milan, May 15th 2017

Outline
 About me
 The Data Science process
 Package and its functionalities
 SparkR vs Sparklyr
 Demo on NYC taxi data

About me
Experience:
 Business administration and management
 Research grants in Economics Statistics
 PhD in Analytics for Economics and Business
 Traineeship at Eurostat Big Data Task Force
 Data scientist at ICTeam SpA
Why Sparklyr?
 R user
 No computer science background
 Need to handle Big Data

R language
 Open source
 5th most popular programming language in
2016 (IEEE Spectrum ranking)
 Data analysis, statistical modelling and visualization
 Historically limited to in-memory data

Data Science process
1. Import data into memory
2. Clean and tidy the data
3. Cyclical process called understand:
1. making transformations to tidied data
2. using the transformed data to fit models
3. visualizing results
4. Communicate the results

Data Science process with big data
Problem: data is too large to download into memory
Workaround: use a very small sample or download as
much data as possible

Limitations: the sample may not be representative, long
waiting time in every iteration of importing, exploring and
modeling
Solution: use Sparklyr to access and analyze the data
inside Spark and only bring results into R

Sparklyr: R interface for Apache Spark
 First release: 0.4 – September 24th, 2016
 Current release: 0.5.4 – April 25th, 2017

Dplyr
 Dplyr verbs:
 select ~ SELECT
 filter ~ WHERE
 arrange ~ ORDER
 summarise ~ aggregators: sum, min, sd, etc.
 mutate ~ operators: +, *, log, etc.
 Grouping: group_by ~ GROUP BY
 Window functions: rank, dense_rank, percent_rank, ntile,
row_number, cume_dist, first_value, last_value, lag, lead
 Performing joins: inner_join, semi_join, left_join, anti_join,
full_join
 Sampling: sample_n, sample_frac

Dplyr
SQL translation:
 Basic math operators: +, -, *, /, %%, ^
 Math functions: abs, acos, asin, asinh, atan, atan2,
ceiling, cos, cosh, exp, floor, log, log10, round, sign,
sin, sinh, sqrt, tan, tanh
 Logical comparisons: <, <=, !=, >=, >, ==, %in%
 Boolean operations: &, &&, |, ||, !
 Character functions: paste, tolower, toupper, nchar
 Casting: as.double, as.integer, as.logical,
as.character, as.date
 Basic aggregations: mean, sum, min, max, sd, var,
cor, cov, n

Dplyr in Sparklyr
 Hive functions:
many of Hive’s built-in functions (UDF) and built-in aggregate
functions (UDAF) can be called inside dplyr’s mutate and
summarize
 Reading and writing data:
spark_read_csv, spark_read_json, spark_read_parquet,
spark_write_csv, spark_write_json, spark_write_parquet
 Collecting to R:
collect()

Dplyr in Sparklyr
Characteristics:
 Laziness
 It never pulls data into R unless you explicitly ask for it
 It delays doing any work until the last possible moment:
it collects together everything you want to do and then
sends it to the database in one step
 Piping %>%
 From package magrittr
 Provides a mechanism for chaining commands with a
forward-pipe operator

Dplyr in Sparklyr: an example
SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`,
AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`,
AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`,
AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`
jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>%
filter(pickup_ntacode == 'QN98') %>%
filter(!is.na(dropoff_ntacode)) %>%
mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>%
group_by(dropoff_ntacode, dropoff_ntaname) %>%
summarize(n = n(),
trip_time_mean = mean(trip_time),
trip_dist_mean = mean(trip_distance),
dropoff_latitude = mean(dropoff_latitude),
dropoff_longitude = mean(dropoff_longitude),
passenger_mean = mean(passenger_count),
fare_amount = mean(fare_amount),
tip_amount = mean(tip_amount))
dplyrSparkSQL

Dplyr in Sparklyr: an example
SELECT *
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`,
`dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER
(PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank`
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS
`trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS
`dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS
`passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn`
WHERE (`n_rank` <= 25.0)
Jfk_pickup <- jfk_pickup_tbl %>%
mutate(n_rank = min_rank(desc(n))) %>%
filter(n_rank <= 25)
dplyrSparkSQL

ML in Sparklyr
Sparklyr allows to access the machine learning routines
provided by the spark.ml package
Three families of functions:
 Machine learning algorithms for analyzing data (ml_*)
 Feature transformers for manipulating individual features
(ft_*)
 Functions for manipulating Spark DataFrames (sdf_*)
Example:
 Perform SQL queries through the sparklyr dplyr interface
 Use the sdf_* and ft_* family of functions to generate new
columns, or partition your data set
 Choose an appropriate machine learning algorithm from the
ml_* family of functions to model your data

Extensions in Sparklyr
Extensions can be created to call the full Spark API and to
provide interfaces to Spark packages
Package Description
spark.sas7bdat Read in SAS data in parallel into Apache Spark.
rsparkling Extension for using H2O machine learning
algorithms against Spark Data Frames.
sparkhello Simple example of including a custom JAR file
within an extension package.
rddlist Implements some methods of an R list as a
Spark RDD (resilient distributed dataset).
sparkwarc Load WARC files into Apache Spark with
sparklyr.
sparkavro Load Avro data into Spark with sparklyr. It is a
wrapper of spark-avro

Sparklyr help: the RStudio cheat sheet

SparkR vs Sparklyr
natively included in Spark after
version 1.6.2
developed by RStudio, available
on CRAN and GitHub
it allows to download and install
Spark for development purposes
df <- createDataFrame(flights)
head(select(df, df$distance, df$origin))
or
head(df[, c(‘distance', ‘origin')])
filter(df, df$distance > 3000)
df <- copy_to(sc2, flights)
head(select(df, distance, origin))
filter(df, distance > 3000)
documentation through R’s help documentation through R’s help
SparkR Sparklyr

SparkR vs Sparklyr
spark.logit
spark.mlp
spark.naiveBayes
spark.survreg
spark.glm
spark.gbt
spark.randomForest
spark.kmeans
spark.lda
spark.isoreg
spark.gaussianMixture
spark.als
spark.kstest
ml_logistic_regression
ml_multilayer_perceptron
ml_naive_bayes
ml_survival_regression
ml_generalized_linear_regression
ml_gradient_boosted_trees
ml_random_forest
ml_kmeans
ml_lda
ml_linear_regression
ml_decision_tree
ml_pca
ml_one_vs_rest
UDF functions UDF functions
(but can invoke Scala code)
SparkR Sparklyr

SparkR vs Sparklyr in Google Trends

Demo on NYC taxi data
 1 billion NYC taxi data
 Original analysis by Todd W. Schneider1, November 2015
+
 Rstudio webinar2, October 2016
 77 GB of data stored in a Hive table
1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/

serena.signorelli@icteam.it
Thank you for your attention

Sparklyr: Big Data enabler for R users

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Sparklyr: Big Data enabler for R users

Similar a Sparklyr: Big Data enabler for R users (20)

Último

Último (20)

Sparklyr: Big Data enabler for R users