Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

•

33 recomendaciones•27,334 vistas

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Software

Combining the Strengths
of MLlib, scikit-learn, & R
Joseph K. Bradley
Spark Summit Europe
October 2015

scikit-learn & R
Greatlibraries
• Detailed documentation & how-to guides
• Many packages& extensions
Business investment
• Education
• Tooling & workflows
5

Big Data
6
Scaling (trees)Topic model on 4.5
million Wikipedia
articles
Recommendation with
50 million users,
5 million songs,
50 billion ratings

Big Data & MLlib
• More data à higher accuracy
• Scalewith business (# users,available data)
• Integrate with production systems
7

Bridging the gap
How do you get from a single-machine workload
to a fully distributed one?
8
At school: Machine Learning with R on my laptop
The Goal: Machine Learning on a huge computing cluster

Wish list
• Run original code on a production environment
• Use distributed data sources
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
• Use familiar APIs
9

Our task
10
Sentiment analysis
Given a review (text),
Predict the user’srating.
Data
from
https://snap.stanford.edu/data/web-‐Amazon.html

Our ML workflow
11
Text
This scarf I
bought is
very strange.
When I ...
Label
Rating = 3.0
Tokenizer
Words
[This,
scarf,
I,
bought,
...]
Hashing
Term-Freq
Features
[2.0,
0.0,
3.0,
...]
Linear
Regression
Prediction
Rating = 2.0

Our ML workflow
12
Cross Validation
Linear
Regression
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}

Cross validation
13
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3

Cross validation
14
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3

Distribute cross validation
15
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3

Distribute feature extraction
16
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Feature
Extraction #1
Feature
Extraction #2
Feature
Extraction #3
...
Linear
Regression #2
Linear
Regression #3

Feature
Extraction #1
Distribute learning
17
Cross Validation
...
Best Linear
Regression
Feature
Extraction #2
Feature
Extraction #3
Linear
Regression #1
Linear
Regression #2
...

Improvements we observed
Also, in practice:
• More folds of Cross Validation
• Tune more parameters
• Increase model size as dataset size increases
18
1) Faster model selection for small data
2) Faster training for large data
3) Better predictions (R^2) with more data

Integrations
• Distributed data sources
• Conversionsbetween pandas& Spark
• Conversionsbetween scipy & MLlib types
• Distributed model selection
• Distributed feature extraction
• Distributed learning
• Conversionsbetween scikit-learn & MLlib models
19

Integrations with R
DataFrames
• Conversionsbetween R (local)& Spark (distributed)
• SQL queries from R
20
model <- glm(Sepal_Length ~ Sepal_Width + Species,
data = df, family = "gaussian")
head(filter(df, df$waiting < 50))
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
R-like MLlib API for generalizedlinearmodels

Repeating this at home
This demo used:
• Spark 1.5
• The pdspark Spark Package (tobe released soon!)
The code will be posted online.
Also see sparkit-learn package
21
Try it on Databricks
with a free trial
@ databricks.com

What’s next?
Further work on integrations
• Python:Support more models& data types
• R: Expand GLM formula (feature interactions) & other models
Match features & behavior
Getinvolved!
• Contribute to Spark & Spark packages
• Provide feedback
22

Thank you!
spark.apache.org
spark-packages.org
databricks.com

Más contenido relacionado

La actualidad más candente

Ensayo de Compresión Triaxial para Suelos CohesivosCarmen Antonieta Esparza Villalba

C 3 cimentaciones 2010Diana Torres

Mezclas AsfalticasJulio Villalba

La guia de diseño de aashtoMax Giver Michael Avila Hancco

Nec2011 cap.2-peligro sismico y requisitos de diseño sismo resistente-021412)Miguel Guaño Olmedo

Modulo de finura de la combinacion rafael cachayalexander arque aquima

Bielas tirantes 1Jose Carlos Jiménez Ariza

Marco teorico permeabilidadBA YC

Muros de contencionReyEduardo4

Manual de drenaje para carreterasWaldir Calsina Pari

Mtc e 101 2000CÉSAR JESÚS DÍAZ CORONEL

aplicación de sap2000Jean Sánchez

Diseño de pavimentos flexibles metodo aashto 93 final de los finalesRonal Pinzon Guerrero

Peso especifico de suelos con parafinahfbonifaz

7095790 el-m etodo-de-hardy-crossjohnny aldmar cuellar serrano

GOTECNIAPaul Seguil

Modulo 12Juan Francisco Giraldo Nova

Problemas de Triaxiales.pdfRafael Ortiz

CAMINOS I - PROBLEMASEmilio Castillo

5. i.e.reservorios apoyadosHeiler Chapoñan Armas

La actualidad más candente (20)

Ensayo de Compresión Triaxial para Suelos Cohesivos

C 3 cimentaciones 2010

Mezclas Asfalticas

La guia de diseño de aashto

Nec2011 cap.2-peligro sismico y requisitos de diseño sismo resistente-021412)

Modulo de finura de la combinacion rafael cachay

Bielas tirantes 1

Marco teorico permeabilidad

Muros de contencion

Manual de drenaje para carreteras

Mtc e 101 2000

aplicación de sap2000

Diseño de pavimentos flexibles metodo aashto 93 final de los finales

Peso especifico de suelos con parafina

7095790 el-m etodo-de-hardy-cross

GOTECNIA

Modulo 12

Problemas de Triaxiales.pdf

CAMINOS I - PROBLEMAS

5. i.e.reservorios apoyados

Destacado

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Enabling exploratory data science with Spark and RDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Spark Summit EU 2015: Matei Zaharia keynoteDatabricks

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks

Destacado (7)

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Spark Summit EU 2015: Lessons from 300+ production users

Enabling exploratory data science with Spark and R

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Spark Summit EU 2015: Matei Zaharia keynote

Spark Summit EU 2015: Reynold Xin Keynote

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...

Similar a Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit

Combining Machine Learning Frameworks with Apache SparkDatabricks

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Apache Spark sqlaftab alam

From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks

DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.

Fighting Fraud with Apache SparkMiklos Christine

Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi

Designing Distributed Machine Learning on Apache SparkDatabricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

The Challenges of Bringing Machine Learning to the MassesAlice Zheng

Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks

Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB

A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar

The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool

Azure Databricks for Data ScientistsRichard Garris

Spark at ZillowSteven Hoelscher

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)

Similar a Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R (20)

Combining Machine Learning frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache Spark sql

From Pipelines to Refineries: scaling big data applications with Tim Hunter

DataMass Summit - Machine Learning for Big Data in SQL Server

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...

Fighting Fraud with Apache Spark

Using BigBench to compare Hive and Spark (Long version)

Designing Distributed Machine Learning on Apache Spark

From Pipelines to Refineries: Scaling Big Data Applications

The Challenges of Bringing Machine Learning to the Masses

Apache Spark's MLlib's Past Trajectory and new Directions

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

A Hands-on Intro to Data Science and R Presentation.ppt

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...

Azure Databricks for Data Scientists

Spark at Zillow

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

tonesoftglanshi9

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba

AI & Machine Learning Presentation TemplatePresentation.STUDIO

WSO2Con204 - Hard Rock Presentation - KeynoteWSO2

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

1. Combining the Strengths of MLlib, scikit-learn, & R Joseph K. Bradley Spark Summit Europe October 2015

2. About me ApacheSpark committer Software Engineer@ Databricks Ph.D. in Machine Learning @ CarnegieMellon University 2

3. scikit-learn & R 3

4. 4

5. scikit-learn & R Greatlibraries • Detailed documentation & how-to guides • Many packages& extensions Business investment • Education • Tooling & workflows 5

6. Big Data 6 Scaling (trees)Topic model on 4.5 million Wikipedia articles Recommendation with 50 million users, 5 million songs, 50 billion ratings

7. Big Data & MLlib • More data à higher accuracy • Scalewith business (# users,available data) • Integrate with production systems 7

8. Bridging the gap How do you get from a single-machine workload to a fully distributed one? 8 At school: Machine Learning with R on my laptop The Goal: Machine Learning on a huge computing cluster

9. Wish list • Run original code on a production environment • Use distributed data sources • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings • Use familiar APIs 9

10. Our task 10 Sentiment analysis Given a review (text), Predict the user’srating. Data from https://snap.stanford.edu/data/web-‐Amazon.html

11. Our ML workflow 11 Text This scarf I bought is very strange. When I ... Label Rating = 3.0 Tokenizer Words [This, scarf, I, bought, ...] Hashing Term-Freq Features [2.0, 0.0, 3.0, ...] Linear Regression Prediction Rating = 2.0

12. Our ML workflow 12 Cross Validation Linear Regression Feature Extraction regularization parameter: {0.0, 0.1, ...}

13. Cross validation 13 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3

14. Cross validation 14 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3

15. Distribute cross validation 15 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3

16. Distribute feature extraction 16 Cross Validation ... Best Linear Regression Linear Regression #1 Feature Extraction #1 Feature Extraction #2 Feature Extraction #3 ... Linear Regression #2 Linear Regression #3

17. Feature Extraction #1 Distribute learning 17 Cross Validation ... Best Linear Regression Feature Extraction #2 Feature Extraction #3 Linear Regression #1 Linear Regression #2 ...

18. Improvements we observed Also, in practice: • More folds of Cross Validation • Tune more parameters • Increase model size as dataset size increases 18 1) Faster model selection for small data 2) Faster training for large data 3) Better predictions (R^2) with more data

19. Integrations • Distributed data sources • Conversionsbetween pandas& Spark • Conversionsbetween scipy & MLlib types • Distributed model selection • Distributed feature extraction • Distributed learning • Conversionsbetween scikit-learn & MLlib models 19

20. Integrations with R DataFrames • Conversionsbetween R (local)& Spark (distributed) • SQL queries from R 20 model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") head(filter(df, df$waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 R-like MLlib API for generalizedlinearmodels

21. Repeating this at home This demo used: • Spark 1.5 • The pdspark Spark Package (tobe released soon!) The code will be posted online. Also see sparkit-learn package 21 Try it on Databricks with a free trial @ databricks.com

22. What’s next? Further work on integrations • Python:Support more models& data types • R: Expand GLM formula (feature interactions) & other models Match features & behavior Getinvolved! • Contribute to Spark & Spark packages • Provide feedback 22

23. Thank you! spark.apache.org spark-packages.org databricks.com

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Similar a Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R