SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Spark DataFrames
and ML Pipelines
Joseph K. Bradley
May 1, 2015
MLconf Seattle
Who am I?
Joseph K. Bradley
Ph.D. in ML from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
Databricks Inc.
3
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
4
Concise	
  APIs	
  in	
  Python,	
  Java,	
  Scala	
  
	
  	
  	
  	
  …	
  and	
  R	
  in	
  Spark	
  1.4!	
  
500+	
  enterprises	
  using	
  or	
  planning	
  
to	
  use	
  Spark	
  in	
  producCon	
  (blog)	
  
Spark	
  
SparkSQL	
   Streaming	
   MLlib	
   GraphX	
  
Distributed	
  compuCng	
  engine	
  
•  Built	
  for	
  speed,	
  ease	
  of	
  use,	
  
and	
  sophisCcated	
  analyCcs	
  
•  Apache	
  open	
  source	
  
Beyond Hadoop
5
Early	
  adopters	
   (Data)	
  Engineers	
  
MapReduce	
  &	
  
funcConal	
  API	
  
Data	
  ScienCsts	
  
&	
  StaCsCcians	
  
Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
6
Machine Learning Pipelines
Simple construction and tuning of ML workflows
Google Trends for “dataframe”
7
DataFrames
8
dept	
   age	
   name	
  
Bio	
   48	
   H	
  Smith	
  
CS	
   54	
   A	
  Turing	
  
Bio	
   43	
   B	
  Jones	
  
Chem	
   61	
   M	
  Kennedy	
  
RDD	
  API	
  
DataFrame	
  API	
  
Data	
  grouped	
  into	
  
named	
  columns	
  
DataFrames
9
dept	
   age	
   name	
  
Bio	
   48	
   H	
  Smith	
  
CS	
   54	
   A	
  Turing	
  
Bio	
   43	
   B	
  Jones	
  
Chem	
   61	
   M	
  Kennedy	
  
Data	
  grouped	
  into	
  
named	
  columns	
  
DSL	
  for	
  common	
  tasks	
  
•  Project,	
  filter,	
  aggregate,	
  join,	
  …	
  
•  Metadata	
  
•  UDFs	
  
Spark DataFrames
10
API inspired by R and Python Pandas
•  Python, Scala, Java (+ R in dev)
•  Pandas integration
Distributed DataFrame
Highly optimized
11
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
be.er	
  
Uses	
  SparkSQL	
  
Catalyst	
  op;mizer	
  
12
Demo:	
  DataFrames	
  
Spark for Data Science
DataFrames
•  Structured data
•  Familiar API based on R & Python Pandas
•  Distributed, optimized implementation
13
Machine Learning Pipelines
Simple construction and tuning of ML workflows
About Spark MLlib
Started @ Berkeley
•  Spark 0.8
Now (Spark 1.3)
•  Contributions from 50+ orgs, 100+ individuals
•  Growing coverage of distributed algorithms
Spark	
  
SparkSQL	
   Streaming	
   MLlib	
   GraphX	
  
14
About Spark MLlib
Classification
•  Logistic regression
•  Naive Bayes
•  Streaming logistic regression
•  Linear SVMs
•  Decision trees
•  Random forests
•  Gradient-boosted trees
Regression
•  Ordinary least squares
•  Ridge regression
•  Lasso
•  Isotonic regression
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Streaming linear methods
15
Statistics
•  Pearson correlation
•  Spearman correlation
•  Online summarization
•  Chi-squared test
•  Kernel density estimation
Linear algebra
•  Local dense & sparse vectors & matrices
•  Distributed matrices
•  Block-partitioned matrix
•  Row matrix
•  Indexed row matrix
•  Coordinate matrix
•  Matrix decompositions
Frequent itemsets
•  FP-growth
Model import/export
Clustering
•  Gaussian mixture models
•  K-Means
•  Streaming K-Means
•  Latent Dirichlet Allocation
•  Power Iteration Clustering
Recommendation
•  Alternating Least Squares
Feature extraction & selection
•  Word2Vec
•  Chi-Squared selection
•  Hashing term frequency
•  Inverse document frequency
•  Normalizer
•  Standard scaler
•  Tokenizer
ML Workflows are complex
16
Image	
  classificaCon	
  
pipeline*	
  
*	
  Evan	
  Sparks.	
  “ML	
  Pipelines.”	
  
amplab.cs.berkeley.edu/ml-­‐pipelines	
  
à Specify	
  pipeline	
  
à Inspect	
  &	
  debug	
  
à Re-­‐run	
  on	
  new	
  data	
  
à Tune	
  parameters	
  
Example: Text Classification
17
Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1:	
  about	
  science	
  
0:	
  not	
  about	
  science	
  
Label	
  Features	
  
Dataset:	
  “20	
  Newsgroups”	
  
From	
  UCI	
  KDD	
  Archive	
  
ML Workflow
18
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
Load Data
19
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
Load Data
20
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Int
text: String
Current	
  data	
  schema	
  
Extract Features
21
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Int
text: String
Current	
  data	
  schema	
  
Extract Features
22
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
Transformer
Train a Model
23
LogisAc	
  Regression	
  
Evaluate	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
prediction: Int
Estimator
Load	
  data	
  
Transformer
Evaluate the Model
24
LogisCc	
  Regression	
  
Evaluate	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
prediction: Int
Load	
  data	
  
Transformer
Evaluator
Estimator
By	
  default,	
  always	
  
append	
  new	
  columns	
  	
  
à Can	
  go	
  back	
  &	
  inspect	
  
intermediate	
  results	
  	
  
à Made	
  efficient	
  by	
  
DataFrame	
  
opCmizaCons	
  
ML Pipelines
25
LogisCc	
  Regression	
  
Evaluate	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
Load	
  data	
  
Pipeline
Test	
  data	
  
LogisCc	
  Regression	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
Evaluate	
  
Re-­‐run	
  exactly	
  
the	
  same	
  way	
  
Parameter Tuning
26
LogisCc	
  Regression	
  
Evaluate	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
CrossValidator
27
Demo:	
  ML	
  Pipelines	
  
Recap
DataFrames
•  Structured data
•  Familiar API based on R & Python Pandas
•  Distributed, optimized implementation
Machine Learning Pipelines
•  Integration with DataFrames
•  Familiar API based on scikit-learn
•  Simple parameter tuning
28
Composable	
  &	
  DAG	
  Pipelines	
  
Schema	
  validaCon	
  
User-­‐defined	
  Transformers	
  
&	
  EsCmators	
  
Looking Ahead
Collaborations with UC Berkeley & others
•  Auto-tuning models
29
DataFrames
•  Further optimization
•  API for R
ML Pipelines
•  More algorithms & pluggability
•  API for R
Thank you!
Spark	
  documentaCon	
  
	
  	
  	
  	
  spark.apache.org	
  
	
  
Pipelines	
  blog	
  post	
  
	
  	
  	
  	
  databricks.com/blog/2015/01/07	
  
	
  
DataFrames	
  blog	
  post	
  
	
  	
  	
  	
  databricks.com/blog/2015/02/17	
  
	
  
Databricks	
  Cloud	
  Plalorm	
  
	
  	
  	
  	
  databricks.com/product	
  
	
  
Spark	
  MOOCs	
  on	
  edX	
  
	
  	
  	
  	
  Intro	
  to	
  Spark	
  &	
  ML	
  with	
  Spark	
  
	
  
Spark	
  Packages	
  
	
  	
  	
  	
  spark-­‐packages.org	
  
	
  

Más contenido relacionado

La actualidad más candente

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

La actualidad más candente (20)

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 

Destacado

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 

Destacado (20)

Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 

Similar a Spark DataFrames and ML Pipelines

Similar a Spark DataFrames and ML Pipelines (20)

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 

Spark DataFrames and ML Pipelines

  • 1. Spark DataFrames and ML Pipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
  • 2. Who am I? Joseph K. Bradley Ph.D. in ML from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
  • 3. Databricks Inc. 3 Founded by the creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
  • 4. 4 Concise  APIs  in  Python,  Java,  Scala          …  and  R  in  Spark  1.4!   500+  enterprises  using  or  planning   to  use  Spark  in  producCon  (blog)   Spark   SparkSQL   Streaming   MLlib   GraphX   Distributed  compuCng  engine   •  Built  for  speed,  ease  of  use,   and  sophisCcated  analyCcs   •  Apache  open  source  
  • 5. Beyond Hadoop 5 Early  adopters   (Data)  Engineers   MapReduce  &   funcConal  API   Data  ScienCsts   &  StaCsCcians  
  • 6. Spark for Data Science DataFrames Intuitive manipulation of distributed structured data 6 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 7. Google Trends for “dataframe” 7
  • 8. DataFrames 8 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   RDD  API   DataFrame  API   Data  grouped  into   named  columns  
  • 9. DataFrames 9 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   Data  grouped  into   named  columns   DSL  for  common  tasks   •  Project,  filter,  aggregate,  join,  …   •  Metadata   •  UDFs  
  • 10. Spark DataFrames 10 API inspired by R and Python Pandas •  Python, Scala, Java (+ R in dev) •  Pandas integration Distributed DataFrame Highly optimized
  • 11. 11 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast be.er   Uses  SparkSQL   Catalyst  op;mizer  
  • 13. Spark for Data Science DataFrames •  Structured data •  Familiar API based on R & Python Pandas •  Distributed, optimized implementation 13 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 14. About Spark MLlib Started @ Berkeley •  Spark 0.8 Now (Spark 1.3) •  Contributions from 50+ orgs, 100+ individuals •  Growing coverage of distributed algorithms Spark   SparkSQL   Streaming   MLlib   GraphX   14
  • 15. About Spark MLlib Classification •  Logistic regression •  Naive Bayes •  Streaming logistic regression •  Linear SVMs •  Decision trees •  Random forests •  Gradient-boosted trees Regression •  Ordinary least squares •  Ridge regression •  Lasso •  Isotonic regression •  Decision trees •  Random forests •  Gradient-boosted trees •  Streaming linear methods 15 Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear algebra •  Local dense & sparse vectors & matrices •  Distributed matrices •  Block-partitioned matrix •  Row matrix •  Indexed row matrix •  Coordinate matrix •  Matrix decompositions Frequent itemsets •  FP-growth Model import/export Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocation •  Power Iteration Clustering Recommendation •  Alternating Least Squares Feature extraction & selection •  Word2Vec •  Chi-Squared selection •  Hashing term frequency •  Inverse document frequency •  Normalizer •  Standard scaler •  Tokenizer
  • 16. ML Workflows are complex 16 Image  classificaCon   pipeline*   *  Evan  Sparks.  “ML  Pipelines.”   amplab.cs.berkeley.edu/ml-­‐pipelines   à Specify  pipeline   à Inspect  &  debug   à Re-­‐run  on  new  data   à Tune  parameters  
  • 17. Example: Text Classification 17 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1:  about  science   0:  not  about  science   Label  Features   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  • 18. ML Workflow 18 Train  model   Evaluate   Load  data   Extract  features  
  • 19. Load Data 19 Train  model   Evaluate   Load  data   Extract  features   built-in external { JSON } JDBC and more … Data sources for DataFrames
  • 20. Load Data 20 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  • 21. Extract Features 21 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  • 22. Extract Features 22 Train  model   Evaluate   Load  data   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] Transformer
  • 23. Train a Model 23 LogisAc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Estimator Load  data   Transformer
  • 24. Evaluate the Model 24 LogisCc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Transformer Evaluator Estimator By  default,  always   append  new  columns     à Can  go  back  &  inspect   intermediate  results     à Made  efficient  by   DataFrame   opCmizaCons  
  • 25. ML Pipelines 25 LogisCc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   Load  data   Pipeline Test  data   LogisCc  Regression   Tokenizer   Hashed  Term  Freq.   Evaluate   Re-­‐run  exactly   the  same  way  
  • 26. Parameter Tuning 26 LogisCc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters CrossValidator
  • 28. Recap DataFrames •  Structured data •  Familiar API based on R & Python Pandas •  Distributed, optimized implementation Machine Learning Pipelines •  Integration with DataFrames •  Familiar API based on scikit-learn •  Simple parameter tuning 28 Composable  &  DAG  Pipelines   Schema  validaCon   User-­‐defined  Transformers   &  EsCmators  
  • 29. Looking Ahead Collaborations with UC Berkeley & others •  Auto-tuning models 29 DataFrames •  Further optimization •  API for R ML Pipelines •  More algorithms & pluggability •  API for R
  • 30. Thank you! Spark  documentaCon          spark.apache.org     Pipelines  blog  post          databricks.com/blog/2015/01/07     DataFrames  blog  post          databricks.com/blog/2015/02/17     Databricks  Cloud  Plalorm          databricks.com/product     Spark  MOOCs  on  edX          Intro  to  Spark  &  ML  with  Spark     Spark  Packages          spark-­‐packages.org