SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
LEVERAGING SPARK FOR
SCALABLE DATA PREP AND
INFERENCE IN DEEP
LEARNING
James Nguyen
Data & AI Cloud Solution Architect, Microsoft
Agenda
INTRODUCTION LARGE SCALE DATA PREPARATION FOR DEEP
LEARNING
SCALABLE DEEP LEARNING INFERENCE IN
SPARK
• PANDAS UDFS
• SPARK’S BINARY AND TENSORFLOW FORMATS SUPPORT
• SCORE EXTERNALLY HOSTED ML MODEL
• LOAD AND SCORE DL ML MODEL WITHIN SPARK
Introduction
◦ Distributed Deep Learning Frameworks
scale Deep Learning well when input data
elements are independent, allowing
parallel processing to start immediately
◦ However preprocessing and featurization
steps, crucial to Deep Learning
development, might involve complex
business logic with computations across
multiple data elements
◦ In addition, support for Batch Inference is
limited compared to Online Inference.
◦ We can leverage new features in Spark 3.0
about support for binary data and new
Pandas UDFs to address these gaps
Different stages in ML development. Stages in aqua green can be offloaded to Spark
Data
transformation
Inference
Featurization
Model Ready
Data
Training and
Testing
Deployment
Collect data
Ingest and
transform data
Data source 1
Data source 2
Data source 3
Using Spark to Accelerate Data Prep and Featurization
Data prep and featurization in Deep Learning pipeline
1.Data acquisition
and initial
transformation
2. Data preparation
for ML task
3. Featurization 4. ML training
Data query/extraction
tools
Single node
Python
Pandas
Traditional way
Multi-node Spark pipeline: Query + Transformation +
Featurization
Tensorflow/Pytorch
data APIs
Combine tools and scale out with Spark
Pandas UDF
Spark’s Pandas UDFs: Parallelizing Python Computation
Input Spark DataFrame
Grouped/split into
parallel batches
Transformation/
scoring logicInput:
Pandas DF/Series
Output:
Pandas DF/Series
Output Spark DataFrame
Transformation
/scoring logic
Transformation
/scoring logic
….
Vectorized operation: Pyarrow convert from JVM data to Python DF/Series
Spark’s Pandas UDFs: Types and Performance
◦ Scalar UDFs
◦ Column values are split into batches of Pandas series
to pass to UDF
◦ UDF also returns Pandas Series
◦ Good for direct parallel column values computation
◦ Grouped map UDFs
◦ Implements split-apply-pattern: Group by each
column value to form Pandas DataFrames then pass
on to UDF
◦ Returns Pandas DataFrame
◦ All data of a group-by value is loaded into memory
◦ Scalar iterator UDFs (Spark 3.0)
◦ Same with Scala UDF except:
◦ UDF takes iterator of batches instead of single batch
◦ Return iterator of batches or yield batches
◦ Good for initializing some state (e.g. load ML model)
Pandas UDFs perform much better than row-at-a-time UDFs across the board,
ranging from 3x to over 100x (source databricks.com)
ML Training
Spark’s Binary Data and Tensorflow’s TFrecords Formats
support
Binary files (image, audio..)
spark.read.format("binaryFile“)
Import custom libraries
Transformation/scoring
Pandas UDF
• Reading binary data using Binary Files type Spark DataFrame
• Select binary content column into a UDF function to extract feature
• Select other columns such as file path into another UDF as needed (for example to create a label column from the filename)
• Inside UDF function, import needed libraries to extract features from binary data
Scaling Up Data Prep Example 1: Multivariate Time Series
Classification
• ML model to predict customer churn based on their historical interaction (events)
• Each event is a multivariate entity with attributes in categorical, numeric, embedding…
• Each training example is a fixed window of 14 days and the outcome(churn vs. stay).
Challenges:
- There can be millions of customer
- Each customer may have a long
history
- Each history need to generate
100x pairs of training examples
with computation needed to
build features
- Result is billions of records and it
would take days to run in a single
node vs. 2 hours in a 30 nodes
clusters
Data preprocessing plan
Scaling Up Data Prep Example 1: Multivariate Time Series
Classification (cont.)
Read input data
from sources
and combine
Collect event
history for each
customer
In each
customer history
generate
overlapping
windows
Within each
window,
generate and
compute
features
Output data for
training
Spark SQL to select data
from sources
Group by customer
Df.groupby(“customer”)
@pandas_udf(pandas_dec_str,
PandasUDFType.GROUPED_MAP)
output_df.
orderBy(rand()).repartition(10
0).write.format("tfrecords")
Scaling Up Data Prep Example 2: Speech Recognition
• Use deep learning to recognize speech from audio data
• Data is in the form of audio files in wav format. Large volume training requires hundred thousands clips and together with data augmentation can result
in millions of training example
• Processing is computing intensive with audio libraries
ML Training
Wave files
spark.read.format("binaryFile“)
Process core binary
content using
librosa and extract
spectrogram as
features
Pandas UDF 1
Get input file path
and extract file
name and look up
index position in a
label list
Pandas UDF 2
Using Spark for large scale batch inference
Big dataset
Distributed Data
preprocessing
Distributed scoring
Calling externally
hosted APIs
Loading model
and score
Result dataset
Spark is very good for regular map reduce style processing.
The same advantage can apply for ML batch inference
Hosted ML
Service
ML
Model
Load Model and Score within Spark
Model distribution
(sparkcontext.addfile()
or store model file at
shared storage)
Input Spark DataFrame
Pandas Scalar UDFs
Scoring Output:
PD Series
Input:
PD Series
Model
Loading
Pandas Scalar Iterator UDF
(recommended for Deep Learning)
ScoringInput:
Pandas DF/Series
Model
LoadingInput:
Iterator of
Series
Output:
Iterator of
Series or
yield Series
• Model loading can be done from model file cached at worker
machine by addfile() method or from shared cloud storage
• Pandas Scala Iterator UDF flavor reduces the frequency of loading
deep ML model which can be an expensive operation
Deep Learning model is
large in size and is not
serializable, so broadcast
won’t work
Load Model and Score within Spark- Code Example
Calling External APIs in an UDF
Input Spark DataFrame
Pandas Scalar UDF
Http Post Output:
PD Series
Input:
PD Series
Model
Loading
Hosted ML
Service
Pandas Scalar Iterator UDF
Http Post Output:
PD Series
Input:
PD Series
Model
Loading
Batch input
Batch output
Batch input
Batch output
Calling External APIs in an UDF
References
◦ Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team, Distributed Deep Learning on Big-Data Clusters,
2017
◦ Databricks,Spark Deep Learning Pipeline , 2017.
◦ Apache Spark Org, Pandas UDF, 2017
◦ Databricks & Apache Spark Org, Pandas UDF Scalar Iterator, 2019.
◦ Databricks & Apache Spark Org, Spark binaryFiles DataFrame, 2019.
◦ Tensorflow team, Spark Tensorflow connector, 2016
THANK YOU!
Your feedback is important to us.
Don’t forget to rate and review the
sessions.

Más contenido relacionado

La actualidad más candente

End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...NTT DATA Technology & Innovation
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Takahiro Inoue
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Databricks
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jTobias Lindaaker
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN
 
Introduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainIntroduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainNeo4j
 
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)NTT DATA Technology & Innovation
 
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...NTT DATA OSS Professional Services
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Delta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetDelta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetAlban Phélip
 

La actualidad más candente (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Oracle APEX概要
Oracle APEX概要Oracle APEX概要
Oracle APEX概要
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Introduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainIntroduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & Bahrain
 
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
Apache Airflow 概要(Airflowの基礎を学ぶハンズオンワークショップ 発表資料)
 
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Delta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetDelta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquet
 

Similar a Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 

Similar a Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning (20)

BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 

Último (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

  • 1. LEVERAGING SPARK FOR SCALABLE DATA PREP AND INFERENCE IN DEEP LEARNING James Nguyen Data & AI Cloud Solution Architect, Microsoft
  • 2. Agenda INTRODUCTION LARGE SCALE DATA PREPARATION FOR DEEP LEARNING SCALABLE DEEP LEARNING INFERENCE IN SPARK • PANDAS UDFS • SPARK’S BINARY AND TENSORFLOW FORMATS SUPPORT • SCORE EXTERNALLY HOSTED ML MODEL • LOAD AND SCORE DL ML MODEL WITHIN SPARK
  • 3. Introduction ◦ Distributed Deep Learning Frameworks scale Deep Learning well when input data elements are independent, allowing parallel processing to start immediately ◦ However preprocessing and featurization steps, crucial to Deep Learning development, might involve complex business logic with computations across multiple data elements ◦ In addition, support for Batch Inference is limited compared to Online Inference. ◦ We can leverage new features in Spark 3.0 about support for binary data and new Pandas UDFs to address these gaps Different stages in ML development. Stages in aqua green can be offloaded to Spark Data transformation Inference Featurization Model Ready Data Training and Testing Deployment Collect data Ingest and transform data Data source 1 Data source 2 Data source 3
  • 4. Using Spark to Accelerate Data Prep and Featurization Data prep and featurization in Deep Learning pipeline 1.Data acquisition and initial transformation 2. Data preparation for ML task 3. Featurization 4. ML training Data query/extraction tools Single node Python Pandas Traditional way Multi-node Spark pipeline: Query + Transformation + Featurization Tensorflow/Pytorch data APIs Combine tools and scale out with Spark
  • 5. Pandas UDF Spark’s Pandas UDFs: Parallelizing Python Computation Input Spark DataFrame Grouped/split into parallel batches Transformation/ scoring logicInput: Pandas DF/Series Output: Pandas DF/Series Output Spark DataFrame Transformation /scoring logic Transformation /scoring logic …. Vectorized operation: Pyarrow convert from JVM data to Python DF/Series
  • 6. Spark’s Pandas UDFs: Types and Performance ◦ Scalar UDFs ◦ Column values are split into batches of Pandas series to pass to UDF ◦ UDF also returns Pandas Series ◦ Good for direct parallel column values computation ◦ Grouped map UDFs ◦ Implements split-apply-pattern: Group by each column value to form Pandas DataFrames then pass on to UDF ◦ Returns Pandas DataFrame ◦ All data of a group-by value is loaded into memory ◦ Scalar iterator UDFs (Spark 3.0) ◦ Same with Scala UDF except: ◦ UDF takes iterator of batches instead of single batch ◦ Return iterator of batches or yield batches ◦ Good for initializing some state (e.g. load ML model) Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x (source databricks.com)
  • 7. ML Training Spark’s Binary Data and Tensorflow’s TFrecords Formats support Binary files (image, audio..) spark.read.format("binaryFile“) Import custom libraries Transformation/scoring Pandas UDF • Reading binary data using Binary Files type Spark DataFrame • Select binary content column into a UDF function to extract feature • Select other columns such as file path into another UDF as needed (for example to create a label column from the filename) • Inside UDF function, import needed libraries to extract features from binary data
  • 8. Scaling Up Data Prep Example 1: Multivariate Time Series Classification • ML model to predict customer churn based on their historical interaction (events) • Each event is a multivariate entity with attributes in categorical, numeric, embedding… • Each training example is a fixed window of 14 days and the outcome(churn vs. stay). Challenges: - There can be millions of customer - Each customer may have a long history - Each history need to generate 100x pairs of training examples with computation needed to build features - Result is billions of records and it would take days to run in a single node vs. 2 hours in a 30 nodes clusters Data preprocessing plan
  • 9. Scaling Up Data Prep Example 1: Multivariate Time Series Classification (cont.) Read input data from sources and combine Collect event history for each customer In each customer history generate overlapping windows Within each window, generate and compute features Output data for training Spark SQL to select data from sources Group by customer Df.groupby(“customer”) @pandas_udf(pandas_dec_str, PandasUDFType.GROUPED_MAP) output_df. orderBy(rand()).repartition(10 0).write.format("tfrecords")
  • 10. Scaling Up Data Prep Example 2: Speech Recognition • Use deep learning to recognize speech from audio data • Data is in the form of audio files in wav format. Large volume training requires hundred thousands clips and together with data augmentation can result in millions of training example • Processing is computing intensive with audio libraries ML Training Wave files spark.read.format("binaryFile“) Process core binary content using librosa and extract spectrogram as features Pandas UDF 1 Get input file path and extract file name and look up index position in a label list Pandas UDF 2
  • 11. Using Spark for large scale batch inference Big dataset Distributed Data preprocessing Distributed scoring Calling externally hosted APIs Loading model and score Result dataset Spark is very good for regular map reduce style processing. The same advantage can apply for ML batch inference Hosted ML Service ML Model
  • 12. Load Model and Score within Spark Model distribution (sparkcontext.addfile() or store model file at shared storage) Input Spark DataFrame Pandas Scalar UDFs Scoring Output: PD Series Input: PD Series Model Loading Pandas Scalar Iterator UDF (recommended for Deep Learning) ScoringInput: Pandas DF/Series Model LoadingInput: Iterator of Series Output: Iterator of Series or yield Series • Model loading can be done from model file cached at worker machine by addfile() method or from shared cloud storage • Pandas Scala Iterator UDF flavor reduces the frequency of loading deep ML model which can be an expensive operation Deep Learning model is large in size and is not serializable, so broadcast won’t work
  • 13. Load Model and Score within Spark- Code Example
  • 14. Calling External APIs in an UDF Input Spark DataFrame Pandas Scalar UDF Http Post Output: PD Series Input: PD Series Model Loading Hosted ML Service Pandas Scalar Iterator UDF Http Post Output: PD Series Input: PD Series Model Loading Batch input Batch output Batch input Batch output
  • 16. References ◦ Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team, Distributed Deep Learning on Big-Data Clusters, 2017 ◦ Databricks,Spark Deep Learning Pipeline , 2017. ◦ Apache Spark Org, Pandas UDF, 2017 ◦ Databricks & Apache Spark Org, Pandas UDF Scalar Iterator, 2019. ◦ Databricks & Apache Spark Org, Spark binaryFiles DataFrame, 2019. ◦ Tensorflow team, Spark Tensorflow connector, 2016
  • 17. THANK YOU! Your feedback is important to us. Don’t forget to rate and review the sessions.