Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

LEVERAGING SPARK FOR
SCALABLE DATA PREP AND
INFERENCE IN DEEP
LEARNING
James Nguyen
Data & AI Cloud Solution Architect, Microsoft

Agenda
INTRODUCTION LARGE SCALE DATA PREPARATION FOR DEEP
LEARNING
SCALABLE DEEP LEARNING INFERENCE IN
SPARK
• PANDAS UDFS
• SPARK’S BINARY AND TENSORFLOW FORMATS SUPPORT
• SCORE EXTERNALLY HOSTED ML MODEL
• LOAD AND SCORE DL ML MODEL WITHIN SPARK

Introduction
◦ Distributed Deep Learning Frameworks
scale Deep Learning well when input data
elements are independent, allowing
parallel processing to start immediately
◦ However preprocessing and featurization
steps, crucial to Deep Learning
development, might involve complex
business logic with computations across
multiple data elements
◦ In addition, support for Batch Inference is
limited compared to Online Inference.
◦ We can leverage new features in Spark 3.0
about support for binary data and new
Pandas UDFs to address these gaps
Different stages in ML development. Stages in aqua green can be offloaded to Spark
Data
transformation
Inference
Featurization
Model Ready
Data
Training and
Testing
Deployment
Collect data
Ingest and
transform data
Data source 1
Data source 2
Data source 3

Using Spark to Accelerate Data Prep and Featurization
Data prep and featurization in Deep Learning pipeline
1.Data acquisition
and initial
transformation
2. Data preparation
for ML task
3. Featurization 4. ML training
Data query/extraction
tools
Single node
Python
Pandas
Traditional way
Multi-node Spark pipeline: Query + Transformation +
Featurization
Tensorflow/Pytorch
data APIs
Combine tools and scale out with Spark

Pandas UDF
Spark’s Pandas UDFs: Parallelizing Python Computation
Input Spark DataFrame
Grouped/split into
parallel batches
Transformation/
scoring logicInput:
Pandas DF/Series
Output:
Pandas DF/Series
Output Spark DataFrame
Transformation
/scoring logic
Transformation
/scoring logic
….
Vectorized operation: Pyarrow convert from JVM data to Python DF/Series

Spark’s Pandas UDFs: Types and Performance
◦ Scalar UDFs
◦ Column values are split into batches of Pandas series
to pass to UDF
◦ UDF also returns Pandas Series
◦ Good for direct parallel column values computation
◦ Grouped map UDFs
◦ Implements split-apply-pattern: Group by each
column value to form Pandas DataFrames then pass
on to UDF
◦ Returns Pandas DataFrame
◦ All data of a group-by value is loaded into memory
◦ Scalar iterator UDFs (Spark 3.0)
◦ Same with Scala UDF except:
◦ UDF takes iterator of batches instead of single batch
◦ Return iterator of batches or yield batches
◦ Good for initializing some state (e.g. load ML model)
Pandas UDFs perform much better than row-at-a-time UDFs across the board,
ranging from 3x to over 100x (source databricks.com)

ML Training
Spark’s Binary Data and Tensorflow’s TFrecords Formats
support
Binary files (image, audio..)
spark.read.format("binaryFile“)
Import custom libraries
Transformation/scoring
Pandas UDF
• Reading binary data using Binary Files type Spark DataFrame
• Select binary content column into a UDF function to extract feature
• Select other columns such as file path into another UDF as needed (for example to create a label column from the filename)
• Inside UDF function, import needed libraries to extract features from binary data

Scaling Up Data Prep Example 1: Multivariate Time Series
Classification
• ML model to predict customer churn based on their historical interaction (events)
• Each event is a multivariate entity with attributes in categorical, numeric, embedding…
• Each training example is a fixed window of 14 days and the outcome(churn vs. stay).
Challenges:
- There can be millions of customer
- Each customer may have a long
history
- Each history need to generate
100x pairs of training examples
with computation needed to
build features
- Result is billions of records and it
would take days to run in a single
node vs. 2 hours in a 30 nodes
clusters
Data preprocessing plan

Scaling Up Data Prep Example 1: Multivariate Time Series
Classification (cont.)
Read input data
from sources
and combine
Collect event
history for each
customer
In each
customer history
generate
overlapping
windows
Within each
window,
generate and
compute
features
Output data for
training
Spark SQL to select data
from sources
Group by customer
Df.groupby(“customer”)
@pandas_udf(pandas_dec_str,
PandasUDFType.GROUPED_MAP)
output_df.
orderBy(rand()).repartition(10
0).write.format("tfrecords")

Scaling Up Data Prep Example 2: Speech Recognition
• Use deep learning to recognize speech from audio data
• Data is in the form of audio files in wav format. Large volume training requires hundred thousands clips and together with data augmentation can result
in millions of training example
• Processing is computing intensive with audio libraries
ML Training
Wave files
spark.read.format("binaryFile“)
Process core binary
content using
librosa and extract
spectrogram as
features
Pandas UDF 1
Get input file path
and extract file
name and look up
index position in a
label list
Pandas UDF 2

Using Spark for large scale batch inference
Big dataset
Distributed Data
preprocessing
Distributed scoring
Calling externally
hosted APIs
Loading model
and score
Result dataset
Spark is very good for regular map reduce style processing.
The same advantage can apply for ML batch inference
Hosted ML
Service
ML
Model

Load Model and Score within Spark
Model distribution
(sparkcontext.addfile()
or store model file at
shared storage)
Pandas Scalar UDFs
Scoring Output:
PD Series
Input:
PD Series
Model
Loading
Pandas Scalar Iterator UDF
(recommended for Deep Learning)
ScoringInput:
Pandas DF/Series
Model
LoadingInput:
Iterator of
Series
Output:
Iterator of
Series or
yield Series
• Model loading can be done from model file cached at worker
machine by addfile() method or from shared cloud storage
• Pandas Scala Iterator UDF flavor reduces the frequency of loading
deep ML model which can be an expensive operation
Deep Learning model is
large in size and is not
serializable, so broadcast
won’t work

Load Model and Score within Spark- Code Example

Calling External APIs in an UDF
Pandas Scalar UDF
Http Post Output:
PD Series
Input:
PD Series
Model
Loading
Hosted ML
Service
Pandas Scalar Iterator UDF
Http Post Output:
PD Series
Input:
PD Series
Model
Loading
Batch input
Batch output
Batch input
Batch output

Calling External APIs in an UDF

References
◦ Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team, Distributed Deep Learning on Big-Data Clusters,
2017
◦ Databricks,Spark Deep Learning Pipeline , 2017.
◦ Apache Spark Org, Pandas UDF, 2017
◦ Databricks & Apache Spark Org, Pandas UDF Scalar Iterator, 2019.
◦ Databricks & Apache Spark Org, Spark binaryFiles DataFrame, 2019.
◦ Tensorflow team, Spark Tensorflow connector, 2016

THANK YOU!
Your feedback is important to us.
Don’t forget to rate and review the
sessions.

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Similar a Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning