SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Project Hydrogen: Unifying State-of-the-art
Big Data and AI in Apache Spark
Xiangrui Meng
2018/07
Spark: unified analytics engine for big data
“Apache Spark is the Taylor Swift of big data software.” - Fortune
500,000
meetup
members
4,000
summit
attendees
AI: re-shaping the world
“Artificial Intelligence is the New Electricity.”  -  Andrew Ng
Transportation Healthcare and Genomics Internet of Things Fraud Prevention Personalization
Get started with Spark + AI?
Try Databricks Community Edition: http://databricks.com/try
Two communities: big data and AI
Significant progress has been made by both big data and AI
communities in recent years to push the cutting edge:
more complex
big data
scenarios
more complex
deep learning
scenarios
The cross?
6
Map/Reduce
CaffeOnSpark
TensorFlowOnSpark
DataFrame-based APIs
50+ Data Sources
Python/Java/R interfaces
Structured Streaming
ML Pipelines API
Continuous
Processing
RDD
Project Tungsten
Pandas UDF
TensorFrames
scikit-learn
pandas/numpy/scipy
LIBLINEAR
R
glmnet
xgboost
GraphLab
Caffe/PyTorch/MXNet
TensorFlow
Keras
Distributed
TensorFlow
Horovod
tf.data
tf.transform
AI/ML
??
TF XLA
Neither big data nor AI is easy
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the
small green box in the middle. The required surrounding infrastructure is vast and complex.
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
Big data for AI
A common goal of collecting, processing, and managing big data
is to extract value from it. AI/ML is the most mentioned approach
in our conversations with customers.
improved customer engagement and conversions by
improving image classification (case study)
increased customer satisfaction, retention, and lifetime value
by detecting abusive language in real-time (case study)
Big data for AI
There are many efforts from the Spark community to integrate
Spark with AI/ML frameworks:
● (Yahoo) CaffeOnSpark, TensorFlowOnSpark
● (Intel) BigDL
● (John Snow Labs) Spark-NLP
● (Databricks) spark-sklearn, tensorframes, spark-deep-learning
● … 80+ ML/AI packages on spark-packages.org
AI needs big data
One cannot apply AI techniques without data. And DL models
scale with amount of data. We have seen efforts from the AI
community to handle different data scenarios:
● tf.data, tf.Transform
● spark-tensorflow-connector
● ...
source: Andrew Ng
When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source offerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where we see Spark and AI efforts overlap more.
The status quo: two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
Distributed training
data warehouse load fit model
Required: Be able to read from
Databricks Delta, Parquet,
MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster
for fast training
Answer: Horovod, Distributed
Tensorflow, etc
Databricks
Delta
Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
A hybrid Spark and AI cluster
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
Unfortunately, it doesn’t work out of the box.
(Demo that a distributed DL job needs to run as a Spark job to
avoid crashing into other Spark workloads on the same cluster.)
18
Different execution models
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed Training
Complete coordination among tasks
Optimized for communication
19
Different execution models
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes…
Distributed Training
Complete coordination among tasks
Optimized for communication
20
Different execution models
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes, rerun that one
Distributed Training
Complete coordination among tasks
Optimized for communication
If one crashes, must rerun all tasks
21
Project Hydrogen: Spark + AI
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
22
Barrier execution mode
We introduce gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides sufficient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
Umbrella JIRA: SPARK-24374 (ETA: Spark 2.4, 3.0)
23
RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}
24
context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // write partition data out
context.barrier()
25
context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val hosts = context.getTaskInfos().map(_.host)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
26
Cluster manager support
YARN
SPARK-24723
Kubernetes
SPARK-24724
Mesos
SPARK-24725
In Spark standalone mode, users need passwordless SSH among
workers to run a hybrid MPI job. The community is working on the
support of other cluster managers.
27
Demo
(Demo barrier mode prototype, where a distributed DL job runs
as a Spark job and does not crash into other Spark workloads.)
28
Optimized data exchange (SPIP)
None of the integrations are possible without exchanging data
between Spark and AI frameworks. And performance matters. We
proposed using a standard interface for data exchange to simplify
the integrations without introducing much overhead.
SPIP JIRA: SPARK-24579 (pending vote, ETA 3.0)
29
Sources × destinations
Parquet
Avro
SequenceFile
MySQL
ORC
...
TFRecords
CSV
LIBSVM
HDF5
NetCDF
...
×
30
Sources + 2 + destinations
Parquet
Avro
SequenceFile
MySQL
ORC
...
TFRecords
CSV
LIBSVM
HDF5
NetCDF
...
+
Apache
Spark
+
Apache
Arrow
+
31
From source to destination
Parquet
file
Spark tf.Dataset
32
From source to destination: a long path
Parquet
file
Spark
Tungsten
format in
JVM
Arrow
buffer in
JVM
native
Parquet
reader
Arrow
buffer in
Python
data in
GPU
Pandastf.Dataset
33
Shortcuts
Parquet
file
Spark
Tungsten
format in
JVM
Arrow
buffer in
JVM
native
Parquet
reader
Arrow
buffer in
Python
data in
GPU
Pandastf.Dataset
34
Vectorized computation
Though Spark uses a columnar storage format, the public user
access interface is row-based.
df.toArrowRDD.map { batches =>
... // vectorized computation
}
35
Generalize Pandas UDF to Scala/Java
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorized computation.
36
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3
(pipelining)
37
Accelerator-aware scheduling (SPIP)
To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster
or to utilize multiple accelerators in a multi-task node, Spark
needs to understand the accelerators installed on each node.
SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0)
38
Request accelerators
With accelerator awareness, users can specify accelerators
constraints or hints (API pending discussion):
rdd.accelerated
.by(“/gpu/p100”)
.numPerTask(2)
.required // or .optional
39
Multiple tasks on the same node
When multiples tasks are scheduled on the same node with
multiple GPUs, each task knows which GPUs are assigned to avoid
crashing into each other (API pending discussion):
// inside a task closure
val gpus = context.getAcceleratorInfos()
40
Cluster manager support
In Spark standalone mode, we can list accelerators in worker conf
and aggregate the information for scheduling. We also need to
provide support for other cluster managers:
YARN Kubernetes Mesos
41
Timeline
Spark 2.4
● barrier execution mode (basic scenarios)
● (Databricks) horovod integration w/ barrier mode
Spark 3.0
● barrier execution mode
● optimized data exchange
● accelerator-aware scheduling
42
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arrow,
Pandas UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and
contributors who helped the project proposal, design, and
implementation.
Thank you!

Más contenido relacionado

La actualidad más candente

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

La actualidad más candente (20)

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Introduction to Shell script
Introduction to Shell scriptIntroduction to Shell script
Introduction to Shell script
 

Similar a Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark

Similar a Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark (20)

Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Último (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 

Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark

  • 1. Project Hydrogen: Unifying State-of-the-art Big Data and AI in Apache Spark Xiangrui Meng 2018/07
  • 2. Spark: unified analytics engine for big data “Apache Spark is the Taylor Swift of big data software.” - Fortune 500,000 meetup members 4,000 summit attendees
  • 3. AI: re-shaping the world “Artificial Intelligence is the New Electricity.”  -  Andrew Ng Transportation Healthcare and Genomics Internet of Things Fraud Prevention Personalization
  • 4. Get started with Spark + AI? Try Databricks Community Edition: http://databricks.com/try
  • 5. Two communities: big data and AI Significant progress has been made by both big data and AI communities in recent years to push the cutting edge: more complex big data scenarios more complex deep learning scenarios
  • 6. The cross? 6 Map/Reduce CaffeOnSpark TensorFlowOnSpark DataFrame-based APIs 50+ Data Sources Python/Java/R interfaces Structured Streaming ML Pipelines API Continuous Processing RDD Project Tungsten Pandas UDF TensorFrames scikit-learn pandas/numpy/scipy LIBLINEAR R glmnet xgboost GraphLab Caffe/PyTorch/MXNet TensorFlow Keras Distributed TensorFlow Horovod tf.data tf.transform AI/ML ?? TF XLA
  • 7. Neither big data nor AI is easy “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex. ML Code Configuration Data Collection Data Verification Feature Extraction Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring
  • 8. Big data for AI A common goal of collecting, processing, and managing big data is to extract value from it. AI/ML is the most mentioned approach in our conversations with customers. improved customer engagement and conversions by improving image classification (case study) increased customer satisfaction, retention, and lifetime value by detecting abusive language in real-time (case study)
  • 9. Big data for AI There are many efforts from the Spark community to integrate Spark with AI/ML frameworks: ● (Yahoo) CaffeOnSpark, TensorFlowOnSpark ● (Intel) BigDL ● (John Snow Labs) Spark-NLP ● (Databricks) spark-sklearn, tensorframes, spark-deep-learning ● … 80+ ML/AI packages on spark-packages.org
  • 10. AI needs big data One cannot apply AI techniques without data. And DL models scale with amount of data. We have seen efforts from the AI community to handle different data scenarios: ● tf.data, tf.Transform ● spark-tensorflow-connector ● ... source: Andrew Ng
  • 11. When AI goes distributed ... When datasets get bigger and bigger, we see more and more distributed training scenarios and open-source offerings, e.g., distributed TensorFlow, Horovod, and distributed MXNet. This is where we see Spark and AI efforts overlap more.
  • 12. The status quo: two simple stories As a data scientist, I can: ● build a pipeline that fetches training events from a production data warehouse and trains a DL model in parallel; ● apply a trained DL model to a distributed stream of events and enrich it with predicted labels.
  • 13. Distributed training data warehouse load fit model Required: Be able to read from Databricks Delta, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc Databricks Delta
  • 14. Two separate data and AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code
  • 15. Streaming model inference Kafka load predict model required: ● save to stream sink ● GPU for fast inference
  • 16. A hybrid Spark and AI cluster load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  • 17. Unfortunately, it doesn’t work out of the box. (Demo that a distributed DL job needs to run as a Spark job to avoid crashing into other Spark workloads on the same cluster.)
  • 18. 18 Different execution models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed Training Complete coordination among tasks Optimized for communication
  • 19. 19 Different execution models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If one crashes… Distributed Training Complete coordination among tasks Optimized for communication
  • 20. 20 Different execution models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If one crashes, rerun that one Distributed Training Complete coordination among tasks Optimized for communication If one crashes, must rerun all tasks
  • 21. 21 Project Hydrogen: Spark + AI Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 22. 22 Barrier execution mode We introduce gang scheduling to Spark on top of MapReduce execution model. So a distributed DL job can run as a Spark job. ● It starts all tasks together. ● It provides sufficient info and tooling to run a hybrid distributed job. ● It cancels and restarts all tasks in case of failures. Umbrella JIRA: SPARK-24374 (ETA: Spark 2.4, 3.0)
  • 23. 23 RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  • 24. 24 context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // write partition data out context.barrier()
  • 25. 25 context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val hosts = context.getTaskInfos().map(_.host) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  • 26. 26 Cluster manager support YARN SPARK-24723 Kubernetes SPARK-24724 Mesos SPARK-24725 In Spark standalone mode, users need passwordless SSH among workers to run a hybrid MPI job. The community is working on the support of other cluster managers.
  • 27. 27 Demo (Demo barrier mode prototype, where a distributed DL job runs as a Spark job and does not crash into other Spark workloads.)
  • 28. 28 Optimized data exchange (SPIP) None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. We proposed using a standard interface for data exchange to simplify the integrations without introducing much overhead. SPIP JIRA: SPARK-24579 (pending vote, ETA 3.0)
  • 30. 30 Sources + 2 + destinations Parquet Avro SequenceFile MySQL ORC ... TFRecords CSV LIBSVM HDF5 NetCDF ... + Apache Spark + Apache Arrow +
  • 31. 31 From source to destination Parquet file Spark tf.Dataset
  • 32. 32 From source to destination: a long path Parquet file Spark Tungsten format in JVM Arrow buffer in JVM native Parquet reader Arrow buffer in Python data in GPU Pandastf.Dataset
  • 34. 34 Vectorized computation Though Spark uses a columnar storage format, the public user access interface is row-based. df.toArrowRDD.map { batches => ... // vectorized computation }
  • 35. 35 Generalize Pandas UDF to Scala/Java Pandas UDF was introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  • 36. 36 Data pipelining CPU GPU t1 fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  • 37. 37 Accelerator-aware scheduling (SPIP) To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster or to utilize multiple accelerators in a multi-task node, Spark needs to understand the accelerators installed on each node. SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0)
  • 38. 38 Request accelerators With accelerator awareness, users can specify accelerators constraints or hints (API pending discussion): rdd.accelerated .by(“/gpu/p100”) .numPerTask(2) .required // or .optional
  • 39. 39 Multiple tasks on the same node When multiples tasks are scheduled on the same node with multiple GPUs, each task knows which GPUs are assigned to avoid crashing into each other (API pending discussion): // inside a task closure val gpus = context.getAcceleratorInfos()
  • 40. 40 Cluster manager support In Spark standalone mode, we can list accelerators in worker conf and aggregate the information for scheduling. We also need to provide support for other cluster managers: YARN Kubernetes Mesos
  • 41. 41 Timeline Spark 2.4 ● barrier execution mode (basic scenarios) ● (Databricks) horovod integration w/ barrier mode Spark 3.0 ● barrier execution mode ● optimized data exchange ● accelerator-aware scheduling
  • 42. 42 Acknowledgement ● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.