SlideShare una empresa de Scribd logo
1 de 25
Tuning and Monitoring Deep
Learning on Apache Spark
Tim Hunter
Databricks, Inc.
About me
Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine Learning
(and Spark user since Spark 0.2)
2
Outline
• Lessons (and challenges) learned from the field
• Tuning Spark for Deep Learning and GPUs
• Loading data in Spark
• Monitoring
3
Deep Learning and Spark
• 2016: the year of emerging solutions for
combining Spark and Deep Learning
– 6 talks on Deep Learning, 3 talks on GPUs
• Yet, no consensus:
– Official MLlib support is limited (perceptron-like
networks)
– Each talk mentions a different framework
4
Deep Learning and Spark
• Deep learning frameworks with Spark bindings:
– Caffe (CaffeOnSpark)
– Keras (Elephas)
– mxnet
– Paddle
– TensorFlow (TensorFlow on Spark, TensorFrames)
• Running natively on Spark
– BigDL
– DeepDist
– DeepLearning4J
– MLlib
– SparkCL
– SparkNet
• Extensions to Spark for specialized hardware
– Blaze (UCLA & Falcon Computing Solutions)
– IBM Conductor with Spark
(Alphabetical order!)
5
One Framework to Rule Them All?
• Should we look for The One Deep Learning
Framework?
6
Databricks’ perspective
• Databricks: a Spark vendor on top of a public cloud
• Now provides GPU instances
• Enables customers with compute-intensive
workloads
• This talk:
– lessons learned when deploying GPU instances on a
public cloud
– what Spark could do better when running Deep Learning
applications
7
ML in a data pipeline
• ML is small part in the full pipeline:
Hidden technical debt in Machine Learning systems
Sculley et al., NIPS 2016
8
DL in a data pipeline (1)
• Training tasks:
9
Data
collection
ETL Featurization Deep
Learning
Validation Export,
Serving
compute intensive IO intensiveIO intensive
Large cluster
High memory/CPU ratio
Small cluster
Low memory/CPU ratio
DL in a data pipeline (2)
• Specialized data transform (feature extraction, …)
10
Input Output
cat
dog
dog
Saulius Garalevicius - CC BY-SA 3.0
Recurring patterns
• Spark as a scheduler
– Embarrassingly parallel tasks
– Data stored outside Spark
• Embedded Deep Learning transforms
– Data stored in DataFrame/RDDs
– Follows the RDD guarantees
• Specialized computations
– Multiple passes over data
– Specific communication patterns
11
Using GPUs through PySpark
• A popular choice when a lot of independent tasks
• A lot of DL packages have a Python interface:
TensorFlow, Theano, Caffe, mxnet, etc.
• Lifetime for python packages: the process
• Requires some configuration tweaks in Spark
12
PySpark recommendation
• spark.executor.cores = 1
– Gives the DL framework full access over all the
resources
– This may be important for some frameworks that
attempt to optimize the processor pipelines
13
Cooperative frameworks
• Use Spark for data input
• Examples:
– IBM GPU efforts
– Skymind’s DeepLearning4J
– DistML and other Parameter Server efforts
14
RDD
Partition 1
Partition n
RDD
Partition 1
Partition m
Black box
Cooperative frameworks
• Bypass Spark for asynchronous / specific
communication patterns across machines
• Lose benefit of RDDs and DataFrames and
reproducibility/determinism
• But these guarantees are not requested anyway
when doing deep learning (stochastic gradient)
• “reproducibility is worth a factor of 2” (Leon Bottou,
quoted by John Langford)
15
Streaming data through DL
• The general choice:
– Cold layer (HDFS/S3/etc.)
– Local storage: files, Spark’s on-disk persistence layer
– In memory: Spark RDDs or Spark Dataframes
• Find out if you are I/O constrained or processor-constrained
– How big is your dataset? MNIST or ImageNet?
• If using PySpark:
– All frameworks heavily optimized for disk I/O
– Use Spark’s broadcast for small datasets that fit in memory
– Reading files is fast: use local files when it does not fit
• If using a binding:
– each binding has its own layer
– in general, better performance by caching in files (same reasons)
16
Detour: simplifying the life of users
• Deep Learning commonly used with GPUs
• A lot of work on Spark dependencies:
– Few dependencies on local machine when compiling Spark
– The build process works well in a large number of configurations
(just scala + maven)
• GPUs present challenges: CUDA, support libraries, drivers,
etc.
– Deep software stack, requires careful construction (hardware +
drivers + CUDA + libraries)
– All these are expected by the user
– Turnkey stacks just starting to appear
17
Container:
nvidia-docker,
lxc, etc.
Simplifying the life of users
• Provide a Docker image with all the GPU SDK
• Pre-install GPU drivers on the instance
GPU hardware
Linux kernel NV Kernel driver
CuBLAS CuDNN
Deep learning libraries
(Tensorflow, etc.) JCUDA
Python / JVM clients
18
CUDA
NV kernel driver (userspace interface)
Monitoring
?
19
Monitoring
• How do you monitor the progress of your tasks?
• It depends on the granularity
– Around tasks
– Inside (long-running) tasks
20
Monitoring: Accumulators
• Good to check
throughput or failure rate
• Works for Scala
• Limited use for Python
(for now, SPARK-2868)
• No “real-time” update
batchesAcc = sc.accumulator(1)
def processBatch(i):
global acc
acc += 1
# Process image batch here
images = sc.parallelize(…)
images.map(processBatch).collect()
21
Monitoring: external system
• Plugs into an external system
• Existing solutions: Grafana, Graphite,
Prometheus, etc.
• Most flexible, but more complex to deploy
22
Conclusion
• Distributed deep learning: exciting and fast-
moving space
• Most insights are specific to a task, a dataset
and an algorithm: nothing replaces experiments
• Easy to get started with parallel jobs
• Move in complexity only when enough
data/insufficient patience
23
Challenges to address
• For Spark developers
– Monitoring long-running tasks
– Presenting and introspecting intermediate results
• For DL developers:
– What boundary to put between the algorithm and
Spark?
– How to integrate with Spark at the low-level?
24
Thank You.
Check our blog post on Deep Learning and GPUs!
https://docs.databricks.com/applications/deep-learning/index.html
Office hours: today 3:50 at the Databricks booth

Más contenido relacionado

La actualidad más candente

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 

La actualidad más candente (20)

Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim Dowling
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 

Destacado

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 

Destacado (10)

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 

Similar a Tuning and Monitoring Deep Learning on Apache Spark

Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 

Similar a Tuning and Monitoring Deep Learning on Apache Spark (20)

Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayAI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Tuning and Monitoring Deep Learning on Apache Spark

  • 1. Tuning and Monitoring Deep Learning on Apache Spark Tim Hunter Databricks, Inc.
  • 2. About me Software engineer at Databricks Apache Spark contributor Ph.D. UC Berkeley in Machine Learning (and Spark user since Spark 0.2) 2
  • 3. Outline • Lessons (and challenges) learned from the field • Tuning Spark for Deep Learning and GPUs • Loading data in Spark • Monitoring 3
  • 4. Deep Learning and Spark • 2016: the year of emerging solutions for combining Spark and Deep Learning – 6 talks on Deep Learning, 3 talks on GPUs • Yet, no consensus: – Official MLlib support is limited (perceptron-like networks) – Each talk mentions a different framework 4
  • 5. Deep Learning and Spark • Deep learning frameworks with Spark bindings: – Caffe (CaffeOnSpark) – Keras (Elephas) – mxnet – Paddle – TensorFlow (TensorFlow on Spark, TensorFrames) • Running natively on Spark – BigDL – DeepDist – DeepLearning4J – MLlib – SparkCL – SparkNet • Extensions to Spark for specialized hardware – Blaze (UCLA & Falcon Computing Solutions) – IBM Conductor with Spark (Alphabetical order!) 5
  • 6. One Framework to Rule Them All? • Should we look for The One Deep Learning Framework? 6
  • 7. Databricks’ perspective • Databricks: a Spark vendor on top of a public cloud • Now provides GPU instances • Enables customers with compute-intensive workloads • This talk: – lessons learned when deploying GPU instances on a public cloud – what Spark could do better when running Deep Learning applications 7
  • 8. ML in a data pipeline • ML is small part in the full pipeline: Hidden technical debt in Machine Learning systems Sculley et al., NIPS 2016 8
  • 9. DL in a data pipeline (1) • Training tasks: 9 Data collection ETL Featurization Deep Learning Validation Export, Serving compute intensive IO intensiveIO intensive Large cluster High memory/CPU ratio Small cluster Low memory/CPU ratio
  • 10. DL in a data pipeline (2) • Specialized data transform (feature extraction, …) 10 Input Output cat dog dog Saulius Garalevicius - CC BY-SA 3.0
  • 11. Recurring patterns • Spark as a scheduler – Embarrassingly parallel tasks – Data stored outside Spark • Embedded Deep Learning transforms – Data stored in DataFrame/RDDs – Follows the RDD guarantees • Specialized computations – Multiple passes over data – Specific communication patterns 11
  • 12. Using GPUs through PySpark • A popular choice when a lot of independent tasks • A lot of DL packages have a Python interface: TensorFlow, Theano, Caffe, mxnet, etc. • Lifetime for python packages: the process • Requires some configuration tweaks in Spark 12
  • 13. PySpark recommendation • spark.executor.cores = 1 – Gives the DL framework full access over all the resources – This may be important for some frameworks that attempt to optimize the processor pipelines 13
  • 14. Cooperative frameworks • Use Spark for data input • Examples: – IBM GPU efforts – Skymind’s DeepLearning4J – DistML and other Parameter Server efforts 14 RDD Partition 1 Partition n RDD Partition 1 Partition m Black box
  • 15. Cooperative frameworks • Bypass Spark for asynchronous / specific communication patterns across machines • Lose benefit of RDDs and DataFrames and reproducibility/determinism • But these guarantees are not requested anyway when doing deep learning (stochastic gradient) • “reproducibility is worth a factor of 2” (Leon Bottou, quoted by John Langford) 15
  • 16. Streaming data through DL • The general choice: – Cold layer (HDFS/S3/etc.) – Local storage: files, Spark’s on-disk persistence layer – In memory: Spark RDDs or Spark Dataframes • Find out if you are I/O constrained or processor-constrained – How big is your dataset? MNIST or ImageNet? • If using PySpark: – All frameworks heavily optimized for disk I/O – Use Spark’s broadcast for small datasets that fit in memory – Reading files is fast: use local files when it does not fit • If using a binding: – each binding has its own layer – in general, better performance by caching in files (same reasons) 16
  • 17. Detour: simplifying the life of users • Deep Learning commonly used with GPUs • A lot of work on Spark dependencies: – Few dependencies on local machine when compiling Spark – The build process works well in a large number of configurations (just scala + maven) • GPUs present challenges: CUDA, support libraries, drivers, etc. – Deep software stack, requires careful construction (hardware + drivers + CUDA + libraries) – All these are expected by the user – Turnkey stacks just starting to appear 17
  • 18. Container: nvidia-docker, lxc, etc. Simplifying the life of users • Provide a Docker image with all the GPU SDK • Pre-install GPU drivers on the instance GPU hardware Linux kernel NV Kernel driver CuBLAS CuDNN Deep learning libraries (Tensorflow, etc.) JCUDA Python / JVM clients 18 CUDA NV kernel driver (userspace interface)
  • 20. Monitoring • How do you monitor the progress of your tasks? • It depends on the granularity – Around tasks – Inside (long-running) tasks 20
  • 21. Monitoring: Accumulators • Good to check throughput or failure rate • Works for Scala • Limited use for Python (for now, SPARK-2868) • No “real-time” update batchesAcc = sc.accumulator(1) def processBatch(i): global acc acc += 1 # Process image batch here images = sc.parallelize(…) images.map(processBatch).collect() 21
  • 22. Monitoring: external system • Plugs into an external system • Existing solutions: Grafana, Graphite, Prometheus, etc. • Most flexible, but more complex to deploy 22
  • 23. Conclusion • Distributed deep learning: exciting and fast- moving space • Most insights are specific to a task, a dataset and an algorithm: nothing replaces experiments • Easy to get started with parallel jobs • Move in complexity only when enough data/insufficient patience 23
  • 24. Challenges to address • For Spark developers – Monitoring long-running tasks – Presenting and introspecting intermediate results • For DL developers: – What boundary to put between the algorithm and Spark? – How to integrate with Spark at the low-level? 24
  • 25. Thank You. Check our blog post on Deep Learning and GPUs! https://docs.databricks.com/applications/deep-learning/index.html Office hours: today 3:50 at the Databricks booth

Notas del editor

  1. Hello everyone, welcome to this session. My name is Timothy Hunter and it is my pleasure to talk to talk to you about deep learning with Spark. Before we start, a couple of house rules: I hope to have an interactive discussion with you. Feel free to interrupt and raise your hands if you want to share a comment or a question.
  2. But first, let me introduce myself briefly. I am a software engineer at databricks, and I have been writing machine learning algorithms with Spark since version 0.2.
  3. I expect everyone here to have heard about deep learning. They are recent techniques that allow training complex data structures that look like neural networks. They are at the origin of a lot of recent advances in computer vision, speech recognition, even gaming and robotics. If you read the mainstream press, you may even believe they have some supernatural powers and will soon replace all of us in the room. But I want to talk today about the challenges that one encounters when we try to use them on the field.
  4. what are the considerations when evaluating a DL framework, or integrating with Spark? There is performance of course, but in practice, there are many other aspects to consider. The most important is that deep learning or machine learning does not work in isolation. It needs to cooperate with a large number of other services in a production environment. (usually a compute-intensive one)
  5. What sort of tasks do you expect to see deep learning in? The first one is TRAINING. We want to build a network that solves a specific task. One of the most famous tasks is for example building a network that assigns labels to images. How does this fit into a full pipeline? Usually, it is a very specific part of a much larger data pipeline. This pipeline typically starts by ingesting data from multiple data sources, etc. One of the most remarkable characteristics is that the deep learning has very different resource requirements compared to the other tasks.
  6. The second typical usage is to transform some data The data has already been collected, and it may be very massive. A trained neural network is then applied, for example to build a new representation of the data, or to augment it with new information such as labels, etc. In that case, each row of data can be processed indepently, and the issue is how to apply this transform at large scale and high speed. Depending on the data and the operational intensity of the transform, it is not clear if the limiting factor is going to be the IO (ingesting the data) or the compute power (the transform itself).
  7. How do you integrate Spark with deep learning frameworks? At Databricks, we see a couple of patterns emerging. The simplest: just use Spark to schedule a large number of independent tasks. The data is stored outside Spark. Spark just sees that some computations are happening on each worker. The next level: transform data stored by Spark (RDD/Dataframe). For good performance, one should use DL framework with some bindings with Spark. The most complex: the transform does not follow the traditional divide-and-conquer Spark model and requires specific communication patterns.
  8. Because the transforms are so intense in computations, it is common to use some specialized hardware for speed. The easiest and most common solution is GPUs. At the same time, GPUs are hard to integrate for a number of reasons. … Spark gives you the promise that you can write some code on your laptop, and it will work as expected on a few thousands computers sitting in a different continents. Deep learning with GPUs breaks this assumptions. This is especially a problem when working on large clusters with tens or hundreds of machines, in which they may be various hardware generations, configurations, etc.
  9. This is why a typical solution is to package all the dependencies in a way that is independent from the underlying hardware. Virtualization is not always aware of GPU requirements, so a typical solution is to use containers.
  10. we won't have time for demo, but here are some inputs: + highlights