"We all know the unprecedented potential impact for Machine Learning. But how do you take advantage of the myriad of data and ML tools now available? How do you streamline processes, speed up discovery, share knowledge, and scale up implementations for real-life scenarios?
In this talk, we'll cover some of the latest innovations brought into the Databricks Unified Analytics Platform for Machine Learning. In particular we will show you how to:
- Get started quickly using the Databricks Runtime for Machine Learning, that provides pre-configured Databricks clusters including the most popular ML frameworks and libraries, Conda support, performance optimizations, and more.
- Get started with most popular Deep Learning frameworks within a few minutes and go deep with state of the art model DL diagnostics tools.
- Scale up Deep Learning training workloads from a single machine to large clusters for the most demanding applications using the new HorovodRunner with ease.
- How all of these ML frameworks get exposed to large and distributed data using Databricks Runtime for Machine Learning."
4. Broad Adoption of ML
4#UnifiedAnalytics #SparkAISummit
and many more customers in different industries and segments
Internet of ThingsDigital Personalization
Disruptive innovations are affecting most enterprises on the planet
Healthcare and Genomics Fraud Prevention
5. Hidden Tech Debt in ML Systems
5#UnifiedAnalytics #SparkAISummit
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
Small fraction of real-world ML systems is composed of the ML code, as shown by the small green
box in the middle. The required surrounding infrastructure is vast and complex.
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
7. ML Runtime: Job To Be Done
• As an ML practitioner
1. I want to quickly start with my ML project
• Today I have to spend many hours setting up environments
2. I want a single runtime for
• I don’t want to move data and code around
7#UnifiedAnalytics #SparkAISummit
all steps of my work
9. What is Databricks Runtime for ML?
A ready to use environment for machine learning and data science
Built on top of and updated with every Databricks Runtime release
APIs for distributed deep learning on Spark (HorovodRunner)
Performance improvement for popular distributed algorithms in Spark
(GraphFrames, logistic regression and tree classifiers)
9#UnifiedAnalytics #SparkAISummit
10. What is Databricks Runtime for ML?
ML Environment is
setup on all cluster
nodes with a single
click.
10#UnifiedAnalytics #SparkAISummit
11. 1. Prepare Data
Easily access, explore, and visualize
data in collaborative notebooks
Prepare data sets at scale with:
o Scala / Python / R / SQL
o Optimized Apache Spark
o Structured Streaming
o Delta Lake
o Persisted data meta store
Quickly automate notebooks with jobs
11#UnifiedAnalytics #SparkAISummit
12. 2. Build Models
12#UnifiedAnalytics #SparkAISummit
Support for popular open source ML
frameworks
• TensorFlow and Tensorboard
• PyTorch
• Keras
• Horovod for distributed DL
• XGBoost
• GraphFrames
• Popular single node tools
in Python and R
13. 3. Productionize ML Models
Model Deployment
MLflow API for inference on
third-party services like Docker
containers, AzureML on Azure,
SageMaker on AWS
Databricks Runtime for ML includes
mleap for model serialization.
13#UnifiedAnalytics #SparkAISummit
15. 15
Challenge
• 325,000 listed hotels, massive volume of image files
• Apply ML to improve match between traveler and hotels with personalized viewing
experience
Solution
• Leverage Databricks to train DL models on 100% of image data
• Increase processing power by 20X and enable real-time scoring
Result
• Hotels.com significantly improved customer engagement and conversions by improving
personalization models
• Customer Case Study: databricks.com/customers/hotels-com
Vision
16. 16
Challenge
• >100 million gamers every month
• 2% of all games infected by serious toxicity
Solution
• Leveraged Databricks to apply NLP & ML to proactively identify abusive language
• Scaled training on much larger dataset and hyperparameter tuning
Result
• Riot Games increased customer satisfaction, retention, and lifetime value by detecting
abusive language in real-time
• Customer Case Study: databricks.com/customers/riot-games
NLP
17. 17
Challenge
• Offer insights to what consumers buy and watch
• Scale from single-machine data science to large datasets to improve product offerings
Solution
• Leveraged Databricks to ensure collaboration across teams
• Reduced annual cost by 40% and improved model performance by 1/3
Result
• Nielsen improved competitive offering by applying ML to batch & live stream data from IOT
devices
• Customer Case Study: databricks.com/customers/nielsen
IOT
19. High-level Engineering Goals
• Reproducible environments
– Package & dependency management
• Testability
– Testing & QA infrastructure and process
• Cross-compatibility
– Careful configuration of all packages to be compatible
• Performance optimization
– High-performance I/O
19#UnifiedAnalytics #SparkAISummit
20. Package Management
• Package management
• Environment management
– Python 2.x & Python 3.x environments
• Environment is selected during cluster setup
• Latest stable versions from Anaconda
distribution
20#UnifiedAnalytics #SparkAISummit
21. Python Environments
• ML Runtime vs. Databricks Runtime
– Upgraded packages
– Conda vs. pip
– Additional ML packages
• MKL for CPU acceleration
• CUDA & cuDNN for GPU acceleration
21#UnifiedAnalytics #SparkAISummit
22. Dependency Management
• bazel for build system
• Audit files for change detection
– Python: Conda
– JAR: maven
– R: MRAN
– Native: Ubuntu APT and Docker
22#UnifiedAnalytics #SparkAISummit
23. Docker Containers
• We internally use Docker to build Databricks
Runtime images
– Full control over content
– Reproducible and automated
• Runtime for ML is a layer on top of DBR
– MLR benefits from all existing DBR tests and QA
– MLR gets every hotfix and patch that goes into DBR
23#UnifiedAnalytics #SparkAISummit
24. Extensive Integration Testing
• Extensive tests for top-tier packages
• Each commit runs unit and integration tests
• Nightly tests on master and released branches
• All CPU and GPU instances on Azure & AWS
• Integration Tests:
– Launch a docker container and run code
– Launch a cluster and execute notebooks
24#UnifiedAnalytics #SparkAISummit
25. High Performance FUSE
• Why Filesystem in userspace (FUSE)?
• We use high-throughput FUSE clients for ML/DL
– Azure Storage FUSE on Azure
– Goofys on AWS
• The mounts points are pre-configured on ML
Runtime at dfbs:/ml
25#UnifiedAnalytics #SparkAISummit
28. GA of Runtime for ML
• Release history:
– 4.1 Beta: June 2018
– …
– 5.3 GA: April 2019
– 5.4: May 2019
– 6.0: Second Half 2019
28#UnifiedAnalytics #SparkAISummit
29. Roadmap for Environment
• DBR with Conda (Beta)
– Enable customizable environment
– Databricks Runtime & Databricks Runtime for ML will
continue to be supported
• 6.0
– Unify all into single Runtime
– Considering removing Python 2.x
29#UnifiedAnalytics #SparkAISummit