SlideShare a Scribd company logo
1 of 31
Download to read offline
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Hossein Falaki & Yifan Cao, Databricks Inc.
Accelerating Machine
Learning on Databricks
Runtime
#UnifiedAnalytics #SparkAISummit
Outline
Databricks Runtime for ML
Use Case Examples
Under the Hood
Demo
What is Next
3#UnifiedAnalytics #SparkAISummit
Broad Adoption of ML
4#UnifiedAnalytics #SparkAISummit
and many more customers in different industries and segments
Internet of ThingsDigital Personalization
Disruptive innovations are affecting most enterprises on the planet
Healthcare and Genomics Fraud Prevention
Hidden Tech Debt in ML Systems
5#UnifiedAnalytics #SparkAISummit
ML
Code
Configuration
Data	Collection
Data	
Verification
Feature	
Extraction
Machine	
Resource	
Management
Analysis	Tools
Process
Management	Tools
Serving
Infrastructure
Monitoring
Small fraction of real-world ML systems is composed of the ML code, as shown by the small green
box in the middle. The required surrounding infrastructure is vast and complex.
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
6#UnifiedAnalytics #SparkAISummit
ML Runtime: Job To Be Done
• As an ML practitioner
1. I want to quickly start with my ML project
• Today I have to spend many hours setting up environments
2. I want a single runtime for
• I don’t want to move data and code around
7#UnifiedAnalytics #SparkAISummit
all steps of my work
ML Project Stages
8#UnifiedAnalytics #SparkAISummit
Databricks Runtime for ML
Build
Models
ProductionizePrepare
Quality Data
What is Databricks Runtime for ML?
A ready to use environment for machine learning and data science
Built on top of and updated with every Databricks Runtime release
APIs for distributed deep learning on Spark (HorovodRunner)
Performance improvement for popular distributed algorithms in Spark
(GraphFrames, logistic regression and tree classifiers)
9#UnifiedAnalytics #SparkAISummit
What is Databricks Runtime for ML?
ML Environment is
setup on all cluster
nodes with a single
click.
10#UnifiedAnalytics #SparkAISummit
1. Prepare Data
Easily access, explore, and visualize
data in collaborative notebooks
Prepare data sets at scale with:
o Scala / Python / R / SQL
o Optimized Apache Spark
o Structured Streaming
o Delta Lake
o Persisted data meta store
Quickly automate notebooks with jobs
11#UnifiedAnalytics #SparkAISummit
2. Build Models
12#UnifiedAnalytics #SparkAISummit
Support for popular open source ML
frameworks
• TensorFlow and Tensorboard
• PyTorch
• Keras
• Horovod for distributed DL
• XGBoost
• GraphFrames
• Popular single node tools
in Python and R
3. Productionize ML Models
Model Deployment
MLflow API for inference on
third-party services like Docker
containers, AzureML on Azure,
SageMaker on AWS
Databricks Runtime for ML includes
mleap for model serialization.
13#UnifiedAnalytics #SparkAISummit
Use Case Examples
14#UnifiedAnalytics #SparkAISummit
15
Challenge
• 325,000 listed hotels, massive volume of image files
• Apply ML to improve match between traveler and hotels with personalized viewing
experience
Solution
• Leverage Databricks to train DL models on 100% of image data
• Increase processing power by 20X and enable real-time scoring
Result
• Hotels.com significantly improved customer engagement and conversions by improving
personalization models
• Customer Case Study: databricks.com/customers/hotels-com
Vision
16
Challenge
• >100 million gamers every month
• 2% of all games infected by serious toxicity
Solution
• Leveraged Databricks to apply NLP & ML to proactively identify abusive language
• Scaled training on much larger dataset and hyperparameter tuning
Result
• Riot Games increased customer satisfaction, retention, and lifetime value by detecting
abusive language in real-time
• Customer Case Study: databricks.com/customers/riot-games
NLP
17
Challenge
• Offer insights to what consumers buy and watch
• Scale from single-machine data science to large datasets to improve product offerings
Solution
• Leveraged Databricks to ensure collaboration across teams
• Reduced annual cost by 40% and improved model performance by 1/3
Result
• Nielsen improved competitive offering by applying ML to batch & live stream data from IOT
devices
• Customer Case Study: databricks.com/customers/nielsen
IOT
Under the Hood
18#UnifiedAnalytics #SparkAISummit
High-level Engineering Goals
• Reproducible environments
– Package & dependency management
• Testability
– Testing & QA infrastructure and process
• Cross-compatibility
– Careful configuration of all packages to be compatible
• Performance optimization
– High-performance I/O
19#UnifiedAnalytics #SparkAISummit
Package Management
• Package management
• Environment management
– Python 2.x & Python 3.x environments
• Environment is selected during cluster setup
• Latest stable versions from Anaconda
distribution
20#UnifiedAnalytics #SparkAISummit
Python Environments
• ML Runtime vs. Databricks Runtime
– Upgraded packages
– Conda vs. pip
– Additional ML packages
• MKL for CPU acceleration
• CUDA & cuDNN for GPU acceleration
21#UnifiedAnalytics #SparkAISummit
Dependency Management
• bazel for build system
• Audit files for change detection
– Python: Conda
– JAR: maven
– R: MRAN
– Native: Ubuntu APT and Docker
22#UnifiedAnalytics #SparkAISummit
Docker Containers
• We internally use Docker to build Databricks
Runtime images
– Full control over content
– Reproducible and automated
• Runtime for ML is a layer on top of DBR
– MLR benefits from all existing DBR tests and QA
– MLR gets every hotfix and patch that goes into DBR
23#UnifiedAnalytics #SparkAISummit
Extensive Integration Testing
• Extensive tests for top-tier packages
• Each commit runs unit and integration tests
• Nightly tests on master and released branches
• All CPU and GPU instances on Azure & AWS
• Integration Tests:
– Launch a docker container and run code
– Launch a cluster and execute notebooks
24#UnifiedAnalytics #SparkAISummit
High Performance FUSE
• Why Filesystem in userspace (FUSE)?
• We use high-throughput FUSE clients for ML/DL
– Azure Storage FUSE on Azure
– Goofys on AWS
• The mounts points are pre-configured on ML
Runtime at dfbs:/ml
25#UnifiedAnalytics #SparkAISummit
Demo
26#UnifiedAnalytics #SparkAISummit
What is Next?
27#UnifiedAnalytics #SparkAISummit
GA of Runtime for ML
• Release history:
– 4.1 Beta: June 2018
– …
– 5.3 GA: April 2019
– 5.4: May 2019
– 6.0: Second Half 2019
28#UnifiedAnalytics #SparkAISummit
Roadmap for Environment
• DBR with Conda (Beta)
– Enable customizable environment
– Databricks Runtime & Databricks Runtime for ML will
continue to be supported
• 6.0
– Unify all into single Runtime
– Considering removing Python 2.x
29#UnifiedAnalytics #SparkAISummit
Our Vision for 6.0
30#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
 

What's hot (20)

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 

Similar to Accelerating Machine Learning on Databricks Runtime

Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Databricks
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4jNeo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4jNeo4j
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowGoDataDriven
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Kai Wähner
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowDatabricks
 
WSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud StrategyWSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud StrategyWSO2
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREAraf Karsh Hamid
 
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...Shift Conference
 
Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...
Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...
Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...Shift Conference
 
Enterprise Trends for MongoDB as a Service
Enterprise Trends for MongoDB as a ServiceEnterprise Trends for MongoDB as a Service
Enterprise Trends for MongoDB as a ServiceMongoDB
 
Microservices at ibotta pitfalls and learnings
Microservices at ibotta pitfalls and learningsMicroservices at ibotta pitfalls and learnings
Microservices at ibotta pitfalls and learningsMatthew Reynolds
 
Cloud computing and Grid Computing
Cloud computing and Grid ComputingCloud computing and Grid Computing
Cloud computing and Grid Computingprabathsl
 
ITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
ITCamp 2019 - Mihai Tataran - Governing your Cloud ResourcesITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
ITCamp 2019 - Mihai Tataran - Governing your Cloud ResourcesITCamp
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next DecadePaula Koziol
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 

Similar to Accelerating Machine Learning on Databricks Runtime (20)

Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4jNeo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud

 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
WSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud StrategyWSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud Strategy
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
 
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
 
Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...
Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...
Shift Remote AI: Behind the Scenes Development in an AI Company - Matija Ilij...
 
Enterprise Trends for MongoDB as a Service
Enterprise Trends for MongoDB as a ServiceEnterprise Trends for MongoDB as a Service
Enterprise Trends for MongoDB as a Service
 
Microservices at ibotta pitfalls and learnings
Microservices at ibotta pitfalls and learningsMicroservices at ibotta pitfalls and learnings
Microservices at ibotta pitfalls and learnings
 
Cloud computing and Grid Computing
Cloud computing and Grid ComputingCloud computing and Grid Computing
Cloud computing and Grid Computing
 
ITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
ITCamp 2019 - Mihai Tataran - Governing your Cloud ResourcesITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
ITCamp 2019 - Mihai Tataran - Governing your Cloud Resources
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Accelerating Machine Learning on Databricks Runtime

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Hossein Falaki & Yifan Cao, Databricks Inc. Accelerating Machine Learning on Databricks Runtime #UnifiedAnalytics #SparkAISummit
  • 3. Outline Databricks Runtime for ML Use Case Examples Under the Hood Demo What is Next 3#UnifiedAnalytics #SparkAISummit
  • 4. Broad Adoption of ML 4#UnifiedAnalytics #SparkAISummit and many more customers in different industries and segments Internet of ThingsDigital Personalization Disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention
  • 5. Hidden Tech Debt in ML Systems 5#UnifiedAnalytics #SparkAISummit ML Code Configuration Data Collection Data Verification Feature Extraction Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring Small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex. “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
  • 7. ML Runtime: Job To Be Done • As an ML practitioner 1. I want to quickly start with my ML project • Today I have to spend many hours setting up environments 2. I want a single runtime for • I don’t want to move data and code around 7#UnifiedAnalytics #SparkAISummit all steps of my work
  • 8. ML Project Stages 8#UnifiedAnalytics #SparkAISummit Databricks Runtime for ML Build Models ProductionizePrepare Quality Data
  • 9. What is Databricks Runtime for ML? A ready to use environment for machine learning and data science Built on top of and updated with every Databricks Runtime release APIs for distributed deep learning on Spark (HorovodRunner) Performance improvement for popular distributed algorithms in Spark (GraphFrames, logistic regression and tree classifiers) 9#UnifiedAnalytics #SparkAISummit
  • 10. What is Databricks Runtime for ML? ML Environment is setup on all cluster nodes with a single click. 10#UnifiedAnalytics #SparkAISummit
  • 11. 1. Prepare Data Easily access, explore, and visualize data in collaborative notebooks Prepare data sets at scale with: o Scala / Python / R / SQL o Optimized Apache Spark o Structured Streaming o Delta Lake o Persisted data meta store Quickly automate notebooks with jobs 11#UnifiedAnalytics #SparkAISummit
  • 12. 2. Build Models 12#UnifiedAnalytics #SparkAISummit Support for popular open source ML frameworks • TensorFlow and Tensorboard • PyTorch • Keras • Horovod for distributed DL • XGBoost • GraphFrames • Popular single node tools in Python and R
  • 13. 3. Productionize ML Models Model Deployment MLflow API for inference on third-party services like Docker containers, AzureML on Azure, SageMaker on AWS Databricks Runtime for ML includes mleap for model serialization. 13#UnifiedAnalytics #SparkAISummit
  • 15. 15 Challenge • 325,000 listed hotels, massive volume of image files • Apply ML to improve match between traveler and hotels with personalized viewing experience Solution • Leverage Databricks to train DL models on 100% of image data • Increase processing power by 20X and enable real-time scoring Result • Hotels.com significantly improved customer engagement and conversions by improving personalization models • Customer Case Study: databricks.com/customers/hotels-com Vision
  • 16. 16 Challenge • >100 million gamers every month • 2% of all games infected by serious toxicity Solution • Leveraged Databricks to apply NLP & ML to proactively identify abusive language • Scaled training on much larger dataset and hyperparameter tuning Result • Riot Games increased customer satisfaction, retention, and lifetime value by detecting abusive language in real-time • Customer Case Study: databricks.com/customers/riot-games NLP
  • 17. 17 Challenge • Offer insights to what consumers buy and watch • Scale from single-machine data science to large datasets to improve product offerings Solution • Leveraged Databricks to ensure collaboration across teams • Reduced annual cost by 40% and improved model performance by 1/3 Result • Nielsen improved competitive offering by applying ML to batch & live stream data from IOT devices • Customer Case Study: databricks.com/customers/nielsen IOT
  • 19. High-level Engineering Goals • Reproducible environments – Package & dependency management • Testability – Testing & QA infrastructure and process • Cross-compatibility – Careful configuration of all packages to be compatible • Performance optimization – High-performance I/O 19#UnifiedAnalytics #SparkAISummit
  • 20. Package Management • Package management • Environment management – Python 2.x & Python 3.x environments • Environment is selected during cluster setup • Latest stable versions from Anaconda distribution 20#UnifiedAnalytics #SparkAISummit
  • 21. Python Environments • ML Runtime vs. Databricks Runtime – Upgraded packages – Conda vs. pip – Additional ML packages • MKL for CPU acceleration • CUDA & cuDNN for GPU acceleration 21#UnifiedAnalytics #SparkAISummit
  • 22. Dependency Management • bazel for build system • Audit files for change detection – Python: Conda – JAR: maven – R: MRAN – Native: Ubuntu APT and Docker 22#UnifiedAnalytics #SparkAISummit
  • 23. Docker Containers • We internally use Docker to build Databricks Runtime images – Full control over content – Reproducible and automated • Runtime for ML is a layer on top of DBR – MLR benefits from all existing DBR tests and QA – MLR gets every hotfix and patch that goes into DBR 23#UnifiedAnalytics #SparkAISummit
  • 24. Extensive Integration Testing • Extensive tests for top-tier packages • Each commit runs unit and integration tests • Nightly tests on master and released branches • All CPU and GPU instances on Azure & AWS • Integration Tests: – Launch a docker container and run code – Launch a cluster and execute notebooks 24#UnifiedAnalytics #SparkAISummit
  • 25. High Performance FUSE • Why Filesystem in userspace (FUSE)? • We use high-throughput FUSE clients for ML/DL – Azure Storage FUSE on Azure – Goofys on AWS • The mounts points are pre-configured on ML Runtime at dfbs:/ml 25#UnifiedAnalytics #SparkAISummit
  • 28. GA of Runtime for ML • Release history: – 4.1 Beta: June 2018 – … – 5.3 GA: April 2019 – 5.4: May 2019 – 6.0: Second Half 2019 28#UnifiedAnalytics #SparkAISummit
  • 29. Roadmap for Environment • DBR with Conda (Beta) – Enable customizable environment – Databricks Runtime & Databricks Runtime for ML will continue to be supported • 6.0 – Unify all into single Runtime – Considering removing Python 2.x 29#UnifiedAnalytics #SparkAISummit
  • 30. Our Vision for 6.0 30#UnifiedAnalytics #SparkAISummit
  • 31. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT