SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
William Cox
PyColorado 2019
@gallamine
Background - Me
● William Cox
● North Carolina
○ twitter.com/gallamine
○ gallamine.com
● Building machine learning systems at Grubhub
○ Part of the Delivery team to delivery food around the country
○ Previously - Internet security industry and sonar systems
#2
@gallamine
Background - Grubhub
Grubhub Inc. is an American online and mobile food ordering and delivery
marketplace that connects diners with local takeout restaurants*.
#3
https://en.wikipedia.org/wiki/Grubhub
@gallamine
The Problem We’re Solving
● Every week we schedule drivers for timeslots.
● Too few drivers and diners are unhappy because they can’t get delivery
● Too many drivers
○ Drivers are unhappy because they’re idle and paid a base rate
○ Grubhub is unhappy because they’re paying for idle drivers
● We predict how many orders will happen for all regions so that an
appropriate number of drivers can be scheduled.
● My team designs and runs the predictions systems for Order Volume
Forecasting
#4
@gallamine
Daily Prediction Cycle
#5
Historic order data
Weather
Sports
...
Train model
Predict orders N
weeks into the future
@gallamine
How Do We Parallelize the Work?
● Long-term forecasting is a batch job (can take several hours to predict 3
weeks into the future)
● Creating multi-week predictions, for hundreds of different regions, for many
different models
● Need a system to do this in parallel across many machines
#6
Model 2Model 1 Model 3 Model N...
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
@gallamine
Design Goals
● Prefer Python(ic)
● Prefer simplicity
● Prefer local testing / distributed deployment
● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar
with)
Our problem needs heavy compute but not necessarily heavy data. Most of our
data will fit comfortably in memory.
#7
@gallamine
The Contenders
#8
@gallamine
Dask
● Familiar API
● Scales out to clusters
● Scales down to single computers

“Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow-
style and yet run them with the scalability promises of Hadoop/Spark allows for a
pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin,
creator
#9
Dask provides ways to scale Pandas, Scikit-Learn, and
Numpy workflows with minimal rewriting.
● Integrates with the Python ecosystem
● Supports complex applications
● Responsive feedback
@gallamine
Dask
Dask use cases can be roughly divided in the following two categories:
1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to
analyze large datasets with familiar techniques. This is similar to
Databases, Spark, or big array libraries.
2. Custom task scheduling. You submit a graph of functions that depend on
each other for custom workloads. This is similar to Azkaban, Airflow, Celery,
or Makefiles
#10
https://docs.dask.org/en/latest/use-cases.html
@gallamine
Dask Quickstart
> pip install dask
#11
@gallamine
Dask Quickstart
def _forecast(group_name, static_param):
if group_name == "c":
raise ValueError("Bad group.")
# do work here
sleep_time = 1 + random.randint(1, 10)
time.sleep(sleep_time)
return sleep_time
#12
@gallamine
#13
from dask.distributed import Client, as_completed
import time
import random
if __name__ == "__main__":
client = Client()
predictions = []
for group in ["a", "b", "c", "d"]:
static_parameters = 1
fcast_future = client.submit(_forecast, group, static_parameters, pure=False)
predictions.append(fcast_future)
for future in as_completed(predictions, with_results=False):
try:
print(f"future {future.key} returned {future.result()}")
except ValueError as e:
print(e)
“The concurrent.futures module provides a high-level
interface for asynchronously executing callables.” Dask implements
this interface
Arbitrary function we’re scheduling
@gallamine
#14
@gallamine
Dask Distributed - Local
cluster = LocalCluster(
processes=USE_DASK_LOCAL_PROCESSES,
n_workers=1,
threads_per_worker=DASK_THREADS_PER_WORKER,
memory_limit='auto'
)
client = Client(cluster)
cluster.scale(DASK_LOCAL_WORKER_INSTANCES)
client.submit(…)
#15
@gallamine
Show Dask UI Local/Cluster
#16
@gallamine
Dask Distributed on YARN
● Dask workers are started in YARN containers
● Lets you allocate compute/memory resources on a cluster
● Files are distributed via HDFS
● HDFS lets you distribute files across a cluster
#17
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Dask works nicely with Hadoop to create and
manage Dask workers.
Lets you scale Dask to many computers on a
network.
Can also do: Kubernetes, SSH, GCP …
@gallamine
worker = skein.Service(
instances=config.dask_worker_instances,
max_restarts=10,
resources=skein.Resources(
memory=config.dask_worker_memory,
vcores=config.dask_worker_vcores
),
files={
'./cachedvolume': skein.File(
source=config.volume_sqlite3_filename, type='file'
)
},
env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/',
'WORKER_VOL_LOCATION': './cachedvolume',},
script='dask-yarn services worker',
depends=['dask.scheduler']
)
Program-
matically
Describe
Service
#18
@gallamine
#19
scheduler = skein.Service(
resources=skein.Resources(
memory=config.dask_scheduler_memory,
vcores=config.dask_scheduler_vcores
),
script='dask-yarn services scheduler'
)
spec = skein.ApplicationSpec(
name=yarn_app_name,
queue='default',
services={
'dask.worker': worker,
'dask.scheduler': scheduler
}
)
@gallamine
Distributed Code Looks Identical to Local
for gid, url, region_ids in groups:
futures.append(cluster_client.submit(_forecast, forecast_periods,
model_id, region_ids, start_time,
end_time, url, testset))
for done_forecast_job in as_completed(futures, with_results=False):
try:
fcast_data = done_forecast_job.result()
except Exception as error:
# Error handling …
#20
@gallamine
Worker Logging / Observation
Cluster UI URL: cluster.application_client.ui.address
if reset_loggers:
# When workers start the reset logging function will be executed first.
client.register_worker_callbacks(setup=init.reset_logger)
#21
Stdout and stderr logs are captured by YARN.
@gallamine
Helpful - Debugging Wrapper
● Wrap Dask functions so that they can be turned off for debugging code
serially
● Code in Appendix slides
#22
Big ML
● SKLearn integration
● XGBoost / TensorFlow
● Works to hand off data to existing
distributed workflows
from dask.distributed import Client
client = Client() # start a local Dask client
import dask_ml.joblib
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask'):
# Your normal scikit-learn code here
Works with joblib
@gallamine
Big Data
● For dealing with large tabular data Dask has
distributed dataframes - Pandas + Dask
● For large numeric data Dask Arrays - Numpy +
Dask
● For large unstructured data Dask Bags
“Pythonic version of the PySpark RDD."
#24
@gallamine
Takeaways
● Forecasting now scales with number of computers in cluster! 50%
savings also in single-node compute.
● For distributing work across computers, Dask is a good place to start
investigating.
● YARN complicates matters
○ But I don’t know that something else (Kubernetes) would be better
○ The Dask website has good documentation
○ The Dask maintainers answer Stackoverflow questions quickly.
○ Dask is a complex library with lots of different abilities. This was just one use-
case among many.
○ We’re hiring!
#25
@gallamine
Questions?
#26
@gallamine
Debugging Wrapper - Appendix
class DebugClient:
def submit(self, func, *args, **kwargs):
f = futures.Future()
try:
f.set_result(self._execute_function(func, *args,
**kwargs))
return f
except Exception as e:
f.set_exception(e)
return f
def _execute_function(self, func, *args, **kwargs):
try:
return func(*args, **kwargs)
except Exception:
raise
#27
def as_completed(fcast_futures, with_results):
if not config.dask_debug_mode:
return dask_as_completed(fcast_futures,
with_results=with_results)
else:
return list(fcast_futures)
@gallamine
● “Dask is really just a smashing together of Python’s networking stack
with its data science stack. Most of the work was already done by the
time we got here.” - M. Rocklin
#28
https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200

Más contenido relacionado

La actualidad más candente

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersDatabricks
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Spark Summit
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Shi Shao Feng
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...HostedbyConfluent
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017Thomas Weise
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked inYi Pan
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
 

La actualidad más candente (20)

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 

Similar a Dask and Machine Learning Models in Production - PyColorado 2019

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Sumanth Chinthagunta
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application sizeKeval Patel
 
Drupal performance and scalability
Drupal performance and scalabilityDrupal performance and scalability
Drupal performance and scalabilityTwinbit
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
DCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDocker, Inc.
 

Similar a Dask and Machine Learning Models in Production - PyColorado 2019 (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
 
Drupal performance and scalability
Drupal performance and scalabilityDrupal performance and scalability
Drupal performance and scalability
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
DCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker Containers
 

Último

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Último (20)

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

Dask and Machine Learning Models in Production - PyColorado 2019

  • 2. @gallamine Background - Me ● William Cox ● North Carolina ○ twitter.com/gallamine ○ gallamine.com ● Building machine learning systems at Grubhub ○ Part of the Delivery team to delivery food around the country ○ Previously - Internet security industry and sonar systems #2
  • 3. @gallamine Background - Grubhub Grubhub Inc. is an American online and mobile food ordering and delivery marketplace that connects diners with local takeout restaurants*. #3 https://en.wikipedia.org/wiki/Grubhub
  • 4. @gallamine The Problem We’re Solving ● Every week we schedule drivers for timeslots. ● Too few drivers and diners are unhappy because they can’t get delivery ● Too many drivers ○ Drivers are unhappy because they’re idle and paid a base rate ○ Grubhub is unhappy because they’re paying for idle drivers ● We predict how many orders will happen for all regions so that an appropriate number of drivers can be scheduled. ● My team designs and runs the predictions systems for Order Volume Forecasting #4
  • 5. @gallamine Daily Prediction Cycle #5 Historic order data Weather Sports ... Train model Predict orders N weeks into the future
  • 6. @gallamine How Do We Parallelize the Work? ● Long-term forecasting is a batch job (can take several hours to predict 3 weeks into the future) ● Creating multi-week predictions, for hundreds of different regions, for many different models ● Need a system to do this in parallel across many machines #6 Model 2Model 1 Model 3 Model N... Region 1 Region 2 Region M Region 1 Region 2 Region M Region 1 Region 2 Region M Region 1 Region 2 Region M
  • 7. @gallamine Design Goals ● Prefer Python(ic) ● Prefer simplicity ● Prefer local testing / distributed deployment ● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar with) Our problem needs heavy compute but not necessarily heavy data. Most of our data will fit comfortably in memory. #7
  • 9. @gallamine Dask ● Familiar API ● Scales out to clusters ● Scales down to single computers
 “Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow- style and yet run them with the scalability promises of Hadoop/Spark allows for a pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin, creator #9 Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows with minimal rewriting. ● Integrates with the Python ecosystem ● Supports complex applications ● Responsive feedback
  • 10. @gallamine Dask Dask use cases can be roughly divided in the following two categories: 1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to analyze large datasets with familiar techniques. This is similar to Databases, Spark, or big array libraries. 2. Custom task scheduling. You submit a graph of functions that depend on each other for custom workloads. This is similar to Azkaban, Airflow, Celery, or Makefiles #10 https://docs.dask.org/en/latest/use-cases.html
  • 12. @gallamine Dask Quickstart def _forecast(group_name, static_param): if group_name == "c": raise ValueError("Bad group.") # do work here sleep_time = 1 + random.randint(1, 10) time.sleep(sleep_time) return sleep_time #12
  • 13. @gallamine #13 from dask.distributed import Client, as_completed import time import random if __name__ == "__main__": client = Client() predictions = [] for group in ["a", "b", "c", "d"]: static_parameters = 1 fcast_future = client.submit(_forecast, group, static_parameters, pure=False) predictions.append(fcast_future) for future in as_completed(predictions, with_results=False): try: print(f"future {future.key} returned {future.result()}") except ValueError as e: print(e) “The concurrent.futures module provides a high-level interface for asynchronously executing callables.” Dask implements this interface Arbitrary function we’re scheduling
  • 15. @gallamine Dask Distributed - Local cluster = LocalCluster( processes=USE_DASK_LOCAL_PROCESSES, n_workers=1, threads_per_worker=DASK_THREADS_PER_WORKER, memory_limit='auto' ) client = Client(cluster) cluster.scale(DASK_LOCAL_WORKER_INSTANCES) client.submit(…) #15
  • 16. @gallamine Show Dask UI Local/Cluster #16
  • 17. @gallamine Dask Distributed on YARN ● Dask workers are started in YARN containers ● Lets you allocate compute/memory resources on a cluster ● Files are distributed via HDFS ● HDFS lets you distribute files across a cluster #17 https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html Dask works nicely with Hadoop to create and manage Dask workers. Lets you scale Dask to many computers on a network. Can also do: Kubernetes, SSH, GCP …
  • 18. @gallamine worker = skein.Service( instances=config.dask_worker_instances, max_restarts=10, resources=skein.Resources( memory=config.dask_worker_memory, vcores=config.dask_worker_vcores ), files={ './cachedvolume': skein.File( source=config.volume_sqlite3_filename, type='file' ) }, env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/', 'WORKER_VOL_LOCATION': './cachedvolume',}, script='dask-yarn services worker', depends=['dask.scheduler'] ) Program- matically Describe Service #18
  • 19. @gallamine #19 scheduler = skein.Service( resources=skein.Resources( memory=config.dask_scheduler_memory, vcores=config.dask_scheduler_vcores ), script='dask-yarn services scheduler' ) spec = skein.ApplicationSpec( name=yarn_app_name, queue='default', services={ 'dask.worker': worker, 'dask.scheduler': scheduler } )
  • 20. @gallamine Distributed Code Looks Identical to Local for gid, url, region_ids in groups: futures.append(cluster_client.submit(_forecast, forecast_periods, model_id, region_ids, start_time, end_time, url, testset)) for done_forecast_job in as_completed(futures, with_results=False): try: fcast_data = done_forecast_job.result() except Exception as error: # Error handling … #20
  • 21. @gallamine Worker Logging / Observation Cluster UI URL: cluster.application_client.ui.address if reset_loggers: # When workers start the reset logging function will be executed first. client.register_worker_callbacks(setup=init.reset_logger) #21 Stdout and stderr logs are captured by YARN.
  • 22. @gallamine Helpful - Debugging Wrapper ● Wrap Dask functions so that they can be turned off for debugging code serially ● Code in Appendix slides #22
  • 23. Big ML ● SKLearn integration ● XGBoost / TensorFlow ● Works to hand off data to existing distributed workflows from dask.distributed import Client client = Client() # start a local Dask client import dask_ml.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask'): # Your normal scikit-learn code here Works with joblib
  • 24. @gallamine Big Data ● For dealing with large tabular data Dask has distributed dataframes - Pandas + Dask ● For large numeric data Dask Arrays - Numpy + Dask ● For large unstructured data Dask Bags “Pythonic version of the PySpark RDD." #24
  • 25. @gallamine Takeaways ● Forecasting now scales with number of computers in cluster! 50% savings also in single-node compute. ● For distributing work across computers, Dask is a good place to start investigating. ● YARN complicates matters ○ But I don’t know that something else (Kubernetes) would be better ○ The Dask website has good documentation ○ The Dask maintainers answer Stackoverflow questions quickly. ○ Dask is a complex library with lots of different abilities. This was just one use- case among many. ○ We’re hiring! #25
  • 27. @gallamine Debugging Wrapper - Appendix class DebugClient: def submit(self, func, *args, **kwargs): f = futures.Future() try: f.set_result(self._execute_function(func, *args, **kwargs)) return f except Exception as e: f.set_exception(e) return f def _execute_function(self, func, *args, **kwargs): try: return func(*args, **kwargs) except Exception: raise #27 def as_completed(fcast_futures, with_results): if not config.dask_debug_mode: return dask_as_completed(fcast_futures, with_results=with_results) else: return list(fcast_futures)
  • 28. @gallamine ● “Dask is really just a smashing together of Python’s networking stack with its data science stack. Most of the work was already done by the time we got here.” - M. Rocklin #28 https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200