Dask is a light-weight, Pythonic, library for doing distributed computation. I’ll talk about how we use it to run machine learning forecasting jobs, and how the library might benefit your machine learning or data science work.
2. @gallamine
Background - Me
● William Cox
● North Carolina
○ twitter.com/gallamine
○ gallamine.com
● Building machine learning systems at Grubhub
○ Part of the Delivery team to delivery food around the country
○ Previously - Internet security industry and sonar systems
#2
3. @gallamine
Background - Grubhub
Grubhub Inc. is an American online and mobile food ordering and delivery
marketplace that connects diners with local takeout restaurants*.
#3
https://en.wikipedia.org/wiki/Grubhub
4. @gallamine
The Problem We’re Solving
● Every week we schedule drivers for timeslots.
● Too few drivers and diners are unhappy because they can’t get delivery
● Too many drivers
○ Drivers are unhappy because they’re idle and paid a base rate
○ Grubhub is unhappy because they’re paying for idle drivers
● We predict how many orders will happen for all regions so that an
appropriate number of drivers can be scheduled.
● My team designs and runs the predictions systems for Order Volume
Forecasting
#4
6. @gallamine
How Do We Parallelize the Work?
● Long-term forecasting is a batch job (can take several hours to predict 3
weeks into the future)
● Creating multi-week predictions, for hundreds of different regions, for many
different models
● Need a system to do this in parallel across many machines
#6
Model 2Model 1 Model 3 Model N...
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
7. @gallamine
Design Goals
● Prefer Python(ic)
● Prefer simplicity
● Prefer local testing / distributed deployment
● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar
with)
Our problem needs heavy compute but not necessarily heavy data. Most of our
data will fit comfortably in memory.
#7
9. @gallamine
Dask
● Familiar API
● Scales out to clusters
● Scales down to single computers
“Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow-
style and yet run them with the scalability promises of Hadoop/Spark allows for a
pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin,
creator
#9
Dask provides ways to scale Pandas, Scikit-Learn, and
Numpy workflows with minimal rewriting.
● Integrates with the Python ecosystem
● Supports complex applications
● Responsive feedback
10. @gallamine
Dask
Dask use cases can be roughly divided in the following two categories:
1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to
analyze large datasets with familiar techniques. This is similar to
Databases, Spark, or big array libraries.
2. Custom task scheduling. You submit a graph of functions that depend on
each other for custom workloads. This is similar to Azkaban, Airflow, Celery,
or Makefiles
#10
https://docs.dask.org/en/latest/use-cases.html
12. @gallamine
Dask Quickstart
def _forecast(group_name, static_param):
if group_name == "c":
raise ValueError("Bad group.")
# do work here
sleep_time = 1 + random.randint(1, 10)
time.sleep(sleep_time)
return sleep_time
#12
13. @gallamine
#13
from dask.distributed import Client, as_completed
import time
import random
if __name__ == "__main__":
client = Client()
predictions = []
for group in ["a", "b", "c", "d"]:
static_parameters = 1
fcast_future = client.submit(_forecast, group, static_parameters, pure=False)
predictions.append(fcast_future)
for future in as_completed(predictions, with_results=False):
try:
print(f"future {future.key} returned {future.result()}")
except ValueError as e:
print(e)
“The concurrent.futures module provides a high-level
interface for asynchronously executing callables.” Dask implements
this interface
Arbitrary function we’re scheduling
17. @gallamine
Dask Distributed on YARN
● Dask workers are started in YARN containers
● Lets you allocate compute/memory resources on a cluster
● Files are distributed via HDFS
● HDFS lets you distribute files across a cluster
#17
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Dask works nicely with Hadoop to create and
manage Dask workers.
Lets you scale Dask to many computers on a
network.
Can also do: Kubernetes, SSH, GCP …
20. @gallamine
Distributed Code Looks Identical to Local
for gid, url, region_ids in groups:
futures.append(cluster_client.submit(_forecast, forecast_periods,
model_id, region_ids, start_time,
end_time, url, testset))
for done_forecast_job in as_completed(futures, with_results=False):
try:
fcast_data = done_forecast_job.result()
except Exception as error:
# Error handling …
#20
21. @gallamine
Worker Logging / Observation
Cluster UI URL: cluster.application_client.ui.address
if reset_loggers:
# When workers start the reset logging function will be executed first.
client.register_worker_callbacks(setup=init.reset_logger)
#21
Stdout and stderr logs are captured by YARN.
22. @gallamine
Helpful - Debugging Wrapper
● Wrap Dask functions so that they can be turned off for debugging code
serially
● Code in Appendix slides
#22
23. Big ML
● SKLearn integration
● XGBoost / TensorFlow
● Works to hand off data to existing
distributed workflows
from dask.distributed import Client
client = Client() # start a local Dask client
import dask_ml.joblib
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask'):
# Your normal scikit-learn code here
Works with joblib
24. @gallamine
Big Data
● For dealing with large tabular data Dask has
distributed dataframes - Pandas + Dask
● For large numeric data Dask Arrays - Numpy +
Dask
● For large unstructured data Dask Bags
“Pythonic version of the PySpark RDD."
#24
25. @gallamine
Takeaways
● Forecasting now scales with number of computers in cluster! 50%
savings also in single-node compute.
● For distributing work across computers, Dask is a good place to start
investigating.
● YARN complicates matters
○ But I don’t know that something else (Kubernetes) would be better
○ The Dask website has good documentation
○ The Dask maintainers answer Stackoverflow questions quickly.
○ Dask is a complex library with lots of different abilities. This was just one use-
case among many.
○ We’re hiring!
#25
27. @gallamine
Debugging Wrapper - Appendix
class DebugClient:
def submit(self, func, *args, **kwargs):
f = futures.Future()
try:
f.set_result(self._execute_function(func, *args,
**kwargs))
return f
except Exception as e:
f.set_exception(e)
return f
def _execute_function(self, func, *args, **kwargs):
try:
return func(*args, **kwargs)
except Exception:
raise
#27
def as_completed(fcast_futures, with_results):
if not config.dask_debug_mode:
return dask_as_completed(fcast_futures,
with_results=with_results)
else:
return list(fcast_futures)
28. @gallamine
● “Dask is really just a smashing together of Python’s networking stack
with its data science stack. Most of the work was already done by the
time we got here.” - M. Rocklin
#28
https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200