In this talk we will explore Ray - a high-performance and low latency distributed execution framework which will allow you to run your Python code on multiple cores, and scale the same code from your laptop to a large cluster.
Ray uses several interesting ideas like actors, fast zero-copy shared-memory object store, or bottom-up scheduling. Moreover, on top of a succinct API, Ray builds tools to your Pandas pipelines faster, tools that find you the best hyper-parameters for your machine learning models, or train state of the art reinforcement learning algorithms, and much more. Come to the talk and learn some more.
Updated the talk with Kubernetes
https://www.pydays.at/
4. Martin Fowler's First rule of
distributed objects computing
Don't
Massive complexity booster
See also Common fallacies of distributed computing
5. Scale up and down on
demand
Yamazaki et al. and 2048 GPUs (March 2019)
ImageNet in 74.7 seconds
6. Heterogeneous computations
CT or MRI
image
segmentpreprocess
landmark
estimation
meshing
view
estimation
VR
L
L
P
P S
M
V
3D print
M
GPU-based
machine learning
CPU-intensive
operation
WebVR-based UI
-
Long runing
external process
3D printed model of your own heart
7. Concurrent world packed with real-
time decisions
OK
Acquisition
VisualisationProcessing
Real-time cookie quality control
10. Threads
GIL, not using all cores anyway, output values...
import threading
def analyze_image(im):
return im.mean()
def process_image(im):
return im * 5
t1 = threading.Thread(target=analyze_image, args=(im,))
t2 = threading.Thread(target=process_image, args=(im,))
t1.start()
t2.start()
t1.join()
t2.join()
11. Processes
Sharing objects between processes - constant pickling
There is hope -
import multiprocessing
def analyze_image(im):
return im.mean()
def process_image(im):
return im * 5
p1 = multiprocessing.Process(target=analyze_image, args=(im,))
p2 = multiprocessing.Process(target=process_image, args=(im,))
p1.start()
p2.start()
p1.join()
p2.join()
https://docs.python.org/3.8/library/multiprocessing.shared_memory.html
12. And we are still just
running on a single
machine
13. Celery
from celery import Celery
app = Celery('jobs', ...)
@app.task
def compute_stuff(x, y):
return x + y
@app.task
def another_compute_stuff(x, y):
return x + y
from jobs import compute_stuff, another_compute_stuff
compute_stuff.delay(1, 1).get()
compute_stuff.apply_async((2, 2), link=another_compute_stuff.s(16))
compute_stuff.starmap([(2, 2), (4, 4)])
14. PySpark
Mature, excellent for ETL, simple queries
Great for homogeneous processing of the points
"BigData" ecosystem in Java
R = matrix(rand(M, F)) * matrix(rand(U, F).T)
ms = matrix(rand(M, F))
us = matrix(rand(U, F))
Rb = sc.broadcast(R)
msb = sc.broadcast(ms)
usb = sc.broadcast(us)
for i in range(ITERATIONS):
ms = sc.parallelize(range(M), partitions)
.map(lambda x: update(x, usb.value, Rb.value))
.collect()
ms = matrix(np.array(ms)[:, :, 0])
…
15. Spark barriers vs dynamic task graphs
Ray: A Distributed Execution Framework for Emerging AI Applications Michael Jordan (UC
Berkeley)
16. Dask
Much more "Pythonic" than Spark
Play well with data science tools
Global scheduler → latency
https://dask.org/
import dask
@dask.delayed
def add(x, y):
return x + y
x = add(1, 2)
y = add(x, 3)
y.compute()
17. Why new system?
Play well with existing tools
Scale from a laptop to a cluster
Heterogeneous code and hardware
Real-time and low-latency
Dynamically schedule tasks
Less cognitive load
18. Ray is a general purpose framework for parallel and
distributed Python and a collection of libraries targeting
data processing workflows
Developed at UC Berkeley as an attempt to replace Spark
https://github.com/ray-project/ray
19. Unique components
Stateless tasks and actors combined
Bottom-up scheduling for low latency
Shared object store with zero copy deserialization
Clean Pythonic API
20. Most* of Ray's API
you will ever need
The rest is (mostly) Python as we know it
*Seriously, this is pretty much it
ray.init # connect to a Ray cluster
ray.remote # declare a task/actor & remote execution
ray.get # retrieve a Ray object and convert to a Python object
ray.put # manually place an object to the object store
ray.wait # retrieve results as they are made ready
22. Tasks
Stateless computations
Decorate a function with ray.remote
Optionally with some extra parameters
@ray.remote
def imread(fname):
return cv2.imread(fname)
@ray.remote(num_cpus=1, num_gpus=0, num_return_vals=2)
def segment(image, threshold=128):
dark = image < threshold
bright = image > threshold
return dark, bright
23. Execute the task on a cluster
Append .remote
Immediatelly returns a future and gives back control
future = imread.remote('/data/python.png')
ObjectID(0100000067dc20383d2f04ea6cfade301eef9919)
24. Get the results
Schedule a computation for execution
ray.get blocks until the computation is completed
All subsequent ray.gets return almost instantly
Use the future as many times as needed
future = heavy_computation.remote()
arr = ray.get(future)
arr0 = ray.get(future)
arr1 = ray.get(future)
thumb_future = make_thumbnail.remote(future)
landmarks_future = find_landmarks.remote(future)
25. Actors
Mutable state and unique resources
Instantiate the actor somewhere
@ray.remote
class ParameterServer(object):
def __init__(self, keys, values):
values = [value.copy() for value in values]
self.weights = dict(zip(keys, values))
def push(self, keys, values):
for key, value in zip(keys, values):
self.weights[key] += value
def pull(self, keys):
return [self.weights[key] for key in keys]
ps = ParameterServer.remote(keys, initial_values)
26. Ray actor methods
always called sequentially
the only way to mutate a resource
simpler model without deadlocks
#LifeWithoutLocks
future0 = ps.push.remote(keys, grads0)
future1 = ps.push.remote(keys, grads1)
future2 = ps.grab.pull(keys)
27. Actors for resources
*camlib is our custom Cython-based wrapper for a vendor-specific library in Cython. Check out
the vendor agnostic and open-source .
@ray.remote
class Camera:
def __init__(self, ref):
self.cam = camlib.Camera(ref=ref)
self.cam.open()
self.num_frames = 0
def grab(self):
self.num_frames += 1
return self.cam.grab_frame()
def total_frames(self):
return self.num_frames
cam = Camera.remote(ref='1337')
im_fut = cam.grab.remote()
harvester
28. Mix and match tasks and actors
Grab and process an images from a camera
Or run a distributed SGD training
frame_id = camera.grab.remote()
segmented_id = segment.remote(frame_id)
segmented = ray.get(segmented_id)
@ray.remote
def worker(ps):
while True:
# Get the latest parameters
weights = ray.get(ps.pull.remote(keys))
# Compute an update of the params
# (e.g. the gradients for neural nets)
# Push the updates to the parameter server
ps.push.remote(keys, gradients)
worker_tasks = [worker.remote(ps) for _ in range(10)]
29. Dynamically define by run
import numpy as np
@ray.remote
def aggregate_data(x, y):
return x + y
data = [np.random.normal(size=1000) for i in range(4)]
while len(data) > 1:
intermediate_result = aggregate_data.remote(data[0], data[1])
data = data[2:] + [intermediate_result]
result = ray.get(data[0])
34. Raylet
Local scheduler
Driver can assign a task to a worker
Bottom up scheduling with fractional resources
No more tasks in parallel than the number of CPUs
(multithreaded libs - set the number of threads to 1)
35. Global control state
Take all metadata and state out of the system
Centralize it in a redis cluster
Everything else is largely stateless
Reschedule tasks on other machines
36. Fault-tolerance
Failover to other nodes based on
the global control state
Non actors - Reconstruct by lineage
Actors - Replay (experimental)
37. Does it scale?
mujoco video
Watch later Share
0:01 / 0:40
Moritz, Nishihara et al.: Ray: A Distributed Framework
for Emerging AI Applications
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
39. On-prem set-up
Start Ray head on one of the nodes
Start Ray workers on the nodes
Connect and run commands
Teardown Ray
$ ray start --head --redis-port=6379 # head IP: 192.168.1.5
$ ray start --redis-address=192.168.1.5:6379
ray.init(redis_address="192.168.1.5:6379")
@ray.remote
def imread(filename):
return cv2.imread(filename)
ims = ray.get([imread.remote(f) for f in glob('*.png')])
$ ray stop
40. Make a private Ray
cluster on the cloud
Ready-made auto-scaling scripts for AWS and GCP
Set-up a Ray cluster
Tear it down
or write a custom provider
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
https://ray.readthedocs.io/en/latest/autoscaling.html
41. Set-up Ray on any
Kubernetes cluster
$ kubectl create -f ray/kubernetes/head.yaml
$ kubectl create -f ray/kubernetes/worker.yaml
https://ray.readthedocs.io/en/latest/deploy-on-kubernetes.html
42. 1. Create a Kubernetes cluster + download kubectl
Download the kubeconfig.yaml file from the UI
2. Check that the nodes are running
3. Deploy the head and the workers
4. Wait till the pods are running
$ kubectl --kubeconfig="kubeconfig.yaml" get nodes
NAME STATUS ROLES AGE VERSION
pool-6pi4ni81f-q4dn Ready <none> 87m v1.14.1
$ kubectl --kubeconfig="kubeconfig.yaml" apply -f head.yaml
$ kubectl --kubeconfig="kubeconfig.yaml" apply -f worker.yaml
$ kubectl --kubeconfig="kubeconfig.yaml" get pods
NAME READY STATUS RESTARTS AGE
ray-head-56fdb7fdd-qtgbt 1/1 Running 0 85m
ray-worker-85454649dd-5nb8k 0/1 Pending 0 13m
...
43. 5. Enter the head pod and run ipython
6. Profit from a distributed Python
$ kubectl --kubeconfig="kubeconfig.yaml" exec -it
ray-head-56fdb7fdd-qtgbt -- bash
$ ipython
from collections import Counter
import time
import ray
ray.init(redis_address="localhost:6379")
@ray.remote
def get_node_ip():
time.sleep(0.01)
return ray.services.get_node_ip_address()
%time Counter(ray.get([f.remote() for _ in range(100)]
46. The ecosystem
Higher-level libs built on top of Ray
Tune - hyper-parameter optimization
rllib - reinforcement learning
modin - distributed* Pandas
and more…
*experimental
48. Tunable function
config - All tunable parameters of the function
reporter - Collector of metrics for the optimizer and for
visualization of the training in Tensorboard
def my_tunable_function(config, reporter):
train_data, self.test_data = make_data_loaders(config)
model = make_model(config)
trainer = make_optimizer(model, config)
for epoch in range(10): # Could be an infinite loop too
train(model, trainer, train_data)
accuracy = evaluate(model, test_data)
reporter(mean_accuracy=accuracy)
49. Class-based tunable API
Support for model checkpointing and restoration.
class MyTunableClass(Trainable):
def _setup(self, config):
self.train_data, self.test_data = make_data_loaders(config)
self.model = make_model(config)
self.trainer = make_optimizer(model, config)
def _train(self):
train_for_a_while(self.model, self.train_data, self.trainer)
return {"mean_accuracy": eval_model(self.model, self.test_data)}
def _save(self, checkpoint_dir):
return save_model(self.model, checkpoint_dir)
def _restore(self, checkpoint_path):
self.model.load_state_dict(checkpoint_path)
50. Define the parameter space
Register the trainable function
Launch hyper-parameter search
Consider extracting your argparse arguments
spec = {
"stop": {
"mean_accuracy": 0.995,
"time_total_s": 600,
},
"config": {
"activation": grid_search(["relu", "elu", "tanh"]),
"learning_rate": tune.grid_search([0.001, 0.01, 0.1]),
},
}
tune.register_trainable("train_imagenet", my_tunable_function)
tune.run("train_imagenet", name="tune_imagenet_test", **spec)
55. Moding automatically partitions and
distributes your data frames
Earlier stage, 71% of Pandas API covered, else fallback to Pandas
https://github.com/modin-project/modin
57. Serial Parallel and distributedf
Remember
def heavy_computation(x):
# do something nice here
return x
results = [
heavy_computation(i)
for i in range(100)
]
@ray.remote
def heavy_computation(x):
# do something nice here
return x
ray.init()
results = ray.get([
heavy_computation.remote(i)
for i in range(100)
])
58. Conclusion
Simple API with tasks and actors
A sane local alternative to threads and processes
Use the same code locally and on a cluster
Growing ecosystem of libraries
Ray has fantastic docs and tutorials
pip install ray
60. References
Seven concurrency models in seven weeks - Butcher
A note on distributed computing - Waldo J. et al.
Free lunch is over - Herb sutter
Fallacies of distrib. computing explained - Rotem-Gal-
Oz
Fallacies of distrib. computing - P. Deutsch
Ray docs
Ray tutorial
Plasma store
Plasma store and Arrow
Scaling Python modules with Ray framework
61. References
Ray - a cluster computing engine for reinforcement
learning applictions
https://ray-project.github.io/2018/07/15/parameter-
server-in-fifteen-lines.html
Ray: A Distributed Execution Framework for AI | SciPy
2018 - Robert Nishihara
Dask and Celery - M. Rocklin
Dask comparison to Spark
Ray: A Distributed System for AI
Resources
My referral link to $100 at Digital ocean for 60 days