Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.
Clipper is a general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by isolating models in their own containers and communicating with them over a lightweight RPC system. This architecture allows models to be deployed for serving in the same runtime environment as that used during training. Further, it provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model.
In this talk, I will provide an overview of the Clipper serving system and then discuss how to get started using Clipper to serve Spark and TensorFlow models in a production serving environment.
10. Models getting more complex
Ø 10s of GFLOPs [1]
Support low-latency, high-throughput serving workloads
Using specialized hardware
for predictions
Deployed on critical path
Ø Maintain SLOs under heavy load
[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
11. Google Translate
Serving
82,000 GPUs
running 24/7
[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
140 billion words a day1
Invented New Hardware!
Tensor Processing Unit
(TPU)
12. Big Companies Build One-Off Systems
Problems:
Ø Expensive to build and maintain
Ø Highly specialized and require ML and
systems expertise
Ø Tightly-coupled model and application
Ø Difficult to change or update model
Ø Only supports single ML framework
14. Large and growing ecosystem of ML models and frameworks
???
Fraud
Detection
14
15. Large and growing ecosystem of ML models and frameworks
???
Fraud
Detection
15
Ø Different programming languages:
Ø TensorFlow: Python and C++
Ø Spark: JVM-based
Ø Different hardware:
Ø CPUs and GPUs
Ø Different costs
Ø DNNs in TensorFlow are generally much slower
16. Large and growing ecosystem of ML models and frameworks
???
Fraud
Detection
Create VW
Caffe 16
Content
Rec.
Personal
Asst.
Robotic
Control
Machine
Translation
21. Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y
Problems:
Ø Requires full set of queries ahead of time
Ø Small and bounded input domain
Ø Wasted computation and space
Ø Can render and store unneeded predictions
Ø Costly to update
Ø Re-run batch job
Use existing systems: Offline Scoring
23. Prediction-Serving Today
Highly specialized systems for
specific problems
Decision
QueryX Y
Offline scoring with existing
frameworks and systems
Clipper aims to unify these
approaches
24. What would we like this to look like?
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe 24
25. Encapsulate each system in isolated environment
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe 25
26. Decouple models and applications
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
26
Caffe
???
27. Clipper
Predict
MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications
Model Container (MC) MC
Caffe
28. Clipper Implementation
Clipper
MC MC MC
RPC RPC RPC RPC
Model Abstraction Layer
Caching
Adaptive Batching
Model Container (MC)
Core system: 35,000 lines of C++
Open source system: https://github.com/ucbrise/clipper
Goal: Community-owned platform for model deployment and
serving
Applications
Predict
31. Getting Started with Clipper
Docker images available on DockerHub
Clipper admin is distributed as pip package:
pip install clipper_admin
Get up and running without cloning or compiling!
32. q Containerized frameworks: unified
abstraction and isolation
q Cross-framework caching and
batching: optimize throughput and latency
q Cross-framework model composition:
improved accuracy through ensembles and
bandits
Technologies inside Clipper
35. Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
38. Package model implementation and dependencies
Model Container (MC)
Container-based Model Deployment
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
40. Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy
41. Clipper
Caffe
MC
RPC RPC RPC
Container (MC)
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy
Ø Scale-out
MC
RPCRPC RPC
MC MC MC
43. Overhead of decoupled architecture
Throughput
(QPS)
Better P99 Latency
(ms)
Better
Model: AlexNet trained on CIFAR-10
44. q Containerized frameworks: unified
abstraction and isolation
q Cross-framework caching and
batching: optimize throughput and latency
q Cross-framework model composition:
improved accuracy through ensembles and
bandits
Technologies inside Clipper
46. A single
page load
may generate
many queries
Batching to Improve Throughput
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
47. A single
page load
may generate
many queries
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
Batching to Improve ThroughputLatency-aware
Clipper Solution:
Adaptively tradeoff latency and throughput…
Ø Inc. batch size until the latency objective
is exceeded (Additive Increase)
Ø If latency exceeds SLO cut batch size by
a fraction (Multiplicative Decrease)
49. Overhead of decoupled architecture
Clipper
Predict FeedbackRPC/REST Interface
Caffe
MC MC MC
RPC RPC RPC RPC
Applications
MC
50. q Containerized frameworks: unified
abstraction and isolation
q Cross-framework caching and
batching: optimize throughput and latency
q Cross-framework model composition:
improved accuracy through ensembles and
bandits
Technologies inside Clipper
55. Clipper
Model Selection LayerSelection Policy
ØExploit multiple models to estimate confidence
Ø Use multi-armed bandit algorithms to learn
optimal model-selection online
Ø Online personalization across ML frameworks
*See paper for details [NSDI 2017]
58. You can always implement your own model container…
Model Container (MC)
class ModelContainer:
def __init__(model_data):
def predict_batch(inputs):
59. Clipper admin will automatically deploy Spark models
Clipper.deploy_pyspark_model(model, predict_function)
60. Clipper admin will automatically deploy Spark models
Clipper.deploy_pyspark_model(model, predict_function)
Instance of PySpark model:
pyspark.mllib.*
pyspark.ml.pipeline.PipelineModel
61. Clipper admin will automatically deploy Spark models
Clipper.deploy_pyspark_model(model, predict_function)
Python closure
def predict_function(spark, model, inputs)
Params:
spark : SparkSession
model : ml or mllib model instance
inputs : list of inputs to predict
Returns:
list of str
Can be used for business logic, featurization,
post-processing…
67. Clipper can automatically deploy other models:
ØArbitrary Python closures:
ØScikit-Learn models
Øpred_function can contain any Python code that is
pickle-able
ØDeploy Spark models from Scala or Java:
ØSupports both MLLib and Pipelines APIs
69. Check out Clipper
Ø Alpha release (0.1.3) available
Ø First class support for Scikit-Learn and Spark models
Ø Deploy directly from training job without writing Docker containers
Ø Actively working with early users
Ø Post issues and questions on GitHub and subscribe to our
mailing list clipper-dev@googlegroups.com
http://clipper.ai
70. What’s next
Ø Ongoing efforts to support for additional frameworks
Ø TensorFlow (supported but requires writing Docker containers)
Ø R (open pull request using Python-R interop library)
Ø Caffe2
Ø Kubernetes integration (early internal prototype)
Ø Bandit methods, online personalization, and cascaded predictions
Ø Studied in research prototypes but need to integrate into system
Ø Actively looking for contributors:
https://github.com/ucbrise/clipper