Clipper: A Low-Latency Online Prediction Serving System

Dan Crankshaw
crankshaw@cs.berkeley.edu
http://clipper.ai
https://github.com/ucbrise/clipper
June 5, 2017
A Low-Latency Online Prediction
Serving System
Clipper

Collaborators
Joseph
Gonzalez
Ion
Stoica
Michael
Franklin
Xin Wang Giulio Zhou
Alexey
Tumanov
Corey Zumar
2
Nishad Singh Yujia Luo

Big
Data
Training
Learning
Complex Model

Learning Produces a Trained Model
“CAT”
Query Decision
Model

Big
Data
Training
Learning
Application
Decision
Query
?
Serving
Model

Big
Data
Training
Learning
Application
Decision
Query
Model
Prediction-Serving for interactive applications
Timescale: ~10s of milliseconds
Serving

q Challenges of prediction serving
q Clipper overview and architecture
q Deploy and serve Spark models with Clipper
In this talk…

Prediction-Serving Raises
New Challenges

Prediction-Serving Challenges
Support low-latency, high-
throughput serving workloads
???
Create
VWCaffe 9
Large and growing ecosystem
of ML models and frameworks

Models getting more complex
Ø 10s of GFLOPs [1]
Support low-latency, high-throughput serving workloads
Using specialized hardware
for predictions
Deployed on critical path
Ø Maintain SLOs under heavy load
[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.

Google Translate
Serving
82,000 GPUs
running 24/7
[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
140 billion words a day1
Invented New Hardware!
Tensor Processing Unit
(TPU)

Big Companies Build One-Off Systems
Problems:
Ø Expensive to build and maintain
Ø Highly specialized and require ML and
systems expertise
Ø Tightly-coupled model and application
Ø Difficult to change or update model
Ø Only supports single ML framework

???
Create
VWCaffe 13

Large and growing ecosystem of ML models and frameworks
???
Fraud
Detection
14

???
Fraud
Detection
15
Ø Different programming languages:
Ø TensorFlow: Python and C++
Ø Spark: JVM-based
Ø Different hardware:
Ø CPUs and GPUs
Ø Different costs
Ø DNNs in TensorFlow are generally much slower

???
Fraud
Detection
Create VW
Caffe 16
Content
Rec.
Personal
Asst.
Robotic
Control
Machine
Translation

But building new
serving systems for
each framework is
expensive…

Big
Data
Batch Analytics
Model
Training
Use existing systems: Offline Scoring

Big
Data
Model
Training
Batch Analytics
Scoring
X Y
Datastore

Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y

Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y
Problems:
Ø Requires full set of queries ahead of time
Ø Small and bounded input domain
Ø Wasted computation and space
Ø Can render and store unneeded predictions
Ø Costly to update
Ø Re-run batch job

???
Create
VWCaffe 22

Prediction-Serving Today
Highly specialized systems for
specific problems
Decision
QueryX Y
Offline scoring with existing
frameworks and systems
Clipper aims to unify these
approaches

What would we like this to look like?
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe 24

Encapsulate each system in isolated environment
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe 25

Decouple models and applications
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
26
Caffe
???

Clipper
Predict
MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications
Model Container (MC) MC
Caffe

Clipper Implementation
Clipper
MC MC MC
RPC RPC RPC RPC
Model Abstraction Layer
Caching
Adaptive Batching
Model Container (MC)
Core system: 35,000 lines of C++
Open source system: https://github.com/ucbrise/clipper
Goal: Community-owned platform for model deployment and
serving
Applications
Predict

Clipper
Query Processor
Redis
Clipper
Management
Applications
Model
Container
Model
Container
Predict
Clipper Implementation
clipper
admin

Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
Cache
Clipper
Clipper Connects Analytics and Serving

Getting Started with Clipper
Docker images available on DockerHub
Clipper admin is distributed as pip package:
pip install clipper_admin
Get up and running without cloning or compiling!

q Containerized frameworks: unified
abstraction and isolation
q Cross-framework caching and
batching: optimize throughput and latency
q Cross-framework model composition:
improved accuracy through ensembles and
bandits
Technologies inside Clipper

Clipper
Predict FeedbackRPC/REST Interface
Caffe
MC MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications

Clipper
Caffe
MC MC MC
RPC RPC RPC RPC

Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems

Container-based Model Deployment
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
Implement Model API:

Implement Model API:
Ø API support for many programming languages
Ø Python
Ø Java
Ø C/C++

Package model implementation and dependencies

Clipper
Caffe
MC MC
RPC RPC RPC RPC
MC

Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy

Clipper
Caffe
MC
RPC RPC RPC
Container (MC)
Ø Models run in separate processes as Docker containers
Ø Resource isolation: Cutting edge ML frameworks can be buggy
Ø Scale-out
MC
RPCRPC RPC
MC MC MC

TensorFlow-
Serving
Predict RPC Interface
Applications
Overhead of decoupled architecture
Clipper
Caffe
MC MC
RPC RPC RPC RPC
Applications
MC MC

Throughput
(QPS)
Better P99 Latency
(ms)
Better
Model: AlexNet trained on CIFAR-10

Clipper Architecture
Clipper
Applications
Predict ObserveRPC/REST Interface
MC MC MC
RPC RPC RPC RPC
Caching
Latency-Aware Batching

A single
page load
may generate
many queries
Batching to Improve Throughput
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size

A single
page load
may generate
many queries
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
Batching to Improve ThroughputLatency-aware
Clipper Solution:
Adaptively tradeoff latency and throughput…
Ø Inc. batch size until the latency objective
is exceeded (Additive Increase)
Ø If latency exceeds SLO cut batch size by
a fraction (Multiplicative Decrease)

Throughput
(QPS)
1R-2S
5andRP
)RreVt
(6KOearn)
LLnear 690
(3y6SarN)
LLnear 690
(6KLearn)
KerneO 690
(6KLearn)
LRg 5egreVVLRn
(6KLearn)
0
20000
40000
60000 48386
22859
29350 48934
197
47219
8963
317 7206
1920
203
1921
AdaStLve 1R BatFKLng
Better
Batching to Improve ThroughputLatency-aware

Clipper
Caffe
MC MC MC
RPC RPC RPC RPC
Applications
MC

Clipper Architecture
Clipper
Applications
Predict ObserveRPC/REST Interface
MC MC MC
RPC RPC RPC RPC
Model Selection LayerSelection Policy

Clipper
Version 1
Version 2
Version 3
Periodic retraining
Experiment with new
models and frameworks

“CAT”
“CAT”
“CAT”
“CAT”
“CAT”
CONFIDENT
Selection Policy: Estimate confidence
Policy
Version 2
Version 3

“CAT”
“MAN”
“CAT”
“SPACE”
“CAT”
UNSURE
Selection Policy: Estimate confidence
Policy

Clipper
ØExploit multiple models to estimate confidence
Ø Use multi-armed bandit algorithms to learn
optimal model-selection online
Ø Online personalization across ML frameworks
*See paper for details [NSDI 2017]

Serving Spark
Models with Clipper

Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
Cache
Clipper
Deploy Models from Spark to Clipper

You can always implement your own model container…
def __init__(model_data):
def predict_batch(inputs):

Clipper admin will automatically deploy Spark models
Clipper.deploy_pyspark_model(model, predict_function)

Instance of PySpark model:
pyspark.mllib.*
pyspark.ml.pipeline.PipelineModel

Python closure
def predict_function(spark, model, inputs)
Params:
spark : SparkSession
model : ml or mllib model instance
inputs : list of inputs to predict
Returns:
list of str
Can be used for business logic, featurization,
post-processing…

pyspark.mllib example:

pyspark.ml.pipeline example:

Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
Cache
Clipper

MC
Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Web Server
Database
Cache
Clipper

Clipper can automatically deploy other models:
ØArbitrary Python closures:
ØScikit-Learn models
Øpred_function can contain any Python code that is
pickle-able
ØDeploy Spark models from Scala or Java:
ØSupports both MLLib and Pipelines APIs

DEMO
http://clipper.ai/examples/spark-demo/

Check out Clipper
Ø Alpha release (0.1.3) available
Ø First class support for Scikit-Learn and Spark models
Ø Deploy directly from training job without writing Docker containers
Ø Actively working with early users
Ø Post issues and questions on GitHub and subscribe to our
mailing list clipper-dev@googlegroups.com
http://clipper.ai

What’s next
Ø Ongoing efforts to support for additional frameworks
Ø TensorFlow (supported but requires writing Docker containers)
Ø R (open pull request using Python-R interop library)
Ø Caffe2
Ø Kubernetes integration (early internal prototype)
Ø Bandit methods, online personalization, and cascaded predictions
Ø Studied in research prototypes but need to integrate into system
Ø Actively looking for contributors:
https://github.com/ucbrise/clipper

Clipper: A Low-Latency Online Prediction Serving System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clipper: A Low-Latency Online Prediction Serving System

Similar to Clipper: A Low-Latency Online Prediction Serving System (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Clipper: A Low-Latency Online Prediction Serving System