Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
12. Serving Predictions Today: Offline Scoring
X Y
Application
Decision
Query
Look up decision in KV-Store
Online Serving System
13. Serving Predictions Today: Offline Scoring
X Y
Application
Decision
Query
Look up decision in KV-Store
Online Serving System
Problems:
Ø Requires full set of queries ahead of time
Ø Small and bounded input domain
Ø Wasted computation and space
Ø Can render and store unneeded predictions
Ø No feedback and costly to update
14. Serving Predictions Today: Online Scoring
Application
Decision
Query
Render prediction with model in real-time
Online Serving System
17. Many applications and many models
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe
18. Can we decouple models and applications?
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe
19. Requirements
• System cannot stand in way of independent evolution of applications
models, empowers
• enables separate evolution, development
• From perspective of data scientist
• Ease of application evolution
• model rollout
• application deployment
• support for wide range of frameworks that data scientists
• improve accuracy, use cutting edge techniques, frameworks
• experiment with models in predictions
• Don’t have to worryabout applications (performance
• Frontend developer
• Stable, reliable, performantAPIs (need systems that meet their SLOs)
• scale system, hardware to meet application demands
• Don’t worryabout models (oblivious to underlying)
20. Requirements
• Decouple applications from models and allow them to evolve
independently from each other
• The Data Scientist perspective: focus on making accurate
predictions
• Support many models, frameworks
• Simple deployment and online experimentation
• (Mostly) oblivious to system performance and workload demands
• The Frontend Dev perspective: focus on building reliable, low-
latency applications
• Provide stable, reliable,performant APIs (need systems that meet their
SLOs)
• scale system, hardware to meet application demands
• Oblivious to the implementations of the underlying models
21. Prediction-Serving System:
Ø Decouple applications from models and allow them to
evolve independently from each other
Ø The Frontend Dev perspective: focus on building reliable,
low-latency applications
Ø Provide stable, reliable, performant APIs to meet SLAs
Ø scale system, hardware to meet application demands
Ø Oblivious to the implementations of the underlying models
Ø The Data Scientist perspective: focus on making accurate
predictions
Ø Support many models and frameworks simultaneously
Ø Simple deployment and online experimentation
Ø (Mostly) oblivious to system performance and workload demands
Requirements
22. Clipper
Predict FeedbackRPC/REST Query Interface
Applications
create_application()
deploy_model()
Management REST API
replicate_model()
inspect_instance()
From the Frontend Dev perspective
23. From the Data Scientist perspective
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
Implement Model API:
24. From the Data Scientist perspective
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
Implement Model API:
Ø Implemented in many languages
Ø Python
Ø Java
Ø C/C++
Ø R
Ø …
25. From the Data Scientist perspective
Model implementation packaged in container
Model Container (MC)
31. Clipper
Create VWCaffeKey Insight:
The challenges of prediction serving can be addressed between
end-user applications and machine learning frameworks
As a result, Clipper is able to:
Ø hide complexity
Ø by providing a common prediction interface to applications
Ø bound latency and maximize throughput
Ø through caching, adaptive batching, model scaleout
Ø enable robust online learning and personalization
Ø through model selection and ensemble algorithms
without modifying machine learning frameworks or end-user applications
32. Clipper
As a result
Ø hide complexity
Ø by providing a common predictioninterface to applications
Ø bound latency and maximize throughput
Ø through caching, adaptive batching, model scaleout
Ø enable robust online learning and personalization
Ø through model selection and ensemble algorithms
without modifying machine learning frameworks or end-user
applications
Clipper Decouples Applications and Models
33. Challenges
Ø Managing heterogeneity everywhere
Ø different types of models (different software, different resource requirements)
in a productionenvironment
Ø Different applicationperformance requirements
Ø workloads, latencies
Ø Scheduling (space-time resource management)
Ø Where and when to send predictionqueries to models
Ø Latency-accuracy tradeoffs
Ø Marginal utility of allocating additional resources
Ø How to use feedback to improve accuracy in real-time
34. Clipper Architecture
Clipper
Caffe
Applications
Predict ObserveRPC/REST Interface
MC MC MC
RPC RPC RPC RPC
Model Abstraction Layer
Provide a common interface to models
while bounding latency and
maximizing throughput.
Model Selection Layer
Improve accuracy through bandit methods and
ensembles, online learning, and personalization
Model Container (MC)
36. Model Container (MC)
Caffe
Correction LayerCorrection Policy
MC MC MC
RPC RPC RPC
Model Abstraction Layer
Caching
Adaptive Batching
Provide a common interface to models while
RPC
37. Correction LayerCorrection Policy
Model Container (MC)
RPC
Caffe
MC
RPC
MC
RPC
MC
RPC
Model Abstraction Layer
Caching
Adaptive Batching
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Models run in separate processes as Docker containers
Ø Resource isolation
38. Correction LayerCorrection Policy
Model Abstraction Layer
Caching
Adaptive Batching
odel Container (MC)
RPC
Caffe
MC
RPC
MC
RPC
MC
RPC
MC
RPC
MC
RPC
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Models run in separate processes as Docker containers
Ø Resource isolation
Ø Scale-out
Problem: frameworks optimized for batch processing not latency
39. A single
page load
may generate
many queries
Adaptive Batching to Improve Throughput
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Clipper Solution:
be as slow as allowed…
Ø Inc. batch size until the latency objective
is exceeded (Additive Increase)
Ø If latency exceeds SLO cut batch size
by a fraction (Multiplicative Decrease)
Ø Why batching helps:
Hardware
Acceleration
Helps amortize
system overhead
43. Clipper
Model Selection LayerSelection Policy
Caffe
Slow Changing
Model
Clipper
Bring Learning into the Serving Tier
What can we learn?
Ø Dynamically weight mixture of
experts
Ø Select best model for each user
Ø Use ensemble to estimate
prediction confidence
Ø Don’t try to retrain models
Real-time
model selection
and ensembles
44. Road Map
Ø Open source on GitHub: https://github.com/ucbrise/clipper
Ø Kick the tires, try out our tutorial
Ø Alpha release in mid-April
Ø Focused on reliability and performance for serving single-model applications
Ø First class support for Scikit-Learn and Spark models, arbitrary Python functions
Ø Coordinating initial set of features with RISE Lab sponsors and collaborators
Ø After alpha release
Ø Support for selection policies and multi-model applications
Ø Model performance monitoring to detect and correct accuracy degradation
Ø New task scheduler design to leverage model and resource heterogeneity
“Clipper: ALow-Latency Online Prediction Serving System” [NSDI ‘17]
https://arxiv.org/abs/1612.03079
crankshaw@cs.berkeley.edu