Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East talk by Dan Crankshaw

Daniel Crankshaw
Spark Summit East
February 2017
A Low-Latency Online Prediction
Serving System
Clipper

Big
Data
Big Model
Training
Learning
Timescale: minutes to days
Systems: offline and batch optimized
Heavily studied ... major focus of the AMPLab

Big
Data
Big Model
Training
Application
Decision
Query
?
Learning Inference

Big
Data
Training
Learning
Inference
Big Model
Application
Decision
Query
Timescale: ~20 milliseconds
Systems: online and latency optimized
Less studied …

Big
Data
Big Model
Training
Application
Decision
Query
Learning Inference
Feedback

Big
Data
Training
Application
Decision
Learning Inference
Feedback
Timescale: hours to weeks
Systems: combination of systems
Less studied …

Big
Data
Big Model
Training
Application
Decision
Query
Learning Inference
Feedback
Responsive
(~10ms)
Adaptive
(~1 seconds)

Serving Predictions Today
Big
Data
Big Model
Training
Offline Batch System

Big
Data
Big Model
Training
Offline Batch System
Scoring
X Y
Serving Predictions Today: Offline Scoring

X Y
Application
Decision
Query
Look up decision in KV-Store
Online Serving System

X Y
Application
Decision
Query
Look up decision in KV-Store
Problems:
Ø Requires full set of queries ahead of time
Ø Small and bounded input domain
Ø Wasted computation and space
Ø Can render and store unneeded predictions
Ø No feedback and costly to update

Serving Predictions Today: Online Scoring
Application
Decision
Query
Render prediction with model in real-time

Fraud
Dataset
Big Model
Training
Application
Decision
Query
Learning Inference
Feedback

???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe
Many applications and many models

Many applications and many models
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe

Can we decouple models and applications?
???
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe

Requirements
• System cannot stand in way of independent evolution of applications
models, empowers
• enables separate evolution, development
• From perspective of data scientist
• Ease of application evolution
• model rollout
• application deployment
• support for wide range of frameworks that data scientists
• improve accuracy, use cutting edge techniques, frameworks
• experiment with models in predictions
• Don’t have to worryabout applications (performance
• Frontend developer
• Stable, reliable, performantAPIs (need systems that meet their SLOs)
• scale system, hardware to meet application demands
• Don’t worryabout models (oblivious to underlying)

Requirements
• Decouple applications from models and allow them to evolve
independently from each other
• The Data Scientist perspective: focus on making accurate
predictions
• Support many models, frameworks
• Simple deployment and online experimentation
• (Mostly) oblivious to system performance and workload demands
• The Frontend Dev perspective: focus on building reliable, low-
latency applications
• Provide stable, reliable,performant APIs (need systems that meet their
SLOs)
• scale system, hardware to meet application demands
• Oblivious to the implementations of the underlying models

Prediction-Serving System:
Ø Decouple applications from models and allow them to
evolve independently from each other
Ø The Frontend Dev perspective: focus on building reliable,
low-latency applications
Ø Provide stable, reliable, performant APIs to meet SLAs
Ø scale system, hardware to meet application demands
Ø Oblivious to the implementations of the underlying models
Ø The Data Scientist perspective: focus on making accurate
predictions
Ø Support many models and frameworks simultaneously
Ø Simple deployment and online experimentation
Ø (Mostly) oblivious to system performance and workload demands
Requirements

Clipper
Predict FeedbackRPC/REST Query Interface
Applications
create_application()
deploy_model()
Management REST API
replicate_model()
inspect_instance()
From the Frontend Dev perspective

From the Data Scientist perspective
class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
Implement Model API:

class ModelContainer:
def __init__(model_data)
def predict_batch(inputs)
Implement Model API:
Ø Implemented in many languages
Ø Python
Ø Java
Ø C/C++
Ø R
Ø …

Model implementation packaged in container
Model Container (MC)

Clipper
Caffe
MC MC MC
RPC RPC RPC RPC

Clipper
Predict FeedbackRPC/REST Interface
Caffe
MC MC MC
RPC RPC RPC RPC
From the data scientist perspective
Applications

Clipper
Predict FeedbackRPC/REST Interface
Caffe
MC MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications

Clipper Generalizes Models Across ML Frameworks
Clipper
Content
Rec.
Fraud
Detection
Personal
Asst.
Robotic
Control
Machine
Translation
Create VW
Caffe

Clipper
Create VWCaffeKey Insight:
The challenges of prediction serving can be addressed between
end-user applications and machine learning frameworks
As a result, Clipper is able to:
Ø hide complexity
Ø by providing a common prediction interface to applications
Ø bound latency and maximize throughput
Ø through caching, adaptive batching, model scaleout
Ø enable robust online learning and personalization
Ø through model selection and ensemble algorithms
without modifying machine learning frameworks or end-user applications

Clipper
As a result
Ø hide complexity
Ø by providing a common predictioninterface to applications
Ø bound latency and maximize throughput
Ø through caching, adaptive batching, model scaleout
Ø enable robust online learning and personalization
Ø through model selection and ensemble algorithms
without modifying machine learning frameworks or end-user
applications
Clipper Decouples Applications and Models

Challenges
Ø Managing heterogeneity everywhere
Ø different types of models (different software, different resource requirements)
in a productionenvironment
Ø Different applicationperformance requirements
Ø workloads, latencies
Ø Scheduling (space-time resource management)
Ø Where and when to send predictionqueries to models
Ø Latency-accuracy tradeoffs
Ø Marginal utility of allocating additional resources
Ø How to use feedback to improve accuracy in real-time

Clipper Architecture
Clipper
Caffe
Applications
Predict ObserveRPC/REST Interface
MC MC MC
RPC RPC RPC RPC
Model Abstraction Layer
Provide a common interface to models
while bounding latency and
maximizing throughput.
Model Selection Layer
Improve accuracy through bandit methods and
ensembles, online learning, and personalization

Clipper
Caffe
Applications
MC MC MC
RPC RPC RPC RPC
Model Selection LayerSelection Policy
Caching
Adaptive Batching

Caffe
Correction LayerCorrection Policy
MC MC MC
RPC RPC RPC
Caching
Adaptive Batching
Provide a common interface to models while
RPC

RPC
Caffe
MC
RPC
MC
RPC
MC
RPC
Caching
Adaptive Batching
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Models run in separate processes as Docker containers
Ø Resource isolation

Caching
Adaptive Batching
odel Container (MC)
RPC
Caffe
MC
RPC
MC
RPC
MC
RPC
MC
RPC
MC
RPC
Common Interface à Simplifies Deployment:
Ø Evaluate models using original code & systems
Ø Models run in separate processes as Docker containers
Ø Resource isolation
Ø Scale-out
Problem: frameworks optimized for batch processing not latency

A single
page load
may generate
many queries
Adaptive Batching to Improve Throughput
Ø Optimal batch depends on:
Ø hardware configuration
Ø model and framework
Ø system load
Clipper Solution:
be as slow as allowed…
Ø Inc. batch size until the latency objective
is exceeded (Additive Increase)
Ø If latency exceeds SLO cut batch size
by a fraction (Multiplicative Decrease)
Ø Why batching helps:
Hardware
Acceleration
Helps amortize
system overhead

Batching Results
SLO
Up to 25.5x
throughput increase
from batching

Clipper
Caffe
Applications
Model Container (MC) MC MC MC
RPC RPC RPC RPC
Caching
Adaptive Batching

Caffe
Big
Data
Application
Learning Inference
Feedback
Slow
Slow Changing
Model
Real-time
model selection
and ensembles
Clipper

Clipper
Caffe
Slow Changing
Model
Clipper
Bring Learning into the Serving Tier
What can we learn?
Ø Dynamically weight mixture of
experts
Ø Select best model for each user
Ø Use ensemble to estimate
prediction confidence
Ø Don’t try to retrain models
Real-time
model selection
and ensembles

Road Map
Ø Open source on GitHub: https://github.com/ucbrise/clipper
Ø Kick the tires, try out our tutorial
Ø Alpha release in mid-April
Ø Focused on reliability and performance for serving single-model applications
Ø First class support for Scikit-Learn and Spark models, arbitrary Python functions
Ø Coordinating initial set of features with RISE Lab sponsors and collaborators
Ø After alpha release
Ø Support for selection policies and multi-model applications
Ø Model performance monitoring to detect and correct accuracy degradation
Ø New task scheduler design to leverage model and resource heterogeneity
“Clipper: ALow-Latency Online Prediction Serving System” [NSDI ‘17]
https://arxiv.org/abs/1612.03079
crankshaw@cs.berkeley.edu

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East talk by Dan Crankshaw

Recomendados

Recomendados

Más contenido relacionado

Similar a Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East talk by Dan Crankshaw

Similar a Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East talk by Dan Crankshaw (20)

Más de Spark Summit

Más de Spark Summit (20)

Último

Último (20)

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East talk by Dan Crankshaw