This document discusses using Apache Flink to operationalize a streaming machine learning lifecycle. It describes Comcast's need to improve customer experiences through predictive analytics over streaming data. Flink is used to orchestrate feature engineering, model training/evaluation, and real-time predictions. Key aspects of the solution include a metadata-driven pipeline, automated deployments, consistent feature stores for training and prediction, and monitoring of multiple models. The document outlines the various components of the ML lifecycle and pipeline implemented on Flink and discusses next steps around UI/UX, continuous monitoring, and supporting multiple feature stores.
Why Teams call analytics are critical to your entire business
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Flink Throughout an Operationalized Streaming ML Lifecycle"
1. EMBEDDING FLINK THROUGHOUT
AN OPERATIONALIZED STREAMING
ML LIFECYCLE
Dave Torok, Senior Principal Architect
Sameer Wadkar, Senior Principal Architect
10 April, 2018
2. 2
INTRODUCTION AND BACKGROUND
CUSTOMER EXPERIENCE TEAM
27 MILLION CUSTOMERS (HIGH SPEED DATA, VIDEO,
VOICE, HOME SECURITY, MOBILE)
INGESTING ABOUT 2 BILLION EVENTS / MONTH
SOME HIGH-VOLUME MACHINE-GENERATED EVENTS
TYPICAL STREAMING DATA ARCHITECTURE
DATA ETL, LAND IN A TIME SERIES DATA LAKE
GREW FROM A FEW DOZEN TO 150+ DATA SOURCES
/ FEEDS IN ABOUT A YEAR
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
3. 3
BUSINESS PROBLEM
INCREASE POSITIVE CUSTOMER EXPERIENCES
RESOLVE POTENTIAL ISSUES CORRECTLY AND
QUICKLY
PREDICT AND DIAGNOSE SERVICE TROUBLE
ACROSS MULTIPLE KNOWLEDGE DOMAINS
REDUCE COSTS THROUGH EARLIER RESOLUTION
AND BY REDUCING AVOIDABLE TECHNICIAN VISITS
4. 4
TECHNICAL PROBLEM
MULTIPLE PROGRAMMING AND DATA SCIENCE
ENVIRONMENTS
WIDESPREAD AND DISCORDANT DATA SOURCES
THE “DATA PLANE” PROBLEM: COMBINING DATA AT
REST AND DATA IN MOTION
ML VERSIONING: DATA, CODE, FEATURES, MODELS
6. 6
MACHINE LEARNING LIFECYCLE
USE CASE DEFINITION
FEATURE EXPLORATION / ENGINEERING
MODEL TRAINING
MODEL EVALUATION
MODEL ARTIFACT DELIVERY (POJO/DOCKER)
MODEL SELECTION
MODEL OPERATIONALIZATION
MODEL PERFORMANCE MONITORING ON LIVE DATA
(A/B & MULTIVARIATE TESTING)
PUSH MODEL TO PRODUCTION
RETRAIN MODEL ON NEWER DATA
7. 7
EXAMPLE NEAR REAL TIME
PREDICTION USE CASE
CUSTOMER RUNS A “SPEED TEST”
EVENT TRIGGERS A PREDICTION FLOW
ENRICH WITH NETWORK HEALTH AND OTHER
INDICATORS
EXECUTE ML MODEL
PREDICT WHETHER IT IS A WIFI, MODEM, OR
NETWORK ISSUE
Detect
Enrich
Predict
Gather Data
Event
ML
Model
Engage Customer
Act / Notify
Network Diagnostic Services
Slow
Speed?
Additional Context Services
Run
Prediction
8. 8
ML PIPELINE ARCHITECTURE PRINCIPLES
Metadata Driven
Feature/Model
Definition,
Versioning , Feature
Assembly, Model
Deployment, Model
Monitoring is
metadata driven
Automation
Orchestrated
Deployment for
new features and
models
Rapid
Onboarding
Portal for Model
and Feature
Management as
well Model
Deployment
Data Consistency
Feature store
enforces a
consistent data
pipeline ensuring
that the data
used for training
is functionally
identical to the
data used for
predictions
Monitoring and
Metrics
Ability to execute
& monitor
multiple models
in production to
enable real-time
metrics driven
model selection
Iterative/Consistent
Model
Development
Multiple versions of
the model can be
developed
iteratively while
consuming from a
consistent dataset
(feature store),
enables A/B &
Multivariate Testing
9. 9
ML PIPELINE – ROLES & WORKFLOW
Define
Use
Case
Business User
Data Scientist
ML Operations
Explore
Features
Create and
publish new
features
Create &
Validate
Models
Model
Selection
Go Live with
Selected
Models
• Define Online Feature
Assembly
• Define pipeline to
collect outcomes
• Model Deployment
and Monitoring
Model
Review
Iterate
Evaluate
Live Model
Performance
Inception Exploration
Model
Development
Candidate Model
Selection
Model
Operationalization
Model
Evaluation
Go Live
Phase
Monitor Live
ModelsCollect new data & retrain
Iterate
10. 1 0
WHY APACHE FLINK?
UTILIZED AS ORCHESTRATION & ETL ENGINE
FIRST-CLASS STREAMING MODEL
PERFORMANCE
RICH STATEFUL SEMANTICS
TEAM EXPERIENCE
OPEN SOURCE
GROWING COMMUNITY
Apache®, Apache Flink®, and the squirrel logo are either
registered trademarks or trademarks of the Apache
Software Foundation in the United States and/or other
countries.
11. 1 1
THE “DATA PLANE” PROBLEM
Streaming Compute Pipeline
AWS
S3
HDFS
Data File Abstraction
Databases
MODEL
Streaming
State
Sum
Avg
Time
Buckets
Stream
Data
QUERY
Enterprise Services
Data Sets at Rest
12. 1 2
ML MODEL EXECUTION
MODEL
EXECUTION
TRIGGER
1. Payload only
contains Model
Name & Account
Number
FEATURE
ASSEMBLY
Model
Metadata
Online
Feature
Store
2. Model Metadata
informs which
features are needed
for a model
3. Pull required
features by account
number
MODEL
EXECUTION
4. Pass full set of
assembled features
for model execution
5. Prediction
13. 1 3
SOLUTION
Rest
Service
Inputs to REST
Service:
1.Model Name
2.Account No
SELECT MODEL BASED ON
RULES (ON-
DEMAND/STREAMING)
Request Initiated
asynchronously via
pushing it to a
queue/topic
INITIATE MODEL PREDICTION REQUEST (ASYNCHRONOUSLY)
REQUESTING
APPLICATION
TRIGGER
EVENT
LISTENER
14. 1 4
SOLUTION
ASSEMBLE FEATURES FOR
A GIVEN MODEL
Happy Path for Model Execution – All Features Current
Online
Feature
Store
Model
/Feature
Metadata
Feature
Store
API
Feature
Assembly
Model
Execution
Are All
Features
Current?
Yes
Prediction/Outc
ome Store
Prediction Sink
Store Prediction
Flow
Customer
Context
Listens
PushREQUESTING
APPLICATION
Assemble features
based on Account
Number as model
input
Collect predictions
and outcome to
create datasets for
model refinement
Store current values
of features for
interactive query
access
15. 1 5
SOLUTION (CONT.)
ASSEMBLE FEATURES FOR
A GIVEN MODEL
Exception Path – Some/All Features are not current
Online
Feature
Store
Model
/Feature
Metadata
Feature
Store
API
Feature
Assembly
Feature Creation
Pipeline
Are All
Features
Current?
No
History
Feature Store
Online
Feature Store
Back to
Happy
Path
Feature
Assembly
Append store (Ex. S3, HDFS,
Redshift) for use by Data
Scientist for Model Training
16. 1 6
SOLUTION – DIGGING DEEPER
Global Window,
Pane per
Request Id
Model Execution
Requests
Request
Features
KeyBy
Request Id
Apply
Function
Custom
Evictor
Model
Metadata
Connected
Stream
Periodically check if
Model TTL has expired
(onEventTime)
Arrival of each feature
triggers the model
execution (onElement)
Evict pane if
model
executed
Evict pane if
model request
expired
Execute model
or expire
Side
Outputs
Features
Custom
Trigger
17. 1 7
FEATURE STORE
TWO TYPES OF FEATURE STORES:
• Online Feature Store – Current values by key
(Key/Value Store)
• History Feature Store – Append features as they are
collected (Ex. HDFS, S3)
MULTIPLE ONLINE FEATURE STORES BASED ON
SLA’S
• A feature can be stored in multiple online feature stores
to support model specific SLA’s.
TYPES OF ONLINE FEATURE STORE
• PostgreSQL (AWS RDS, Aurora DB) for low volume
on-demand model execution requests
• HBase, DynamoDB for high volume feature ingest
• Flink Queryable State for high volume ingest, high
velocity model execution requests
Feature Creation
Pipeline
History
Feature Store
Online
Feature Store
Prediction
Phase
Model Training
Phase
AppendOverwrite
18. 1 8
FEATURE CREATION PIPELINES
FLINK AS REAL-TIME DATA STREAM CONSUMER
CUSTOM FLOWS FOR AGGREGATION FEATURES
SAME DATA FLOWS FOR PREDICTION (STREAMING)
& TRAINING (BATCH)
• PRODUCED FEATURES UPDATE ONLINE FEATURE
STORE (PREDICTION PHASE)
• PRODUCED FEATURES APPENDED TO S3 OR
HDFS FOR USE BY DATA SCIENTISTS (TRAINING
PHASE)
Aggregation
Features
On Demand
Feature
Raw Data
On Demand
Feature Request
External
Rest API
Push to Feature
Store
19. 1 9
STREAMING FEATURE EXAMPLE
KAFKA ERROR STREAM (~150 / SECOND)
DETECT ACCOUNTS WITH SIGNAL ERROR WITH
COUNT > 2000 IN TRAILING 24 HOURS
SOLUTION:
AVRO DESERIALIZER WITH KEY = ACCOUNT
“24 HOUR ROLLING” HASH STRUCTURE AS STATE
FILTER FUNCTION WITH SIGNAL THRESHOLD
Flink Features Used:
Kafka Source
Keyed Stream
Value State
Sliding Window
Filter Function
20. 2 0
ON-DEMAND FEATURE EXAMPLE
PREMISE HEATH TEST
• DIAGNOSTIC TELEMETRY INFORMATION FOR
EACH DEVICE FOR A GIVEN CUSTOMER
• EXPENSIVE - ONLY REQUESTED ON DEMAND
• MODELS USING SUCH A FEATURE WILL EXTRACT
SUB-ELEMENTS USING SCRIPTING CAPABILITIES
(MODEL METADATA & FEATURE ENGINEERING)
• MODEL METADATA WILL CONTAIN TTL
ATTRIBUTE FOR SUCH FEATURES INDICATING
THEIR TOLERANCE FOR STALE DATA
SOLUTION:
MAKE AN ON-DEMAND REQUEST FOR PHT
TELEMETRY DATA FOR IF IT IS STALE OR ABSENT
FOR A GIVEN ACCOUNT
Flink Features Used:
Async Operator
21. 2 1
ML PREDICTION COMPONENT
• REST SERVICE
• H2O.ai Model Container (POJO)
• Python based service running specialized ML Models
• Any stateless REST service
• FLINK MAP OPERATOR
• H2O.ai Model Container (POJO) wrapped in a Flink
Map Operator
• Possibly support native calls via Flink Map Operators
running specialized Models (Ex. Tensorflow GPU
based predictions)
• Same Code Base
• Multiple Deployment Models
• REST – Low velocity, on-
demand model invocations
• Map Operators – High
velocity, streaming model
invocations
22. 2 2
VERSIONING AND DEVOPS
EVERYTHING IS VERSIONED
• Feature/Model Metadata
• Feature Data & Model Execution environments
• Training, Validation datasets are versioned
• Feature creation pipelines are versioned
VERSIONING ALLOWS PROVENANCE &
AUDITABILITY & REPEATABILITY OF EVERY
PREDICTION
23. 2 3
FEATURES OF THE ML PIPELINE
CLOUD AGNOSTIC
• Integrates with the AWS Cloud but not
dependent on it
• Framework should be able to work in a
non-AWS distributed environment with
configuration (not code) changes
TRACEABILITY & REPEATABILITY &
AUDITABILITY
• Model to be traced back to business use-
cases
• Full traceability from raw data to feature
engineering to predictions
• “Everything Versioned” enables
repeatability
CI/CD SUPPORT
• Code, Metadata (Hyper-Parameters) and
Data (Training/Validation Data) are
versioned. Deployable artifacts to
integrate with CI/CD Pipeline
24. 2 4
FEATURES OF THE ML PIPELINE (CONT.)
MULTI-DEPLOYMENT OPTIONS
• Supports Throughput vs. Latency
Tradeoffs- Process in stream/batch/on-
demand
• Allows multiple versions of the
same/different models to be compared
with one another on live data
• A/B testing & Multivariate testing
• Live but dark deployments
• Supports integration of outcomes with
predictions to measure production
performance & support continuous model
re-training
PLUGGABLE (DATA AND COMPUTE)
ARCHITECTURE
• De-coupled architecture based on
message driven inter-component
communication.
• Failure of an isolated component does
not fail the entire platform
• Asynchronous behavior
• Micro-Services based design which
supports independent deployment of
components
25. 2 5
NEXT STEPS AND FUTURE WORK
GENERATING “FLINK NATIVE” FEATURE FLOWS
• Evaluating Uber’s “AthenaX” Project / Similar Approaches
UI PORTAL FOR
• MODEL / FEATURE AND METADATA MANAGEMENT
• CONTAINERIZATION SUPPORT FOR MODEL
EXECUTION PHASE
• WORKBENCH FOR DATA SCIENTIST
• CONTINUOUS MODEL MONITORING
QUERYABLE STATE
AUTOMATING THE RETRAINING PROCESS
SUPPORT FOR MULTIPLE/PLUGGABLE FEATURE
STORES (SLA DRIVEN)
26. 2 6
SUMMARY & LESSONS LEARNED
FLINK IS HELPING ACHIEVE OUR BUSINESS GOALS
• Near-real-time streaming context
• Container for ML Prediction Pipeline
• Stateful Feature Generation
• Multiple Solutions to the “Data Plane” Problem
• Natural Asynchronous support
• Rich windowing semantics support various aspects of
our ML Pipeline (Training/Prediction/ETL)
• Connected Streams simplify pushing metadata updates
(reduced querying load with better performance)
• Queryable State is a natural fit for high velocity and high
volume data being pushed to the online feature store