Metaflow: The ML Infrastructure at Netflix

ML Infrastructure at Netflix
AICamp / August 2021
Ville Tuulos
ville@outerbounds.co

First there were just
recommendations.

Over time, a diverse set of
use cases for machine
learning started popping up.
Optimize
schedules
Predict
churn
Computer
vision
NLP

What is common across all
ML projects which we could
provide as a platform?

All projects need data.
Metrics
Text
Tables
Images

Metrics
Text
Tables
Images
All projects need to execute
compute-intensive
algorithms at scale.
Metrics
Tables
Text
Images

Metrics
Tables
Text
Images
All projects consists of
multiple steps that need to
be executed.

Metrics
Tables
Text
Images
All projects are developed
iteratively through multiple
versions, sometimes
involving multiple people.

Metrics
Tables
Text
Images
All projects leverage 3rd
party libraries.

Metrics
Tables
Text
Images
Can we address the common
concerns in a single coherent
framework?

Model Development
Feature Engineering
Model Operations
Architecture
Versioning
Job Scheduler
Compute Resources
Data Warehouse
Now open-source at
metaflow.org

Model Development
Feature Engineering
Model Operations
Architecture
Versioning
Job Scheduler
Compute Resources
Data Warehouse
How much
data scientist
cares
How much
infrastructure
is needed
Key insight
ML infrastructure should help with infrastructure rather than ML

Project lifecycle
Prototype Production

Project lifecycle: Baby-steps towards production
Cloud-based
workstation

Note
books
IDE
Terminal
Cloud
access
Workflows
Cloud workstation
Production environment
Cloud workstation
prototype
deploy

Cloud-based
workstation
Explore with
notebooks

Notebooks are great for exploration
(but can get messy)

Cloud-based
workstation
Explore with
notebooks
Create a
workﬂow

class MyFlow(FlowSpec):
@step
def start(self):
import pandas as pd
pd.DataFrame(big_one)
self.next(self.end)
@step
def end(self):
pass
start
end
# python myflow.py run
A simple Metaflow workflow

Cloud-based
workstation
Explore with
notebooks
Create a
workﬂow
Version
everything

start
end
end
end
end
end
train
join
Snapshot and store all executions

Cloud-based
workstation
Explore with
notebooks
Create a
workﬂow
Scale
vertically
Version
everything

@resources(memory=128000)
@step
def start(self):
import pandas as pd
self.next(self.end)
@step
def end(self):
pass
start
end
128GB RAM guaranteed
# python myflow.py run --with batch
Use a larger computer with a single line of code

Cloud-based
workstation
Explore with
notebooks
Create a
workﬂow
Scale
vertically
Scale
horizontally
Version
everything

@step
def start(self):
self.params = list(range(100))
self.next(self.train, foreach='params')
@resources(memory=128000)
@step
def train(self):
self.model = train(...)
self.next(self.join)
@step
def join(self, inputs):
...
start
end
end
end
end
end
train
join
Use many computers

Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Version
everything

@step
def start:
import spark_client
SQL = "CREATE TABLE mydata AS SELECT ..."
self.table_loc = spark_client.query(SQL)
self.next(self.load_data)
@step
def load_data(self):
from metaflow import S3
import pyarrow.parquet as pq
with S3() as s3:
parquet = s3.get(self.table_loc)
self.table = pq.read_table(parquet.path)
Source table
Data engineering
Data science
Preprocess data in SQL...

@step
def load_data(self):
from metaflow import S3
with S3() as s3:
results = s3.get_many(s3_urls)
mymodel.input_data([res.path for res in results])
Parquet
Parquet
Parquet
...and access results efficiently
10Gbps

Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Scheduled
execution
Version
everything

# python myflow.py step-functions create
Schedule executions with a single command

Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
Version
everything

@conda(libraries={‘tensorflow’: ‘2.5.0’})
@step
def start:
import tensorflow as tf
tf.optimizer = tf.optimizers.SGD(alpha)
@step
def end:
pass
start
end
Define a stable execution environment

Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything

Project: LTV
@project(name='LTV')
class TrainingFlow(FlowSpec):
@step
def start(self):
class PredictFlow(FlowSpec):
@step
def start(self):
class TrainingFlow(FlowSpec):
@step
def start(self):
class PredictFlow(FlowSpec):
@step
def start(self):
Branch A
Branch B
Deploy isolated chains of workflows

Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything
Woohoo! Mission
accomplished! 🎉

It will fail
Better to be prepared

Project lifecycle: Baby-steps towards production (and back)
Debug
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything

@retry(times=5)
@catch(var=’handle_failure’)
@step
def start:
import pandas as pd
self.next(self.end)
@step
def end:
pass
start
end
Handle failures gracefully whenever possible...

# python myflow.py resume --origin-run-id sfn-199874
# python myflow.py step-functions create
...but when things go wrong,
Allow production issues to be reproduced locally

The path to production is incremental and iterative
Debug
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workﬂow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything

Business-critical
project: Maximum
SLA required.
Data scientist is prototyping a
new idea locally. Failing is ok.
Promising project. Let’s
scale it to all data.
Project is being A/B tested in
production with a failover path.
Predictions feed into a
decision-support system.
Humans can error-correct
if needed.
Production-readiness is a spectrum
Supporting projects at all phases of their lifecycle

Model Development
Feature Engineering
Model Operations
Architecture
Versioning
Job Scheduler
Compute Resources
Data Warehouse
How much
data scientist
cares
How much
infrastructure
is needed
All layers of the stack matter
There’s a natural division of freedom & responsibilities

Metrics
Tables
Text
Images
Addressing the common
concerns in a single coherent
framework boosts
productivity.
Data scientist time is more
valuable than machine time.

Regarding machines...
No need to reinvent the wheels

Curious to learn more?
Effective Data Science
Infrastructure
How to make data scientists more productive
www.manning.com/books/effective-data-science-infrastructure
Book in progress

Thank you!
Join Metaﬂow community Slack at
http://slack.outerbounds.co

Metaflow: The ML Infrastructure at Netflix

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Metaflow: The ML Infrastructure at Netflix

Similar a Metaflow: The ML Infrastructure at Netflix (20)

Más de Bill Liu

Más de Bill Liu (20)

Último

Último (20)

Metaflow: The ML Infrastructure at Netflix