Metaflow was started at Netflix to answer a pressing business need: How to enable an organization of data scientists, who are not software engineers by training, build and deploy end-to-end machine learning workflows and applications independently. We wanted to provide the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure: data, compute, orchestration, and versioning.
Today, the open-source Metaflow powers hundreds of business-critical ML projects at Netflix and other companies from bioinformatics to real estate.
In this talk, you will learn about:
- What to expect from a modern ML infrastructure stack.
- Using Metaflow to boost the productivity of your data science organization, based on lessons learned from Netflix.
- Deployment strategies for a full stack of ML infrastructure that plays nicely with your existing systems and policies.
https://www.aicamp.ai/event/eventdetails/W2021080510
12. Model Development
Feature Engineering
Model Operations
Architecture
Versioning
Job Scheduler
Compute Resources
Data Warehouse
How much
data scientist
cares
How much
infrastructure
is needed
Key insight
ML infrastructure should help with infrastructure rather than ML
20. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Create a
workflow
Version
everything
22. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Create a
workflow
Scale
vertically
Version
everything
24. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Create a
workflow
Scale
vertically
Scale
horizontally
Version
everything
25. @step
def start(self):
self.params = list(range(100))
self.next(self.train, foreach='params')
@resources(memory=128000)
@step
def train(self):
self.model = train(...)
self.next(self.join)
@step
def join(self, inputs):
...
start
end
end
end
end
end
train
join
Use many computers
26. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Version
everything
27. @step
def start:
import spark_client
SQL = "CREATE TABLE mydata AS SELECT ..."
self.table_loc = spark_client.query(SQL)
self.next(self.load_data)
@step
def load_data(self):
from metaflow import S3
import pyarrow.parquet as pq
with S3() as s3:
parquet = s3.get(self.table_loc)
self.table = pq.read_table(parquet.path)
Source table
Data engineering
Data science
Preprocess data in SQL...
28. @step
def load_data(self):
from metaflow import S3
with S3() as s3:
results = s3.get_many(s3_urls)
mymodel.input_data([res.path for res in results])
Parquet
Parquet
Parquet
...and access results efficiently
10Gbps
29. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Scheduled
execution
Version
everything
30. # python myflow.py step-functions create
Schedule executions with a single command
31. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
Version
everything
33. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything
34. Project: LTV
@project(name='LTV')
class TrainingFlow(FlowSpec):
@step
def start(self):
@project(name='LTV')
class PredictFlow(FlowSpec):
@step
def start(self):
@project(name='LTV')
class TrainingFlow(FlowSpec):
@step
def start(self):
@project(name='LTV')
class PredictFlow(FlowSpec):
@step
def start(self):
Branch A
Branch B
Deploy isolated chains of workflows
35. Project lifecycle: Baby-steps towards production
Prototype Production
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything
Woohoo! Mission
accomplished! 🎉
37. Project lifecycle: Baby-steps towards production (and back)
Prototype Production
Debug
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything
39. # python myflow.py resume --origin-run-id sfn-199874
# python myflow.py step-functions create
...but when things go wrong,
Allow production issues to be reproduced locally
41. The path to production is incremental and iterative
Prototype Production
Debug
Cloud-based
workstation
Explore with
notebooks
Access data
quickly
Create a
workflow
Scale
vertically
Scale
horizontally
Scheduled
execution
Freeze
dependencies
A/B test
Version
everything
42. Business-critical
project: Maximum
SLA required.
Data scientist is prototyping a
new idea locally. Failing is ok.
Promising project. Let’s
scale it to all data.
Project is being A/B tested in
production with a failover path.
Predictions feed into a
decision-support system.
Humans can error-correct
if needed.
Production-readiness is a spectrum
Supporting projects at all phases of their lifecycle
43. Model Development
Feature Engineering
Model Operations
Architecture
Versioning
Job Scheduler
Compute Resources
Data Warehouse
How much
data scientist
cares
How much
infrastructure
is needed
All layers of the stack matter
There’s a natural division of freedom & responsibilities
46. Curious to learn more?
Effective Data Science
Infrastructure
How to make data scientists more productive
www.manning.com/books/effective-data-science-infrastructure
Book in progress