A talk by Emily Gorcenski and Arif Wider presented a Strata Data Conference 2019 in London.
Abstract:
It’s already challenging to transition a machine learning model or AI system from the research space to production, and maintaining that system alongside ever-changing data is an even greater challenge. In software engineering, continuous delivery practices have been developed to ensure that developers can adapt, maintain, and update software and systems cheaply and quickly, enabling release cycles on the scale of hours or days instead of weeks or months.
Nevertheless, in the data science world, continuous delivery is rarely applied holistically—due in part to different workflows: data scientists regularly work on whole sets of hypotheses, whereas software engineers work more linearly even when evaluating multiple implementation alternatives. Therefore, existing software engineering practices cannot be applied as is to machine learning projects.
Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows. Join in to learn how they drew on their expertise to adapt practices and tools to allow for continuous intelligence—the practice of delivering AI applications continuously.
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
Continuous Intelligence: Keeping your AI Application in Production
1.
2. 5500 technologists with 40 offices in 14 countries
Partner for technology driven business transformation
London - Hamburg - Manchester - Berlin - Barcelona - Munich - Cologne - Madrid - Stuttgart
6. “Continuous Delivery is the ability to get
— including new features, configuration
changes, bug fixes and experiments — into
production, or into the hands of users,
in a ”
12. Continuous Delivery isn’t a
technology or a tool; it is a
practice and a set of principles.
Quality is built into software
and improvement is always
possible.
But machine learning systems
have unique challenges; unlike
deterministic software, it is
difficult—or impossible—to
understand the behavior of
data-driven intelligent systems.
This poses a huge challenge
when it comes to deploying
machine learning systems in
accordance with CD principles.
●
●
●
●
●
●
●
●
13.
14. ModelData Code
+ +
Schema
Sampling over Time
Volume
...
Research, Experiments
Training on New Data
Performance
...
New Features
Bug Fixes
Dependencies
...
Icons created by Noura Mbarki and I Putu Kharismayadi from Noun Project
20. Any data-driven service can experience version drift along any of these axes independently or jointly.
Schema
Sampling
Model
Code
21.
22. Our code, data, and models should be versioned and shareable without unnecessary work.
■ Data and models can
be very large
■ Data can vary invisibly
■ Data scientists need to
share work and it must
be repeatable
What are the challenges?
■ Large artifacts stored
in arbitrary storage
linked to source repo
■ Data scientists can
encode work process
at repeat in one step
What does the ideal
solution look like?
■ Storage: S3, HDFS, etc
■ Git LFS
■ Shell scripts
■ dvc
■ Pachyderm
■ jupyterhub
What solutions are out
there now?
23. We should be able to scale model development to try multiple modeling approaches simultaneously.
■ Hyperparameter
tuning and model
selection is hard
■ Tracking performance
depends on other
moving parts (e.g. data)
What are the challenges?
■ Links models to
specific training sets
and parameter sets
■ Is differentiable
■ Allows visualization of
results
What does the ideal
solution look like?
■ dvc
■ MLFlow
■ etc...
What solutions are out
there now?
24. Our code, data, and models should be versioned and shareable without unnecessary work.
■ We should be
monitoring the health
and quality of our
production ML system
■ Deployment is usually
deeply coupled to code
What are the challenges?
■ One-click deployment
■ Easy, context-aware
logging
■ Performance can be
benchmarked to
research results
What does the ideal
solution look like?
■ Elasticsearch
■ Kibana
■ Logstash
■ CircleCI
■ GoCD
■ etc.
What solutions are out
there now?
25. Machine learning may automate things but it does not automatically generate value.
■ It is hard to map data
science investment to
business value
■ We should be
extracting continuous
insights from data
What are the challenges?
■ KPIs can be visualized
and linked to work
efforts
■ Patterns should
emerge
■ New questions arise
What does the ideal
solution look like?
■ BI tools (e.g. Tabelau,
Chartio, SAS, etc.)
■ Work tracking and
documentation tools
(Github, JIRA,
Confluence)
What solutions are out
there now?
26. Doing CD with Machine Learning is still a hard problem
MODEL
PERFORMANCE
ASSESSMENT
TRACKING
VERSION
CONTROL AND
ARTIFACT
REPOSITORIES
MODEL
MONITORING
AND
OBSERVABILITY
DISCOVERABLE
AND
ACCESSIBLE
DATA
CONTINUOUS
DELIVERY
ORCHESTRATION
TO COMBINE
PIPELINES
INFRASTRUCTURE
FOR MULTIPLE
ENVIRONMENTS
AND
EXPERIMENTS
27. Data Science,
Model
Building
Training Data
Source Code
+
Executables
Model
Evaluation
Productionize
Model
Integration
Testing
Deployment
Test Data
Model +
parameters
CD Tools and Repositories
DiscoverableandAccessibleData
Monitoring
Production Data
28. There are many options for tools and technologies to implement CD4ML