With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
2. • Data Science Consultant at
Databricks
• B.S. in Mathematics from Xavier
University
• Working in Data Science and Spark
Ecosystem for 3 + years
• Focused on customers in Retail and
CPG, Databricks Labs Automl
Toolkit contributor
• Detroit, MI
Mary Grace Moesta
marygrace@databricks.com
3. • Resident Solutions Architect
Databricks
• M.S. Computer Information
Systems
Distributed Systems
Biomedical Informatics
Grand Valley State University
• Previously @ Ford Motor Company:
– L&D Specialist
– Lead Software Developer
– Hadoop Systems Architect
Gray Gwizdz
gray@databricks.com
6. The smaller the studies conducted in a scientific field,
the less likely the research findings are to be true.
Randomized controlled trials in cardiology (several thousand subjects randomized) than in scientific fields
with small studies, such as most research of molecular predictors (sample sizes 100-fold smaller).
- Ioannidis
The data is as important as the
analysis itself
7. The greater the flexibility in designs, definitions,
outcomes, and analytical modes in a scientific field, the
less likely the research findings are to be true.
Adherence to common standards is likely to increase the proportion of true findings. The same applies to
outcomes. True findings may be more common when outcomes are unequivocal and universally agreed (e.g.,
death) rather than when multifarious outcomes are devised (e.g., scales for schizophrenia).
- Ioannidis
If you can’t explain the analysis,
you can’t trust the results
8. The greater the financial and other interests and
prejudices in a scientific field, the less likely the research
findings are to be true.
Conflicts of interest are common in biomedical research, and typically they are inadequately and sparsely
reported.
- Ioannidis
Unverified scientific claims
have real life consequences
10. Reproducibility: A Systemic Problem in ML
▪ 2020 State of AI Report reports that
only 15% of papers publish code
▪ This doesn’t even include data!
▪ Submit papers without code to
paperswithoutcode.com to give
authors the opportunity to respond
11. Four Components of Reproducible ML Applications
Changing data
means changing
results
Main point of
reference for
preprocessing steps
and model
hyperparameters
Code
Data
The environment in
which your code
runs can affect
results
“It runs fine on my
machine!”
Environment
Surrounding the
environment is the
physical hardware
used to support
processing
Compute
12. Changing Data Yields Changing Results
Model Version 3
Nightly updates, tuning
hyperparameters with new
model family
Model Version 2
Nightly updates, trying a new
model family
Model Version 1
Nightly data updates, test new
model hyperparameters
Model Version 0
Initial version of the data, initial
version of the model
How do you know if your model is performing better because of hyperparameter changes, model
changes, or data changes?
Data Code Environment Compute
13. A new standard for building data lakes
An opinionated approach to
building Data Lakes
■ Adds reliability, quality,
performance to Data Lakes
■ Brings the best of data
warehousing and data lakes
■ Based on open source and
open format (Parquet)
15. Fixing Data Versions for Valid Control
▪ Consistency in data ensures that
changes you make to the model
are the only variables being tested
▪ Writing out training / test sets to
persistent location
▪ Delta time travel to specify
version and keep data fixed
Data Code Environment Compute
16. Changing Code Yields Changing Results
Model Version 3
Nightly updates, tuning
hyperparameters with new
model family
Model Version 2
Nightly updates, trying a new
model family
Model Version 1
Nightly data updates, test new
model hyperparameters
Model Version 0
Initial version of the data, initial
version of the model
Data Code Environment Compute
17. Code
▪ Organize code into pipelines
▪ Feature engineering
▪ Training
▪ Inference
▪ Tracking code versions in MLflow
▪ Use git to version control code for
full lineage
Data Code Environment Compute
18. Environment
Data Code Environment Compute
▪ What operating system did I use?
▪ What version of Python did I use?
▪ What version of Pandas did I use?
▪ What version of scikit-learn did I use
in conjunction with Pandas?
▪ What environment variables did I
set?
19. Environment
▪ Mirroring environments across both
production and development
▪ MLflow projects make this really easy
Will search all URLs under the /pkgs to
download in addition to conda-forge
Installs dependencies as managed by
Conda
Pip installs python specific packages
that are available through PyPI
Data Code Environment Compute
21. Compute
▪ Compute configurations managed
by tools like ARM templates,
Terraform, CloudFormation, etc.
▪ Log compute configuration to the
MLflow tracking server
▪ “Reproduce Run” in Managed
MLflow
Data Code Environment Compute
Data Code Environment Compute
23. ▪ Say we’re using a few data sources from Kaggle to see
we can predict job changes in Data Scientists with the
addition of Covid-19 data
▪ Is there any relationship between data scientist
changing jobs and the number of Covid-19 cases in
their city?
Let’s Build a Model
24. Always Changing Data
Data Code Environment Compute
▪ The CDC reports
numbers daily on the
status of Covid-19
across the country
▪ This data changes daily!
25. Tracking Results with MLflow Tracking
Data Code Environment Compute
▪ Keep track of all hyperparameters, metrics, artifacts in
tracking server
▪ Limit unintentional run repetition
26. ▪ MLflow currently supports 3 project environments:
▪ Conda
▪ Docker container
▪ Current system
Reproducing Envs is Easy with MLflow
Data Code Environment Compute
MLflow Project
Package data science code
in a format to reproduce
runs on any platform
Container / Image Build /
Batch Job
Deployment
Real Time
Batch Scoring
27. ▪ Managed MLFlow supports
reproducing run compute
with a single click
▪ Databricks cluster
configurations can be
tracked using MLFlow
Reproducing Compute Environments
Data Code Environment Compute
28. Bringing it all Together
▪ ACID compliance and fixing data versions with
▪ For tracking code, compute, and
environment
▪ Address four main components of reproducibility with
and Open Source tech stack
Data Code Environment Compute