Why do the majority of Data Science projects never make it to production?

Why the majority of Data
Science projects never
make it to production?

María de la Fuente
Solutions Architect Manager – Israel Middle East &Africa
@Databricks
María de la Fuente | LinkedIn

AI is poised to
change the world
$3.9T
Projected Business value
creation by AI in 2022
And most
leaders agree
83%
CEOs say AI is a
strategic priority
But AI doesn’t
make it out the door
at most companies
Of Data Science initiatives
never make it to production
87%

Q: Why are these projects struggling?
A: It is mainly because of reliability,
performance and lack of ML end-to-
end tracking mechanism.

ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small red box
in the middle. The required surrounding infrastructure is vast and complex.
Hardest Part of ML isn’t ML, it’s Data

A typical Machine Learning workflow
Data
Preparation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Deployment
Model
Tuning
Model
Consumption
Data
Ingestion
Users
Data & ML
Engineers

Data Scientists time is valuable
Source: CrowdFlower Data Science Report

Data Lake
The data is not ready for data science & ML
The majority of these projects are failing due to
unreliable data!
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing

ML Lifecycle is Manual, Inconsistent
and Disconnected
● Ad hoc approach to track
experiments
● Very hard to reproduce
experiments
Prep Data
● Multiple tightly coupled
deployment options
● Different monitoring approach
for each framework
Build Model Deploy Model
● Low level integrations for
Data and ML
● Difficult to track data used
for a model

Nothing last forever
”Change is the only constant in life¨-Heraclitus, Greek Philosopher
12
One of the main assumptions when creating a model is that future data will
be similar to past data used to build the model
HOWEVER,
Models exists in a dynamic and continually changing environment, when
this environment change, the performance of the model will change too

This means…Model Drifting is expected!
13
ML Models will lose their predictive power over time
CONCEPT DRIFT  properties of the dependent variable(s) change(s)
DATA DRIFT  properties of the independent variable(s) change(s)

ML Lifecycle and Challenges
Delta
Tuning Model Mgmt
Raw Data ETL Train
Featurize Score/Serve
Batch + Realtime
Monitor
Alert, Debug
Deploy
AutoML,
Hyper-p. search
Experiment
Tracking
Remote Cloud
Execution
Project Mgmt
(scale teams)
Model
Exchange
Data
Drift
Model
Drift
Orchestration
(Airflow, Jobs)
A/B
Testing
CI/CD/Jenkins
push to prod
Feature
Repository
Lifecycle
mgmt.
Retrain
Update Features
Production Logs
Zoo of Ecosystem Frameworks
Collaboration Scale Governance

Q: How we are going to solve these
problems?

MLOps: What, why, how?
WHAT: Set of practices for
collaboration and communication
between data scientists and
operations professionals
WHY: Aims to improve the delivery of
machine learning models by
combining the processes of design,
development, testing, and delivery
into a singular process.
● Shortening development cycles, and as a
result, decreasing time to market
● Improving collaboration between teams
across all levels of technical expertise
● Increasing reliability, performance,
scalability, and security of ML systems
● Streamlining operational and governance
processes
● Increasing return on investment of ML
projects

MLOps vs DevOps
SAME SAME…
when it comes to continuous integration of source control, unit testing, integration testing, and
continuous delivery of the software module or the package
…BUT DIFFERENT
Continuous Integration (CI) is no longer only about testing and validating code and components, but also
testing and validating data, data schemas, and models
Continuous Deployment (CD) is no longer about a single software package or service, but a system (an ML
training pipeline) that should automatically deploy another service (model prediction service) or roll back
changes from a model
Continuous Testing (CT) is a new property, unique to ML systems, that’s concerned with automatically
retraining and serving the models

Tactics for
Successful & Scalable ML in production
● Align business needs & ML Objectives
● Involve right personas
● Lean into the cloud
● Break the silos & support cross-colaboration
● Architect with operations in mind
● Invest & Leverage MLOps

How do we bring it
together at
Databricks?

End-to-End Data Science and ML on
AutoML
End-to-End ML Lifecycle
ML Runtime and
Environments
Batch
Scoring
Online
Serving
Data Science Workspace
Prep Data Build Model Deploy/Monitor Model
Open,
pluggable
architecture

High level Architecture
Unified Analytics Platform
Data Science, Model Training, Test and Selection
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Databricks Runtime
BI
Tool
Connectors
Model Deployment& Monitoring
to the cloud...
to the edge...
ETL / Data Processing
Bronze
Gold
DB
Connect
Tracking Projects Models
End to end ML lifecycle
Registry
Connectors and APIs
for a wide variety of
differentsources...
File
DB/DW

Why do the majority of Data Science projects never make it to production?

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Why do the majority of Data Science projects never make it to production?

Similar a Why do the majority of Data Science projects never make it to production? (20)

Más de Itai Yaffe

Más de Itai Yaffe (20)

Último

Último (20)

Why do the majority of Data Science projects never make it to production?