Similar a KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science Project Cycle & Strategies to Avoid Common Pitfalls - Ming Li(20)
KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science Project Cycle & Strategies to Avoid Common Pitfalls - Ming Li
1. Skillsto Master Machine Learning and
Data Science Project Cycle &
Strategiesto Avoid Common Pitfalls
MING LI, AMAZON
KDD 2019 WORKSHOP “INITIATIVE FOR ANALYTICS AND DATA SCIENCE STANDARDS”
2. It is rare to see machine learning and data
science presentations with topics focusing
on end-to-end project cycle. It is even less to
discuss project common pitfalls.
In this talk, we cover both!
5. Model exploring and development
Model training, validation, testing
Model selection
Breakdown
o Python + sklearn
o R & various libs
o Keras + DL libs
o Spark MLlib
o Traditional machine
learning (ML) methods
o Deep learning methods
o General model training
and tuning process
6. Data science formulation
Data quality and availability
Data preprocessing
Feature engineering
Breakdown
o Statistical/ML thinking
o Idea abstraction
o Data generation process
o Data understanding
o Common preprocessing and
feature engineering ideas
o SQL
o Python: pandas
o R: tidyverse
o PySpark / SparkR
7. Business problem definition and understanding
Quantifying business value and define key metrics
Computation resources assessment
Key milestones and timeline
Data security, privacy and legal review
Breakdown
o Business insight
o Collaboration
o Teamwork
o Planning
o Prototype
8. A/B testing in production system
Model deployment in production
environment
Exception management
Performance monitoring
Breakdown
o Production system
knowledge
o Launch plan
o Anomaly detection
o Dashboard
o Tableau
o R-Shiny
9. Model tuning and re-training
Model update and add-on
Model failure and retirement
…
Breakdown
o Quantify business
dynamics
o Consider different
scenario
o Plan and set criteria
o Auto ML
10. Breakdown Business problem definition and
understanding
Quantifying business value and define
key metrics
Computation resources assessment
Key milestones and timeline
Data security, privacy and legal review
Data science formulation
Data quality and availability
Data preprocessing
Feature engineering
Model exploring and development
Model training, validation, testing
Model selection
A/B testing in production system
Model deployment in production
environment
Exception management
Performance monitoring
Model tuning and re-training
Model update and add-on
Model failure and retirement
Project
Cycle
Planning
Formulation
Modeling
Production
Post
Production
11. Business teams
o Operation team
o Business analyst team
o Insight and reporting team
Technology team
o Database and data warehouse team
o Data engineering team
o Infrastructure team
o Core machine learning team
o Software development team
o Visualization dashboard team
o Production implementation
Project management team
Program management team
Product management team
Senior leadership team
Leaders across organizations
Cross Team
Collaboration
Agile-Style
Project
Management
Team Work
13. Project Planning Stage
Solving the wrong problem
o Vague description of business needs
o Misalignment across many teams (Scientist, Developer, Operation, Project Managers etc.)
o Scientist team are not actively participating in the problem formulation process
Too optimistic about the timeline
o Project managers may not have past experience for ML and data science projects
o Many ML method-specific uncertainties are not accounted for at planning stage
o ML and data science projects are fundamentally different from each other and from software development
projects (such as online vs. offline model, batch model, real time training, re-training etc.)
Over promise on business value
o Unrealistic high expectation (i.e. advertisement vs actual product)
o Many assumptions about the project are usually not true
o Similar projects from other teams/companies are not evaluated thoroughly to set realistic expectation of time line
and outcome
14. Problem Formulation Stage
Too optimistic about standard statistical and ML methods
o Extra efforts are needed to abstract business problem into a set of analytics problems
o Standard methods are usually not enough to solve the business problems
Too optimistic about data availability and quality
o “Big data” is not a guarantee of good and relevant data, usually big and messy
o Ideal data for the business problem is almost always not available
o Unexpected efforts to bring the right data
o Under estimate effort to evaluate quality of data
Too optimistic about needed effort on data preprocessing
o Table or column descriptions are not detailed enough
o Lack in-depth understanding of the dataset
o Under estimate of date preprocessing (such as dealing missing data)
o Under estimate the effort for feature engineering
o Mismatch between different data sources (such as online vs offline, different tables etc.)
15. Modeling Stage
Un-representative data (such as lack of future outlook of what
will happen in production or biased data)
Too optimistic about model selection and hyper-parameter
tuning to reach desired performance
Overfitting and obsession for complicated models (heavy models
may leads poor production performance)
Take too long to fail
16. Productionization Stage
Bad production performance
oLack shadow mode dry run
oLack needed A/B testing
oData availability and stability issue in real time
oLack exception management on issues such as timeout and missing data
Fail to scale in real time applications
oComputation capacity limitation
oReal time data storage and processing limitation
oLatency constrains
oNot enough engineering resources (i.e. SDE, DE) during implementation
17. Post Production Stage
Missing necessary checkup
oLack model monitoring for key metrics
oLack exception notification
oLack model failures/timeout notification
oOnline feature not stored for future analysis
Production performance degradation
oNot aware of dynamic nature of the business problem
oNot aware of changing input data quality and availability
oLack model tuning and re-training plan
oLack model retirement or replacement plan
18. Strategies to Avoid Pitfalls
Be aware these pitfalls
Proactively discuss these pitfalls across teams
Be prepared when it happens
Have an end-to-end project cycle mindset
Ensure needed engineering recourses available
True collaboration among teams
19. Data Science Family Career Path
Skills Career Title Education
If you are good at all
three areas, you
become a "full-stack"
data scientist!
Each career title has
its own growth path
has promotion cycle.
At problem formulation stage, the biggest pitfall is “Solving the wrong problem”, it sounds funny, but it is sadly true. One of the main reason behind this problem is that, data scientist are not involved in the first place for the initial discussion. Over promise on business values in another big problem in problem formulation stage, often times, Data scientist’s data-driven and fact-base voice is not heard much behind this problem.
At modeling stage, these pitfalls are familiar to statisticians, such as un-representative data and in the most times, the ideal data is not available for a specific business problem, but very biased data will definitely lead to model failure.