Overview of a Machine Learning 11 week course I developed and trained software engineers at Dell on their way to become Data Scientists. Class is outline of Predictive Analytics methods using Python. I taught this class 8 separate occasions over 3 years.
2. 2 of Y
Internal Use - Confidential
Trend #2 Non-Data Scientists will perform
more fairly sophisticated analytics
alongside data scientists
Data Scientist
Algorithm Coder
Data
Science
Citizens
Advanced
Analytics
Programmers
Statisticians
Business
Analyst
Coders
Data Science continues to develop
specialties - this means the mythical
‘full stack’ data scientist will disappear
Trend #1
Data
Scientist
Data
Engineer
Algorithm
Coder
Data
Storyteller
Industry Trends for 2018 – How
what we’re doing fits into the future
the Context
3. 3 of Y
Internal Use - Confidential
the Course
Machine Learning Mastery
- Understand Your Data
- Create Accurate Models
- Work Projects End-To-End
• 16 weeks – May-Oct., 2017
• 20+ class hours – 20% homework, 80% live coding
• 17 notebooks – Python code templates
• 4 Prerequisites – Coding, statistics, algorithms, thirst to learn
• 1 Textbook – Machine Learning Mastery w/ Python -Dr. Jason Brownlee
• 1 Teacher – Mitch Sanders w/ Assistant – Uday Waghmare
• 14 Students – global: software engineers, adv. analysts, statisticians
• Platform – Jupyter, Python 2.7, Anaconda
• Code Repository – GitHub
• NPS Survey – Survey Monkey, LTR = 90
• Awarded – “On the Spot”
4. 4 of Y
Internal Use - Confidential
the Content
Prepare & Explore Model Improve Accuracy & Finalize
Python ML
Ecosystem
SciPy
Scikit-learn
Crash Courses
NumPy
Matplotlib
Pandas
Load Libraries & Data
Descriptive Statistics
Attribute Data Types
Class Distribution
Correlation Analysis
Skew of Univariates
Pre Processing
Rescale
Standardize
Normalize
BinarizeFeature Selection
Tree & Univariate
Recursive -RFE
Principle Comp.
Analysis - PCA
Feature Importance
Resampling
Split into Train/Test
K-fold Cross Validation
Leave One Out
Repeated Random
Evaluation Metrics
For Classification
For Regression
Spot Check
Classification Algorithms
Linear –
• Logistic Regression
• Linear Discriminate
Analysis (LDA)
Non-linear –
• K-Nearest Neighbor (KNN)
• Naïve Bayes
• Class & Regression Trees
(CART)
• Support Vector Machines
(SVM)
Compare Algorithms
Spot Check
Regression Algorithms
Linear – LR, LASSO,
ElasticNet (EN)
Non-Linear – CART, SVR,
KNN
Automate w/ Pipelines
Preparation Pipelines
Feature Extraction Pipelines
Modeling Pipelines
Ensembles - Performance
Improvements
Boosting –
• AdaBoost,
• Gradient Boosting (GBM)
Bagging –
• Random Forest, Extra Trees
• Voting
Algorithm
Parameter Tuning
Parameters
Grid Search
Random Search
Finalize Model
Predict on Validation Data
Create Standalone on Entire Data
Save Model for Production
Visualization
Univariate Plots
Multivariate Plots
Case Studies #1 & #2
Key concepts – and flow – the
17 notebooks
#1
#17
6. 6 of Y
Internal Use - Confidential
the Course Syllabus
Python Ecosystem for Machine
Learning
• Python
• SciPy
• Scikit-learn
• Python Ecosystem Installation
• Summary
Crash Course in Python and SciPy
• Python Crash Course
• NumPy Crash Course
• Matplotlib Crash Course
• Pandas Crash Course
• Summary
How To Load Machine Learning Data
• Considerations When Loading CSV
Data
• Pima Indians Dataset
• Load CSV Files with the Python
Standard Library
• Load CSV Files with NumPy
• Load CSV Files with Pandas
• Summary
Understand Your Data With
Visualization
• Univariate Plots
• Multivariate Plots
• Summary
Prepare Your Data For Machine Learning
• Need For Data Pre-processing
• Data Transforms
• Rescale Data
• Standardize Data
• Normalize Data
• Binarize Data (Make Binary)
• Summary
Feature Selection For Machine Learning
• Feature Selection
• Univariate Selection
• Recursive Feature Elimination
• Principal Component Analysis
• Feature Importance
• Summary
Evaluate the Performance of Machine
Learning Algorithms with Resampling
• Evaluate Machine Learning Algorithms
• Split into Train and Test Sets
• K-fold Cross-Validation
• Leave One Out Cross-Validation
• Repeated Random Test-Train Splits
• What Techniques to Use When
• Summary
Machine Learning Algorithm
Performance Metrics
• Algorithm Evaluation Metrics
• Classification Metrics
• Regression Metrics
• Summary
Spot-Check Classification Algorithms
• Algorithm Spot-Checking
• Algorithms Overview
• Linear Machine Learning Algorithms
• Nonlinear Machine Learning
Algorithms
• Summary
Spot-Check Regression Algorithms
• Algorithms Overview
• Linear Machine Learning Algorithms
• Nonlinear Machine Learning
Algorithms
• Summary
Compare Machine Learning Algorithms
• Choose The Best Machine Learning
Model
• Compare Machine Learning
Algorithms Consistently
• Summary
Automate Machine Learning Workflows
with Pipelines
• Automating Machine Learning
Workflows
• Data Preparation and Modeling
Pipeline
• Feature Extraction and Modeling
Pipeline
• Summary
Improve Performance with Ensembles
• Combine Models Into Ensemble
Predictions
• Bagging Algorithms
• Boosting Algorithms
• Voting Ensemble
• Summary
7. 7 of Y
Internal Use - Confidential
data science student questions - 1
“So you do Data Science work. What really does that involve? And how is that different than programming, statistical work or data
engineering?”
“I want to learn Data Science. Between R, Python and SAS, where should I start and what are the Pros and Cons of each?”
“What is OOP (Object orientated programming) and Structured Programming and what’s the difference between them?"
“What is main differences between Python 2.7 and Python 3.x versions? And why do so many developers stay with Python 2.7?”
"What is the difference between Supervised Learning an Unsupervised Learning?"
"What's different graphing might a univariate have compared to a bivariate analysis? Can you graph multivariate?"
"How do you explain machine learning to an 8-year old child?"
"What is Gradient Descent?
"What is multicollinearity and how you can overcome it?"
8. 8 of Y
Internal Use - Confidential
data science student questions - 2
"What is the curse of dimensionality?"
"What do you understand by Hypothesis in the content of Machine Learning?"
"What's the difference between a Test Set and a Validation Set?"
"What is cross-validation and what is it used for?"
"What's difference between a Classification Regression Tree algoithm and a Random Forest? And when is one better than the other?"
"What are the basic assumptions to be made for linear regression?"
"Can you explain in simple language what is an Eigenvalue and Eigenvector?"
"Do gradient descent methods always converge to same point?"
"What's difference between continuous, ordinal and categorical variables?"
"What is K-means? How can you select K for K-means?"
9. 9 of Y
Internal Use - Confidential
data science student questions - 3
"Why is naive Bayes so ‘naive’ ?"
"OLS is to linear regression as Maximum likelihood is to logistic regression. Explain the statement."
"What do you understand by Bias Variance trade off?"
"Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?"
"When does regularization becomes necessary in Machine Learning?"
"Explain a model and its dimensions to an 8 year old."
"How do you determine and deal with correlated features in your data set, how to reduce the dimensionality of data?"
"During analysis, how do you treat missing values?"
"What is Regularization and what kind of problems does regularization solve?"
11. 11 of Y
Internal Use - Confidential
the Data Scientist Roles
Roles Defined by 3 different Data Science Authors
Data Scientist Core Skills
How To Build A Successful Data Science
Team
The seven people you need on your
Big Data team Descriptions:
Capture Data Engineer Handyman
Expert in Dell EDW, D3, BO, Hana/BMS,
other RDBMS, and ETL work
Open Source Guru (plus Data
Modeler)
Hadoop stack, Cloudera, Linux, data
structures and network
Analyze Machine Learning Expert
Data Modeler (plus all aspets of Data
Engineer and Business Analyst)
SQL, RDBMS, Teradata, Dell
infrastructure
Deep Diver
Machine Learning, R, Python, SQL, ETL
work, algorithm modeling, statistics
Present Business Analyst Story Teller
PowerPoint, Design, Tableau,
understands customers business
language and technical, artistic eye
Snoop (plus Handyman skills)
Enthusiastic, deeply creative, super savy
in Dell envirionments, finds contacts and
not hesitant to do work-arounds
Privacy Wonk
Dell policy meticulous, socially aware,
foresees roadblocks