Más contenido relacionado La actualidad más candente (20) Similar a Data Science Crash Course (20) Más de DataWorks Summit (20) Data Science Crash Course1. 1 © Hortonworks Inc. 2011–2018. All rights reserved
DWS Barcelona 2019
Robert Hryniewicz
@robhryniewicz
Data Science Crash Course
2. 2 © Hortonworks Inc. 2011–2018. All rights reserved
What is Machine Learning?
Machine Learning is programming with data (as opposed to programming with code).
Machine Learning is a way to use data to draw
meaningful conclusions including identifying
patterns, anomalies and trends that may not be
obvious to humans.
Machine learning is math, at scale.
Machine learning is learning patterns
from data labelled or not.
Machine learning is when I explain my challenge to
the computer and it finds a way to solve it.
Machine Learning allows for emotional decisions to
become objective.
3. 3 © Hortonworks Inc. 2011–2018. All rights reserved
Examples where Machine Learning can be applied
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
Insurance
• Risk assessment
• Customer insights/experience
• Finance real time analysis
Life sciences
• Genome sequencing
• Drug development
• Sensor data
4. 4 © Hortonworks Inc. 2011–2018. All rights reserved
Machine Learning – Major Types
Supervised Learning Unsupervised Learning
Reinforcement
Learning
5. 5 © Hortonworks Inc. 2011–2018. All rights reserved
Supervised Learning
Input
Input
Input
Input
Input
Input
Input
Output 1
Output n
Use labeled (training)
datasets on to learn the
relationship of given
inputs to outputs.
Once model is trained use
it to predict outputs on
new input data.
Output 2
.
.
.
…
…
6. 6 © Hortonworks Inc. 2011–2018. All rights reserved
Unsupervised Learning
Explore, classify & find
patterns in the input data
without being explicit
about the output.
7. 7 © Hortonworks Inc. 2011–2018. All rights reserved
Reinforcement Learning
Algorithm
Environment
ActionRewardState
Algorithm learns to
maximize rewards it
receives for its actions
(e.g. maximizes points for
investment returns).
Use when you don’t have
lots of training data, you
can’t clearly define ideal
end-state, or the only way
to learn is by interacting
with the environment.
8. 8 © Hortonworks Inc. 2011–2018. All rights reserved
Regression
Classification
Recommender Systems / Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression • Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)
Deep Learning
• Fully Connected Neural Nets
Ø Tabular or Recommender Systems
• Convolutional Neural Nets (CNNs)
Ø Images
• Recurrent Neural Nets (RNNs)
Ø Natural Language Processing (NLP) / Text
9. 9 © Hortonworks Inc. 2011–2018. All rights reserved
REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression
10. 10 © Hortonworks Inc. 2011–2018. All rights reserved
CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
• Logistic Regression
• Fast training (linear model)
• Classes expressed in probabilities
• Less overfitting [+]
• Less fitting (accuracy) [-]
• Support Vector Machines (SVM)
• “Best” supervised learning algorithm, effective
• State of the art prior to Deep Learning
• More robust to outliers than Log Regression
• Handles non-linearity
• Random Forest
(ensemble of Decision Trees)
• Fast training
• Handles categorical features
• Does not require feature scaling
• Captures non-linearity and
feature interaction
• i.e. performs feature selection / PCA implicitly
• Naïve Bayes
• Good for text classification
• Assumes independent variables / words
11. 11 © Hortonworks Inc. 2011–2018. All rights reserved
Visual Intro to Decision Trees
• http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
12. 12 © Hortonworks Inc. 2011–2018. All rights reserved
CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA
13. 13 © Hortonworks Inc. 2011–2018. All rights reserved
COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)
14. 14 © Hortonworks Inc. 2011–2018. All rights reserved
DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
• Removing noise in images by selecting only
“important” features
• Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)
16. 16 © Hortonworks Inc. 2011–2018. All rights reserved
Simple/shallow vs Deep Neural Net
17. 17 © Hortonworks Inc. 2011–2018. All rights reserved
• Convolutional Neural Nets (CNNs)
• Recurrent Neural Nets (RNNs)
• Long Short-Term Memory (LSTM)
Popular Neural Net Architectures
ß Images
ß Text / Language (NLP) & Time Series
18. 18 © Hortonworks Inc. 2011–2018. All rights reserved
Number Probability
0 0.03
1 0.01
2 0.04
3 0.08
4 0.05
5 0.08
6 0.07
7 0.02
8 0.54
9 0.08
19. 19 © Hortonworks Inc. 2011–2018. All rights reserved
Quickly Training Deep Learning Models
with Transfer Learning
19
20. 20 © Hortonworks Inc. 2011–2018. All rights reserved
How to Build a Deep Learning Image Recognition System?
African Bush Elephant Indian Elephant Sri Lankan Elephant Borneo Pygmy Elephant
Step 1: Download examples to train the model with
21. 21 © Hortonworks Inc. 2011–2018. All rights reserved
How to Build a Deep Learning Image Recognition System?
Step 2: Augment dataset to enrich training data
à Adds 5-10x more training examples
22. 22 © Hortonworks Inc. 2011–2018. All rights reserved
dawn.cs.stanford.edu/benchmark
Step 3: Select and download a pre-trained model
How to Build a Deep Learning Image Recognition System?
23. 23 © Hortonworks Inc. 2011–2018. All rights reserved
Sample Architecture of a CNN
Pretrained
Parameters
Random
Parameters
24. 24 © Hortonworks Inc. 2011–2018. All rights reserved
Step 4: Apply transfer learning
How to Build a Deep Learning Image Recognition System?
Pretrained Network
(millions of parameters)
Random
ParametersINPUT OUTPUT
Borneo Pygmy
Elephant
Indian
Elephant
Train
Parameters
Step A
Adjust
Parameters
Step B
25. 25 © Hortonworks Inc. 2011–2018. All rights reserved
Step 5: Host a trained model on a server and make it accessible via a web app
How to Build a Deep Learning Image Recognition System?
User uploads
Borneo Pygmy Elephant
Web app returns
26. 26 © Hortonworks Inc. 2011–2018. All rights reserved
Data Science Journey
26
27. 27 © Hortonworks Inc. 2011–2018. All rights reserved
What is data science?
The scientific exploration of data to extract meaning or
insight, using statistics and mathematical models with
the end goal of making smarter, quicker decisions.
29. 29 © Hortonworks Inc. 2011–2018. All rights reserved
Start by Asking Relevant Questions
• Specific (can you think of a clear answer?)
• Measurable (quantifiable? data driven?)
• Actionable (if you had an answer, could you do something with it?)
• Realistic (can you get an answer with data you have?)
• Timely (answer in reasonable timeframe?)
30. 30 © Hortonworks Inc. 2011–2018. All rights reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States è California, CA, Cal., Cal
31. 31 © Hortonworks Inc. 2011–2018. All rights reserved
Feature Selection
• Also known as variable or attribute selection
• Why important?
• simplification of models è easier to interpret by researchers/users
• shorter training times
• enhanced generalization by reducing overfitting
• Dimensionality reduction vs feature selection
• Dimensionality reduction: create new combinations of attributes
• Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
32. 32 © Hortonworks Inc. 2011–2018. All rights reserved
Hyperparameters
• Define higher-level model properties, e.g. complexity or learning rate
• Cannot be learned during training à need to be predefined
• Can be decided by
• setting different values
• training different models
• choosing the values that test better
• Hyperparameter examples
• Number of leaves or depth of a tree
• Number of latent factors in a matrix factorization
• Learning rate (in many models)
• Number of hidden layers in a deep neural network
• Number of clusters in a k-means clustering
33. 33 © Hortonworks Inc. 2011–2018. All rights reserved
v Residuals
• residual of an observed value is the difference between
the observed value and the estimated value
v R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
v RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
34. 34 © Hortonworks Inc. 2011–2018. All rights reserved
With that in mind…
• No simple formula for “good questions” only general guidelines
• The right data is better than lots of data
• Understanding relationships matters
35. 35 © Hortonworks Inc. 2011–2018. All rights reserved
Enterprise Data Science @ Scale
Enterprise- Grade
Leverage
enterprise-grade
security,
governance and
operations
Tools
Enhance productivity
by enabling data
scientists to use their
favorite tools,
technologies and
libraries
Deployment
Compress the
time to insight
by deploying
models into
production
faster
Data
Build more
robust models
by using all
the data in the
data lake
36. 36 © Hortonworks Inc. 2011–2018. All rights reserved
Thanks!
Robert Hryniewicz
@robhryniewicz