Más contenido relacionado
La actualidad más candente (20)
Similar a Data Science Crash Course (20)
Más de DataWorks Summit/Hadoop Summit (20)
Data Science Crash Course
- 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning Use Cases
Healthcare
Predict diagnosis
Prioritize screenings
Reduce re-admittance rates
Financial services
Fraud Detection/prevention
Predict underwriting risk
New account risk screens
Public Sector
Analyze public sentiment
Optimize resource allocation
Law enforcement & security
Retail
Product recommendation
Inventory management
Price optimization
Telco/mobile
Predict customer churn
Predict equipment failure
Customer behavior analysis
Oil & Gas
Predictive maintenance
Seismic data management
Predict well production levels
- 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
à Distributed collection of data organized into named
columns
à Conceptually equivalent to a table in relational DB or
a data frame in R/Python
à API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema
- 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a ML Model?
à Mathematical formula with a number of parameters that need to be learned from the
data. And fitting a model to the data is a process known as model training
à E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …
- 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Deep Learning
Clustering
Dimensionality Reduction
• XGBoost (Extreme Gradient Boosting)
• Classification and regression trees (CART)
• Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
• Yinyang K-Means
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Local Regression (LOESS)
Collaborative Filtering
• Weighted Alternating Least
Squares (WALS)
- 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hyperparameters
à Define higher-level model properties, e.g. complexity or learning rate
à Cannot be learned during training à need to be predefined
à Can be decided by
– setting different values
– training different models
– choosing the values that test better
à Hyperparameter examples
– Number of leaves or depth of a tree
– Number of latent factors in a matrix factorization
– Learning rate (in many models)
– Number of hidden layers in a deep neural network
– Number of clusters in a k-means clustering
- 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States è California, CA, Cal., Cal
- 57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
- 63. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
à Zeppelin è Interactive notebook
à Spark
à YARN è Resource Management
à HDFS è Distributed Storage Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
- 66. 66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
- 69. 69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ML Lab
• Residuals
• residual of an observed value is the difference between the observed value and
the estimated value
• R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
• RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model or and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
- 72. 72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Diabetes Dataset – Decision Trees / Random Forest
Labeled set with 8 Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667
-1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333
+1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1
-1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6
+1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7
-1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...
- 78. 78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature Selection
à Also known as variable or attribute selection
à Why important?
– simplification of models è easier to interpret by researchers/users
– shorter training times
– enhanced generalization by reducing overfitting
à Dimensionality reduction vs feature selection
– Dimensionality red: create new combinations of attributes
– Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
- 80. 80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature Selection Traps
à Feature selection is another key part of the applied machine learning process, like
model selection. You cannot fire and forget.
à It is important to consider feature selection a part of the model selection process. If you
do not, you may inadvertently introduce bias into your models which can result in
overfitting.
à For example, you must include feature selection within the inner-loop when you are
using accuracy estimation methods such as cross-validation. This means that feature
selection is performed on the prepared fold right before the model is trained. A mistake
would be to perform feature selection first to prepare your data, then perform model
selection and training on the selected features.
- 81. 81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature Selection Checklist
1. Do you have domain knowledge? If yes, construct a better set of “ad hoc” features
2. Are your features commensurate? If no, consider normalizing them.
3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your
computer resources allow you.
4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of
feature
5. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a
first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.
6. Do you need a predictor? If no, stop
7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier
examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-
norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of
features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new
idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model
selection
10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several
“bootstrap”.
- 90. 90 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s new in HDP 2.6 – Spark & Zeppelin
à Spark 1.6.3 GA
à Spark 2.1 GA
à REST API (Livy) GA
à Spark Thrift Server doAS GA
à SparkSQL – Row/Column Security (GA)
à Spark Streaming + Kafka over SSL
à Multi Cluster HBase support for SHC
à Package support in PySpark & SparkR
Spark
à Spark 2.x support
à Improved Livy integration
à No password in clear
à JDBC interpreter improvements
à Smart Sense integration
à Knox proxy Zeppelin UI
Zeppelin 0.7.x