From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Learning from data
1. Learning from Data
Busy Professional’s guide to machine learning
@govindk
http://govindkanshi.wordpress.com
2. Agenda
• What we know
• What we do not know
• Process
• What to measure
• Challenge with Model
• Challenge with Data
• Resources
• Software
• Books
3. What we know
• Reports made from data
• KPIs made of data
• Dashboards made of data
• They all measure known metrics, questions
4. What we do not know
• Will this person turn delinquent in x years based on his profile
(age/income/background…)
• Which kind of process, machine will fail
• Which people/things are similar to each other – find me a pattern
• Prevent people from readmission into Hospital
• Why - because we do not know the question and
database/applications do not have oob functionality.
5. We are already using applied ML results
• Mails get despammed
• Kinect recognizes our gestures
• Facebook recognizes our photos
• Siri/Cortana – recognize our voice commands
• Watson used some
• Search uses many
• Recommendation is there in face
6. So then
• Learn from data
• How
• Create a model of the data
• Test the model for error and use it
7. Unsupervised
• Clustering
• Customer segmentation
• Topic identification
• Number of algorithms
• Hierarchical (distance as measure – generally Euclidian )
• Agglomerative ( start with n groups and start merging them)
• Single Link (2 at time) vs divisive (start single – break it down)
8. Simple way
• Group folks on
• Height
• What you eat
• Where you are from (state)
• Next time a new person comes in – let us predict
10. Challenges and next steps
• How many groups/clusters
• How many miss-groupings (Evaluation)
• Associate Topics & after Clustering what
• Once clusters are formed – some one can name them
• Now run supervised methods on data to learn more
11. Supervised learning
• Given a label L for a attributes (a1,a2,a3..)
• Learn the model which can predict the label based on attributes
12. Simple way to understand Classification
• Let us say we are labelled north indian, south indian
• How
• Attributes (language, food, movie language, music …)
• Basically learning the link between
• An observed data X and
• A variable y usually called target or labels.
13. Supervised
• Data
• One dataset for training which has label
• One dataset for testing
• Example
• Classification (spam, order data, disease data, Kinect gesture)
• Classification
• binary vs. multiclass
• Regression (sales)
• Ranking
• Search
• Predictive maintenance
• Recommendation
• Netflix - Netflix competition = SVD
14. Demos
• Trees
• DecisionTree – Python (show train and test, validation)
• Decision tree – R
• BigML (nw dependent)
• Challenge –
• one input every time
15. Few more terms to overcome data issues
• Bagging – (used with tree models) (bias reduction)
• Train an ensemble of models from Bootstrap samples
• Get a vote amongst models
• Class predicted by majority of the model wins
• Get an average if outputs are scores or probabilities
• * Bootstrap – denotes different random sample of dataset
• Boosting (variance reduction)
• Like Bagging but penalizes & learns from misclassification
• Challenge of assigning “weights” misclassified instances to penalize
• Start with higher weight say 1 and keep reducing till error comes down
16. Demo
• RandomForest
• n training data out of N, at each decision node of the tree, it randomly selects
m input features from the total M input features (m ~ M^0.5) and learns a
decision tree from it. Finally each tree in the forest vote for the result.
• Evaluation
• Loss function to margins (penalize mis-classification, reward +ve)
17. Regression
• Explain relationship betwee two variables (dependent vs
independent)
• Simple linear - y = W0 + W1x1 + W2x2 + …
• Estimate the weights to predict y
• Multivariate
19. What to meaure
• Data
• Cross Validation
• n-fold cross-validation
• Leave-one-out validation
• Hold out
• Eod – how much data is enough, is there bias in data (only certain kind of labels)
• Model Results
• Contingency table(true negatives & false positive are bad )
• ROC & AUC (coverage curve) (true positive vs false positives)
• Precision/Recall (from search world)
• F-measure
• Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
20. Is Model working right
Predicted +ve Predicted -ve
Actual +ve 40 15 55
Actual -ve 5 40 45
45 55 100
Precision 40/45
Recall 40/55
F measure (Harmonic mean) 2/((1/prec) + (1/rec))
Accuracy TPR(40) + TPN(55)/ (40+15+5+40)
How much accuracy is enough
Lift – How much better than random guessing
Lift and accuracy do not have correlation
21. Challenge with Model
• Overfitting
• Avoid Bias and have less variance
• Use Regularization
• L1 (Ridge)
• L2 (Lasso)
• If time permits show the alpha effect
• Look for “overfitting model” , “bias and variance”
22. Challenge with Data
• Categorical, ordinal, quantitative
• Measures – mean, median, variance, std deviation, range, shape (skewness)
• Always observe to get “feel”/smell of data
• Discretize/Thresholding (convert quantitative feature)
• Missing feature(s) –
• What do you do – median, avg
• Data encoding
• Create new from existing vs encode in different way
23. Feature engineering
• Feature selection
• Intuition, testing co-relation
• Subset (Start small and increase) based on some error function
• Feature extraction
• New k dimensions – as combination of older d dimensions
• Linear
• PCA (find the variance by projecting – explains impact of outliers)
• LDA (supervised method for dimension redn for classification)
• FA(Factor Analysis), Multidimensional Scaling(distance between points)
• IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
24. What we could not cover
• Mechanisms
• Reinforcement Learning (punishment/rewards to learn better)
• Algorithm types
• Perceptron (back propogation, som, ..)
• SVM
• LDA and friends for unstructured world
• Regression(ols,logistic,stepwise,mars)
• Regularization (ridge/lasso)
• Trees (GBM,c4.5, ID3…)
• Bayesian
• Kernel (radial)
• Deep learning(DBN, Boltzman..)
• Clustering (Expectation Max)
• Recommendation
• Probability (distributions) & Linear Algebra
• Constraint Solving and Optimization (Solver, OpenSolver..)
27. What you will be doing
• Data
• Touch/feel (visualize),breathe it in
• Cleaning, scaling/normalization
• Selecting
• Algorithm (chose the task)
• Classification
• Regression
• Ranking (recommendation, search results)
• Amongst
• Evaluate Algorithm against each other & refine/calibrate
• AUC, ROC, RMSE etc…
28. If time & net permits Yhatr demo
• Because you need to deploy,test & use the model
• Yhatr provides good host (theirs and host your own)
29. Thanks for your time
• Please fill the evaluation form
• See you next time