Learning from data

Learning from Data
Busy Professional’s guide to machine learning
@govindk
http://govindkanshi.wordpress.com

Agenda
• What we know
• What we do not know
• Process
• What to measure
• Challenge with Model
• Challenge with Data
• Resources
• Software
• Books

What we know
• Reports made from data
• KPIs made of data
• Dashboards made of data
• They all measure known metrics, questions

What we do not know
• Will this person turn delinquent in x years based on his profile
(age/income/background…)
• Which kind of process, machine will fail
• Which people/things are similar to each other – find me a pattern
• Prevent people from readmission into Hospital
• Why - because we do not know the question and
database/applications do not have oob functionality.

We are already using applied ML results
• Mails get despammed
• Kinect recognizes our gestures
• Facebook recognizes our photos
• Siri/Cortana – recognize our voice commands
• Watson used some
• Search uses many
• Recommendation is there in face

So then
• Learn from data
• How
• Create a model of the data
• Test the model for error and use it

Unsupervised
• Clustering
• Customer segmentation
• Topic identification
• Number of algorithms
• Hierarchical (distance as measure – generally Euclidian )
• Agglomerative ( start with n groups and start merging them)
• Single Link (2 at time) vs divisive (start single – break it down)

Simple way
• Group folks on
• Height
• What you eat
• Where you are from (state)
• Next time a new person comes in – let us predict

Demos
• USArrests Data
• Wine Data

Challenges and next steps
• How many groups/clusters
• How many miss-groupings (Evaluation)
• Associate Topics & after Clustering what
• Once clusters are formed – some one can name them
• Now run supervised methods on data to learn more

Supervised learning
• Given a label L for a attributes (a1,a2,a3..)
• Learn the model which can predict the label based on attributes

Simple way to understand Classification
• Let us say we are labelled north indian, south indian
• How
• Attributes (language, food, movie language, music …)
• Basically learning the link between
• An observed data X and
• A variable y usually called target or labels.

Supervised
• Data
• One dataset for training which has label
• One dataset for testing
• Example
• Classification (spam, order data, disease data, Kinect gesture)
• Classification
• binary vs. multiclass
• Regression (sales)
• Ranking
• Search
• Predictive maintenance
• Recommendation
• Netflix - Netflix competition = SVD

Demos
• Trees
• DecisionTree – Python (show train and test, validation)
• Decision tree – R
• BigML (nw dependent)
• Challenge –
• one input every time

Few more terms to overcome data issues
• Bagging – (used with tree models) (bias reduction)
• Train an ensemble of models from Bootstrap samples
• Get a vote amongst models
• Class predicted by majority of the model wins
• Get an average if outputs are scores or probabilities
• * Bootstrap – denotes different random sample of dataset
• Boosting (variance reduction)
• Like Bagging but penalizes & learns from misclassification
• Challenge of assigning “weights” misclassified instances to penalize
• Start with higher weight say 1 and keep reducing till error comes down

Demo
• RandomForest
• n training data out of N, at each decision node of the tree, it randomly selects
m input features from the total M input features (m ~ M^0.5) and learns a
decision tree from it. Finally each tree in the forest vote for the result.
• Evaluation
• Loss function to margins (penalize mis-classification, reward +ve)

Regression
• Explain relationship betwee two variables (dependent vs
independent)
• Simple linear - y = W0 + W1x1 + W2x2 + …
• Estimate the weights to predict y
• Multivariate

Demos
• Excel
• SimpleLinear -R
• RandomForest – Wine
• Evaluate by applying loss function to residuals

What to meaure
• Data
• Cross Validation
• n-fold cross-validation
• Leave-one-out validation
• Hold out
• Eod – how much data is enough, is there bias in data (only certain kind of labels)
• Model Results
• Contingency table(true negatives & false positive are bad )
• ROC & AUC (coverage curve) (true positive vs false positives)
• Precision/Recall (from search world)
• F-measure
• Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)

Is Model working right
Predicted +ve Predicted -ve
Actual +ve 40 15 55
Actual -ve 5 40 45
45 55 100
Precision 40/45
Recall 40/55
F measure (Harmonic mean) 2/((1/prec) + (1/rec))
Accuracy TPR(40) + TPN(55)/ (40+15+5+40)
How much accuracy is enough
Lift – How much better than random guessing
Lift and accuracy do not have correlation

Challenge with Model
• Overfitting
• Avoid Bias and have less variance
• Use Regularization
• L1 (Ridge)
• L2 (Lasso)
• If time permits show the alpha effect
• Look for “overfitting model” , “bias and variance”

Challenge with Data
• Categorical, ordinal, quantitative
• Measures – mean, median, variance, std deviation, range, shape (skewness)
• Always observe to get “feel”/smell of data
• Discretize/Thresholding (convert quantitative feature)
• Missing feature(s) –
• What do you do – median, avg
• Data encoding
• Create new from existing vs encode in different way

Feature engineering
• Feature selection
• Intuition, testing co-relation
• Subset (Start small and increase) based on some error function
• Feature extraction
• New k dimensions – as combination of older d dimensions
• Linear
• PCA (find the variance by projecting – explains impact of outliers)
• LDA (supervised method for dimension redn for classification)
• FA(Factor Analysis), Multidimensional Scaling(distance between points)
• IsoMap (geodesic distance) and Locally Linear Embedding (LLE)

What we could not cover
• Mechanisms
• Reinforcement Learning (punishment/rewards to learn better)
• Algorithm types
• Perceptron (back propogation, som, ..)
• SVM
• LDA and friends for unstructured world
• Regression(ols,logistic,stepwise,mars)
• Regularization (ridge/lasso)
• Trees (GBM,c4.5, ID3…)
• Bayesian
• Kernel (radial)
• Deep learning(DBN, Boltzman..)
• Clustering (Expectation Max)
• Recommendation
• Probability (distributions) & Linear Algebra
• Constraint Solving and Optimization (Solver, OpenSolver..)

Tools
• R
• Scikit
• Theano
• Weka
• Kmine
• Recommender (.net….)
• DataTau
• BigML
• WiseIO
• Skytree
• SAS/SPSS
• YHatr

Books
• Bishop
• Alpyadin
• John Foreman
• PyMC – Search query (Bayesian-Methods-for-Hackers)
• Scikit –
• jakevdp – “scikit jake 2014 tutorial”
• Olvier – “scikit olvier grasel tutorial”
• Recommender (http://mymedialite.net/) – Zeno Ganter

What you will be doing
• Data
• Touch/feel (visualize),breathe it in
• Cleaning, scaling/normalization
• Selecting
• Algorithm (chose the task)
• Classification
• Regression
• Ranking (recommendation, search results)
• Amongst
• Evaluate Algorithm against each other & refine/calibrate
• AUC, ROC, RMSE etc…

If time & net permits Yhatr demo
• Because you need to deploy,test & use the model
• Yhatr provides good host (theirs and host your own)

Thanks for your time
• Please fill the evaluation form
• See you next time

Learning from data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Learning from data

Similar a Learning from data (20)

Último

Último (20)

Learning from data