An introduction to machine learning. I gave a talk on this, the video can be found here:
http://www.techgig.com/expert-speak/Introduction-to-Machine-Learning-616
2. 2
What is learning?
Tom Mitchell: Learning is to improve some performance measure P of executing some
task T with experience E.
In plain English: Performing some task better with experience and training…
Key Elements:
• Remember or memorize the past experiences E
• Generalize from the experiences E
Observe how kids learn to read words: They make mistakes even when reading
previously known words, then correct themselves. Especially happens when reading
words with silent letters, and those ending with tion.
Warning: This is a highly mathematical subject!
3. 3
What is Machine Learning
How would you build a computer program which “learns” from experiences?
Generally a three phase process
• Express Experience E mathematically: Build a
set of features related to the experiences (Feature
Extraction from raw data)
• Memorize and Generalize: Build a mathematical
model or set of rules from the experiences (training)
• Apply the mathematical model to features of the
future tasks
4. 4
Machine Learning in Action…
• Word Lens mobile app
• OCR in web pages:
http://newscarousel.herokuapp.com/scribble-js/Scribble.html
5. 5
Types of ML Systems
• Supervised Learning
• Classification
• Logistic Regression, SVM, NB, Decision Trees, ANN etc.
• Regression
• Recommender Systems*
• User-user/item-item similarity, matrix factorization etc.
• Unsupervised Learning
• Clustering
• K-means, Fuzzy K-Means, Model based (LDA) clustering etc.
• Dimensionality reduction
• Principal Component Analysis (PCA)
• Anomaly Detection
6. 6
Classification
Identify speaker’s gender from the voice spectrum
Amplitude
Frequency
• Training: Build a model using data: {(a1, f1,
g1), (a2, f2, g2), … (am, fm, gm)}
• Logistic Regression (LR): p(g = F | a, f; θ)
= hθ(θ0 + θ1a + θ2f)
• Decision Boundary: p < 0.5, g = M, else F
7. 7
Logistic Regression
• If we let
• y = 1 when g = F, and y = 0 when g = M, and define
vector x = [a, f]
• and define a function hθ(x) = sigmoid(θT*x) where
sigmoid(z) = 1/(1+e-z). It represents probability
p(y=1|x,θ).
• Cost J(θ) = -Σ(y*log(h) + (1-y)*log(1-h)) -
λθTθ over all training examples for some λ.
• Optimization algorithm (gradient descent): Obtain θ which
minimizes J(θ).
• Try to fit model θ to cross validation data, vary λ for
optimum fitment.
• Test model θ against test data: hθ(x) ≥ 0.5, predict
gender = F, otherwise predict gender as M.
8. 8
Recommender Systems
• User j specifies ratings for item i: y(i,j) Training Data
• Guess ratings for other items: The blanks
Items
Users
1 5
3
4
2
4
2
5
2
1
3
2
4
5
3
3
2
4
3
3
1
1
3
4
5
• Collaborative Filtering: k features of each item:
• Feature vector xi for item i: {xi
1,xi
2, … xi
k}
• Parameter Vector θj for user j: {θj
1, θj
2, … θj
k}
• For user j’s estimated rating for item i: (θj)T xi
9. 9
Recommender Systems
• Learn xi and θj:
• Given xi , minimize Σ((θj)T xi - y(i,j))2 for all i where user j
has rated item i to find optimum θj.
• Given θj, minimize Σ((θj)T xi - y(i,j))2 for all j where user j
has rated item i to find optimum xi.
• Simultaneously: minimize Σ((θj)T xi - y(i,j))2 for all (i,j)
where user j has rated item i to find optimum θj and xi.
• Find factors X and ϴ of ratings matrix Y such that Y ≈ X ϴT
• Other Algorithms: user-user similarity, item-item
similarity
• Useful even when users are not humans, for e.g..
Wiki documents as users and links as items.
10. 10
Clustering
• Example: Top two occurring terms in documents
• Training set: {x1, x2, x3, … xm}, vector xi
• No labels (yi) specified
#Term 2
#Term 1
11. 11
Clustering: Applications
• Computer Science
• Document Clustering
• Google news: Organizing similar news from different sources
• News Categorizing
• Social networks analysis
• Features reduction: Speeding up ML pipelines
• Cluster Centroids as new features
• Image compression (Reduce number of colors): Pre-processing for faster, memory efficient
computations
• Deep Learning: Alternate supervised and unsupervised learning
• Recommender Systems
• Physics:
• Astronomy
• Particle physics
• Market segmentation
• http://en.wikipedia.org/wiki/Cluster_analysis#Applications
12. 12
K-Means Clustering
1. Randomly choose initial cluster centroids
#Term 2
#Term 1
2. Assign each training example to a cluster: Pick
closest centroid
3. Move centroids: Re-compute centroids as average
of training points assigned
4. Repeat 2, 3 for max iterations count or convergence
13. 13
Popular Machine Learning Tools
• Apache Mahout:
• Various Recommender Systems, clustering, and
classification algorithms
• Java based, with some algorithms having Hadoop Map-
Reduce implementations. Recently started spark
implementations, with a new ML DSL.
• Stable, widely used in production, community support.
• R:
• Popular in statistics world. Has its own language
• GNU license
• Spark MLLib, Mlbase(http://www.mlbase.org/)
• Scala based. Runs on spark (in memory, distributed)
14. 14
Popular Machine Learning tools
• Weka:
• Java based
• GNU License
• Vowpal Wabbit: http://hunch.net/~vw/,
https://github.com/JohnLangford/vowpal_wabbit
• Google Prediction API
• http://en.wikipedia.org/wiki/Machine_learning#Soft
ware
15. 15
Machine Learning In Action
• Mobile:
• Speech Recognition: Google Now, Siri
• Languages/NLP: Google Translate
• Vision: face recognition in cameras and online photos, OCR
• Misc: Handwriting driven Myscript calculator and Stylus
keyboard
• Applications
• OCR of printed documents and handwriting
• Automatic tagging of photos based on similar faces
• Biology and Medicine:
• DNA analysis for likelihood of diseases, personalized drugs
etc.
16. 16
Resources
• Online Courses:
• Coursera: Machine Learning (Andrew Ng)
• Coursera: Neural Networks for Machine Learning (Geoffrey
Hinton)
• Udacity: Intro to Artificial Intelligence (Peter Norvig, Sebastian
Thrun)
• CMU: Introduction to Machine Learning (Alex Smola)
• Berkely: Scalable Machine Learning (Alex Smola)
• Books:
• Pattern Recognition and Machine Learning: Christopher Bishop
• Machine Learning: Tom Mitchell
• Mahout In Action
• Artificial Intelligence: A modern approach (http://aima.cs.berkeley.edu/)
• Machine Learning in Action
Lets look at some real life applications. Word lens is a very popular mobile app which performs OCR, translation and inline display of translated text on the app screen.
It’s uses of chain of ML classification algorithms: detects areas of text in the images, performs OCR, translation.
Scribble-js performs classification of scribbled text using two pre-trained models: Logistic Regression and Artificial Neural Network
Applications in particle physics:
http://www.techrepublic.com/blog/european-technology/cern-where-the-big-bang-meets-big-data/
https://developers.google.com/events/io/sessions/333315382
Recommender systems are a special kind of supervised learning. Here features are learnt from the user preferences.
Clustering has application in image compression too apart from classical ML applications.
Canopy clustering is another clustering algorithm, usually used to pick initial cluster centroids before running k-means clustering.
Step 3 of moving cluster centroids using average minimizes distance for Euclidian distance measures. For non-Euclidian distance measures, the algorithm may not converge.