Machine Learning with R

Machine Learning
with R
Barbara Fusinska
@BasiaFusinska

About me
Programmer
Machine Learning
Data Solutions Architect
@BasiaFusinska
https://github.com/BasiaFusinska/MachineLearningWithR

Agenda
• What’s Machine Learning?
• Exploratory Data Analysis
• Classification
• Clustering
• Regression

Setup
• Install R:
https://www.r-project.org/
• Install RStudio:
https://www.rstudio.com/
• GitHub repository:
https://github.com/BasiaFusinska/Ma
chineLearningWithR
• Packages

Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?

Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?

Data Visualization
0
10
20
30
40
50
60
0 10 20 30 40 50
Rule 1:
If on the left side of the
line then Class = A
Rule 2:
If on the right side of the
line then Class = B
A
B

Supervised
learning
• Classification, regression
• Label, target value
• Training & Validation
phases

Unsupervised
learning
• Clustering, feature
selection
• Finding structure of data
• Statistical values
describing the data

Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data

Publishing the model
Machine Learning
Model
Model Training
Published
Machine Learning
Model
Prediction
Training data
Publish model
Test stream
Scores

Exploratory
Data Analysis
Demo

Classification problem
Model training
Data & Labels

Classification data
Source #Links #Characters ... Fake
TopNews 10 2750 … T
Twitter 2 120 … F
TopNews 235 502 … F
Channel X 1530 3024 … T
Twitter 24 70 … F
StoryLeaks 722 1408 … T
Facebook 98 230 … T
… … … … ...
Features
Labels

Task: Iris EDA
• Descriptive statistics (dimensions,
rows, columns, data types,
correlation)
• Data visualization (distributions,
outliers)
• Features distributions & classes
separation
• 2D visualisation
http://archive.ics.uci.edu/ml/datasets/Iris

K-Nearest Neighbours Algorithm
• Object is classified by a majority
vote
• k – algorithm parameter
• Distance metrics: Euclidean
(continuous variables), Hamming
(text)
?

Naïve Bayes classifier
𝑝 𝐶 𝑘 𝒙) =
𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘)
𝑝(𝒙)
𝒙 = (𝑥1, … , 𝑥 𝑘)
𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood
evidence
prior
posterior

Naïve Bayes example
Sex Height Weight Foot size
Male 6 190 11
Male 6.2 170 10
Female 5 130 6
… … … …
Sex Height Weight Foot size
? 5.9 140 8
𝑝 𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 +
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

Logistic regression
𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
𝑦 =
1 𝑓𝑜𝑟 𝑧 > 0
0 𝑓𝑜𝑟 𝑧 < 0
𝑦 =
1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5
0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5
Logistic function
Coefficients
Best fit of β

Evaluation methods for classification
Confusion
Matrix
Reference
Positive Negative
Prediction
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
curve
Area under the curve
(AUC)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡
#𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
How good at avoiding
false alarms
How good it is at
detecting positives

Task: Iris
Classification
• Data preprocessing
• Split data for training and tests
sets
• Classification using: kNN and Naïve
Bayes
• Performance evaluation
• Results Visualisation

Task: Binary
Classification
• Only two slasses in the dataset
(versicolor & virginica)
• Classification using logistic
regression
• Results Visualisation

Task: Resampling
& Tuning
• Repeated k-fold cross validation
• Use Naïve Bayes as classification
algorithm
• Tune the parameters using specific
values

Hierarchical clustering
• Decision of where the cluster
should be split
• Metric: distance between pairs
of observation
• Linkage criterion: dissimilarity of
sets

Evaluating
methods for
clustering
• Sum of squares
• Class based measures
• Underlying true

Task: Iris
Clustering
• Clustering using k-means and
hierarchies
• Compare clusters with the original
classes assignments
• Visualise the findings

Regression problem
• Dependent value
• Predicting the real value
• Fitting the coefficients
• Analytical solutions
• Gradient descent
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘

Ordinary linear regression
Residual sum of squares (RSS)
𝑆 𝑤 =
𝑖=1
𝑛
(𝑦𝑖 − 𝑥𝑖
𝑇
𝑤)2
= 𝑦 − 𝑋𝑤 𝑇
𝑦 − 𝑋𝑤
𝑤 = 𝑎𝑟𝑔 min
𝑤
𝑆(𝑤)

Task: Prestige EDA
• Descriptive statistics (dimensions,
rows, columns, data types,
correlation)
• Data visualization (distributions,
outliers)
• Handle missing data
• Features significance

Evaluation methods for regression
• Errors
𝑅𝑀𝑆𝐸 = 𝑖=1
𝑛
(𝑓𝑖 − 𝑦𝑖)2
𝑛
𝑅2 = 1 −
(𝑓𝑖 − 𝑦𝑖)2
( 𝑦 − 𝑦𝑖)2
• Statistics (t, ANOVA)

Residuals vs
Fitted
• Check if residuals have non-
linear patterns
• Check if the model captures
the non-linear relationship
• Should show equally spread
residuals around the
horizontal line

Normal Q-Q
• Shows if the residuals are
normally distributed
• Values should be lined on the
straight dashed line
• Check if residuals do not
deviate severely

Scale-Location
• Show if residuals are spread
equally along the ranges of
predictors
• Test the assumption of equal
variance (homoscedasticity)
• Should show horizontal line
with equally (randomly)
spread points

Residuals vs
Leverage
• Helps to find influential cases
• When outside of the Cook’s
distance the cases are
influential
• With no influential cases
Cook’s distance lines should
be barely visible

Task: Prestige
Regression
• Numeric and categorical features
• Other than linear relations
• Combining the features

Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 0
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘

Categorical data for regression
𝑓 𝑥 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑣1 + ⋯ + 𝛽 𝑘 𝑣 𝑘−1 +
𝛽 𝑘+1 𝑣1 𝑥 + ⋯ + 𝛽2𝑘−1 𝑣 𝑘−1 𝑥
𝑦 ~ 𝑥 + 𝑐𝑎𝑡 + 𝑥: 𝑐𝑎𝑡

Keep in touch
BarbaraFusinska.com
@BasiaFusinska
https://github.com/BasiaFusinska/MachineLearningWithR

Machine Learning with R

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Machine Learning with R

Similar a Machine Learning with R (20)

Más de Barbara Fusinska

Más de Barbara Fusinska (20)

Último

Último (20)

Machine Learning with R