Machine Learning
with R
Barbara Fusinska
About me
Data Science Freelancer
Machine Learning
• Machine Learning
• R platform
• Machine Learning with R
• Classification problem
• Linear Regression
• Clustering
Machine Learning?
Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
What is the genre of
Gone with the wind
Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
What is the class of the entry
with the following features:
F1: 31, F2: 4
Data Visualization
0 10 20 30 40 50
Rule 1:
If on the left side of the
line then Class = A
Rule 2:
If on the right side of the
line then Class = B
Chick sexing
Aircraft recognition
• Classification, regression
• Label, target value
• Training & Validation phases
• Clustering, feature selection
• Finding structure of data
• Statistical values describing the
Publishing the model
Machine Learning
Model Training
Machine Learning
Training data
Publish model
Test stream
R language
Why R?
• Ross Ihaka & Robert Gentleman
• Successor of S
• Open source
• Community driven
• #1 for statistical computing
• Exploratory Data Analysis
• Machine Learning
• Visualisation
• Install R:
• Install RStudio:
• GitHub repository:
• Packages
Supervised Machine Learning workflow
Clean data Data split
Machine Learning
Trained model Score
Test data
Classification problem
Model training
Data & Labels
Data preparation
32 x 32
8 x 8
K-Nearest Neighbours Algorithm
• Object is classified by a majority
• k – algorithm parameter
• Distance metrics: Euclidean
(continuous variables), Hamming
Naïve Bayes classifier
𝑝 𝐶 𝑘 𝒙) =
𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘)
𝒙 = (𝑥1, … , 𝑥 𝑘)
𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood
Naïve Bayes example
Sex Height Weight Foot size
Male 6 190 11
Male 6.2 170 10
Female 5 130 6
… … … …
Sex Height Weight Foot size
? 5.9 140 8
𝑝 𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 +
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
Logistic regression
𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
𝑦 =
1 𝑓𝑜𝑟 𝑧 > 0
0 𝑓𝑜𝑟 𝑧 < 0
𝑦 =
1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5
0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5
Logistic function
Best fit of β
Evaluation methods for classification
Positive Negative
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
Area under the curve
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
How good at avoiding
false alarms
How good it is at
detecting positives
# Read data
trainingSet <- read.csv(trainingFile, header = FALSE)
testSet <- read.csv(testFile, header = FALSE)
trainingSet$V65 <- factor(trainingSet$V65)
testSet$V65 <- factor(testSet$V65)
# Classify
library(caret) <- knn3(V65 ~ ., data=trainingSet, k=5)
# Predict new values
pred.test <- predict(, testSet[,1:64], type="class")
# Confusion matrix
confusionMatrix(pred.test, testSet[,65])
Regression problem
• Dependent value
• Predicting the real value
• Fitting the coefficients
• Analytical solutions
• Gradient descent
Ordinary linear regression
Residual sum of squares (RSS)
𝑆 𝛽 =
(𝑦𝑖 − 𝑥𝑖
= 𝑦 − 𝑋𝛽 𝑇
𝑦 − 𝑋𝛽
𝛽 = 𝑎𝑟𝑔 min
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
Evaluation methods for regression
• Errors
𝑅𝑀𝑆𝐸 = 𝑖=1
(𝑓𝑖 − 𝑦𝑖)2
𝑅2 = 1 −
(𝑓𝑖 − 𝑦𝑖)2
( 𝑦 − 𝑦𝑖)2
• Statistics (t, ANOVA)
Prestige dataset
Feature Data type Description
education continuous Average education (years)
income integer Average income (dollars)
women continuous Percentage of women
prestige continuous Pineo-Porter prestige score for
census integer Canadian Census occupational
type multi-valued
Type of occupation: bc, prof, wc
# Pairs for the numeric data
pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
# Linear regression, numerical data
num.model <- lm(prestige ~ education + log2(income) + women, Prestige)
lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
Min 1Q Median 3Q Max
-17.364 -4.429 -0.101 4.316 19.179
Estimate Std. Error t value Pr(>|t|)
(Intercept) -110.9658 14.8429 -7.476 3.27e-11 ***
education 3.7305 0.3544 10.527 < 2e-16 ***
log2(income) 9.3147 1.3265 7.022 2.90e-10 ***
women 0.0469 0.0299 1.568 0.12
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.093 on 98 degrees of freedom
Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 1
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
# Linear regression, categorical variable
cat.model <- lm(prestige ~ education + log2(income) + type, Prestige)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.2019 13.7431 -5.909 5.63e-08 ***
education 3.2845 0.6081 5.401 5.06e-07 ***
log2(income) 7.2694 1.1900 6.109 2.31e-08 ***
typeprof 6.7509 3.6185 1.866 0.0652 .
typewc -1.4394 2.3780 -0.605 0.5465
# Linear regression, categorical variable split <- lm(prestige ~ type*education, Prestige)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.2936 8.6470 -0.497 0.621
typeprof 18.8637 16.8881 1.117 0.267
typewc -24.3833 21.7777 -1.120 0.266
education 4.7637 1.0247 4.649 1.11e-05 ***
typeprof:education -0.9808 1.4495 -0.677 0.500
typewc:education 1.6709 2.0777 0.804 0.423
# Pairs for the numeric data
cf <-$coefficients
ggplot(prestige, aes(education, prestige)) + geom_point(aes(col=type)) +
geom_abline(slope=cf[4], intercept = cf[1], colour='red') +
geom_abline(slope=cf[4] + cf[5], intercept = cf[1] + cf[2], colour='green') +
geom_abline(slope=cf[4] + cf[6], intercept = cf[1] + cf[3], colour='blue')
Clustering problem
K-means Algorithm
Chicago crimes dataset
Data column Data type
ID Number
Case Number String
Arrest Boolean
Primary Type Enum
District Enum
DateFBI Code Enum
Longitude Numeric
Latitude Numeric
# Read data
crimeData <- read.csv(crimeFilePath)
# Only data with location, only Assault or Burglary types
crimeData <- crimeData[
!$Latitude) & !$Longitude),]
selectedCrimes <- subset(crimeData,
Primary.Type %in% c(crimeTypes[2], crimeTypes[4]))
# Visualise
# Get map from Google
map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean(
crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2)
ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
& Burglary
# k-means clustering (k=6)
clusterResult <- kmeans(selectedCrimes[, c('Longitude', 'Latitude')], 6)
# Get the clusters information
centers <-$centers)
clusterColours <- factor(clusterResult$cluster)
# Visualise
ggmap(map_g) +
geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
alpha = 0.8, color = clusterColours), size = 1) +
geom_point(data = centers, aes(x = Longitude, y = Latitude,
alpha = 0.8), size = 1.5) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 1. Machine Learning with R Barbara Fusinska @BasiaFusinska
  • 2. About me Data Science Freelancer Machine Learning Programmer @BasiaFusinska
  • 3. Agenda • Machine Learning • R platform • Machine Learning with R • Classification problem • Linear Regression • Clustering
  • 5. Movies Genres Title # Kisses # Kicks Genre Taken 3 47 Action Love story 24 2 Romance P.S. I love you 17 3 Romance Rush hours 5 51 Action Bad boys 7 42 Action Question: What is the genre of Gone with the wind ?
  • 6. Data-based classification Id Feature 1 Feature 2 Class 1. 3 47 A 2. 24 2 B 3. 17 3 B 4. 5 51 A 5. 7 42 A Question: What is the class of the entry with the following features: F1: 31, F2: 4 ?
  • 7. Data Visualization 0 10 20 30 40 50 60 0 10 20 30 40 50 Rule 1: If on the left side of the line then Class = A Rule 2: If on the right side of the line then Class = B A B
  • 9. Supervised learning • Classification, regression • Label, target value • Training & Validation phases
  • 10. Unsupervised learning • Clustering, feature selection • Finding structure of data • Statistical values describing the data
  • 11. Publishing the model Machine Learning Model Model Training Published Machine Learning Model Prediction Training data Publish model Test stream Scores
  • 13. Why R? • Ross Ihaka & Robert Gentleman • Successor of S • Open source • Community driven • #1 for statistical computing • Exploratory Data Analysis • Machine Learning • Visualisation
  • 14. Setup • Install R: • Install RStudio: • GitHub repository: hineLearning neLearningWithR • Packages
  • 15. Supervised Machine Learning workflow Clean data Data split Machine Learning algorithm Trained model Score Preprocess data Training data Test data
  • 16. Classification problem Model training Data & Labels 0 1 2 3 4 5 6 7 8 9
  • 17. Data preparation 32 x 32 (0-1) 8 x 8 (0..16)
  • 18. K-Nearest Neighbours Algorithm • Object is classified by a majority vote • k – algorithm parameter • Distance metrics: Euclidean (continuous variables), Hamming (text) ?
  • 19. Naïve Bayes classifier 𝑝 𝐶 𝑘 𝒙) = 𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘) 𝑝(𝒙) 𝒙 = (𝑥1, … , 𝑥 𝑘) 𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood evidence prior posterior
  • 20. Naïve Bayes example Sex Height Weight Foot size Male 6 190 11 Male 6.2 170 10 Female 5 130 6 … … … … Sex Height Weight Foot size ? 5.9 140 8 𝑝 𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 + 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
  • 21. Logistic regression 𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 1 𝑓𝑜𝑟 𝑧 > 0 0 𝑓𝑜𝑟 𝑧 < 0 𝑦 = 1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5 0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5 Logistic function Coefficients Best fit of β
  • 22. Evaluation methods for classification Confusion Matrix Reference Positive Negative Prediction Positive TP FP Negative FN TN Receiver Operating Characteristic curve Area under the curve (AUC) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 #𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑁 How good at avoiding false alarms How good it is at detecting positives
  • 23. # Read data trainingSet <- read.csv(trainingFile, header = FALSE) testSet <- read.csv(testFile, header = FALSE) trainingSet$V65 <- factor(trainingSet$V65) testSet$V65 <- factor(testSet$V65) # Classify library(caret) <- knn3(V65 ~ ., data=trainingSet, k=5) # Predict new values pred.test <- predict(, testSet[,1:64], type="class")
  • 25. Regression problem • Dependent value • Predicting the real value • Fitting the coefficients • Analytical solutions • Gradient descent
  • 26. Ordinary linear regression Residual sum of squares (RSS) 𝑆 𝛽 = 𝑖=1 𝑛 (𝑦𝑖 − 𝑥𝑖 𝑇 𝛽)2 = 𝑦 − 𝑋𝛽 𝑇 𝑦 − 𝑋𝛽 𝛽 = 𝑎𝑟𝑔 min 𝛽 𝑆(𝛽) 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
  • 27. Evaluation methods for regression • Errors 𝑅𝑀𝑆𝐸 = 𝑖=1 𝑛 (𝑓𝑖 − 𝑦𝑖)2 𝑛 𝑅2 = 1 − (𝑓𝑖 − 𝑦𝑖)2 ( 𝑦 − 𝑦𝑖)2 • Statistics (t, ANOVA)
  • 28. Prestige dataset Feature Data type Description education continuous Average education (years) income integer Average income (dollars) women continuous Percentage of women prestige continuous Pineo-Porter prestige score for occupation census integer Canadian Census occupational code type multi-valued discrete Type of occupation: bc, prof, wc
  • 29. # Pairs for the numeric data pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
  • 30. # Linear regression, numerical data num.model <- lm(prestige ~ education + log2(income) + women, Prestige) summary(num.model) -------------------------------------------------- Call: lm(formula = prestige ~ education + log2(income) + women, data = Prestige) Residuals: Min 1Q Median 3Q Max -17.364 -4.429 -0.101 4.316 19.179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -110.9658 14.8429 -7.476 3.27e-11 *** education 3.7305 0.3544 10.527 < 2e-16 *** log2(income) 9.3147 1.3265 7.022 2.90e-10 *** women 0.0469 0.0299 1.568 0.12 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.093 on 98 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.83 F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
  • 32. Categorical data for regression • Categories: A, B, C are coded as dummy variables • In general if the variable has k categories it will be decoded into k-1 dummy variables Category V1 V2 A 0 0 B 1 0 C 0 1 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
  • 33. # Linear regression, categorical variable cat.model <- lm(prestige ~ education + log2(income) + type, Prestige) summary(cat.model) -------------------------------------------------- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -81.2019 13.7431 -5.909 5.63e-08 *** education 3.2845 0.6081 5.401 5.06e-07 *** log2(income) 7.2694 1.1900 6.109 2.31e-08 *** typeprof 6.7509 3.6185 1.866 0.0652 . typewc -1.4394 2.3780 -0.605 0.5465
  • 34. # Linear regression, categorical variable split <- lm(prestige ~ type*education, Prestige) summary( -------------------------------------------------- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.2936 8.6470 -0.497 0.621 typeprof 18.8637 16.8881 1.117 0.267 typewc -24.3833 21.7777 -1.120 0.266 education 4.7637 1.0247 4.649 1.11e-05 *** typeprof:education -0.9808 1.4495 -0.677 0.500 typewc:education 1.6709 2.0777 0.804 0.423
  • 35. # Pairs for the numeric data cf <-$coefficients ggplot(prestige, aes(education, prestige)) + geom_point(aes(col=type)) + geom_abline(slope=cf[4], intercept = cf[1], colour='red') + geom_abline(slope=cf[4] + cf[5], intercept = cf[1] + cf[2], colour='green') + geom_abline(slope=cf[4] + cf[6], intercept = cf[1] + cf[3], colour='blue')
  • 38. Chicago crimes dataset Data column Data type ID Number Case Number String Arrest Boolean Primary Type Enum District Enum DateFBI Code Enum Longitude Numeric Latitude Numeric ...
  • 39. # Read data crimeData <- read.csv(crimeFilePath) # Only data with location, only Assault or Burglary types crimeData <- crimeData[ !$Latitude) & !$Longitude),] selectedCrimes <- subset(crimeData, Primary.Type %in% c(crimeTypes[2], crimeTypes[4])) # Visualise library(ggplot2) library(ggmap) # Get map from Google map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean( crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2) ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude, fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) + guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 41. # k-means clustering (k=6) clusterResult <- kmeans(selectedCrimes[, c('Longitude', 'Latitude')], 6) # Get the clusters information centers <-$centers) clusterColours <- factor(clusterResult$cluster) # Visualise ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude, alpha = 0.8, color = clusterColours), size = 1) + geom_point(data = centers, aes(x = Longitude, y = Latitude, alpha = 0.8), size = 1.5) + guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 43.