These slides are for the tutorial on how to use R language for data analysis and Machine Learning tasks.
The workshop was given at OSCON (Austin, TX), 2017
7. Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?
8. Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?
13. Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data
14. Publishing the model
Machine Learning
Model
Model Training
Published
Machine Learning
Model
Prediction
Training data
Publish model
Test stream
Scores
25. Evaluation methods for classification
Confusion
Matrix
Reference
Positive Negative
Prediction
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
curve
Area under the curve
(AUC)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡
#𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
How good at avoiding
false alarms
How good it is at
detecting positives
26. Task: Iris
Classification
• Data preprocessing
• Split data for training and tests
sets
• Classification using: kNN and Naïve
Bayes
• Performance evaluation
• Results Visualisation
27. Task: Binary
Classification
• Only two slasses in the dataset
(versicolor & virginica)
• Classification using logistic
regression
• Performance evaluation
• Results Visualisation
35. Hierarchical clustering
• Decision of where the cluster
should be split
• Metric: distance between pairs
of observation
• Linkage criterion: dissimilarity of
sets
43. Residuals vs
Fitted
• Check if residuals have non-
linear patterns
• Check if the model captures
the non-linear relationship
• Should show equally spread
residuals around the
horizontal line
44. Normal Q-Q
• Shows if the residuals are
normally distributed
• Values should be lined on the
straight dashed line
• Check if residuals do not
deviate severely
45. Scale-Location
• Show if residuals are spread
equally along the ranges of
predictors
• Test the assumption of equal
variance (homoscedasticity)
• Should show horizontal line
with equally (randomly)
spread points
46. Residuals vs
Leverage
• Helps to find influential cases
• When outside of the Cook’s
distance the cases are
influential
• With no influential cases
Cook’s distance lines should
be barely visible
49. Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 0
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘