Measures of Central Tendency: Mean, Median and Mode
Introduction to R for Data Science :: Session 4
1. Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Data Science zajednica Srbije
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science zajednica Srbije
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
2. Control Flow in R
• for, while, repeat
• if, else
• switch
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# Starting with simple 'if‘
num <- 2 # some value to test with
if (num > 0) print("num is positive")
# if condition num > 0 stands than print() is executed
# Sometimes 'if' has its 'else‘
if (num > 0) { # test to see if it's positive
print("num is positive") # print in case of positive number
} else { print("num is negative") # it's negative if not positive }
# Careful: place your else right after the end (‘}’) of the conditional block
3. Vectorized: ifelse
• for, while, repeat
• if, else, ifelse
• switch
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# R is vectorized so there's vectorized if-else
simple_vect <- c(1, 3, 12, NA, 2, NA, 4) # just another num vector with NAs
ifelse(is.na(simple_vect), "nothing here", "some number")
# nothing here if it's an NA or it's a number
4. For loops: slow and slower
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# For loop is always working same way
for (i in simple_vect) print(i)
# Be aware that loops can be slow if
vec <- numeric()
system.time(
for(i in seq_len(50000-1)) {
some_calc <- sqrt(i/10)
# this is what makes it slow:
vec <- c(vec, some_calc)
})
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# This solution is slightly faster
iter <- 50000;
# this makes it faster:
vec <- numeric(length=iter)
system.time(
for(i in seq_len(iter-1)) {
some_calc <- sqrt(i/10);
vec[i] <- some_calc # ...not this!
})
5. For loops: slow and slower
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# This solution is even faster
iter <- 50000
vec <- numeric(length=iter) # not because of this...
system.time(
for(i in seq_len(iter-1)) {
vec[i] <- sqrt(i/10) # ...but because of this!
})
6. For loops vs. vectorized functions
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# Another example how loops can be slow
# (loop vs vectorized functions)
iter <- 50000
system.time(for (i in 1:iter) {
vec[i] <- rnorm(n=1, mean=0, sd=1)
# approach from previous example
})
system.time(y <- rnorm(iter, 0, 1)) # but this is much much faster
7. while, repeat…
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# R also knows about while loop
r <- 1 # initializing some variable
while (r < 5) { # while r < 5
print(r) # print r
r <- r + 1 # increase r by 1
}
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# Nope, we didn't forget 'repeat' loop
i <- 1
repeat { # there is no condition!
print(i)
i <- i + 1
if (i == 10) break
# ...so we have to break it if we
# don't want infinite loop
}
8. switch
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
switch(2, "data", "science", "serbia") # choose one option based on value
# More on switch:
switchIndicator <- "A“
# switchIndicator <- "switchIndicator“
# switchIndicator <- "AvAvAv“ # play with this three conditions
# rare situations where you do not need to enclose strings: ' ', or " “
switch(switchIndicator,
A = {print(switchIndicator)},
switchIndicator = {unlist(strsplit(switchIndicator,"h"))},
AvAvAv = {print(nchar(switchIndicator))}
)
9. switch()
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
type = 2
cc <- c("A", "B", "C")
switch(type,
c1 = {print(cc[1])},
c2 = {print(cc[2])},
c3 = {print(cc[3])},
{print("Beyond C...")} # default choice
);
# However…
10. switch()
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# if you do this, R will miss the default choice, so be careful w. switch:
type = 4
cc <- c("A", "B", "C")
switch(type,
print(cc[1]),
print(cc[2]),
print(cc[3]),
{print("Beyond C...")}
# the unnamed default choice works only
# if previous choices are named!
)
# switch is faster than if… else… (!)
11. Vectorization
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
### vectorization in R
dataSet <- USArrests;
# data$Murder, data$Assault, data$Rape: columns of data
# in behavioral sciences (psychology or biomedical sciences, for example) we would call them:
# variables (or factors, even more often)
# in data science and machine learning, we usually call them: FEATURES
# in psychology and behavioral sciences, the usage of the term "feature" is usually constrained
# to theories of categorization and concept learning
# Task: classify the US states according to some global indicator of violent crime
# Two categories (simplification): more dangerous and less dangerous (F)
# We have three features: Murder, Rape, Assault, all per 100,000 inhabitants
# The idea is to combine the three available features.
# Let's assume that we arbitrarily assign the following preference order over the features:
# Murder > Rape > Assault
# in terms of the severity of the consequences of the associated criminal acts
12. Vectorization
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# Let's first isolate the features from the data.frame
featureMatrix <- as.matrix(dataSet[, c(1,4,2)]);
# Let's WEIGHT the features in accordance with the imposed preference order:
weigthsVector <- c(3,2,1); # mind the order of the columns in featureMatrix
# Essentially, we want our global indicator to be a linear combination of all three selected
# features, where each feature is weighted by the corresponding element of the weigthsVector:
featureMatrix <- cbind(featureMatrix,numeric(length(featureMatrix[,1])));
for (i in 1:length(featureMatrix[,1])) {
featureMatrix[i,4] <- sum(weigthsVector*featureMatrix[i,1:3]);
# don't forget: this "*" multiplication in R is vectorized and operates element-wise
# we have a 1x3 weightsVector and a 1x3 featureMatrix[i,1:3], Ok
# sum() then produces the desired linear combination
}
13. Vectorization
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# Classification; in the simplest case, let's simply take a look at
# the distribution of our global indicator:
hist(featureMatrix[,4],20); # it's multimodal and not too symmetric; go for median
criterion <- median(featureMatrix[,4]);
# And classify:
dataSet$Dangerous <- ifelse(featureMatrix[,4]>=criterion,T,F);
# Ok. You will never do this before you have a model that has actually *learned* the
# most adequate feature weights. This is an exercise only.
# ***Important***: have you seen the for loop above? Well...
# N e v e r d o t h a t.
dataSet$Dangerous <- NULL;
14. Vectorization
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# In Data Science, you will be working with huge amounts of quantitative data.
# For loops are slow. But in vector programming languages like R...
# matrix computations are seriously fast.
# What you ***want to do*** is the following:
# Let's first isolate the features from the data.frame
featureMatrix <- as.matrix(dataSet[, c(1,4,2)]);
# Let's WEIGHT the features in accordance with the imposed preference order:
weigthsVector <- c(3,2,1); # mind the order of the columns in featureMatrix
# Feature weighting:
wF <- weigthsVector %*% t(featureMatrix);
# In R, t() is for: transpose
# In R, %*% is matrix multiplication
15. Vectorization
Intro to R for Data Science
Session 4: Control Flow
# Introduction to R for Data Science
# SESSION 4 :: 19 May, 2016
# oh yes: R knows about row and column vectors - and you want to put this one
# as a COLUMN in your dataSet data.frame, while wF is currently a ROW vector, look:
wF
length(wF)
wF <- t(wF)
# and classify:
dataSet$Dangerous <- ifelse(wF>=median(wF),T,F);