R Language Introduction

Khaled El-Sham’aa

1

Session Road Map
 First Steps  ANOVA
 Importing Data into R  PCA
 R Basics  Clustering
 Data Visualization  Time Series
 Correlation & Regression  Programming
 t-Test  Publication-Quality output
 Chi-squared Test

2

First Steps (1)
 R is one of the most popular platforms for data
analysis and visualization currently available. It is
free and open source software:
http://www.r-project.org

 Take advantage of its coverage and availability of
new, cutting edge applications/techniques.

 R will enable us to develop and distribute solutions
to our NARS with no hidden license cost.
3

First Steps (3)
5 * 4 b[4]
[1] 20 [1] 5

a <- (3 * 7) + 1 b[1:3]
a [1] 1 2 3
[1] 22
b[c(1,3,5)]
b <- c(1, 2, 3, 5, 8) [1] 1 3 8
b * 2
[1] 2 4 6 10 16 b[b > 4]
[1] 5 8

5

First Steps (4)
 citation()

R Development Core Team (2009). R: A language and
environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. ISBN
3-900051-07-0, URL http://www.R-project.org.

6

First Steps (5)
 If you know the name of the function you want help
with, you just type a question mark ? at the command
line prompt followed by the name of the function:
?read.table

7

First Steps (6)
 Sometimes you cannot remember the precise name of
the function, but you know the subject on which you
want help. Use the help.search function with your
query in double quotes like this:
help.search("data input")

8

First Steps (7)
 To see a worked example just type the function name:
example(mean)

mean> x <- c(0:10, 50)

mean> xm <- mean(x)

mean> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50

mean> mean(USArrests, trim = 0.2)
Murder Assault UrbanPop Rape
7.42 167.60 66.20 20.16
9

First Steps (8)
 There are hundreds of contributed packages for
R, written by many different authors (to implement
specialized statistical methods). Most are available for
download from CRAN (http://CRAN.R-project.org)

 List all available packages: library()
 Load package “ggplot2”: library(ggplot2)
 Documentation on package library(help=ggplot2)

10

Importing Data into R (1)
 data <- read.table("D:/path/file.txt", header=TRUE)

 data <- read.csv(file.choose(), header=TRUE, sep=";")

 data <- edit(data)

 fix(data)

 head(data)

 tail(data)

 tail(data, 10)

11

 In order to refer to a vector by name with an R session,
you need to attach the dataframe containing the
vector. Alternatively, you can refer to the dataframe
name and the vector name within it, using the element
name operator $ like this: mtcars$mpg

?mtcars

attach(mtcars)

mpg

12


13

# Read data left on the clipboard
data <- read.table("clipboard", header=T)

# ODBC
library(RODBC)
db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd")
raw <- sqlQuery(db1, "SELECT * FROM table1")

# XLSX
library(XLConnect)
xls <- loadWorkbook("my_file.xlsx", create=F)
raw <- as.data.frame(readWorksheet(xls,sheet='Sheet1'))
14

R Basics (1)
 max(x) maximum value in x
 min(x) minimum value in x
 mean(x) arithmetic average of the values in x
 median(x) median value in x
 var(x) sample variance of x
 sd(x) standard deviation of x
 cor(x,y) correlation between vectors x and y
 summary(x) generic function used to produce
result summaries of the results of various functions

15

R Basics (2)
 abs(x) absolute value
 floor(2.718) largest integers not greater than
 ceiling(3.142) smallest integer not less than x
 asin(x) inverse sine of x in radians
 round(2.718, digits=2) returns 2.72

 x <- 1:12; sample(x) Simple randomization
 RCBD randomization:
RCBD <- replicate(3, sample(x))

16

R Basics (3)
Common Data Transformation:

Nature of Data Transformation R Syntax

Measurements (lengths, weights, etc) loge log(x)

log10 log(x, 10)
Log10 log10(x)
Log x+1 log(x + 1)
Counts (number of individuals, etc) sqrt(x)
Percentages (must be proportions) arcsin asin(sqrt(x))*180/pi

* where x is the name of the vector (variable) whose values are to be transformed.
17

R Basics (4)
 Vectorized computations:
Any function call or operator apply to a vector in will
automatically operates directly on all elements of the
vector.
nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8
 The recycling rule:
The shorter vector is replicated enough times so that the
result has the length of the longer vector, then the
operator is applied.
1:10 + 1:3 # 2 4 6 5 7 9 8 10 12 11

18

R Basics (5)
mydata <- matrix(rnorm(30), nrow=6)
mydata

# calculate the 6 row means
apply(mydata, 1, mean)

# calculate the 5 column means
apply(mydata, 2, mean)

apply(mydata, 2, mean, trim=0.2)
19

R Basics (6)
 String functions:

substr(month.name, 2, 3)
paste("*", month.name[1:4], "*", sep=" ")

x <- toupper(dna.seq)
rna.seq <- chartr("T", "U", x)

comp.seq <- chartr("ACTG", "TGAC", dna.seq)

20

R Basics (7)
 Surprisingly, the base installation doesn’t provide
functions for skew and kurtosis, but you can add your
own:

m <- mean(x)
n <- length(x)
s <- sd(x)

skew <- sum((x-m)^3/s^3)/n
kurt <- sum((x-m)^4/s^4)/n – 3

21

Data Visualization (1)
 Pairs for a matrix
of scatter plots
of every variable
against every
other:

?mtcars
pairs(mtcars)

Voilà!

22

pie(table(cyl)) barplot(table(cyl))

23

 Gives a scatter plot if x is continuous, and a box-and-
whisker plot if x is a factor. Some people prefer the
alternative syntax plot(y~x):

attach(mtcars)
plot(wt, mpg)

plot(cyl, mpg)

cyl <- factor(cyl)
plot(cyl, mpg)

24


25

 Histograms show a frequency distribution
hist(qsec, col="gray")

26

 boxplot(qsec, col="gray")

 boxplot(qsec, mpg, col="gray")

27

XY <- cbind(LAT, LONG)
plot(XY, type='l')

library(sp)
XY.poly <- Polygon(XY)

XY.pnt <- spsample(XY.poly,
n=8, type='random')

XY.pnt

points(XY.pnt)
28


29

Correlation and Regression (1)
 If you want to determine the significance of a
correlation (i.e. the p value associated with the
calculated value of r) then use cor.test rather than cor.

cor(wt, mpg)
[1] -0.8676594

The value will vary from -1 to +1. A -1 indicates perfect
negative correlation, and +1 indicates perfect positive
correlation. 0 means no correlation.
30

cor.test(wt, qsec)
Pearson's product-moment correlation

data: wt and qsec
t = -0.9719, df = 30, p-value = 0.3389
alternative hypothesis: true correlation is not
equal to 0
95 percent confidence interval:
-0.4933536 0.1852649
sample estimates:
cor
-0.1747159
31

cor.test(wt, mpg)
Pearson's product-moment correlation

data: wt and mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not
equal to 0
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594
32


33

 Fits a linear model with normal errors and constant
variance; generally this is used for regression analysis
using continuous explanatory variables.

fit <- lm(y ~ x)
summary(fit)
plot(x, y)

# Sample of multiple linear regression
fit <- lm(y ~ x1 + x2 + x3)

34

Call:
lm(formula = mpg ~ wt)

Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

35

 The great thing about graphics in R is that it is
extremely straightforward to add things to your plots.
In the present case, we might want to add a regression
line through the cloud of data points. The function for
this is abline which can take as its argument the linear
model object:
abline(fit)

 Note: abline(a, b) function adds a regression line
with an intercept of a and a slope of b
36

plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon")
abline(fit, col="blue", lwd=2)
text(4, 25, "mpg = 37.29 - 5.34 wt")

37

 Predict is a generic built-in function for predictions
from the results of various model fitting functions:

predict(fit, list(wt = 4.5))
[1] 13.23500

38


39

 What do you do if you identify problems?

There are four approaches to dealing with violations of
regression assumptions:

 Deleting observation
 Transforming variables
 Adding or deleting variables
 Using another regression approach

40

 You can compare the fit of two nested models using
the anova() function in the base installation. A nested
model is one whose terms are completely included in
the other model.
fit1 <- lm (y ~ A + B + C)
fit2 <- lm (y ~ A + C)
anova(fit1, fit2)

 If the test is not significant (i.e. p > 0.05), we conclude
that B in this case don’t add to the linear prediction
and we’re justified in dropping it from our model.
41

# Bootstrap 95% CI for R-Squared
library(boot)

rsq <- function(formula, data, indices) {
fit <- lm(formula, data= data[indices,])
return(summary(fit)$r.square)
}

rs <- boot(data=mtcars, statistic=rsq, R=1000,
formula=mpg~wt+disp)

boot.ci(rs, type="bca") # try print(rs) and plot(rs)
42

t-Test (1)
 Comparing two sample means with normal errors
(Student’s t test, t.test)
t.test(a, b)
t.test(a, b, paired = TRUE)
# alternative argument options:
# "two.sided", "less", "greater"

a <- qsec[cyl == 4]
b <- qsec[cyl == 6]
c <- qsec[cyl == 8]

43

t-Test (2)
t.test(a, b)
Welch Two Sample t-test

data: a and b
t = 1.4136, df = 12.781, p-value = 0.1814
alternative hypothesis: true difference in means is
not equal to 0
-0.6159443 2.9362040
sample estimates:
mean of x mean of y
19.13727 17.97714
44

t-Test (3)
t.test(a, c)
Welch Two Sample t-test

data: a and c
t = 3.9446, df = 17.407, p-value = 0.001005
alternative hypothesis: true difference in means is
not equal to 0
1.102361 3.627899
sample estimates:
mean of x mean of y
19.13727 16.77214
45

t-Test (4)
(a) Test the equality of variances assumption:

ev <- var.test(a, c)$p.value

(b) Test the normality assumption:

an <- shapiro.test(a)$p.value
bn <- shapiro.test(c)$p.value

46

Chi-squared Test (1)
Construct hypotheses based on qualitative – categorical data:

myTable <- table(am, cyl)

myTable
cyl
am 4 6 8
automatic 3 4 12
manual 8 3 2
47

chisq.test(myTable)
Pearson's Chi-squared test

data: myTable
X-squared = 8.7407, df = 2, p-value = 0.01265

The expected counts under the null hypothesis:

hisq.test(myTable)$expected
cyl
am 4 6 8
automatic 6.53125 4.15625 8.3125
manual 4.46875 2.84375 5.6875
48

mosaicplot(myTable, color=rainbow(3))

49

ANOVA (1)
 A method which partitions the total variation in the
response into the components (sources of variation) in
the above model is called the analysis of variance.

table(N, S, Rep)

N <- factor(N)
S <- factor(S)
Rep <- factor(Rep)

50

ANOVA (2)
 The best way to
understand the two
significant interaction
terms is to plot them using
interaction.plot like this:

interaction.plot(S, N, Yield)

51

ANOVA (3)
boxplot(Yield~N, col="gray")

52

ANOVA (4)
model <- aov(Yield ~ N * S) #CRD
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
N 2 4.5818 2.2909 42.7469 1.230e-08 ***
S 3 0.9798 0.3266 6.0944 0.003106 **
N:S 6 0.6517 0.1086 2.0268 0.101243
Residuals 24 1.2862 0.0536
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

53

ANOVA (5)
par(mfrow = c(2, 2))

plot(model)

ANOVA assumptions:
 Normality
 Linearity
 Constant variance
 Independence

54

ANOVA (6)
model.tables(model, "means")
Tables of means
Grand mean
1.104722

N 0 180 230
0.6025 1.3142 1.3975

S 0 10 20 40
0.8289 1.1556 1.1678 1.2667

S
N 0 10 20 40
0 0.5600 0.7733 0.5233 0.5533
180 0.8933 1.2900 1.5267 1.5467
230 1.0333 1.4033 1.4533 1.7000
55

ANOVA (7)
model.tables(model, se=TRUE)
.......
Standard errors for differences of means
N S N:S
0.0945 0.1091 0.1890
replic. 12 9 3

Plot.design(Yield ~ N * S)

56

ANOVA (8)
mc <- TukeyHSD(model, "N", ordered = TRUE); mc
Tukey multiple comparisons of means
95% family-wise confidence level
factor levels have been ordered

Fit: aov(formula = Yield ~ N * S)

$N
diff lwr upr p adj
180-0 0.71166667 0.4756506 0.9476827 0.0000003
230-0 0.79500000 0.5589840 1.0310160 0.0000000
230-180 0.08333333 -0.1526827 0.3193494 0.6567397
57

ANOVA (9)
plot(mc)

58

ANOVA (10)
summary(aov(Yield ~ N * S + Error(Rep))) #RCB
Error: Rep
Residuals 2 0.30191 0.15095

Error: Within
N 2 4.5818 2.2909 51.2035 5.289e-09 ***
S 3 0.9798 0.3266 7.3001 0.001423 **
N:S 6 0.6517 0.1086 2.4277 0.059281 .
Residuals 22 0.9843 0.0447
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
59

ANOVA (11)
 In a split-plot design, different treatments are applied
to plots of different sizes. Each different plot size is
associated with its own error variance.
 The model formula is specified as a factorial, using the
asterisk notation. The error structure is defined in the
Error term, with the plot sizes listed from left to
right, from largest to smallest, with each variable
separated by the slash operator /.
model <- aov(Yield ~ N * S + Error(Rep/N))

60

ANOVA (12)
Error: Rep
Residuals 2 0.30191 0.15095

Error: Rep:N
N 2 4.5818 2.29088 55.583 0.001206 **
Residuals 4 0.1649 0.04122

Error: Within
S 3 0.97983 0.32661 7.1744 0.002280 **
N:S 6 0.65171 0.10862 2.3860 0.071313 .
Residuals 18 0.81943 0.04552
61

ANOVA (13)
 Analysis of Covariance:

# f is treatment factor
# x is variate acts as covariate
model <- aov(y ~ x * f)

 Split both main effects into linear and quadratic parts.

contrasts <- list(N = list(lin=1, quad=2),
S = list(lin=1, quad=2))
summary(model, split=contrasts)

62

PCA (1)
 The idea of principal components analysis (PCA) is to
find a small number of linear combinations of the
variables so as to capture most of the variation in the
dataframe as a whole.

d2 <- cbind(wt, disp/10, hp/10, mpg, qsec)

colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec")

63

PCA (2)
model <- prcomp(d2)
model
Standard deviations:
[1] 14.6949595 3.9627722 2.8306355 1.1593717

Rotation:
PC1 PC2 PC3 PC4
wt -0.05887539 0.05015401 -0.07513271 -0.16910728
disp -0.83186362 0.47519625 0.28005113 0.04080894
hp -0.40572567 -0.83180078 0.24611265 -0.28768795
mpeg 0.36888799 0.12190490 0.91398919 -0.09385946
qsec 0.06200759 0.25479354 -0.14134625 -0.93710373
64

PCA (3)
summary(model)

Importance of components:
PC1 PC2 PC3
Standard deviation 14.6950 3.96277 2.83064
Proportion of Variance 0.8957 0.06514 0.03323
Cumulative Proportion 0.8957 0.96082 0.99405

65

PCA (4)
plot(model) biplot(model)

66

Clustering (1)
 We define similarity on the basis of the distance
between two samples in this n-dimensional space.
Several different distance measures could be used to
work out the distance from every sample to every other
sample. This quantitative dissimilarity structure of the
data is stored in a matrix produced by the dist function:

rownames(d2) <- rownames(mtcars)

my.dist <- dist(d2, method="euclidian")

67

Clustering (2)
 Initially, each sample is assigned to its own cluster, and
then the hclust algorithm proceeds iteratively, at each
stage joining the two most similar clusters, continuing
until there is just a single cluster (see ?hclust for
details).

my.hc <- hclust(my.dist, "ward")

68

Clustering (3)
 We can plot the object called my.hc, and we specify
that the leaves of the hierarchy are labeled by their
plot numbers

plot(my.hc, hang=-1)

g <- rect.hclust(my.hc, k=4, border="red")

Note:
When the hang argument is set to '-1' then all leaves
end on one line and their labels hang down from 0.
69

Clustering (5)
 Partitioning into a number of clusters specified by the user.

gr <- kmeans(cbind(disp, hp), 2)

plot(disp, hp, col = gr$cluster, pch=19)

points(gr$centers, col = 1:2, pch = 8, cex=2)

71

Clustering (7)
K-means clustering with 2 clusters of sizes 18, 14

Cluster means:
disp hp
1 135.5389 98.05556
2 353.1000 209.21429

Clustering vector:
[1] 1 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1
1 2 1 2 1

Within cluster sum of squares by cluster:
[1] 58369.27 93490.74
(between_SS / total_SS = 75.6 %)
73

Clustering (8)
x <- as.matrix(mtcars)
heatmap(x, scale="column")

74

Time Series (1)
 First, make the data variable into a time series object

# create time-series objects
beer <- ts(beer, start=1956, freq=12)

 It is useful to be able to turn a time series into
components. The function stl performs seasonal
decomposition of a time series into seasonal, trend
and irregular components using loess.

75

Time Series (2)
 The remainder component is the residuals from the
seasonal plus trend fit. The bars at the right-hand side
are of equal heights (in user coordinates).

# Decompose a time series into seasonal,
# trend and irregular components using loess
ts.comp <- stl(beer, s.window="periodic")

plot(ts.comp)

76

Programming (1)
 We can extend the functionality of R by writing a
function that estimates the standard error of the mean

SEM <- function(x, na.rm = FALSE) {
if (na.rm == TRUE) VAR <- x[!is.na(x)]
else VAR <- x
SD <- sd(VAR)
N <- length(VAR)
SE <- SD/sqrt(N - 1)
return(SE)
}
78

Programming (2)
 You can define your own operator of the form %any%
using any text string in place of any. The function
should be a function of two arguments.

"%p%" <- function(x,y) paste(x,y,sep=" ")

"Hi" %p% "Khaled"

[1] "Hi Khaled"

79

Programming (3)
setwd("path/to/folder")
sink("output.txt")
cat("Intercept t Slope")
a <- fit$coefficients[[1]]
b <- fit$coefficients[[2]]
cat(paste(a, b, sep="t"))
sink()

jpeg(filename="graph.jpg", width=600, height=600)
plot(wt, mpg); abline(fit)
dev.off()
80

Programming (4)
 The code for R functions can be viewed, and in most
cases modified, if so is desired using fix() function.

 You can trigger garbage collection by call gc() function
which will report few memory usage statistics.

 Basic tool for code timing is: system.time(commands)

 tempfile() give a unique file name in temporary
writable directory deleted at the end of the session.
81

Programming (5)
 Take control of your R code! RStudio is a free and open
source integrated development environment for R. You
can run it on your desktop (Windows, Mac, or Linux) :

 Syntax highlighting, code completion, etc...
 Execute R code directly from the source editor
 Workspace browser and data viewer
 Plot history, zooming, and flexible image & PDF export
 Integrated R help and documentation
 and more (http://www.rstudio.com/ide/)

82

Programming (7)
 If want to evaluate the quadratic x2−2x +4 many times
so we can write a function that evaluates the function
for a specific value of x:

my.f <- function(x) { x^2 - 2*x + 4 }

my.f(3)
[1] 7

plot(my.f, -10, +10)
84

Programming (9)
 We can find the minimum of the function using:

optimize(my.f, lower = -10, upper = 10)
$minimum
[1] 1
$objective
[1] 3

which says that the minimum occurs at x=1 and at that
point the quadratic has value 3.
86

Programming (10)
 We can integrate the function over the interval -10 to
10 using:

integrate(my.f, lower = -10, upper = 10)
746.6667 with absolute error < 4.1e-12

which gives an answer together with an estimate of the
absolute error.

87

Programming (11)
plot(my.f, -15, +15)

v <- seq(-10,10,0.01)

x <- c(-10,v,10)
y <- c(0,my.f(v),0)

polygon(x, y,
col='gray')
88

Publication-Quality Output (1)
 Research doesn’t end when the last statistical analysis
is completed. We need to include the results in a
report. xtable function convert an R object to an xtable
object, which can then be printed as a LaTeX table.

 LaTeX is a document preparation system for high-
quality typesetting (http://www.latex-project.org).

library(xtable)
print(xtable(model))

89

library(xtable)
example(aov)
print(xtable(npk.aov))

90

 ggplot2 package is an elegant alternative to the base
graphics system, it has two complementary uses:

 Producing publication quality graphics using very
simple syntax that it similar to that of base graphics.
ggplot2 tends to make smart default choices for color,
scale etc.

 Making more sophisticated/customized plots that go
beyond the defaults.

91


92

Final words!
 How Large is Your Family?
How many brothers and sisters are there in your family
including yourself? The average number of children in
families was about 2. Can you explain the difference
between this value and the class average?

 Birthday Problem!
The problem is to compute the approximate
probability that in a room of n people, at least two
have the same birthday.

93

Online Resources
 http://tryr.codeschool.com

 http://www.r-project.org

 http://www.statmethods.net

 http://www.r-bloggers.com

 http://www.r-tutor.com

 http://blog.revolutionanalytics.com/r
94

R Language Introduction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a R Language Introduction

Similar a R Language Introduction (20)

Más de Khaled Al-Shamaa

Más de Khaled Al-Shamaa (9)

Último

Último (20)

R Language Introduction

Notas del editor