R programming intro with examples

STATISTICAL DATA ANALYSIS USING R
Additional R Programming Examples
Dennis
Friday 2
nd
and Saturday 3
rd
May, 2014.
Dennis Introduction to R Statistical Programming.

Topics covered
Vector and Matrix operation.
File Operations.
Evaluation of Probability Density Functions.
Evaluation of Probability Distribution Functions, i.e.
Binomial, Poisson, Normal and Uniform Distributions.
Testing of Hypothesis, i.e. t, z, F, Chi-square.
Linear Regression.
Prediction.
Sum of Squared Residuals.
Residual Plots.
Multiple Regression.

Vectors I
A numeric vector is a list of numbers.
The three common functions that are used to create vectors
in various situations are:- c(), seq() and rep().
The function ”c()”
The letter ”c” means concatenate, i.e. to join together.
> c(200, 270, 250, 300, 210) # Creating a numeric vector
> Saturday.sales < −c(200, 300, 320)
> Sunday.sales < −c(120, 230, 200)# Saving a vector in a variable.
> weekend.sales < −c(Saturday.sales, Sunday.sales)# Combining
two vectors.
> cities < −c(”Kerala”, ”Delhi”, ”Chennai”) # Creating a vector
of characters.

Vectors II
The functions ”cbind()”, ”rbind()” and ”seq()”
> rows.sales < −cbind(Saturday.sales, Sunday.sales) # Binds
Rows
> columns.sales < −rbind(Saturday.sales, Sunday.sales) # Binds
Columns
> colnames(column.sales) < −cities # Adds titles into an array of
vectors.
The abbreviation ”seq” means sequence, i.e. it is used for
equidistant series of numbers.
> seq(4, 9) # Creates a sequence of numbers from 4 to 9 at interval
of 1. # creates a sequence of numbers from 3 to 10 at interval of 2.
> seq(3, 10)

Vectors III
The function ”rep()”
The abbreviation ”rep” means replicate, i.e. it is used to generate
repeated values.It is used in two variants, depending on whether the
second argument is a vector or a single number, for example; to
repeat a series of numbers;
> x = c(4, 5)
> rep(x, 3) # This will repeat ”x” three times.
Consider the following r-code;
> rep(1 : 2, c(10, 15)) # This means repeat ”1” 10 times and ”2”
15 times.

Matrices I
A matrix is a two-dimensional array of numbers.
Example
> x < −1 : 12
> dim(x) < −c(3, 4); x
Alternatively, the function ”matrix()” can be used as follows;
> matrix(1 : 12, nrow = 3, byrow = T)
The dim assignment function sets or changes the dimension
attribute of x, causing R to treat the vector of 12 numbers as
a 3 x 4 matrix.
Useful functions that operate on matrices include rownames(),
colnames(), and the transposition function t().

Matrices II
Example
> x < −matrix(1 : 12, nrow = 3, byrow = T)
> rownames(x) < −LETTERS[1 : 3]; x
> colnames(x) < −letters[1 : 3]; x # Naming rows and columns.
The character vector LETTERS is a built-in variable that
contains the capital letters A−Z. Similar useful vectors are
letters, month.name and month.abb, which refers to lowercase
letters a−z, month names, and abbreviated month names
respectively.
Finding the Transpose and the Inverse of a matrix.
> y < −t(x); y # Gives the transpose of a matrix.
> z < −solve(x); z # Gives the inverse of a matrix.

Matrices III
Create two matrices, say, A and B.
Matrix addition, subtraction and multiplication
> C = c(rep(0, 3), seq(1, 6, 1), rep(seq(1, 6, 1), 2))
> A = matrix(sample(C, 12), nrow = 3, byrow = T); A
> B = matrix(sample(C, 12), nrow = 3, byrow = T); B
> D = A + B; D # Adds the matrix.
> E = A − B; E # Subtracts matrix B form matrix A.
> F = A ∗ B; F # Multiplies matrix A form matrix B element-wise.
> G = A% ∗ %t(B); G # Computes matrix multiplication.

File operations
Reading Data.
The two common functions used in reading data are;
scan() and read.table(), i.e.
> Sales = scan(file = ”filepath”)
> SALES = read.table(file = ”filepath”, header = FALSE)
To merge two or more files according to the column
names or row names, we use the function merge();
> AB = merge(x, y, z) # merging files.
To write an output into a file, use the function write(),
i.e.
> output.file = write(x, file =
”file path to save the data”, ncolumns = 1, append = FALSE)

Some probability distributions I
Normal distribution
Full list and options are found in > help(Normal) command.
dnorm
> x < −seq(−20, 20, by = .1)
> y < −dnorm(x)
> plot(x, y)
pnorm
> x < −seq(−20, 20, by = .1)
> y < −pnorm(x, mean = 3, sd = 4)
> plot(x, y)

Some probability distributions II
qnorm
The next function we look at is qnorm which is the inverse of
pnorm. The idea behind qnorm is that you give it a probability,
and it returns the number whose cumulative distribution matches
the probability.
> x < −seq(0, 1, by = .05)
> y < −qnorm(x)
> plot(x, y)
> y < −qnorm(x, mean = 3, sd = 2)
> plot(x, y)
rnorm
> y < −rnorm(200, mean = −2)
> hist(y)

t-distribution I
dt, Help at > help(TDist)
> x < −seq(−20, 20, by = .5)
> y < −dt(x, df = 10)
pt
> x = c(−3, −4, −2, −1)
> pt((mean(x) − 2)/sd(x), df = 20)
qt
> v < −c(0.005, .025, .05)
> qt(v, df = 253)
rt
> rt(3, df = 10)

Chi-square I
dchisq, Help at > help(Chisquare)
> x < −seq(−20, 20, by = .5)
> y < −dchisq(x, df = 10)
pchisq
> x = c(2, 4, 5, 6)
> pchisq(x, df = 20)
qchisq
> v < −c(0.005, .025, .05); qchisq(v, df = 253)
rchisq
> rchisq(3, df = 20)

Testing of Hypotheses I
In the one-sample t-test, we are comparing a sample mean to
a known or hypothesized population mean. The null
hypothesis is that the expected mean diﬀerence between the
sample mean and the population mean is zero, or in other
words, that the expected value of the sample mean is equal to
the population mean.
t-test example 1.
> rnorm1 = rnorm(50, 500, 100); rnorm1 # Generate random
numbers from a normal distribution with µ = 500, σ = 100.
> summary(rnorm1) # Gives the descriptive summary.
The mean of 517.3 is higher than µ = 500, but is it signiﬁcantly
higher? To test this hypotheses, we use the function t.test().
> t.test(rnorm1, mu = 500)

Testing of Hypotheses II
Load the in-built library datasets.
t-test example 2.
> library(datasets)
> head(mtcars)
Quiz
Assuming that the data in mtcars follows the normal distribution,
find the 95% confidence interval estimate of the difference between
the mean gas mileage of manual and automatic transmissions.

Testing of Hypotheses III
Solution
> L = mtcars$am == 0
> mpg.auto = mtcars[L, ]$mpg; mpg.auto # automatic
transmission mileage.
> mpg.manual = mtcars[!L, ]$mpg; mpg.manual # manual
transmission mileage.
> t.test(mpg.auto, mpg.manual)
Answer
In mtcars, the mean mileage of automatic transmission is 17.147
mpg and the manual transmission is 24.392 mpg. The 95%
conﬁdence interval of the diﬀerence in mean gas mileage is between
3.2097 and 11.2802 mpg.

Testing of Hypotheses IV
Chi-squared test
Consider the following example;
> female = c(18, 102)
> male = c(10, 110)
> migraine = cbind(female, male); migraine
> chisq.test(migraine)
From the results we find that;- We determine that the chi-square
test fails to reject the null hypothesis that gender and migraine
susceptibility are independent. If the ratio of female to male
migraine sufferers is indeed 18:6, then our result is not what we had
hoped for. We may have an unusual sample or we may simply need
a larger sample to obtain statistical significance.

Regression I
Simple Linear Regression
An example of grade point averages (GPAs) of 20 students
along with the number of hours each student studies per
week. We then use the lm() function to ﬁnd the slope and
intercept terms.
Hours 10 12 10 15 14 12 13 15 16 14 13 12 11 10 13 13
14 18 17 14
GPA 3.33 2.92 2.56 3.08 3.57 3.31 3.45 3.93 3.82 3.70
3.26 3.00 2.74 2.85 3.33 3.29 3.58 3.85 4.00 3.50
> results < lm(GPA ∼ Hours); summary(results)
The correct interpretation of the regression equation is that a
student who did not study would have an estimated GPA of
1.3728, and that for every 1-hour increase in study time, the
estimated GPA would increase by 0.1489 points.

Regression II
Example
Consider the dataframe on cars, ”mtcars” in library(datasets).
Test if ”mpg”-miles per gallon is aﬀected by
”disp”-displacement and ”hp”-horsepower.
> myvariables = c(”mpg”, ”disp”, ”hp”)
> mycars = mtcars[myvariables]
> mymodel = lm(mpg ∼ disp + hp, data = mycars)
> summary(mymodel)
From the summary of the model, we can clearly see that the
independent variables; ”disp”-Engine displacement &
”hp”-Horsepower negatively aﬀects the ”dependent variable”,
i.e. for an increase in either displacement or horsepower, it
will result in a decrease in the miles per gallon .

Regression III
Another important graphical way to assess relationship
between variables is by using the ”conditioning plot” as shown
below.
conditioning plot
> coplot(mpg disp|factor(am, levels = 0 : 1, labels =
c(”Automatic”, ”Manual”)), data = mycars, panel =
panel.smooth, rows = 1)
Regression model prediction
> new = data.frame(disp = c(100, 200, 199), hp =
c(121, 124, 100))
> mpg.predicted = predict(mymodel, new); mpg.predicted
> data.frame(mpg.predicted, new)

R programming intro with examples

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a R programming intro with examples

Similar a R programming intro with examples (20)

Último

Último (20)

R programming intro with examples