Khaled El-Sham’aa                    1
Session Road Map First Steps                 ANOVA Importing Data into R       PCA R Basics                    Clust...
First Steps (1) R is one of the most popular platforms for data analysis and visualization currently available. It is fre...
First Steps (2)                  4
First Steps (3)5 * 4                   b[4][1] 20                  [1] 5a <- (3 * 7) + 1        b[1:3]a                   ...
First Steps (4) citation() R Development Core Team (2009). R: A language and environment for statistical computing. R Fou...
First Steps (5) If you know the name of the function you want help with, you just type a question mark ? at the command l...
First Steps (6) Sometimes you cannot remember the precise name of the function, but you know the subject on which you wan...
First Steps (7) To see a worked example just type the function name:example(mean)mean> x <- c(0:10, 50)mean> xm <- mean(x...
First Steps (8) There are hundreds of contributed packages for  R, written by many different authors (to implement  speci...
Importing Data into R (1) data <- read.table("D:/path/file.txt", header=TRUE) data <- read.csv(file.choose(), header=TRU...
Importing Data into R (2) In order to refer to a vector by name with an R session, you need to attach the dataframe conta...
Importing Data into R (3)                            13
Importing Data into R (4)# Read data left on the clipboarddata <- read.table("clipboard", header=T)# ODBClibrary(RODBC)db1...
R Basics (1) max(x)           maximum value in x min(x)           minimum value in x mean(x)          arithmetic averag...
R Basics (2) abs(x)           absolute value   floor(2.718)   largest integers not greater than   ceiling(3.142) smalle...
R Basics (3)Common Data Transformation:Nature of Data                              Transformation           R SyntaxMeasur...
R Basics (4) Vectorized computations: Any function call or operator apply to a vector in will automatically operates dire...
R Basics (5)mydata <- matrix(rnorm(30), nrow=6)mydata# calculate the 6 row meansapply(mydata, 1, mean)# calculate the 5 co...
R Basics (6) String functions:substr(month.name, 2, 3)paste("*", month.name[1:4], "*", sep=" ")x <- toupper(dna.seq)rna.s...
R Basics (7) Surprisingly, the base installation doesn’t provide  functions for skew and kurtosis, but you can add your  ...
Data Visualization (1) Pairs for a matrix  of scatter plots  of every variable  against every  other:  ?mtcars  pairs(mtc...
Data Visualization (2)pie(table(cyl))   barplot(table(cyl))                                        23
Data Visualization (3) Gives a scatter plot if x is continuous, and a box-and-  whisker plot if x is a factor. Some peopl...
Data Visualization (4)                         25
Data Visualization (5) Histograms show a frequency distribution hist(qsec, col="gray")                                   ...
Data Visualization (6) boxplot(qsec, col="gray") boxplot(qsec, mpg, col="gray")                                   27
Data Visualization (7)XY <- cbind(LAT, LONG)plot(XY, type=l)library(sp)XY.poly <- Polygon(XY)XY.pnt <- spsample(XY.poly,  ...
Data Visualization (8)                         29
Correlation and Regression (1) If you want to determine the significance of a correlation (i.e. the p value associated wi...
Correlation and Regression (2)cor.test(wt, qsec)        Pearsons product-moment correlationdata: wt and qsect = -0.9719, d...
Correlation and Regression (3)cor.test(wt, mpg)        Pearsons product-moment correlationdata: wt and mpgt = -9.559, df =...
Correlation and Regression (4)                                 33
Correlation and Regression (5) Fits a linear model with normal errors and constant variance; generally this is used for r...
Correlation and Regression (6)Call:lm(formula = mpg ~ wt)Residuals:    Min      1Q Median        3Q      Max-4.5432 -2.364...
Correlation and Regression (7) The great thing about graphics in R is that it is  extremely straightforward to add things...
Correlation and Regression (8)plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon")abline(fit, col="blue", lwd=2)text(4, 25, "...
Correlation and Regression (9) Predict is a generic built-in function for predictions  from the results of various model ...
Correlation and Regression (10)                              39
Correlation and Regression (11) What do you do if you identify problems? There are four approaches to dealing with violat...
Correlation and Regression (12) You can compare the fit of two nested models using  the anova() function in the base inst...
Correlation and Regression (13)# Bootstrap 95% CI for R-Squaredlibrary(boot)rsq <- function(formula, data, indices) {    f...
t-Test (1) Comparing two sample means with normal errors (Student’s t test, t.test) t.test(a, b) t.test(a, b, paired = TR...
t-Test (2)t.test(a, b)        Welch Two Sample t-testdata: a and bt = 1.4136, df = 12.781, p-value = 0.1814alternative hyp...
t-Test (3)t.test(a, c)        Welch Two Sample t-testdata: a and ct = 3.9446, df = 17.407, p-value = 0.001005alternative h...
t-Test (4)(a) Test the equality of variances assumption:ev <- var.test(a, c)$p.value(b) Test the normality assumption:an <...
Chi-squared Test (1)Construct hypotheses based on qualitative – categorical data:myTable <- table(am, cyl)myTable         ...
Chi-squared Test (2)chisq.test(myTable)        Pearsons Chi-squared testdata: myTableX-squared = 8.7407, df = 2, p-value =...
Chi-squared Test (3)mosaicplot(myTable, color=rainbow(3))                                        49
ANOVA (1) A method which partitions the total variation in the response into the components (sources of variation) in the...
ANOVA (2) The best way to understand the two significant interaction terms is to plot them using interaction.plot like th...
ANOVA (3)boxplot(Yield~N, col="gray")                               52
ANOVA (4)model <- aov(Yield ~ N * S)                    #CRDsummary(model)            Df Sum Sq Mean Sq F value   Pr(>F)N ...
ANOVA (5)par(mfrow = c(2, 2))plot(model)ANOVA assumptions: Normality Linearity Constant variance Independence         ...
ANOVA (6)model.tables(model, "means")Tables of meansGrand mean1.104722N    0    180    2300.6025 1.3142 1.3975S    0     1...
ANOVA (7)model.tables(model, se=TRUE).......Standard errors for differences of means             N      S    N:S        0....
ANOVA (8)mc <- TukeyHSD(model, "N", ordered = TRUE); mc  Tukey multiple comparisons of means    95% family-wise confidence...
ANOVA (9)plot(mc)            58
ANOVA (10)summary(aov(Yield ~ N * S + Error(Rep)))             #RCBError: Rep          Df Sum Sq Mean Sq F value Pr(>F)Res...
ANOVA (11) In a split-plot design, different treatments are applied  to plots of different sizes. Each different plot siz...
ANOVA (12)Error: Rep          Df Sum Sq Mean Sq F value Pr(>F)Residuals 2 0.30191 0.15095Error: Rep:N          Df Sum Sq M...
ANOVA (13) Analysis of Covariance: # f is treatment factor # x is variate acts as covariate model <- aov(y ~ x * f) Spli...
PCA (1) The idea of principal components analysis (PCA) is to find a small number of linear combinations of the variables...
PCA (2)model <- prcomp(d2)modelStandard deviations:[1] 14.6949595 3.9627722   2.8306355   1.1593717Rotation:             P...
PCA (3)summary(model)Importance of components:                          PC1    PC2     PC3Standard deviation     14.6950 3...
PCA (4)plot(model)   biplot(model)                              66
Clustering (1) We define similarity on the basis of the distance between two samples in this n-dimensional space. Several...
Clustering (2) Initially, each sample is assigned to its own cluster, and  then the hclust algorithm proceeds iteratively...
Clustering (3) We can plot the object called my.hc, and we specify that the leaves of the hierarchy are labeled by their ...
Clustering (4)                 70
Clustering (5) Partitioning into a number of clusters specified by the user.gr <- kmeans(cbind(disp, hp), 2)plot(disp, hp...
Clustering (6)                 72
Clustering (7)K-means clustering with 2 clusters of sizes 18, 14Cluster means:      disp        hp1 135.5389 98.055562 353...
Clustering (8)x <- as.matrix(mtcars)heatmap(x, scale="column")                             74
Time Series (1) First, make the data variable into a time series object  # create time-series objects  beer <- ts(beer, s...
Time Series (2) The remainder component is the residuals from the seasonal plus trend fit. The bars at the right-hand sid...
Time Series (3)                  77
Programming (1) We can extend the functionality of R by writing a function that estimates the standard error of the mean ...
Programming (2) You can define your own operator of the form %any% using any text string in place of any. The function sh...
Programming (3)setwd("path/to/folder")sink("output.txt")  cat("Intercept t Slope")  a <- fit$coefficients[[1]]  b <- fit$c...
Programming (4) The code for R functions can be viewed, and in most cases modified, if so is desired using fix() function...
Programming (5) Take control of your R code! RStudio is a free and open source integrated development environment for R. ...
Programming (6)                  83
Programming (7) If want to evaluate the quadratic x2−2x +4 many times so we can write a function that evaluates the funct...
Programming (8)                  85
Programming (9) We can find the minimum of the function using:  optimize(my.f, lower = -10, upper = 10)  $minimum  [1] 1 ...
Programming (10) We can integrate the function over the interval -10 to 10 using:  integrate(my.f, lower = -10, upper = 1...
Programming (11)plot(my.f, -15, +15)v <- seq(-10,10,0.01)x <- c(-10,v,10)y <- c(0,my.f(v),0)polygon(x, y,       col=gray) ...
Publication-Quality Output (1) Research doesn’t end when the last statistical analysis is completed. We need to include t...
Publication-Quality Output (2)library(xtable)example(aov)print(xtable(npk.aov))                                 90
Publication-Quality Output (3) ggplot2 package is an elegant alternative to the base graphics system, it has two compleme...
Publication-Quality Output (4)                                 92
Final words! How Large is Your Family? How many brothers and sisters are there in your family including yourself? The ave...
Online Resources http://tryr.codeschool.com http://www.r-project.org http://www.statmethods.net http://www.r-bloggers....
Thank You            95
Próxima SlideShare
Cargando en…5
×

R Language Introduction

7.724 visualizaciones

Publicado el

Publicado en: Tecnología
  • Sé el primero en comentar

R Language Introduction

  1. 1. Khaled El-Sham’aa 1
  2. 2. Session Road Map First Steps  ANOVA Importing Data into R  PCA R Basics  Clustering Data Visualization  Time Series Correlation & Regression  Programming t-Test  Publication-Quality output Chi-squared Test 2
  3. 3. First Steps (1) R is one of the most popular platforms for data analysis and visualization currently available. It is free and open source software: http://www.r-project.org Take advantage of its coverage and availability of new, cutting edge applications/techniques. R will enable us to develop and distribute solutions to our NARS with no hidden license cost. 3
  4. 4. First Steps (2) 4
  5. 5. First Steps (3)5 * 4 b[4][1] 20 [1] 5a <- (3 * 7) + 1 b[1:3]a [1] 1 2 3[1] 22 b[c(1,3,5)]b <- c(1, 2, 3, 5, 8) [1] 1 3 8b * 2[1] 2 4 6 10 16 b[b > 4] [1] 5 8 5
  6. 6. First Steps (4) citation() R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. 6
  7. 7. First Steps (5) If you know the name of the function you want help with, you just type a question mark ? at the command line prompt followed by the name of the function: ?read.table 7
  8. 8. First Steps (6) Sometimes you cannot remember the precise name of the function, but you know the subject on which you want help. Use the help.search function with your query in double quotes like this: help.search("data input") 8
  9. 9. First Steps (7) To see a worked example just type the function name:example(mean)mean> x <- c(0:10, 50)mean> xm <- mean(x)mean> c(xm, mean(x, trim = 0.10))[1] 8.75 5.50mean> mean(USArrests, trim = 0.2) Murder Assault UrbanPop Rape 7.42 167.60 66.20 20.16 9
  10. 10. First Steps (8) There are hundreds of contributed packages for R, written by many different authors (to implement specialized statistical methods). Most are available for download from CRAN (http://CRAN.R-project.org) List all available packages: library() Load package “ggplot2”: library(ggplot2) Documentation on package library(help=ggplot2) 10
  11. 11. Importing Data into R (1) data <- read.table("D:/path/file.txt", header=TRUE) data <- read.csv(file.choose(), header=TRUE, sep=";") data <- edit(data) fix(data) head(data) tail(data) tail(data, 10) 11
  12. 12. Importing Data into R (2) In order to refer to a vector by name with an R session, you need to attach the dataframe containing the vector. Alternatively, you can refer to the dataframe name and the vector name within it, using the element name operator $ like this: mtcars$mpg ?mtcars attach(mtcars) mpg 12
  13. 13. Importing Data into R (3) 13
  14. 14. Importing Data into R (4)# Read data left on the clipboarddata <- read.table("clipboard", header=T)# ODBClibrary(RODBC)db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd")raw <- sqlQuery(db1, "SELECT * FROM table1")# XLSXlibrary(XLConnect)xls <- loadWorkbook("my_file.xlsx", create=F)raw <- as.data.frame(readWorksheet(xls,sheet=Sheet1)) 14
  15. 15. R Basics (1) max(x) maximum value in x min(x) minimum value in x mean(x) arithmetic average of the values in x median(x) median value in x var(x) sample variance of x sd(x) standard deviation of x cor(x,y) correlation between vectors x and y summary(x) generic function used to produce result summaries of the results of various functions 15
  16. 16. R Basics (2) abs(x) absolute value floor(2.718) largest integers not greater than ceiling(3.142) smallest integer not less than x asin(x) inverse sine of x in radians round(2.718, digits=2) returns 2.72 x <- 1:12; sample(x) Simple randomization RCBD randomization: RCBD <- replicate(3, sample(x)) 16
  17. 17. R Basics (3)Common Data Transformation:Nature of Data Transformation R SyntaxMeasurements (lengths, weights, etc) loge log(x) log10 log(x, 10) Log10 log10(x) Log x+1 log(x + 1)Counts (number of individuals, etc) sqrt(x)Percentages (must be proportions) arcsin asin(sqrt(x))*180/pi* where x is the name of the vector (variable) whose values are to be transformed. 17
  18. 18. R Basics (4) Vectorized computations: Any function call or operator apply to a vector in will automatically operates directly on all elements of the vector. nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8 The recycling rule: The shorter vector is replicated enough times so that the result has the length of the longer vector, then the operator is applied. 1:10 + 1:3 # 2 4 6 5 7 9 8 10 12 11 18
  19. 19. R Basics (5)mydata <- matrix(rnorm(30), nrow=6)mydata# calculate the 6 row meansapply(mydata, 1, mean)# calculate the 5 column meansapply(mydata, 2, mean)apply(mydata, 2, mean, trim=0.2) 19
  20. 20. R Basics (6) String functions:substr(month.name, 2, 3)paste("*", month.name[1:4], "*", sep=" ")x <- toupper(dna.seq)rna.seq <- chartr("T", "U", x)comp.seq <- chartr("ACTG", "TGAC", dna.seq) 20
  21. 21. R Basics (7) Surprisingly, the base installation doesn’t provide functions for skew and kurtosis, but you can add your own: m <- mean(x) n <- length(x) s <- sd(x) skew <- sum((x-m)^3/s^3)/n kurt <- sum((x-m)^4/s^4)/n – 3 21
  22. 22. Data Visualization (1) Pairs for a matrix of scatter plots of every variable against every other: ?mtcars pairs(mtcars) Voilà! 22
  23. 23. Data Visualization (2)pie(table(cyl)) barplot(table(cyl)) 23
  24. 24. Data Visualization (3) Gives a scatter plot if x is continuous, and a box-and- whisker plot if x is a factor. Some people prefer the alternative syntax plot(y~x): attach(mtcars) plot(wt, mpg) plot(cyl, mpg) cyl <- factor(cyl) plot(cyl, mpg) 24
  25. 25. Data Visualization (4) 25
  26. 26. Data Visualization (5) Histograms show a frequency distribution hist(qsec, col="gray") 26
  27. 27. Data Visualization (6) boxplot(qsec, col="gray") boxplot(qsec, mpg, col="gray") 27
  28. 28. Data Visualization (7)XY <- cbind(LAT, LONG)plot(XY, type=l)library(sp)XY.poly <- Polygon(XY)XY.pnt <- spsample(XY.poly, n=8, type=random)XY.pntpoints(XY.pnt) 28
  29. 29. Data Visualization (8) 29
  30. 30. Correlation and Regression (1) If you want to determine the significance of a correlation (i.e. the p value associated with the calculated value of r) then use cor.test rather than cor. cor(wt, mpg) [1] -0.8676594 The value will vary from -1 to +1. A -1 indicates perfect negative correlation, and +1 indicates perfect positive correlation. 0 means no correlation. 30
  31. 31. Correlation and Regression (2)cor.test(wt, qsec) Pearsons product-moment correlationdata: wt and qsect = -0.9719, df = 30, p-value = 0.3389alternative hypothesis: true correlation is not equal to 095 percent confidence interval: -0.4933536 0.1852649sample estimates: cor-0.1747159 31
  32. 32. Correlation and Regression (3)cor.test(wt, mpg) Pearsons product-moment correlationdata: wt and mpgt = -9.559, df = 30, p-value = 1.294e-10alternative hypothesis: true correlation is not equal to 095 percent confidence interval: -0.9338264 -0.7440872sample estimates: cor-0.8676594 32
  33. 33. Correlation and Regression (4) 33
  34. 34. Correlation and Regression (5) Fits a linear model with normal errors and constant variance; generally this is used for regression analysis using continuous explanatory variables. fit <- lm(y ~ x) summary(fit) plot(x, y) # Sample of multiple linear regression fit <- lm(y ~ x1 + x2 + x3) 34
  35. 35. Correlation and Regression (6)Call:lm(formula = mpg ~ wt)Residuals: Min 1Q Median 3Q Max-4.5432 -2.3647 -0.1252 1.4096 6.8727Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***wt -5.3445 0.5591 -9.559 1.29e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 3.046 on 30 degrees of freedomMultiple R-squared: 0.7528, Adjusted R-squared: 0.7446F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 35
  36. 36. Correlation and Regression (7) The great thing about graphics in R is that it is extremely straightforward to add things to your plots. In the present case, we might want to add a regression line through the cloud of data points. The function for this is abline which can take as its argument the linear model object: abline(fit) Note: abline(a, b) function adds a regression line with an intercept of a and a slope of b 36
  37. 37. Correlation and Regression (8)plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon")abline(fit, col="blue", lwd=2)text(4, 25, "mpg = 37.29 - 5.34 wt") 37
  38. 38. Correlation and Regression (9) Predict is a generic built-in function for predictions from the results of various model fitting functions: predict(fit, list(wt = 4.5)) [1] 13.23500 38
  39. 39. Correlation and Regression (10) 39
  40. 40. Correlation and Regression (11) What do you do if you identify problems? There are four approaches to dealing with violations of regression assumptions:  Deleting observation  Transforming variables  Adding or deleting variables  Using another regression approach 40
  41. 41. Correlation and Regression (12) You can compare the fit of two nested models using the anova() function in the base installation. A nested model is one whose terms are completely included in the other model. fit1 <- lm (y ~ A + B + C) fit2 <- lm (y ~ A + C) anova(fit1, fit2) If the test is not significant (i.e. p > 0.05), we conclude that B in this case don’t add to the linear prediction and we’re justified in dropping it from our model. 41
  42. 42. Correlation and Regression (13)# Bootstrap 95% CI for R-Squaredlibrary(boot)rsq <- function(formula, data, indices) { fit <- lm(formula, data= data[indices,]) return(summary(fit)$r.square)}rs <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp)boot.ci(rs, type="bca") # try print(rs) and plot(rs) 42
  43. 43. t-Test (1) Comparing two sample means with normal errors (Student’s t test, t.test) t.test(a, b) t.test(a, b, paired = TRUE) # alternative argument options: # "two.sided", "less", "greater" a <- qsec[cyl == 4] b <- qsec[cyl == 6] c <- qsec[cyl == 8] 43
  44. 44. t-Test (2)t.test(a, b) Welch Two Sample t-testdata: a and bt = 1.4136, df = 12.781, p-value = 0.1814alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.6159443 2.9362040sample estimates:mean of x mean of y 19.13727 17.97714 44
  45. 45. t-Test (3)t.test(a, c) Welch Two Sample t-testdata: a and ct = 3.9446, df = 17.407, p-value = 0.001005alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: 1.102361 3.627899sample estimates:mean of x mean of y 19.13727 16.77214 45
  46. 46. t-Test (4)(a) Test the equality of variances assumption:ev <- var.test(a, c)$p.value(b) Test the normality assumption:an <- shapiro.test(a)$p.valuebn <- shapiro.test(c)$p.value 46
  47. 47. Chi-squared Test (1)Construct hypotheses based on qualitative – categorical data:myTable <- table(am, cyl)myTable cylam 4 6 8 automatic 3 4 12 manual 8 3 2 47
  48. 48. Chi-squared Test (2)chisq.test(myTable) Pearsons Chi-squared testdata: myTableX-squared = 8.7407, df = 2, p-value = 0.01265The expected counts under the null hypothesis:hisq.test(myTable)$expected cylam 4 6 8 automatic 6.53125 4.15625 8.3125 manual 4.46875 2.84375 5.6875 48
  49. 49. Chi-squared Test (3)mosaicplot(myTable, color=rainbow(3)) 49
  50. 50. ANOVA (1) A method which partitions the total variation in the response into the components (sources of variation) in the above model is called the analysis of variance. table(N, S, Rep) N <- factor(N) S <- factor(S) Rep <- factor(Rep) 50
  51. 51. ANOVA (2) The best way to understand the two significant interaction terms is to plot them using interaction.plot like this:interaction.plot(S, N, Yield) 51
  52. 52. ANOVA (3)boxplot(Yield~N, col="gray") 52
  53. 53. ANOVA (4)model <- aov(Yield ~ N * S) #CRDsummary(model) Df Sum Sq Mean Sq F value Pr(>F)N 2 4.5818 2.2909 42.7469 1.230e-08 ***S 3 0.9798 0.3266 6.0944 0.003106 **N:S 6 0.6517 0.1086 2.0268 0.101243Residuals 24 1.2862 0.0536---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 53
  54. 54. ANOVA (5)par(mfrow = c(2, 2))plot(model)ANOVA assumptions: Normality Linearity Constant variance Independence 54
  55. 55. ANOVA (6)model.tables(model, "means")Tables of meansGrand mean1.104722N 0 180 2300.6025 1.3142 1.3975S 0 10 20 400.8289 1.1556 1.1678 1.2667 SN 0 10 20 40 0 0.5600 0.7733 0.5233 0.5533 180 0.8933 1.2900 1.5267 1.5467 230 1.0333 1.4033 1.4533 1.7000 55
  56. 56. ANOVA (7)model.tables(model, se=TRUE).......Standard errors for differences of means N S N:S 0.0945 0.1091 0.1890replic. 12 9 3Plot.design(Yield ~ N * S) 56
  57. 57. ANOVA (8)mc <- TukeyHSD(model, "N", ordered = TRUE); mc Tukey multiple comparisons of means 95% family-wise confidence level factor levels have been orderedFit: aov(formula = Yield ~ N * S)$N diff lwr upr p adj180-0 0.71166667 0.4756506 0.9476827 0.0000003230-0 0.79500000 0.5589840 1.0310160 0.0000000230-180 0.08333333 -0.1526827 0.3193494 0.6567397 57
  58. 58. ANOVA (9)plot(mc) 58
  59. 59. ANOVA (10)summary(aov(Yield ~ N * S + Error(Rep))) #RCBError: Rep Df Sum Sq Mean Sq F value Pr(>F)Residuals 2 0.30191 0.15095Error: Within Df Sum Sq Mean Sq F value Pr(>F)N 2 4.5818 2.2909 51.2035 5.289e-09 ***S 3 0.9798 0.3266 7.3001 0.001423 **N:S 6 0.6517 0.1086 2.4277 0.059281 .Residuals 22 0.9843 0.0447---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 59
  60. 60. ANOVA (11) In a split-plot design, different treatments are applied to plots of different sizes. Each different plot size is associated with its own error variance. The model formula is specified as a factorial, using the asterisk notation. The error structure is defined in the Error term, with the plot sizes listed from left to right, from largest to smallest, with each variable separated by the slash operator /. model <- aov(Yield ~ N * S + Error(Rep/N)) 60
  61. 61. ANOVA (12)Error: Rep Df Sum Sq Mean Sq F value Pr(>F)Residuals 2 0.30191 0.15095Error: Rep:N Df Sum Sq Mean Sq F value Pr(>F)N 2 4.5818 2.29088 55.583 0.001206 **Residuals 4 0.1649 0.04122Error: Within Df Sum Sq Mean Sq F value Pr(>F)S 3 0.97983 0.32661 7.1744 0.002280 **N:S 6 0.65171 0.10862 2.3860 0.071313 .Residuals 18 0.81943 0.04552 61
  62. 62. ANOVA (13) Analysis of Covariance: # f is treatment factor # x is variate acts as covariate model <- aov(y ~ x * f) Split both main effects into linear and quadratic parts. contrasts <- list(N = list(lin=1, quad=2), S = list(lin=1, quad=2)) summary(model, split=contrasts) 62
  63. 63. PCA (1) The idea of principal components analysis (PCA) is to find a small number of linear combinations of the variables so as to capture most of the variation in the dataframe as a whole.d2 <- cbind(wt, disp/10, hp/10, mpg, qsec)colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec") 63
  64. 64. PCA (2)model <- prcomp(d2)modelStandard deviations:[1] 14.6949595 3.9627722 2.8306355 1.1593717Rotation: PC1 PC2 PC3 PC4wt -0.05887539 0.05015401 -0.07513271 -0.16910728disp -0.83186362 0.47519625 0.28005113 0.04080894hp -0.40572567 -0.83180078 0.24611265 -0.28768795mpeg 0.36888799 0.12190490 0.91398919 -0.09385946qsec 0.06200759 0.25479354 -0.14134625 -0.93710373 64
  65. 65. PCA (3)summary(model)Importance of components: PC1 PC2 PC3Standard deviation 14.6950 3.96277 2.83064Proportion of Variance 0.8957 0.06514 0.03323Cumulative Proportion 0.8957 0.96082 0.99405 65
  66. 66. PCA (4)plot(model) biplot(model) 66
  67. 67. Clustering (1) We define similarity on the basis of the distance between two samples in this n-dimensional space. Several different distance measures could be used to work out the distance from every sample to every other sample. This quantitative dissimilarity structure of the data is stored in a matrix produced by the dist function: rownames(d2) <- rownames(mtcars) my.dist <- dist(d2, method="euclidian") 67
  68. 68. Clustering (2) Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster (see ?hclust for details). my.hc <- hclust(my.dist, "ward") 68
  69. 69. Clustering (3) We can plot the object called my.hc, and we specify that the leaves of the hierarchy are labeled by their plot numbers plot(my.hc, hang=-1) g <- rect.hclust(my.hc, k=4, border="red") Note: When the hang argument is set to -1 then all leaves end on one line and their labels hang down from 0. 69
  70. 70. Clustering (4) 70
  71. 71. Clustering (5) Partitioning into a number of clusters specified by the user.gr <- kmeans(cbind(disp, hp), 2)plot(disp, hp, col = gr$cluster, pch=19)points(gr$centers, col = 1:2, pch = 8, cex=2) 71
  72. 72. Clustering (6) 72
  73. 73. Clustering (7)K-means clustering with 2 clusters of sizes 18, 14Cluster means: disp hp1 135.5389 98.055562 353.1000 209.21429Clustering vector: [1] 1 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 2 1 2 1Within cluster sum of squares by cluster:[1] 58369.27 93490.74 (between_SS / total_SS = 75.6 %) 73
  74. 74. Clustering (8)x <- as.matrix(mtcars)heatmap(x, scale="column") 74
  75. 75. Time Series (1) First, make the data variable into a time series object # create time-series objects beer <- ts(beer, start=1956, freq=12) It is useful to be able to turn a time series into components. The function stl performs seasonal decomposition of a time series into seasonal, trend and irregular components using loess. 75
  76. 76. Time Series (2) The remainder component is the residuals from the seasonal plus trend fit. The bars at the right-hand side are of equal heights (in user coordinates). # Decompose a time series into seasonal, # trend and irregular components using loess ts.comp <- stl(beer, s.window="periodic") plot(ts.comp) 76
  77. 77. Time Series (3) 77
  78. 78. Programming (1) We can extend the functionality of R by writing a function that estimates the standard error of the mean SEM <- function(x, na.rm = FALSE) { if (na.rm == TRUE) VAR <- x[!is.na(x)] else VAR <- x SD <- sd(VAR) N <- length(VAR) SE <- SD/sqrt(N - 1) return(SE) } 78
  79. 79. Programming (2) You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments. "%p%" <- function(x,y) paste(x,y,sep=" ") "Hi" %p% "Khaled" [1] "Hi Khaled" 79
  80. 80. Programming (3)setwd("path/to/folder")sink("output.txt") cat("Intercept t Slope") a <- fit$coefficients[[1]] b <- fit$coefficients[[2]] cat(paste(a, b, sep="t"))sink()jpeg(filename="graph.jpg", width=600, height=600)plot(wt, mpg); abline(fit)dev.off() 80
  81. 81. Programming (4) The code for R functions can be viewed, and in most cases modified, if so is desired using fix() function. You can trigger garbage collection by call gc() function which will report few memory usage statistics. Basic tool for code timing is: system.time(commands) tempfile() give a unique file name in temporary writable directory deleted at the end of the session. 81
  82. 82. Programming (5) Take control of your R code! RStudio is a free and open source integrated development environment for R. You can run it on your desktop (Windows, Mac, or Linux) :  Syntax highlighting, code completion, etc...  Execute R code directly from the source editor  Workspace browser and data viewer  Plot history, zooming, and flexible image & PDF export  Integrated R help and documentation  and more (http://www.rstudio.com/ide/) 82
  83. 83. Programming (6) 83
  84. 84. Programming (7) If want to evaluate the quadratic x2−2x +4 many times so we can write a function that evaluates the function for a specific value of x: my.f <- function(x) { x^2 - 2*x + 4 } my.f(3) [1] 7 plot(my.f, -10, +10) 84
  85. 85. Programming (8) 85
  86. 86. Programming (9) We can find the minimum of the function using: optimize(my.f, lower = -10, upper = 10) $minimum [1] 1 $objective [1] 3 which says that the minimum occurs at x=1 and at that point the quadratic has value 3. 86
  87. 87. Programming (10) We can integrate the function over the interval -10 to 10 using: integrate(my.f, lower = -10, upper = 10) 746.6667 with absolute error < 4.1e-12 which gives an answer together with an estimate of the absolute error. 87
  88. 88. Programming (11)plot(my.f, -15, +15)v <- seq(-10,10,0.01)x <- c(-10,v,10)y <- c(0,my.f(v),0)polygon(x, y, col=gray) 88
  89. 89. Publication-Quality Output (1) Research doesn’t end when the last statistical analysis is completed. We need to include the results in a report. xtable function convert an R object to an xtable object, which can then be printed as a LaTeX table. LaTeX is a document preparation system for high- quality typesetting (http://www.latex-project.org).library(xtable)print(xtable(model)) 89
  90. 90. Publication-Quality Output (2)library(xtable)example(aov)print(xtable(npk.aov)) 90
  91. 91. Publication-Quality Output (3) ggplot2 package is an elegant alternative to the base graphics system, it has two complementary uses:  Producing publication quality graphics using very simple syntax that it similar to that of base graphics. ggplot2 tends to make smart default choices for color, scale etc.  Making more sophisticated/customized plots that go beyond the defaults. 91
  92. 92. Publication-Quality Output (4) 92
  93. 93. Final words! How Large is Your Family? How many brothers and sisters are there in your family including yourself? The average number of children in families was about 2. Can you explain the difference between this value and the class average? Birthday Problem! The problem is to compute the approximate probability that in a room of n people, at least two have the same birthday. 93
  94. 94. Online Resources http://tryr.codeschool.com http://www.r-project.org http://www.statmethods.net http://www.r-bloggers.com http://www.r-tutor.com http://blog.revolutionanalytics.com/r 94
  95. 95. Thank You 95

×