SlideShare una empresa de Scribd logo
1 de 59
Munging &
Visualizing
Data with R




Michael E. Driscoll
CTO, Metamarkets
@medriscoll

Xavier Léauté
Metamarkets
@xvrl

Barret Schloerke
Metamarkets
I.	
  A	
  Tour	
  of	
  R	
  
January	
  6,	
  2009	
  
R	
  is	
  a	
  tool	
  for…	
  
Data	
  Manipula?on	
  
•  connec$ng	
  to	
  data	
  sources	
  
•  slicing	
  &	
  dicing	
  data	
  

Modeling	
  &	
  Computa?on	
  
•  sta$s$cal	
  modeling	
  
•  numerical	
  simula$on	
  

Data	
  Visualiza?on	
  
•  visualizing	
  fit	
  of	
  models	
  
•  composing	
  sta$s$cal	
  graphics	
  
R	
  is	
  an	
  environment	
  
Its	
  interface	
  is	
  plain	
  
RStudio	
  to	
  the	
  rescue	
  
## load in some Insurance Claim data
                                    library(MASS)
                                    data(Insurance)
                                    Insurance <- edit(Insurance)


Let’s	
  take	
  a	
  tour	
  
                                    head(Insurance)
                                    dim(Insurance)

                                    ## plot it nicely using the ggplot2 package


of	
  some	
  data	
  in	
  R	
  
                                    library(ggplot2)
                                    qplot(Group, Claims/Holders,
                                          data=Insurance,
                                          geom="bar",
                                          stat='identity',
                                          position="dodge",
                                          facets=District ~ .,
                                          fill=Age,
                                          ylab="Claim Propensity",
                                          xlab="Car Group")

                                    ## hypothesize a relationship between Age ~ Claim Propensity
                                    ## visualize this hypothesis with a boxplot
                                    x11()
                                    library(ggplot2)
                                    qplot(Age, Claims/Holders,
                                          data=Insurance,
                                          geom="boxplot",
                                          fill=Age)

                                    ## quantify the hypothesis with linear model
                                    m <- lm(Claims/Holders ~ Age + 0, data=Insurance)
                                    summary(m)
R	
  is	
  “an	
  overgrown	
  calculator”	
  




sum(rgamma(rpois(1,lambda=2),shape=49,scale=.2)))
R	
  is	
  “an	
  overgrown	
  calculator”	
  
•  simple	
  math	
  
        > 2+2
        4


•  storing	
  results	
  in	
  variables	
  
        > x <- 2+2        ## ‘<-’ is R syntax for ‘=’ or assignment
        > x^2
        16


•  vectorized	
  math	
  
        > weight <- c(110, 180, 240)           ## three weights
        > height <- c(5.5, 6.1, 6.2)           ## three heights
        > bmi <- (weight*4.88)/height^2        ## divides element-wise
        17.7 23.6 30.4


	
  
R	
  is	
  “an	
  overgrown	
  calculator”	
  
•  basic	
  sta$s$cs	
  
           mean(weight)         sd(weight)              sqrt(var(weight))
           176.6                65.0                    65.0 # same as sd


•  set	
  func$ons	
  
           union                 intersect               setdiff


•  advanced	
  sta$s$cs	
  
       > pbinom(40, 100, 0.5)      ##   P that a coin tossed 100 times
       0.028                       ##   will comes up less than 40 heads



       > pshare <- pbirthday(23, 365, coincident=2)	
  
       0.530 ## probability that among 23 people, two share a birthday


	
  
Try	
  It!	
  #1	
  
                     	
  Overgrown	
  Calculator	
  
•  basic	
  calcula$ons	
  
    > 2 + 2                  [Hit	
  ENTER]
    > log(100)               [Hit	
  ENTER]

	
  
•  calculate	
  the	
  value	
  of	
  $100	
  aIer	
  10	
  years	
  at	
  5%	
  
    > 100 * exp(0.05*10) [Hit	
  ENTER]


•  construct	
  a	
  vector	
  &	
  do	
  a	
  vectorized	
  calcula$on	
  
    > year <- (1,2,5,10,25) [Hit	
  ENTER]	
  	
  	
  this	
  returns	
  an	
  error.	
  	
  why?	
  
    > year <- c(1,2,5,10,25) [Hit	
  ENTER]
    > 100 * exp(0.05*year)   [Hit	
  ENTER]	
  
    	
  
    	
  
R	
  as	
  a	
  Programming	
  Language	
  
                               fibonacci <- function(n) {
                                 fib <- numeric(n)
                                 fib [1:2] <- 1
                                 for (i in 3:n) {
                                     fib[i] <- fib[i-1] + fib[i-2]
                                 }
                                 return(fib[n])
Image from cover of Abelson
& Sussman’s textThe            }
Structure and Interpretation
of Computer Languages
Func$on	
  Calls	
  
•  There	
  are	
  ~	
  1100	
  built-­‐in	
  commands	
  in	
  the	
  R	
  
         “base”	
  package,	
  which	
  can	
  be	
  executed	
  on	
  the	
  
         command-­‐line.	
  	
  The	
  basic	
  structure	
  of	
  a	
  call	
  is	
  
         thus:	
  
	
  
     	
  output <- function(arg1, arg2, …)
	
  
•  Arithme$c	
  Opera$ons	
  
       +     -      *     /      ^	
  

•  R	
  func$ons	
  are	
  typically	
  vectorized	
  
          x <- x/3
  	
  works	
  whether	
  x	
  is	
  a	
  one	
  or	
  many-­‐valued	
  vector	
  
Data	
  Structures	
  in	
  R	
  
                      numeric	
      x <- c(0,2:4)
       vectors	
  




                                     y <- c(“alpha”, “b”, “c3”, “4”)
                     Character	
  



                      logical	
      z <- c(1, 0, TRUE, FALSE)




> class(x)
[1] "numeric"
> x2 <- as.logical(x)
> class(x2)
[1] “logical”
Data	
  Structures	
  in	
  R	
  
                         lists	
         lst <- list(x,y,z)
     objects	
  




                                         M <- matrix(rep(x,3),ncol=3)
                      matrices	
  


                   data	
  frames*	
     df <- data.frame(x,y,z)




> class(df)
[1] “data.frame"
Summary	
  of	
  Data	
  Structures	
  
                Linear                 Rectangular


                               ?	
  
Homogeneous
                    vectors	
              matrices	
  


Heterogeneous
                         lists	
         data	
  frames*	
  
R	
  is	
  a	
  numerical	
  simulator	
  	
  
•  built-­‐in	
  func$ons	
  for	
  
   classical	
  probability	
  
   distribu$ons	
  

•  let’s	
  simulate	
  10,000	
  
   trials	
  of	
  100	
  coin	
  flips.	
  	
  
   what’s	
  the	
  
   distribu$on	
  of	
  heads?	
  
   	
  
                                           > heads <- rbinom(10^5,100,0.50)
                                           > hist(heads)
Func$ons	
  for	
  Probability	
  Distribu$ons	
  
                                      ddist(	
  )	
            density	
  func$on	
  (pdf)	
  
                                      pdist(	
  )	
            cumula$ve	
  density	
  func$on	
  
                                      qdist(	
  )	
            quan$le	
  func$on	
  
                                      rdist(	
  )	
            random	
  deviates	
  

 Examples	
  
 Normal	
                                 dnorm,	
  pnorm,	
  qnorm,	
  rnorm	
  
 Binomial	
                               dbinom,	
  pbinom,	
  …	
  
 Poisson	
                                dpois,	
  …	
  
 >	
  pnorm(0)	
       	
  0.05	
  	
  
 >	
  qnorm(0.9)	
     	
  1.28	
  
 >	
  rnorm(100)	
     	
  vector	
  of	
  length	
  100	
  
 	
  
Func$ons	
  for	
  Probability	
  Distribu$ons	
  
                                                    distribu?on	
                 dist	
  suffix	
  in	
  R	
  
How	
  to	
  find	
  the	
  func?ons	
  for	
        Beta	
                        -­‐beta	
  
lognormal	
  distribu?on?	
  	
  	
                 Binomial	
                    -­‐binom	
  
	
                                                  Cauchy	
                      -­‐cauchy	
  
1)	
  Use	
  the	
  double	
  ques$on	
  mark	
     Chisquare	
                   -­‐chisq	
  
                                                    Exponen?al	
                  -­‐exp	
  
‘??’	
  to	
  search	
                              F	
                           -­‐f	
  
> ??lognormal                                       Gamma	
                       -­‐gamma	
  
	
                                                  Geometric	
                   -­‐geom	
  
2)	
  Then	
  iden$fy	
  the	
  package	
           Hypergeometric	
              -­‐hyper	
  
	
  >	
  ?Lognormal	
                               Logis?c	
                     -­‐logis	
  
                                                    Lognormal	
                   -­‐lnorm	
  
	
                                                  Nega?ve	
  Binomial	
  	
     -­‐nbinom	
  
3)	
  Discover	
  the	
  dist	
  func$ons	
  	
     Normal	
                      -­‐norm	
  
dlnorm, plnorm, qlnorm,                             Poisson	
                     -­‐pois	
  
rlnorm                                              Student	
  t	
  	
            -­‐t	
  
                                                    Uniform	
                     -­‐unif	
  
                                                    Tukey	
                       -­‐tukey	
  
                                                    Weibull	
                     -­‐weib	
  
                                                    Wilcoxon	
                    -­‐wilcox	
  
Try	
  It!	
  #2	
  
                        	
  Numerical	
  Simula$on	
  
•  simulate	
  1m	
  drivers	
  from	
  which	
  we	
  expect	
  4	
  claims	
  
       > numclaims <- rpois(n, lambda)
       (hint:	
  use	
  ?rpois to	
  understand	
  the	
  parameters)	
  


•  verify	
  the	
  mean	
  &	
  variance	
  are	
  reasonable
       > mean(numclaims)
       > var(numclaims)


•  visualize	
  the	
  distribu$on	
  of	
  claim	
  counts	
  
       > hist(numclaims)



       	
  

	
  
Gehng	
  Data	
  In	
  
	
  -­‐	
  from	
  Files	
  
   > Insurance <- read.csv(“Insurance.csv”,header=TRUE)

	
  	
  	
  from	
  Databases	
  
   > con <- dbConnect(driver,user,password,host,dbname)
   > Insurance <- dbSendQuery(con, “SELECT * FROM
     claims”)

	
  	
  	
  from	
  the	
  Web	
  
   > con <- url('http://labs.dataspora.com/test.txt')
   > Insurance <- read.csv(con, header=TRUE)

	
  	
  	
  	
  from	
  R	
  data	
  objects	
  
   > load(‘Insurance.Rda’)
Gehng	
  Data	
  Out	
  
•  to	
  Files	
  
       write.csv(Insurance,file=“Insurance.csv”)



•  to	
  Databases	
  
       con <- dbConnect(dbdriver,user,password,host,dbname)
       dbWriteTable(con, “Insurance”, Insurance)




	
  	
  	
  	
  	
  to	
  R	
  Objects	
  
       save(Insurance, file=“Insurance.Rda”)
Naviga$ng	
  within	
  the	
  R	
  environment	
  
•  lis$ng	
  all	
  variables	
  
    > ls()

•  examining	
  a	
  variable	
  ‘x’	
  
    >   str(x)
    >   head(x)
    >   tail(x)
    >   class(x)

•  removing	
  variables	
  
    > rm(x)
    > rm(list=ls())             # remove everything
Try	
  It!	
  #3	
  
                              	
  Data	
  Processing	
  	
  
•  load	
  data	
  &	
  view	
  it	
  
      library(MASS)
      head(Insurance)              ## the first 7 rows
      dim(Insurance)               ## number of rows & columns

•  write	
  it	
  out	
  
      write.csv(Insurance,file=“Insurance.csv”, row.names=FALSE)
      getwd() ## where am I?

•  view	
  it	
  in	
  Excel,	
  make	
  a	
  change,	
  save	
  it	
  
      remove the first district	
  

•  load	
  it	
  back	
  in	
  to	
  R	
  &	
  plot	
  it	
  
      Insurance <- read.csv(file=“Insurance.csv”)
      plot(Claims/Holders ~ Age, data=Insurance)
A	
  Swiss-­‐Army	
  Knife	
  for	
  Data	
  
A	
  Swiss-­‐Army	
  Knife	
  for	
  Data	
  
•  Indexing	
  
•  Three	
  ways	
  to	
  index	
  into	
  a	
  data	
  frame	
  
     –  array	
  of	
  integer	
  indices	
  
     –  array	
  of	
  character	
  names	
  
     –  array	
  of	
  logical	
  Booleans	
  
•  Examples:	
  
     df[1:3,]
     df[c(“New York”, “Chicago”),]
     df[c(TRUE,FALSE,TRUE,TRUE),]
     df[df$city == “New York”,]
A	
  Swiss-­‐Army	
  Knife	
  for	
  Data	
  
•  subset	
  –	
  extract	
  subsets	
  mee$ng	
  some	
  criteria	
  
      subset(Insurance, District==1)
      subset(Insurance, Claims < 20)
•  transform	
  –	
  add	
  or	
  alter	
  a	
  column	
  of	
  a	
  data	
  frame	
  
      transform(Insurance, Propensity=Claims/Holders)
•  cut	
  –	
  cut	
  a	
  con$nuous	
  value	
  into	
  groups
     cut(Insurance$Claims, breaks=c(-1,100,Inf),
    labels=c('lo','hi'))

•  Put	
  it	
  all	
  together:	
  create	
  a	
  new,	
  transformed	
  data	
  frame	
  
   transform(subset(Insurance, District==1),
      ClaimLevel=cut(Claims, breaks=c(-1,100,Inf),
      labels=c(‘lo’,’hi’)))	
  
A	
  Swiss-­‐Army	
  Knife	
  for	
  Data	
  
•  sqldf	
  –	
  a	
  library	
  that	
  allows	
  you	
  to	
  query	
  R	
  data	
  frames	
  as	
  if	
  they	
  
    were	
  SQL	
  tables.	
  	
  Par$cularly	
  useful	
  for	
  aggrega$ons.	
  

library(sqldf)
sqldf('select country, sum(revenue) revenue
       FROM sales
       GROUP BY country')

  country revenue
1      FR 307.1157
2      UK 280.6382
3     USA 304.6860
A	
  Sta$s$cal	
  Modeler	
  
•  R’s	
  has	
  a	
  powerful	
  modeling	
  syntax	
  
•  Models	
  are	
  specified	
  with	
  formulae,	
  like	
  	
  
                y ~ x
     growth ~ sun + water
    model	
  rela$onships	
  between	
  con$nuous	
  and	
  
     categorical	
  variables.	
  
•  Models	
  are	
  also	
  guide	
  the	
  visualiza$on	
  of	
  
   rela$onships	
  in	
  a	
  graphical	
  form	
  
A	
  Sta$s$cal	
  Modeler	
  
•  Linear	
  model	
  
     m <- lm(Claims/Holders ~ Age, data=Insurance)

•  Examine	
  it	
  
    summary(m)

•  Plot	
  it	
  
    plot(m)
A	
  Sta$s$cal	
  Modeler	
  
•  Logis$c	
  model	
  
    m <- glm(Age ~ Claims/Holders, data=Insurance,
           family=binomial(“logit”))

•  Examine	
  it	
  
    summary(m)

•  Plot	
  it	
  
    plot(m)
Try	
  It!	
  #4	
  
                    	
  Sta$s$cal	
  Modeling	
  
•  fit	
  a	
  linear	
  model	
  
       m <- lm(Claims/Holders ~ Age + 0, data=Insurance)


•  examine	
  it	
  	
  
       summary(m)
	
  

•  plot	
  it	
  
       plot(m)
Visualiza$on:	
  	
  	
  
 Mul$variate	
  
 Barplot	
  




library(ggplot2)
qplot(Group, Claims/Holders,
      data=Insurance,
      geom="bar",
      stat='identity',
      position="dodge",
      facets=District ~ .,
      fill=Age)
Visualiza$on:	
  	
  Boxplots	
  




library(ggplot2)             library(lattice)
qplot(Age, Claims/Holders,   bwplot(Claims/Holders ~ Age,
  data=Insurance,              data=Insurance)
  geom="boxplot“)	
  
Visualiza$on:	
  Histograms	
  




library(ggplot2)
                                   library(lattice)
qplot(Claims/Holders,              densityplot(~ Claims/Holders | Age,
 data=Insurance,
                                   data=Insurance, layout=c(4,1)
 facets=Age ~ ., geom="density")
Try	
  It!	
  #5	
  
                     	
  Data	
  Visualiza$on	
  
•  simple	
  line	
  chart	
  
       > x <- 1:10
       > y <- x^2
       > plot(y ~ x)


•  box	
  plot	
  
       > library(lattice)
       > boxplot(Claims/Holders ~ Age, data=Insurance)

	
  

•  visualize	
  a	
  linear	
  fit	
  
       > abline(0,1)
Gehng	
  Help	
  with	
  R	
  
Help	
  within	
  R	
  itself	
  for	
  a	
  func?on	
  
   > help(func)
      > ?func
For	
  a	
  topic	
  
      > help.search(topic)
      > ??topic
      	
  
•    search.r-­‐project.org	
  
•    Google	
  Code	
  Search	
  	
  www.google.com/codesearch	
  
•    Stack	
  Overflow	
  	
  hsp://stackoverflow.com/tags/R	
  	
  
•    R-­‐help	
  list	
  hsp://www.r-­‐project.org/pos$ng-­‐guide.html	
  	
  
Six	
  Indispensable	
  Books	
  on	
  R	
  
                Learning	
  R	
  


                                 Data	
  Manipula?on	
  




                   Visualiza?on:	
  
                   	
  	
  la-ce	
  &	
  ggplot2	
  

                               Sta?s?cal	
  Modeling	
  
Extending	
  R	
  with	
  Packages	
  
Over	
  one	
  thousand	
  user-­‐contributed	
  packages	
  are	
  available	
  
                    on	
  CRAN	
  –	
  the	
  Comprehensive	
  R	
  Archive	
  Network	
  
	
  	
  	
  	
  	
  	
  hsp://cran.r-­‐project.org	
  	
  
	
  
Install	
  a	
  package	
  from	
  the	
  command-­‐line	
  
                        > install.packages(‘actuar’)

Install	
  a	
  package	
  from	
  the	
  GUI	
  menu	
  
    “Packages”--> “Install packages(s)”
Visualiza?on	
  with	
  
lagce	
  
lahce	
  =	
  trellis	
  




    (source:	
  hsp://lmdvr.r-­‐forge.r-­‐project.org	
  )	
  
list	
  of	
  	
  
lahce	
  
func$ons	
  




                     densityplot(~ speed | type, data=pitch)	
  
Visualiza?on	
  with	
  	
  
ggplot2	
  
ggplot2	
  =	
  
grammar	
  of	
  	
  
graphics	
  
ggplot2	
  =	
  
grammar	
  of	
  
graphics	
  
Visualizing	
  50,000	
  Diamonds	
  with	
  ggplot2	
  
qplot(carat, price, data = diamonds)
qplot(log(carat), log(price), data = diamonds)
qplot(log(carat), log(price), data = diamonds,
alpha = I(1/20))
qplot(log(carat), log(price), data = diamonds,
alpha = I(1/20), colour=color)
qplot(log(carat), log(price), data = diamonds,
alpha=I(1/20)) + facet_grid(. ~ color)
qplot(color, price/carat,   qplot(color, price/carat,
data = diamonds,            data = diamonds, alpha = I(1/20),
geom=“boxplot”)             geom=“jitter”)
(live	
  demo)	
  
visualizing	
  six	
  dimensions	
  
of	
  MLB	
  pitches	
  with	
  ggplot2	
  
Demo	
  with	
  MLB	
  Gameday	
  Data	
  




Code, data, and instructions at:
http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R
R Workshop for Beginners

Más contenido relacionado

La actualidad más candente

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in REshwar Sai
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Parth Khare
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Vectors data frames
Vectors data framesVectors data frames
Vectors data framesFAO
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Lucas Witold Adamus
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 

La actualidad más candente (20)

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
R language
R languageR language
R language
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in R
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Vectors data frames
Vectors data framesVectors data frames
Vectors data frames
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 

Destacado

Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSchool of Data
 
Calcolo fognaturetrento p_pratica
Calcolo fognaturetrento p_praticaCalcolo fognaturetrento p_pratica
Calcolo fognaturetrento p_praticaRiccardo Rigon
 
12.13 acqua neisuoli-watertableequations
12.13 acqua neisuoli-watertableequations12.13 acqua neisuoli-watertableequations
12.13 acqua neisuoli-watertableequationsRiccardo Rigon
 
12.12 acqua nei suoli-macropores
12.12 acqua nei suoli-macropores12.12 acqua nei suoli-macropores
12.12 acqua nei suoli-macroporesRiccardo Rigon
 
P jgrass-tools-SlidesUsateALezione
P jgrass-tools-SlidesUsateALezioneP jgrass-tools-SlidesUsateALezione
P jgrass-tools-SlidesUsateALezioneRiccardo Rigon
 
12.10 acqua nei suoli-richards-1d
12.10 acqua nei suoli-richards-1d12.10 acqua nei suoli-richards-1d
12.10 acqua nei suoli-richards-1dRiccardo Rigon
 
7 inferenza statisticae-statisticadescrittiva
7 inferenza statisticae-statisticadescrittiva7 inferenza statisticae-statisticadescrittiva
7 inferenza statisticae-statisticadescrittivaRiccardo Rigon
 
2 introduzione gis - La nuova versione
2   introduzione gis - La nuova versione2   introduzione gis - La nuova versione
2 introduzione gis - La nuova versioneRiccardo Rigon
 
3 e-stationarity and ergodicity
3 e-stationarity and ergodicity3 e-stationarity and ergodicity
3 e-stationarity and ergodicityRiccardo Rigon
 
3 d-test of hypothesis
3 d-test of hypothesis3 d-test of hypothesis
3 d-test of hypothesisRiccardo Rigon
 
3 b-statistics statistics
3 b-statistics statistics3 b-statistics statistics
3 b-statistics statisticsRiccardo Rigon
 
2 c-dove iniziano i canali
2 c-dove iniziano i canali2 c-dove iniziano i canali
2 c-dove iniziano i canaliRiccardo Rigon
 

Destacado (20)

Presentazione qgis
Presentazione qgisPresentazione qgis
Presentazione qgis
 
P j grass-tools
P j grass-toolsP j grass-tools
P j grass-tools
 
3 j grass-new-age
3 j grass-new-age3 j grass-new-age
3 j grass-new-age
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Trento_p nei Nettools
Trento_p nei NettoolsTrento_p nei Nettools
Trento_p nei Nettools
 
Calcolo fognaturetrento p_pratica
Calcolo fognaturetrento p_praticaCalcolo fognaturetrento p_pratica
Calcolo fognaturetrento p_pratica
 
12.13 acqua neisuoli-watertableequations
12.13 acqua neisuoli-watertableequations12.13 acqua neisuoli-watertableequations
12.13 acqua neisuoli-watertableequations
 
12.12 acqua nei suoli-macropores
12.12 acqua nei suoli-macropores12.12 acqua nei suoli-macropores
12.12 acqua nei suoli-macropores
 
P jgrass-tools-SlidesUsateALezione
P jgrass-tools-SlidesUsateALezioneP jgrass-tools-SlidesUsateALezione
P jgrass-tools-SlidesUsateALezione
 
12.10 acqua nei suoli-richards-1d
12.10 acqua nei suoli-richards-1d12.10 acqua nei suoli-richards-1d
12.10 acqua nei suoli-richards-1d
 
P udig
P udigP udig
P udig
 
7 inferenza statisticae-statisticadescrittiva
7 inferenza statisticae-statisticadescrittiva7 inferenza statisticae-statisticadescrittiva
7 inferenza statisticae-statisticadescrittiva
 
Nettools Epanet
Nettools   EpanetNettools   Epanet
Nettools Epanet
 
2 introduzione gis - La nuova versione
2   introduzione gis - La nuova versione2   introduzione gis - La nuova versione
2 introduzione gis - La nuova versione
 
9 introduzione r
9   introduzione r9   introduzione r
9 introduzione r
 
3 e-stationarity and ergodicity
3 e-stationarity and ergodicity3 e-stationarity and ergodicity
3 e-stationarity and ergodicity
 
3 d-test of hypothesis
3 d-test of hypothesis3 d-test of hypothesis
3 d-test of hypothesis
 
3 b-statistics statistics
3 b-statistics statistics3 b-statistics statistics
3 b-statistics statistics
 
2 c-dove iniziano i canali
2 c-dove iniziano i canali2 c-dove iniziano i canali
2 c-dove iniziano i canali
 
3 c-form andshape
3 c-form andshape3 c-form andshape
3 c-form andshape
 

Similar a R Workshop for Beginners

Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Dr. Volkan OBAN
 
Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기JangHyuk You
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2Kevin Chun-Hsien Hsu
 
app4.pptx
app4.pptxapp4.pptx
app4.pptxsg4795
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfannikasarees
 
Thinking Functionally In Ruby
Thinking Functionally In RubyThinking Functionally In Ruby
Thinking Functionally In RubyRoss Lawley
 
software engineering modules iii & iv.pptx
software engineering  modules iii & iv.pptxsoftware engineering  modules iii & iv.pptx
software engineering modules iii & iv.pptxrani marri
 
Functional Programming In Mathematica
Functional Programming In MathematicaFunctional Programming In Mathematica
Functional Programming In MathematicaHossam Karim
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to RAngshuman Saha
 

Similar a R Workshop for Beginners (20)

Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
cluster(python)
cluster(python)cluster(python)
cluster(python)
 
Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
 
R
RR
R
 
app4.pptx
app4.pptxapp4.pptx
app4.pptx
 
Python grass
Python grassPython grass
Python grass
 
Rug hogan-10-03-2012
Rug hogan-10-03-2012Rug hogan-10-03-2012
Rug hogan-10-03-2012
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdf
 
Thinking Functionally In Ruby
Thinking Functionally In RubyThinking Functionally In Ruby
Thinking Functionally In Ruby
 
software engineering modules iii & iv.pptx
software engineering  modules iii & iv.pptxsoftware engineering  modules iii & iv.pptx
software engineering modules iii & iv.pptx
 
Functional Programming In Mathematica
Functional Programming In MathematicaFunctional Programming In Mathematica
Functional Programming In Mathematica
 
Seminar psu 20.10.2013
Seminar psu 20.10.2013Seminar psu 20.10.2013
Seminar psu 20.10.2013
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to R
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Último (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

R Workshop for Beginners

  • 1. Munging & Visualizing Data with R Michael E. Driscoll CTO, Metamarkets @medriscoll Xavier Léauté Metamarkets @xvrl Barret Schloerke Metamarkets
  • 2. I.  A  Tour  of  R  
  • 4.
  • 5. R  is  a  tool  for…   Data  Manipula?on   •  connec$ng  to  data  sources   •  slicing  &  dicing  data   Modeling  &  Computa?on   •  sta$s$cal  modeling   •  numerical  simula$on   Data  Visualiza?on   •  visualizing  fit  of  models   •  composing  sta$s$cal  graphics  
  • 6. R  is  an  environment  
  • 7. Its  interface  is  plain  
  • 8. RStudio  to  the  rescue  
  • 9. ## load in some Insurance Claim data library(MASS) data(Insurance) Insurance <- edit(Insurance) Let’s  take  a  tour   head(Insurance) dim(Insurance) ## plot it nicely using the ggplot2 package of  some  data  in  R   library(ggplot2) qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat='identity', position="dodge", facets=District ~ ., fill=Age, ylab="Claim Propensity", xlab="Car Group") ## hypothesize a relationship between Age ~ Claim Propensity ## visualize this hypothesis with a boxplot x11() library(ggplot2) qplot(Age, Claims/Holders, data=Insurance, geom="boxplot", fill=Age) ## quantify the hypothesis with linear model m <- lm(Claims/Holders ~ Age + 0, data=Insurance) summary(m)
  • 10. R  is  “an  overgrown  calculator”   sum(rgamma(rpois(1,lambda=2),shape=49,scale=.2)))
  • 11. R  is  “an  overgrown  calculator”   •  simple  math   > 2+2 4 •  storing  results  in  variables   > x <- 2+2 ## ‘<-’ is R syntax for ‘=’ or assignment > x^2 16 •  vectorized  math   > weight <- c(110, 180, 240) ## three weights > height <- c(5.5, 6.1, 6.2) ## three heights > bmi <- (weight*4.88)/height^2 ## divides element-wise 17.7 23.6 30.4  
  • 12. R  is  “an  overgrown  calculator”   •  basic  sta$s$cs   mean(weight) sd(weight) sqrt(var(weight)) 176.6 65.0 65.0 # same as sd •  set  func$ons   union intersect setdiff •  advanced  sta$s$cs   > pbinom(40, 100, 0.5) ## P that a coin tossed 100 times 0.028 ## will comes up less than 40 heads > pshare <- pbirthday(23, 365, coincident=2)   0.530 ## probability that among 23 people, two share a birthday  
  • 13. Try  It!  #1    Overgrown  Calculator   •  basic  calcula$ons   > 2 + 2 [Hit  ENTER] > log(100) [Hit  ENTER]   •  calculate  the  value  of  $100  aIer  10  years  at  5%   > 100 * exp(0.05*10) [Hit  ENTER] •  construct  a  vector  &  do  a  vectorized  calcula$on   > year <- (1,2,5,10,25) [Hit  ENTER]      this  returns  an  error.    why?   > year <- c(1,2,5,10,25) [Hit  ENTER] > 100 * exp(0.05*year) [Hit  ENTER]      
  • 14. R  as  a  Programming  Language   fibonacci <- function(n) { fib <- numeric(n) fib [1:2] <- 1 for (i in 3:n) { fib[i] <- fib[i-1] + fib[i-2] } return(fib[n]) Image from cover of Abelson & Sussman’s textThe } Structure and Interpretation of Computer Languages
  • 15. Func$on  Calls   •  There  are  ~  1100  built-­‐in  commands  in  the  R   “base”  package,  which  can  be  executed  on  the   command-­‐line.    The  basic  structure  of  a  call  is   thus:      output <- function(arg1, arg2, …)   •  Arithme$c  Opera$ons   + - * / ^   •  R  func$ons  are  typically  vectorized   x <- x/3  works  whether  x  is  a  one  or  many-­‐valued  vector  
  • 16. Data  Structures  in  R   numeric   x <- c(0,2:4) vectors   y <- c(“alpha”, “b”, “c3”, “4”) Character   logical   z <- c(1, 0, TRUE, FALSE) > class(x) [1] "numeric" > x2 <- as.logical(x) > class(x2) [1] “logical”
  • 17. Data  Structures  in  R   lists   lst <- list(x,y,z) objects   M <- matrix(rep(x,3),ncol=3) matrices   data  frames*   df <- data.frame(x,y,z) > class(df) [1] “data.frame"
  • 18. Summary  of  Data  Structures   Linear Rectangular ?   Homogeneous vectors   matrices   Heterogeneous lists   data  frames*  
  • 19. R  is  a  numerical  simulator     •  built-­‐in  func$ons  for   classical  probability   distribu$ons   •  let’s  simulate  10,000   trials  of  100  coin  flips.     what’s  the   distribu$on  of  heads?     > heads <- rbinom(10^5,100,0.50) > hist(heads)
  • 20. Func$ons  for  Probability  Distribu$ons   ddist(  )   density  func$on  (pdf)   pdist(  )   cumula$ve  density  func$on   qdist(  )   quan$le  func$on   rdist(  )   random  deviates   Examples   Normal   dnorm,  pnorm,  qnorm,  rnorm   Binomial   dbinom,  pbinom,  …   Poisson   dpois,  …   >  pnorm(0)    0.05     >  qnorm(0.9)    1.28   >  rnorm(100)    vector  of  length  100    
  • 21. Func$ons  for  Probability  Distribu$ons   distribu?on   dist  suffix  in  R   How  to  find  the  func?ons  for   Beta   -­‐beta   lognormal  distribu?on?       Binomial   -­‐binom     Cauchy   -­‐cauchy   1)  Use  the  double  ques$on  mark   Chisquare   -­‐chisq   Exponen?al   -­‐exp   ‘??’  to  search   F   -­‐f   > ??lognormal Gamma   -­‐gamma     Geometric   -­‐geom   2)  Then  iden$fy  the  package   Hypergeometric   -­‐hyper    >  ?Lognormal   Logis?c   -­‐logis   Lognormal   -­‐lnorm     Nega?ve  Binomial     -­‐nbinom   3)  Discover  the  dist  func$ons     Normal   -­‐norm   dlnorm, plnorm, qlnorm, Poisson   -­‐pois   rlnorm Student  t     -­‐t   Uniform   -­‐unif   Tukey   -­‐tukey   Weibull   -­‐weib   Wilcoxon   -­‐wilcox  
  • 22. Try  It!  #2    Numerical  Simula$on   •  simulate  1m  drivers  from  which  we  expect  4  claims   > numclaims <- rpois(n, lambda) (hint:  use  ?rpois to  understand  the  parameters)   •  verify  the  mean  &  variance  are  reasonable > mean(numclaims) > var(numclaims) •  visualize  the  distribu$on  of  claim  counts   > hist(numclaims)    
  • 23. Gehng  Data  In    -­‐  from  Files   > Insurance <- read.csv(“Insurance.csv”,header=TRUE)      from  Databases   > con <- dbConnect(driver,user,password,host,dbname) > Insurance <- dbSendQuery(con, “SELECT * FROM claims”)      from  the  Web   > con <- url('http://labs.dataspora.com/test.txt') > Insurance <- read.csv(con, header=TRUE)        from  R  data  objects   > load(‘Insurance.Rda’)
  • 24. Gehng  Data  Out   •  to  Files   write.csv(Insurance,file=“Insurance.csv”) •  to  Databases   con <- dbConnect(dbdriver,user,password,host,dbname) dbWriteTable(con, “Insurance”, Insurance)          to  R  Objects   save(Insurance, file=“Insurance.Rda”)
  • 25. Naviga$ng  within  the  R  environment   •  lis$ng  all  variables   > ls() •  examining  a  variable  ‘x’   > str(x) > head(x) > tail(x) > class(x) •  removing  variables   > rm(x) > rm(list=ls()) # remove everything
  • 26. Try  It!  #3    Data  Processing     •  load  data  &  view  it   library(MASS) head(Insurance) ## the first 7 rows dim(Insurance) ## number of rows & columns •  write  it  out   write.csv(Insurance,file=“Insurance.csv”, row.names=FALSE) getwd() ## where am I? •  view  it  in  Excel,  make  a  change,  save  it   remove the first district   •  load  it  back  in  to  R  &  plot  it   Insurance <- read.csv(file=“Insurance.csv”) plot(Claims/Holders ~ Age, data=Insurance)
  • 27. A  Swiss-­‐Army  Knife  for  Data  
  • 28. A  Swiss-­‐Army  Knife  for  Data   •  Indexing   •  Three  ways  to  index  into  a  data  frame   –  array  of  integer  indices   –  array  of  character  names   –  array  of  logical  Booleans   •  Examples:   df[1:3,] df[c(“New York”, “Chicago”),] df[c(TRUE,FALSE,TRUE,TRUE),] df[df$city == “New York”,]
  • 29. A  Swiss-­‐Army  Knife  for  Data   •  subset  –  extract  subsets  mee$ng  some  criteria   subset(Insurance, District==1) subset(Insurance, Claims < 20) •  transform  –  add  or  alter  a  column  of  a  data  frame   transform(Insurance, Propensity=Claims/Holders) •  cut  –  cut  a  con$nuous  value  into  groups cut(Insurance$Claims, breaks=c(-1,100,Inf), labels=c('lo','hi')) •  Put  it  all  together:  create  a  new,  transformed  data  frame   transform(subset(Insurance, District==1), ClaimLevel=cut(Claims, breaks=c(-1,100,Inf), labels=c(‘lo’,’hi’)))  
  • 30. A  Swiss-­‐Army  Knife  for  Data   •  sqldf  –  a  library  that  allows  you  to  query  R  data  frames  as  if  they   were  SQL  tables.    Par$cularly  useful  for  aggrega$ons.   library(sqldf) sqldf('select country, sum(revenue) revenue FROM sales GROUP BY country') country revenue 1 FR 307.1157 2 UK 280.6382 3 USA 304.6860
  • 31. A  Sta$s$cal  Modeler   •  R’s  has  a  powerful  modeling  syntax   •  Models  are  specified  with  formulae,  like     y ~ x growth ~ sun + water model  rela$onships  between  con$nuous  and   categorical  variables.   •  Models  are  also  guide  the  visualiza$on  of   rela$onships  in  a  graphical  form  
  • 32. A  Sta$s$cal  Modeler   •  Linear  model   m <- lm(Claims/Holders ~ Age, data=Insurance) •  Examine  it   summary(m) •  Plot  it   plot(m)
  • 33. A  Sta$s$cal  Modeler   •  Logis$c  model   m <- glm(Age ~ Claims/Holders, data=Insurance, family=binomial(“logit”)) •  Examine  it   summary(m) •  Plot  it   plot(m)
  • 34. Try  It!  #4    Sta$s$cal  Modeling   •  fit  a  linear  model   m <- lm(Claims/Holders ~ Age + 0, data=Insurance) •  examine  it     summary(m)   •  plot  it   plot(m)
  • 35. Visualiza$on:       Mul$variate   Barplot   library(ggplot2) qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat='identity', position="dodge", facets=District ~ ., fill=Age)
  • 36. Visualiza$on:    Boxplots   library(ggplot2) library(lattice) qplot(Age, Claims/Holders, bwplot(Claims/Holders ~ Age, data=Insurance, data=Insurance) geom="boxplot“)  
  • 37. Visualiza$on:  Histograms   library(ggplot2) library(lattice) qplot(Claims/Holders, densityplot(~ Claims/Holders | Age, data=Insurance, data=Insurance, layout=c(4,1) facets=Age ~ ., geom="density")
  • 38. Try  It!  #5    Data  Visualiza$on   •  simple  line  chart   > x <- 1:10 > y <- x^2 > plot(y ~ x) •  box  plot   > library(lattice) > boxplot(Claims/Holders ~ Age, data=Insurance)   •  visualize  a  linear  fit   > abline(0,1)
  • 39. Gehng  Help  with  R   Help  within  R  itself  for  a  func?on   > help(func) > ?func For  a  topic   > help.search(topic) > ??topic   •  search.r-­‐project.org   •  Google  Code  Search    www.google.com/codesearch   •  Stack  Overflow    hsp://stackoverflow.com/tags/R     •  R-­‐help  list  hsp://www.r-­‐project.org/pos$ng-­‐guide.html    
  • 40. Six  Indispensable  Books  on  R   Learning  R   Data  Manipula?on   Visualiza?on:      la-ce  &  ggplot2   Sta?s?cal  Modeling  
  • 41. Extending  R  with  Packages   Over  one  thousand  user-­‐contributed  packages  are  available   on  CRAN  –  the  Comprehensive  R  Archive  Network              hsp://cran.r-­‐project.org       Install  a  package  from  the  command-­‐line   > install.packages(‘actuar’) Install  a  package  from  the  GUI  menu   “Packages”--> “Install packages(s)”
  • 43. lahce  =  trellis   (source:  hsp://lmdvr.r-­‐forge.r-­‐project.org  )  
  • 44. list  of     lahce   func$ons   densityplot(~ speed | type, data=pitch)  
  • 45. Visualiza?on  with     ggplot2  
  • 46. ggplot2  =   grammar  of     graphics  
  • 47. ggplot2  =   grammar  of   graphics  
  • 48. Visualizing  50,000  Diamonds  with  ggplot2  
  • 51. qplot(log(carat), log(price), data = diamonds, alpha = I(1/20))
  • 52. qplot(log(carat), log(price), data = diamonds, alpha = I(1/20), colour=color)
  • 53. qplot(log(carat), log(price), data = diamonds, alpha=I(1/20)) + facet_grid(. ~ color)
  • 54. qplot(color, price/carat, qplot(color, price/carat, data = diamonds, data = diamonds, alpha = I(1/20), geom=“boxplot”) geom=“jitter”)
  • 55.
  • 57. visualizing  six  dimensions   of  MLB  pitches  with  ggplot2  
  • 58. Demo  with  MLB  Gameday  Data   Code, data, and instructions at: http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R