6. R data structures

Yusuf YIGINI, PhD - FAO, Land and Water Division (CBL)
GSP - Eurasian Soil
Partnership - Dijital
Toprak Haritalama ve
Modelleme Egitimi
Izmir, Turkiye
21-25 Agustos 2017

Acquiring R Skills:
Data Types

Basic Data Types
Everything in R is an object.
R has the following atomic vector types.
character
numeric
integer
logical
complex
By atomic, we mean the vector only holds data of a single type.

Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> S <- 10
> S
[1] 10
> class(S)
[1] "numeric"

Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> LC <- c("arable", "forest", "grassland")
> LC
[1] "arable" "grassland" "forest"
"wetlands"
> class(LC)
[1] "character"

Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> LC <- c("arable", "forest", "grassland")
> LC.factor <- as.factor(LC)
> class(LC.factor)
[1] "factor"

Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> y <- 20.9
> y
[1] 20.9
> as.integer(y)
[1] 20
> y <- as.integer(20)
> y
[1] 20
> class(y)
[1] "integer"

Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> a<- TRUE
> a
[1] TRUE
> 4 < 2
[1] FALSE
> 4 < 5
[1] TRUE
> b <- 4 < 5
> b
[1] TRUE
> class(b)
[1] "logical

Basic Data Types
R provides many functions to examine features of vectors and other
objects, for example
class() - what kind of object is it (high-level)?
typeof() - what is the object’s data type (low-level)?
length() - how long is it? What about two dimensional objects?
attributes() - does it have any metadata?
> typeof(y)
[1] "integer"
> length(y)
[1] 1
> class(y)
[1] "integer"
> str(y)
int 20

Acquiring R Skills:
Data Structures

This section is loosely based on the R manual “An
Introduction to R”
A couple of important functions we are going to
use in this tutorial:
is : is used to get information about the type/class
of the object;
as : is used to coerce/transform the object into a
specific type/class;

What is a vector in R?
Like all other things in R, a vector is an
object that stands on your working
environment.
In short, it's a data structure and the
the simplest data structure in R.

Vectors
> v <- c(1, 43, 100, 3, 55)
> is.vector(v)
[1] TRUE
> is(v)
[1] "numeric" "vector"
> length(v)
[1] 5
> v
[1] 1 43 100 3 55

Vectors
> v <- 1
> is.vector(v)
[1] TRUE
> length(v)
[1] 1
> v
[1] 1
Note that a scalar is a vector of length 1.

Vector Arithmetics
Vector arithmetics in R has the
advantage of allowing the same
operation to be performed on all the
elements of the vector with a single
call, avoiding loops.

Vector Arithmetics
Arithmetic operations of vectors are performed member-by-
member (memberwise). For example, suppose we have two
vectors a and b.
Then, if we multiply a by 5, we would get a vector with each
of its members multiplied by 5
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)

Vector Arithmetics
Then, if we multiply a by 5, we would get a vector with each
of its members multiplied by 5
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
> 5 * a
[1] 5 15 25 35

Vector Arithmetics
And if we add a and b together, the sum would be a vector
whose members are the sum of the corresponding
members from a and b.
> a + b
[1] 2 5 9 15

Vector Arithmetics
Similarly for subtraction, multiplication and division, we get
new vectors via memberwise operations.
> a - b
[1] 0 1 1 -1
> a * b
[1] 1 6 20 56
> a / b
[1] 1.000 1.500 1.250 0.875

Vector Arithmetics
Recycling Rule
If two vectors are of unequal length, the shorter one will be
recycled in order to match the longer vector. For example,
the following vectors a and b have different lengths, and
their sum is computed by recycling values of the shorter
vector a.
> a = c(10, 20, 30)
> b = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
> a + b
[1] 11 22 33 14 25 36 17 28 39

Vector Arithmetics
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2
> x=c(2,4,6,8,12)
> y=c(2,1,4,7,10)
> x%%y
[1] 0 0 2 1 2
> x %/% y
[1] 1 4 1 1 1

Vector Arithmetics
A large number of operations are available.
For example check the help page ?"+".
?"+"

Vector Arithmetics
There's also operations that summarize the contents of the
vector.
> v <- c(1, 34, 100, 3, 26)
> sum(v)
[1] 164
> prod(v)
[1] 265200
> quantile(v)
0% 25% 50% 75% 100%
1 3 26 34 100

Vector Arithmetics
Creating vectors /vector types
The simplest method to create vectors is to use c.
Common alternatives are vector and seq.
There are several others …
Vectors can be of several types:
-numeric
-logical
-character

Vectors
Numeric vectors
> w <- c(10, 10.2, 34, 7.35, 0)
> is(w)
[1] "numeric" "vector"
> w <- seq(0,10,2)
> w <- 1:10
> w
[1] 1 2 3 4 5 6 7 8 9 10

Vectors
Logical vectors
> v <- c(TRUE, FALSE, TRUE, TRUE)
> is(v)
[1] "logical" "vector"

Vectors
A useful feature of logical vectors is the possibility to
coerce/transform into, and from, numeric vectors with the
as method.
> v <- c(TRUE, FALSE, TRUE, TRUE)
> as.numeric(v)
[1] 1 0 1 1
> v <- c(0, 0, 1, 1)
> as.logical(v)
[1] FALSE FALSE TRUE TRUE

Vectors
Logical vectors are the outcome of comparisons.
> v <- c(10, 10.2, 34, 7.35, 0)
> v < 5
[1] FALSE FALSE FALSE FALSE TRUE
> v >= 10
[1] TRUE TRUE TRUE FALSE FALSE
> v == 0
[1] FALSE FALSE FALSE FALSE TRUE
> v!=0
[1] TRUE TRUE TRUE TRUE FALSE

Vectors
Character vectors
> v <- c("a", "b", "c", "d", "e")
> is(v)
[1] "character" "vector"
[3] "data.frameRowLabels" "SuperClassMethod"

Vectors
Character vectors
With characters the combination of vectors can be useful,
> v1 <- "ASP Training Workshop"
> v2 <- "24-29 April 2017"
> paste(v1,v2)
[1] "ASP Training Workshop 24-29 April 2017"

Vectors
Character vectors
Characters can not be transformed into numericals or
logicals,
> as.numeric(v1)
[1] NA
Warning message:
NAs introduced by coercion
> as.logical(v1)
[1] NA

Vectors
Vectors can be used to create other vectors.
> v1 <- c(v, 0, 0, v)
> length(v1)
[1] 8
> v
[1] 1 34 100
> v1
[1] 1 34 100 0 0 1 34 100

Exercise
Generate 3 vectors of 500 elements of a random
variable with mean 0 and standard deviation 1 (*).
Call them v1, v2, and v3.
(*) Tip: use rnorm function. V <- rnorm(n, mean=, sd=)

Vectors
Vector Index
We retrieve values in a vector by declaring an index inside a single
square bracket "[]" operator. For example, the following shows how
to retrieve a vector member.
> v <- c(10, 10.2, 34, 7.35, 0)
> v[1]
[1] 10
> v[c(2, 4)]
[1] 10.20 7.35
> v[c(4, 2)]
[1] 7.35 10.20
> v[-c(2:5)]
[1] 10

Vectors
Vector Index
The index vector can be of different types. Above we used a vector of
integers, but it could be a logical vector.
> v <- c(10, 10.2, 34, 7.35, 0)
> v[c(TRUE, FALSE, FALSE, FALSE, FALSE)]
[1] 10

Vectors
Vector Index
For example, when we do v > 9 we get a logical vector stating which
elements were larger than 9 but we didn't get the elements. To get
the elements we can use the logical vector to subset the vector.
v <- c(10, 10.2, 34, 7.35, 0)
# which elements are larger than 9
v > 9
[1] TRUE TRUE TRUE FALSE FALSE
# select elements larger than 9
v[v > 9]
[1] 10.0 10.2 34.0
# or
idx <- v > 9
v[idx]
[1] 10.0 10.2 34.0

Vectors
Vector Index
A particular case applies to NA elements.
v <- c(NA, 10.2, 34, NA, 0)
# select the NA elements
v==NA
[1] NA NA NA NA NA
v=="NA"
[1] NA FALSE FALSE NA FALSE
# neither works because NA is special

Vectors
Vector Index
we should use the is.na functions
v
[1] NA 10.2 34.0 NA 0.0
is.na(v)
[1] TRUE FALSE FALSE TRUE FALSE
# now it's possible to select the NA
v[is.na(v)]
[1] NA NA

Vectors
to replace NA values, if needed.
We should use <-
v
[1] NA 10.2 34.0 NA 0.0
v[is.na(v)] <- 200
v
[1] 200.0 10.2 34.0 200.0 0.0

Plotting vectors
There's a lot to be done with graphs, which will be
demonstrated later, but for the moment check the most
common ones.
v <- rnorm(1000)
plot(v, main="My scatter plot")

Plotting vectors
v <- rnorm(1000)
plot(v, main="My scatter plot")

Plotting vectors
hist(v, main="My histogram")

Plotting vectors
> v <- rnorm(1000, mean=40, sd=5)
> hist(v, main="My histogram")

Density Plot
> plot(density(v), main="My density plot")

Comparing 2 variables or vectors.
v1 <- rnorm(1000)
v2 <- rnorm(1000)
plot(v1, v2, main="Independent variables")

Comparing 2 variables or vectors.
v1 <- rnorm(1000)
v2 <- rnorm(1000, v1)
plot(v1, v2, main="Dependent variables")

Naming Vectors
Naming vectors
Adding names to the elements maybe useful is
some situations.
For example if one is dealing with model
parameters it maybe easier to use the parameters
names. Names can also be used for subsetting.

Naming Vectors
v <- c(10, 3, 0, 54.2, 1)
names(v) <- letters[1:5]
v
a b c d e
10.0 3.0 0.0 54.2 1.0
v["c"]
c
0
v[3]
c
0

Exercises - Vectors (30 mins)
Exercise 1
Consider two vectors, x, y
x=c(4,6,5,7,10,9,4,15)
y=c(0,10,1,8,2,3,4,1)
What is the value of: x*y

Exercises
Exercise 2
Consider two vectors, a, b
a=c(1,2,4,5,6)
b=c(3,2,4,1,9)
What is the value of: cbind(a,b)

Exercises
Exercise 3
a=c(1,5,4,3,6)
b=c(3,5,2,1,9)
What is the value of: a<=b

Exercises
Exercise 4
a=c(10,2,4,15)
b=c(3,12,4,11)
What is the value of: rbind(a,b)

Exercises
Exercise 5
x<- c(1:12)
What is the value of: dim(x)
What is the value of: length(x)

Exercises
Exercise 6
If a=c(12:5)
What is the value of: is.numeric(a)
Exercise 7

Exercises
Exercise 7
x=c(12:4)
y=c(0,1,2,0,1,2,0,1,2)
What is the value of: which(!is.finite(x/y))

Exercises
Exercise 8
x=letters[1:10]
y=letters[15:24]
What is the value of: x<y

Exercises
Exercise 9
If x=c('blue','red','green','yellow')
What is the value of: is.character(x)

Exercises
Exercise 10
If x=c('blue',10,'green',20)
What is the value of: is.character(x)

Data Frames and Lists
In this session: one of R's most useful
object types: the data.frame.
And also: Lists which are simple but
useful.

Data Frames
A data frame is a table or a two-dimensional array-like
structure in which each column contains values of one
variable and each row contains one set of values from each
column.
Characteristics of a data frame.
• The column names should not be empty.
• The row names should be unique.
• The data stored in a data frame may be numeric, factor or
character.
• Each column contains same number of data items.

Data Frames
For example 'Soil Organic Carbon Data from
Macedonian Database.
> SOC <- read.csv("MASIS_SOC.csv")
> SOC
Id UpperDepth LowerDepth SOC Lambda tsme
1 4 0 30 12.00032455 0.01 0.003985153
2 7 0 30 3.48365276 0.01 0.002502976
3 8 0 30 2.31341405 0.01 0.002504971
4 9 0 30 1.94142743 0.01 0.002508691
5 10 0 30 1.34296903 0.01 0.002509177
6 11 0 30 2.28793284 0.01 0.002509360
7 12 0 30 2.71584298 0.01 0.002518323
8 13 0 30 4.34011158 0.01 0.002515760
...

Data Frames
summary() summmarises each column
> summary(SOC)
Id UpperDepth LowerDepth SOC Lambda
Min. : 4 Min. :0 Min. :30 Min. : 0.000 Min. :0.01
1st Qu.:1878 1st Qu.:0 1st Qu.:30 1st Qu.: 1.006 1st Qu.:0.01
Median :3214 Median :0 Median :30 Median : 1.495 Median :0.01
Mean :3198 Mean :0 Mean :30 Mean : 1.916 Mean :0.01
3rd Qu.:4502 3rd Qu.:0 3rd Qu.:30 3rd Qu.: 2.268 3rd Qu.:0.01
Max. :6539 Max. :0 Max. :30 Max. :50.205 Max. :0.01
NA's :1
tsme
Min. :0.002472
1st Qu.:0.002502
Median :0.002504
Mean :0.002507
3rd Qu.:0.002507
Max. :0.003985

Data Frames
or head/tail to look at the first / last few rows
> tail(SOC)
3257 6531 0 30 0.5698581 0.01 0.002503761
3258 6532 0 30 5.7547935 0.01 0.002505020
3259 6533 0 30 1.6636972 0.01 0.002506451
3260 6535 0 30 1.9226001 0.01 0.002502052
3261 6537 0 30 1.7165334 0.01 0.002502749
3262 6539 0 30 1.3633083 0.01 0.002502855

Data Frames
We can inspect the dimensions
> dim(SOC)
[1] 3262 7

Data Frames
And dimension names
> dimnames(SOC)
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
[11] "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
…
[[2]]
[1] "Id" "UpperDepth" "LowerDepth" "SOC" "Lambda"
[6] "tsme"

Data Frames
Accessing values in a data.frame
There are several ways to access the data in a
data.frame.
To access a whole column you can use '$' and the
column name> SOC$SOC
[1] 12.00032455 3.48365276 2.31341405 1.94142743 1.34296903 2.28793284
[7] 2.71584298 4.34011158 5.77118126 4.54692240 4.63597793 2.10768409
[13] 3.96522026 4.80577783 3.08891798 4.59635072 1.51213851 1.31937774
[19] 1.64608828 1.63183332 3.93370051 1.89931487 1.70009503 1.68627536
...

Data Frames
You can also use square brackets: [row, column]
where 'row' and 'column' are index numbers or
names.
For example, to access the third row and the ‘tsme’
column only:
> SOC[3, "tsme"]
[1] 0.002504971

Data Frames
Which is equivalent to the third row and sixth
column:
> SOC[3,6]
[1] 0.002504971

Data Frames
Leaving out the row or column means access all of
them.
All rows of the SOC column
SOC[, 6]
[1] 0.003985153 0.002502976 0.002504971 0.002508691 0.002509177 0.002509360
[7] 0.002518323 0.002515760 0.002509165 0.002514908 0.002517316 0.002510426
[13] 0.002506018 0.002503509 0.002505200 0.002503669 0.002505983 0.002504586
[19] 0.002504090 0.002506831 0.002507821 0.002505357 0.002505301 0.002507838
[25] 0.002507646 0.002502572 0.002504063 0.002505777 0.002506869 0.002502164
...

Data Frames
You can select multiple rows and columns using
vectors.
> SOC[1:5, c("SOC","tsme")]
SOC tsme
1 12.000325 0.003985153
2 3.483653 0.002502976
3 2.313414 0.002504971
4 1.941427 0.002508691
5 1.342969 0.002509177

Data Frames
Of course you can write values into a data.frame.
We make a copy of SOC (so we don't mess up
the original)
> SOCtemp <- SOC
> SOCtemp[3,"tsme"]
[1] 0.002504971
> SOCtemp[3,"tsme"] <- 1
> SOCtemp[3,"tsme"]
[1] 1

Data Frames
Recycling rules apply if less values are supplied
than selected
> SOCtemp[1:5,"tsme"] <- 1
> SOCtemp[1:5,]
1 4 0 30 12.000325 0.01 1
2 7 0 30 3.483653 0.01 1
3 8 0 30 2.313414 0.01 1
4 9 0 30 1.941427 0.01 1
5 10 0 30 1.342969 0.01 1

Data Frames
For example, look at the ‘SOC’column. Which of
these is higher than 2?
> SOC$SOC > 2
[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[25] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[49] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[61] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
...

Data Frames
We can now use this variable to access only the
rows that have SOC > 2
> SOCHigh <- SOC$SOC > 2
> SOC[SOCHigh,]
1 4 0 30 12.000325 0.01 0.003985153
2 7 0 30 3.483653 0.01 0.002502976
3 8 0 30 2.313414 0.01 0.002504971
6 11 0 30 2.287933 0.01 0.002509360
7 12 0 30 2.715843 0.01 0.002518323
...

Data Frames
Or do it all at once (in pure R fashion)
> SOC[SOC$SOC > 2,]
1 4 0 30 12.000325 0.01 0.003985153
2 7 0 30 3.483653 0.01 0.002502976
3 8 0 30 2.313414 0.01 0.002504971
6 11 0 30 2.287933 0.01 0.002509360
7 12 0 30 2.715843 0.01 0.002518323
8 13 0 30 4.340112 0.01 0.002515760
...

Data Frames
Ordering
You can reorder the data.frame by one or more
columns using the order() function
> SOC[order(SOC$tsme),]
2268 4039 0 30 0.00000000 0.01 0.002472194
145 396 0 30 1.17823531 0.01 0.002479430
1527 2803 0 30 0.45000000 0.01 0.002482055
471 1032 0 30 0.62766244 0.01 0.002482168
1581 3101 0 30 0.92922471 0.01 0.002484241
2237 3996 0 30 1.44240409 0.01 0.002484922
2629 5048 0 30 0.85421359 0.01 0.002485544
1910 3590 0 30 1.45097492 0.01 0.002486062
1650 3234 0 30 0.00000000 0.01 0.002486621
2536 4632 0 30 1.76030792 0.01 0.002486932
...

Data Frames
Ordering
You can reorder the data.frame by one or more
columns using the order() function
> SOC[order(SOC$tsme, SOC$SOC),]
2268 4039 0 30 0.00000000 0.01 0.002472194
145 396 0 30 1.17823531 0.01 0.002479430
1527 2803 0 30 0.45000000 0.01 0.002482055
471 1032 0 30 0.62766244 0.01 0.002482168
1581 3101 0 30 0.92922471 0.01 0.002484241
2237 3996 0 30 1.44240409 0.01 0.002484922
2629 5048 0 30 0.85421359 0.01 0.002485544

Data Frames
Making your own data.frame is straightforward
using the data.frame() function. For example:
> year <- 2000:2010
> catch <- c(900, 1230, 1400, 930, 670, 1000, 960, 840, 900, 500,400)
> dat <- data.frame(year=year, catch=catch)
> head(dat)
year catch
1 2000 900
2 2001 1230
3 2002 1400
4 2003 930
5 2004 670
6 2005 1000

Data Frames
It's possible to add extra columns of various types
> dat$area <- c("N","S","N","S","N","S","N","S","N","S","N")
> dat$survey <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,
TRUE)
> head(dat)
year catch area survey
1 2000 900 N TRUE
2 2001 1230 S FALSE
3 2002 1400 N FALSE
4 2003 930 S TRUE
5 2004 670 N TRUE
6 2005 1000 S TRUE

Data Frames
To add an extra row or rows use rbind and pass in
a data.frame with the exact same column names
and types
> dat2 <- data.frame(year = 1920, catch = 666, area = "N", survey = FALSE)
> dat <- rbind(dat, dat2)
> dat
year catch area survey
1 2000 900 N TRUE
2 2001 1230 S FALSE
3 2002 1400 N FALSE
4 2003 930 S TRUE
5 2004 670 N TRUE
6 2005 1000 S TRUE
7 2006 960 N TRUE
8 2007 840 S TRUE
9 2008 900 N FALSE
10 2009 500 S TRUE
11 2010 400 N TRUE
12 1920 666 N FALSE

EXERCISE (20 mins)
Ask at least 4 people near you and make a data.frame to
hold the following information about them: Name, hair
colour, height, shoe size, how long they can hold
their breath for.
• Reorder the data.frame by height.
• Subset the data.frame to only include people taller
than 1m 70.
• What is the mean shoe size of the people in the
data.frame?
• Whose shoe size is closest to the mean shoe size?

Data Frames
We need to talk about factors.
Macedonian Soil Data data.frame
> head(MSoil)
Id UpperDepth LowerDepth SOC Lambda tsme Region
1 4 0 30 12.000325 0.01 0.003985153 A
2 7 0 30 3.483653 0.01 0.002502976 B
3 8 0 30 2.313414 0.01 0.002504971 B
4 9 0 30 1.941427 0.01 0.002508691 B
5 10 0 30 1.342969 0.01 0.002509177 B
6 11 0 30 2.287933 0.01 0.002509360 B

Data Frames
We need to talk about factors. In the Macedonian
Soil Data data.frame Take a look at the ‘Region’
column
> head(MSoil$Region)
[1] A B B B B B
Levels: A B
They look like characters, but no quotes. There are two
“levels”: A and B. What does this mean?

Data Frames
They look like characters, but no quotes. There are two
“levels”: A and B. What does this mean?
> class(MSoil$Region)
[1] "factor"

Data Frames
Factors
Factors are a way of encoding data that can be used for
grouping variables.
Values can only be one of the defined 'levels'. This
allows you to keep track of what the values could be.
They can be used to ensure that a data set is coherent.

Data Frames
For example, if we try to set a value in the
“Region” column to something other than A or B,
we get a warning
> MSoil[1,"Region"] <- "20"
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "20") :
invalid factor level, NA generated

Data Frames
And a broken data.frame
> MSoil[1,]
1 4 0 30 12.00032 0.01 0.003985153 <NA>

Data Frames
Let's fix it :)
> MSoil[1,"Region"] <- "A"
> MSoil[1,]
1 4 0 30 12.00032 0.01 0.003985153 A

Data Frames
Factors
If you really wanted to change the value to
something not in the levels you need to change
the levels too (the names of the factors)
> levels(MSoil$Region)
[1] "A" "B"

Data Frames
Factors
> levels(MSoil$Region) <- c("A","B","C")
> levels(MSoil$Region)
[1] "A" "B" "C"
> MSoil[1, "Region"] <- "C"
> head(MSoil)
1 4 0 30 12.000325 0.01 0.003985153 C
2 7 0 30 3.483653 0.01 0.002502976 B
3 8 0 30 2.313414 0.01 0.002504971 B
4 9 0 30 1.941427 0.01 0.002508691 B
5 10 0 30 1.342969 0.01 0.002509177 B
6 11 0 30 2.287933 0.01 0.002509360 B

Data Frames
Factors
MSoil[, "Region"]
[1] C B B B B B B B B B B B B B B B B B B B B B A A A A A A A A A A A A A A
[37] A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B A A
...
[973] B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[ reached getOption("max.print") -- omitted 2262 entries ]
Levels: A B C

Data Frames
Factors
Factors are used for many methods and
functions in R, such as linear analysis.

Data Frames
let's make another data set that only includes
Region == B
> MSoilB <- subset(MSoil, Region=="B")
>
> MSoilB$Region
[1] B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
...
[937] B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[973] B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[ reached getOption("max.print") -- omitted 1147 entries ]
Levels: A B C

Data Frames
You can see that we have no observations for A
but you know that there could be. This might be
important for data management. When you import
data into R, some of your columns may be read in
as factors even if you did not intend them to.

Data Frames
The by() function can be used to split the data and apply
a function to each chunk. This can be very useful for
summarising the data. For example, to split the data by
the 'Region' column (into A and B chunks) and take the
mean of the column of each chunk you can do
> by(MSoil$SOC, MSoil$Region, mean)
MSoil$Region: A
[1] NaN
----------------------------------------------------------
MSoil$Region: B
[1] 1.839508
----------------------------------------------------------
MSoil$Region: C
[1] 12.00032

Data Frames
aggregate() does something similar but can be
used to operate on multiple columns in a data
frame.Learning how to manipulate data frames is
a very useful skill.The plyr and reshape packages
are worth your time getting to know.

Exercise
What is the mean height by hair
colour of the people in your
data.frame?

Lists
A list is a very flexible container.
It's like a vector, but the elements can be
objects of any class and size - even lists
(lists f lists of lists of …).
This makes them very handy for moving big
chunks of data around (particularly returning
output from a function).

Making lists
Here we make two objects to put into a list.
> best_food <- c("cake", "banana")
> odd_numbers <- c(1,3,5,7,9)
> notes <- "Something interesting"

Making lists
To make the list, we use the list() function.
When you create a list, you should give the
elements names (they don't have to be the name
of the object).
> my_list <- list(food = best_food, numbers = odd_numbers, note
= notes)
> class(my_list)
[1] "list"

Lists
Getting the length of the list and the names of the
elements is straightforward
> length(my_list)
[1] 3

Lists
Getting the length of the list and the names of the
elements is straightforward
> length(my_list)
[1] 3
> names(my_list)
[1] "food" "numbers" "note

Lists
Elements in a list can be extracted using two
methods. By name, using $ and the element
name.
> my_list$food
[1] "cake" "banana"

Lists
Accessing data in a list
Using [[ and the element position or name.
> my_list[[1]]
[1] "cake" "banana"
>
> my_list[["food"]]
[1] "cake" "banana"

Lists
Modifying lists
Lists can be easily extended - just add an extra
element.
> my_list[["new"]] <- c(1,3,5,7)
> summary(my_list)
Length Class Mode
food 2 -none- character
numbers 5 -none- numeric
note 1 -none- character
new 4 -none- numeric

Lists
Processing lists
lapply - apply the same function to each element
in a list.
> vec1 <- seq(from=1, to = 10, length = 7)
> vec2 <- seq(from=12, to = 20, length = 6)
> lst <- list(vec1 = vec1, vec2 = vec2)
> lapply(lst, sum)
$vec1
[1] 38.5
$vec2
[1] 96

Lists
Processing lists
This only makes sense if the same function can be
applied to all elements. For example, if we add a
character vector to the list, we can't use sum.But
length makes sense.
> lapply(lst, length)
$vec1
[1] 7
$vec2
[1] 6
$str1
[1] 3

Lists - Exercises
Exercise 1
If:
p <- c(2,7,8), q <- c("A", "B", "C") and
x <- list(p, q),
then what is the value of x[2]?
a. NULL
b. "A" "B" "C"
c. "7"

Lists - Exercises
Exercise 2
If:
w <- c(2, 7, 8)
v <- c("A", "B", "C")
x <- list(w, v),
then which R statement will replace "A" in x with
"K".
a. x[[2]] <- "K"
b. x[[2]][1] <- "K"
c. x[[1]][2] <- "K"

Lists - Exercises
Exercise 3
If a <- list ("x"=5, "y"=10, "z"=15), which R
statement will give the sum of all elements in a?
a. sum(a)
b. sum(list(a))
c. sum(unlist(a))

Lists - Exercises
Exercise 4
If Newlist <- list(a=1:10, b="Good morning",
c="Hi"), write an R statement that will add 1 to
each element of the first vector in Newlist.
# Exercise 4
Newlist <- list(a=1:10, b="Good morning", c="Hi")
Newlist$a <- Newlist$a + 1
Newlist
## $a
## [1] 2 3 4 5 6 7 8 9 10 11
##
## $b
## [1] "Good morning"
##
## $c
## [1] "Hi"

6. R data structures

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a 6. R data structures

Similar a 6. R data structures (20)

Más de ExternalEvents

Más de ExternalEvents (20)

Último

Último (20)

6. R data structures