SlideShare una empresa de Scribd logo
1 de 115
Descargar para leer sin conexión
Yusuf YIGINI, PhD - FAO, Land and Water Division (CBL)
GSP - Eurasian Soil
Partnership - Dijital
Toprak Haritalama ve
Modelleme Egitimi
Izmir, Turkiye
21-25 Agustos 2017
Acquiring R Skills:
Data Types
Basic Data Types
Everything in R is an object.
R has the following atomic vector types.
character
numeric
integer
logical
complex
By atomic, we mean the vector only holds data of a single type.
Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> S <- 10
> S
[1] 10
> class(S)
[1] "numeric"
Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> LC <- c("arable", "forest", "grassland")
> LC
[1] "arable" "grassland" "forest"
"wetlands"
> class(LC)
[1] "character"
Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> LC <- c("arable", "forest", "grassland")
> LC.factor <- as.factor(LC)
> class(LC.factor)
[1] "factor"
Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> y <- 20.9
> y
[1] 20.9
> as.integer(y)
[1] 20
> y <- as.integer(20)
> y
[1] 20
> class(y)
[1] "integer"
Basic Data Types
● Numeric
● Integer
● Logical
● Character
● Factor
> a<- TRUE
> a
[1] TRUE
> 4 < 2
[1] FALSE
> 4 < 5
[1] TRUE
> b <- 4 < 5
> b
[1] TRUE
> class(b)
[1] "logical
Basic Data Types
R provides many functions to examine features of vectors and other
objects, for example
class() - what kind of object is it (high-level)?
typeof() - what is the object’s data type (low-level)?
length() - how long is it? What about two dimensional objects?
attributes() - does it have any metadata?
> typeof(y)
[1] "integer"
> length(y)
[1] 1
> class(y)
[1] "integer"
> str(y)
int 20
Acquiring R Skills:
Data Structures
This section is loosely based on the R manual “An
Introduction to R”
A couple of important functions we are going to
use in this tutorial:
is : is used to get information about the type/class
of the object;
as : is used to coerce/transform the object into a
specific type/class;
What is a vector in R?
Like all other things in R, a vector is an
object that stands on your working
environment.
In short, it's a data structure and the
the simplest data structure in R.
Vectors
> v <- c(1, 43, 100, 3, 55)
> is.vector(v)
[1] TRUE
> is(v)
[1] "numeric" "vector"
> length(v)
[1] 5
> v
[1] 1 43 100 3 55
Vectors
> v <- 1
> is.vector(v)
[1] TRUE
> length(v)
[1] 1
> v
[1] 1
Note that a scalar is a vector of length 1.
Vector Arithmetics
Vector arithmetics in R has the
advantage of allowing the same
operation to be performed on all the
elements of the vector with a single
call, avoiding loops.
Vector Arithmetics
Arithmetic operations of vectors are performed member-by-
member (memberwise). For example, suppose we have two
vectors a and b.
Then, if we multiply a by 5, we would get a vector with each
of its members multiplied by 5
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
Vector Arithmetics
Then, if we multiply a by 5, we would get a vector with each
of its members multiplied by 5
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
> 5 * a
[1] 5 15 25 35
Vector Arithmetics
And if we add a and b together, the sum would be a vector
whose members are the sum of the corresponding
members from a and b.
> a + b
[1] 2 5 9 15
Vector Arithmetics
Similarly for subtraction, multiplication and division, we get
new vectors via memberwise operations.
> a - b
[1] 0 1 1 -1
> a * b
[1] 1 6 20 56
> a / b
[1] 1.000 1.500 1.250 0.875
Vector Arithmetics
Recycling Rule
If two vectors are of unequal length, the shorter one will be
recycled in order to match the longer vector. For example,
the following vectors a and b have different lengths, and
their sum is computed by recycling values of the shorter
vector a.
> a = c(10, 20, 30)
> b = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
> a + b
[1] 11 22 33 14 25 36 17 28 39
Vector Arithmetics
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2
> x=c(2,4,6,8,12)
> y=c(2,1,4,7,10)
> x%%y
[1] 0 0 2 1 2
> x %/% y
[1] 1 4 1 1 1
Vector Arithmetics
A large number of operations are available.
For example check the help page ?"+".
?"+"
Vector Arithmetics
There's also operations that summarize the contents of the
vector.
> v <- c(1, 34, 100, 3, 26)
> sum(v)
[1] 164
> prod(v)
[1] 265200
> quantile(v)
0% 25% 50% 75% 100%
1 3 26 34 100
Vector Arithmetics
Creating vectors /vector types
The simplest method to create vectors is to use c.
Common alternatives are vector and seq.
There are several others …
Vectors can be of several types:
-numeric
-logical
-character
Vectors
Numeric vectors
> w <- c(10, 10.2, 34, 7.35, 0)
> is(w)
[1] "numeric" "vector"
> w <- seq(0,10,2)
> w <- 1:10
> w
[1] 1 2 3 4 5 6 7 8 9 10
Vectors
Logical vectors
> v <- c(TRUE, FALSE, TRUE, TRUE)
> is(v)
[1] "logical" "vector"
Vectors
A useful feature of logical vectors is the possibility to
coerce/transform into, and from, numeric vectors with the
as method.
> v <- c(TRUE, FALSE, TRUE, TRUE)
> as.numeric(v)
[1] 1 0 1 1
> v <- c(0, 0, 1, 1)
> as.logical(v)
[1] FALSE FALSE TRUE TRUE
Vectors
Logical vectors are the outcome of comparisons.
> v <- c(10, 10.2, 34, 7.35, 0)
> v < 5
[1] FALSE FALSE FALSE FALSE TRUE
> v >= 10
[1] TRUE TRUE TRUE FALSE FALSE
> v == 0
[1] FALSE FALSE FALSE FALSE TRUE
> v!=0
[1] TRUE TRUE TRUE TRUE FALSE
Vectors
Character vectors
> v <- c("a", "b", "c", "d", "e")
> is(v)
[1] "character" "vector"
[3] "data.frameRowLabels" "SuperClassMethod"
Vectors
Character vectors
With characters the combination of vectors can be useful,
> v1 <- "ASP Training Workshop"
> v2 <- "24-29 April 2017"
> paste(v1,v2)
[1] "ASP Training Workshop 24-29 April 2017"
Vectors
Character vectors
Characters can not be transformed into numericals or
logicals,
> as.numeric(v1)
[1] NA
Warning message:
NAs introduced by coercion
> as.logical(v1)
[1] NA
Vectors
Vectors can be used to create other vectors.
> v1 <- c(v, 0, 0, v)
> length(v1)
[1] 8
> v
[1] 1 34 100
> v1
[1] 1 34 100 0 0 1 34 100
Exercise
Generate 3 vectors of 500 elements of a random
variable with mean 0 and standard deviation 1 (*).
Call them v1, v2, and v3.
(*) Tip: use rnorm function. V <- rnorm(n, mean=, sd=)
Vectors
Vector Index
We retrieve values in a vector by declaring an index inside a single
square bracket "[]" operator. For example, the following shows how
to retrieve a vector member.
> v <- c(10, 10.2, 34, 7.35, 0)
> v[1]
[1] 10
> v[c(2, 4)]
[1] 10.20 7.35
> v[c(4, 2)]
[1] 7.35 10.20
> v[-c(2:5)]
[1] 10
Vectors
Vector Index
The index vector can be of different types. Above we used a vector of
integers, but it could be a logical vector.
> v <- c(10, 10.2, 34, 7.35, 0)
> v[c(TRUE, FALSE, FALSE, FALSE, FALSE)]
[1] 10
Vectors
Vector Index
For example, when we do v > 9 we get a logical vector stating which
elements were larger than 9 but we didn't get the elements. To get
the elements we can use the logical vector to subset the vector.
v <- c(10, 10.2, 34, 7.35, 0)
# which elements are larger than 9
v > 9
[1] TRUE TRUE TRUE FALSE FALSE
# select elements larger than 9
v[v > 9]
[1] 10.0 10.2 34.0
# or
idx <- v > 9
v[idx]
[1] 10.0 10.2 34.0
Vectors
Vector Index
A particular case applies to NA elements.
v <- c(NA, 10.2, 34, NA, 0)
# select the NA elements
v==NA
[1] NA NA NA NA NA
v=="NA"
[1] NA FALSE FALSE NA FALSE
# neither works because NA is special
Vectors
Vector Index
we should use the is.na functions
v
[1] NA 10.2 34.0 NA 0.0
is.na(v)
[1] TRUE FALSE FALSE TRUE FALSE
# now it's possible to select the NA
v[is.na(v)]
[1] NA NA
Vectors
to replace NA values, if needed.
We should use <-
v
[1] NA 10.2 34.0 NA 0.0
v[is.na(v)] <- 200
v
[1] 200.0 10.2 34.0 200.0 0.0
Vectors
to replace NA values, if needed.
We should use <-
v
[1] NA 10.2 34.0 NA 0.0
v[is.na(v)] <- 200
v
[1] 200.0 10.2 34.0 200.0 0.0
Plotting vectors
There's a lot to be done with graphs, which will be
demonstrated later, but for the moment check the most
common ones.
v <- rnorm(1000)
plot(v, main="My scatter plot")
Plotting vectors
v <- rnorm(1000)
plot(v, main="My scatter plot")
Plotting vectors
hist(v, main="My histogram")
Plotting vectors
> v <- rnorm(1000, mean=40, sd=5)
> hist(v, main="My histogram")
Density Plot
> plot(density(v), main="My density plot")
Comparing 2 variables or vectors.
v1 <- rnorm(1000)
v2 <- rnorm(1000)
plot(v1, v2, main="Independent variables")
Comparing 2 variables or vectors.
v1 <- rnorm(1000)
v2 <- rnorm(1000, v1)
plot(v1, v2, main="Dependent variables")
Naming Vectors
Naming vectors
Adding names to the elements maybe useful is
some situations.
For example if one is dealing with model
parameters it maybe easier to use the parameters
names. Names can also be used for subsetting.
Naming Vectors
v <- c(10, 3, 0, 54.2, 1)
names(v) <- letters[1:5]
v
a b c d e
10.0 3.0 0.0 54.2 1.0
v["c"]
c
0
v[3]
c
0
Exercises - Vectors (30 mins)
Exercise 1
Consider two vectors, x, y
x=c(4,6,5,7,10,9,4,15)
y=c(0,10,1,8,2,3,4,1)
What is the value of: x*y
Exercises
Exercise 2
Consider two vectors, a, b
a=c(1,2,4,5,6)
b=c(3,2,4,1,9)
What is the value of: cbind(a,b)
Exercises
Exercise 3
Consider two vectors, a, b
a=c(1,5,4,3,6)
b=c(3,5,2,1,9)
What is the value of: a<=b
Exercises
Exercise 4
Consider two vectors, a, b
a=c(10,2,4,15)
b=c(3,12,4,11)
What is the value of: rbind(a,b)
Exercises
Exercise 5
x<- c(1:12)
What is the value of: dim(x)
What is the value of: length(x)
Exercises
Exercise 6
If a=c(12:5)
What is the value of: is.numeric(a)
Exercise 7
Exercises
Exercise 7
Consider two vectors, x, y
x=c(12:4)
y=c(0,1,2,0,1,2,0,1,2)
What is the value of: which(!is.finite(x/y))
Exercises
Exercise 8
Consider two vectors, x, y
x=letters[1:10]
y=letters[15:24]
What is the value of: x<y
Exercises
Exercise 9
If x=c('blue','red','green','yellow')
What is the value of: is.character(x)
Exercises
Exercise 2
Consider two vectors, a, b
a=c(1,2,4,5,6)
b=c(3,2,4,1,9)
What is the value of: cbind(a,b)
Exercises
Exercise 10
If x=c('blue',10,'green',20)
What is the value of: is.character(x)
DATA FRAMES, LISTS
Data Frames and Lists
In this session: one of R's most useful
object types: the data.frame.
And also: Lists which are simple but
useful.
Data Frames
A data frame is a table or a two-dimensional array-like
structure in which each column contains values of one
variable and each row contains one set of values from each
column.
Characteristics of a data frame.
• The column names should not be empty.
• The row names should be unique.
• The data stored in a data frame may be numeric, factor or
character.
• Each column contains same number of data items.
Data Frames
For example 'Soil Organic Carbon Data from
Macedonian Database.
> SOC <- read.csv("MASIS_SOC.csv")
> SOC
Id UpperDepth LowerDepth SOC Lambda tsme
1 4 0 30 12.00032455 0.01 0.003985153
2 7 0 30 3.48365276 0.01 0.002502976
3 8 0 30 2.31341405 0.01 0.002504971
4 9 0 30 1.94142743 0.01 0.002508691
5 10 0 30 1.34296903 0.01 0.002509177
6 11 0 30 2.28793284 0.01 0.002509360
7 12 0 30 2.71584298 0.01 0.002518323
8 13 0 30 4.34011158 0.01 0.002515760
...
Data Frames
summary() summmarises each column
> summary(SOC)
Id UpperDepth LowerDepth SOC Lambda
Min. : 4 Min. :0 Min. :30 Min. : 0.000 Min. :0.01
1st Qu.:1878 1st Qu.:0 1st Qu.:30 1st Qu.: 1.006 1st Qu.:0.01
Median :3214 Median :0 Median :30 Median : 1.495 Median :0.01
Mean :3198 Mean :0 Mean :30 Mean : 1.916 Mean :0.01
3rd Qu.:4502 3rd Qu.:0 3rd Qu.:30 3rd Qu.: 2.268 3rd Qu.:0.01
Max. :6539 Max. :0 Max. :30 Max. :50.205 Max. :0.01
NA's :1
tsme
Min. :0.002472
1st Qu.:0.002502
Median :0.002504
Mean :0.002507
3rd Qu.:0.002507
Max. :0.003985
Data Frames
or head/tail to look at the first / last few rows
> tail(SOC)
Id UpperDepth LowerDepth SOC Lambda tsme
3257 6531 0 30 0.5698581 0.01 0.002503761
3258 6532 0 30 5.7547935 0.01 0.002505020
3259 6533 0 30 1.6636972 0.01 0.002506451
3260 6535 0 30 1.9226001 0.01 0.002502052
3261 6537 0 30 1.7165334 0.01 0.002502749
3262 6539 0 30 1.3633083 0.01 0.002502855
Data Frames
We can inspect the dimensions
> dim(SOC)
[1] 3262 7
Data Frames
And dimension names
> dimnames(SOC)
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
[11] "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
…
[[2]]
[1] "Id" "UpperDepth" "LowerDepth" "SOC" "Lambda"
[6] "tsme"
Data Frames
Accessing values in a data.frame
There are several ways to access the data in a
data.frame.
To access a whole column you can use '$' and the
column name> SOC$SOC
[1] 12.00032455 3.48365276 2.31341405 1.94142743 1.34296903 2.28793284
[7] 2.71584298 4.34011158 5.77118126 4.54692240 4.63597793 2.10768409
[13] 3.96522026 4.80577783 3.08891798 4.59635072 1.51213851 1.31937774
[19] 1.64608828 1.63183332 3.93370051 1.89931487 1.70009503 1.68627536
...
Data Frames
You can also use square brackets: [row, column]
where 'row' and 'column' are index numbers or
names.
For example, to access the third row and the ‘tsme’
column only:
> SOC[3, "tsme"]
[1] 0.002504971
Data Frames
Which is equivalent to the third row and sixth
column:
> SOC[3,6]
[1] 0.002504971
Data Frames
Leaving out the row or column means access all of
them.
All rows of the SOC column
SOC[, 6]
[1] 0.003985153 0.002502976 0.002504971 0.002508691 0.002509177 0.002509360
[7] 0.002518323 0.002515760 0.002509165 0.002514908 0.002517316 0.002510426
[13] 0.002506018 0.002503509 0.002505200 0.002503669 0.002505983 0.002504586
[19] 0.002504090 0.002506831 0.002507821 0.002505357 0.002505301 0.002507838
[25] 0.002507646 0.002502572 0.002504063 0.002505777 0.002506869 0.002502164
...
Data Frames
You can select multiple rows and columns using
vectors.
> SOC[1:5, c("SOC","tsme")]
SOC tsme
1 12.000325 0.003985153
2 3.483653 0.002502976
3 2.313414 0.002504971
4 1.941427 0.002508691
5 1.342969 0.002509177
Data Frames
Of course you can write values into a data.frame.
We make a copy of SOC (so we don't mess up
the original)
> SOCtemp <- SOC
> SOCtemp[3,"tsme"]
[1] 0.002504971
> SOCtemp[3,"tsme"] <- 1
> SOCtemp[3,"tsme"]
[1] 1
Data Frames
Recycling rules apply if less values are supplied
than selected
> SOCtemp[1:5,"tsme"] <- 1
> SOCtemp[1:5,]
Id UpperDepth LowerDepth SOC Lambda tsme
1 4 0 30 12.000325 0.01 1
2 7 0 30 3.483653 0.01 1
3 8 0 30 2.313414 0.01 1
4 9 0 30 1.941427 0.01 1
5 10 0 30 1.342969 0.01 1
Data Frames
For example, look at the ‘SOC’column. Which of
these is higher than 2?
> SOC$SOC > 2
[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[25] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[49] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[61] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
...
Data Frames
We can now use this variable to access only the
rows that have SOC > 2
> SOCHigh <- SOC$SOC > 2
> SOC[SOCHigh,]
Id UpperDepth LowerDepth SOC Lambda tsme
1 4 0 30 12.000325 0.01 0.003985153
2 7 0 30 3.483653 0.01 0.002502976
3 8 0 30 2.313414 0.01 0.002504971
6 11 0 30 2.287933 0.01 0.002509360
7 12 0 30 2.715843 0.01 0.002518323
...
Data Frames
Or do it all at once (in pure R fashion)
> SOC[SOC$SOC > 2,]
Id UpperDepth LowerDepth SOC Lambda tsme
1 4 0 30 12.000325 0.01 0.003985153
2 7 0 30 3.483653 0.01 0.002502976
3 8 0 30 2.313414 0.01 0.002504971
6 11 0 30 2.287933 0.01 0.002509360
7 12 0 30 2.715843 0.01 0.002518323
8 13 0 30 4.340112 0.01 0.002515760
...
Data Frames
Ordering
You can reorder the data.frame by one or more
columns using the order() function
> SOC[order(SOC$tsme),]
Id UpperDepth LowerDepth SOC Lambda tsme
2268 4039 0 30 0.00000000 0.01 0.002472194
145 396 0 30 1.17823531 0.01 0.002479430
1527 2803 0 30 0.45000000 0.01 0.002482055
471 1032 0 30 0.62766244 0.01 0.002482168
1581 3101 0 30 0.92922471 0.01 0.002484241
2237 3996 0 30 1.44240409 0.01 0.002484922
2629 5048 0 30 0.85421359 0.01 0.002485544
1910 3590 0 30 1.45097492 0.01 0.002486062
1650 3234 0 30 0.00000000 0.01 0.002486621
2536 4632 0 30 1.76030792 0.01 0.002486932
...
Data Frames
Ordering
You can reorder the data.frame by one or more
columns using the order() function
> SOC[order(SOC$tsme, SOC$SOC),]
Id UpperDepth LowerDepth SOC Lambda tsme
2268 4039 0 30 0.00000000 0.01 0.002472194
145 396 0 30 1.17823531 0.01 0.002479430
1527 2803 0 30 0.45000000 0.01 0.002482055
471 1032 0 30 0.62766244 0.01 0.002482168
1581 3101 0 30 0.92922471 0.01 0.002484241
2237 3996 0 30 1.44240409 0.01 0.002484922
2629 5048 0 30 0.85421359 0.01 0.002485544
Data Frames
Making your own data.frame is straightforward
using the data.frame() function. For example:
> year <- 2000:2010
> catch <- c(900, 1230, 1400, 930, 670, 1000, 960, 840, 900, 500,400)
> dat <- data.frame(year=year, catch=catch)
> head(dat)
year catch
1 2000 900
2 2001 1230
3 2002 1400
4 2003 930
5 2004 670
6 2005 1000
Data Frames
It's possible to add extra columns of various types
> dat$area <- c("N","S","N","S","N","S","N","S","N","S","N")
> dat$survey <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,
TRUE)
> head(dat)
year catch area survey
1 2000 900 N TRUE
2 2001 1230 S FALSE
3 2002 1400 N FALSE
4 2003 930 S TRUE
5 2004 670 N TRUE
6 2005 1000 S TRUE
Data Frames
To add an extra row or rows use rbind and pass in
a data.frame with the exact same column names
and types
> dat2 <- data.frame(year = 1920, catch = 666, area = "N", survey = FALSE)
> dat <- rbind(dat, dat2)
> dat
year catch area survey
1 2000 900 N TRUE
2 2001 1230 S FALSE
3 2002 1400 N FALSE
4 2003 930 S TRUE
5 2004 670 N TRUE
6 2005 1000 S TRUE
7 2006 960 N TRUE
8 2007 840 S TRUE
9 2008 900 N FALSE
10 2009 500 S TRUE
11 2010 400 N TRUE
12 1920 666 N FALSE
EXERCISE (20 mins)
Ask at least 4 people near you and make a data.frame to
hold the following information about them: Name, hair
colour, height, shoe size, how long they can hold
their breath for.
• Reorder the data.frame by height.
• Subset the data.frame to only include people taller
than 1m 70.
• What is the mean shoe size of the people in the
data.frame?
• Whose shoe size is closest to the mean shoe size?
Data Frames
We need to talk about factors.
Macedonian Soil Data data.frame
> head(MSoil)
Id UpperDepth LowerDepth SOC Lambda tsme Region
1 4 0 30 12.000325 0.01 0.003985153 A
2 7 0 30 3.483653 0.01 0.002502976 B
3 8 0 30 2.313414 0.01 0.002504971 B
4 9 0 30 1.941427 0.01 0.002508691 B
5 10 0 30 1.342969 0.01 0.002509177 B
6 11 0 30 2.287933 0.01 0.002509360 B
Data Frames
We need to talk about factors. In the Macedonian
Soil Data data.frame Take a look at the ‘Region’
column
> head(MSoil$Region)
[1] A B B B B B
Levels: A B
They look like characters, but no quotes. There are two
“levels”: A and B. What does this mean?
Data Frames
They look like characters, but no quotes. There are two
“levels”: A and B. What does this mean?
> class(MSoil$Region)
[1] "factor"
Data Frames
Factors
Factors are a way of encoding data that can be used for
grouping variables.
Values can only be one of the defined 'levels'. This
allows you to keep track of what the values could be.
They can be used to ensure that a data set is coherent.
Data Frames
For example, if we try to set a value in the
“Region” column to something other than A or B,
we get a warning
> MSoil[1,"Region"] <- "20"
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "20") :
invalid factor level, NA generated
Data Frames
And a broken data.frame
> MSoil[1,]
Id UpperDepth LowerDepth SOC Lambda tsme Region
1 4 0 30 12.00032 0.01 0.003985153 <NA>
Data Frames
Let's fix it :)
> MSoil[1,"Region"] <- "A"
> MSoil[1,]
Id UpperDepth LowerDepth SOC Lambda tsme Region
1 4 0 30 12.00032 0.01 0.003985153 A
Data Frames
Factors
If you really wanted to change the value to
something not in the levels you need to change
the levels too (the names of the factors)
> levels(MSoil$Region)
[1] "A" "B"
Data Frames
Factors
If you really wanted to change the value to
something not in the levels you need to change
the levels too (the names of the factors)
> levels(MSoil$Region) <- c("A","B","C")
> levels(MSoil$Region)
[1] "A" "B" "C"
> MSoil[1, "Region"] <- "C"
> head(MSoil)
Id UpperDepth LowerDepth SOC Lambda tsme Region
1 4 0 30 12.000325 0.01 0.003985153 C
2 7 0 30 3.483653 0.01 0.002502976 B
3 8 0 30 2.313414 0.01 0.002504971 B
4 9 0 30 1.941427 0.01 0.002508691 B
5 10 0 30 1.342969 0.01 0.002509177 B
6 11 0 30 2.287933 0.01 0.002509360 B
Data Frames
Factors
If you really wanted to change the value to
something not in the levels you need to change
the levels too (the names of the factors)
MSoil[, "Region"]
[1] C B B B B B B B B B B B B B B B B B B B B B A A A A A A A A A A A A A A
[37] A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B A A
...
[973] B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[ reached getOption("max.print") -- omitted 2262 entries ]
Levels: A B C
Data Frames
Factors
Factors are used for many methods and
functions in R, such as linear analysis.
Data Frames
let's make another data set that only includes
Region == B
> MSoilB <- subset(MSoil, Region=="B")
>
> MSoilB$Region
[1] B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
...
[937] B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[973] B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[ reached getOption("max.print") -- omitted 1147 entries ]
Levels: A B C
Data Frames
You can see that we have no observations for A
but you know that there could be. This might be
important for data management. When you import
data into R, some of your columns may be read in
as factors even if you did not intend them to.
Data Frames
The by() function can be used to split the data and apply
a function to each chunk. This can be very useful for
summarising the data. For example, to split the data by
the 'Region' column (into A and B chunks) and take the
mean of the column of each chunk you can do
> by(MSoil$SOC, MSoil$Region, mean)
MSoil$Region: A
[1] NaN
----------------------------------------------------------
MSoil$Region: B
[1] 1.839508
----------------------------------------------------------
MSoil$Region: C
[1] 12.00032
Data Frames
aggregate() does something similar but can be
used to operate on multiple columns in a data
frame.Learning how to manipulate data frames is
a very useful skill.The plyr and reshape packages
are worth your time getting to know.
Exercise
What is the mean height by hair
colour of the people in your
data.frame?
Lists
A list is a very flexible container.
It's like a vector, but the elements can be
objects of any class and size - even lists
(lists f lists of lists of …).
This makes them very handy for moving big
chunks of data around (particularly returning
output from a function).
Making lists
Here we make two objects to put into a list.
> best_food <- c("cake", "banana")
> odd_numbers <- c(1,3,5,7,9)
> notes <- "Something interesting"
Making lists
To make the list, we use the list() function.
When you create a list, you should give the
elements names (they don't have to be the name
of the object).
> my_list <- list(food = best_food, numbers = odd_numbers, note
= notes)
> class(my_list)
[1] "list"
Lists
Getting the length of the list and the names of the
elements is straightforward
> length(my_list)
[1] 3
Lists
Getting the length of the list and the names of the
elements is straightforward
> length(my_list)
[1] 3
> names(my_list)
[1] "food" "numbers" "note
Lists
Elements in a list can be extracted using two
methods. By name, using $ and the element
name.
> my_list$food
[1] "cake" "banana"
Lists
Accessing data in a list
Using [[ and the element position or name.
> my_list[[1]]
[1] "cake" "banana"
>
> my_list[["food"]]
[1] "cake" "banana"
Lists
Modifying lists
Lists can be easily extended - just add an extra
element.
> my_list[["new"]] <- c(1,3,5,7)
> summary(my_list)
Length Class Mode
food 2 -none- character
numbers 5 -none- numeric
note 1 -none- character
new 4 -none- numeric
Lists
Processing lists
lapply - apply the same function to each element
in a list.
> vec1 <- seq(from=1, to = 10, length = 7)
> vec2 <- seq(from=12, to = 20, length = 6)
> lst <- list(vec1 = vec1, vec2 = vec2)
> lapply(lst, sum)
$vec1
[1] 38.5
$vec2
[1] 96
Lists
Processing lists
This only makes sense if the same function can be
applied to all elements. For example, if we add a
character vector to the list, we can't use sum.But
length makes sense.
> lapply(lst, length)
$vec1
[1] 7
$vec2
[1] 6
$str1
[1] 3
Lists - Exercises
Exercise 1
If:
p <- c(2,7,8), q <- c("A", "B", "C") and
x <- list(p, q),
then what is the value of x[2]?
a. NULL
b. "A" "B" "C"
c. "7"
Lists - Exercises
Exercise 2
If:
w <- c(2, 7, 8)
v <- c("A", "B", "C")
x <- list(w, v),
then which R statement will replace "A" in x with
"K".
a. x[[2]] <- "K"
b. x[[2]][1] <- "K"
c. x[[1]][2] <- "K"
Lists - Exercises
Exercise 3
If a <- list ("x"=5, "y"=10, "z"=15), which R
statement will give the sum of all elements in a?
a. sum(a)
b. sum(list(a))
c. sum(unlist(a))
Lists - Exercises
Exercise 4
If Newlist <- list(a=1:10, b="Good morning",
c="Hi"), write an R statement that will add 1 to
each element of the first vector in Newlist.
# Exercise 4
Newlist <- list(a=1:10, b="Good morning", c="Hi")
Newlist$a <- Newlist$a + 1
Newlist
## $a
## [1] 2 3 4 5 6 7 8 9 10 11
##
## $b
## [1] "Good morning"
##
## $c
## [1] "Hi"
6. R data structures

Más contenido relacionado

La actualidad más candente

Data Types and Structures in R
Data Types and Structures in RData Types and Structures in R
Data Types and Structures in RRupak Roy
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-exportFAO
 
4. R- files Reading and Writing
4. R- files Reading and Writing4. R- files Reading and Writing
4. R- files Reading and Writingkrishna singh
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture janani thirupathi
 
Divide and conquer - Quick sort
Divide and conquer - Quick sortDivide and conquer - Quick sort
Divide and conquer - Quick sortMadhu Bala
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization janani thirupathi
 
Database abstraction
Database abstractionDatabase abstraction
Database abstractionRituBhargava7
 
Introduction to R - from Rstudio to ggplot
Introduction to R - from Rstudio to ggplotIntroduction to R - from Rstudio to ggplot
Introduction to R - from Rstudio to ggplotOlga Scrivner
 
Difference between File system And DBMS.pptx
Difference between File system And DBMS.pptxDifference between File system And DBMS.pptx
Difference between File system And DBMS.pptxShayanMujahid2
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
2. Entity Relationship Model in DBMS
2. Entity Relationship Model in DBMS2. Entity Relationship Model in DBMS
2. Entity Relationship Model in DBMSkoolkampus
 

La actualidad más candente (20)

Data Types and Structures in R
Data Types and Structures in RData Types and Structures in R
Data Types and Structures in R
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
4. R- files Reading and Writing
4. R- files Reading and Writing4. R- files Reading and Writing
4. R- files Reading and Writing
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Divide and conquer - Quick sort
Divide and conquer - Quick sortDivide and conquer - Quick sort
Divide and conquer - Quick sort
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
Data frame operations
Data frame operationsData frame operations
Data frame operations
 
Disjoint sets
Disjoint setsDisjoint sets
Disjoint sets
 
Database abstraction
Database abstractionDatabase abstraction
Database abstraction
 
Introduction to R - from Rstudio to ggplot
Introduction to R - from Rstudio to ggplotIntroduction to R - from Rstudio to ggplot
Introduction to R - from Rstudio to ggplot
 
Difference between File system And DBMS.pptx
Difference between File system And DBMS.pptxDifference between File system And DBMS.pptx
Difference between File system And DBMS.pptx
 
Sets in python
Sets in pythonSets in python
Sets in python
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Data structures using c
Data structures using cData structures using c
Data structures using c
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
R variables
R   variablesR   variables
R variables
 
2. Entity Relationship Model in DBMS
2. Entity Relationship Model in DBMS2. Entity Relationship Model in DBMS
2. Entity Relationship Model in DBMS
 

Similar a 6. R data structures

Vectors data frames
Vectors data framesVectors data frames
Vectors data framesFAO
 
8. Vectors data frames
8. Vectors data frames8. Vectors data frames
8. Vectors data framesExternalEvents
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Parth Khare
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
Matrix algebra in_r
Matrix algebra in_rMatrix algebra in_r
Matrix algebra in_rRazzaqe
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2Kevin Chun-Hsien Hsu
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data ManipulationChu An
 
Day 1c access, select ordering copy.pptx
Day 1c   access, select   ordering copy.pptxDay 1c   access, select   ordering copy.pptx
Day 1c access, select ordering copy.pptxAdrien Melquiond
 
Vector class in C++
Vector class in C++Vector class in C++
Vector class in C++Jawad Khan
 
Introduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdf
Introduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdfIntroduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdf
Introduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdfYasirMuhammadlawan
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 

Similar a 6. R data structures (20)

Vectors data frames
Vectors data framesVectors data frames
Vectors data frames
 
8. Vectors data frames
8. Vectors data frames8. Vectors data frames
8. Vectors data frames
 
Vectors.pptx
Vectors.pptxVectors.pptx
Vectors.pptx
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
Matrix algebra in_r
Matrix algebra in_rMatrix algebra in_r
Matrix algebra in_r
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
R교육1
R교육1R교육1
R교육1
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
R language tutorial.pptx
R language tutorial.pptxR language tutorial.pptx
R language tutorial.pptx
 
Learning R
Learning RLearning R
Learning R
 
Vector in R
Vector in RVector in R
Vector in R
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
Day 1c access, select ordering copy.pptx
Day 1c   access, select   ordering copy.pptxDay 1c   access, select   ordering copy.pptx
Day 1c access, select ordering copy.pptx
 
Vector class in C++
Vector class in C++Vector class in C++
Vector class in C++
 
Introduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdf
Introduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdfIntroduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdf
Introduction to matlab chapter2 by Dr.Bashir m. sa'ad.pdf
 
Clojure functions midje
Clojure functions midjeClojure functions midje
Clojure functions midje
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 

Más de ExternalEvents (20)

Mauritania
Mauritania Mauritania
Mauritania
 
Malawi - M. Munthali
Malawi - M. MunthaliMalawi - M. Munthali
Malawi - M. Munthali
 
Malawi (Mbewe)
Malawi (Mbewe)Malawi (Mbewe)
Malawi (Mbewe)
 
Malawi (Desideri)
Malawi (Desideri)Malawi (Desideri)
Malawi (Desideri)
 
Lesotho
LesothoLesotho
Lesotho
 
Kenya
KenyaKenya
Kenya
 
ICRAF: Soil-plant spectral diagnostics laboratory
ICRAF: Soil-plant spectral diagnostics laboratoryICRAF: Soil-plant spectral diagnostics laboratory
ICRAF: Soil-plant spectral diagnostics laboratory
 
Ghana
GhanaGhana
Ghana
 
Ethiopia
EthiopiaEthiopia
Ethiopia
 
Item 15
Item 15Item 15
Item 15
 
Item 14
Item 14Item 14
Item 14
 
Item 13
Item 13Item 13
Item 13
 
Item 7
Item 7Item 7
Item 7
 
Item 6
Item 6Item 6
Item 6
 
Item 3
Item 3Item 3
Item 3
 
Item 16
Item 16Item 16
Item 16
 
Item 9: Soil mapping to support sustainable agriculture
Item 9: Soil mapping to support sustainable agricultureItem 9: Soil mapping to support sustainable agriculture
Item 9: Soil mapping to support sustainable agriculture
 
Item 8: WRB, World Reference Base for Soil Resouces
Item 8: WRB, World Reference Base for Soil ResoucesItem 8: WRB, World Reference Base for Soil Resouces
Item 8: WRB, World Reference Base for Soil Resouces
 
Item 7: Progress made in Nepal
Item 7: Progress made in NepalItem 7: Progress made in Nepal
Item 7: Progress made in Nepal
 
Item 6: International Center for Biosaline Agriculture
Item 6: International Center for Biosaline AgricultureItem 6: International Center for Biosaline Agriculture
Item 6: International Center for Biosaline Agriculture
 

Último

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 

Último (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

6. R data structures

  • 1. Yusuf YIGINI, PhD - FAO, Land and Water Division (CBL) GSP - Eurasian Soil Partnership - Dijital Toprak Haritalama ve Modelleme Egitimi Izmir, Turkiye 21-25 Agustos 2017
  • 3. Basic Data Types Everything in R is an object. R has the following atomic vector types. character numeric integer logical complex By atomic, we mean the vector only holds data of a single type.
  • 4. Basic Data Types ● Numeric ● Integer ● Logical ● Character ● Factor > S <- 10 > S [1] 10 > class(S) [1] "numeric"
  • 5. Basic Data Types ● Numeric ● Integer ● Logical ● Character ● Factor > LC <- c("arable", "forest", "grassland") > LC [1] "arable" "grassland" "forest" "wetlands" > class(LC) [1] "character"
  • 6. Basic Data Types ● Numeric ● Integer ● Logical ● Character ● Factor > LC <- c("arable", "forest", "grassland") > LC.factor <- as.factor(LC) > class(LC.factor) [1] "factor"
  • 7. Basic Data Types ● Numeric ● Integer ● Logical ● Character ● Factor > y <- 20.9 > y [1] 20.9 > as.integer(y) [1] 20 > y <- as.integer(20) > y [1] 20 > class(y) [1] "integer"
  • 8. Basic Data Types ● Numeric ● Integer ● Logical ● Character ● Factor > a<- TRUE > a [1] TRUE > 4 < 2 [1] FALSE > 4 < 5 [1] TRUE > b <- 4 < 5 > b [1] TRUE > class(b) [1] "logical
  • 9. Basic Data Types R provides many functions to examine features of vectors and other objects, for example class() - what kind of object is it (high-level)? typeof() - what is the object’s data type (low-level)? length() - how long is it? What about two dimensional objects? attributes() - does it have any metadata? > typeof(y) [1] "integer" > length(y) [1] 1 > class(y) [1] "integer" > str(y) int 20
  • 11. This section is loosely based on the R manual “An Introduction to R” A couple of important functions we are going to use in this tutorial: is : is used to get information about the type/class of the object; as : is used to coerce/transform the object into a specific type/class;
  • 12. What is a vector in R? Like all other things in R, a vector is an object that stands on your working environment. In short, it's a data structure and the the simplest data structure in R.
  • 13. Vectors > v <- c(1, 43, 100, 3, 55) > is.vector(v) [1] TRUE > is(v) [1] "numeric" "vector" > length(v) [1] 5 > v [1] 1 43 100 3 55
  • 14. Vectors > v <- 1 > is.vector(v) [1] TRUE > length(v) [1] 1 > v [1] 1 Note that a scalar is a vector of length 1.
  • 15. Vector Arithmetics Vector arithmetics in R has the advantage of allowing the same operation to be performed on all the elements of the vector with a single call, avoiding loops.
  • 16. Vector Arithmetics Arithmetic operations of vectors are performed member-by- member (memberwise). For example, suppose we have two vectors a and b. Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5 > a = c(1, 3, 5, 7) > b = c(1, 2, 4, 8)
  • 17. Vector Arithmetics Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5 > a = c(1, 3, 5, 7) > b = c(1, 2, 4, 8) > 5 * a [1] 5 15 25 35
  • 18. Vector Arithmetics And if we add a and b together, the sum would be a vector whose members are the sum of the corresponding members from a and b. > a + b [1] 2 5 9 15
  • 19. Vector Arithmetics Similarly for subtraction, multiplication and division, we get new vectors via memberwise operations. > a - b [1] 0 1 1 -1 > a * b [1] 1 6 20 56 > a / b [1] 1.000 1.500 1.250 0.875
  • 20. Vector Arithmetics Recycling Rule If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the following vectors a and b have different lengths, and their sum is computed by recycling values of the shorter vector a. > a = c(10, 20, 30) > b = c(1, 2, 3, 4, 5, 6, 7, 8, 9) > a + b [1] 11 22 33 14 25 36 17 28 39
  • 21. Vector Arithmetics x %% y modulus (x mod y) 5%%2 is 1 x %/% y integer division 5%/%2 is 2 > x=c(2,4,6,8,12) > y=c(2,1,4,7,10) > x%%y [1] 0 0 2 1 2 > x %/% y [1] 1 4 1 1 1
  • 22. Vector Arithmetics A large number of operations are available. For example check the help page ?"+". ?"+"
  • 23. Vector Arithmetics There's also operations that summarize the contents of the vector. > v <- c(1, 34, 100, 3, 26) > sum(v) [1] 164 > prod(v) [1] 265200 > quantile(v) 0% 25% 50% 75% 100% 1 3 26 34 100
  • 24. Vector Arithmetics Creating vectors /vector types The simplest method to create vectors is to use c. Common alternatives are vector and seq. There are several others … Vectors can be of several types: -numeric -logical -character
  • 25. Vectors Numeric vectors > w <- c(10, 10.2, 34, 7.35, 0) > is(w) [1] "numeric" "vector" > w <- seq(0,10,2) > w <- 1:10 > w [1] 1 2 3 4 5 6 7 8 9 10
  • 26. Vectors Logical vectors > v <- c(TRUE, FALSE, TRUE, TRUE) > is(v) [1] "logical" "vector"
  • 27. Vectors A useful feature of logical vectors is the possibility to coerce/transform into, and from, numeric vectors with the as method. > v <- c(TRUE, FALSE, TRUE, TRUE) > as.numeric(v) [1] 1 0 1 1 > v <- c(0, 0, 1, 1) > as.logical(v) [1] FALSE FALSE TRUE TRUE
  • 28. Vectors Logical vectors are the outcome of comparisons. > v <- c(10, 10.2, 34, 7.35, 0) > v < 5 [1] FALSE FALSE FALSE FALSE TRUE > v >= 10 [1] TRUE TRUE TRUE FALSE FALSE > v == 0 [1] FALSE FALSE FALSE FALSE TRUE > v!=0 [1] TRUE TRUE TRUE TRUE FALSE
  • 29. Vectors Character vectors > v <- c("a", "b", "c", "d", "e") > is(v) [1] "character" "vector" [3] "data.frameRowLabels" "SuperClassMethod"
  • 30. Vectors Character vectors With characters the combination of vectors can be useful, > v1 <- "ASP Training Workshop" > v2 <- "24-29 April 2017" > paste(v1,v2) [1] "ASP Training Workshop 24-29 April 2017"
  • 31. Vectors Character vectors Characters can not be transformed into numericals or logicals, > as.numeric(v1) [1] NA Warning message: NAs introduced by coercion > as.logical(v1) [1] NA
  • 32. Vectors Vectors can be used to create other vectors. > v1 <- c(v, 0, 0, v) > length(v1) [1] 8 > v [1] 1 34 100 > v1 [1] 1 34 100 0 0 1 34 100
  • 33. Exercise Generate 3 vectors of 500 elements of a random variable with mean 0 and standard deviation 1 (*). Call them v1, v2, and v3. (*) Tip: use rnorm function. V <- rnorm(n, mean=, sd=)
  • 34. Vectors Vector Index We retrieve values in a vector by declaring an index inside a single square bracket "[]" operator. For example, the following shows how to retrieve a vector member. > v <- c(10, 10.2, 34, 7.35, 0) > v[1] [1] 10 > v[c(2, 4)] [1] 10.20 7.35 > v[c(4, 2)] [1] 7.35 10.20 > v[-c(2:5)] [1] 10
  • 35. Vectors Vector Index The index vector can be of different types. Above we used a vector of integers, but it could be a logical vector. > v <- c(10, 10.2, 34, 7.35, 0) > v[c(TRUE, FALSE, FALSE, FALSE, FALSE)] [1] 10
  • 36. Vectors Vector Index For example, when we do v > 9 we get a logical vector stating which elements were larger than 9 but we didn't get the elements. To get the elements we can use the logical vector to subset the vector. v <- c(10, 10.2, 34, 7.35, 0) # which elements are larger than 9 v > 9 [1] TRUE TRUE TRUE FALSE FALSE # select elements larger than 9 v[v > 9] [1] 10.0 10.2 34.0 # or idx <- v > 9 v[idx] [1] 10.0 10.2 34.0
  • 37. Vectors Vector Index A particular case applies to NA elements. v <- c(NA, 10.2, 34, NA, 0) # select the NA elements v==NA [1] NA NA NA NA NA v=="NA" [1] NA FALSE FALSE NA FALSE # neither works because NA is special
  • 38. Vectors Vector Index we should use the is.na functions v [1] NA 10.2 34.0 NA 0.0 is.na(v) [1] TRUE FALSE FALSE TRUE FALSE # now it's possible to select the NA v[is.na(v)] [1] NA NA
  • 39. Vectors to replace NA values, if needed. We should use <- v [1] NA 10.2 34.0 NA 0.0 v[is.na(v)] <- 200 v [1] 200.0 10.2 34.0 200.0 0.0
  • 40. Vectors to replace NA values, if needed. We should use <- v [1] NA 10.2 34.0 NA 0.0 v[is.na(v)] <- 200 v [1] 200.0 10.2 34.0 200.0 0.0
  • 41. Plotting vectors There's a lot to be done with graphs, which will be demonstrated later, but for the moment check the most common ones. v <- rnorm(1000) plot(v, main="My scatter plot")
  • 42. Plotting vectors v <- rnorm(1000) plot(v, main="My scatter plot")
  • 44. Plotting vectors > v <- rnorm(1000, mean=40, sd=5) > hist(v, main="My histogram")
  • 45. Density Plot > plot(density(v), main="My density plot")
  • 46. Comparing 2 variables or vectors. v1 <- rnorm(1000) v2 <- rnorm(1000) plot(v1, v2, main="Independent variables")
  • 47. Comparing 2 variables or vectors. v1 <- rnorm(1000) v2 <- rnorm(1000, v1) plot(v1, v2, main="Dependent variables")
  • 48. Naming Vectors Naming vectors Adding names to the elements maybe useful is some situations. For example if one is dealing with model parameters it maybe easier to use the parameters names. Names can also be used for subsetting.
  • 49. Naming Vectors v <- c(10, 3, 0, 54.2, 1) names(v) <- letters[1:5] v a b c d e 10.0 3.0 0.0 54.2 1.0 v["c"] c 0 v[3] c 0
  • 50. Exercises - Vectors (30 mins) Exercise 1 Consider two vectors, x, y x=c(4,6,5,7,10,9,4,15) y=c(0,10,1,8,2,3,4,1) What is the value of: x*y
  • 51. Exercises Exercise 2 Consider two vectors, a, b a=c(1,2,4,5,6) b=c(3,2,4,1,9) What is the value of: cbind(a,b)
  • 52. Exercises Exercise 3 Consider two vectors, a, b a=c(1,5,4,3,6) b=c(3,5,2,1,9) What is the value of: a<=b
  • 53. Exercises Exercise 4 Consider two vectors, a, b a=c(10,2,4,15) b=c(3,12,4,11) What is the value of: rbind(a,b)
  • 54. Exercises Exercise 5 x<- c(1:12) What is the value of: dim(x) What is the value of: length(x)
  • 55. Exercises Exercise 6 If a=c(12:5) What is the value of: is.numeric(a) Exercise 7
  • 56. Exercises Exercise 7 Consider two vectors, x, y x=c(12:4) y=c(0,1,2,0,1,2,0,1,2) What is the value of: which(!is.finite(x/y))
  • 57. Exercises Exercise 8 Consider two vectors, x, y x=letters[1:10] y=letters[15:24] What is the value of: x<y
  • 59. Exercises Exercise 2 Consider two vectors, a, b a=c(1,2,4,5,6) b=c(3,2,4,1,9) What is the value of: cbind(a,b)
  • 62. Data Frames and Lists In this session: one of R's most useful object types: the data.frame. And also: Lists which are simple but useful.
  • 63. Data Frames A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Characteristics of a data frame. • The column names should not be empty. • The row names should be unique. • The data stored in a data frame may be numeric, factor or character. • Each column contains same number of data items.
  • 64. Data Frames For example 'Soil Organic Carbon Data from Macedonian Database. > SOC <- read.csv("MASIS_SOC.csv") > SOC Id UpperDepth LowerDepth SOC Lambda tsme 1 4 0 30 12.00032455 0.01 0.003985153 2 7 0 30 3.48365276 0.01 0.002502976 3 8 0 30 2.31341405 0.01 0.002504971 4 9 0 30 1.94142743 0.01 0.002508691 5 10 0 30 1.34296903 0.01 0.002509177 6 11 0 30 2.28793284 0.01 0.002509360 7 12 0 30 2.71584298 0.01 0.002518323 8 13 0 30 4.34011158 0.01 0.002515760 ...
  • 65. Data Frames summary() summmarises each column > summary(SOC) Id UpperDepth LowerDepth SOC Lambda Min. : 4 Min. :0 Min. :30 Min. : 0.000 Min. :0.01 1st Qu.:1878 1st Qu.:0 1st Qu.:30 1st Qu.: 1.006 1st Qu.:0.01 Median :3214 Median :0 Median :30 Median : 1.495 Median :0.01 Mean :3198 Mean :0 Mean :30 Mean : 1.916 Mean :0.01 3rd Qu.:4502 3rd Qu.:0 3rd Qu.:30 3rd Qu.: 2.268 3rd Qu.:0.01 Max. :6539 Max. :0 Max. :30 Max. :50.205 Max. :0.01 NA's :1 tsme Min. :0.002472 1st Qu.:0.002502 Median :0.002504 Mean :0.002507 3rd Qu.:0.002507 Max. :0.003985
  • 66. Data Frames or head/tail to look at the first / last few rows > tail(SOC) Id UpperDepth LowerDepth SOC Lambda tsme 3257 6531 0 30 0.5698581 0.01 0.002503761 3258 6532 0 30 5.7547935 0.01 0.002505020 3259 6533 0 30 1.6636972 0.01 0.002506451 3260 6535 0 30 1.9226001 0.01 0.002502052 3261 6537 0 30 1.7165334 0.01 0.002502749 3262 6539 0 30 1.3633083 0.01 0.002502855
  • 67. Data Frames We can inspect the dimensions > dim(SOC) [1] 3262 7
  • 68. Data Frames And dimension names > dimnames(SOC) [[1]] [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" [11] "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" … [[2]] [1] "Id" "UpperDepth" "LowerDepth" "SOC" "Lambda" [6] "tsme"
  • 69. Data Frames Accessing values in a data.frame There are several ways to access the data in a data.frame. To access a whole column you can use '$' and the column name> SOC$SOC [1] 12.00032455 3.48365276 2.31341405 1.94142743 1.34296903 2.28793284 [7] 2.71584298 4.34011158 5.77118126 4.54692240 4.63597793 2.10768409 [13] 3.96522026 4.80577783 3.08891798 4.59635072 1.51213851 1.31937774 [19] 1.64608828 1.63183332 3.93370051 1.89931487 1.70009503 1.68627536 ...
  • 70. Data Frames You can also use square brackets: [row, column] where 'row' and 'column' are index numbers or names. For example, to access the third row and the ‘tsme’ column only: > SOC[3, "tsme"] [1] 0.002504971
  • 71. Data Frames Which is equivalent to the third row and sixth column: > SOC[3,6] [1] 0.002504971
  • 72. Data Frames Leaving out the row or column means access all of them. All rows of the SOC column SOC[, 6] [1] 0.003985153 0.002502976 0.002504971 0.002508691 0.002509177 0.002509360 [7] 0.002518323 0.002515760 0.002509165 0.002514908 0.002517316 0.002510426 [13] 0.002506018 0.002503509 0.002505200 0.002503669 0.002505983 0.002504586 [19] 0.002504090 0.002506831 0.002507821 0.002505357 0.002505301 0.002507838 [25] 0.002507646 0.002502572 0.002504063 0.002505777 0.002506869 0.002502164 ...
  • 73. Data Frames You can select multiple rows and columns using vectors. > SOC[1:5, c("SOC","tsme")] SOC tsme 1 12.000325 0.003985153 2 3.483653 0.002502976 3 2.313414 0.002504971 4 1.941427 0.002508691 5 1.342969 0.002509177
  • 74. Data Frames Of course you can write values into a data.frame. We make a copy of SOC (so we don't mess up the original) > SOCtemp <- SOC > SOCtemp[3,"tsme"] [1] 0.002504971 > SOCtemp[3,"tsme"] <- 1 > SOCtemp[3,"tsme"] [1] 1
  • 75. Data Frames Recycling rules apply if less values are supplied than selected > SOCtemp[1:5,"tsme"] <- 1 > SOCtemp[1:5,] Id UpperDepth LowerDepth SOC Lambda tsme 1 4 0 30 12.000325 0.01 1 2 7 0 30 3.483653 0.01 1 3 8 0 30 2.313414 0.01 1 4 9 0 30 1.941427 0.01 1 5 10 0 30 1.342969 0.01 1
  • 76. Data Frames For example, look at the ‘SOC’column. Which of these is higher than 2? > SOC$SOC > 2 [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [13] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE [25] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [37] FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE [49] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE [61] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE ...
  • 77. Data Frames We can now use this variable to access only the rows that have SOC > 2 > SOCHigh <- SOC$SOC > 2 > SOC[SOCHigh,] Id UpperDepth LowerDepth SOC Lambda tsme 1 4 0 30 12.000325 0.01 0.003985153 2 7 0 30 3.483653 0.01 0.002502976 3 8 0 30 2.313414 0.01 0.002504971 6 11 0 30 2.287933 0.01 0.002509360 7 12 0 30 2.715843 0.01 0.002518323 ...
  • 78. Data Frames Or do it all at once (in pure R fashion) > SOC[SOC$SOC > 2,] Id UpperDepth LowerDepth SOC Lambda tsme 1 4 0 30 12.000325 0.01 0.003985153 2 7 0 30 3.483653 0.01 0.002502976 3 8 0 30 2.313414 0.01 0.002504971 6 11 0 30 2.287933 0.01 0.002509360 7 12 0 30 2.715843 0.01 0.002518323 8 13 0 30 4.340112 0.01 0.002515760 ...
  • 79. Data Frames Ordering You can reorder the data.frame by one or more columns using the order() function > SOC[order(SOC$tsme),] Id UpperDepth LowerDepth SOC Lambda tsme 2268 4039 0 30 0.00000000 0.01 0.002472194 145 396 0 30 1.17823531 0.01 0.002479430 1527 2803 0 30 0.45000000 0.01 0.002482055 471 1032 0 30 0.62766244 0.01 0.002482168 1581 3101 0 30 0.92922471 0.01 0.002484241 2237 3996 0 30 1.44240409 0.01 0.002484922 2629 5048 0 30 0.85421359 0.01 0.002485544 1910 3590 0 30 1.45097492 0.01 0.002486062 1650 3234 0 30 0.00000000 0.01 0.002486621 2536 4632 0 30 1.76030792 0.01 0.002486932 ...
  • 80. Data Frames Ordering You can reorder the data.frame by one or more columns using the order() function > SOC[order(SOC$tsme, SOC$SOC),] Id UpperDepth LowerDepth SOC Lambda tsme 2268 4039 0 30 0.00000000 0.01 0.002472194 145 396 0 30 1.17823531 0.01 0.002479430 1527 2803 0 30 0.45000000 0.01 0.002482055 471 1032 0 30 0.62766244 0.01 0.002482168 1581 3101 0 30 0.92922471 0.01 0.002484241 2237 3996 0 30 1.44240409 0.01 0.002484922 2629 5048 0 30 0.85421359 0.01 0.002485544
  • 81. Data Frames Making your own data.frame is straightforward using the data.frame() function. For example: > year <- 2000:2010 > catch <- c(900, 1230, 1400, 930, 670, 1000, 960, 840, 900, 500,400) > dat <- data.frame(year=year, catch=catch) > head(dat) year catch 1 2000 900 2 2001 1230 3 2002 1400 4 2003 930 5 2004 670 6 2005 1000
  • 82. Data Frames It's possible to add extra columns of various types > dat$area <- c("N","S","N","S","N","S","N","S","N","S","N") > dat$survey <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE) > head(dat) year catch area survey 1 2000 900 N TRUE 2 2001 1230 S FALSE 3 2002 1400 N FALSE 4 2003 930 S TRUE 5 2004 670 N TRUE 6 2005 1000 S TRUE
  • 83. Data Frames To add an extra row or rows use rbind and pass in a data.frame with the exact same column names and types > dat2 <- data.frame(year = 1920, catch = 666, area = "N", survey = FALSE) > dat <- rbind(dat, dat2) > dat year catch area survey 1 2000 900 N TRUE 2 2001 1230 S FALSE 3 2002 1400 N FALSE 4 2003 930 S TRUE 5 2004 670 N TRUE 6 2005 1000 S TRUE 7 2006 960 N TRUE 8 2007 840 S TRUE 9 2008 900 N FALSE 10 2009 500 S TRUE 11 2010 400 N TRUE 12 1920 666 N FALSE
  • 84. EXERCISE (20 mins) Ask at least 4 people near you and make a data.frame to hold the following information about them: Name, hair colour, height, shoe size, how long they can hold their breath for. • Reorder the data.frame by height. • Subset the data.frame to only include people taller than 1m 70. • What is the mean shoe size of the people in the data.frame? • Whose shoe size is closest to the mean shoe size?
  • 85. Data Frames We need to talk about factors. Macedonian Soil Data data.frame > head(MSoil) Id UpperDepth LowerDepth SOC Lambda tsme Region 1 4 0 30 12.000325 0.01 0.003985153 A 2 7 0 30 3.483653 0.01 0.002502976 B 3 8 0 30 2.313414 0.01 0.002504971 B 4 9 0 30 1.941427 0.01 0.002508691 B 5 10 0 30 1.342969 0.01 0.002509177 B 6 11 0 30 2.287933 0.01 0.002509360 B
  • 86. Data Frames We need to talk about factors. In the Macedonian Soil Data data.frame Take a look at the ‘Region’ column > head(MSoil$Region) [1] A B B B B B Levels: A B They look like characters, but no quotes. There are two “levels”: A and B. What does this mean?
  • 87. Data Frames They look like characters, but no quotes. There are two “levels”: A and B. What does this mean? > class(MSoil$Region) [1] "factor"
  • 88. Data Frames Factors Factors are a way of encoding data that can be used for grouping variables. Values can only be one of the defined 'levels'. This allows you to keep track of what the values could be. They can be used to ensure that a data set is coherent.
  • 89. Data Frames For example, if we try to set a value in the “Region” column to something other than A or B, we get a warning > MSoil[1,"Region"] <- "20" Warning message: In `[<-.factor`(`*tmp*`, iseq, value = "20") : invalid factor level, NA generated
  • 90. Data Frames And a broken data.frame > MSoil[1,] Id UpperDepth LowerDepth SOC Lambda tsme Region 1 4 0 30 12.00032 0.01 0.003985153 <NA>
  • 91. Data Frames Let's fix it :) > MSoil[1,"Region"] <- "A" > MSoil[1,] Id UpperDepth LowerDepth SOC Lambda tsme Region 1 4 0 30 12.00032 0.01 0.003985153 A
  • 92. Data Frames Factors If you really wanted to change the value to something not in the levels you need to change the levels too (the names of the factors) > levels(MSoil$Region) [1] "A" "B"
  • 93. Data Frames Factors If you really wanted to change the value to something not in the levels you need to change the levels too (the names of the factors) > levels(MSoil$Region) <- c("A","B","C") > levels(MSoil$Region) [1] "A" "B" "C" > MSoil[1, "Region"] <- "C" > head(MSoil) Id UpperDepth LowerDepth SOC Lambda tsme Region 1 4 0 30 12.000325 0.01 0.003985153 C 2 7 0 30 3.483653 0.01 0.002502976 B 3 8 0 30 2.313414 0.01 0.002504971 B 4 9 0 30 1.941427 0.01 0.002508691 B 5 10 0 30 1.342969 0.01 0.002509177 B 6 11 0 30 2.287933 0.01 0.002509360 B
  • 94. Data Frames Factors If you really wanted to change the value to something not in the levels you need to change the levels too (the names of the factors) MSoil[, "Region"] [1] C B B B B B B B B B B B B B B B B B B B B B A A A A A A A A A A A A A A [37] A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B A A ... [973] B B B B B B B B B B B B B B B B B B B B B B B B B B B B [ reached getOption("max.print") -- omitted 2262 entries ] Levels: A B C
  • 95. Data Frames Factors Factors are used for many methods and functions in R, such as linear analysis.
  • 96. Data Frames let's make another data set that only includes Region == B > MSoilB <- subset(MSoil, Region=="B") > > MSoilB$Region [1] B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B ... [937] B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B [973] B B B B B B B B B B B B B B B B B B B B B B B B B B B B [ reached getOption("max.print") -- omitted 1147 entries ] Levels: A B C
  • 97. Data Frames You can see that we have no observations for A but you know that there could be. This might be important for data management. When you import data into R, some of your columns may be read in as factors even if you did not intend them to.
  • 98. Data Frames The by() function can be used to split the data and apply a function to each chunk. This can be very useful for summarising the data. For example, to split the data by the 'Region' column (into A and B chunks) and take the mean of the column of each chunk you can do > by(MSoil$SOC, MSoil$Region, mean) MSoil$Region: A [1] NaN ---------------------------------------------------------- MSoil$Region: B [1] 1.839508 ---------------------------------------------------------- MSoil$Region: C [1] 12.00032
  • 99. Data Frames aggregate() does something similar but can be used to operate on multiple columns in a data frame.Learning how to manipulate data frames is a very useful skill.The plyr and reshape packages are worth your time getting to know.
  • 100. Exercise What is the mean height by hair colour of the people in your data.frame?
  • 101. Lists A list is a very flexible container. It's like a vector, but the elements can be objects of any class and size - even lists (lists f lists of lists of …). This makes them very handy for moving big chunks of data around (particularly returning output from a function).
  • 102. Making lists Here we make two objects to put into a list. > best_food <- c("cake", "banana") > odd_numbers <- c(1,3,5,7,9) > notes <- "Something interesting"
  • 103. Making lists To make the list, we use the list() function. When you create a list, you should give the elements names (they don't have to be the name of the object). > my_list <- list(food = best_food, numbers = odd_numbers, note = notes) > class(my_list) [1] "list"
  • 104. Lists Getting the length of the list and the names of the elements is straightforward > length(my_list) [1] 3
  • 105. Lists Getting the length of the list and the names of the elements is straightforward > length(my_list) [1] 3 > names(my_list) [1] "food" "numbers" "note
  • 106. Lists Elements in a list can be extracted using two methods. By name, using $ and the element name. > my_list$food [1] "cake" "banana"
  • 107. Lists Accessing data in a list Using [[ and the element position or name. > my_list[[1]] [1] "cake" "banana" > > my_list[["food"]] [1] "cake" "banana"
  • 108. Lists Modifying lists Lists can be easily extended - just add an extra element. > my_list[["new"]] <- c(1,3,5,7) > summary(my_list) Length Class Mode food 2 -none- character numbers 5 -none- numeric note 1 -none- character new 4 -none- numeric
  • 109. Lists Processing lists lapply - apply the same function to each element in a list. > vec1 <- seq(from=1, to = 10, length = 7) > vec2 <- seq(from=12, to = 20, length = 6) > lst <- list(vec1 = vec1, vec2 = vec2) > lapply(lst, sum) $vec1 [1] 38.5 $vec2 [1] 96
  • 110. Lists Processing lists This only makes sense if the same function can be applied to all elements. For example, if we add a character vector to the list, we can't use sum.But length makes sense. > lapply(lst, length) $vec1 [1] 7 $vec2 [1] 6 $str1 [1] 3
  • 111. Lists - Exercises Exercise 1 If: p <- c(2,7,8), q <- c("A", "B", "C") and x <- list(p, q), then what is the value of x[2]? a. NULL b. "A" "B" "C" c. "7"
  • 112. Lists - Exercises Exercise 2 If: w <- c(2, 7, 8) v <- c("A", "B", "C") x <- list(w, v), then which R statement will replace "A" in x with "K". a. x[[2]] <- "K" b. x[[2]][1] <- "K" c. x[[1]][2] <- "K"
  • 113. Lists - Exercises Exercise 3 If a <- list ("x"=5, "y"=10, "z"=15), which R statement will give the sum of all elements in a? a. sum(a) b. sum(list(a)) c. sum(unlist(a))
  • 114. Lists - Exercises Exercise 4 If Newlist <- list(a=1:10, b="Good morning", c="Hi"), write an R statement that will add 1 to each element of the first vector in Newlist. # Exercise 4 Newlist <- list(a=1:10, b="Good morning", c="Hi") Newlist$a <- Newlist$a + 1 Newlist ## $a ## [1] 2 3 4 5 6 7 8 9 10 11 ## ## $b ## [1] "Good morning" ## ## $c ## [1] "Hi"