R training4

Introduction to Data Analysis and Graphics in R
author: Hellen Gakuruh date: 2017-03-10 autosize: true
Slide 4: Summarizing Data
Outline
What we shall cover
• Numerical summaries for discrete variables
• Numerical summaries for continuous variables
• Tables for dichotomous variables
• Tables for categorical variables
• Tables for ordinal variables
Introduction
type: section
• A variable is a quantity whose values are not constant (change)
• Discrete variables have ﬁnite values (obtained by counting)
• Continuous variables can take any value within a range (obtained by
measuring)
• Dichotomous variable has two values like “Yes” and “NO” or TRUE and
FALSE
• Categorical variables are qualitative variables whose values are non-
numerical (text) with no ordering like gender “Female” and “Male”
Introduction cont.
type: section
• Ordinal variables are qualitative variables whose values are textual (non-
numeric) with natural ordering like likert scales or level of education
• There are two way to describe a variable; numerically and graphically
• Numerical summaries comprise measures of central tendency, measures
of spread/variability and shape of distribution (latter often not reported,
used to guide additional analysis)
• All these variables (discrete, continuous, dichotomous, categorical, and ordi-
nal) can be described by these measures, but each has it’s own computation
and presentation
1

Measures of central Tendency
type: section
• There three most often used/reported measures of central tendency
• Mean (arithmetic)
• Median
• Mode
• Mean is average of all values, i.e. sum of observations divided by number
of observations
===============================================================
type: sub-section
• Median is central value when ordered
• Mode is most frequently occurring value
• There at least three measures of dispersion
– Range
– Inter-quantile range (IQR)
– Variance and Standard deviation
===============================================================
type: sub-section
• Range is minimum and maximum value
• IQR is a range of where 50% of values lie (ordered statistic)
• Standard deviation is average distance of values from mean. It is computed
from variance which is squared distance from mean.
• Distinction is made between sample and population
• Measure for population are called population parameters and they are
often unknown
• Measures for a sample are called sample statistics
• Population mean is denoted as (mu) (pronounced ad “mu”)
• Sample mean is denoted as (bar{x}) (pronounced as “x bar”) Computing
mean =========================================================
type: sub-section
• Since mean is sum of all values divided by number of values, then population
and sample mean can be expressed as: [mu = frac{sum{X}}{N}, where
X are value and N is number of values] [bar{x} = frac{sum{x}}{n},
where x are values and n is number of values] respectively
2

Locating median
type: sub-section
• Median depends on whether number of observations are odd or even
• For odd number of values, median is the middle value like 3 in data set
{1,2,3,4,5}
• For odd number of values, median is average of the two middle values like
average of 3 and 4 for data set {1,2,3,4,5,6} which is 3.5.
Determining mode
type: sub-section
• Mode is most frequently occurring value (observation)
• To get mode, count number of occurrence of each unique value (observation)
and select the one with most number of occurrences
• Number of occurrences is called frequency
• Mode for data set {1, 2, 1, 1, 3, 3} is 3
• Mode is the only measure of central tendency which can has 0, 1, 2, > 2
modes (no mode, uni-modal, bi-modal, or multi-modal)
Standard deviation (SD)
type: sub-section
• Used to determine how spread out values are from it’s average (mean)
• A small SD means values are clustered around it’s mean and a big SD
means values are spread out
• Computed by ﬁrst subtracting each value from mean. Then summing the
deviation. But before summing, they are squared as summation would
result to 0. Finally they are divided by number of values. But since it’s a
squared deviation, a square root is taken.
• For samples from unknown population parameters, dividing with number
of observation has been proved to underestimate variance, hence divided
by “n-1” i.e.. (s = sqrt{sum(x-bar{x})ˆ2/(n-1)})
Skewness
type: sub-section
• Skewness measures symmetry of values around it’s mean
3

• If values are symmetrical, left and right side of it’s average is a mirror
image, then it’s said to have “no skweness”
• If bulk of values is to the left and has a right trail of values, then it’s
positively skewed
• If bulk of values is to the right and has a trail of values to the left, then
it’s negatively skewed
• Measurement involves balancing values on both sides of the mean, if
diﬀerence is zero, they it’s symmetrical, else +ve or -ve
Kurtosis
type: sub-section
• A measure of tailness; fat/thin or long/short
• Not a measure of “peakness” as often discussed in older text
• Reason: measure gives more weight to values far away from average, thus
outputting how far and by how much it is from average
• Kurtosis is noted as being “Mesokurtic”, “Leptokurtic” or “platykurtic”.
===============================================================
type: sub-section
• Mesokurtic means it’s symmetrical (tails are the same), “leptokurtic” means
it is “slender” and has fatter tails, it also has a greater kurtosis than
“mesokurtic” or a symmetrical distribution
• Platykurtic means it has a lesser kurtosis than symmetric distribution and
it’s broad with thinner tails
• Symmetry is considered ideal hence kurtosis measured in reference to
symmetry which as kurtosis of 3
• Kurtosis measured in reference to symmetry f 3 are referred to as Excess
Kurtosis
Numerical summaries for discrete variables
type: section
• Can be described by mean or median as its average
• If data is skewed, median is appropriate, otherwise compute mean
• If average is mean, then dispersion is reported as standard deviation. If
average is median, then dispersion should be IQR
• Shape of distribution as measured by skewness and kurtosis can inform on
which average (mean or median) to use. It also guides inferential statistics
• Example: Hypothetical random numbers of students scores
4

==============================================================
type: sub-section
# Data
set.seed(4)
scores <- as.integer(round(rnorm(50, 78, 1)))
# Source own function for printing frequency tables
source("~/R/Scripts/desc-statistics.R")
# Frequency table
freq(scores)
Values Freq Perc
1 76 2 4
2 77 8 16
3 78 19 38
4 79 17 34
5 80 4 8
===============================================================
type: sub-section
# Mean
mean(scores)
[1] 78.26
# Median
median(scores)
[1] 78
# Range
cat("Range for this distribution is", diff(range(scores)), paste0("(", paste(range(scores),
Range for this distribution is 4 (76, 80)
===============================================================
type: sub-section
# Where 50% of values lie
cat("50% of values lie between score of about", round(quantile(scores, 0.25)), "and", paste0
50% of values lie between score of about 78 and 79: an IQR of about 1
# Standard deviation (spread of values around mean)
sd(scores)
[1] 0.964894
5

===============================================================
type: sub-section
# Functions developed to measure and interpret skewness and kurtosis
source("~/R/Scripts/skewness-kurtosis-fun.R")
# Skewness
m3_std(scores)
[1] -0.2551918
skewness_interpreter(m3_std(scores))
[1] "approximately symmetric"
# Kurtosis
excess_kurt(scores)
[1] -0.365273
excess_interpreter(excess_kurt(scores))
[1] "approximately mesokurtic"
Conclusion (discrete numerical measures)
type: sub-section
• From skewness and kurtosis we can tell this data set is almost centered
around it’s mean, hence mean is an appropriate representative value (a
value to describe data)
• Since mean is our representative value, then standard deviation is the
appropriate measure for dispersion
• SD of 0.964894 indicates values are not dispersed
• Display-wise, we expect to see an almost symmetric distribution
Numerical summaries for continuous variables
type: section
• Continuous variables have the same numerical summaries as discrete vari-
able
• Exception is how to locate it’s mode, since values can take on an inﬁnite
number of values within a range
• Mode then involves grouping values into useful intervals sometimes called
breaks. This is a process called “discretization”
6

• Breaks can range between 2 to 10 but most often interval of ﬁve (data
determines)
============================================================
type: sub-section
• Example data: Random hypothetical sample of human height in inches
# Example data
set.seed(4)
height <- round(rnorm(50, 5.4), 2)
sort(height)
[1] 3.60 3.71 3.92 4.12 4.47 4.54 4.58 4.65 4.76 4.86 4.93 5.00 5.02 5.12
[15] 5.12 5.17 5.19 5.30 5.35 5.36 5.42 5.43 5.50 5.55 5.57 5.57 5.58 5.62
[29] 5.78 5.97 5.99 6.00 6.09 6.12 6.26 6.29 6.31 6.33 6.45 6.57 6.64 6.66
[43] 6.69 6.69 6.71 6.74 6.94 7.04 7.18 7.30
===============================================================
type: sub-section
# Average
mean(height)
[1] 5.6352
median(height)
[1] 5.57
# Dispersion
sd(height)
[1] 0.9184931
diff(range(height)); range(height)
[1] 3.7
[1] 3.6 7.3
===============================================================
type: sub-section
IQR(height)
[1] 1.28
# Modal Class (interval)
tab <- freq_continuous(height)
as.vector(tab[which.max(tab$Perc), 1])
[1] "(5,5.5]"
7

# Functions for generating frequency tables
freq_continuous(height)
Values Freq Perc
1 (3.5,4] 3 6
2 (4,4.5] 2 4
3 (4.5,5] 7 14
4 (5,5.5] 11 22
5 (5.5,6] 9 18
6 (6,6.5] 7 14
7 (6.5,7] 8 16
8 (7,7.5] 3 6
============================================================
type: sub-section
# Skewness
m3_std(height)
[1] -0.2186212
skewness_interpreter(m3_std(height))
[1] "approximately symmetric"
# Kurtosis
excess_kurt(height)
[1] -0.7024805
excess_interpreter(excess_kurt(height))
[1] "moderately platykurtic"
Tables for dichotomous variables
type: section
• Have two values e.g. “Yes” & “No”
• Best presented in frequency tables
set.seed(4)
dichot <- sample(c("Yes", "No"), 100, replace = TRUE)
freq(dichot)
8

Values Freq Perc
1 No 57 57
2 Yes 43 43
Tables for categorical variables
type: section
• Just like dichotomous variables (which are categorical), these can be
displayed in a frequency table if univariate and contingency tables for
bi-variate relationships
==========================================================
type: sub-section
groups <- rep(c("a", "b", "c"), 200)
set.seed(4)
outcome <- sample(c("improved", "same", "decreased"), length(groups), replace = TRUE, prob =
freq(groups)
Values Freq Perc
1 a 200 33
2 b 200 33
3 c 200 33
freq(outcome)
Values Freq Perc
1 decreased 66 11
2 improved 418 70
3 same 116 19
Contingency table
type: sub-section
source("~/R/Scripts/desc-statistics.R")
contigency_tab(groups, outcome)
outcome
groups decreased perc improved perc same perc
a 22 33 136 33 42 36
b 23 35 140 33 37 32
c 21 32 142 34 37 32
9

R training4

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Similar a R training4

Similar a R training4 (20)

Más de Hellen Gakuruh

Más de Hellen Gakuruh (20)

Último

Último (20)

R training4