SlideShare una empresa de Scribd logo
1 de 9
Descargar para leer sin conexión
Introduction to Data Analysis and Graphics in R
author: Hellen Gakuruh date: 2017-03-10 autosize: true
Slide 4: Summarizing Data
Outline
What we shall cover
• Numerical summaries for discrete variables
• Numerical summaries for continuous variables
• Tables for dichotomous variables
• Tables for categorical variables
• Tables for ordinal variables
Introduction
type: section
• A variable is a quantity whose values are not constant (change)
• Discrete variables have finite values (obtained by counting)
• Continuous variables can take any value within a range (obtained by
measuring)
• Dichotomous variable has two values like “Yes” and “NO” or TRUE and
FALSE
• Categorical variables are qualitative variables whose values are non-
numerical (text) with no ordering like gender “Female” and “Male”
Introduction cont.
type: section
• Ordinal variables are qualitative variables whose values are textual (non-
numeric) with natural ordering like likert scales or level of education
• There are two way to describe a variable; numerically and graphically
• Numerical summaries comprise measures of central tendency, measures
of spread/variability and shape of distribution (latter often not reported,
used to guide additional analysis)
• All these variables (discrete, continuous, dichotomous, categorical, and ordi-
nal) can be described by these measures, but each has it’s own computation
and presentation
1
Measures of central Tendency
type: section
• There three most often used/reported measures of central tendency
• Mean (arithmetic)
• Median
• Mode
• Mean is average of all values, i.e. sum of observations divided by number
of observations
===============================================================
type: sub-section
• Median is central value when ordered
• Mode is most frequently occurring value
• There at least three measures of dispersion
– Range
– Inter-quantile range (IQR)
– Variance and Standard deviation
===============================================================
type: sub-section
• Range is minimum and maximum value
• IQR is a range of where 50% of values lie (ordered statistic)
• Standard deviation is average distance of values from mean. It is computed
from variance which is squared distance from mean.
• Distinction is made between sample and population
• Measure for population are called population parameters and they are
often unknown
• Measures for a sample are called sample statistics
• Population mean is denoted as (mu) (pronounced ad “mu”)
• Sample mean is denoted as (bar{x}) (pronounced as “x bar”) Computing
mean =========================================================
type: sub-section
• Since mean is sum of all values divided by number of values, then population
and sample mean can be expressed as: [mu = frac{sum{X}}{N}, where
X are value and N is number of values] [bar{x} = frac{sum{x}}{n},
where x are values and n is number of values] respectively
2
Locating median
type: sub-section
• Median depends on whether number of observations are odd or even
• For odd number of values, median is the middle value like 3 in data set
{1,2,3,4,5}
• For odd number of values, median is average of the two middle values like
average of 3 and 4 for data set {1,2,3,4,5,6} which is 3.5.
Determining mode
type: sub-section
• Mode is most frequently occurring value (observation)
• To get mode, count number of occurrence of each unique value (observation)
and select the one with most number of occurrences
• Number of occurrences is called frequency
• Mode for data set {1, 2, 1, 1, 3, 3} is 3
• Mode is the only measure of central tendency which can has 0, 1, 2, > 2
modes (no mode, uni-modal, bi-modal, or multi-modal)
Standard deviation (SD)
type: sub-section
• Used to determine how spread out values are from it’s average (mean)
• A small SD means values are clustered around it’s mean and a big SD
means values are spread out
• Computed by first subtracting each value from mean. Then summing the
deviation. But before summing, they are squared as summation would
result to 0. Finally they are divided by number of values. But since it’s a
squared deviation, a square root is taken.
• For samples from unknown population parameters, dividing with number
of observation has been proved to underestimate variance, hence divided
by “n-1” i.e.. (s = sqrt{sum(x-bar{x})ˆ2/(n-1)})
Skewness
type: sub-section
• Skewness measures symmetry of values around it’s mean
3
• If values are symmetrical, left and right side of it’s average is a mirror
image, then it’s said to have “no skweness”
• If bulk of values is to the left and has a right trail of values, then it’s
positively skewed
• If bulk of values is to the right and has a trail of values to the left, then
it’s negatively skewed
• Measurement involves balancing values on both sides of the mean, if
difference is zero, they it’s symmetrical, else +ve or -ve
Kurtosis
type: sub-section
• A measure of tailness; fat/thin or long/short
• Not a measure of “peakness” as often discussed in older text
• Reason: measure gives more weight to values far away from average, thus
outputting how far and by how much it is from average
• Kurtosis is noted as being “Mesokurtic”, “Leptokurtic” or “platykurtic”.
===============================================================
type: sub-section
• Mesokurtic means it’s symmetrical (tails are the same), “leptokurtic” means
it is “slender” and has fatter tails, it also has a greater kurtosis than
“mesokurtic” or a symmetrical distribution
• Platykurtic means it has a lesser kurtosis than symmetric distribution and
it’s broad with thinner tails
• Symmetry is considered ideal hence kurtosis measured in reference to
symmetry which as kurtosis of 3
• Kurtosis measured in reference to symmetry f 3 are referred to as Excess
Kurtosis
Numerical summaries for discrete variables
type: section
• Can be described by mean or median as its average
• If data is skewed, median is appropriate, otherwise compute mean
• If average is mean, then dispersion is reported as standard deviation. If
average is median, then dispersion should be IQR
• Shape of distribution as measured by skewness and kurtosis can inform on
which average (mean or median) to use. It also guides inferential statistics
• Example: Hypothetical random numbers of students scores
4
==============================================================
type: sub-section
# Data
set.seed(4)
scores <- as.integer(round(rnorm(50, 78, 1)))
# Source own function for printing frequency tables
source("~/R/Scripts/desc-statistics.R")
# Frequency table
freq(scores)
Values Freq Perc
1 76 2 4
2 77 8 16
3 78 19 38
4 79 17 34
5 80 4 8
===============================================================
type: sub-section
# Mean
mean(scores)
[1] 78.26
# Median
median(scores)
[1] 78
# Range
cat("Range for this distribution is", diff(range(scores)), paste0("(", paste(range(scores),
Range for this distribution is 4 (76, 80)
===============================================================
type: sub-section
# Where 50% of values lie
cat("50% of values lie between score of about", round(quantile(scores, 0.25)), "and", paste0
50% of values lie between score of about 78 and 79: an IQR of about 1
# Standard deviation (spread of values around mean)
sd(scores)
[1] 0.964894
5
===============================================================
type: sub-section
# Functions developed to measure and interpret skewness and kurtosis
source("~/R/Scripts/skewness-kurtosis-fun.R")
# Skewness
m3_std(scores)
[1] -0.2551918
skewness_interpreter(m3_std(scores))
[1] "approximately symmetric"
# Kurtosis
excess_kurt(scores)
[1] -0.365273
excess_interpreter(excess_kurt(scores))
[1] "approximately mesokurtic"
Conclusion (discrete numerical measures)
type: sub-section
• From skewness and kurtosis we can tell this data set is almost centered
around it’s mean, hence mean is an appropriate representative value (a
value to describe data)
• Since mean is our representative value, then standard deviation is the
appropriate measure for dispersion
• SD of 0.964894 indicates values are not dispersed
• Display-wise, we expect to see an almost symmetric distribution
Numerical summaries for continuous variables
type: section
• Continuous variables have the same numerical summaries as discrete vari-
able
• Exception is how to locate it’s mode, since values can take on an infinite
number of values within a range
• Mode then involves grouping values into useful intervals sometimes called
breaks. This is a process called “discretization”
6
• Breaks can range between 2 to 10 but most often interval of five (data
determines)
============================================================
type: sub-section
• Example data: Random hypothetical sample of human height in inches
# Example data
set.seed(4)
height <- round(rnorm(50, 5.4), 2)
sort(height)
[1] 3.60 3.71 3.92 4.12 4.47 4.54 4.58 4.65 4.76 4.86 4.93 5.00 5.02 5.12
[15] 5.12 5.17 5.19 5.30 5.35 5.36 5.42 5.43 5.50 5.55 5.57 5.57 5.58 5.62
[29] 5.78 5.97 5.99 6.00 6.09 6.12 6.26 6.29 6.31 6.33 6.45 6.57 6.64 6.66
[43] 6.69 6.69 6.71 6.74 6.94 7.04 7.18 7.30
===============================================================
type: sub-section
# Average
mean(height)
[1] 5.6352
median(height)
[1] 5.57
# Dispersion
sd(height)
[1] 0.9184931
diff(range(height)); range(height)
[1] 3.7
[1] 3.6 7.3
===============================================================
type: sub-section
IQR(height)
[1] 1.28
# Modal Class (interval)
tab <- freq_continuous(height)
as.vector(tab[which.max(tab$Perc), 1])
[1] "(5,5.5]"
7
# Functions for generating frequency tables
freq_continuous(height)
Values Freq Perc
1 (3.5,4] 3 6
2 (4,4.5] 2 4
3 (4.5,5] 7 14
4 (5,5.5] 11 22
5 (5.5,6] 9 18
6 (6,6.5] 7 14
7 (6.5,7] 8 16
8 (7,7.5] 3 6
============================================================
type: sub-section
# Skewness
m3_std(height)
[1] -0.2186212
skewness_interpreter(m3_std(height))
[1] "approximately symmetric"
# Kurtosis
excess_kurt(height)
[1] -0.7024805
excess_interpreter(excess_kurt(height))
[1] "moderately platykurtic"
Tables for dichotomous variables
type: section
• Have two values e.g. “Yes” & “No”
• Best presented in frequency tables
set.seed(4)
dichot <- sample(c("Yes", "No"), 100, replace = TRUE)
freq(dichot)
8
Values Freq Perc
1 No 57 57
2 Yes 43 43
Tables for categorical variables
type: section
• Just like dichotomous variables (which are categorical), these can be
displayed in a frequency table if univariate and contingency tables for
bi-variate relationships
==========================================================
type: sub-section
groups <- rep(c("a", "b", "c"), 200)
set.seed(4)
outcome <- sample(c("improved", "same", "decreased"), length(groups), replace = TRUE, prob =
freq(groups)
Values Freq Perc
1 a 200 33
2 b 200 33
3 c 200 33
freq(outcome)
Values Freq Perc
1 decreased 66 11
2 improved 418 70
3 same 116 19
Contingency table
type: sub-section
source("~/R/Scripts/desc-statistics.R")
contigency_tab(groups, outcome)
outcome
groups decreased perc improved perc same perc
a 22 33 136 33 42 36
b 23 35 140 33 37 32
c 21 32 142 34 37 32
9

Más contenido relacionado

La actualidad más candente

Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsmercy rani
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statisticsMona Sajid
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDouglas Joubert
 
2. chapter ii(analyz)
2. chapter ii(analyz)2. chapter ii(analyz)
2. chapter ii(analyz)Chhom Karath
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsSarfraz Ahmad
 
Biostatistics basics-biostatistics4734
Biostatistics basics-biostatistics4734Biostatistics basics-biostatistics4734
Biostatistics basics-biostatistics4734AbhishekDas15
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsBurak Mızrak
 
Descriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu KDescriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu KThiyagu K
 
Frequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare ProfessioanlsFrequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare Professioanlsalberpaules
 
Descriptive Analysis in Statistics
Descriptive Analysis in StatisticsDescriptive Analysis in Statistics
Descriptive Analysis in StatisticsAzmi Mohd Tamil
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsMmedsc Hahm
 
Statistics for machine learning shifa noorulain
Statistics for machine learning   shifa noorulainStatistics for machine learning   shifa noorulain
Statistics for machine learning shifa noorulainShifaNoorUlAin1
 
Introduction to statistics...ppt rahul
Introduction to statistics...ppt rahulIntroduction to statistics...ppt rahul
Introduction to statistics...ppt rahulRahul Dhaker
 

La actualidad más candente (17)

Descriptive statistics ppt
Descriptive statistics pptDescriptive statistics ppt
Descriptive statistics ppt
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
statistics in nursing
 statistics in nursing statistics in nursing
statistics in nursing
 
2. chapter ii(analyz)
2. chapter ii(analyz)2. chapter ii(analyz)
2. chapter ii(analyz)
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Biostatistics basics-biostatistics4734
Biostatistics basics-biostatistics4734Biostatistics basics-biostatistics4734
Biostatistics basics-biostatistics4734
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive statistics -review(2)
Descriptive statistics -review(2)Descriptive statistics -review(2)
Descriptive statistics -review(2)
 
Descriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu KDescriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu K
 
Descriptive statistics ii
Descriptive statistics iiDescriptive statistics ii
Descriptive statistics ii
 
Frequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare ProfessioanlsFrequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare Professioanls
 
Descriptive Analysis in Statistics
Descriptive Analysis in StatisticsDescriptive Analysis in Statistics
Descriptive Analysis in Statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Statistics for machine learning shifa noorulain
Statistics for machine learning   shifa noorulainStatistics for machine learning   shifa noorulain
Statistics for machine learning shifa noorulain
 
Introduction to statistics...ppt rahul
Introduction to statistics...ppt rahulIntroduction to statistics...ppt rahul
Introduction to statistics...ppt rahul
 

Similar a R training4

Descriptive Statistics.pptx
Descriptive Statistics.pptxDescriptive Statistics.pptx
Descriptive Statistics.pptxtest215275
 
Ch5-quantitative-data analysis.pptx
Ch5-quantitative-data analysis.pptxCh5-quantitative-data analysis.pptx
Ch5-quantitative-data analysis.pptxzerihunnana
 
descriptive data analysis
 descriptive data analysis descriptive data analysis
descriptive data analysisgnanasarita1
 
3. Statistical Analysis.pptx
3. Statistical Analysis.pptx3. Statistical Analysis.pptx
3. Statistical Analysis.pptxjeyanthisivakumar
 
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index NumbersChapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index NumbersInternational advisers
 
Biostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxBiostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxSailajaReddyGunnam
 
Statistical Methods in Research
Statistical Methods in ResearchStatistical Methods in Research
Statistical Methods in ResearchManoj Sharma
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data DescriptionFarhan Alfin
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Measures of central tendancy
Measures of central tendancy Measures of central tendancy
Measures of central tendancy Pranav Krishna
 
State presentation2
State presentation2State presentation2
State presentation2Lata Bhatta
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxAnusuya123
 
MEASURE-OF-VARIABILITY- for students. Ppt
MEASURE-OF-VARIABILITY- for students. PptMEASURE-OF-VARIABILITY- for students. Ppt
MEASURE-OF-VARIABILITY- for students. PptPrincessjaynoviaKali
 
measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...SoujanyaLk1
 

Similar a R training4 (20)

Descriptive Statistics.pptx
Descriptive Statistics.pptxDescriptive Statistics.pptx
Descriptive Statistics.pptx
 
Ch5-quantitative-data analysis.pptx
Ch5-quantitative-data analysis.pptxCh5-quantitative-data analysis.pptx
Ch5-quantitative-data analysis.pptx
 
Statr sessions 4 to 6
Statr sessions 4 to 6Statr sessions 4 to 6
Statr sessions 4 to 6
 
descriptive data analysis
 descriptive data analysis descriptive data analysis
descriptive data analysis
 
3. Statistical Analysis.pptx
3. Statistical Analysis.pptx3. Statistical Analysis.pptx
3. Statistical Analysis.pptx
 
Statistics
StatisticsStatistics
Statistics
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
 
Basic statisctis -Anandh Shankar
Basic statisctis -Anandh ShankarBasic statisctis -Anandh Shankar
Basic statisctis -Anandh Shankar
 
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index NumbersChapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
 
Biostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxBiostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptx
 
Statistical Methods in Research
Statistical Methods in ResearchStatistical Methods in Research
Statistical Methods in Research
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data Description
 
determinatiion of
determinatiion of determinatiion of
determinatiion of
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Measures of central tendancy
Measures of central tendancy Measures of central tendancy
Measures of central tendancy
 
1 introduction to psychological statistics
1 introduction to psychological statistics1 introduction to psychological statistics
1 introduction to psychological statistics
 
State presentation2
State presentation2State presentation2
State presentation2
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
 
MEASURE-OF-VARIABILITY- for students. Ppt
MEASURE-OF-VARIABILITY- for students. PptMEASURE-OF-VARIABILITY- for students. Ppt
MEASURE-OF-VARIABILITY- for students. Ppt
 
measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...
 

Más de Hellen Gakuruh

Prelude to level_three
Prelude to level_threePrelude to level_three
Prelude to level_threeHellen Gakuruh
 
SessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsSessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsHellen Gakuruh
 
Introduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RIntroduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RHellen Gakuruh
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
SessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpSessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpHellen Gakuruh
 
SessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRSessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRHellen Gakuruh
 
SessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeSessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeHellen Gakuruh
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsHellen Gakuruh
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataHellen Gakuruh
 
SessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsSessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsHellen Gakuruh
 

Más de Hellen Gakuruh (20)

R training2
R training2R training2
R training2
 
R training6
R training6R training6
R training6
 
R training5
R training5R training5
R training5
 
R training3
R training3R training3
R training3
 
R training
R trainingR training
R training
 
Prelude to level_three
Prelude to level_threePrelude to level_three
Prelude to level_three
 
Prelude to level_two
Prelude to level_twoPrelude to level_two
Prelude to level_two
 
SessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsSessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystems
 
Day 2
Day 2Day 2
Day 2
 
Day 1
Day 1Day 1
Day 1
 
Introduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RIntroduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_R
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
webScrapingFunctions
webScrapingFunctionswebScrapingFunctions
webScrapingFunctions
 
SessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpSessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelp
 
SessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRSessionEight_PlottingInBaseR
SessionEight_PlottingInBaseR
 
SessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeSessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTime
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjects
 
Files
FilesFiles
Files
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
 
SessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsSessionFour_DataTypesandObjects
SessionFour_DataTypesandObjects
 

Último

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Último (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

R training4

  • 1. Introduction to Data Analysis and Graphics in R author: Hellen Gakuruh date: 2017-03-10 autosize: true Slide 4: Summarizing Data Outline What we shall cover • Numerical summaries for discrete variables • Numerical summaries for continuous variables • Tables for dichotomous variables • Tables for categorical variables • Tables for ordinal variables Introduction type: section • A variable is a quantity whose values are not constant (change) • Discrete variables have finite values (obtained by counting) • Continuous variables can take any value within a range (obtained by measuring) • Dichotomous variable has two values like “Yes” and “NO” or TRUE and FALSE • Categorical variables are qualitative variables whose values are non- numerical (text) with no ordering like gender “Female” and “Male” Introduction cont. type: section • Ordinal variables are qualitative variables whose values are textual (non- numeric) with natural ordering like likert scales or level of education • There are two way to describe a variable; numerically and graphically • Numerical summaries comprise measures of central tendency, measures of spread/variability and shape of distribution (latter often not reported, used to guide additional analysis) • All these variables (discrete, continuous, dichotomous, categorical, and ordi- nal) can be described by these measures, but each has it’s own computation and presentation 1
  • 2. Measures of central Tendency type: section • There three most often used/reported measures of central tendency • Mean (arithmetic) • Median • Mode • Mean is average of all values, i.e. sum of observations divided by number of observations =============================================================== type: sub-section • Median is central value when ordered • Mode is most frequently occurring value • There at least three measures of dispersion – Range – Inter-quantile range (IQR) – Variance and Standard deviation =============================================================== type: sub-section • Range is minimum and maximum value • IQR is a range of where 50% of values lie (ordered statistic) • Standard deviation is average distance of values from mean. It is computed from variance which is squared distance from mean. • Distinction is made between sample and population • Measure for population are called population parameters and they are often unknown • Measures for a sample are called sample statistics • Population mean is denoted as (mu) (pronounced ad “mu”) • Sample mean is denoted as (bar{x}) (pronounced as “x bar”) Computing mean ========================================================= type: sub-section • Since mean is sum of all values divided by number of values, then population and sample mean can be expressed as: [mu = frac{sum{X}}{N}, where X are value and N is number of values] [bar{x} = frac{sum{x}}{n}, where x are values and n is number of values] respectively 2
  • 3. Locating median type: sub-section • Median depends on whether number of observations are odd or even • For odd number of values, median is the middle value like 3 in data set {1,2,3,4,5} • For odd number of values, median is average of the two middle values like average of 3 and 4 for data set {1,2,3,4,5,6} which is 3.5. Determining mode type: sub-section • Mode is most frequently occurring value (observation) • To get mode, count number of occurrence of each unique value (observation) and select the one with most number of occurrences • Number of occurrences is called frequency • Mode for data set {1, 2, 1, 1, 3, 3} is 3 • Mode is the only measure of central tendency which can has 0, 1, 2, > 2 modes (no mode, uni-modal, bi-modal, or multi-modal) Standard deviation (SD) type: sub-section • Used to determine how spread out values are from it’s average (mean) • A small SD means values are clustered around it’s mean and a big SD means values are spread out • Computed by first subtracting each value from mean. Then summing the deviation. But before summing, they are squared as summation would result to 0. Finally they are divided by number of values. But since it’s a squared deviation, a square root is taken. • For samples from unknown population parameters, dividing with number of observation has been proved to underestimate variance, hence divided by “n-1” i.e.. (s = sqrt{sum(x-bar{x})ˆ2/(n-1)}) Skewness type: sub-section • Skewness measures symmetry of values around it’s mean 3
  • 4. • If values are symmetrical, left and right side of it’s average is a mirror image, then it’s said to have “no skweness” • If bulk of values is to the left and has a right trail of values, then it’s positively skewed • If bulk of values is to the right and has a trail of values to the left, then it’s negatively skewed • Measurement involves balancing values on both sides of the mean, if difference is zero, they it’s symmetrical, else +ve or -ve Kurtosis type: sub-section • A measure of tailness; fat/thin or long/short • Not a measure of “peakness” as often discussed in older text • Reason: measure gives more weight to values far away from average, thus outputting how far and by how much it is from average • Kurtosis is noted as being “Mesokurtic”, “Leptokurtic” or “platykurtic”. =============================================================== type: sub-section • Mesokurtic means it’s symmetrical (tails are the same), “leptokurtic” means it is “slender” and has fatter tails, it also has a greater kurtosis than “mesokurtic” or a symmetrical distribution • Platykurtic means it has a lesser kurtosis than symmetric distribution and it’s broad with thinner tails • Symmetry is considered ideal hence kurtosis measured in reference to symmetry which as kurtosis of 3 • Kurtosis measured in reference to symmetry f 3 are referred to as Excess Kurtosis Numerical summaries for discrete variables type: section • Can be described by mean or median as its average • If data is skewed, median is appropriate, otherwise compute mean • If average is mean, then dispersion is reported as standard deviation. If average is median, then dispersion should be IQR • Shape of distribution as measured by skewness and kurtosis can inform on which average (mean or median) to use. It also guides inferential statistics • Example: Hypothetical random numbers of students scores 4
  • 5. ============================================================== type: sub-section # Data set.seed(4) scores <- as.integer(round(rnorm(50, 78, 1))) # Source own function for printing frequency tables source("~/R/Scripts/desc-statistics.R") # Frequency table freq(scores) Values Freq Perc 1 76 2 4 2 77 8 16 3 78 19 38 4 79 17 34 5 80 4 8 =============================================================== type: sub-section # Mean mean(scores) [1] 78.26 # Median median(scores) [1] 78 # Range cat("Range for this distribution is", diff(range(scores)), paste0("(", paste(range(scores), Range for this distribution is 4 (76, 80) =============================================================== type: sub-section # Where 50% of values lie cat("50% of values lie between score of about", round(quantile(scores, 0.25)), "and", paste0 50% of values lie between score of about 78 and 79: an IQR of about 1 # Standard deviation (spread of values around mean) sd(scores) [1] 0.964894 5
  • 6. =============================================================== type: sub-section # Functions developed to measure and interpret skewness and kurtosis source("~/R/Scripts/skewness-kurtosis-fun.R") # Skewness m3_std(scores) [1] -0.2551918 skewness_interpreter(m3_std(scores)) [1] "approximately symmetric" # Kurtosis excess_kurt(scores) [1] -0.365273 excess_interpreter(excess_kurt(scores)) [1] "approximately mesokurtic" Conclusion (discrete numerical measures) type: sub-section • From skewness and kurtosis we can tell this data set is almost centered around it’s mean, hence mean is an appropriate representative value (a value to describe data) • Since mean is our representative value, then standard deviation is the appropriate measure for dispersion • SD of 0.964894 indicates values are not dispersed • Display-wise, we expect to see an almost symmetric distribution Numerical summaries for continuous variables type: section • Continuous variables have the same numerical summaries as discrete vari- able • Exception is how to locate it’s mode, since values can take on an infinite number of values within a range • Mode then involves grouping values into useful intervals sometimes called breaks. This is a process called “discretization” 6
  • 7. • Breaks can range between 2 to 10 but most often interval of five (data determines) ============================================================ type: sub-section • Example data: Random hypothetical sample of human height in inches # Example data set.seed(4) height <- round(rnorm(50, 5.4), 2) sort(height) [1] 3.60 3.71 3.92 4.12 4.47 4.54 4.58 4.65 4.76 4.86 4.93 5.00 5.02 5.12 [15] 5.12 5.17 5.19 5.30 5.35 5.36 5.42 5.43 5.50 5.55 5.57 5.57 5.58 5.62 [29] 5.78 5.97 5.99 6.00 6.09 6.12 6.26 6.29 6.31 6.33 6.45 6.57 6.64 6.66 [43] 6.69 6.69 6.71 6.74 6.94 7.04 7.18 7.30 =============================================================== type: sub-section # Average mean(height) [1] 5.6352 median(height) [1] 5.57 # Dispersion sd(height) [1] 0.9184931 diff(range(height)); range(height) [1] 3.7 [1] 3.6 7.3 =============================================================== type: sub-section IQR(height) [1] 1.28 # Modal Class (interval) tab <- freq_continuous(height) as.vector(tab[which.max(tab$Perc), 1]) [1] "(5,5.5]" 7
  • 8. # Functions for generating frequency tables freq_continuous(height) Values Freq Perc 1 (3.5,4] 3 6 2 (4,4.5] 2 4 3 (4.5,5] 7 14 4 (5,5.5] 11 22 5 (5.5,6] 9 18 6 (6,6.5] 7 14 7 (6.5,7] 8 16 8 (7,7.5] 3 6 ============================================================ type: sub-section # Skewness m3_std(height) [1] -0.2186212 skewness_interpreter(m3_std(height)) [1] "approximately symmetric" # Kurtosis excess_kurt(height) [1] -0.7024805 excess_interpreter(excess_kurt(height)) [1] "moderately platykurtic" Tables for dichotomous variables type: section • Have two values e.g. “Yes” & “No” • Best presented in frequency tables set.seed(4) dichot <- sample(c("Yes", "No"), 100, replace = TRUE) freq(dichot) 8
  • 9. Values Freq Perc 1 No 57 57 2 Yes 43 43 Tables for categorical variables type: section • Just like dichotomous variables (which are categorical), these can be displayed in a frequency table if univariate and contingency tables for bi-variate relationships ========================================================== type: sub-section groups <- rep(c("a", "b", "c"), 200) set.seed(4) outcome <- sample(c("improved", "same", "decreased"), length(groups), replace = TRUE, prob = freq(groups) Values Freq Perc 1 a 200 33 2 b 200 33 3 c 200 33 freq(outcome) Values Freq Perc 1 decreased 66 11 2 improved 418 70 3 same 116 19 Contingency table type: sub-section source("~/R/Scripts/desc-statistics.R") contigency_tab(groups, outcome) outcome groups decreased perc improved perc same perc a 22 33 136 33 42 36 b 23 35 140 33 37 32 c 21 32 142 34 37 32 9