Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
G7-quantitative
1. INSTRUCTOR: DR. TUNG NGUYE
GROUP 7
MEMEMBER:
Ly Ngoc Tra An
Ngo Huong Giang
Tran Nhu Hanh
Tran Thi My Hanh
Nguyen Thi Hong Tham
Nguyen Thi Thao Tien
2. OUT LINE
Data analysis
Central Tendency : Mean,Median,Mode
Spread of distribution : Range, Variance,
Standard Deviation
Experimental :
Paired T-Test
Anova
3. CENTRAL TENDENCY
The term central tendency refers to the "middle" value or perhaps a
typical value of the data, and is measured using the mean, median,
or mode. Each of these measures is calculated differently, and the
one that is best to use depends upon the situation.
In statistics, the term central tendency relates to the way in which
quantitative data tend to cluster around some value
In the simplest cases, the measure of central tendency is an average of
a set of measurements, the word average being variously construed
as mean, median, or other measure of location, depending on the
context.
Both "central tendency" and "measure of central tendency" apply to
either statistical populations or to samples from a population.
4. MEASURES OF CENTRAL
TENDENCY
Arithmetic mean: (or simply, mean) – the sum of all
measurements divided by the number of observations in
the data set
The mean is the most commonly-used measure of central tendency.
When we talk about an "average", we usually are referring to
the mean. The mean is simply the sum of the values divided by
the total number of items in the set. The result is referred to as
the arithmetic mean. Sometimes it is useful to give more
weighting to certain data points, in which case the result is
called the weighted arithmetic mean.
The mean is valid only for interval data or ratio data. Since it uses
the values of all of the data points in the population or sample,
the mean is influenced by outliers that may be at the extremes of
the data set.
5. MEDIAN: THE MIDDLE VALUE
THAT SEPARATES THE HIGHER
HALF FROM THE LOWER HALF
OF THEsetDATA highest values and taking the data
The median is determined by sorting the data from lowest to
SET
point in the middle of the sequence. There is an equal number of points above and below the
median. For example, in the data set {1,2,3,4,5} the median is 3; there are two data points
greater than this value and two data points less than this value. In this case, the median is
equal to the mean. But consider the data set {1,2,3,4,10}. In this dataset, the median still is
three, but the mean is equal to 4. If there is an even number of data points in the set, then there
is no single point at the middle and the median is calculated by taking the mean of the two
middle points.
The median can be determined for ordinal data as well as interval and ratio data. Unlike the mean,
the median is not influenced by outliers at the extremes of the data set. For this reason, the
median often is used when there are a few extreme values that could greatly influence the
mean and distort what might be considered typical. This often is the case with home prices
and with income data for a group of people, which often is very skewed. For such data, the
median often is reported instead of the mean. For example, in a group of people, if the salary
of one person is 10 times the mean, the mean salary of the group will be higher because of the
unusually large salary. In this case, the median may better represent the typical salary level of
the group.
6. MODE (STATISTICS): THE MOST
FREQUENT VALUE IN THE DATA
SET
The mode is the most frequently occurring value in the data set.
For example, in the data set {1,2,3,4,4}, the mode is equal to
4. A data set can have more than a single mode, in which case
it is multimodal. In the data set {1,1,2,3,3} there are two
modes: 1 and 3.
The mode can be very useful for dealing with categorical data.
For example, if a sandwich shop sells 10 different types of
sandwiches, the mode would represent the most popular
sandwich. The mode also can be used with ordinal, interval,
and ratio data. However, in interval and ratio scales, the data
may be spread thinly with no data points having the same
value. In such cases, the mode may not exist or may not be
very meaningful.
7. WHEN TO USE MEAN, MEDIAN,
AND MODE
Measurement Scale Best Measure of the
"Middle"
Nominal Mode
(Categorical)
Ordinal Median
Interval Symmetrical data: Mean
Skewed data: Median
Ratio Symmetrical data: Mean
Skewed data: Median
8. A RANGE, A VARIANCE, AND A
STANDARD DEVIATION
RANGE
Range = The range indicates the distance
between the two most extreme scores in a
distribution
>>> Range = highest score – lowest score
9. VARIANCE AND STANDARD DEVIATION
•The variance and standard deviation are two
measures of variability that indicate how
much the scores are spread out around the p
mean
• We use the mean as our reference point since
it is at the center of the distribution
10. Variance = how spread out (far away) a
number is from the mean
Standard Deviation = loosely defined as
the average amount a number differs
from the mean
11. We will use the following sample data
set to explain the range, variance, and
standard deviation:
4, 6, 3, 7, 9, 4, 2, 1, 4, 2
12. SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
Range:
R = maximum score - minimum score
In order to figure out the range, A) arrange your data
set in order from lowest to highest and B) subtract
the lowest number from the highest number.
A) When arranged in order, 4, 6, 3, 7, 9, 4, 2, 1, 4,
2 becomes: 1, 2, 2, 3, 4, 4, 4, 6, 7, 9
B) The lowest number is 1 and the highest number
is 9. Therefore, R = 9-1 = 8
13. SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The Computational Formula:
From the above formula:
S2 = variance
Σ = sigma = the sum of (add up all the numbers)
X = the numbers from your data set
X2 = the numbers from your data set squared
N = the total number of numbers you have in your data
set
14. SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The easiest way to compute variance with the A): C):
computational formula is as follows: X X2
A) List each of the numbers in your data set 4 42=16
vertically & get the sum of that column
B) Figure out n (count how many numbers you 6 62=36
have in your data set) 3 32=9
C) Square each number in your data set and get 7 72=49
the sum of that column 9 92=81
4 42=16
2 22=4
1 12=1
4 42=16
2 22=4
Σ=42 Σ=232
B): N=10
15. SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
Now use the sum for part A) and C), as
well as the value for N which you
found in part B) to fill in the formula:
Do the math and S2 = 5.56
16. SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The Conceptual Formula:
From the above formula:
S2 = variance
Σ = sigma = the sum of (add up all the numbers)
X = the numbers from your data set
M = the mean
N = the total number of numbers you have in your data
set
17. SAMPLE DATA : 4, 6, 3, 7, 9, 4, 2, 1, 4, 2
The easiest way to compute variance with the A): D): E):
computational formula is as follows: X (X-M) (X-M)2
A) List each of the numbers in your data set 4 (4-4.2)= -0.2 (-0.2)2= 0.04
vertically & get the sum of that column
B) Figure out n (count how many numbers you 6 (6-4.2)= 1.8 (1.8)2= 3.24
have in your data set) 3 (3-4.2)= -1.2 (-1.2)2= 1.44
C) Figure out M 7 (7-4.2)= 2.8 (2.8)2= 7.84
D) Subtract M from each number in your data 9 (9-4.2)= 4.8 (4.8)2= 23.04
set (Notice how the sum is zero) 4 (4-4.2)= -0.2 (-0.2)2= 0.04
E) Square the numbers you got for part D) and get 2 (2-4.2)= -2.2 (-2.2)2= 4.84
the sum of that column
1 (1-4.2)= -3.2 (-3.2)2= 10.24
4 (4-4.2)= -0.2 (-0.2)2= 0.04
2 (2-4.2)= -2.2 (-2.2)2= 4.84
Σ=42 Σ=0 Σ=55.6
B): N=10
C): M= 42/10=4.2
18. Now use the sum for part E), as well as
the value for N which you found in
part B) to fill in the formula:
Do the math and S2 = 5.56
19. STANDARD DEVIATION:
Standard deviation is simply the square root
of the variance. Therefore, it does not
matter if you use the computational formula
or the conceptual formula to compute
variance.
For our sample data set, our variance came
out to be 5.56, regardless of the formula
used. The standard deviation for our data
set then becomes: S = = 2.36
20. INDEPENDENT SAMPLES
• The independent samples t-test is used when two separate
sets of independent and identically distributed samples are
obtained, one from each of the two populations being
compared.
• E.g: suppose we are evaluating the effect of a medical
treatment, and we enroll 100 subjects into our study, then
randomize 50 subjects to the treatment group and 50 subjects
to the control group. In this case, we have two independent
samples and would use the unpaired form of the t-test. The
randomization is not essential here—if we contacted 100
people by phone and obtained each person's age and gender,
and then used a two-sample t-test to see whether the mean
ages differ by gender, this would also be an independent
samples t-test, even though the data are observational.
22. A. EQUAL SAMPLE SIZES, EQUAL VARIANCE
This test is only used when both:
the two sample sizes (that is, the
number, n, of participants of each
group) are equal;
it can be assumed that the two
distributions have the same variance.
23. B. UNEQUAL SAMPLE SIZES, EQUAL VARIANCE
This test is used only when it can be
assumed that the two distributions
have the same variance.
24. C. UNEQUAL SAMPLE SIZES, UNEQUAL
VARIANCE
This test, also known as Welch's t-test,
is used only when the two population
variances are assumed to be different
(the two sample sizes may or may not
be equal) and hence must be
estimated separately.
25. WORKED EXAMPLE
• A study of the effect of caffeine on muscle metabolism used
eighteen male volunteers who each underwent arm exercise
tests. Nine of the men were randomly selected to take a
capsule containing pure caffeine one hour before the test.
The other men received a placebo capsule. During each
exercise the subject's respiratory exchange ratio (RER) was
measured. (RER is the ratio of CO2 produced to O2
consumed and is an indicator of whether energy is being
obtained from carbohydrates or fats).
• Question: whether, on average, caffeine changes RER.
• Populations: ―men who have not taken caffeine‖ and ―men
who have taken caffeine‖. (If caffeine has no effect on RER
the two sets of data can be regarded as having come from
the same population.)
26. • The means show that, on average,
caffeine appears to have altered RER from
about 100.6% to 94.2%, a change of 6.4%
•. However, there is a great deal of
Placebo Caffeine variation between the data values in both
samples and considerable overlap
105 96
between them.
119 99 • Is the difference between the two means
100 94 simply due sampling variation, or does the
data provide evidence that caffeine does,
97 89
on average, reduce RER? >> p-value
96 96 answers this question.
101 93 •The t-test tests the null hypothesis that the
mean of the caffeine treatment equals the
94 88
mean of the placebo versus the alternative
95 105 hypothesis that the mean of caffeine
98 88 treatment is not equal to the mean of the
placebo treatment.
Mean = 100.56 Mean = 94.22
•Computer output obtained for the RER
SD = 7.70 SD = 5.61 data gives the sample means and the 95%
confidence interval for the difference
between the means.
27. COMPUTER OUTPUT
The p-value is 0.063 and, therefore, the difference between the two means is not
statistically significantly different from zero at the 5% level of significance. There is
an estimated change of 6.4% (SE = 3.17%). However, there is insufficient evidence
(p = 0.063) to suggest that caffeine does change the mean RER.
28. Alternative suggestion
It could be argued, however, that the researcher might only be interested in whether
'caffeine reduces RER'. That is, the researcher is looking for a specific direction for
the difference between the two population means. This is an example of a one-tailed
t-test as opposed to a two-tailed t-test outlined above.
SPSS only performs a 2-tailed test (the non-directional alternative hypothesis) and to
obtain the p-value for the directional alternative hypothesis (one-tailed test) the p-
value should be halved. Hence, in this example, p = 0.032.
Report: The mean RER in the caffeine group (94.2 1.9) was significantly lower (t =
1.99, 16 df, one-tailed t-test, p = 0.032) than the mean of the placebo group (100.6
2.6).
Note: It is important to decide whether a one- or two-tailed test is being carried-out,
before analysis takes place.
Otherwise it might be tempting to see what the p-value is before making your
decision!
29. A suitable null hypothesis in both cases is
H0: On average, caffeine has no effect on
RER, with an alternative (or
experimental) hypothesis,
H1: On average, caffeine changes RER (2-
tail test), or H1: On average, caffeine
reduces RER (1-tail case).
30. 2. ONE SAMPLE T-TEST
Compare the mean score of a sample to
a known value. Usually, the known
value is a population mean.
Assumption:
The dependent variable is
normally distributed.
31. In testing the null hypothesis that the
population mean is equal to a specified
value μ, use the statistic:
: sample mean S: sample standard deviation n: sample size
32. 2. PAIRED SAMPLES T-TEST
What it does:
compare the means of two variables
compute the difference between the two variables
for each case, and test to see if the average
difference is significantly different from zero
Assumption:
Both variables should be normally distributed.
33. Hypothesis:
Null: There is no significant difference
between the means of the two variables.
Alternate: There is a significant difference
between the means of the two variables.
34. Difference between a paired samples t-test
and an independent samples t-test?
Both tests are used to find significant differences between
groups, but the independent samples t-test assumes the
groups are not related to each other, while the dependent
samples t-test or paired samples t-test assumes the groups
are related to each other.
A dependent samples t-test or paired samples t-test would
be used to find differences within groups, while the
independent samples t-test would be used to find
differences between groups.
35. Independent variable and dependent
variable:
The independent variable and the dependent
variable is the same in both the dependent
samples t-test and the independent samples t-test.
The variable of measure of the variable of
interest is the dependent variable and the
grouping variable is the independent variable.
36. The most common use of the dependent samples t-test
is in a pretreatment vs. posttreatment scenario where the
researcher wants to test the effectiveness of a treatment.
1. The participants are tested pretreatment, to establish
some kind of a baseline measure
2. The participants are then exposed to some kind of
treatment
3. The participants are then tested posttreatment, for the
purposes of comparison with the pretreatment scores
37. For this equation, the differences between all pairs
must be calculated. The pairs are either one person's
pre-test and post-test scores or between pairs of
persons matched into meaningful groups. The average
and standard deviation of those differences are
used in the equation. The degree of freedom used
is n − 1.
38. EXAMPLE: SPSS OUTPUT
We compared the mean test scores before
(pre-test) and after (post-test) the subjects
completed a test preparation course.
We want to see if our test preparation
course improved people's score on the
test
40. There is a strong positive correlation.
People who did well on the pre-test also
did well on the post-test.
41. Remember, this test is based on the difference
between the two variables. Under "Paired
Differences" we see the descriptive statistics for the
difference between the two variables.
42. The T value = -2.171
We have 11 degrees of freedom
Our significance is .053
43.
44. If the significance value is less than .05,
there is a significant difference.
If the significance value is greater than.
05, there is no significant difference.
Conclusion: There is no difference between
pre- and post-test scores. Our test
preparation course did not help!
46. WHAT IS ANOVA?
• ANOVA is an analysis of the variation present
in an experiment. It is a test of the hypothesis
that the variation in an experiment is no greater
than that due to normal variation of individuals'
characteristics and error in their measurement.
• ANOVA, is a technique from statistical
interference that allows us to deal with several
populations
48. ONE-WAY ANOVA DEFINITION
• A One-way ANOVA is used when
comparing two or more group means on
a continuous dependent variable. In other
words, one-way ANOVA techniques can
be used to study the effect of k(>2) levels
of a single factor.
• The independent T-Test is a special case
of the One-way ANOVA for situatiosn
where there are only two group means
49. MAJOR CONCEPTS:
1. CALCULATING SUMS OF SQUARES
• The One-way ANOVA separates the total variance
in the continuous dependent variable into two
components: Variability between the groups and
Variability within the groups
• Variability between the groups is calculated by
first obtaining the sums of squares between groups
(SSb), or the sum of the square differences between
each indibidual group mean from the grand mean
• Variability within the groups is calculated by
first obtaining the sums of squares within groups
(SSw) or the sum of the squared differences
beyween each individual score and that individual’s
group mean.
50. TYPES OF VARIABLES
FOR ONE-WAY ANOVA
• The IV (Independent Variable) is
categorical. The categorical IV can
be two groups or it can have more
than two groups.
• The DV (Dependent Variable) is
continuous
• Data are collected on both variables
for each person in the study.
51. EXAMPLES OF RESEARCH QUESTIONS FOR
ONE-WAY ANOVA
1. Is there a significant difference in student attitudes
toward the course between students who pass or fail a
course?
• Student attitude is continuous
• Passing a course is categorical (pass/fail)
Because the IV has only 2 groups, we can use
independent T-Test
2. Does student satisfaction significantly differ by location
of institution (rural, urban, suburban)?
• Student satisfaction is continuous
• Institution location is categorical
52. The linear model, conceptually, is:
SSt = SSb + SSw
SSt: total sums of squares
SSb: sums of squares between groups
SSw: sums of squares within groups
53. ONE-WAY ANOVA AS A RATIO OF VARIANCES:
Formula for variance:
Numerator: a sum of squared values (or a
sums of squares)
Denominator: degrees of freedom
54. • The ANOVA analyzes the ratio of the
variance between groups the variance
within the groups
• In ANOVA, these variances, formerly
known to us as , are referred as mean
squares (MS). Mean squares are
calculated by dividing each sum of
squares by the degrees of freedom
associated with it.
55. • Thus, a mean square between is simply the
variance between groups obtained by a sums of
squares divided by degrees of freedom
• Likewise, a mean square within is simply the
variance between groups obtained by a sums of
squares divided by degrees of freedom
56. FACTORS THAT AFFECT SIGNIFICANCE
F -ratio: the variation due to an experimental treatment or
effect divided by the variation due to experimental error. The
null hypothesis is this ratio equals 1.0, or the treatment effect
is the same as the experimental error. This hypothesis is
rejected if the F-ratio is significantly large enough that the
possibility of it equaling 1.0 is smaller than some pre-
assigned criteria such as 0.05 (one in twenty)
The MSb and the MSw are then divided to obtain the F ratio
for hypothesis testing
57. DISTRIBUTION OF F - RATIO
• F distribution is positively skewed
• If F statistic falls near 1.0, then
most likely the null is true
• If F statistic is large, expect null is
false. Thus, signigicant F ratios will
be in the tail of the F distribution
58. P VALUE
In statistical hypothesis testing, the p-
value is the probability of obtaining
a test statistic at least as extreme as
the one that was actually observed,
assuming that the null hypothesis is
true. One often "rejects the null
hypothesis" when the p-value is less
than the significance level α, which is
often 0.05 or 0.01.
59.
60. t2= F
• The larger the value of t, the more liley we are
to find significant results
• t is a special case of ANOVA when only two
groups comprise the independent variable
• We’re famimilar with the t distribution as
normally distributed (for large df), with
positive and negative values. The F statistics,
on the other hand, is positively skewed, and is
comprised of squared values. Thus, for any two
group situation, t2= F
61. CALCULATIONS
• dfb = k-1(k: numbers of samples/
groups/ levels)
• dfw = N- k (total of individuals in
groups)
• dfT = N -1
• MSb = SSb/ dfb
• MSw = SSw/ dfw
• F = MSb/ MSw
62. STEPS IN ONE-WAY ANOVA
STEP 1: STATE HYPOTHESES
To determine if different levels of factor affect measured
observations differently, the following hypotheses are tested.
• There is no significant difference among groups in variable X
• There is a significant difference between at least two of the
groups in the variable X. In other words, at least one mean will
significantly differ.
66. STEP 5: MAKE DECISION
• Fail to reject the null hypothesis and
conclude tha there is no significant
different among the group F(dfb, dfw) =
insert F statistic, p> insert α
• Reject the null hypothesis and
conclude that there is a significant
difference among the grou F(dfb, dfw)
= insert F statistic, p <insert α
68. ONE-WAY ANOVA
• One-Way ANOVA has one independent
variable (1 factor) with > 2 conditions
– conditions = levels = treatments
– e.g., for a brand of cola factor, the
levels are:
Coke, Pepsi, RC Cola
• Independent variables = factors
69. TWO-WAY ANOVA
• Two-Way ANOVA has 2 independent variables
(factors)
– each can have multiple conditions
Example
• Two Independent Variables (IV’s)
– IV1: Brand; and IV2: Calories
– Three levels of Brand:
• Coke, Pepsi, RC Cola
- Two levels of Calories:
• Regular, Diet
70. WHEN TO USE
• One-way ANOVA: you have more than two levels
(conditions) of a single IV
– EXAMPLE: studying effectiveness of three types of
pain reliever
aspirin vs. tylenol vs. ibuprofen
• Two-way ANOVA: you have more than one IV
(factor)
– EXAMPLE: studying pain relief based on pain
reliever and type of pain
• Factor A: Pain reliever (aspirin vs. tylenol)
• Factor B: type of pain (headache vs. back pain
71. NOTATION
Factor A
Factor B.
a : the number of categories of Factor A,
b : the number of categories of Factor B.
Total number of groups is ab.T
The total number of observations N .
The response/dependent variable value for each observation :Yijk
,
where i : the subject’s category for Factor A, and j : the subject’s
category for Factor B. Then i and j together : a group, and k
denotes which individual we’re talking about within this
particular group.
The number of observations in each group n and N = abn.
72. How the number of hours of TV people
watch per week depends on two variables:
gender and age. Each person is classified
according to gender (male, female) and age
(18–24, 25–54,55+).
There are six groups—one for each
combination of gender and age. We
randomly sample five people from each
group, and each person reports the time, in
hours, that he or she watches TV per week.
The data is shown in
75. There are three main Whether TV viewing time
questions that we might depends on age and
ask in two-way ANOVA: gender.
• Does the response variable The third question asks
depend on Factor A? whether TV viewing time
• Does the response variable depends on gender
depend on Factor B? differently for people of
different ages, or whether
• Does the response variable TV viewing time depends
depend on Factor A on age differ- ently for
differently for different men than for women.
values of Factor B, and
vice versa? (For example, perhaps it’s
true that women 55+
watch more TV than men
55+, but women 18–24
watch less TV than men
18–24.)
76. 1.Sums of Squares
Two-way ANOVA involves five different
sums of squares:
• The total sum of squares, SS Tot , measures the total
variability in the response variable values. Its formula is
• The Factor A sum of squares, SS A, measures the
variability that can be explained by differences in Factor
A. Its formula is
77. _
Yij represents the sample mean of the group in category i of
Factor A and category j of Factor B (always an average of n
observations).
_
Yi represents the sample mean of all the data in category i of
Factor A combined (always an average of bn observations).
_
Y j represents the sample mean of all the data in category j of
Factor B combined (always an average of an observations).
_
Y represents the overall sample mean of all the data from all
groups combined (always an average of all abn = N
observations).
78. • The Factor B sum of squares, SS B , measures the
variability that can be explained by differences in Factor
B. Its formula is
•The interaction sum of squares, SS AB , measures the
variability that can be explained by interaction between the
effects of Factors A and B. (We’ll talk more about what this
means later.) Its formula is
•The error sum of squares, SS E , measures the variability
of the ob- servations around their group sample means. Its
formula is
79. •If we call the sample standard deviation within
each group sij , then another formula for SS E is
84. TWO-WAY ANOVA
HYPOTHESIS TESTS
• Does the response
variable depend on
Factor A?
• Does the response Main effects
variable depend on
Factor B?
• Does the response
variable depend on
Factor A differently for Interaction
different values of
Factor B, and vice
versa?
85. Interaction :
We say that there is interaction if Y
depends on Factor A differently for
different values of Factor B, and vice
versa.
Similarly, we say that there is NO
interaction if Y depends on Factor A
in the same way for all values of Factor
B, and vice versa.
86. HYPOTHESES
In the test for interaction, the null hypothesis
(Ho) is that there is no interaction, while the
alternative hypothesis (Ha) is that there is
interaction.
87. There is no interaction on the left. For each age group, women
average watching five more hours of TV per week than men. For
each gender, the middle age group averages watching six
more hours of TV per week than the youngest age group, and
the oldest age group averages watching nine more hours of TV
per week than the middle age group.
• There is interaction on the right. For each age group, women
average watching more TV than men, but how much more varies
for the different age groups. Also, for each gender, older people
average watching more TV, but how much more varies by gender.
88. ASSUMPTIONS
The assumptions for the two-way
ANOVA F test for interaction are
exactly the same as those of the one-
way ANOVA F test, with one
additional re- quirement: the number
of observations should be the the
same for all groups.
92. • If we believe there is interaction, then we don’t bother
to ask whether the response depends on Factor A or
Factor B separately—the fact that there is interaction
means that the response depends on Factor A
differently for different values of Factor B, and vice
versa. So we stop here and do not perform the tests
for main effects (which we’ll talk about in the next
subsection).
• If we believe it’s reasonable that there is no interaction,
then that means we can look at the effects of Factor A
and Factor B separately, so we proceed to the tests
for main effects.