2. Introduction
Statistical
methodology
Step of scientific research
Important parametric tests
Important nonparametric tests
Example using Excel program
Using Excel for Statistics in Gateway
Cases – Office 2007
Elementary statistics
2
3. Most people become familiar with probability and
statistics through radios, television,newspapers and
magazines.For example,the following statements
were found in newspapers.
Eating 10 grams(g) of fiber a day reduce the risk
of heart attack by 14%
Thirty minutes (of exercise) two or three times
each week can raise HDLs 10 to 15%
Elementary statistics
3
4. Statistics
is used to analyze the results of
surveys and as a tool in scientific research to
make decisions based on controlled
experiments.
Other uses of statistics include operations
research, quality control, estimation and
prediction.
Elementary statistics
4
6. as
the basis of data analysis are concerned with two
basic types of problems
(1) summarizing, describing, and exploring the data
This problems is covered by descriptive statistics
(2) using sampled data to infer the nature of the
process which produced the data
This problems is covered by inferential statistics.
Elementary statistics
6
7. Statistics
plays an important role in the
description of mass phenomena.
Organized and summarized for clear
presentation for ease of communications.
Data may come from studies of populations
or samples
It offers methods to summarize a collection
of data. These methods may be numerical or
graphical, both of which have their own
advantages and disadvantages.
Elementary statistics
7
8. Inferential
statistics is used to draw
conclusions about a data set.
Usually this means drawing inferences about
a population from a sample either by
estimating some relationships or by testing
some hypothesis.
A Population is the
set of all possible
states of a random
variable. The size of
the population may
be either infinite or
finite.
Elementary statistics
A Sample is a subset
of the population; its
size is always finite.
8
9. Descriptive Statistics
Graphical
Inferential Statistics
Confidence interval
Arrange data in tables
Compare means of two
Bar graphs and pie charts
samples
Numerical
t Test
Percentages
Averages
Range
Relationships
Correlation coefficient
Regression analysis
F -Test
Compare means from
three samples
Pre/post (LSD,DMRT)
ANOVA = analysis of
variance
F -Test
10. Another important aspect of data analysis is the Data,
which can be of two different types:
qualitative data ex. Sex, color, smell, taste etc.
quantitative data ex. Height, weight, percentage etc.
Qualitative data does not contain quantitative
information.
Qualitative data can be classified into categories.
Elementary statistics
10
11. Type of Scale
Possible Statements
Allowed
Operators
Examples
nominal scale
identity, countable
=, ≠
colors, phone
numbers,
feelings
ordinal scale
identity, less
than/greater than
relations, countable
=, ≠, <, >
soccer league
table, military
ranks, energy
efficiency
classes
interval scale
identity, less
than/greater than
relations, equality of
differences
=, ≠ , <,
-
dates (years),
temperature in
Celsius, IQ scale
ratio scale
identity, less
than/greater than
relations, equality of
differences, equality
of ratios, zero point
=, ≠ , <,
-
velocities,
lengths,
temperatur in
Kelvin, age
Elementary statistics
11
12. Collecting the
necessary
Analyzing the facts
facts
Inference Statistics
Descriptive Statistics
Assessing
the results
Elementary statistics
Making decisions
Carrying out
decisions
12
13. Mode
=The most frequent value
Median =The value of the middle point of the ordered
measurements
Mean =The average (balancing point in the distribution)
Variance= The average of the squared deviations of all
the population measurements from the
population mean
Standard deviation =The square root of the variance
16. Hypothesis
= a assumption or some supposition
to be proved or disproved.
“the automobile A is performing
as well as
automobile B.”
17. Null
hypothesis (H0 ) =expresses no difference
Often said
“H naught”
H0:
=0
Or any number
Later…….
H0: 1 = 2
Alternative
hypothesis (H1 )
H0:
= 0; Null Hypothesis
HA:
= 0; Alternative Hypothesis
Elementary statistics
17
18. Type I error (α) :
reject H0 | H0 true
Type II error (β) :
Accept H0 | H1 true
Elementary statistics
18
19. Calculated F value is greater than the critical F values
Significant >>>reject H0
Calculated F value is lower than the critical F values
Non Significant >>>accept H0
Elementary statistics
19
20. Truth
H0 Correct
HA Correct
Decide H0
“fail to reject H0”
1- α
True Negative
β
False Negative
Decide HA
“reject H0”
α
False Positive
1- β
True Positive
Data
α = significance level
1- β = power
22. Z - test
is based on the normal probability
distribution and is used for judging the
significance of several statistical
z-test is generally used for comparing the mean of sample to
measures, particularly the mean. a(n>30)
some hypothesized mean for the population in case of large sample
23. T – test
is based on t-distribution and is considered an appropriate
test for judging the significance of a sample mean or for
judging the significance of difference between the means
of two samples in case of small sample(s) when population
variance is not known (in which case we use variance of
the sample as an estimate of the population variance).
t-test applies only in case of small sample(s)
when population variance is unknown.
Unknown variance
Under H0
X
0
s/ n
~ t( n 1)
Critical values: statistics books or computer
t-distribution approximately normal for degrees of freedom (df) >30
Elementary statistics
23
24. F – test
is based on F-distribution and is used to compare the variance of
the two-independent samples. This test is also used in the context
of analysis of variance (ANOVA) for judging the significance of
more than two sample means at one and the same time.
Test statistic, F, is calculated and compared with its probable value
(to be seen in the F-ratio tables for different degrees of freedom for
greater and smaller variances at specified level of significance) for
accepting or rejecting the null hypothesis.
Elementary statistics
24
25. Anova tables:
for a 1-way anova with N observations and T treatments.
Source
df
treatment
(T-1)
error…………by subtraction
Total
(N-1)
SS
SStrt
Sserr
MS
F
=SStrt/(T-1) MStrt/MSerr
=SSerr/dferr
Finally, you (or the PC) consult tables or otherwise obtain a probability of
obtaining this F value given df for treatment and error.
26. 1: Calculate N, Σx, Σx2 for the whole dataset.
2: Find the Correction factor
CF = (Σx * Σx) /N
3: Find the total Sum of Squares for the data
= Σ(xi2) – CF
4: add up the totals for each treatment in turn (Xt.), then
calculate Treatment Sum of Squares
SStrt = Σt(Xt.*Xt.)/r - CF
where Xt. = sum of all values within treatment t, and r is
the number of observations that went into that total.
3: Draw up ANOVA table, getting error terms by subtraction.
27. Complete
@LSD
Randomize Design : Least
(CRD)
Randomize Complete Block
@DMRT:
Design (RBD)
Duncan’s New
Multiple Range
Latin Square (LQ)
Test
Treatments
Replication
Degree of freedom
(df)
Significant
Difference
Elementary statistics
27
28. Most
people have difficulties in determining
whether a model is linear or non-linear.
Before discussing the issues of linear vs. nonlinear systems, let's have a short look at
some examples, displaying several types of
discrimination lines between two classes:
Nonlinear
linear
Elementary statistics
28
29. Here's
the answer: linear models are linear
in the parameters which have to be
estimated, but not necessarily in the
independent variables.
This explains why the middle of the three
figures above shows a linear discrimination
line between the two classes, although the
line is not linear in the sense of a straight
line.
Elementary statistics
29
30. When
calculating a regression model, we are
interested in a measure of the usefulness of
the model.
There are several ways to do this, one of
them being the coefficient of determination
(also sometimes called goodness of fit).
The concept behind this coefficient is to
calculate the reduction of the error of
prediction when the information provided by
the x values is included in the calculation.
Elementary statistics
30
31. Thus
the coefficient of determination specifies
the amount of sample variation in y explained
by x.
For simple linear regression the coefficient of
determination is simply the square of the
correlation coefficient between Y and X .
Strong negative
Linear relationship
Strong positive
Linear relationship
-1
0
Elementary statistics
No Linear relationship
31
+1
32. also
called Pearson's product moment
correlation after Karl Pearson is calculated
by
The correlation coefficient may take any value between -1.0 and +1.0.
Assumptions:
linear relationship between x and y
continuous random variables
both variables must be normally distributed
x and y must be independent of each other
Elementary statistics
32
34. test
is based on chi-square distribution and as a parametric test
is used for comparing a sample variance to a theoretical
population variance.
where
= variance of the sample;
= variance of the population;
(n – 1) = degrees of freedom,
n being the number of items in the sample.
36. In
quality control, there are situations when
we need to know whether a sample mean lies
within the confidence limits of the entire
population. This can be accomplished by
using t-distribution to determine confidence
limits for a population mean using a selected
probability.
We will use Excel function TINV( ) to determine the t-distribution.
Elementary statistics
36
E
X
A
M
P
L
E
I
37. Ten cans of sliced pineapple were removed at
random from a population of 1000 cans. The
drained weight of the contents were
measured as 410.5, 411.4, 410.4, 412.6,
411.9, 411.5,412.5, 411.4, 411.5, 410.1 g.
Determine the 95% confidence limits for the
entire population.
Elementary statistics
37
38. We will first calculate the average of the ten
data values using the AVERAGE() function.
Next we will determine the standard
deviation of the sample mean using STDEV()
function. Then we will use the following
expression to estimate the lower and upper
limits of population mean
Elementary statistics
38
39. Discussion:
The results show that the 95% confidence lower
and upper limits for the population mean are
410.78 and 411.98, respectively.
Elementary statistics
39
40. When a sample is taken from a large
population and analyzed for selected
DATA, statistical analysis is helpful in
obtaining estimates for the total population
from which the sample was obtained. In this
worksheet.
We will use Excel's built-in data analysis techniques to determine
various statistical descriptors for the sample and the population.
Elementary statistics
40
E
X
A
M
P
L
E
II
41. Case study : Color Data
A
sample of 10 breads is obtained from a
conveyor belt exiting a baking oven. The
breads are analyzed for color by comparing
them with a standard color chart. The values
recorded, in customized color units, are as
follows:
34, 33, 36,37, 31, 32, 38, 33, 34, and 35.
Estimate the mean, variance,
and standard deviation of the population.
Elementary statistics
41
42. We will use the Data Analysis capability of
Excel in determining the descriptive
statistics for the given data. First, you should
make sure that Data Analysis... is available
under the menu command Tools. If it is not
available, then see Next slide for details on
how to add this analysis package.
Elementary statistics
42
43. Click
Microsoft Office Button , and Then
Click Excel Options
Click Add-ins. In Manage Box, Select Excel
Add-ins
Click Go
In the Add-Ins Available Box, Select Analysis
ToolPak Check Box and Click OK. (If ToolPak
Is Not Listed, Click Browse to Locate It.)
43
44. Step 1 Open a new worksheet expanded to full size.
Step 2 In cells A2 :A 11, type the text labels and data values
Elementary statistics
44
45. Step 3 Choose the menu items Data, Data Analysis ....
A dialog box will open as shown.
Step 4 Double click on Descriptive Statistics.
Elementary statistics
45
46. Step 5 In the edit box for Input Range:, type the range of
cells as SA$2:$A$11.
Step 6 Select the radio button Columns.
Step 7 In output range type A13. Click OK.
Step 8 Excel will calculate the descriptive statistics and
display results in cells A13:B28
@The results indicate that the
sample mean is 34.3.
@The standard deviation for
the population is 2.214, and
@the sample variance of the
population is 4.9
Elementary statistics
46
47. t
(difference between samples) / (variability)
Excel will automatically calculate t-values to
compare:
Means of two datasets with equal variances
Means of two datasets with unequal variances
Two sets of paired data
abs(t-score)
< abs(t-critical): accept H0
Insufficient evidence to prove that observed
differences reflect real, significant differences
47
48. A
researcher wishes to test whether heavy
metal in soil have different mean after war
threat versus before war threat. The heavy
metal in soil is that mean after war threat
will exceed mean before war threat
Use Excel to help test the hypothesis for the difference
in population means.
Elementary statistics
48
E
X
A
M
P
L
E
III
49. Step 1 Open a new worksheet expanded to full size.
Step 2 In cells B5 :C19, type the text labels and data values
The null and
hypothesis to be
test are:
Ho :
HA :
Elementary statistics
1
2
1
2
49
0.0
0.0
50. Step 3 Choose the menu items Tools, Data Analysis ....
A dialog box will open as shown.
Step 4 Double click on t-Test two-sample assuring equal variances.
Elementary statistics
50
52. t > tcritical(one-tail), so the
mean of sample #1 is
significantly larger than
the mean of sample #2.
Change this if you want to know
whether the means of the two
samples differ by at least some
specified amount.
p value for one tailed
test is .003 which is
less than .05 so we
reject the null
hypothesis.
t > tcritical(two-tail), so
the mean of sample #1
is significantly
different from the mean
of sample #2.
Elementary statistics
p value for Two-tail test is
.007 which is less than .05 so
we reject the null hypothesis.
52
53. In
hypothesis testing, it is sometimes not
possible to use the same judges for testing
different treatments. Although, it would be
desirable to use the same judges to evaluate
samples obtained from different treatments.
In
such cases, we have a completely
randomized design. Using single-factor ANOVA
We can test to see whether the treatments had any influence on the
judges scores; in other words, does the mean of each treatment differ?
Elementary statistics
53
E
X
A
M
P
L
E
IV
54. Case study : Weight of oranges Data
a weight of oranges from three
different suppliers A, B, and C .Five oranges
was random sampling and weighted. The
following weights were obtained:
Consider
A
B
C
150
148
146
151
150
148
152
152
150
153
154
152
154
156
154
Elementary statistics
54
55. For each treatment, 5 samples were weighted by
5 times. Therefore, the design was completely
randomized. Calculate the F value to determine
whether the means of three treatments are
significantly different.
Elementary statistics
55
56. We
will use a single factor analysis of variance
available in Excel. We will determine the F
value at probability of 0.95 .
These computations will allow us to determine
if the means between the three different
treatments are significantly different.
First make sure that the Data Analysis...
Command is available under menu item Data.
Elementary statistics
56
57. Step 1 Open a new worksheet expanded to full size.
Step 2 In cells A4 :C8, type the text labels and data values
Elementary statistics
57
58. Step 3 Choose the menu items Data, Data Analysis ....
A dialog box will open as shown.
Step 4 Double click on Anova Single Factor.
Elementary statistics
58
59. The results show that the F value is 0.889. The critical F
values are At the 5% level F = 3.885
This indicates that for the example problem the F value is lower than
the value at the 5% level but not at the 5% level. Thus, we can
say that no significant difference in their mean scores(P<0.05).
Elementary statistics
59
60. When
we are interested in evaluating samples
for sensory characteristics using same judges
with
samples
obtained
from
multiple
treatments, analysis of variance for a twofactor design without replication is useful.
This analysis helps in determining if there are
significant differences among the various
treatments as well as if an significant
differences exist among the judges themselves.
Elementary statistics
60
E
X
A
M
P
L
E
V
61. Three
types of ice cream were evaluated by
11 judges. The judges assigned the following
scores.
Judge
Ice Cream A
Ice Cream B
Ice Cream C
A
16
14
15
B
17
15
17
C
16
16
16
D
18
14
16
E
16
14
14
F
17
16
17
G
18
14
15
H
16
15
16
I
17
14
14
J
18
13
16
K
17
15
15
Elementary statistics
61
62. We
will use the built-in analysis pack
available in the Excel command called Data
Analysis ....
Three sets of results will be obtained for the
5% level
Elementary statistics
62
63. Step 1 Open a new worksheet expanded to full size.
Step 2. In cell A3 :D 13, type the text labels and
data values,
Elementary statistics
63
64. Step 3 Choose the menu items Data, Data Analysis ....
A dialog box will open.
Step 4 Double click on Anova: Two-Factor Without
Replication. A new dialog box will open.
Step 5 Type entries in edit boxes as shown.
Step 6. The results will be displayed in cells
Elementary statistics
64
65. For judges, the calculated F value is
1.36. This value is lower than the critical
F values of 2.35 at the 5 % level
Elementary statistics
The difference
among ice cream
types is determined
by examining the F
values. The F value
is calculated as
19.73. This value is
greater than 3.49 for
the 5% level
65
66. The
difference among ice cream types is
determined by examining the F values. The F
value is calculated as 19.73. This value is
greater than 3.49 for the 5% level,
The ice cream types are significantly
different at p<0.001.
For judges, the calculated F value is 1.36.
This value is lower than the critical F values
of 2.35 at the 5 % level.
The judges showed no significant difference
in their mean scores.
Elementary statistics
66
67. Simple
regression analysis involves determining
the statistical relationship between two
variables. One of the uses of such analysis is in
predicting one variable on the basis of the
other.
We will use the regression analysis available in
the Add-in package in Excel to determine linear regression
between two variables.
Elementary statistics
67
E
X
A
M
P
L
E
VI
68. Case study : Sensory scores Data
flavor with storage time in a frozen
vegetable. Sensory scores obtained at
0, 1, 2, 3, 4 and 6 month times were
1.5, 2, 2, 3,
2.5, and 3.5, respectively. Assuming that
these data can be linearly
correlated, determine the regression
coefficient and
predict the off-flavor score at 5 months of
storage.
Elementary statistics
68
69. We will use the package Regression available
as an Add-in item in Excel. We will use this
package to obtain required statistical
relationships. We assume that a linear
relationship exists between the off-flavor
score and time (in months) with the equation
y= mx+b,
where
y is off-flavor score, x is time in months, m is slope and
b is intercept.
Elementary statistics
69
70. Step 1 Open a new worksheet expanded to full size.
Step 2 In cells A4 :B9, enter the text labels and data values
Elementary statistics
70
71. Step 3 Choose the menu items Data, Data Analysis .... A dialog box will
open.
Step 4 Double click on Regression.
Step 5 A new dialog box will open. Enter the range of cells for Y and X as
shown. Check boxes for Residuals and Line Fit Plots. Click OK.
Elementary statistics
71
72. The results will
be displayed
~99% of the variation in y is explained by
variation in x. The remainder may be
random error, or may be explained by
some factor other than x.
Probability of
getting this value of
F by randomly
sampling from a
normally distributed
population. Low
value means model
(rather than random
variability) explains
most variation in
data.
y = 0.31 x + 1.58
Ratio of variability explained
by model to leftover
variability. High number
means model explains most
variation in data.
Probability of getting a slope or intercept this
much different from zero by randomly sampling
from a normally-distributed population.
Elementary statistics
Confidence limits on
slope and intercept.
72
73. The
r 2 value is calculated as 0.85, the
standard error is 0.318.The intercept is 1.5786
and the slope is 0.3143.
The linear equation is y = 0.31x + 1.58 . The
residual output gives the predicted values for
the off-flavor score at different time intervals.
These data are also shown in the chart.
The predicted and calculated values are shown.
The predicted value at 5 months of storage
duration is calculated as 3.13.
Elementary statistics
73
77. Click
Microsoft Office Button , and Then
Click Excel Options
Click Add-ins. In Manage Box, Select Excel
Add-ins
Click Go
In the Add-Ins Available Box, Select Analysis
ToolPak Check Box and Click OK. (If ToolPak
Is Not Listed, Click Browse to Locate It.)
77
78. Click
Data/Data Analysis (Far Right) /Descriptive
Statistics & OK.
Put Checkmarks on Summary Statistics, 95% or
99% Confidence Interval, & Labels in First Row
Boxes.
Move Cursor to Input Range Window, Highlight
Data to Analyze including Labels, & Click OK.
Your Data will Appear on New Worksheet.
Widen Columns by Clicking Home/Format/AutoFit
Column Width.
78
79. Click Data/Data Analysis/Histogram & OK.
Put Checkmarks on Chart Output & New Worksheet
Boxes.
Move Cursor to Input Range Window, Highlight Data
Going into Histogram.
Move Cursor to Input Bin Range, Highlight Data
Showing Upper Value of Each Bin & Click OK.
Histogram will be on New Worksheet. You May
Lengthen it by Clicking Blank Space in Window, Moving
Cursor to Window Bottom Line & Holding Down Mouse
Button as You Pull Down Window.
79
80. Go
to Sheet One.
Click Data/Data Analysis/ and the Appropriate
Statistical Test. Then Click OK.
On New Window Check Labels Box and Put
Cursor on Variable 1 Range.
Highlight Variable 1 Data Including Label.
Put Cursor on Variable 2 Range & Highlight
Variable 2 Data (Including Label). Then Click OK.
Click Home/Format/AutoFit/Column Width
80
81. Go
to Sheet One.
Highlight Data (Be Sure X Values are in
Left Column and Y Values are in Right
Column).
Click Insert/Scatter. Pull down menu and
click Upper Left Icon.
Click a Datum Point on Chart with Right
Mouse Key, Add Trendline, & Click Linear.
81
82. Go
to Sheet One.
Click Data/Data Analysis (On Far Right)
/Regression & Click OK.
On New Window Check Labels Box and Put
Cursor on X Range.
Highlight X Data Including Label.
Put Cursor on Y Range & Highlight Y Data
(Including Label), Then Click OK.
Click Home/Format/AutoFit Column Width.
82