2. 24 January
2022
2
Objective
At the end you should be able
Estimate parameters
Conduct hypothesis testing
Testthe associationsbetween variables
3. Inferential Statistics
It isthe processofgeneralizingor makingconclusionsto the target population
basedon the information obtained from the sample
24 January
2022
3
6. Sampling Distributions
The probabilitydistribution ofasamplesstatistic that isformed whenrepeated
sampleswere taken from the whole population
Ifwetakemany,manysamplesandget the statistic for eachofthose samples,the
distribution ofallthose statistic.
The frequency distribution ofallthese samplesforms the samplingdistribution of
these sample statistic.
6
7. 7
Sampling Distributions
Practically repeated samplesdo not taken from the population
Wedo not encounter samplingdistribution empirically, but it isnecessaryto
knowtheir properties in order to drawstatistical inferences.
Three thingsthat determine sampling distribution
Its mean
Its variance
Its shape
8. 8
Properties of Sampling Distributions
The mean of the sample means will be the sameasthe population
mean.
The standard deviation of the sample means will be equalto the
population standard deviation divided bythe squareroot ofthe sample
size.
The standard deviation ofthe samplemeanswill be smaller than the standard
deviation ofthe population
The Standarddeviationofthesamplingdistributionofthesamplestatisticsiscalled
the standarderror
9. 9
Standard deviation vs Standarderror
Standard deviation
Isameasureofvariability between
individual observations
Descriptiveindex relevant to mean
Standard error
Thevariability ofsummary statistics
e.g. the variabilityofthe samplemeanor
asample proportion
It isameasureofuncertainty in a
sample statistic.
i.e. precision ofthe estimate of the
estimator
10. 10
Sampling Distributions
Basedon the nature ofsummary statistics
Sampling distribution of the mean
Sampling distribution of the proportion
11. Properties of sampling distribution of the means
The mean of the sampling distribution of the means is the same as the
population mean( )
The SDofthe samplingdistribution ofthe meansis / n.
The shape of the sampling distribution of means is approximately a normal
curve, regardless of the population distribution when n is large enough
(Central limit theorem).
11
12. Properties of sampling distribution of the proportions
The sampleproportion p will be anestimate ofthe population proportions
The SDofthe samplingdistribution ofthe proportion is
The shape of the sampling distribution of proportion is approximately a
normal curve, regardless of the population distribution when n is large
enough(Central limit theorem).
12
13. 13
Central LimitTheorem
Statesthat regardlessofthe shape ofthe parent population distribution;
Thesamplingdistribution ofanystatisticwill be normal or nearlynormal, ifthe samplesize
islarge enough.
Butthe question is" how large enough"?
Asarough rule of thumb,
Asamplesizeof30 islargeenoughfor continuous data and
np≥ 5and nq≥ 5for categorical data whichare measuredby proportion
14. 14
Assumptions of statistical inference
T
o make valid inference or conclusions the following assumptions must be
satisfied
Samplesmust be randomly selected
Samplesizemust be large enough
The population must be normally or approximately normally distributed ifthe
samplesizeislessthan 30.
That meansthe population varianceshouldbe known
What if n is not large enough and population variance is unknown?
15. 15
Student’s t- distribution
Weusestudent t distribution in statistical inferencewhichdependson degrees of freedom:
Thet-distribution isatheoretical probability distribution whichissymmetrical, bell-shaped,
andsimilarto the normal but more spread out.
Theconditions to usethe student t distribution
Thesampleisfrom anormallydistributed population,
Populationvarianceisunknown, and
Thesample size is small i.e. lessthan30 and np < 5 or nq<5
18. 24 January
2022
18
Student’s t-distributions
The t distribution andstandard normal distribution are similar in :
It isbell shaped.
It issymmetrical about the mean.
The mean, median, andmode are equal to 0 andlocated at the center.
The curve never touches the x axis
The t distribution differs from the normal distribution:
The varianceisgreater than one
The t distribution isbasedon DF,whichisrelated to sample size.
Assamplesizeincrease, the t distribution approachesthe SND(Z).
19. 24 January
2022
19
Parameter Estimations
Wegenerallyassumethe underlyingdistribution ofthe variableofinterest is
adequatelydescribed byone or more unknown parameters
Butit isusuallynot possibleto makemeasurements on everyindividualin a
population, parameters cannot usuallybedetermined exactly.
Instead we estimate parameters by calculating the corresponding
characteristics from arandomsample estimates .
20. 24 January
2022
20
Estimation
It isaprocedure in whichthe information obtained from asampleare used to
get the true population parameter.
The processofestimating population parameters byusing samplestatistics
An estimator is any statistic that is used to estimate unknown population
parameter.
The valueor valuesthat the estimator assumesare called estimates
21. 24 January
2022
21
Characteristics of good estimator
Anestimator shouldbe:
Unbiased: the expected valueofthe estimator must be equalto the
parameter to be estimated.
Consistent: asthe samplesizeincrease, the valueofthe estimator should
approachesto the valueofthe estimated parameter.
Efficient: thevarianceofthe estimator shouldbe smallest.
Sufficient: the samplefrom whichthe estimator iscalculatedmust contain the
maximumpossibleinformation about the population.
23. Point Estimation
A single numerical value is used to estimate the corresponding population
parameter.
The corresponding point estimator for the parameters:
24 January
2022
23
24. 24 January
2022
24
Point Estimation
However, there are pitfalls of point estimation.
Different samples end with different estimate for a single unknown population
parameter.
However,point estimate doesnot take sampleto samplevariability into account.
Point estimate does not give the precision of the estimate and hence we need
another method ofestimation whichhandlesthese problems.
25. 24 January
2022
25
Interval Estimation
It isaninterval computed from sampledata containing the true population
parameter within acertain levelof confidence.
CI=point estimate ± margin oferror (reliability coefficient × StandardError)
CIconsists of three parts:
The statistic,
Aconfidence level and
Standard error
Interval estimators are commonlycalled confidence intervals.
26. 24 January
2022
26
Interval Estimation
Level of confidence
Is the probabilityof obtaining the populationparameter within the error margin.
Levelofconfidenceisdenoted as(1-α)100%.
Confidencelevelcannever be 100%!
Mostcommonly the 95%confidenceintervals are calculated
However, 90%and99%confidenceintervals are sometimes used
27. Interval Estimation
ACIin general:
Considers variationin samplestatisticsfrom sampleto sample
Basedon observation from one sample
Givesinformation about closenessto unknownpopulation parameters
Statedin terms oflevelof confidence
Interpretation ofconfidenceinterval (e.g. a95% CI)
Ifwetake100 repeated n samplesandconstruct confidenceinterval, weexpect that 95 of
them will contain the true population parameter.
24 January
2022
27
28. Interval Estimation
Thegeneralformula for allCIs is:
point estimate (measure of how confident we
want to be or reliability coefficient) (standard
error)
The value of the statistic in my sample (e.g., mean, proportion , mean
difference, proportion difference, etc.)
From a Z table or a T table, depending on
the sampling distribution of the statistic.
Standard error
of the statistic.
24 January
2022
28
29. 24 January
2022
29
Error of Margin
It is the amount added and subtracted to the point estimate in confidence
interval estimation
It isameasure of precision
Error margin isaproduct of
Reliability coefficient corresponding to confidence level and
Standard error ofthe estimator.
30. 24 January
2022
30
Interval Estimation
The width ofthe confidence interval depends on:
Sample size
The larger the samplesize, the narrower the confidence interval andthe
more preciseour estimate. Because as sample sizeincreasesstandard error
decreases.
It isto meanthe samplestatistic will approach the population parameter
Standard deviation
The more the variation amongthe individualvalues,the wider the
confidence interval andthe lessprecisethe estimate.
31. 24 January
2022
31
Interval Estimation
Confidence level
Thelarger confidencelevel, the wider the confidence interval
90%CIisnarrower than 95%CIsinceweare only90%certain that the interval
includesthe population parameter.
The99%CIiswider than 95%CI; the extra width meaningthat wecanbe more
certain that the interval willcontain the population parameter.
32. 24 January
2022
32
Interval Estimation
Confidenceinterval canbe estimated for
Singlepopulation
One population mean
One population proportion
Double population
Twopopulation(difference) inmean
Twopopulation(difference) inproportion
34. CIfor a Single Population Mean
When the followingassumptionsare fulfilled
Populationstandard deviation () is known
Population isnormally distributed
Ifpopulation isnot normal, uselarge sample
A100(1-)% C.I. for iscalculated by:
isto be chosenbythe researcher, most commonvaluesof are
0.05, 0.01 and 0.1.
34
35. Confidence interval
Thepoint estimate ofμ isthe samplemean 𝑥
ҧ
The standard error of𝑥ҧ is 𝛔
ൗ 𝑛
CommonlyusedCLsare 90%, 95%, and 99%
35
36. 36
Example:
1. W
aiting times (in hours) at a particular hospital are believed to be
approximately normally distributed with a variance of 2.25hr.
a. Asampleof20 outpatients revealedameanwaitingtime of1.52 hours. Calculatethe
point estimate andconstruct the 95% CI.
b. Suppose that the mean of 1.52 hours had resulted from a sample of 32 patients.
Calculatethe point estimate andfindthe 95% CI.
c. What effect doeslarger samplesizehaveon the CI?
37. Solutions
A.
Weare 95%confident that the true meanwaitingtime isbetween 0.87 and2.17 hrs.
Althoughthe true meanmayor maynot be inthis interval, 95%ofthe intervalsformed in this
manner willcontain the true mean.
Anincorrect interpretation isthat there is95%probability that this interval containsthe true
population mean.
20
1.52.65(.87,2.17)
37
1.521.96
2.25
1.521.96(.33)
38. Solutions
B.
32
38
1.52 .53 (.99, 2.05)
Thelarger the samplesizemakes the CI narrower (more precision).
When constructing CIs, it hasbeen assumedthat the standard deviation
ofthe underlying population, , isknown
What if isnot known?
1.52 1.96
2.25
1.52 1.96(.27)
39. Unknown variance (small sample size, n ≤ 30)
Ifthe for the underlying populationisunknownandthe samplesize
is small
Asanalternative weuseStudent’
st distribution.
39
40. Degrees of Freedom (df)
df= Number ofobservations that are allowedto varyfreelyafter the estimator
hadcalculated. df= n-1
40
41. Example
Compute a 95% CI for the mean birth weight based on n = 10, sample mean =
116.9 Oz ands =21.70.
From the t table, t (9, 0.975) = 2.262
Answer:(101.4, 132.4)
Interpretations?
42. CIs for single population proportion, p
An interval estimate for the population proportion (π) canbe
calculated byaddinganallowancefor uncertainty to the sample
proportion (p)
Isbasedon three elements of CI.
Point estimate
SEof point estimate
Reliability coefficient
45. Example
A random sample of 100 people shows that 25 are left-handed.
Calculate the point estimate and form a 95% CI for the true
proportion of left-handers.
46. Example
It was found that 28.1% of 153 cervical-cancer cases had never had a Pap smear prior to the
time of case’s diagnosis. Calculate a 95% CI for the percentage of cervical-cancer cases who
never hadaPap smear.
48. CIfor the difference between population means
Known variances and large sample size
When 1 and2 are knownandboth populationsare normal or both samplesizesare at least
30
Thetest statistic isa z-value
The point estimation of (μ1- μ2) is(𝑥
1
ҧ − 𝑥
2
ҧ )
Thestandard error is (
𝑥
1ҧ − 𝑥ҧ2
)
Finally,
Ifpopulation variancesare unknown, theycanbe approximatedbythe samplevariances:𝑠1
2
and𝑠2
2 whenthe Sample islarge (n≥ 30)
49. Example 1
• Researchers wishto knowifthe data they havecollected provide sufficient
evidence to indicate adifference in mean serum uric acidlevels between
normal individualsandindividualswith mongolism.The data consist of
serum uric acidreadings on 12 mongoloid individualsand15 normal
individuals.The meansare 𝑥ҧ1= 4.5 mg/100 ml and𝑥ҧ2= 3.4 mg/100
m
l
.The data constitute two independent simple random samples each
drawn from anormally distributed population with avarianceequal to 1
mg/100 ml.
• Compute the point estimate andconstruct a95%CIfor the difference in
meanserum uric acidlevels between the two populations.
51. Example 2
Researchers are interested in the difference between serum uric acid levels in patients
with and without Down’ssyndrome.
Patientswithout Down’s syndrome
n=12, samplemean=4.5 mg/100ml,2=1.0
Patientswith Down’s syndrome
n=15, samplemean=3.4 mg/100ml,2=1.5
Calculate the 95% CI.
SE= 0.43, 95% CI = 1.1 ± 1.96 (0.43) = (0.26, 1.94)
Weare 95%confident that the true differencebetween the two population meansis between
0.26 and 1.94.
52. CIfor the difference between population means
UnknownVariances (σ1
2and σ2
2) and small sample size (n < 30)
Ifthe followingassumptions satisfied
The two random samplesare independent
Bothsamplesare pickedfrom population with normal distribution.
The population variancesare unknownbut are assumedto be equal.
the test statistic isat-value with degrees offreedom = 𝑛1 + 𝑛2-2
The point estimation of(μ1- μ2) is (𝑥1
ҧ− 𝑥ҧ2)
The standard error is (𝑥1
ҧ− 𝑥
2
ҧ )=
53. CIfor the difference between population means
Thepooled samplevariance (S2)
Finally,(1- α) 100% confidence interval for (μ1-
μ2):
54. Example
Aresearch team collected serum amylasedata from asampleofhealthy
subjects andfrom asampleofhospitalized subjects.They wishto knowif
they wouldbe justified in concluding that the population meansare
different.The data consist ofserum amylasedeterminations on 𝑛2=15
healthy subjects and 𝑛1=22 hospitalized subjects.The samplemeansand
standard deviations are as follows:
𝑥ҧ1= 120 units/ml, 𝑠1=40 units/ml
𝑥ҧ2= 96 units/ml, 𝑠2=35 units/ml
Construct a95%CIfor the difference between the two population mean
serum amylase.
56. CIfor the difference between populationproportions
Supposethat n1andn2are largeenoughso that;
– 𝑛1𝑝1≥5,𝑛1(1 − 𝑝1)≥5,𝑛2𝑝2≥5,and 𝑛1(1 − 𝑝1)≥5
Thepoint estimate for the differenceoftwo population proportion, 𝜋1− 𝜋2isby𝑃1− 𝑃2.
1 2
𝑃1(1−𝑃1)
+ 𝑃2(1−𝑃2)
𝑛1 𝑛2
The standard deviation 𝑃 − 𝑃=
A(1-α)100% confidenceinterval estimate for the differenceofpopulation proportions, 𝑃1−
𝑃2= 𝑃1− 𝑃2± 𝑧𝛼
Τ2 𝑛
+
𝑃1(1−𝑃1) 𝑃2(1−𝑃2)
𝑛
1 2
57. Example
Each of two groups consists of 100 patients who have leukemia. Anew
drug is given to the first group but not to the second (the control
group). It is found that in the first group 75 people have remission for
2 years; but only 60 in the second group. Find 95% confidence limits
for the difference in the proportion of all patients with leukemia who
haveremissionfor 2 years.
59. Summary
Is σ
known?
Is n ≥ 30 or np and nq≥5
Use tα/2 values and s in the formula.
ye
s
ye
s
Use zα/2 values
no maters what the sample size is
Use zα/2 values and
s in place of σ in the formula.
N
o
N
o
• When to usetα/2 or zα/2 for findingconfidenceinterval
61. HypothesisTesting
Researchers are interested to conduct a study for answering many research
questions/hypothesis.
The best wayto determine whether their hypothesisistrue wouldbe to examine
the entire population.
Butit isoften impractical, researchers typicallyexamine arandomsamplefrom
the population.
The purpose ofthe anystudy isto collect datawhichwill allowthe researcher to
test the hypothesisor answertheir question.
Statistical tests canprove(with acertain degree ofconfidence), that ahypothesis
are true or not.
62. HypothesisTesting
Inhypothesistesting:-the researcher must definethe population under study,
-state the particular hypothesisthat will be investigated,
-Determine significance level,
-select asamplefrom the population and collect the data, and
-perform the appropriate statisticaltest andreacha conclusion.
64. Hypothesis Testing
Hypothesis is a testable statement that describes the nature proposed
relationship between two or more variablesof interest.
Hypothesisare formulated, experiments are performed, andresults are evaluated
for their consistency with a hypothesis.
HypothesisTesting(HT) providesanobjectiveframework for makingdecisions
usingprobabilistic methods
The purpose ofHTisto aidthe clinician, researcher or administrator in reaching
adecision (conclusion).
65. Types of Hypothesis
The Null Hypothesis, H0
Isastatement claimingthat there isno difference between the hypothesizedvalue
andthe population value(parameter= hypothesized value)
It isastatement ofagreement (no difference)(no difference between groupsor
the intervention isnot effective)
Statesthe assumption (hypothesis) to be tested
It isalwaysabout apopulation parameter (mean, proportion, OR, RR, etc.),
not about asample statistic
Alwayscontains“=” , “ ≤” or“≥ ” sign
Mayor maynot be rejected
66. Types of Hypothesis
TheAlternative Hypothesis,HA
It isastatement wewillbelieveastrue ifwereject the H0.
It isgenerally the hypothesisthat isbelieved(or needsto be supported) bythe
researcher.
Is a statement that disagrees (opposes) with H0 (there is difference between
groupsor the intervention effective)
Never contains“=” , “ ≤” or “≥ ” sign,it contains“≠”,“>”, or”<“
May or maynot beaccepted
67. Rules for Stating Statistical Hypotheses
Indicationof equality(either =, ≤ or ≥) mustappearinH0.
H0 : μ = μo, HA: μ ≠ μo; when our hypothesis is expressed in terms of population mean
H0: P= Po, HA: P≠ Po; when our hypothesisisexpressed interms ofpopulationproportion
Canweconcludethat acertain populationmean is
not 50?;H0: μ = 50 andHA: μ ≠50
greater than 50?; H0: μ ≤ 50 andHA: μ > 50
Canweconcludethat the proportion ofpatients with leukemiawhosurvivemore than six years
isnot 60%?
HA: P= 0.6 and HA: P≠0.6
Canweconcludedissmokingissignificantlyassociatedwith lungcancer
H0: there isno associationbetween smokingandlung cancer.
HA:there isanassociationbetween smokingandlung cancer
68. Hypothesis testing process
Nowthink about howthe hypothesistest shouldbe carried out
Wedrawarandom sampleofsizenfrom the underlying population and
calculateits samplemean (𝑥ҧ)
Wecompare(𝑥ҧ)to the postulated mean μ0
Is the difference between (𝑥ҧ) and μ0 too large to
be attributed to chance alone?
70. Steps in HypothesisTesting
1. Formulatethe appropriate statisticalhypotheses clearly
SpecifyH0and HA
H0: = 0 H0: ≤0 H0: ≥0
HA: 0 HA: > 0 HA: < 0
two-tailed one-tailed one-tailed
2. Decide on the appropriate test statistic for the hypothesis. E.g., one
population
or
71. Steps in HypothesisTesting
3. Specifythe desired levelofsignificance(=0.05, 0.01, etc.)
4. Determine the critical value.
5. Compute the test statistic or the p-value
6. Reachadecisionanddrawthe conclusion
IfH0isrejected,weconcludethatHAistrue(oraccepted).
IfH0isnotrejected,weconcludethatHomaybetrue.
72. One tail and two tailtests
Depend on the waythe H0iswritten, hypothesistesting canbe:
Twotail test
Therejection region issplit into the two tails.
Alternative hypothesistakestheform ”differentfrom”.
One tail test
Therejection region isat one end ofthe distribution or the other.
Alternative hypothesistakesthe form ”lessthan”or ”greater than”.
73. Level of Significance, α
Isthe probabilityofrejecting atrue H0
Definesrejection region ofthe sampling distribution
The decisionismadeon the basisofthe levelofsignificance,designated byα.
More frequently used valuesofα are 0.01, 0.05 and 0.10.
α isselected bythe researcher at the beginning
74. Test statistic
Anyobserveddifferences or associationsmayhaveoccurred bychance.
Becausethere israndomvariation, evenanunbiasedsamplemaynot accurately
represent the population asa whole.
Atest statisticsisavaluewecancompare with knowndistribution ofwhatwe
expect when the null hypothesisis true.
The general formula of any test statisticsis:
𝒐
𝒃
𝒔
𝒆
𝒓
𝒃
𝒆
𝒅𝒗
𝒂
𝒍
𝒖
𝒆
−
𝒉
𝒚
𝒑
𝒐
𝒕
𝒆
𝒔
𝒊
𝒛
𝒆
𝒅
𝒗
𝒂
𝒍
𝒖
𝒓
𝒔
𝒕
𝒂
𝒏
𝒅
𝒂
𝒓
𝒅𝒆
𝒓
𝒓
𝒐
𝒓
Anexampleofatest statistic isz-test , t-test, X2-test
75. Critical value
The valuethat separates the rejection region from the acceptance region for a
givenlevelof significance
The valuesofthe test statistic assumethe points on the horizontal axisofthe
normal distribution andseparatestwo regions:
Rejection region, and
Non-rejection region.
Thevaluesofthe test statistic forming the rejection region are lesslikelyto occur ifthe H0is
true.
Thevaluesmakingthe acceptance(non-rejection) region are more likelyto occur ifthe H0 is
true.
76. Rejection and Non-Rejection Regions
Rejection region Non-rejection region Rejection region
= 0.025 = 0.025
0.95
1.96
-1.96
77. P-value
Inmost applications, the outcome ofperforming ahypothesistest isto produce a
p-value.
P-valueisthe probabilityofobtainingatest statistic asextreme or more extreme
valuethan the actual test statisticobtained if the H0 is true
• P-valueisthe probabilitythat the observeddifference isdue to chance.
The larger the test statistic, the smaller is the P
-value, the value observed
occurring just bychanceis low.
The smaller the P-valuethe stronger the evidencefor rejecting H0 .
Reject H0 ifP-value< α
AcceptH0 ifP-value> α
What ifP-value =α??????
78. How to calculateP-value
Usestatistical software likeSPSS, SAS,STA
TA, or R, etc.
Manual calculations
Obtained from the test statistics (Z calculated)
Findthe probability oftest statistics from standard normal table
Subtract the probability from 0.5
Ifthe test two tailed multiply 2 the result.
79. Statistical Decision
Basedon the computation from the data ofthe sample
The decision to reject or not to reject the Ho isbased on
The magnitude ofthe test statistic.
CI
P-value
Reject Ho ifthe valueofthe test statistic in the rejection region
Don’t reject Ho ifthe computed valueofthe test statistic isone ofthe valuesin
the non-rejection region.
80. Errors in hypothesis testing
Whenever wereject or accept the H0 wecommit errors.
Twotypes oferrors are committed.
TypeI Error
TypeII Error
81. TypeI Error
Theerror committed whenatrue H0is rejected
Considered aserious type of error
The probability ofatype Ierror isthe probabilityofrejecting the H0
whenit is true
The probability oftype Ierror isα, Called levelofsignificanceofthe
test
Setbyresearcher in advance
82. TypeII Error
Theerror committed whena false H0 isnot rejected
The probability ofTypeIIError is
Usuallyunknownbut larger than α
83. Power
Theprobability ofrejecting the H0 whenit is false.
Power= 1 – β = 1- probability oftype IIerror
Wewouldliketo maintainlowprobability ofatype Ierror (α) and low
probability ofatype IIerror (β) [highpower = 1 - β].
84. Summary
Decision
(Conclusion)
Reality
H0 True H0 False
Do not
reject Ho
Correct action
(Prob. = 1-α)
Type II error (β)
(Prob. = β= 1-Power)
Reject Ho Type I error (α)
(Prob. = α = Sign. level)
Correct action
(Prob. = Power = 1-β)
91. HypothesisTesting for KnownVariance
Twotailed test
H 0 : 0
H A : 1
0
n
z for two tailed test
2
cal tab
if | zcal | ztab do not reject H o
if | z | z reject H o
ztabulated
cal
Decision :
z
x 0
92. Example
Asimplerandomsampleof10 peoplefrom a certain population hasa meanage of 27. Canwe
The variance is known to be
conclude that the mean age of the population is not 30?
20. Let α = .05.
Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
Assumptions
Simplerandom sample
Normallydistributed population
93. Example
A. Hypothesis
Ho: µ= 30
HA: µ≠ 30
B.Test statistic
Asthe population varianceisknown, weuseZ asthe test statistic
C. Determine the levelof significance
94. Example
D. Determine the criticalvalue
Reject Ho ifthe Z valuefallsinthe rejection region.
Don’t rejectHo if theZ valuefallsin the non-rejection region.
Becauseofthe structure ofHo it isatwo tail test. Therefore, reject Ho ifZ ≤ -1.96 or Z ≥
1.96.
95. Example
E.Calculation of test statistic or computeCI
F
.Statistical decision
Wereject the HobecauseZ = -2.12 isinthe rejection region.Atα of 5%.
Conclusion
Weconcludethat µisnot 30. P-value= 0.0340
AZ value of -2.12 correspondsto an area of0.0170. Sincethere are two parts to the rejection region in a
two tail test,the P-value is twice this which is .0340.
2 . 1 2
2 7 3 0 3
1 0
2 0 1 . 4 1 4 2
x 0
z
n
c a l
96. Hypothesis testing using confidenceinterval
Aproblem like the above example can also be solved using aconfidence interval.
A confidence interval will show that the calculated value of Z does not fall within the
boundaries ofthe interval. However,it willnot givea probability.
Confidence interval
27 1.96(1.4142)
(24.228,29.772)
n
CI x z
2
98. Example
A simple random sample of 10 people from a certain population has a mean age of 27. Can
we conclude that the mean age of the population is less than 30? The variance is known to be
20. Let α = 0.05.
Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
Hypotheses
Ho: µ ≥ 30, HA: µ < 30
99. Example
Test
statistic
e have the entire rejection
region at
the left. The critical value will
be Z
With α = 0.05 and the inequality,
w
= -1.645. Reject Ho if Z < -
1.645.
=
Rejection Region
Lower tail test
100. Example
• Statistical decision
– We reject the Ho because -2.12 < -1.645.
• Conclusion
– We conclude that µ < 30.
– p = .0170 this time because it is only a one tail test and not a
two tail test.
101. Hypothesis testing for unknown variance (nsmall)
Inmost practical applicationsthe standard deviationofthe underlying
population isnot known
Inthis case, canbe estimated bythe samplestandard deviation s.
Ifthe underlying population isnormallydistributed, then the test
statistic is:
102. Example
Asimplerandom sampleof14 people from acertain population givesasamplemeanbody mass
index (BMI)of30.5 andSDof10.64. Canweconcludethat the BMIisnot 35 at α 5%?
Ho: µ= 35, HA:µ≠35
Test statistic
Ifthe assumptionsare correct andHo istrue, the test statistic followsStudent's t distribution
with 13 degrees of freedom.
103. Example
Decision rule
Wehaveatwo tailed test. With α = 0.05 it meansthat eachtail is0.025.The critical t valueswith
13 dfare -2.1604 and 2.1604.
Wereject Ho ifthe t ≤ -2.1604 or t ≥ 2.1604.
Dono
possib
in the rejection
5
t reject Ho because-1.58 isnot
le that µ= 35. P-value= 0.137
region. Basedon the dataoft hesample, it is
104. Hypothesis testing for proportions
Involvescategorical values
Twopossibleoutcomes
“Success”(possessesacertaincharacteristic)
“Failure”(doesnot possessesthatcharacteristic)
Fractionor proportion of populationin the“success”categoryis
denoted by p
106. Example
We are interested in the probability of developing asthma over a given one-year period for
children 0 to 4 years of age whose mothers smoke in the home. In the general population of 0
to 4-year-olds, the annual incidence ofasthma is1.4%. If10 casesofasthmaare observedover a
single year in a sample of 500 children whose mothers smoke, can we conclude that this is
different from the underlyingprobability ofp0= 0.014?Α = 5%
H0 : p = 0.014
HA: p ≠ 0.014
108. Example
Thecritical valueofZα/2 at α=5% is±1.96.
Don’t rejectHosinceZ(=1.14) in the non-rejection region between
±1.96.
P-value = 0.2548
We do not have sufficient evidence to conclude that the probability of
developing asthma for children whose mothers smoke in the home is
different from the probability in the general population
109. Hypothesis testing for two samples
ComparingTwo Population Means;
Independent samples: variancesknown
Independent samples: variancesunknown
• Paired Difference Experiments
Paired/matched/repeated sampling
• ComparingTwo Population Proportions
Large,independent samplescase
110. Hypothesis testing for two populationmeans
Independent sample with known variance or both groups have large sample size
Thesteps to test the hypothesisfor differenceofmeansisthe samewith the singlemean
Step1: state the hypothesis
Ho: µ1-µ2 =0 vsHA: µ1-µ2≠0, HA: µ1-µ2<0, HA: µ1-µ2 >0
Step2: Significancelevel(α)
Step3:Test statistic
n1 n2
2
2
1
2
( x y ) (1 2)
zc al
111. Hypothesis testing for two populationmeans
if zcal ztab
cal ztab
A
cal
do not reject Ho
reject Ho
do not reject Ho
if zcal zcal
if z zt a b reject Ho
: 1 2 0
A
cal cal
if | zcal | zcal do not reject Ho
if | z | reject Ho
A
ztabulated
ztabulated
if z
: 1 2 0
For H
For H
For H 2
1
: 0
z for two tailed test
2
z for one tailed test
z
112. Example
• Aresearchers wishto knowifthe datathey havecollected providesufficientevidenceto indicate
a difference in mean serum uric acid levels between normal individual and individual with
down’s syndrome. The data consists of serum uric acid readings on 12 individuals with down’s
syndrome and 15 normal individuals. The means are 4.5mg/100ml and 3.4 mg/100ml with
population standard deviationof2.9 and3.5 mg/100ml respectively.
HO : 1 2 0
H A : 1 2 0
114. 114
Hypothesis testing for two populationmeans
Independent Samples,variancesunknown
Generally, in most ofthe real lifesituations, the true valuesofthe population
variances 𝜎1
2 and 𝜎2
2are notknown.
Theyhave to be estimated from samplevariance 𝑆1
2
and 𝑆2
2
,respectively.
Alsoneed to estimate the standard deviation ofthe samplingdistributions ofthe
differencein means (
𝑋
ത
1
-
𝑋
ത
2
)
Twoapproach's
1.The varianceofthe two populationsare assumedto be equal
2.The varianceofthe two groupsare assumedto be not equal
115. Hypothesis testing for two populationmeans
Assumed that the unknownvariances are equal; 𝝈𝟏
𝟐=𝝈𝟐
𝟐=𝝈𝟐
Thepooled estimate of𝜎2isthe weightedaverageofthe two sample
variances,𝑆1
2
𝑎𝑛𝑑𝑆2
2
Thepooled estimate ofisdenoted by𝑆𝑝
2
Standarddeviationofthe samplingdistribution is;
𝑠
𝑥
1
ҧ−𝑥2ҧ = 𝑝
𝑛1 𝑛2
𝑆 2
(
1
+
1
)
115
116. Hypothesis testing for two populationmeans
The t-statistic will be
used
𝑡=
𝑥ҧ1−𝑥ҧ2 −(𝜇
1−𝜇2)
116
𝑆𝑝
2( 1
+ 1
)
𝑛1 𝑛2
The df = 𝑛1+ 𝑛2− 2
117. Hypothesis testing for two populationmeans
𝑠𝑥ҧ1−𝑥ҧ2
=
𝑛
Assumethat the unknownvariances are not equal;𝝈𝟏
𝟐 ≠ 𝝈𝟐
𝟐
The 𝜎1
2 and 𝜎2
2will be estimated by𝑆1
2
𝑎𝑛𝑑𝑆2
2
, respectively
Standarddeviationofthe samplingdistribution is;
𝑆1
2
𝑆2
2
117
1 2
( + )
𝑛
Howto compute the dfwhenthe unknownvariancesandassumednot
to be equal?(reading assignment)
118. Example
We have 20 subjects, all males between the ages 25 and 35 who volunteer for our experiment.
One half of the group will be given coffee containing caffeine; the other half will be given
decaffeinated coffee as the placebo control. We measure the pulse rate after the subjects drink
their coffee.The results are:
A) Testthe hypothesis that caffeinehasno effect on the pulse rates ofyoungmen byassuming both
groups hadequalvariance?(α = .05)
B) Findthe 95%C.I. for the population mean difference.
118
119. 119
SOLUTION
Hypotheses:Ho : μt = μc
HA: μt ≠μc
where, μt = population meanoftreatment group, μc = population meanofcontrol (placebo)
group.
Compute the pooled(combined) varianceofboth groups
S2= { (10-1)x 28.67+ (10-1) x 31.11 } / 18
= (258.03 + 279.99)/18 = 538.02 / 18 = 29.89
Therefore,t calc = (75 - 68) / √ 29.89(1/10 + 1/10 ) = 7/ √ 5.978
= 2.86 (Thiscorresponds to aP-valueoflessthan 0.02)
t tab ( α = 0.05 , df = 18 ) = 2.10, t calc> t tab ⇒rejectHo
• Hence, caffeinatedcoffeehasaneffect on the pulserates ofyoung men.
120. 120
Hypothesis testing for two population means
Dependent/paired/matched/repeated sampling
Risesfrom two differentprocesseson same studyunits (e.g. "before” and“after”
treatments)
Use of the same/matched individuals, eliminates any differences in the
individualsthemselves(confounding factors).
Inference concerning the differencebetween two population meansissimilarto
one population mean; except that wemanipulateon the difference here.
124. Solution
1 5 ... 8
242
20.17
n 12 12
d di
1 2 11
124
1 2 1 0 7 6 6 2 4 2 2
2
5 3 5 . 0 6
i i i
d
s 2
n d 2
d
n 1 n n 1
d d 2
1. State the hypothesis
Ho: The mean difference between before and after diet-
exercise- program is zero
HA: The mean difference between before and after diet-
exercise-
program is < zero
125. Solution
2. Select the appropriate test statistic
3. Select the level of significance = 0.05
4. Determine the critical ratio or critical value of t test = - 1.7959
5. Perform the calculation for the test statistic
t
20.17 0
20.17
3.02
• Reject Ho since - 3.02 < - 1.7959
• Conclude that the diet-exercise program is effective.
535.06 12
6. Draw and state the conclusion
6.68
125
126. Hypothesis testing for two populationproportions
Supposethat n1andn2are largeenoughso that;
– 𝑛1𝑝1≥5,𝑛1(1 − 𝑝1)≥5,𝑛2𝑝2≥5,and 𝑛1(1 − 𝑝1)≥5
1 2
𝑃1(1−𝑃1)
+ 𝑃2(1−𝑃2)
𝑛1 𝑛2
The standard deviation 𝑃 − 𝑃=
Thetest statistic could be
𝑍
𝑐
𝑎
𝑙=
(𝑃1−𝑃2)−(𝜋1−𝜋2)
+
𝑃1(1−𝑃1) 𝑃2(1−𝑃2)
126
𝑛1 𝑛2
What if the sample size issmall?
weuse t-statistic with df of 𝑛1+ 𝑛2− 2
127. 127
Example
Aresearcher is trying to study the malaria situation of Ethiopia. From the records of seasonal
blood survey (SBS) results he came to understand that the proportion of people having malaria
in Ethiopia was 3.8% in 2019 (Eth. Cal). The size of the sample considered was 15000. He also
realized that during the year that followed (2020), blood samples were taken from 10,000
randomly selected persons. The result of the 2020 seasonal blood survey showed that 200
persons were positivefor malaria.
Doesthe researcher concludethat the malariasituationof2020 did not showanysignificant
differencefrom that of2021 (take the levelofsignificance,α =.01).
128. Solution
HO : P2019= P2020( or P2019- P2020= 0 ); HA: P2019≠ P2020( or P2019- P2020≠ 0 )
P2019= 0.038 , P2019= 15,000
P2020= 0.02 , P2020= 10,000
Z tab ( α = 0.01 ) = 2.58 (two tail)
1 5 , 0 0 0 1 0 , 0 0 0
Zcalc= 8.2,Which corresponds to aP-valueoflessthan 0.003.
Decision: reject Ho (because Zcal > Z tab); in other words, the p-value is less
than the level of significance (i.e., α = 0.01)
128
0.038(1 0.038)
0 . 0 2 ( 1 0 . 0 2 )
( 0 . 0 3 8 0 . 0 2 ) (0)
zc a l
129. 129
Example
Astudy wasconducted to look at the effects of oral contraceptives (OC) on heart
disease in women 40–44 years of age. It is found that among n1 = 500 current
OC users, 13 develop a myocardial infarction (MI) over a three-year period,
while among n2 = 1000 non- OC users, seven develop a MI over a three-year
period. Then can you conclude that rate of MIis significantly greater among OC
users?(Report the P-valuefor your test)
130. 130
Measures of Association
While a test of hypothesis can be used to determine whether an
association exists between two random variables, it cannot provide a
measureofthe strength ofthe association
• Several methods are available for estimating the magnitude of the effect
giventhe categoricaldatain a2× 2 contingency table
1. Chi-SquareTest
2. Relative Risk (RR)
3. Odds Ratio (OR)
131. 131
Chi-SquareTest
AChi-Square (χ2) is a probability distribution used to make statistical
inferences about categorical data (proportions) in which the numbers
ofcategories are two or more.
Widelyusedin the analysisofcontingency tables.
Chi-Square test allows us to test for association between two
categorical variables.
Ho: No associationbetween the variables;HA:There is association
Consequently asignificantp-valueimplies association.
132. X2 Distribution
Indexedbythe degrees offreedom (n)
Unlike z and t distributions, which are always symmetric about 0, the
X2distribution only takes on positive values and is alwaysskewed to the
right.
The skewnessdiminishesasn increases
18.307 2
1 0
0.05
A cceptance
region
0,95
R ejection
region
132
133. 133
X2 Distribution
Ast distributions, there isadifferent X2distribution for eachpossiblevalueof df.
X2distributions with asmallnumber ofdfare highlyskewed;however,this
skewnessisattenuated asthe number ofdfincreases.
The dfdistribution isconcentrated overnonnegative values.
It hasmeanequalto its degrees offreedom (df), andits standard deviation equals
√(2df ).
Asdfincreases, the distribution concentrates around larger valuesandismore
spread out.
The distribution isskewedto the right, but it becomesmore bell-shaped
(normal) asdf increases
135. Chi-Square test
It isastatistic whichmeasuresthe discrepancybetween kobservedfrequencies
O1, O2,…Ok andthe corresponding expected frequencies E1, E2,… Ek.
Ifthe valueofχ2 iszero, no discrepancybetween the observedandthe expected
frequencies.
The greater the discrepancy,the larger willbe the valueof χ2.
The calculatedvalueofχ2 iscompared with the tabulated value for the givendf.
• Chi-Squaretest isbasedon the table ofΧ2 for df. ForRx Ctable the dfisgiven
by: (row-1)(columon-1) or (R-1)(C-1)
135
136. Chi-SquareTest
Counts in the Chi-SquareTestofa2x2 tablearerepresentedas“a”, “b”,
“c” and“d”.
Thegeneral formula
for 2x 2 table.
nadbc2
We canalso use
2
(ac)(bd)(ab)(cd)
136
138. Chi-SquareTest
Assumptions
Datamust be categorical
The data shouldbe afrequency data
Thenumbersin eachcell are‘not too small’. No expected frequency = zero
No more than 20% of the expected frequenciesshouldbe lessthan 5.
Ifthis does not hold
combined(re-categorized) row or columnvariablescategories to makethe expected
frequencieslarger or
useYatescontinuity correction
138
139. 139
Example
A study was conducted to investigate the possible cause of
gastroenteritis outbreak following a lunch served in a high school
cafeteria. Among the 225 students who ate the sandwiches, 109
became ill. While, among the 38 students who did not eat the
sandwiches,4 became ill.
Present the data by2x2 contingency table
140. Example
With this method, dataare arranged in the form ofacontingency
table
Thisisa2 × 2 table for two dichotomous random variables
140
141. Solution
We again wish to know whether the proportions of students
who becameill in eachofthe groupsare identical
Tocarry out the test, wefirst calculate the expected counts for the
table assuming that:
H0: p1 = p2
HA: p1 ≠p2
141
142. Example
The chi-square test compares the observed frequencies in
each categorywith the expected frequencies giventhat H0is true
Are the deviations between Observed and Expected too large to
be attributed to chance?
Todetermine this, deviationsfrom all4 cellsmust be combined
Calculate the sum:
142
143. 143
Example
TheHo isrejected at α levelifX2istoo large, in particular, ifX2>
X21,α
If α = 0.05, wewouldreject H0for X2greater than X21,α = 3.84
Therefore, wereject the Ho
The p-valueisgivenbythe area under the X2distribution to the right
of X2
P-value< 0.001
144. 144
Example
Astudy was conducted to look at the effects of oral contraceptives (OC) on heart
disease in women 40 to 44 years of age. It is found that among 5000 current OC
users at baseline, 13 women develop a myocardial infarction (MI) over a 3-year
period, where as among 10,000 non-OC users, 7 develop an MI over a 3-year
period. Compare the relation between Chi-SquareTestandz-test ?
– P1= 0.0026, P2= 0.0007
– Z-test = 2.77, P-value= 0.006
– There isahighlysignificantassociationbetweenMIandOC use
145. 145
Solution
Displaythe abovedatain the form ofa2x2 contingency table
OC-use group
MI statusover
3 years
Total
Yes No
OC users 13 4987 5000
Non-OC users 7 9993 10,000
Total 20 14,980 15,000
Isthe proportion ofMIthe samein OC users andnon-OC users?
What canbe saidabout the relationship between MIstatus andOC use?
146. Solutions
Compute the expected frequencyfor the OC-MI data
Relationship betwe
X2 ≈ 8, 0.001<p-value < 0.005
en X2andZtest isX2= Z2
146
147. 147
Summary
1. Everyχ2 distribution extends indefinitely to the right from 0.
2. Everyχ2 distribution hasonlyone (right ) tail.
3. Asdfincreases, the χ2 curves get more bell shapedandapproach the normal
curve in appearance (but remember that a chi square curve starts at 0, not at -
∞)
4. If the value of χ2 is zero, then there is aperfect agreement between the observed
and the expected frequencies. The greater the discrepancy between the
observedandexpected frequencies, the larger willbe the valueofχ2.