SlideShare una empresa de Scribd logo
1 de 136
Descargar para leer sin conexión
BMCU002: QUANTITATIVE METHODS NOTES
Page 1 of 3
INTRODUCTION TO STATISTICS
Definition:
It is the science of collecting, organizing, presenting, analyzing and interpreting data to assist in
making more effective decisions.
Types of Statistics
(a) Descriptive statistics: it’s a tabular, graphical and numerical method for organizing and
summarizing information clearly and effectively relating to either a population or sample.
(b) Inferential statistics: are the methods of drawing and measuring the reliability of
conclusions about a statistical population based on information from a sample data set.
 A population is a collection of all possible individuals, objects or measurements of
interest.
 A sample part or sub set of the population of interest.
Variables:
A variable is a measurable characteristic that assumes different values among the subjects.
Types of variables
(a) Independent variables: It is a variable that a researcher manipulates in order to determine its
effect or influence on another variable. They predict the amount of variation that occurs in
other variables.
(b) Dependent variables: It is the variable that is measured, predicted or monitored and is
expected to be affected by manipulation of an independent variable. They attempt to indicate
the total influence arising from the effects of the independent variable. It varies as a function
of the independent variable e.g., influence of hours studied on performance in a statistical test,
influence of distance from the supply center on cost of building materials.
The above variables can either be qualitative or quantitative variables: -
i. Qualitative variables: Are variables that are non-numeric i.e., attributes e.g., Gender,
Religion, Colour, State of birth etc.
ii. Quantitative variables: are numeric variables. They can either be discrete or
continuous.
 Discrete variables: Are variables, which can only assume certain values
i.e., whole numbers. Are always counted.
 Continuous variables: Are variables, which can assume any value within
a specific range. Are always measured e.g., height, temperature, weight,
radius etc.

Levels of measurement
There are four levels of measurement; nominal, ordinal, interval and ratio.
(a) Nominal level. The observations are classified under a common characteristic e.g., sex, race,
marital status, employment status, language, religion etc. helps in sampling.
BMCU002: QUANTITATIVE METHODS NOTES
Page 2 of 3
(b) Ordinal level: items or subjects are not only grouped into categories, but they are ranked into
some order e.g., greater than, less than, superior, happier than, poorer, above etc. helps in
developing a likert scale.
(c) Interval level: numerals are assigned to each measure and ranked. The intervals between
numerals are equal. The numerals used represent meaningful quantities but the zero point is
not meaningful e.g., test scores, temperature.
(d) Ratio level: has all the characteristics of the other levels and in addition the zero point is
meaningful. Mathematical operations can be applied to yield meaningful values e.g., height,
weight, distance, age, area etc.
Characteristics of statistical data
 They are aggregate of facts e.g., total sales of a firm for one year.
 They are affected to a marked extent by a multiplicity of causes e.g., volume of wheat
production depends on rainfall, soil fertility, seeds etc
 They are numerically expressed e.g., population of Kenya increased by 4 million during the
year 2004.
 They are estimated according to a reasonable standard of accuracy e.g., 90% accuracy
 They are collected in a systematic manner.
 They are collected for a predetermined purpose
 They should be placed in relation to each other.
Uses and users of statistics
1. Government:
 Monitoring economic and social trends
 Forecasting
 Policy making
2. Individuals
 Leisure activities
 Community work
 Personal finances
 Gambling
3. Academia
 Testing hypothesis
 Developing new theories
 Consultancy services
4. Businesses
 Planning and control
 Quality control especially for the manufacturers
 Forecasting i.e., planning production schedules, advertising expenditures etc.
 Auditing
BMCU002: QUANTITATIVE METHODS NOTES
Page 3 of 3
 Determining production costs e.g., by using regression and correlation, one can determine
the relationship between two variables like costs and methods of production, advertising and
sales etc.
 It gives relevant information for decision-making.
Limitations of statistics
 Deals with aggregate facts and not individual items.
 Deals mainly with quantitative characteristics and not qualitative characteristics like
honesty, efficiency etc.
 The results are only true on an average and under certain conditions.
 Statistics can be misused i.e., wrong interpretation. It requires experience and skill to draw
sensible conclusions from the data.
 Statistics may not provide the best solution under all circumstances.
BMCU002: QUANTITATIVE METHODS NOTES
Page 1 of 11
DESCRIPTIVE STATISTICS
Descriptive statistics is used to summarize data and make sense out of the raw data
collected during the research.
Data collection
Data can be collected from primary and / or secondary sources.
Secondary data consists of information that already exists somewhere having been
collected for another purpose e.g., in government publications, periodicals, journals, books
etc.
Advantages: Low in cost and Readily available
Disadvantages: The data needed might not exist and The existing data might be
outdated, inaccurate, incomplete and unreliable.
Primary data consists of original information gathered for the specific purpose through
observation, interviews and questionnaires.
Advantages
- It is relevant
- Its accurate
Disadvantages
- It is costly
- It is time consuming
Presentation of data
Presentation of data refers to the classification and tabulation of data. Classification of data
refers to the act of arranging the data in groups or classes according to some resemblance
of the data in each group or class. Tabulation of data is the arrangement of statistical data
in columns and rows.
Frequency distribution
A frequency distribution is a grouping of data into mutually exclusive categories showing
the number of observations in each category.
Steps
 Decide on the number of classes
 Determine the class interval or width
 Set the individual class limits
 Tally the values into the classes
BMCU002: QUANTITATIVE METHODS NOTES
Page 2 of 11
 Count the number of items in each class
A class interval is the difference between the lower limit of the class and the lower limit of
the next class.
A class midpoint / class mark is the middle point between the lower and the upper class
limit.
Graphical representation of a frequency distribution
1. Histogram: It is a graph in which classes are marked on the horizontal axis and the
class frequencies on the horizontal axis and the class frequencies on the vertical axis.
The class frequencies are represented by the heights of the bars and the bars are drawn
adjacent to each other.
2. Frequency polygons: The class midpoints are connected with a line segment.
3. Cumulative frequency polygons
 Less than cumulative frequency polygons
 More than cumulative frequency polygons
4. Line charts: Show the change in a variable over time
5. Bar chart: Make use of rectangles to present the given data. Can be vertical,
horizontal or component.
6. Pie charts: different segments of a circle represent percentage contribution of various
components to the total.
7. Graphs
8. Pictograms: pictures are used to represent data.
Example
(a) The data below indicates the marks attained by students in a statistical test. Construct
a frequency distribution table with 10 classes
12
8
18
5
15
24
25
25
32
40
40
42
44
46
48
50
50
52
53
55
56
59
60
66
68
72
76
83
95
98
(b) From the above: construct a histogram, frequency polygons and curves, cumulative
frequency curves.
MEASURES OF CENTRAL TENDENCY
Central tendency is the tendency of observations to cluster near the central part of the
distribution. Measures of central tendency are the measures of location e.g. mean, mode
and median. They are the most representative value of the distribution.
BMCU002: QUANTITATIVE METHODS NOTES
Page 3 of 11
Qualities of a good average
Should be-
 Rigidly defined
 Based on all values
 Easily understood and calculated
 Least affected by the fluctuations of sampling
 Capable of further algebraic or statistical treatment
 Least affected by extreme values
Types of averages
The following are the most important types of averages
(a) Arithmetic mean or simple average
(b) Median
(c) Mode
(d) Geometric mean
(e) Harmonic mean
THE ARITHMETIC MEAN
It is obtained by summing up the values of all the items of a series and dividing this sum
by the number of items.
Computation of the arithmetic mean for
Individual series:-
Direct method
n
X
X

 where X = arithmetic mean , n = number of items
Grouped series
Direct method
n
xf
X

 Where f = frequencies, n = number of items
Properties of the arithmetic mean
 The product of the arithmetic mean and the number of items is equal to the sum of all
given values
 The algebraic sum of the deviations of the various values from the mean is equal to
zero
 The sum of the squares of deviations from arithmetic mean is least.
Advantages of the arithmetic mean
 Can be easily understood
 Takes into account all the items of the series
BMCU002: QUANTITATIVE METHODS NOTES
Page 4 of 11
 It is not necessary to arrange the data before calculating the average
 It is capable of algebraic treatment
 It is a good method of comparison
 It is not indefinite
 It is used frequently.
Disadvantages of the arithmetic mean
 It is affected by extreme values to a great extent
 It may be a figure that does not exist in a series
 It cannot be calculated if all the items of a series are not known
 It cannot be used incase of qualitative data
THE MEDIAN
The median is the middle value of a series arranged in ascending or descending order. If
there are n observations, the median is the value of the
th
n





 
2
1
item.
Computation of the median in discrete series
 Arrange the items in descending or ascending order with their corresponding
frequencies against them.
 Compute the cumulated frequencies and then locate the middle item.
Computation of the median in Continuous series
The median has to be interpolated in the class interval containing the median using the
formula:-
Median = 𝑳 +
(
𝒏
𝟐
)−𝑩
𝑮
(𝑾)
Where:
L= lower class boundary
n= total number of values
B= cumulative frequency of the group before the median group
G= frequency of the median group
W= class width
Properties of the Median
 It is a positional average and is influenced by the position of the items in the series
and not by the size of items
 The sum of the absolute values of deviations is least.
Advantages of the Median
 It is easy to calculate
 It is simple and is understood easily
 It is less affected by the value of extreme items
BMCU002: QUANTITATIVE METHODS NOTES
Page 5 of 11
 It can be calculated by inspection in some cases
 It is useful in the study of phenomenon which are of qualitative nature
Disadvantages of the Median
 It is not a suitable representative of a series in most cases
 It is not suitable for further algebraic treatment
 It is not used frequently like arithmetic mean
 It cannot be determined exactly in the case of continuous series
Quartiles, deciles and percentiles
 Quartiles are the values of the items that divide the series into four equal parts.
 Deciles divide the series into 10 equal parts.
 Percentiles divide the series into 100 equal parts.
The 2nd
quartile, 5th
decile and 50th
percentile are equal to the median.
THE MODE
The mode is the value, which occurs most often in the data. A distribution with one mode
is called unimodal, with two modes bimodal and with many modes, multimodal
distribution. The class mid-point of a modal class is called a crude mode.
Calculation of the mode in a continuous series
Mode = 𝑳 +
𝒇𝒎−𝒇𝒎−𝟏
(𝒇𝒎−𝒇𝒎−𝟏)+(𝒇𝒎−𝒇𝒎+𝟏)
(𝒘)
Where:
 L is the lower-class boundary of the modal group
 fm-1 is the frequency of the group before the modal group
 fm is the frequency of the modal group
 fm+1 is the frequency of the group after the modal group
 w is the group width
Properties of the mode
 It represents the most typical value of the distribution and it should coincide with
existing items
 It is not affected by the presence of extremely large or small items
Advantages of the Mode
 It is easy to understand
 Extreme items do not affect its value
 It possesses the merit of simplicity
Disadvantages of the Mode
 It is often not clearly defined
 Exact location is often uncertain
 It is unsuitable for further algebraic treatment
BMCU002: QUANTITATIVE METHODS NOTES
Page 6 of 11
 It does not take into account extreme values.
GEOMETRIC MEAN
Geometric Mean is the nth
root of the product of n values i.e. n
n
x
x
x
M
G .....
*
. 2
1

For ungrouped data
G.M = Antilog of
n
Logx

Grouped data
G.M = Antilog of
n
fLogx

Merits of the Geometric mean
 It takes into account all the items in the data and condenses them into one
representative value.
 It gives more weight to smaller values than to large values.
 It is amenable to algebraic manipulations
Demerits
 It is difficult to use and compute
 It is determinate for positive values and cannot be used for negative values or zero.
HARMONIC MEAN
It is the reciprocal of the arithmetic mean of the reciprocal of a series of observations.
Ungrouped data
H.M =
 x
n
1
Grouped data
H.M =
 x
f
n
Merits of the Harmonic mean
 It takes into account all the observations in the data
 It gives more weight to smaller items
 It is amenable to algebraic manipulations
 It measures the rates of change
Demerits
 It is difficult to compute when the number of items is large
 It assigns too much weight to smaller items.
Factors to consider in the choice of an average
 The purpose for which the average is being used
 The nature, characteristics and properties of the average
 The nature and characteristics of the data.
MEASURES OF DISPERSION
Definition of dispersion
 It is the degree to which numerical data tends to spread about an average value
BMCU002: QUANTITATIVE METHODS NOTES
Page 7 of 11
 It is the extent of the scattered ness of items around a measure of central tendency
Significance of measuring dispersion
 To determine the reliability of an average
 To serve as a basis for the control of the variability
 To compare two or more series with regard to their variability
 To facilitate the use of other statistical measures
Properties of a good measure of dispersion
It should be: -
 Simple to understand
 Easy to compute
 Rigidly defined
 Based on each and every item in the
distribution
 Amenable to further algebraic
calculations
 Have sampling stability
 Not be unduly affected by extreme
values
Measures of dispersion
 Range
 Quartile deviation
 Mean deviation
 Standard deviation
BMCU002: QUANTITATIVE METHODS NOTES
Page 8 of 11
The Range: it is the difference between the smallest value and the largest value of a series
Advantages of the Range
 It is the simplest to understand and compute
 It takes the minimum time to calculate the value of the range
Limitations
 It is not based on each and every value of the distribution
 It is subject to fluctuations of considerable magnitude from sample to sample
 It cannot be computed in case of open-ended distributions
 It does not explain or indicate anything about the character of the distribution within the two
extreme observations.
Uses of the range
 Quality control
 Fluctuations of prices
 Weather forecast
 Finding the difference between two values e.g. wages earned by different employees.
The standard deviation
It is the square root of the arithmetic average of the squares of the deviations measured from the
mean. It measures how much “spread” or “ Variability” is present in the sample. A small standard
deviation means a high degree of uniformity of the observations as well as the homogeneity of a
series and vice versa.
Ways of computing the standard deviation
Direct method
Ungrouped data
n
dx


2
 where  2
dx = sum of squares of the deviations from arithmetic mean
Grouped data
n
fdx


2

Advantages of the standard deviation
 It is rigidly defined and is based on all the observations of the series
 It is applied or used in other statistical techniques like correlation and regression analysis and
sampling theory.
 It is possible to calculate the combined standard deviation of two or more groups.
Disadvantages of the standard deviation
 It cannot be used for comparing the dispersion of two or more series of observations given in
different units.
 It gives more weight to extreme values.
BMCU002: QUANTITATIVE METHODS NOTES
Page 9 of 11
SKEWNESS AND KURTOSIS IN STATISTICS
The average and measure of dispersion can describe the distribution but they are not sufficient to
describe the nature of the distribution. For this purpose we use other concepts known as Skewness
and Kurtosis. The symmetrical and skewed distributions are shown by curves as
Skewness
Skewness means lack of symmetry. A distribution is said to be symmetrical when the values are
uniformly distributed around the mean. For example, the following distribution is symmetrical
about its mean 3.
X : 1 2 3 4 5
Frequency (f): 5 9 12 9 5
In a symmetrical distribution the mean, median and mode coincide, that is, mean = median = mode.
Several measures are used to express the direction and extent of skewness of a dispersion. The
important measures are that given by Pearson. The first one is the Coefficient of Skewness:
For a symmetric distribution Sk = 0. If the distribution is negatively skewed then Sk is negative and
if it is positively skewed then Sk is positive. The range for Sk is from -3 to 3.
BMCU002: QUANTITATIVE METHODS NOTES
Page 10 of 11
The other measure uses the b (read ‘beta’) coefficient which is given by, where, m2
and m3 are the second and third central moments. The second central moment m2 is nothing but
the variance. The sample estimate of this coefficient is where m2 and m3 are
the sample central moments given by
For a symmetrical distribution b1 = 0. Skewness is positive or negative depending upon whether
m3 is positive or negative.
Kurtosis
A measure of the peakness or convexity of a curve is known as Kurtosis.
BMCU002: QUANTITATIVE METHODS NOTES
Page 11 of 11
It is clear from the above figure that all the three curves, (1), (2) and (3) are symmetrical about
the mean. Still they are not of the same type. One has different peak as compared to that of
others. Curve (1) is known as mesokurtic (normal curve); Curve (2) is known as leptocurtic
(leading curve) and Curve (3) is known as platykurtic (flat curve). Kurtosis is measured by
Pearson’s coefficient, b2 (read ‘beta - two’).It is given by .
The sample estimate of this coefficient is where, m4 is the fourth central moment
given by m4 =
The distribution is called normal if b2 = 3. When b2 is more than 3 the distribution is said to be
leptokurtic. If b2 is less than 3 the distribution is said to be platykurtic.
BMCU002: QUANTITATIVE METHODS NOTES
Page 1 of 13
MEASURES OF CENTRAL TENDENCY
MODE
Meaning
The mode refers to that value in a distribution, which occur most frequently. It is an actual value,
which has the highest concentration of items in and around it.
Computation of the Mode
1. Ungrouped or Raw Data
For ungrouped data or a series of individual observations, mode is often found by mere inspection.
Example 1:
2 , 7, 10, 15, 10, 17, 8, 10, 2
 Mode = M0 = 10
In some cases the mode may be absent while in some cases there may be more than one mode.
Example 2:
1) 12, 10, 15, 24, 30 (no mode)
2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10
∴ The modes are 7 and 10
2. Grouped Data
a) Discrete Distribution
For Discrete distribution, see the highest frequency and corresponding value of X is mode. A
discrete variable is the one whose outcomes are measured in fixed numbers.
b) Continuous Distribution
See the highest frequency then the corresponding value of class interval is called the modal
class. Then apply the following formula:
BMCU002: QUANTITATIVE METHODS NOTES
Page 2 of 13
Mode = M0 = l1+
𝑓1−𝑓0
(𝑓1−𝑓0 )+(𝑓1−𝑓2 )
x𝑖
Where: 𝑙1 = the lower value of the class in which the lies
𝑓1 = the frequency of the class in which the mode lies
𝑓0 = the frequency of the class preceding the modal class
𝑓2 = the frequency of the class succeeding the modal class
𝑖 = the class interval of the modal classs
NOTE: While applying the above formula, we should ensure that the class-intervals are uniform
throughout. If the class-intervals are not uniform, then they should be made uniform on the
assumption that the frequencies are evenly distributed throughout the class.
Example 3:
Let us take the following frequency distribution:
Class Intervals Frequency
30−40 4
40−50 6
50−60 8
60−70 12
70−80 9
80−90 7
90−100 4
Required:
Calculate the mode in respect of this series.
Solution
Mode = M0 = 60+ 12−8
(12−8)+(12−9)
x10
= 60 +
4
4 + 3
𝑥10 = 65.7 approx.
BMCU002: QUANTITATIVE METHODS NOTES
Page 3 of 13
3. Determination of Modal Class
For a frequency distribution modal class corresponds to the maximum frequency. But it is not
possible to identify by inspection the class where the mode lies in any one (or more) of the
following cases:
i. If the maximum frequency is repeated.
ii. If the maximum frequency occurs in the beginning or at the end of the distribution.
iii. If there are irregularities in the distribution, the modal class is determined by the method
of grouping.
Steps for Calculation
1. Prepare a grouping table with 6 columns.
2. In column I, write down the given frequencies.
3. Column II is obtained by combining the frequencies two by two.
4. Leave the 1st
frequency and combine the remaining frequencies two by two and write in column
III.
5. Column IV is obtained by combining the frequencies three by three.
6. Leave the 1st frequency and combine the remaining frequencies three by three and write in
column V.
7. Leave the 1st
and 2nd
frequencies and combine the remaining frequencies three by three and
write in column VI.
8. Mark the highest frequency in each column.
9. Form an analysis table to find the modal class.
10. After finding the modal class use the formula to calculate the modal value.
Example 4
Calculate the mode for the following frequency distribution.
Class Interval 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40
Frequency 9 12 15 16 17 15 10 13
BMCU002: QUANTITATIVE METHODS NOTES
Page 4 of 13
Solution
Grouping Table
Class Interval Frequency 2 3 4 5 6
0−5 9
5−10 12 21 36
10−15 15 27 43
15−20 16 31 48
20−25 17 33 48
25−30 15 32 42
30−35 10 25 38
35−40 13 23
Analysis Table
Columns 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40
1 1
2 1 1
3 1 1
4 1 1 1
5 1 1 1
6 1 1 1
Total 1 2 4 5 2
The maximum occurred corresponding to 20−25, and hence it is the modal class.
Mode = M0 = 20+ 17−16
(17−16)+(17−15)
𝑥5
M0 = 20 +
1
1 + 2
𝑥5 = 21.6 approx.
Example 5
The following table gives some frequency data:
BMCU002: QUANTITATIVE METHODS NOTES
Page 5 of 13
Size of Item Frequency Cummulative Currency
10−20 10 10
20−30 18 28
30−40 25 53
40−50 26 79
50−60 17 96
60−70 4 100
Total 100
Required:
Calculate the mode
Solution
Grouping Table
Class Interval Frequency 2 3 4 5 6
10−20 10
20−30 18 28 53
30−40 25 43 69
40−50 26 51 68
50−60 17 43 47
60−70 4 21
Analysis Table
Columns 10−20 20−30 30−40 40−50 50−60 60−70
1 1
2 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
Total 1 3 5 5 2
BMCU002: QUANTITATIVE METHODS NOTES
Page 6 of 13
Mode = 3 median - 2 mean
Median =
n + 1
2
=
100 + 1
2
= 50.5th item
This lies in the class 30−40.
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑙1 +
𝑙2 − 𝑙1
𝑓
(𝑚 − 𝑐) = 30 +
40 − 30
25
(50.50 − 28) = 30 + 9 = 39
Calculation of Arithmetic Mean
Class- Interval Frequency Mid- Points d d'=d/10 fd’
10−20 10 15 −20 −2 −20
20−30 18 25 −10 −1 −18
30−40 25 35 0 0 0
40−50 26 45 10 1 26
50−60 17 55 20 2 34
60−70 4 65 30 3 12
Total 100 34
Assumed mean= 35
Median = A +
∑ fd′
n
xi
Median = A35 +
34
100
x10 = 38.4
Mode = 3 median − 2 mean = 3(39) − 2(38.4) = 117 − 76.8 = 40.2
Merits of Mode
1. It is easy to calculate and in some cases it can be located mere inspection.
2. Mode is not at all affected by extreme values.
3. It can be calculated for open-end classes.
4. It is usually an actual value of an important part of the series.
5. In some circumstances it is the best representative of data.
BMCU002: QUANTITATIVE METHODS NOTES
Page 7 of 13
Demerits of Mode
1. It is not based on all observations.
2. It is not capable of further mathematical treatment.
3. Mode is ill-defined generally, it is not possible to find mode in some cases.
4. As compared with mean, mode is affected to a great extent,by sampling fluctuations.
5. It is unsuitable in cases where relative importance of items has to be considered.
QUARTILES
Meaning
The quartiles divide the distribution in four parts. There are three quartiles. The second quartile
(Q2) divides the distribution into two halves and therefore is the same as the median. The first
(lower) quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the
three-fourth. In other words, the three quartiles Q1, Q2 and Q3 are such that 25 percent of the data
fall below Q1, 25 percent fall between Q1 and Q2, 25 percent fall between Q2 and Q3 and 25 percent
fall above Q3.
Computation of the Mode
1. Raw or Ungrouped Data
First arrange the given data in the increasing order and use the formula for Q1 and Q3.
Q1 = (
n + 1
4
) th item
Q3 = 3 (
n + 1
4
) th item
Example 1
Compute quartiles for the data given below:
25,18,30, 8, 15, 5, 10, 35, 40, 45
Solution
5, 8, 10, 15, 18,25, 30,35,40, 45
Q1 = (
n + 1
4
) th item
BMCU002: QUANTITATIVE METHODS NOTES
Page 8 of 13
Q1 = (
10 + 1
4
)th item
Q1 = (2.75)th item
Q1 = 2nd
item + (
3
4
) (3rd
item − 2nd
item)
Q1 = 8 + (
3
4
) (10 − 8) = 9.5
Q3 = 3 (
n + 1
4
) th item
Q3 = 3(2.75)th item
Q3 = (8.25)th item
Q3 = 8th
item + (
1
4
)(9th
item − 8th
item)
𝑄3 = 35 + (
1
4
)(40 − 35) = 36.25
2. Discrete Series
Step1: Find cumulative frequencies.
Step2: Find (
𝑛+1
4
)
Step3: See in the cumulative frequencies, the value just greater than (
𝑛+1
4
), then the corresponding
value of x is Q1.
Step 4: Find 3 (
𝑛+1
4
)
Step 5: See in the cumulative frequencies, the value just greater than 3 (
𝑛+1
4
), then the
corresponding value of x is Q3.
Example 2
Compute quartiles for the data given bellow:
X 5 8 12 15 19 24 30
F 4 3 2 4 5 2 4
BMCU002: QUANTITATIVE METHODS NOTES
Page 9 of 13
Solution :
X F CF
5 4 4
8 3 7
12 2 9
15 4 13
19 5 18
24 2 20
30 4 24
Total 24
Q1 = (
N + 1
4
) th item = (
24 + 1
4
) = (
25
4
) = 6.25th
item
Q3 = 3 (
N + 1
4
) th item = 3 (
24 + 1
4
) = 3 (
25
4
) = 18.25th
item
Q1 = 8; Q3 = 24
3. Continuous Series
Step1: Find cumulative frequencies
Step2: Find (
N
4
)
Step 3: See in the cumulative frequencies, the value just greater than (
𝑁
4
), then the corresponding
class interval is called first quartile class.
Step 4: Find 3 (
3
4
)
Step 5: See in the cumulative frequencies the value just greater than 3 (
3
4
), then the corresponding
class interval is called 3rd quartile class.
Step 6: Apply the respective formulae.
Q1 = l1 + (
N
4
− m1
f1
) x c1
BMCU002: QUANTITATIVE METHODS NOTES
Page 10 of 13
Q3 = l3 + (
3 (
N
4
) − m3
f3
) xc3
Where: l1 = lower limit of the first quartile class
f1 = frequency of the first quartile class
c1 = width of the first quartile class
m1 = cf preceding the first quartile class
l3 = lower limit of the third quartile class
f3 = frequency of the third quartile class
c3 = width of the third quartile class
m3 = cf preceding the third quartile class
Example 3
The following series relates to the marks secured by students in an examination.
Marks Number of Students
0−10 11
10−20 18
20−30 25
30−40 28
40−50 30
50−60 33
60−70 22
70−80 15
80−90 12
90−100 10
BMCU002: QUANTITATIVE METHODS NOTES
Page 11 of 13
Required:
Find the quartiles.
Solution:
Marks Number of Students Cummulative Frequency
0−10 11 11
10−20 18 29
20−30 25 54
30−40 28 82
40−50 30 112
50−60 33 145
60−70 22 167
70−80 15 182
80−90 12 194
90−100 10 204
Total 204
(
N
4
) = (
204
4
) = 51; 3 (
N
4
) = 153
Q1 = 20 + (
51 − 29
25
) x 10 = 28.8
Q1 = 60 + (
153 − 145
22
)x 10 = 63.64
PERCENTILES
The percentile values divide the distribution into 100 parts each containing 1 percent of the cases.
The percentile (Pk) is that value of the variable up to which lie exactly k% of the total number of
observations.
1. Percentile for Raw Data or Ungrouped Data
Relationship :
P25 = Q1 ; P50 = Q2 = Median and P75 = Q3
BMCU002: QUANTITATIVE METHODS NOTES
Page 12 of 13
Example 4
Calculate P15 for the data given below:
5, 24 , 36 , 12 , 20 , 8
Solution:
Arranging the given values in the increasing order.
5, 8, 12, 20, 24, 36
P15 = (
15(n + 1)
100
)th item
P15 = (
15(6 + 1)
100
)th item
P15 = (
(15x7)
100
)th item
P15 = (1.05)th item
P15 = 1st
item + 0.05(2nd
item − 1st
item)
P15 = 5 + 0.05(8 − 5) = 5.15
2. Percentile for Grouped Data
Example 5
Find P53 for the following frequency distribution:
Class Interval 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40
Frequency 5 8 12 16 20 10 4 3
Solution :
Class Interval Frequency Cummulative Frequency
0−5 5 5
5−10 8 13
10−15 12 25
15−20 16 41
20−25 20 61
BMCU002: QUANTITATIVE METHODS NOTES
Page 13 of 13
25−30 10 71
30−35 4 75
35−40 3 78
Total 78
P53 = l1 +
53N
100
− m
f
xc
P53 = 20 +
53(78)
100
− 41
f
x5 = 20.085
BPCU004: ADVANCED BUSINESS STATISTICS
Page 1 of 23
MEASURES OF DISPERSION
MEANING
Dispersion (also known as scatter, spread or variation) measures the extent to which the items
vary from some central value.
SIGNIFICANCE OF MEASURING VARIATION
1. Measures of variation point out as to how far an average is representative of the mass.
2. Measures of dispersion determine nature and cause of variation in order to control the
variation itself.
3. Measures of dispersion enable a comparison to be made of two or more series with regard
to their variability.
4. Measures of dispersion are the basis of Many powerful analytical tools in statistics such as
correlation analysis, testing of hypothesis, analysis of variance, the statistical quality control
and regression analysis.
Characteristics/Properties of a Good Measure of Dispersion
1. It should be simple to understand.
2. It should be easy to compute.
3. It should be rigidly defined.
4. It should be based on each and every item of the distribution.
5. It should be amenable to further algebraic treatment.
6. It should have sampling stability.
7. Extreme items should not unduly affect it.
ABSOLUTE AND RELATIVE MEASURES OF DISPERSION
There are two kinds of measures of dispersion, namely:
1. Absolute measure of dispersion.
2. Relative measure of dispersion.
Absolute measure of dispersion indicates the amount of variation in a set of values in terms of
units of observations. For example, when rainfalls on different days are available in mm, any
absolute measure of dispersion gives the variation in rainfall in mm. On the other hand relative
measures of dispersion are free from the units of measurements of the observations. They are
BPCU004: ADVANCED BUSINESS STATISTICS
Page 2 of 23
pure numbers. They are used to compare the variation in two or more sets, which are having
different units of measurements of observations.
Absolute measure Relative measure
1. Range 1. Co-efficient of Range
2. Quartile deviation 2. Co-efficient of Quartile deviation
3. Mean deviation 3. Co-efficient of Mean deviation
4. Standard deviation 4. Co-efficient of variation
RANGE AND COEFFICIENT OF RANGE
1. Range
This is the simplest possible measure of dispersion and is defined as the difference between the
largest and smallest values of the variable.
Range = L − S
𝑊ℎ𝑒𝑟𝑒: L = Largest Value
S = Smallest Value
In individual observations and discrete series, L and S are easily identified. In continuous series,
the following two methods are followed.
Method 1:
L = Upper boundary of the highest class
S = Lower boundary of the highest class
Method 2:
L = Mid value of the highest class
S = Mid value of the lowest class
2. Co-efficient of Range
Coefficient of Range =
L − S
L + S
Example 1
Find the value of range and its co-efficient for the following data.
7, 9, 6, 8, 11, 10
BPCU004: ADVANCED BUSINESS STATISTICS
Page 3 of 23
Solution:
Range = L − S = 11 − 4 = 7
Coefficient of Range =
L − S
L + S
=
11 − 4
11 + 4
= 0.4667
Example 2:
Calculate range and its co efficient from the following distribution.
Size : 60−63 63−66 66−69 69−72 72−75
Number : 5 18 42 27 8
Solution:
Range = L − S = 75 − 60 = 15
Coefficient of Range =
L − S
L + S
=
75 − 60
75 + 60
= 0.1111
Merits
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather forecasts, share price analysis, et
c., range is most widely used.
Demerits:
1. It is very much affected by the extreme items.
2. It is based on only two extreme observations.
3. It cannot be calculated from open-end class intervals.
4. It is not suitable for mathematical treatment.
5. It is a very rarely used measure.
QUARTILE DEVIATION AND CO-EFFICIENT OF QUARTILE DEVIATION
1. Quartile Deviation (Q.D)
Definition: Quartile Deviation is half of the difference between the first and third quartiles.
Hence, it is called Semi-Inter Quartile Range.
𝑄. 𝐷 =
𝑄3 − 𝑄1
2
BPCU004: ADVANCED BUSINESS STATISTICS
Page 4 of 23
Among the quartiles Q1, Q2 and Q3, the range Q3 – Q1 is called inter quartile range and
𝑄3−𝑄1
2
,
semi inter quartile range.
2. Co-efficient of Quartile Deviation
Co − efficient of Q. D =
Q3 − Q1
Q3 + Q1
Example 3
Find the Quartile Deviation for the following data:
391, 384, 591, 407, 672, 522, 777, 733, 1490, 2488
Solution:
Arrange the given values in ascending order.
384, 391, 407, 522, 591, 672, 733, 777, 1490, 2488.
Position of Q1 is
N + 1
4
=
10 + 1
4
= 12.75th
item
Q1 = 2nd
item + 0.75(3rd
Item − 2nd
Item)
𝑄1 = 391 + 0.75 (4.7 − 391) = 403
Position of Q3 is 3(
N + 1
4
) = 3(12.75) = 8.25th
item
Q3 = 8th
Item + 0.25(9th
Item − 8th
Item)
Q3 = 777 + 0.25(1490 − 777) = 955.25
𝑄. 𝐷 =
955.25 − 403
2
= 276.125
Example 4
Weekly wages of labours are given below. Calculated Q.D and Coefficient of Q.D.
Weekly Wage (Kshs.) 100 200 400 500 600
No. of Weeks 5 8 21 12 6
BPCU004: ADVANCED BUSINESS STATISTICS
Page 5 of 23
Solution :
Weekly Wage (Kshs.) No. of Weeks Cum. No. of Weeks
100 5 5
200 8 13
400 21 34
500 12 46
600 6 52
Total 52
Position of Q1 is
N + 1
4
=
52 + 1
4
= 13.25th
item
Q1 = 13th
Item + 0.25(14th
Item − 13th
Item)
𝑄1 = 200 + 0.25 (400 − 200) = 250
Position of Q3 is 3(
N + 1
4
) = 3(13.25) = 39.75th
item
Q3 = 39th
Item + 0.75(40th
Item − 39th
Item)
Q3 = 500 + 0.75(600 − 500) = 575
𝑄. 𝐷 =
575 − 250
2
= 162.5
Co − efficient of Q. D =
Q3 − Q1
Q3 + Q1
Co − efficient of Q. D =
575 − 250
575 + 250
=
325
825
= 0.394
Example 5
For the data given below, give the quartile deviation and coefficient of quartile deviation.
X 351−500 501−650 651−800 801−950 951−1100
F 48 189 88 47 28
BPCU004: ADVANCED BUSINESS STATISTICS
Page 6 of 23
Solution:
X True Class Intervals F Cumulative
Frequency
351−500 350.5−500.5 48 48
501−650 500.5−650.5 189 237
651−800 650.5−800.5 88 325
801−950 800.5−950.5 47 372
951−1100 950.5−1100.5 28 400
Total 400
Q1 =
N
4
=
400
4
= 100; Q2 = 3 (
N
4
) = 3 (100) = 300
Q1 = l1 + (
N
4
− m1
f1
) x c1
Q1 = 500.5 + (
100 − 48
189
)x 150 = 541.77
Q3 = l3 + (
3 (
N
4
) − m3
f3
) xc3
Q3 = 650.5 + (
300 − 237
88
)x150 = 757.89
Q.D =
Q3 − Q1
2
=
757.89 − 541.77
2
= 108.06
Co − efficient Q. D =
Q3 − Q1
Q3 + Q1
=
757.89 − 541.77
757.89 + 541.77
= 0.1663
Merits of Quartile Deviation
1. It is simple to understand and easy to calculate.
2. It is not affected by extreme values.
3. It can be calculated for data with open end classes also.
Demerits of Quartile Deviation
1. It is not based on all the items. It is based on two positional values Q1 and Q3 and ignores
the extreme 50% of the items.
BPCU004: ADVANCED BUSINESS STATISTICS
Page 7 of 23
2. It is not amenable to further mathematical treatment.
3. It is affected by sampling fluctuations.
MEAN DEVIATION AND COEFFICIENT OF MEAN DEVIATION
1. Mean Deviation
The mean deviation is measure of dispersion based on all items in a distribution. Mean deviation
is the arithmetic mean of the deviations of a series computed from any measure of central
tendency; i.e., the mean, median or mode, all the deviations are taken as positive i.e., signs are
ignored. But in general practice and due to wide applications of mean, the mean deviation is
generally computed from mean. M.D can be used to denote mean deviation.
2. Coefficient of mean deviation:
Mean deviation calculated by any measure of central tendency is an absolute measure. For the
purpose of comparing variation among different series, a relative mean deviation is required.
The relative mean deviation is obtained by dividing the mean deviation by the average used for
calculating mean deviation.
Co − efficient of Mean Deviation =
Mean Deviation
Mean or Median or Mode
If the result is desired in percentage, the coefficient of mean deviation.
Co − efficient of Mean Deviation =
Mean Deviation
Mean or Median or Mode
x100
COMPUTATION OF MEAN DEVIATION
1. Individual Series
a. Calculate the average mean, median or mode of the series.
b. Take the deviations of items from average ignoring signs and denote these deviations
by |D|.
c. Compute the total of these deviations, i.e., Σ |D|
d. Divide this total obtained by the number of items.
M. D. =
D
n
BPCU004: ADVANCED BUSINESS STATISTICS
Page 8 of 23
Example 6
Calculate mean deviation from mean and median for the following data: 100, 150, 200, 250,
360, 490, 500, 600, 671 also calculate coefficients of M.D.
Solution:
Mean =
 X
N
=
3321
9
= 369
Now arrange the data in ascending order
100, 150, 200, 250, 360, 490, 500, 600, 671
Mean = Value of (
n + 1
2
) th item = Value of (
9 + 1
2
) th item = Value of 5th
item = 360
X D=X−Mean D=X−Median
100 269 260
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
3321 1570 1561
M. D. from mean =
 D
n
=
1570
9
= 174.44
Co − efficient of M. D. =
MD
Mean
=
174.44
369
= 0.47
M. D. from median =
 D
n
=
1561
9
= 173.44
Co − efficient of M. D. =
MD
Median
=
173.44
360
= 0.48
BPCU004: ADVANCED BUSINESS STATISTICS
Page 9 of 23
2. Mean Deviation −Discrete Series
Step 1: Find out an average (mean, median or mode).
Step 2: Find out the deviation of the variable values from the average, ignoring signs and denote
them by |D|
Step 3: Multiply the deviation of each value by its respective frequency and find out the total
Σf | D|
Step 4: Divide Σf | D| by the total frequencies N
Example 7
Compute Mean deviation from mean and median from the following data:
Height in cms 158 159 160 161 162 163 164 165 166
No. of
persons
15 20 32 35 33 22 20 10 8
Also compute coefficient of mean deviation.
Solution:
Height (X) No. of
persons (f)
d = x−A
A = 162
fd D=X−mean fD
158 15 −4 −60 3.51 52.65
159 20 −3 −60 2.51 50.20
160 32 −2 −64 1.51 48.32
161 35 −1 −35 0.51 17.85
162 33 0 0 0.49 16.17
163 22 1 22 1.49 32.78
164 20 2 40 2.49 49.80
165 10 3 30 3.49 34.90
166 8 4 32 4.49 35.92
Total 195 −95 338.59
Mean = A +
fd
N
= 162 +
−95
195
= 161.51
M. D. =
fD
N
=
338.59
195
= 1.74
BPCU004: ADVANCED BUSINESS STATISTICS
Page 10 of 23
Co − efficient M. D. =
M. D.
Mean
=
1.74
161.51
= 0.0108
Height (x) No. of persons (f) c.f. D=X−median fD
158 15 15 3 45
159 20 35 2 40
160 32 67 1 32
161 35 102 0 0
162 33 135 1 33
163 22 157 2 44
164 20 177 3 60
165 10 187 4 40
166 8 195 5 40
195 334
Median = Size of (
N
2
) th item = Size of (
195
2
)th item = Size of 98th
item = 161
M. D. =
fD
N
=
334
195
= 1.71
Co − efficient M. D. =
M. D.
Median
=
1.71
161
= 0.0106
3. Mean Deviation-Continuous Series
The method of calculating mean deviation in a continuous series same as the discrete series. In
continuous series we have to find out the mid points of the various classes and take deviation
of these points from the average selected. Thus
M. D. =
fD
N
Where: D = m − Average ; m = mid point
Example 8:
Find out the mean deviation from mean and median from the following series.
Age in years No. of persons
0−10 20
BPCU004: ADVANCED BUSINESS STATISTICS
Page 11 of 23
10−20 25
20−30 32
30−40 40
40−50 42
50−60 35
60−70 10
70−80 80
Also compute co-efficient of mean deviation.
Solution:
x m f
𝑑 =
𝑚 − 𝐴
𝑐
𝐴 = 35; 𝑐 = 10
fd D=X−mean fD
0−10 5 20 −3 −60 31.5 630.0
10−20 15 25 −2 −50 21.5 537.5
20−30 25 32 −1 −32 11.5 368.0
30−40 35 40 0 0 1.5 60.0
40−50 45 42 1 42 8.5 357.0
50−60 55 35 2 70 18.5 647.5
60−70 65 10 3 30 28.5 285.0
70−80 75 8 4 32 38.5 308.0
Total 212 3192.5
Mean = A +
∑ fd
N
∗ c = 35 +
320
212
x10 = 36.5
M. D. =
∑ fD
N
=
3192.5
212
= 15.06
BPCU004: ADVANCED BUSINESS STATISTICS
Page 12 of 23
Calculation of Median and M.D. from Median
x m f c.f D=m−Md fD
0−10 5 20 20 32.25 645.00
10−20 15 25 45 22.25 556.25
20−30 25 32 77 12.25 392.00
30−40 35 40 117 2.25 90.00
40−50 45 42 159 7.75 325.50
50−60 55 35 194 17.75 621.25
60−70 65 10 204 27.75 277.50
70−80 75 8 212 37.75 302.00
Total 212 3209.50
Median = (
N
2
) th item =
212
2
= 106
Median = 𝑙 +
N
2
− m
f
∗ c = 30 +
106 − 77
40
∗ 10 = 37.25
M. D. =
∑ fD
N
=
3209.5
212
= 15.14
Co − efficient of M. D. =
M. D.
Median
=
15.14
37.25
= 0.41
Merits of M.D.
1. It is simple to understand and easy to compute.
2. It is rigidly defined.
3. It is based on all items of the series.
4. It is not much affected by the fluctuations of sampling.
5. It is less affected by the extreme items.
6. It is flexible, because it can be calculated from any average.
7. It is better measure of comparison.
Demerits of M.D.
1. It is not a very accurate measure of dispersion.
2. It is not suitable for further mathematical calculation.
3. It is rarely used. It is not as popular as standard deviation.
BPCU004: ADVANCED BUSINESS STATISTICS
Page 13 of 23
4. Algebraic positive and negative signs are ignored. It is mathematically unsound and
illogical.
STANDARD DEVIATION AND COEFFICIENT OF VARIATION
1. Definition
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations
of the given observation from their arithmetic mean. It is the square–root of the mean of the
squared deviation from the arithmetic mean. Square of standard deviation is called Variance.
2. Calculation of Standard Deviation-Individual Series
There are two methods of calculating Standard deviation in an individual series.
a) Deviations taken from Actual mean
b) Deviation taken from Assumed mean
(a) Deviation taken from Actual mean
This method is adopted when the mean is a whole number.
Steps:
1. Find out the actual mean of the series ( )
2. Find out the deviation of each value from the mean (X = X – )
3. Square the deviations and take the total of squared deviations ∑ X2
4. Divide the total (∑ X2) by the number of observation (
∑X2
n
)
Formulae:
Standard Deviation () = √(
∑ X2
n
)𝑜𝑟 √(X − X)
2
n
(b) Deviations Taken from Assumed Mean
This method is adopted when the arithmetic mean is fractional value. Taking deviations from
fractional value would be a very difficult and tedious task. To save time and labour, the short–
cut method is applied. In this method, the deviations are taken from an assumed mean.
The formula is:
BPCU004: ADVANCED BUSINESS STATISTICS
Page 14 of 23
 = √(
∑ d2
N
)− (
∑ d
N
)
2
Where: d stands for the deviations from the assumed mean = (X − A)
Steps:
1. Assume any one of the item in the series as an average (A)
2. Find out the deviations from the assumed mean; i.e., X-A denoted by d and also the total of
the deviations Σd
3. Square the deviations; i.e., d2
and add up the squares of deviations, i.e, Σd2
4. Then substitute the values in the following formula:
 = √(
∑ d2
N
)− (
∑ d
N
)
2
Note: We can also use the simplified formula for standard deviation.
 =
1
n
√(n ∑ d2) − (∑ d)
2
For the frequency distribution
 =
c
n
√(N ∑ fd2) − (∑ fd)
2
Example 9
Calculate the standard deviation from the following data.
14, 22, 9, 15, 20, 17, 12, 11
BPCU004: ADVANCED BUSINESS STATISTICS
Page 15 of 23
Solution:
Deviations from actual mean.
Values (X) (X − X) (X − X)2
14 –1 1
22 7 49
9 –6 36
15 0 0
20 4 16
17 2 4
12 –3 9
11 –4 16
120 140
X =
120
8
= 15
 = √(X − X)
2
n
= √
140
8
= 4.18
Example 10
The table below gives the marks obtained by 10 students in statistics. Calculate standard
deviation.
Student Nos : 1 2 3 4 5 6 7 8 9 10
Marks 43 48 65 57 31 60 37 48 78 59
Solution
Deviations from assumed mean
Student Nos : Marks (X) d = X − A (A = 57) d2
1 43 –14 196
2 48 –9 81
3 65 8 64
4 57 0 0
5 31 –26 676
BPCU004: ADVANCED BUSINESS STATISTICS
Page 16 of 23
6 60 3 9
7 37 –20 400
8 48 –9 81
9 78 21 441
10 59 2 4
N=10 d=–44 d2
=1952
 = √(
∑ d2
N
)− (
∑ d
N
)
2
 = √(
1952
10
)− (
−44
10
)
2
= 13.26
3. Calculation of Standard Deviation for Discrete Series
There are three methods for calculating standard deviation in discrete series:
(a) Actual mean methods
(b) Assumed mean method
(c) Step-deviation method.
(a) Actual mean method
Steps:
1. Calculate the mean of the series.
2. Find deviations for various items from the means i.e., d = X − X
3. Square the deviations (d2
) and multiply by the respective frequencies (f) to get fd2
.
4. Total to product (Σfd2
) Then apply the formula:
 = √
∑ fd2
∑ f
If the actual mean in fractions, the calculation takes lot of time and labour; and as such this
method is rarely used in practice.
BPCU004: ADVANCED BUSINESS STATISTICS
Page 17 of 23
(b) Assumed Mean Method
Here deviation are taken not from an actual mean but from an assumed mean. Also this method
is used, if the given variable values are not in equal intervals.
Steps:
1. Assume any one of the items in the series as an assumed mean and denoted by A.
2. Find out the deviations from assumed mean, i.e, X-A and denote it by d.
3. Multiply these deviations by the respective frequencies and get the Σfd.
4. Square the deviations (d2
).
5. Multiply the squared deviations (d2
) by the respective frequencies (f) and get Σfd2
.
6. Substitute the values in the following formula:
 = √
∑ fd2
∑ f
− (
∑ fd
∑ f
)
2
Where: d = A − A, N = f
Example 11:
Calculate Standard deviation from the following data.
X 20 22 25 31 35 40 42 45
f 5 12 15 20 25 14 10 6
Solution :
Deviations from assumed mean
X f d = X − A (A = 31) d2
fd fd2
20 5 −11 121 −55 605
22 12 −9 81 −108 972
25 15 −6 36 −90 540
31 20 0 0 0 0
35 25 4 16 100 400
40 14 9 81 126 1134
42 10 11 121 110 1210
45 6 14 196 84 1176
Total N=107 fd=167 fd2
=6037
BPCU004: ADVANCED BUSINESS STATISTICS
Page 18 of 23
 = √
∑ fd2
∑ f
− (
∑ fd
∑ f
)
2
 = √
6037
107
− (
167
107
)
2
= 7.35
(c) Step-deviation method:
If the variable values are in equal intervals, then we adopt this method.
Steps:
1. Assume the center value of the series as assumed mean A.
2. Find out d′
=
X−A
C
, where C is the interval between each value.
3. Multiply these deviations d′
by the respective frequencies and get ∑ fd′
.
4. Square the deviations and get d′2
.
5. Multiply the squared deviation (d′2
) by the respective frequencies (f) and obtain the total
∑ fd′2
.
6. Substitute the values in the following formula to get the standard deviation.
 = √∑ fd′2
∑ f
− (
fd′2
∑f
)
2
*C
Example 12
Compute Standard deviation from the following data.
Marks 10 20 30 40 50 60
No. of students 8 12 20 10 7 3
Solution:
Marks (X) No. of students (f)
d′
=
X − 30
10
d2
fd fd2
10 8 −2 4 −16 32
20 12 −1 1 −12 12
30 20 0 0 0 0
40 10 1 1 10 10
50 7 2 4 14 28
60 3 3 9 9 27
N=60 fd=5 fd2
=109
BPCU004: ADVANCED BUSINESS STATISTICS
Page 19 of 23
 = √∑ fd′2
∑ f
− (
fd′2
∑f
)
2
*C
 = √
∑ 1092
60
− (
5
60
)
2
∗ 10 = 13.45
4. Calculation of Standard Deviation for Continuous series
In the continuous series the method of calculating standard deviation is almost the same as in a
discrete series. But in a continuous series, mid-values of the class intervals are to be found out.
The step- deviation method is widely used.
The formula is,
= √∑ fd′2
N
− (
fd′2
N
)
2
*C
Where d′
=
m − A
C
; C = Class interval
Steps:
1. Find out the mid-value of each class.
2. Assume the center value as an assumed mean and denote it by A.
3. Find out d′
=
m−A
C
4. Multiply the deviations d′
by the respective frequencies and get fd′
5. Square the deviations and get 𝑑′2
.
6. Multiply the squared deviations 𝑑′2
) by the respective frequencies and get fd′2
7. Substituting the values in the following formula to get the standard deviation.
 = √∑ fd′2
N
− (
fd′2
N
)
2
*C
Example 13:
The daily temperature recorded in a city in Russia in a year is given below.
Temperature C0
No. of days
−40 to −30 10
−30 to −20 18
−20 to −10 30
−10 to 0 42
BPCU004: ADVANCED BUSINESS STATISTICS
Page 20 of 23
0 to −10 65
10 to −20 180
20 to 30 20
Required:
Calculate Standard Deviation.
Solution :
Temperature
(X)
Mid-Point
(m)
No. of days
(f)
d′
=
m − (−5)
10
d′2
fd′
fd′2
−40 to −30 −35 10 −3 9 −30 90
−30 to −20 −25 18 −2 4 −36 72
−20 to −10 −15 30 −1 1 −30 30
−10 to 0 −5 42 0 0 0 0
0 to −10 5 65 1 1 65 65
10 to −20 15 180 2 4 360 720
20 to 30 25 20 3 9 60 180
N=365 fd=389 fd2
=1157
 = √∑ fd′2
N
− (
fd′
N
)
2
*C
 = √1157
365
− (
389
365
)
2
*10 =14.260
𝐶
Merits of Standard Deviation
1. It is rigidly defined and its value is always definite and based on all the observations and
the actual signs of deviations are used.
2. As it is based on arithmetic mean, it has all the merits of arithmetic mean.
3. It is the most important and widely used measure of dispersion.
4. It is possible for further algebraic treatment.
5. It is less affected by the fluctuations of sampling and hence stable.
6. It is the basis for measuring the coefficient of correlation and sampling.
Demerits of Standard Deviation
1. It is not easy to understand and it is difficult to calculate.
2. It gives more weight to extreme values because the values are squared up.
BPCU004: ADVANCED BUSINESS STATISTICS
Page 21 of 23
3. As it is an absolute measure of variability, it cannot be used for the purpose of comparison.
Coefficient of Variation
The standard deviation is an absolute measure of dispersion. It is expressed in terms of units in
which the original figures are collected and stated. The standard deviation of heights of students
cannot be compared with the standard deviation of weights of students, as both are expressed
in different units, i.e heights in centimeter and weights in kilograms. Therefore the standard
deviation must be converted into a relative measure of dispersion for the purpose of
comparison. The relative measure is known as the coefficient of variation.
The coefficient of variation is obtained by dividing the standard deviation by the mean and
multiply it by 100. symbolically,
Coefficient of Variation (C. V. ) =

X
x100
If we want to compare the variability of two or more series, we can use C.V. The series or
groups of data for which the C.V. is greater indicate that the group is more variable, less stable,
less uniform, less consistent or less homogeneous. If the C.V. is less, it indicates that the group
is less variable, more stable, more uniform, more consistent or more homogeneous.
Example 15
In two factories A and B located in the same industrial area, the average weekly wages (in
rupees) and the standard deviations are as follows:
Factory Average Standard Deviation No. of workers
A 34.5 5 476
B 28.5 4.5 524
Required:
(a) Which factory A or B pays out a larger amount as weekly wages?
(b) Which factory A or B has greater variability in individual wages?
Solution:
Total wages paid by factory A = 34.5x476 = Kshs. 16,422
(a) Total wages paid by factory B = 28.5x524 = Kshs. 14,934
BPCU004: ADVANCED BUSINESS STATISTICS
Page 22 of 23
Therefore factory A pays out larger amount as weekly wages.
(b) C.V. of distribution of weekly wages of factory A and B are
CV (A) =

X
x100 =
5
34.5
x100 = 14.49%
CV (B) =

X
x100 =
4.5
28.5
x100 = 15.79%
Factory B has greater variability in individual wages, since C.V. of factory B is greater than
C.V of factory A.
Example 16
Prices of a particular commodity in five years in two cities are given below:
Price in City A Price in City B
20 10
22 20
19 18
23 12
16 15
Which city has more stable prices?
Solution:
Actual mean method
City A City B
Prices (X) dx = X − 20 dx2 Prices (Y) dy = Y − 15 dy2
20 0 0 10 −5 25
22 2 4 20 5 25
19 −1 1 18 3 9
23 3 9 12 −3 9
16 −4 16 15 0 0
X=100 dx dx2
Y=75 dy=0 dy2
=68
City A: X =
∑ X
n
=
100
5
= 20
BPCU004: ADVANCED BUSINESS STATISTICS
Page 23 of 23
 = √
∑ dx2
n
= √
30
5
= 2.45
CV (A) =

X
x100 =
2.45
20
x100 = 12.25%
City B: X =
∑ X
n
=
75
5
= 15
 = √
∑ dx2
n
= √
68
5
= 3.69
CV (A) =

X
x100 =
3.69
15
x100 = 24.6%
City A had more stable prices than City B, because the coefficient of variation is less in City A.
BMCU002: QUANTITATIVE METHODS NOTES
Page 1 of 19
LESSON THREE: OVERVIEW OF HYPOTHESIS TESTING
3.0 Introduction
3.1 Lesson Objectives
3.2 Definition of Hypothesis Testing
Hypothesis: It’s a statement about a population parameter developed for the purpose of testing.
Hypothesis testing: It’s a procedure based on sample evidence and probability theory to determine
whether the hypothesis is a reasonable statement.
3.2 Procedure for Testing a Hypothesis
The following are the steps that are followed when testing hypothesis
1. State the null and alternate hypothesis
2. Select a level of significance.
3. Identify the test statistic
4. Formulate a decision rule and identify the rejection region
5. Compute the value of the test statistic
6. Make a conclusion.
This lesson gives an overview of the concepts in hypothesis testing. It describes the procedure
of testing a hypothesis, differentiates between one-tailed and two-tailed tests and type I and
Type II errors. Examples of testing hypothesis about a single population mean when the
population variance and not given are discussed.
By the end of the lesson, the students should be able to;
 Define the term hypothesis
 Differentiate between one-tailed and two-tailed tests
 Describe the procedure for testing hypothesis
 Test hypothesis about the mean when the population variance is known
 Test hypothesis about the mean when the population variance is unknown
BMCU002: QUANTITATIVE METHODS NOTES
Page 2 of 19
State the null hypothesis (HO) and alternate hypothesis (HA)
 The null hypothesis is a statement about the value of a population parameter. It should be
stated as “There is no significant difference between ……………”. It should always contain
an equal sign.
 The alternate hypothesis is a statement that is accepted if sample data provide enough
evidence that the null hypothesis is false.
Select a Level of Significance
A level of significance is the probability of rejecting the null hypothesis when it is true. It is
designated by  and should be between 0 –1.
Types of errors that can be committed
i. Type I error: it is rejecting the null hypothesis, when it is true.
ii. Type II error: It is not rejecting the null hypothesis, when it is false.
Null hypothesis Do not reject HO Reject HO
HO is True Correct decision Type I error
HO is false Type II error Correct decision
Identify the Test Statistic
A test statistic is the statistic that will be used to test the hypothesis e.g.
)
(
,
, 2
square
chi
Fand 

 
Formulate a decision rule
A decision rule is a statement of the conditions under which the null hypothesis is rejected and
the conditions under which it is not rejected. The region or area of rejection defines the location of
all those values that are so large or so small that the probability of their occurrence under a true
null hypothesis is rather remote.
Compute the value of the test statistic and make a conclusion
The value of the test statistic is determined from the sample information, and is used to determine
whether to reject the null hypothesis or not.
BMCU002: QUANTITATIVE METHODS NOTES
Page 3 of 19
3.4 One-Tailed and Two-Tailed Tests
 A test is one tailed when the alternate hypothesis states a direction e.g.
Ho: The mean income of women is equal to the mean income of men
HA: The mean income of women is greater than the mean income of men
 A test is two tailed if no direction is specified in the alternate hypothesis
Ho: There is no difference between the mean income of women and the mean income
of men
HA: There is a difference between the mean income of women and the mean income of
men
3.5 Testing The Population Mean When the Population Variance is Known
When the population variance is known and the population is normally distributed, the test
statistic for testing hypothesis about  is
n
x
Z



 . The confidence interval estimator of 
when 2
 is known is
n
Z
x 

2

Example One
A study by the Coca-Cola Company showed that the typical adult Kenyan consumes 18 gallons of
Coca-Cola each year. According to the same survey, the standard deviation of the number of
gallons consumed is 3.0. A random sample of 64 college students showed they consumed an
average (mean) of 17 gallons of cola last year. At the 0.05 significance level, can we conclude that
there is a significance difference between the mean consumption rate of college students and other
adults?
Solution
1. Stating the null and alternate hypothesis
18
:
18
:
0




A
H
H
2. Level of significance: 05
.
0


BMCU002: QUANTITATIVE METHODS NOTES
Page 4 of 19
3. Test statistic
n
X
Z




4. Rejection region
o
c
c
025
.
0
2
/ H
Reject
,
96
.
1
or Z
96
.
1
Z
If
96
.
1 



 Z
Z
5. Value of the test statistic
96
.
1
67
.
2
64
3
18
17








n
X
Zc


6. Conclusion
Reject H0. Yes, there is a significance difference between the mean consumption rate of college
students and other adults.
Example Two
Past experience indicates that the monthly long distance telephone bill per household in a particular
community is normally distributed, with a mean of Sh. 1012 and a standard deviation of Sh. 327.
After an advertising campaign that encouraged people to make long distance telephone calls more
frequently, a random sample of 57 households revealed that the mean monthly long distance bill
was Sh. 1098. Can we conclude at the 10% significance level that the advertising campaign was
successful?
Solution
1. Stating the null and alternate hypothesis
1012
:
1012
:
0




A
H
H
2. Level of significance: 1
.
0


3. Test statistic
n
X
Z




BMCU002: QUANTITATIVE METHODS NOTES
Page 5 of 19
4. Rejection region
o
c
1
.
0 H
Reject
,
28
.
1
Z
If
28
.
1 

 Z
Z
5. Value of the test statistic
28
.
1
99
.
1
57
327
1012
1098






n
X
Zc


6. Conclusion
Reject H0. Yes, there is sufficient evidence to conclude that the advertising campaign was
successful
3.6 Testing the Population Mean when the Population Variance is Unknown
When the population variance is unknown and the population is normally distributed, the test
statistic for testing hypothesis about  is
n
s
x
t


 which has a student t distribution with 1

n
degrees of freedom.
We now have two different test statistic for testing the population mean. The choice of which one
to use depends on whether or not the population variance is known.
 If the population variance is known, the test statistic is
n
x
Z




 If the population variance is unknown, the test statistic is
n
s
x
t


 1
. 
 n
f
d
The confidence interval estimator of  when 2
 is unknown is
n
s
t
x
2

 1
.
. 
 n
f
d
Example One
A manufacturer of automobile seats has a production line that produces an average of 100 seats
per day. Because of new government regulations, a new safety device has been installed, which
the manufacturer believes will reduce average daily output. A random sample of 15 days’ output
after the installation of the safety device is shown below:
BMCU002: QUANTITATIVE METHODS NOTES
Page 6 of 19
93, 103, 95, 101, 91, 105, 96, 94, 101, 88, 98, 94, 101, 92, 95
Assuming that the daily output is normally distributed, is there sufficient evidence at the 5%
significance level, to conclude that average daily output has decreased following the installation
of the safety device?
Solution
1. Stating the null and alternate hypothesis
100
:
100
:
0




A
H
H
2. Level of significance: 05
.
0


3. Test statistic
n
s
X
t



4. Rejection region
o
c
14
,
05
.
0
1 H
Reject
,
761
.
1
t
If
761
.
1
, 




 t
t n

5. Value of the test statistic
 
761
.
1
82
.
2
15
85
.
4
100
47
.
96
85
.
4
14
15
1447
139917
1
47
.
96
15
1447
139917
X
1447
2
2
2
2






















 
n
s
X
t
n
n
X
X
S
n
X
X
X
c

6. Conclusion
BMCU002: QUANTITATIVE METHODS NOTES
Page 7 of 19
Reject H0. Yes, there is sufficient evidence to conclude that average daily output has decreased
following the installation of the safety device
Example Two
A courier service advertises that its average delivery time is less than six hours for local deliveries.
A random sample of the amount of time this courier takes to deliver packages to an address across
town produced the following times (rounded to the nearest hour).
7, 3, 4, 6, 10, 5, 6, 4, 3, 8
Is there sufficient evidence to support the courier’s advertisement at the 5% level of significance?
Solution
1. Stating the null and alternate hypothesis
6
:
6
:
0




A
H
H
2. Level of significance: 05
.
0


3. Test statistic
n
s
X
t



4. Rejection region
o
c
9
,
05
.
0
1 H
Reject
,
833
.
1
t
If
833
.
1
, 




 t
t n

5. Value of the test statistic
BMCU002: QUANTITATIVE METHODS NOTES
Page 8 of 19
 
833
.
1
56
.
0
10
27
.
2
6
6
.
5
27
.
2
9
10
56
360
1
6
.
5
10
56
360
X
56
2
2
2
2






















 
n
s
X
t
n
n
X
X
S
n
X
X
X
c

6. Conclusion
Do not Reject H0. No, there is no sufficient evidence to conclude that the advertising campaign
was successful
3.7 Chi-Square Test
A chi-squared test is any statistical hypothesis test in which the sampling distribution of the test
statistic is a chi-squared distribution when the null hypothesis is true. Also considered a chi-
squared test is a test in which this is asymptotically true, meaning that the sampling distribution (if
the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as
desired by making the sample size large enough. The chi-square test is used to determine whether
there is a significant difference between the expected frequencies and the observed frequencies in
one or more categories.
3.7.1 Chi-Square Test of a Multinomial Experiment (Goodness-Of-Fit Test)
A multinomial experiment is a generalized version of a binomial experiment that allows for more
than two possible outcomes on each trial of the experiment. The following are the properties of a
multinomial experiment
 The experiment consists of a fixed number nof trials.
 The outcome of each trial can be classified into exactly one of k categories called cells
 The probability 1
P that the outcome of a trial will fall into a cell i remains constant for each
trial, for .
.........k
3,
2,
1,

i moreover, 1
........
2
1 
 k
P
P
P .
BMCU002: QUANTITATIVE METHODS NOTES
Page 9 of 19
 Each trial of the experiment is independent of the other trials.
Test Statistic
 




k
i i
i
i
e
e
o
1
2
2

Rejection Region
1
-
k
,
2
2


 
Example One
Two companies A and B have recently conducted aggressive advertising campaigns in order to
maintain and possibly increase their respective shares of the market for a particular product. These
two companies enjoy a dominant position in the market. Before advertising campaigns began, the
market share for Company A was 45% while Company B had a market share of 40%. Other
competitors accounted for the remaining market share of 15%. To determine whether these market
shares changed after the advertising campaigns, a marketing analyst solicited the preferences of a
random sample of 200 consumers of this product. Of the 200 consumers, 100 indicated a
preference for Company’s A’s product, 85 preferred Company’s B product and the remainder
preferred one or another of the products distributed by other competitors. Conduct a test to
determine at the 5% level of significance, whether the market shares have changed from the levels
they were at before the advertising campaigns occurred.
Solution
1. Stating the null and alternate hypothesis
Ho: P1= 0.45, P2 = 0.4, P3 = 0.15
HA: At least one of the i
P is not equal to its specified value.
2. Level of significance: 05
.
0


3. Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

4. Rejection region : 99147
.
5
2
2
,
05
.
1
,
2
2


  

  k
BMCU002: QUANTITATIVE METHODS NOTES
Page 10 of 19
5. Value of the test statistic: assuming that the null hypothesis is correct, we can calculate the
expected number of consumers who prefer A, B and others using the formula np
ei  .
Company Observed
frequency
Expected
frequency
 2
i
i e
o   
i
i
i
e
e
o
2

A
B
Others
100
85
15
90
80
30
100
25
225
1.11
.31
7.50
Total 200 200 8.92
Therefore 92
.
8
)
(
1
2
2


 

k
i i
i
i
e
e
o

6. Conclusion: Reject Ho
There is sufficient evidence at the 5% level of significance to allow us to conclude that the
market shares have changed from the levels they were at before the advertising campaigns
occurred.
Example Two
To determine if a single die, is balanced, or fair, the die was rolled 600 times. The observed
frequencies with which each of the six sides of the die turned up are recorded in the following
table: -
Face 1 2 3 4 5 6
Observed frequency 114 92 84 101 107 102
Is there sufficient evidence to conclude at the 5% level of significance, that the die is not fair?
Solution
1. Stating the null and alternate hypothesis
value
specified
its
ot
equal
not
is
s
P
the
of
one
least
At
:
6
1
:
i
6
5
4
3
2
1
A
o
H
p
p
p
p
p
p
H 





2. Level of significance: 05
.
0


BMCU002: QUANTITATIVE METHODS NOTES
Page 11 of 19
3. Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

4. Decision Rule : Ho
Reject
,
0705
.
11
If
,
0705
.
11 2
2
5
,
05
.
1
,
2
2



  


  k
5. Value of the test statistic:
Assuming that the null hypothesis is correct, we can calculate the expected number of
consumers who prefer A, B and others using the formula np
ei  .
Face Observed
frequency
Expected
frequency
 
i
i
i
e
e
o
2

1
2
3
4
5
6
114
92
84
101
107
102
100
100
100
100
100
100
1.96
0.64
2.56
0.01
0.49
0.04
Total 600 600 5.7
Therefore 0705
.
11
7
.
5
)
(
1
2
2



 

k
i i
i
i
e
e
o

6. Conclusion:
Do not Reject Ho. There is no sufficient evidence at the 5% level of significance to allow us to
conclude that that the die is not fair.
Rule of Five
For the discrete distribution of the test statistic 2
 to be adequately approximated by the
continuous chi-square distribution, the conventional rule is to require that the expected frequency
for each cell be at least 5. Where necessary, cells should be combined in order to satisfy this
condition. The choice of cells to be combined should be made in such a way that meaningful
categories result from the combination.
BMCU002: QUANTITATIVE METHODS NOTES
Page 12 of 19
3.7.2 Chi-Square Test of a Contingency Table
A contingency table is a rectangular table which items from a population are classified according
to two characteristics. The objective is to analyze the relationship between two qualitative
variables i.e. to investigate whether a dependence relationship exists between two variables or
whether the variables are statistically independent. The number of degrees of freedom for a
contingency table with r rows and c columns is   
1
1
-
r
.
. 
 c
f
d .
Example One
A sample of employees at a large chemical plant was asked to indicate a preference for one of
three pension plans. The results are given in the following table: -
Job Class
Pension Plan
Plan A Plan B Plan B
Supervisor
Clerical
Laborer
10
19
81
13
80
57
29
19
22
At the 1% significance level, determine whether there is a relationship between the pension
plan selected and the job classification of employees?
Solution
Job Class
Pension Plan Total
Plan A Plan B Plan B
Supervisor
Clerical
Laborer
10
19
81
13
80
57
29
19
22
52
118
160
Total 110 150 70 330
We need to conduct a chi-square of the contingency table to determine whether the classifications
are statistically independent.
Ho: The two classifications are independent
HA: the two classifications are dependent
BMCU002: QUANTITATIVE METHODS NOTES
Page 13 of 19
Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

Rejection region : 2767
.
13
2
4
,
01
.
0
)
1
)(
1
(
,
2
2


 
 

  c
r
The value of the test statistic
To compute the expected values for each cell, multiply the row total by the column total and divide
by the total number of shirts sampled.
Cell i Observed frequency
o
Expected frequency
e
 
e
e
o
2

1
2
3
4
5
6
7
8
9
10
13
29
19
80
19
81
57
22
17.33
23.64
11.03
39.33
53.64
25.03
53.33
72.73
33.94
3.1003
4.7889
29.2766
10.5087
12.9539
1.4527
14.3564
3.4021
4.2005
Total 84.0401
Value of the test statistic : 0401
.
84
)
(
1
2
2


 

k
i i
i
i
e
e
o

Conclusion: Reject Ho.
There is enough evidence at the 1% significance level to conclude that the two classifications are
dependent.
Example Two
The Coca Cola Company sells four brands of sodas in East Africa. To help determine if the same
marketing approach used in Kenya can be used in Uganda and Tanzania, one of the firm’s
marketing analysts wants to ascertain if there is an association between the brand of Soda preferred
and the nationality of the consumer. She first classifies the population according to the brand of
BMCU002: QUANTITATIVE METHODS NOTES
Page 14 of 19
soda preferred i.e. Fanta, Sprite, Coke and Krest. Her second classification consists of the three
nationalities; Kenyan, Tanzanian and Ugandan. The marketing analyst then interviews a random
sample of 250 Soda drinkers from the three countries, classifies each according to the two criteria
and records the observed frequency of drinkers falling into each of the cells as shown in the table
below.
Nationality
Soda preference
Total
Coke Krest Sprite Fanta
Kenyan
Ugandan
Tanzanian
72
26
7
8
10
10
12
16
14
23
33
19
115
85
50
Total 105 28 42 75 250
Based on the above sample data, can we conclude at the 1% level of significance that there is a
relationship between the preference of the soda drinkers and their nationality?
Solution
We need to conduct a chi-square of the contingency table to determine whether the classifications
are statistically independent.
Ho: The two classifications are independent
HA: the two classifications are dependent
Test statistic: 



k
i i
i
i
e
e
o
1
2
2 )
(

Rejection region : 8119
.
16
2
6
,
01
.
0
)
1
)(
1
(
,
2
2


 
 

  c
r
The value of the test statistic
To compute the expected values for each cell, multiply the row total by the column total
and divide by the total number of respondents sampled.
BMCU002: QUANTITATIVE METHODS NOTES
Page 15 of 19
Cell i Observed frequency
o
Expected frequency
e
 
e
e
o
2

1
2
3
4
5
6
7
8
9
10
11
12
72
26
7
8
10
10
12
16
14
23
33
19
48.30
35.70
21.00
12.88
9.52
5.60
19.32
14.28
8.40
34.50
25.50
15.00
11.63
2.64
9.33
1.85
0.02
3.46
2.77
0.21
3.73
3.83
2.21
1.07
Value of the test statistic: 75
.
42
)
(
1
2
2


 

k
i i
i
i
e
e
o

Conclusion: Reject Ho.
Based on the sample data, we can conclude at the 1% significance level that there is a relationship
between preferences of soda drinkers and their nationality.
3.7.3 Chi-Square Test for Normality
The chi-square goodness of fit test for a normal distribution proceeds in essentially the same way
as the chi-square test for a multinomial population. The multinomial test dealt with a single
population of qualitative data, where as a normal distribution involves quantitative data. Therefore,
we must begin by subdividing the range of the normal distribution into a set of intervals or
categories in order to obtain qualitative data.
Example One
A battery manufacturer who wants to determine if the lifetimes of his batteries are normally
distributed. Such information would be helpful in establishing the guarantee that should be offered.
The lifetimes of a sample of 200 batteries are measured and the resulting data are grouped into a
BMCU002: QUANTITATIVE METHODS NOTES
Page 16 of 19
frequency distribution as shown in the table below. The mean and the standard deviation of the
sample life times are 164 and 10 respectively.
Is there evidence at the 5% level of significance that the lifetimes of his batteries are normally
distributed?
Solution
1. Stating the null and alternate hypothesis
H0: The data are normally distributed
HA: The data are not normally distributed
2. Level of significance: 05
.
0


3. Test statistic: 3
-
k
d.f.
)
(
1
2
2


 

k
i i
i
i
e
e
o

4. Decision Rule : Ho
Reject
,
9915
.
5
If
,
9915
.
5 2
2
2
,
05
.
1
,
2
2



  


  k
5. Value of the test statistic:
10
,
164 
 
X
6
.
2
10
164
-
190
Z
,
6
.
1
10
164
-
180
Z
,
6
.
0
10
164
170
4
.
0
10
164
160
Z
,
4
.
1
10
164
150
,
4
.
2
10
164
140



















Z
Z
Z
Life Time in Hours Number of Batteries
140 up to 150
150 up to 160
160 up to 170
170 up to 180
180 up to 190
15
54
78
42
11
Total 200
BMCU002: QUANTITATIVE METHODS NOTES
Page 17 of 19
Lifetime Probability Observed
frequency
Expected
frequency
 
i
i
i
e
e
o
2

Less than 150
150 up to 160
160 up to 170
170 up to 180
180 or more
0.0808
0.2638
0.3811
0.2195
0.0548
15
54
78
42
11
16.16
52.76
76.22
43.9
10.96
0.0833
0.0291
0.0416
0.0822
0.0001
200 200 0.2363
Therefore 9915
.
5
02363
.
0
)
(
1
2
2



 

k
i i
i
i
e
e
o

6. Conclusion:
Do not Reject Ho. There is no sufficient evidence at the 5% level of significance to allow us
to conclude that the lifetimes of his batteries are normally distributed?
Example Two
The instructors for an introductory accounting course attempt to construct the final examination
so that the grades are normally distributed with a mean of 65.
Grade Frequency
30 up to 40
40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 up to 90
4
17
29
49
33
18
From the sample of grades appearing in the accompanying frequency distribution table, can
you conclude that they have achieved their objective? (Use 05
.
0

 )
BMCU002: QUANTITATIVE METHODS NOTES
Page 18 of 19
Solution
1. Stating the null and alternate hypothesis
H0: The data are normally distributed
HA: The data are not normally distributed
2. Level of significance: 05
.
0


3. Test statistic: 3
-
k
d.f.
)
(
1
2
2


 

k
i i
i
i
e
e
o

4. Decision Rule : Ho
Reject
,
81373
.
7
If
,
81473
.
7 2
2
3
,
05
.
1
,
2
2



  


  k
5. Value of the test statistic:
x f xf Dx
10
'
Dx
Dx 
2
'
Dx '
fDx 2
'
fDx
30 up to 40
40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 up to 90
35
45
55
65
75
85
4
17
29
49
33
18
140
765
1595
3185
2475
1530
-30
-20
-10
0
10
20
-3
-2
-1
0
1
2
9
4
1
0
1
4
-12
-34
-29
0
33
36
36
68
29
0
33
72
150 9690 -6 238
6
.
12
10
*
150
6
150
238
6
.
64
150
9690
2






 







n
xf
x
12.6
,
6
.
64 
 
X
BMCU002: QUANTITATIVE METHODS NOTES
Page 19 of 19
22
.
1
12.6
64.6
-
80
Z
,
43
.
0
6
.
12
6
.
64
70
37
.
0
6
.
12
6
.
64
60
Z
,
16
.
1
6
.
12
6
.
64
50
,
95
.
1
6
.
12
6
.
64
40

















Z
Z
Z
Lifetime Probability Observed
frequency
Expected
frequency
 
i
i
i
e
e
o
2

Less than 40
40 up to 50
50 up to 60
60 up to 70
70 up to 80
80 or more
0.0256
0.0974
0.229
0.3144
0.2224
0.1112
4
17
29
49
33
18
3.84
14.61
34.35
47.16
33.36
16.68
0.0067
0.3910
0.8333
0.0718
0.0039
0.1045
150 150 1.4112
Therefore 81473
.
7
4112
.
1
)
(
1
2
2



 

k
i i
i
i
e
e
o

6. Conclusion:
Do not Reject Ho. The data is normally distributed therefore we can conclude that they have
achieved their objective
Page 1 of 19
LESSON THREE: REGRESSION ANALYSIS
3.0 Introduction
Regression involves developing a mathematical equation that analyses the relationship between
the variable to be forecast (dependent variable) and the variables that the statistician believes
are related to the forecast variable (independent variable). Regression is the estimation of
unknown values or the prediction of one variable from known values of other variables. Simple
linear regression involves a relationship between two variables only. Multiple regression
analyses or considers the relationship between three or more variables.
3.1 Lesson Objectives
By the end of the lesson, the students should be able to:
i. Formulate a simple regression model
ii. Calculate the coefficient of correlation and determination and interpret them
iii. Test hypothesis about the regression coefficients
3.2 Simple Regression
The first step in establishing the relationship between X and Y is to obtain observations on the
two variables and analyze the data using a scatter diagram to indicate whether a positive or
negative relationship exists between X and Y. the relationship can be approximated by a
straight line. Algebraically, the relationship is t
t X
b
b
Y 1
0 

The above function is deterministic since it gives exact relationship between X and Y. when
the line is plotted, not all the points will fall on the line because of the following reasons:
 Omission of other explanatory variables from the function
 Random behavior of human beings
 Imperfect specification of the functional form of the model
 Errors of aggregation
 Errors of measurement
To account for the deviations of some points from the straight line, the error term is introduced.
The introduction of the error term makes the function stochastic t
t
t e
X
b
b
Y 

 1
0 . To
estimate the values of the coefficients 0
b and 1
b , we need observations on Y, X and the error
term. However, the error term is not observable and therefore we make assumptions about the
error term.
Page 2 of 19
3.3 Assumptions of the Error Term
The following are the assumptions of the error term
 The error term is a real random variable which has a mean of zero and constant variance
(Assumption of homoscedasticity)
 The error term is normally distributed
 The error term corresponding to different values of X for different periods are not correlated
(assumption of no autocorrelation)
 There is no relationship between the explanatory variables and the error term
 The explanatory variables are measured without error. The error absorbs the influence of
omitted variables and errors of measurement in the dependent variable.
All the above assumptions are called stochastic assumptions
Other Assumptions
 The explanatory variables are not perfectly linearly related or correlated (No
multicollinearity)
 The variables are correctly aggregated
 The relation being estimated is identified
 The relationship is correctly specified
The regression equation of Y on X
 It used to predict the values of Y from the given values of X.
 It is expressed as follows X
b
b
Y 1
0 

 To determine the values of 0
b and 1
b the following two normal equations are to be solved
simultaneously
  






2
1
0
1
0
X
b
X
b
XY
X
b
nb
Y
 Alternatively the values of 0
b and 1
b can be got using the following formula’s
X
b
Y
b 1
0 





 2
2
1
X
n
X
Y
X
n
XY
b
Page 3 of 19
3.4 Correlation
Definition: It is the existence of some definite relationship between two or more variables.
Correlation analysis is a statistical tool used to describe the degree to which one variable is
linearly related to another variable.
Types of Correlation
Correlation may be classified in the following ways:-
(a) Positive and negative correlation.
Correlation is said to be positive if two series move in the same direction, otherwise it is
negative (opposite Direction).
(b) Linear and Non-Linear correlation
Correlation is linear if the amount of change in one variable tends to bear a constant ratio to
the amount of change in the other variable otherwise it is non-linear.
(c) Simple, partial and multiple correlation
Simple correlation is where two variables are studied while partial or multiple involves three
or more variables.
3.5 Methods of Calculating Simple Correlation
 Scatter diagram
 Karl Pearson’s coefficient of correlation
 Spearman’s rank correlation coefficient
 Method of least squares
Karl Pearson’s coefficient of correlation (Product moment coefficient of correlation)
The coefficient of correlation (r) is a measure of strength of the linear relationship between two
variables.







2
2
2
2
Y
n
Y
X
n
X
Y
X
n
XY
r
Interpretation of the coefficient of correlation
1. When r = +1, there is a perfect positive correlation between the variables
2. When r = -1, there is a perfect negative correlation between the variables
Page 4 of 19
3. When r = 0, there is no correlation between the variables
4. The closer r is to +1 or to –1, the stronger the relationship between the variables and the
closer r is to 0, the weaker the relationship.
5. The following table lists the interpretations for various correlation coefficients:
Value Comment
0.8 to 1.0
0.6 to 0.8
0.4 to 0.6
0.2 to 0.4
0.0 to 0.2
Very strong
Strong
Moderate
Weak
Very weak
Method of least squares
yy
xx
xy
SS
SS
SS
r
*

Coefficient of determination (r2
)
It is the square of the correlation coefficient. It shows the proportion of the total variation in
the dependent variable Y that is explained or accounted for by the variation in the independent
variable X. e.g. If the value of r = 0.9, r2
= 0.81, this means 81% of the variation in the
dependent variable has been explained by the independent variable.
Example One
A random sample of eight auto drivers insured with a company and having similar auto
insurance policies was selected. The following table lists their driving experience (in years)
and the monthly auto insurance premium (in Sh.000) paid by them.
Driving experience (Years) 5 2 12 9 15 6 25 16
Monthly auto insurance premium
(In Sh.000)
64 87 50 71 44 56 42 69
i. Find the least squares regression line by identifying the appropriate dependent and
independent variable
ii. Interpret the meaning of the constants calculated in part (i).
iii. Compute the coefficient of correlation and coefficient of determination and interpret
them.
Page 5 of 19
Solution:
i. x
y 1
0
ˆ 
 

xx
xy
SS
SS

1
̂ x
y 1
0
ˆ
ˆ 
 

  90
x   1396
2
x   4739
xy   474
y   29642
2
y
  5
.
383
8
90
1396
2
2
2






 n
x
x
SSxx
   5
.
593
8
474
*
90
4739 







 n
y
x
xy
SSxy
  5
.
1557
8
474
29642
2
2
2






 n
y
y
SSyy
55
.
1
5
.
383
5
.
593
ˆ
1 




xx
xy
SS
SS

69
.
76
)
25
.
11
*
55
.
1
(
25
.
59
ˆ
ˆ
1
0 




 x
y 

x
x
y 55
.
1
69
.
76
ˆ 1
0 


 

ii. 55
.
1
ˆ
1 

 it indicates the rate at which the insurance premium reduces with an
additional year of driving experience
69
.
76
ˆ
0 
 It indicates the amount of premium that would be paid by a driver without
any years of experience.
iii.
77
.
0
5
.
1557
*
5
.
383
5
.
593
*





yy
xx
xy
SS
SS
SS
r
There is a strong negative relationship between the years of experience and the monthly auto
insurance premiums
%
29
.
59
77
.
0 2
2



r
59.29% of the premium paid is determined by the driving experience
Example Two
A company is using a system of payment by results. The union claims that this seriously
discriminates against the workers. there is a fairly steep learning curve which workers follow
with the apparent outcome that more experienced workers can perform the task in about half
of the time taken by the new employee. You have been asked to find out if there is any basis
Page 6 of 19
for this claim. To do this, you have observed ten workers on the shop floor, timing how long it
takes them to produce an item. It was then possible for you to match these times with the length
of worker’s experience. The results obtained are shown below:
Month’s experience 2 5 3 8 5 9 12 16 1 6
Time taken 27 26 30 20 22 20 16 15 30 19
Required:
(a) Find the regression line of time taken on month’s experience
(b) Compute the coefficient of correlation and coefficient of determination and interpret them.
Solution:
x
b
b
Y 1
0 

xx
xy
SS
SS
b 
1 X
b
Y
b 1
0 

7
6
 
X   645
2
X  1300
XY   225
Y   5331
2
Y
 
1
.
196
10
67
645
2
2
2






 n
X
X
SSxx
  
5
.
207
10
225
*
67
1300 







 n
Y
X
XY
SSxy
 
5
.
268
10
225
5331
2
2
2






 n
Y
Y
SSyy
0581
.
1
1
.
196
5
.
207
1 




xx
xy
SS
SS
b
41073
.
15
)
7
.
6
*
0581
.
1
(
5
.
22
1
0 




 X
b
Y
b
X
Y 0581
.
1
41073
.
15 

iv. 0581
.
1
1 

b : It indicates the rate at which the time taken would reduce by for every
additional month of experience
41073
.
15
0 
b It indicates the time taken by an employee without any experience
9043
.
0
5
.
268
*
1
.
196
5
.
207
*





yy
xx
xy
SS
SS
SS
r
There is a very strong negative correlation between the month’s experience and the time
taken
Page 7 of 19
%
78
.
81
100
*
8178
.
0
9043
.
0 2
2




r
81.78% of the variation in the time taken is explained by the month’s experience
Example Three
Students in the BMS 302 class were polled by a researcher attempting to establish a relationship
between hours of study in the week immediately preceding the end of semester exam and the
marks received on the exam. The surveyor gathered the data listed in the accompanying table
Hours of study Exam score
25
12
18
26
19
20
23
15
22
8
93
57
55
90
82
95
95
80
85
61
i. Find the least squares regression line by identifying the appropriate dependent and
independent variable.
ii. Interpret the meaning of the values of 0 and 1 calculated in part (i).
iii. Compute the correlation of coefficient and coefficient of determination and interpret them.
Solution
x
y 1
0
ˆ 
 

xx
xy
SS
SS

1
̂ x
y 1
0
ˆ
ˆ 
 

 188
x   3832
2
x  15540
xy   793
y   65143
2
y
 
6
.
297
10
188
3832
2
2
2






 n
x
x
SSxx
  
6
.
631
10
793
*
188
15540 






 n
y
x
xy
SSxy
Page 8 of 19
 
1
.
2258
10
793
65143
2
2
2






 n
y
y
SSyy
122
.
2
6
.
297
6
.
631
ˆ
1 


xx
xy
SS
SS

4064
.
39
)
8
.
18
*
122
.
2
(
3
.
79
ˆ
ˆ
1
0 



 x
y 

x
x
y 122
.
2
4064
.
39
ˆ 1
0 


 

i. 122
.
2
ˆ
1 
 it indicates the rate at which the exam score would increase with an
additional hour of study
04
.
39
ˆ
0 
 It indicates the exam score that would be attained by a student who does
not study a week to exams.
ii.
77
.
0
1
.
2258
*
6
.
297
6
.
631
*



yy
xx
xy
SS
SS
SS
r
There is a strong positive relationship between the exam score and the number of hours studied
%
29
.
59
77
.
0 2
2


r
59.29% of the exam score is determined by the number of hours studied
3.6 Spearman’s Rank Correlation
 It is the correlation between the ranks assigned to individuals by two different people.
 It is a non-parametric technique for measuring strength of relationship between paired
observations of two variables when the data are in ranked form.
It is denoted by R or p
N
N
d
N
N
d
R
i








3
2
2
2
6
1
)
1
(
6
1
In rank correlation, there are two types of problems:-
i. Where actual ranks are given
ii. Where actual ranks are not given
Page 9 of 19
Where actual ranks are given
Steps:
 Take the differences of the two ranks i.e. (R1-R2) and denote these differences by d.
 Square these differences and obtain the total  2
d
 Use the formula
N
N
d
R




3
2
6
1
Example
The ranks given by two judges to 10 individuals are given below.
Individual 1 2 3 4 5 6 7 8 9 10
Judge 1(X) 1 2 7 9 8 6 4 3 10 5
Judge 2 (Y) 7 5 8 10 9 4 1 6 3 2
Calculate
(a) The spearman’s rank correlation.
(b) The Coefficient of correlation
Where ranks are not given
Ranks can be assigned by taking either the highest value as 1 or the lowest value as 1. the same
method should be followed in case of all the variables.
Example
Calculate the Rank correlation coefficient for the following data of marks given to 1st
year B
Com students:
CMS 100 45 47 60 38 50
CAC 100 60 61 58 48 46
Equal Ranks or Tie in Ranks
 Where equal ranks are assigned to some entries, an adjustment in the formula for
calculating the Rank coefficient of correlation is made.
 The adjustment consists of adding  
m
m 
3
12
1 to the value of  2
d where m stands
for the number of items whose ranks are common.
Page 10 of 19
Example
An examination of eight applicants for a clerical post was taken by a firm. From the marks
obtained by the applicants in the accounting and statistics papers, compute the Rank coefficient
of correlation.
Applicant A B C D E F G H
Marks in accounting 15 20 28 12 40 60 20 80
Marks in statistics 40 30 50 30 20 10 30 60
3.7 Assessing the Regression Model
3.7.1 Estimating the variance of the error variable
The sample statistic
2
2


n
SSE
Se is an unbiased estimator of 2
e
 . The square root of 2
e
S is called
the standard error of estimate i.e.
2


n
SSE
Se
xx
yy
SS
SS
SS
SSE xy
2


Interpretation of the Standard Error of Estimate
 The smallest value that the standard error of estimate can assume is zero, which occurs
when SSE = 0 i.e. when all the points fall on the regression line.
 If 
S is close to zero, the fit is excellent and the linear model is likely to be a useful and
effective analytical and forecasting tool
 If 
S is large, the model is a poor one and the statistician should either improve it or
discard it.
 In general, the standard error of estimate cannot be used as an absolute measure of the
model’s utility. Nonetheless, it is useful in comparing models.
3.7.2 Drawing inferences about 1

This involves determining whether a linear relationship actually exists between x and y . The
null hypothesis will always state that there is no linear relationship between the variables i.e.
0
: 1
0 

H . Any of the following three alternate hypothesis can be tested:-
i. 0
: 1 

A
H Tests whether some linear relationship exists between x and y
ii. 0
: 1 

A
H Tests for a positive linear relationship exists between x and y
Page 11 of 19
iii. 0
: 1 

A
H Tests for a negative linear relationship exists between x and y
The test statistic is
1
1
1
b
s
b
t


 where
xx
e
b
SS
S
S 
1
Assuming that the error variable is normally distributed, the test statistic follows a student
distribution with 2

n degrees of freedom
The confidence interval estimator of 1
2
,
2
/
1
1 b
n S
t
b 

 

3.7.3 Measuring the strength of the linear relationship
1
 is useful in measuring the strength of the linear relationship particularly when we want to
compare different models to see which one fits the data better.
(a) Coefficient of Correlation
The coefficient of correlation denoted by )
(Rho
 measures the similarity of the changes in the
values of x and y . Its range is 1
1 

  . Since  is a population parameter, its value is
estimated from the data. The sample coefficient of correlation r is defined as follows:-
yy
xx
xy
SS
SS
SS
r
*

(b) Testing the Coefficient of Correlation
If 0

 the values of x and y are uncorrelated and the linear model is not appropriate. We
can determine if x and y are correlated by testing the following hypothesis
0
:
0
:
0




A
H
H
Test statistics for
r
s
r
t 

 where
2
1 2



n
r
sr
The test statistics is student t distributed with n-2 degrees of freedom if the error variable is
normally distributed
(c) Coefficient of Determination )
( 2
r
This measures the proportion of variability in the dependent variable that is explained by
variability of the independent variable.
Page 12 of 19
yy
xx SS
SS
SS
r xy
2
2

3.7.4 Predicting the particular value of y for a given x (The prediction Interval)
The prediction interval is given by: -
 
SSxx
x
x
n
S
t
y
Y
g
e
n
2
2
,
2
/
1
1
ˆ




 

Where g
x is the given value of x and g
x
b
b
y 1
0
ˆ 

3.7.5 Estimating the expected value of y for a given x (The confidence Interval)
The confidence interval is given by: -
 
SSxx
x
x
n
S
t
y
Y
g
e
n
2
2
,
2
/
1
ˆ



 

Where g
x is the given value of x and g
x
b
b
y 1
0
ˆ 

Example One
A real estate agent would like to predict the selling price of single family homes. After careful
consideration, she concludes that the variable likely to be mostly closely related to the selling
price is the size of the house. As an experiment, she takes a random sample of 15 recently sold
houses and records the selling price in Sh.000’s and size in 100 ft2
of each. The data is shown
in the table below: -
House size
(100 ft2
)
20.0 14.8 20.5 12.5 18.0 14.3 27.5 16.5 24.3 20.2
Selling price
(Sh’000)
89.5 79.9 83.1 56.9 66.6 82.5 126.3 79.3 119.9 87.6
22.0 19.0 12.3 14.0 16.7
112.6 120.8 78.5 74.3 74.8
Required: -
(a) Find the sample regression line for the data
(b) Estimate the variance of the error variable and the standard error of estimate.
Page 13 of 19
(c) Can we conclude at the 1% significance level that the size of a house is linearly related
to its selling price?
(d) Estimate the 99% confidence interval estimate of 1

(e) Compute the coefficient of correlation and interpret its value
(f) Can we conclude at the 1% significance level that the two variables are correlated?
(g) Compute the coefficient of determination and interpret its value
(h) Predict with 95% confidence the selling price of a house that occupies 2,000ft2
.
(i) In a certain part of the city, a developer built several thousand houses whose floor plans
and exteriors differ but whose sizes are all 2,000 ft2
. To date, they have been rented but
the builder now wants to sell them and wants to know approximately how much money
in total he can expect from the sale of the houses. Help him by estimating a 95%
confidence interval estimate of the mean selling price of the houses.
Solution
(a) Find the least squares regression line
x
b
b
y 1
0
ˆ 

xx
xy
SS
SS
b 
1 x
y
b 1
0 ̂


  6
.
272
X   6
.
1332
Y   97
.
25257
XY   24
.
5222
2
X
42
.
124618
2
 
Y
 
189
.
268
15
6
.
272
24
.
5222
2
2
2






 n
x
x
SSxx
  
186
.
1040
15
6
.
1332
*
6
.
272
97
.
25257 






 n
y
x
xy
SSxy
 
24
.
6230
15
2
.
1332
42
.
124618
2
2
2






 n
y
y
SSyy
88
.
3
189
.
268
186
.
1040
1 


xx
xy
SS
SS
b
34
.
18
)
17
.
18
*
88
.
3
(
84
.
88
1
0 



 x
b
y
b
x
x
b
b
y 88
.
3
34
.
18
ˆ 1
0 



Page 14 of 19
(b) Estimate the variance of the error variable and the standard error of estimate.
169
13
13
2
15
88
.
2195
2
88
.
2195
19
.
268
18
.
1040
24
.
6230
2
2
2
2












e
e
xx
yy
S
n
SSE
S
SS
SS
SS
SSE
xy
(c) Can we conclude at the 1% significance level that the size of a house is linearly related to
its selling price?
0
:
b
t
:
Statistic
Test
0.05
0
:
1
1
1
1







A
b
o
H
S
H
Decision rule
0
13
,
025
.
0
2
,
2
/ H
Reject
,
012
.
3
or
012
.
3
If
3.012 




 c
c
n t
t
t
t
Value of the test statistic
012
.
3
89
.
4
794
.
0
88
.
3
794
.
0
19
.
268
13
1
1
1







b
xx
e
b
S
b
t
SS
S
S
Conclusion: Reject Ho. Yes, the data provides sufficient evidence to conclude that the
house size is linearly related to its selling price
(d) Estimate the 99% confidence interval estimate of 1

27
.
6
49
.
1
39
.
2
88
.
3
)
794
.
0
*
012
.
3
(
88
.
3
1
1
1
2
,
2
/
1
1







 


  b
n S
t
b
(e) Compute the coefficient of correlation and interpret its value
805
.
0
24
.
6230
*
19
.
268
18
.
1040
*



yy
xx
xy
SS
SS
SS
r
There is a very strong positive correlation between the size of the house and its selling
price
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf
QUANTITATIVE METHODS NOTES.pdf

Más contenido relacionado

La actualidad más candente

Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
yogesh ingle
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
Chie Pegollo
 
Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviation
Amrit Swaroop
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
Vijeesh Soman
 
Measure OF Central Tendency
Measure OF Central TendencyMeasure OF Central Tendency
Measure OF Central Tendency
Iqrabutt038
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
sristi1992
 
Standard Deviation and Variance
Standard Deviation and VarianceStandard Deviation and Variance
Standard Deviation and Variance
Jufil Hombria
 

La actualidad más candente (20)

Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Forecasting techniques, time series analysis
Forecasting techniques, time series analysisForecasting techniques, time series analysis
Forecasting techniques, time series analysis
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Measures of dispersion
Measures of dispersion Measures of dispersion
Measures of dispersion
 
Introduction to kurtosis
Introduction to kurtosisIntroduction to kurtosis
Introduction to kurtosis
 
Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviation
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 
Understanding statistics in research
Understanding statistics in researchUnderstanding statistics in research
Understanding statistics in research
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Data Presenetation
Data PresenetationData Presenetation
Data Presenetation
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data Analysis
 
Measures of central tendency ppt
Measures of central tendency pptMeasures of central tendency ppt
Measures of central tendency ppt
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
 
Measure OF Central Tendency
Measure OF Central TendencyMeasure OF Central Tendency
Measure OF Central Tendency
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Standard Deviation and Variance
Standard Deviation and VarianceStandard Deviation and Variance
Standard Deviation and Variance
 

Similar a QUANTITATIVE METHODS NOTES.pdf

Statistics
StatisticsStatistics
Statistics
pikuoec
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
Rohit Mishra
 

Similar a QUANTITATIVE METHODS NOTES.pdf (20)

Unit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptxUnit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptx
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdf
 
Stat-Lesson.pptx
Stat-Lesson.pptxStat-Lesson.pptx
Stat-Lesson.pptx
 
Frequency Distribution.pdf
Frequency Distribution.pdfFrequency Distribution.pdf
Frequency Distribution.pdf
 
Statistics
StatisticsStatistics
Statistics
 
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGYBIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminar
 
Edited economic statistics note
Edited economic statistics noteEdited economic statistics note
Edited economic statistics note
 
Introduction to statistics.pptx
Introduction to statistics.pptxIntroduction to statistics.pptx
Introduction to statistics.pptx
 
Adv.-Statistics-2.pptx
Adv.-Statistics-2.pptxAdv.-Statistics-2.pptx
Adv.-Statistics-2.pptx
 
STATISTICS.pptx
STATISTICS.pptxSTATISTICS.pptx
STATISTICS.pptx
 
SPSS software application.pdf
SPSS software application.pdfSPSS software application.pdf
SPSS software application.pdf
 
Quatitative Data Analysis
Quatitative Data Analysis Quatitative Data Analysis
Quatitative Data Analysis
 
Chapter one Business statistics referesh
Chapter one Business statistics refereshChapter one Business statistics referesh
Chapter one Business statistics referesh
 
Intoduction to statistics
Intoduction to statisticsIntoduction to statistics
Intoduction to statistics
 
presentaion-ni-owel.pptx
presentaion-ni-owel.pptxpresentaion-ni-owel.pptx
presentaion-ni-owel.pptx
 
Introduction To Statistics.ppt
Introduction To Statistics.pptIntroduction To Statistics.ppt
Introduction To Statistics.ppt
 
presentaion ni owel iwiw.pptx
presentaion ni owel iwiw.pptxpresentaion ni owel iwiw.pptx
presentaion ni owel iwiw.pptx
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
 

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Último (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 

QUANTITATIVE METHODS NOTES.pdf

  • 1. BMCU002: QUANTITATIVE METHODS NOTES Page 1 of 3 INTRODUCTION TO STATISTICS Definition: It is the science of collecting, organizing, presenting, analyzing and interpreting data to assist in making more effective decisions. Types of Statistics (a) Descriptive statistics: it’s a tabular, graphical and numerical method for organizing and summarizing information clearly and effectively relating to either a population or sample. (b) Inferential statistics: are the methods of drawing and measuring the reliability of conclusions about a statistical population based on information from a sample data set.  A population is a collection of all possible individuals, objects or measurements of interest.  A sample part or sub set of the population of interest. Variables: A variable is a measurable characteristic that assumes different values among the subjects. Types of variables (a) Independent variables: It is a variable that a researcher manipulates in order to determine its effect or influence on another variable. They predict the amount of variation that occurs in other variables. (b) Dependent variables: It is the variable that is measured, predicted or monitored and is expected to be affected by manipulation of an independent variable. They attempt to indicate the total influence arising from the effects of the independent variable. It varies as a function of the independent variable e.g., influence of hours studied on performance in a statistical test, influence of distance from the supply center on cost of building materials. The above variables can either be qualitative or quantitative variables: - i. Qualitative variables: Are variables that are non-numeric i.e., attributes e.g., Gender, Religion, Colour, State of birth etc. ii. Quantitative variables: are numeric variables. They can either be discrete or continuous.  Discrete variables: Are variables, which can only assume certain values i.e., whole numbers. Are always counted.  Continuous variables: Are variables, which can assume any value within a specific range. Are always measured e.g., height, temperature, weight, radius etc.  Levels of measurement There are four levels of measurement; nominal, ordinal, interval and ratio. (a) Nominal level. The observations are classified under a common characteristic e.g., sex, race, marital status, employment status, language, religion etc. helps in sampling.
  • 2. BMCU002: QUANTITATIVE METHODS NOTES Page 2 of 3 (b) Ordinal level: items or subjects are not only grouped into categories, but they are ranked into some order e.g., greater than, less than, superior, happier than, poorer, above etc. helps in developing a likert scale. (c) Interval level: numerals are assigned to each measure and ranked. The intervals between numerals are equal. The numerals used represent meaningful quantities but the zero point is not meaningful e.g., test scores, temperature. (d) Ratio level: has all the characteristics of the other levels and in addition the zero point is meaningful. Mathematical operations can be applied to yield meaningful values e.g., height, weight, distance, age, area etc. Characteristics of statistical data  They are aggregate of facts e.g., total sales of a firm for one year.  They are affected to a marked extent by a multiplicity of causes e.g., volume of wheat production depends on rainfall, soil fertility, seeds etc  They are numerically expressed e.g., population of Kenya increased by 4 million during the year 2004.  They are estimated according to a reasonable standard of accuracy e.g., 90% accuracy  They are collected in a systematic manner.  They are collected for a predetermined purpose  They should be placed in relation to each other. Uses and users of statistics 1. Government:  Monitoring economic and social trends  Forecasting  Policy making 2. Individuals  Leisure activities  Community work  Personal finances  Gambling 3. Academia  Testing hypothesis  Developing new theories  Consultancy services 4. Businesses  Planning and control  Quality control especially for the manufacturers  Forecasting i.e., planning production schedules, advertising expenditures etc.  Auditing
  • 3. BMCU002: QUANTITATIVE METHODS NOTES Page 3 of 3  Determining production costs e.g., by using regression and correlation, one can determine the relationship between two variables like costs and methods of production, advertising and sales etc.  It gives relevant information for decision-making. Limitations of statistics  Deals with aggregate facts and not individual items.  Deals mainly with quantitative characteristics and not qualitative characteristics like honesty, efficiency etc.  The results are only true on an average and under certain conditions.  Statistics can be misused i.e., wrong interpretation. It requires experience and skill to draw sensible conclusions from the data.  Statistics may not provide the best solution under all circumstances.
  • 4. BMCU002: QUANTITATIVE METHODS NOTES Page 1 of 11 DESCRIPTIVE STATISTICS Descriptive statistics is used to summarize data and make sense out of the raw data collected during the research. Data collection Data can be collected from primary and / or secondary sources. Secondary data consists of information that already exists somewhere having been collected for another purpose e.g., in government publications, periodicals, journals, books etc. Advantages: Low in cost and Readily available Disadvantages: The data needed might not exist and The existing data might be outdated, inaccurate, incomplete and unreliable. Primary data consists of original information gathered for the specific purpose through observation, interviews and questionnaires. Advantages - It is relevant - Its accurate Disadvantages - It is costly - It is time consuming Presentation of data Presentation of data refers to the classification and tabulation of data. Classification of data refers to the act of arranging the data in groups or classes according to some resemblance of the data in each group or class. Tabulation of data is the arrangement of statistical data in columns and rows. Frequency distribution A frequency distribution is a grouping of data into mutually exclusive categories showing the number of observations in each category. Steps  Decide on the number of classes  Determine the class interval or width  Set the individual class limits  Tally the values into the classes
  • 5. BMCU002: QUANTITATIVE METHODS NOTES Page 2 of 11  Count the number of items in each class A class interval is the difference between the lower limit of the class and the lower limit of the next class. A class midpoint / class mark is the middle point between the lower and the upper class limit. Graphical representation of a frequency distribution 1. Histogram: It is a graph in which classes are marked on the horizontal axis and the class frequencies on the horizontal axis and the class frequencies on the vertical axis. The class frequencies are represented by the heights of the bars and the bars are drawn adjacent to each other. 2. Frequency polygons: The class midpoints are connected with a line segment. 3. Cumulative frequency polygons  Less than cumulative frequency polygons  More than cumulative frequency polygons 4. Line charts: Show the change in a variable over time 5. Bar chart: Make use of rectangles to present the given data. Can be vertical, horizontal or component. 6. Pie charts: different segments of a circle represent percentage contribution of various components to the total. 7. Graphs 8. Pictograms: pictures are used to represent data. Example (a) The data below indicates the marks attained by students in a statistical test. Construct a frequency distribution table with 10 classes 12 8 18 5 15 24 25 25 32 40 40 42 44 46 48 50 50 52 53 55 56 59 60 66 68 72 76 83 95 98 (b) From the above: construct a histogram, frequency polygons and curves, cumulative frequency curves. MEASURES OF CENTRAL TENDENCY Central tendency is the tendency of observations to cluster near the central part of the distribution. Measures of central tendency are the measures of location e.g. mean, mode and median. They are the most representative value of the distribution.
  • 6. BMCU002: QUANTITATIVE METHODS NOTES Page 3 of 11 Qualities of a good average Should be-  Rigidly defined  Based on all values  Easily understood and calculated  Least affected by the fluctuations of sampling  Capable of further algebraic or statistical treatment  Least affected by extreme values Types of averages The following are the most important types of averages (a) Arithmetic mean or simple average (b) Median (c) Mode (d) Geometric mean (e) Harmonic mean THE ARITHMETIC MEAN It is obtained by summing up the values of all the items of a series and dividing this sum by the number of items. Computation of the arithmetic mean for Individual series:- Direct method n X X   where X = arithmetic mean , n = number of items Grouped series Direct method n xf X   Where f = frequencies, n = number of items Properties of the arithmetic mean  The product of the arithmetic mean and the number of items is equal to the sum of all given values  The algebraic sum of the deviations of the various values from the mean is equal to zero  The sum of the squares of deviations from arithmetic mean is least. Advantages of the arithmetic mean  Can be easily understood  Takes into account all the items of the series
  • 7. BMCU002: QUANTITATIVE METHODS NOTES Page 4 of 11  It is not necessary to arrange the data before calculating the average  It is capable of algebraic treatment  It is a good method of comparison  It is not indefinite  It is used frequently. Disadvantages of the arithmetic mean  It is affected by extreme values to a great extent  It may be a figure that does not exist in a series  It cannot be calculated if all the items of a series are not known  It cannot be used incase of qualitative data THE MEDIAN The median is the middle value of a series arranged in ascending or descending order. If there are n observations, the median is the value of the th n        2 1 item. Computation of the median in discrete series  Arrange the items in descending or ascending order with their corresponding frequencies against them.  Compute the cumulated frequencies and then locate the middle item. Computation of the median in Continuous series The median has to be interpolated in the class interval containing the median using the formula:- Median = 𝑳 + ( 𝒏 𝟐 )−𝑩 𝑮 (𝑾) Where: L= lower class boundary n= total number of values B= cumulative frequency of the group before the median group G= frequency of the median group W= class width Properties of the Median  It is a positional average and is influenced by the position of the items in the series and not by the size of items  The sum of the absolute values of deviations is least. Advantages of the Median  It is easy to calculate  It is simple and is understood easily  It is less affected by the value of extreme items
  • 8. BMCU002: QUANTITATIVE METHODS NOTES Page 5 of 11  It can be calculated by inspection in some cases  It is useful in the study of phenomenon which are of qualitative nature Disadvantages of the Median  It is not a suitable representative of a series in most cases  It is not suitable for further algebraic treatment  It is not used frequently like arithmetic mean  It cannot be determined exactly in the case of continuous series Quartiles, deciles and percentiles  Quartiles are the values of the items that divide the series into four equal parts.  Deciles divide the series into 10 equal parts.  Percentiles divide the series into 100 equal parts. The 2nd quartile, 5th decile and 50th percentile are equal to the median. THE MODE The mode is the value, which occurs most often in the data. A distribution with one mode is called unimodal, with two modes bimodal and with many modes, multimodal distribution. The class mid-point of a modal class is called a crude mode. Calculation of the mode in a continuous series Mode = 𝑳 + 𝒇𝒎−𝒇𝒎−𝟏 (𝒇𝒎−𝒇𝒎−𝟏)+(𝒇𝒎−𝒇𝒎+𝟏) (𝒘) Where:  L is the lower-class boundary of the modal group  fm-1 is the frequency of the group before the modal group  fm is the frequency of the modal group  fm+1 is the frequency of the group after the modal group  w is the group width Properties of the mode  It represents the most typical value of the distribution and it should coincide with existing items  It is not affected by the presence of extremely large or small items Advantages of the Mode  It is easy to understand  Extreme items do not affect its value  It possesses the merit of simplicity Disadvantages of the Mode  It is often not clearly defined  Exact location is often uncertain  It is unsuitable for further algebraic treatment
  • 9. BMCU002: QUANTITATIVE METHODS NOTES Page 6 of 11  It does not take into account extreme values. GEOMETRIC MEAN Geometric Mean is the nth root of the product of n values i.e. n n x x x M G ..... * . 2 1  For ungrouped data G.M = Antilog of n Logx  Grouped data G.M = Antilog of n fLogx  Merits of the Geometric mean  It takes into account all the items in the data and condenses them into one representative value.  It gives more weight to smaller values than to large values.  It is amenable to algebraic manipulations Demerits  It is difficult to use and compute  It is determinate for positive values and cannot be used for negative values or zero. HARMONIC MEAN It is the reciprocal of the arithmetic mean of the reciprocal of a series of observations. Ungrouped data H.M =  x n 1 Grouped data H.M =  x f n Merits of the Harmonic mean  It takes into account all the observations in the data  It gives more weight to smaller items  It is amenable to algebraic manipulations  It measures the rates of change Demerits  It is difficult to compute when the number of items is large  It assigns too much weight to smaller items. Factors to consider in the choice of an average  The purpose for which the average is being used  The nature, characteristics and properties of the average  The nature and characteristics of the data. MEASURES OF DISPERSION Definition of dispersion  It is the degree to which numerical data tends to spread about an average value
  • 10. BMCU002: QUANTITATIVE METHODS NOTES Page 7 of 11  It is the extent of the scattered ness of items around a measure of central tendency Significance of measuring dispersion  To determine the reliability of an average  To serve as a basis for the control of the variability  To compare two or more series with regard to their variability  To facilitate the use of other statistical measures Properties of a good measure of dispersion It should be: -  Simple to understand  Easy to compute  Rigidly defined  Based on each and every item in the distribution  Amenable to further algebraic calculations  Have sampling stability  Not be unduly affected by extreme values Measures of dispersion  Range  Quartile deviation  Mean deviation  Standard deviation
  • 11. BMCU002: QUANTITATIVE METHODS NOTES Page 8 of 11 The Range: it is the difference between the smallest value and the largest value of a series Advantages of the Range  It is the simplest to understand and compute  It takes the minimum time to calculate the value of the range Limitations  It is not based on each and every value of the distribution  It is subject to fluctuations of considerable magnitude from sample to sample  It cannot be computed in case of open-ended distributions  It does not explain or indicate anything about the character of the distribution within the two extreme observations. Uses of the range  Quality control  Fluctuations of prices  Weather forecast  Finding the difference between two values e.g. wages earned by different employees. The standard deviation It is the square root of the arithmetic average of the squares of the deviations measured from the mean. It measures how much “spread” or “ Variability” is present in the sample. A small standard deviation means a high degree of uniformity of the observations as well as the homogeneity of a series and vice versa. Ways of computing the standard deviation Direct method Ungrouped data n dx   2  where  2 dx = sum of squares of the deviations from arithmetic mean Grouped data n fdx   2  Advantages of the standard deviation  It is rigidly defined and is based on all the observations of the series  It is applied or used in other statistical techniques like correlation and regression analysis and sampling theory.  It is possible to calculate the combined standard deviation of two or more groups. Disadvantages of the standard deviation  It cannot be used for comparing the dispersion of two or more series of observations given in different units.  It gives more weight to extreme values.
  • 12. BMCU002: QUANTITATIVE METHODS NOTES Page 9 of 11 SKEWNESS AND KURTOSIS IN STATISTICS The average and measure of dispersion can describe the distribution but they are not sufficient to describe the nature of the distribution. For this purpose we use other concepts known as Skewness and Kurtosis. The symmetrical and skewed distributions are shown by curves as Skewness Skewness means lack of symmetry. A distribution is said to be symmetrical when the values are uniformly distributed around the mean. For example, the following distribution is symmetrical about its mean 3. X : 1 2 3 4 5 Frequency (f): 5 9 12 9 5 In a symmetrical distribution the mean, median and mode coincide, that is, mean = median = mode. Several measures are used to express the direction and extent of skewness of a dispersion. The important measures are that given by Pearson. The first one is the Coefficient of Skewness: For a symmetric distribution Sk = 0. If the distribution is negatively skewed then Sk is negative and if it is positively skewed then Sk is positive. The range for Sk is from -3 to 3.
  • 13. BMCU002: QUANTITATIVE METHODS NOTES Page 10 of 11 The other measure uses the b (read ‘beta’) coefficient which is given by, where, m2 and m3 are the second and third central moments. The second central moment m2 is nothing but the variance. The sample estimate of this coefficient is where m2 and m3 are the sample central moments given by For a symmetrical distribution b1 = 0. Skewness is positive or negative depending upon whether m3 is positive or negative. Kurtosis A measure of the peakness or convexity of a curve is known as Kurtosis.
  • 14. BMCU002: QUANTITATIVE METHODS NOTES Page 11 of 11 It is clear from the above figure that all the three curves, (1), (2) and (3) are symmetrical about the mean. Still they are not of the same type. One has different peak as compared to that of others. Curve (1) is known as mesokurtic (normal curve); Curve (2) is known as leptocurtic (leading curve) and Curve (3) is known as platykurtic (flat curve). Kurtosis is measured by Pearson’s coefficient, b2 (read ‘beta - two’).It is given by . The sample estimate of this coefficient is where, m4 is the fourth central moment given by m4 = The distribution is called normal if b2 = 3. When b2 is more than 3 the distribution is said to be leptokurtic. If b2 is less than 3 the distribution is said to be platykurtic.
  • 15. BMCU002: QUANTITATIVE METHODS NOTES Page 1 of 13 MEASURES OF CENTRAL TENDENCY MODE Meaning The mode refers to that value in a distribution, which occur most frequently. It is an actual value, which has the highest concentration of items in and around it. Computation of the Mode 1. Ungrouped or Raw Data For ungrouped data or a series of individual observations, mode is often found by mere inspection. Example 1: 2 , 7, 10, 15, 10, 17, 8, 10, 2  Mode = M0 = 10 In some cases the mode may be absent while in some cases there may be more than one mode. Example 2: 1) 12, 10, 15, 24, 30 (no mode) 2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10 ∴ The modes are 7 and 10 2. Grouped Data a) Discrete Distribution For Discrete distribution, see the highest frequency and corresponding value of X is mode. A discrete variable is the one whose outcomes are measured in fixed numbers. b) Continuous Distribution See the highest frequency then the corresponding value of class interval is called the modal class. Then apply the following formula:
  • 16. BMCU002: QUANTITATIVE METHODS NOTES Page 2 of 13 Mode = M0 = l1+ 𝑓1−𝑓0 (𝑓1−𝑓0 )+(𝑓1−𝑓2 ) x𝑖 Where: 𝑙1 = the lower value of the class in which the lies 𝑓1 = the frequency of the class in which the mode lies 𝑓0 = the frequency of the class preceding the modal class 𝑓2 = the frequency of the class succeeding the modal class 𝑖 = the class interval of the modal classs NOTE: While applying the above formula, we should ensure that the class-intervals are uniform throughout. If the class-intervals are not uniform, then they should be made uniform on the assumption that the frequencies are evenly distributed throughout the class. Example 3: Let us take the following frequency distribution: Class Intervals Frequency 30−40 4 40−50 6 50−60 8 60−70 12 70−80 9 80−90 7 90−100 4 Required: Calculate the mode in respect of this series. Solution Mode = M0 = 60+ 12−8 (12−8)+(12−9) x10 = 60 + 4 4 + 3 𝑥10 = 65.7 approx.
  • 17. BMCU002: QUANTITATIVE METHODS NOTES Page 3 of 13 3. Determination of Modal Class For a frequency distribution modal class corresponds to the maximum frequency. But it is not possible to identify by inspection the class where the mode lies in any one (or more) of the following cases: i. If the maximum frequency is repeated. ii. If the maximum frequency occurs in the beginning or at the end of the distribution. iii. If there are irregularities in the distribution, the modal class is determined by the method of grouping. Steps for Calculation 1. Prepare a grouping table with 6 columns. 2. In column I, write down the given frequencies. 3. Column II is obtained by combining the frequencies two by two. 4. Leave the 1st frequency and combine the remaining frequencies two by two and write in column III. 5. Column IV is obtained by combining the frequencies three by three. 6. Leave the 1st frequency and combine the remaining frequencies three by three and write in column V. 7. Leave the 1st and 2nd frequencies and combine the remaining frequencies three by three and write in column VI. 8. Mark the highest frequency in each column. 9. Form an analysis table to find the modal class. 10. After finding the modal class use the formula to calculate the modal value. Example 4 Calculate the mode for the following frequency distribution. Class Interval 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40 Frequency 9 12 15 16 17 15 10 13
  • 18. BMCU002: QUANTITATIVE METHODS NOTES Page 4 of 13 Solution Grouping Table Class Interval Frequency 2 3 4 5 6 0−5 9 5−10 12 21 36 10−15 15 27 43 15−20 16 31 48 20−25 17 33 48 25−30 15 32 42 30−35 10 25 38 35−40 13 23 Analysis Table Columns 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40 1 1 2 1 1 3 1 1 4 1 1 1 5 1 1 1 6 1 1 1 Total 1 2 4 5 2 The maximum occurred corresponding to 20−25, and hence it is the modal class. Mode = M0 = 20+ 17−16 (17−16)+(17−15) 𝑥5 M0 = 20 + 1 1 + 2 𝑥5 = 21.6 approx. Example 5 The following table gives some frequency data:
  • 19. BMCU002: QUANTITATIVE METHODS NOTES Page 5 of 13 Size of Item Frequency Cummulative Currency 10−20 10 10 20−30 18 28 30−40 25 53 40−50 26 79 50−60 17 96 60−70 4 100 Total 100 Required: Calculate the mode Solution Grouping Table Class Interval Frequency 2 3 4 5 6 10−20 10 20−30 18 28 53 30−40 25 43 69 40−50 26 51 68 50−60 17 43 47 60−70 4 21 Analysis Table Columns 10−20 20−30 30−40 40−50 50−60 60−70 1 1 2 1 1 3 1 1 1 1 4 1 1 1 5 1 1 1 6 1 1 1 Total 1 3 5 5 2
  • 20. BMCU002: QUANTITATIVE METHODS NOTES Page 6 of 13 Mode = 3 median - 2 mean Median = n + 1 2 = 100 + 1 2 = 50.5th item This lies in the class 30−40. 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑙1 + 𝑙2 − 𝑙1 𝑓 (𝑚 − 𝑐) = 30 + 40 − 30 25 (50.50 − 28) = 30 + 9 = 39 Calculation of Arithmetic Mean Class- Interval Frequency Mid- Points d d'=d/10 fd’ 10−20 10 15 −20 −2 −20 20−30 18 25 −10 −1 −18 30−40 25 35 0 0 0 40−50 26 45 10 1 26 50−60 17 55 20 2 34 60−70 4 65 30 3 12 Total 100 34 Assumed mean= 35 Median = A + ∑ fd′ n xi Median = A35 + 34 100 x10 = 38.4 Mode = 3 median − 2 mean = 3(39) − 2(38.4) = 117 − 76.8 = 40.2 Merits of Mode 1. It is easy to calculate and in some cases it can be located mere inspection. 2. Mode is not at all affected by extreme values. 3. It can be calculated for open-end classes. 4. It is usually an actual value of an important part of the series. 5. In some circumstances it is the best representative of data.
  • 21. BMCU002: QUANTITATIVE METHODS NOTES Page 7 of 13 Demerits of Mode 1. It is not based on all observations. 2. It is not capable of further mathematical treatment. 3. Mode is ill-defined generally, it is not possible to find mode in some cases. 4. As compared with mean, mode is affected to a great extent,by sampling fluctuations. 5. It is unsuitable in cases where relative importance of items has to be considered. QUARTILES Meaning The quartiles divide the distribution in four parts. There are three quartiles. The second quartile (Q2) divides the distribution into two halves and therefore is the same as the median. The first (lower) quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the three-fourth. In other words, the three quartiles Q1, Q2 and Q3 are such that 25 percent of the data fall below Q1, 25 percent fall between Q1 and Q2, 25 percent fall between Q2 and Q3 and 25 percent fall above Q3. Computation of the Mode 1. Raw or Ungrouped Data First arrange the given data in the increasing order and use the formula for Q1 and Q3. Q1 = ( n + 1 4 ) th item Q3 = 3 ( n + 1 4 ) th item Example 1 Compute quartiles for the data given below: 25,18,30, 8, 15, 5, 10, 35, 40, 45 Solution 5, 8, 10, 15, 18,25, 30,35,40, 45 Q1 = ( n + 1 4 ) th item
  • 22. BMCU002: QUANTITATIVE METHODS NOTES Page 8 of 13 Q1 = ( 10 + 1 4 )th item Q1 = (2.75)th item Q1 = 2nd item + ( 3 4 ) (3rd item − 2nd item) Q1 = 8 + ( 3 4 ) (10 − 8) = 9.5 Q3 = 3 ( n + 1 4 ) th item Q3 = 3(2.75)th item Q3 = (8.25)th item Q3 = 8th item + ( 1 4 )(9th item − 8th item) 𝑄3 = 35 + ( 1 4 )(40 − 35) = 36.25 2. Discrete Series Step1: Find cumulative frequencies. Step2: Find ( 𝑛+1 4 ) Step3: See in the cumulative frequencies, the value just greater than ( 𝑛+1 4 ), then the corresponding value of x is Q1. Step 4: Find 3 ( 𝑛+1 4 ) Step 5: See in the cumulative frequencies, the value just greater than 3 ( 𝑛+1 4 ), then the corresponding value of x is Q3. Example 2 Compute quartiles for the data given bellow: X 5 8 12 15 19 24 30 F 4 3 2 4 5 2 4
  • 23. BMCU002: QUANTITATIVE METHODS NOTES Page 9 of 13 Solution : X F CF 5 4 4 8 3 7 12 2 9 15 4 13 19 5 18 24 2 20 30 4 24 Total 24 Q1 = ( N + 1 4 ) th item = ( 24 + 1 4 ) = ( 25 4 ) = 6.25th item Q3 = 3 ( N + 1 4 ) th item = 3 ( 24 + 1 4 ) = 3 ( 25 4 ) = 18.25th item Q1 = 8; Q3 = 24 3. Continuous Series Step1: Find cumulative frequencies Step2: Find ( N 4 ) Step 3: See in the cumulative frequencies, the value just greater than ( 𝑁 4 ), then the corresponding class interval is called first quartile class. Step 4: Find 3 ( 3 4 ) Step 5: See in the cumulative frequencies the value just greater than 3 ( 3 4 ), then the corresponding class interval is called 3rd quartile class. Step 6: Apply the respective formulae. Q1 = l1 + ( N 4 − m1 f1 ) x c1
  • 24. BMCU002: QUANTITATIVE METHODS NOTES Page 10 of 13 Q3 = l3 + ( 3 ( N 4 ) − m3 f3 ) xc3 Where: l1 = lower limit of the first quartile class f1 = frequency of the first quartile class c1 = width of the first quartile class m1 = cf preceding the first quartile class l3 = lower limit of the third quartile class f3 = frequency of the third quartile class c3 = width of the third quartile class m3 = cf preceding the third quartile class Example 3 The following series relates to the marks secured by students in an examination. Marks Number of Students 0−10 11 10−20 18 20−30 25 30−40 28 40−50 30 50−60 33 60−70 22 70−80 15 80−90 12 90−100 10
  • 25. BMCU002: QUANTITATIVE METHODS NOTES Page 11 of 13 Required: Find the quartiles. Solution: Marks Number of Students Cummulative Frequency 0−10 11 11 10−20 18 29 20−30 25 54 30−40 28 82 40−50 30 112 50−60 33 145 60−70 22 167 70−80 15 182 80−90 12 194 90−100 10 204 Total 204 ( N 4 ) = ( 204 4 ) = 51; 3 ( N 4 ) = 153 Q1 = 20 + ( 51 − 29 25 ) x 10 = 28.8 Q1 = 60 + ( 153 − 145 22 )x 10 = 63.64 PERCENTILES The percentile values divide the distribution into 100 parts each containing 1 percent of the cases. The percentile (Pk) is that value of the variable up to which lie exactly k% of the total number of observations. 1. Percentile for Raw Data or Ungrouped Data Relationship : P25 = Q1 ; P50 = Q2 = Median and P75 = Q3
  • 26. BMCU002: QUANTITATIVE METHODS NOTES Page 12 of 13 Example 4 Calculate P15 for the data given below: 5, 24 , 36 , 12 , 20 , 8 Solution: Arranging the given values in the increasing order. 5, 8, 12, 20, 24, 36 P15 = ( 15(n + 1) 100 )th item P15 = ( 15(6 + 1) 100 )th item P15 = ( (15x7) 100 )th item P15 = (1.05)th item P15 = 1st item + 0.05(2nd item − 1st item) P15 = 5 + 0.05(8 − 5) = 5.15 2. Percentile for Grouped Data Example 5 Find P53 for the following frequency distribution: Class Interval 0−5 5−10 10−15 15−20 20−25 25−30 30−35 35−40 Frequency 5 8 12 16 20 10 4 3 Solution : Class Interval Frequency Cummulative Frequency 0−5 5 5 5−10 8 13 10−15 12 25 15−20 16 41 20−25 20 61
  • 27. BMCU002: QUANTITATIVE METHODS NOTES Page 13 of 13 25−30 10 71 30−35 4 75 35−40 3 78 Total 78 P53 = l1 + 53N 100 − m f xc P53 = 20 + 53(78) 100 − 41 f x5 = 20.085
  • 28. BPCU004: ADVANCED BUSINESS STATISTICS Page 1 of 23 MEASURES OF DISPERSION MEANING Dispersion (also known as scatter, spread or variation) measures the extent to which the items vary from some central value. SIGNIFICANCE OF MEASURING VARIATION 1. Measures of variation point out as to how far an average is representative of the mass. 2. Measures of dispersion determine nature and cause of variation in order to control the variation itself. 3. Measures of dispersion enable a comparison to be made of two or more series with regard to their variability. 4. Measures of dispersion are the basis of Many powerful analytical tools in statistics such as correlation analysis, testing of hypothesis, analysis of variance, the statistical quality control and regression analysis. Characteristics/Properties of a Good Measure of Dispersion 1. It should be simple to understand. 2. It should be easy to compute. 3. It should be rigidly defined. 4. It should be based on each and every item of the distribution. 5. It should be amenable to further algebraic treatment. 6. It should have sampling stability. 7. Extreme items should not unduly affect it. ABSOLUTE AND RELATIVE MEASURES OF DISPERSION There are two kinds of measures of dispersion, namely: 1. Absolute measure of dispersion. 2. Relative measure of dispersion. Absolute measure of dispersion indicates the amount of variation in a set of values in terms of units of observations. For example, when rainfalls on different days are available in mm, any absolute measure of dispersion gives the variation in rainfall in mm. On the other hand relative measures of dispersion are free from the units of measurements of the observations. They are
  • 29. BPCU004: ADVANCED BUSINESS STATISTICS Page 2 of 23 pure numbers. They are used to compare the variation in two or more sets, which are having different units of measurements of observations. Absolute measure Relative measure 1. Range 1. Co-efficient of Range 2. Quartile deviation 2. Co-efficient of Quartile deviation 3. Mean deviation 3. Co-efficient of Mean deviation 4. Standard deviation 4. Co-efficient of variation RANGE AND COEFFICIENT OF RANGE 1. Range This is the simplest possible measure of dispersion and is defined as the difference between the largest and smallest values of the variable. Range = L − S 𝑊ℎ𝑒𝑟𝑒: L = Largest Value S = Smallest Value In individual observations and discrete series, L and S are easily identified. In continuous series, the following two methods are followed. Method 1: L = Upper boundary of the highest class S = Lower boundary of the highest class Method 2: L = Mid value of the highest class S = Mid value of the lowest class 2. Co-efficient of Range Coefficient of Range = L − S L + S Example 1 Find the value of range and its co-efficient for the following data. 7, 9, 6, 8, 11, 10
  • 30. BPCU004: ADVANCED BUSINESS STATISTICS Page 3 of 23 Solution: Range = L − S = 11 − 4 = 7 Coefficient of Range = L − S L + S = 11 − 4 11 + 4 = 0.4667 Example 2: Calculate range and its co efficient from the following distribution. Size : 60−63 63−66 66−69 69−72 72−75 Number : 5 18 42 27 8 Solution: Range = L − S = 75 − 60 = 15 Coefficient of Range = L − S L + S = 75 − 60 75 + 60 = 0.1111 Merits 1. It is simple to understand. 2. It is easy to calculate. 3. In certain types of problems like quality control, weather forecasts, share price analysis, et c., range is most widely used. Demerits: 1. It is very much affected by the extreme items. 2. It is based on only two extreme observations. 3. It cannot be calculated from open-end class intervals. 4. It is not suitable for mathematical treatment. 5. It is a very rarely used measure. QUARTILE DEVIATION AND CO-EFFICIENT OF QUARTILE DEVIATION 1. Quartile Deviation (Q.D) Definition: Quartile Deviation is half of the difference between the first and third quartiles. Hence, it is called Semi-Inter Quartile Range. 𝑄. 𝐷 = 𝑄3 − 𝑄1 2
  • 31. BPCU004: ADVANCED BUSINESS STATISTICS Page 4 of 23 Among the quartiles Q1, Q2 and Q3, the range Q3 – Q1 is called inter quartile range and 𝑄3−𝑄1 2 , semi inter quartile range. 2. Co-efficient of Quartile Deviation Co − efficient of Q. D = Q3 − Q1 Q3 + Q1 Example 3 Find the Quartile Deviation for the following data: 391, 384, 591, 407, 672, 522, 777, 733, 1490, 2488 Solution: Arrange the given values in ascending order. 384, 391, 407, 522, 591, 672, 733, 777, 1490, 2488. Position of Q1 is N + 1 4 = 10 + 1 4 = 12.75th item Q1 = 2nd item + 0.75(3rd Item − 2nd Item) 𝑄1 = 391 + 0.75 (4.7 − 391) = 403 Position of Q3 is 3( N + 1 4 ) = 3(12.75) = 8.25th item Q3 = 8th Item + 0.25(9th Item − 8th Item) Q3 = 777 + 0.25(1490 − 777) = 955.25 𝑄. 𝐷 = 955.25 − 403 2 = 276.125 Example 4 Weekly wages of labours are given below. Calculated Q.D and Coefficient of Q.D. Weekly Wage (Kshs.) 100 200 400 500 600 No. of Weeks 5 8 21 12 6
  • 32. BPCU004: ADVANCED BUSINESS STATISTICS Page 5 of 23 Solution : Weekly Wage (Kshs.) No. of Weeks Cum. No. of Weeks 100 5 5 200 8 13 400 21 34 500 12 46 600 6 52 Total 52 Position of Q1 is N + 1 4 = 52 + 1 4 = 13.25th item Q1 = 13th Item + 0.25(14th Item − 13th Item) 𝑄1 = 200 + 0.25 (400 − 200) = 250 Position of Q3 is 3( N + 1 4 ) = 3(13.25) = 39.75th item Q3 = 39th Item + 0.75(40th Item − 39th Item) Q3 = 500 + 0.75(600 − 500) = 575 𝑄. 𝐷 = 575 − 250 2 = 162.5 Co − efficient of Q. D = Q3 − Q1 Q3 + Q1 Co − efficient of Q. D = 575 − 250 575 + 250 = 325 825 = 0.394 Example 5 For the data given below, give the quartile deviation and coefficient of quartile deviation. X 351−500 501−650 651−800 801−950 951−1100 F 48 189 88 47 28
  • 33. BPCU004: ADVANCED BUSINESS STATISTICS Page 6 of 23 Solution: X True Class Intervals F Cumulative Frequency 351−500 350.5−500.5 48 48 501−650 500.5−650.5 189 237 651−800 650.5−800.5 88 325 801−950 800.5−950.5 47 372 951−1100 950.5−1100.5 28 400 Total 400 Q1 = N 4 = 400 4 = 100; Q2 = 3 ( N 4 ) = 3 (100) = 300 Q1 = l1 + ( N 4 − m1 f1 ) x c1 Q1 = 500.5 + ( 100 − 48 189 )x 150 = 541.77 Q3 = l3 + ( 3 ( N 4 ) − m3 f3 ) xc3 Q3 = 650.5 + ( 300 − 237 88 )x150 = 757.89 Q.D = Q3 − Q1 2 = 757.89 − 541.77 2 = 108.06 Co − efficient Q. D = Q3 − Q1 Q3 + Q1 = 757.89 − 541.77 757.89 + 541.77 = 0.1663 Merits of Quartile Deviation 1. It is simple to understand and easy to calculate. 2. It is not affected by extreme values. 3. It can be calculated for data with open end classes also. Demerits of Quartile Deviation 1. It is not based on all the items. It is based on two positional values Q1 and Q3 and ignores the extreme 50% of the items.
  • 34. BPCU004: ADVANCED BUSINESS STATISTICS Page 7 of 23 2. It is not amenable to further mathematical treatment. 3. It is affected by sampling fluctuations. MEAN DEVIATION AND COEFFICIENT OF MEAN DEVIATION 1. Mean Deviation The mean deviation is measure of dispersion based on all items in a distribution. Mean deviation is the arithmetic mean of the deviations of a series computed from any measure of central tendency; i.e., the mean, median or mode, all the deviations are taken as positive i.e., signs are ignored. But in general practice and due to wide applications of mean, the mean deviation is generally computed from mean. M.D can be used to denote mean deviation. 2. Coefficient of mean deviation: Mean deviation calculated by any measure of central tendency is an absolute measure. For the purpose of comparing variation among different series, a relative mean deviation is required. The relative mean deviation is obtained by dividing the mean deviation by the average used for calculating mean deviation. Co − efficient of Mean Deviation = Mean Deviation Mean or Median or Mode If the result is desired in percentage, the coefficient of mean deviation. Co − efficient of Mean Deviation = Mean Deviation Mean or Median or Mode x100 COMPUTATION OF MEAN DEVIATION 1. Individual Series a. Calculate the average mean, median or mode of the series. b. Take the deviations of items from average ignoring signs and denote these deviations by |D|. c. Compute the total of these deviations, i.e., Σ |D| d. Divide this total obtained by the number of items. M. D. = D n
  • 35. BPCU004: ADVANCED BUSINESS STATISTICS Page 8 of 23 Example 6 Calculate mean deviation from mean and median for the following data: 100, 150, 200, 250, 360, 490, 500, 600, 671 also calculate coefficients of M.D. Solution: Mean =  X N = 3321 9 = 369 Now arrange the data in ascending order 100, 150, 200, 250, 360, 490, 500, 600, 671 Mean = Value of ( n + 1 2 ) th item = Value of ( 9 + 1 2 ) th item = Value of 5th item = 360 X D=X−Mean D=X−Median 100 269 260 150 219 210 200 169 160 250 119 110 360 9 0 490 121 130 500 131 140 600 231 240 671 302 311 3321 1570 1561 M. D. from mean =  D n = 1570 9 = 174.44 Co − efficient of M. D. = MD Mean = 174.44 369 = 0.47 M. D. from median =  D n = 1561 9 = 173.44 Co − efficient of M. D. = MD Median = 173.44 360 = 0.48
  • 36. BPCU004: ADVANCED BUSINESS STATISTICS Page 9 of 23 2. Mean Deviation −Discrete Series Step 1: Find out an average (mean, median or mode). Step 2: Find out the deviation of the variable values from the average, ignoring signs and denote them by |D| Step 3: Multiply the deviation of each value by its respective frequency and find out the total Σf | D| Step 4: Divide Σf | D| by the total frequencies N Example 7 Compute Mean deviation from mean and median from the following data: Height in cms 158 159 160 161 162 163 164 165 166 No. of persons 15 20 32 35 33 22 20 10 8 Also compute coefficient of mean deviation. Solution: Height (X) No. of persons (f) d = x−A A = 162 fd D=X−mean fD 158 15 −4 −60 3.51 52.65 159 20 −3 −60 2.51 50.20 160 32 −2 −64 1.51 48.32 161 35 −1 −35 0.51 17.85 162 33 0 0 0.49 16.17 163 22 1 22 1.49 32.78 164 20 2 40 2.49 49.80 165 10 3 30 3.49 34.90 166 8 4 32 4.49 35.92 Total 195 −95 338.59 Mean = A + fd N = 162 + −95 195 = 161.51 M. D. = fD N = 338.59 195 = 1.74
  • 37. BPCU004: ADVANCED BUSINESS STATISTICS Page 10 of 23 Co − efficient M. D. = M. D. Mean = 1.74 161.51 = 0.0108 Height (x) No. of persons (f) c.f. D=X−median fD 158 15 15 3 45 159 20 35 2 40 160 32 67 1 32 161 35 102 0 0 162 33 135 1 33 163 22 157 2 44 164 20 177 3 60 165 10 187 4 40 166 8 195 5 40 195 334 Median = Size of ( N 2 ) th item = Size of ( 195 2 )th item = Size of 98th item = 161 M. D. = fD N = 334 195 = 1.71 Co − efficient M. D. = M. D. Median = 1.71 161 = 0.0106 3. Mean Deviation-Continuous Series The method of calculating mean deviation in a continuous series same as the discrete series. In continuous series we have to find out the mid points of the various classes and take deviation of these points from the average selected. Thus M. D. = fD N Where: D = m − Average ; m = mid point Example 8: Find out the mean deviation from mean and median from the following series. Age in years No. of persons 0−10 20
  • 38. BPCU004: ADVANCED BUSINESS STATISTICS Page 11 of 23 10−20 25 20−30 32 30−40 40 40−50 42 50−60 35 60−70 10 70−80 80 Also compute co-efficient of mean deviation. Solution: x m f 𝑑 = 𝑚 − 𝐴 𝑐 𝐴 = 35; 𝑐 = 10 fd D=X−mean fD 0−10 5 20 −3 −60 31.5 630.0 10−20 15 25 −2 −50 21.5 537.5 20−30 25 32 −1 −32 11.5 368.0 30−40 35 40 0 0 1.5 60.0 40−50 45 42 1 42 8.5 357.0 50−60 55 35 2 70 18.5 647.5 60−70 65 10 3 30 28.5 285.0 70−80 75 8 4 32 38.5 308.0 Total 212 3192.5 Mean = A + ∑ fd N ∗ c = 35 + 320 212 x10 = 36.5 M. D. = ∑ fD N = 3192.5 212 = 15.06
  • 39. BPCU004: ADVANCED BUSINESS STATISTICS Page 12 of 23 Calculation of Median and M.D. from Median x m f c.f D=m−Md fD 0−10 5 20 20 32.25 645.00 10−20 15 25 45 22.25 556.25 20−30 25 32 77 12.25 392.00 30−40 35 40 117 2.25 90.00 40−50 45 42 159 7.75 325.50 50−60 55 35 194 17.75 621.25 60−70 65 10 204 27.75 277.50 70−80 75 8 212 37.75 302.00 Total 212 3209.50 Median = ( N 2 ) th item = 212 2 = 106 Median = 𝑙 + N 2 − m f ∗ c = 30 + 106 − 77 40 ∗ 10 = 37.25 M. D. = ∑ fD N = 3209.5 212 = 15.14 Co − efficient of M. D. = M. D. Median = 15.14 37.25 = 0.41 Merits of M.D. 1. It is simple to understand and easy to compute. 2. It is rigidly defined. 3. It is based on all items of the series. 4. It is not much affected by the fluctuations of sampling. 5. It is less affected by the extreme items. 6. It is flexible, because it can be calculated from any average. 7. It is better measure of comparison. Demerits of M.D. 1. It is not a very accurate measure of dispersion. 2. It is not suitable for further mathematical calculation. 3. It is rarely used. It is not as popular as standard deviation.
  • 40. BPCU004: ADVANCED BUSINESS STATISTICS Page 13 of 23 4. Algebraic positive and negative signs are ignored. It is mathematically unsound and illogical. STANDARD DEVIATION AND COEFFICIENT OF VARIATION 1. Definition It is defined as the positive square-root of the arithmetic mean of the Square of the deviations of the given observation from their arithmetic mean. It is the square–root of the mean of the squared deviation from the arithmetic mean. Square of standard deviation is called Variance. 2. Calculation of Standard Deviation-Individual Series There are two methods of calculating Standard deviation in an individual series. a) Deviations taken from Actual mean b) Deviation taken from Assumed mean (a) Deviation taken from Actual mean This method is adopted when the mean is a whole number. Steps: 1. Find out the actual mean of the series ( ) 2. Find out the deviation of each value from the mean (X = X – ) 3. Square the deviations and take the total of squared deviations ∑ X2 4. Divide the total (∑ X2) by the number of observation ( ∑X2 n ) Formulae: Standard Deviation () = √( ∑ X2 n )𝑜𝑟 √(X − X) 2 n (b) Deviations Taken from Assumed Mean This method is adopted when the arithmetic mean is fractional value. Taking deviations from fractional value would be a very difficult and tedious task. To save time and labour, the short– cut method is applied. In this method, the deviations are taken from an assumed mean. The formula is:
  • 41. BPCU004: ADVANCED BUSINESS STATISTICS Page 14 of 23  = √( ∑ d2 N )− ( ∑ d N ) 2 Where: d stands for the deviations from the assumed mean = (X − A) Steps: 1. Assume any one of the item in the series as an average (A) 2. Find out the deviations from the assumed mean; i.e., X-A denoted by d and also the total of the deviations Σd 3. Square the deviations; i.e., d2 and add up the squares of deviations, i.e, Σd2 4. Then substitute the values in the following formula:  = √( ∑ d2 N )− ( ∑ d N ) 2 Note: We can also use the simplified formula for standard deviation.  = 1 n √(n ∑ d2) − (∑ d) 2 For the frequency distribution  = c n √(N ∑ fd2) − (∑ fd) 2 Example 9 Calculate the standard deviation from the following data. 14, 22, 9, 15, 20, 17, 12, 11
  • 42. BPCU004: ADVANCED BUSINESS STATISTICS Page 15 of 23 Solution: Deviations from actual mean. Values (X) (X − X) (X − X)2 14 –1 1 22 7 49 9 –6 36 15 0 0 20 4 16 17 2 4 12 –3 9 11 –4 16 120 140 X = 120 8 = 15  = √(X − X) 2 n = √ 140 8 = 4.18 Example 10 The table below gives the marks obtained by 10 students in statistics. Calculate standard deviation. Student Nos : 1 2 3 4 5 6 7 8 9 10 Marks 43 48 65 57 31 60 37 48 78 59 Solution Deviations from assumed mean Student Nos : Marks (X) d = X − A (A = 57) d2 1 43 –14 196 2 48 –9 81 3 65 8 64 4 57 0 0 5 31 –26 676
  • 43. BPCU004: ADVANCED BUSINESS STATISTICS Page 16 of 23 6 60 3 9 7 37 –20 400 8 48 –9 81 9 78 21 441 10 59 2 4 N=10 d=–44 d2 =1952  = √( ∑ d2 N )− ( ∑ d N ) 2  = √( 1952 10 )− ( −44 10 ) 2 = 13.26 3. Calculation of Standard Deviation for Discrete Series There are three methods for calculating standard deviation in discrete series: (a) Actual mean methods (b) Assumed mean method (c) Step-deviation method. (a) Actual mean method Steps: 1. Calculate the mean of the series. 2. Find deviations for various items from the means i.e., d = X − X 3. Square the deviations (d2 ) and multiply by the respective frequencies (f) to get fd2 . 4. Total to product (Σfd2 ) Then apply the formula:  = √ ∑ fd2 ∑ f If the actual mean in fractions, the calculation takes lot of time and labour; and as such this method is rarely used in practice.
  • 44. BPCU004: ADVANCED BUSINESS STATISTICS Page 17 of 23 (b) Assumed Mean Method Here deviation are taken not from an actual mean but from an assumed mean. Also this method is used, if the given variable values are not in equal intervals. Steps: 1. Assume any one of the items in the series as an assumed mean and denoted by A. 2. Find out the deviations from assumed mean, i.e, X-A and denote it by d. 3. Multiply these deviations by the respective frequencies and get the Σfd. 4. Square the deviations (d2 ). 5. Multiply the squared deviations (d2 ) by the respective frequencies (f) and get Σfd2 . 6. Substitute the values in the following formula:  = √ ∑ fd2 ∑ f − ( ∑ fd ∑ f ) 2 Where: d = A − A, N = f Example 11: Calculate Standard deviation from the following data. X 20 22 25 31 35 40 42 45 f 5 12 15 20 25 14 10 6 Solution : Deviations from assumed mean X f d = X − A (A = 31) d2 fd fd2 20 5 −11 121 −55 605 22 12 −9 81 −108 972 25 15 −6 36 −90 540 31 20 0 0 0 0 35 25 4 16 100 400 40 14 9 81 126 1134 42 10 11 121 110 1210 45 6 14 196 84 1176 Total N=107 fd=167 fd2 =6037
  • 45. BPCU004: ADVANCED BUSINESS STATISTICS Page 18 of 23  = √ ∑ fd2 ∑ f − ( ∑ fd ∑ f ) 2  = √ 6037 107 − ( 167 107 ) 2 = 7.35 (c) Step-deviation method: If the variable values are in equal intervals, then we adopt this method. Steps: 1. Assume the center value of the series as assumed mean A. 2. Find out d′ = X−A C , where C is the interval between each value. 3. Multiply these deviations d′ by the respective frequencies and get ∑ fd′ . 4. Square the deviations and get d′2 . 5. Multiply the squared deviation (d′2 ) by the respective frequencies (f) and obtain the total ∑ fd′2 . 6. Substitute the values in the following formula to get the standard deviation.  = √∑ fd′2 ∑ f − ( fd′2 ∑f ) 2 *C Example 12 Compute Standard deviation from the following data. Marks 10 20 30 40 50 60 No. of students 8 12 20 10 7 3 Solution: Marks (X) No. of students (f) d′ = X − 30 10 d2 fd fd2 10 8 −2 4 −16 32 20 12 −1 1 −12 12 30 20 0 0 0 0 40 10 1 1 10 10 50 7 2 4 14 28 60 3 3 9 9 27 N=60 fd=5 fd2 =109
  • 46. BPCU004: ADVANCED BUSINESS STATISTICS Page 19 of 23  = √∑ fd′2 ∑ f − ( fd′2 ∑f ) 2 *C  = √ ∑ 1092 60 − ( 5 60 ) 2 ∗ 10 = 13.45 4. Calculation of Standard Deviation for Continuous series In the continuous series the method of calculating standard deviation is almost the same as in a discrete series. But in a continuous series, mid-values of the class intervals are to be found out. The step- deviation method is widely used. The formula is, = √∑ fd′2 N − ( fd′2 N ) 2 *C Where d′ = m − A C ; C = Class interval Steps: 1. Find out the mid-value of each class. 2. Assume the center value as an assumed mean and denote it by A. 3. Find out d′ = m−A C 4. Multiply the deviations d′ by the respective frequencies and get fd′ 5. Square the deviations and get 𝑑′2 . 6. Multiply the squared deviations 𝑑′2 ) by the respective frequencies and get fd′2 7. Substituting the values in the following formula to get the standard deviation.  = √∑ fd′2 N − ( fd′2 N ) 2 *C Example 13: The daily temperature recorded in a city in Russia in a year is given below. Temperature C0 No. of days −40 to −30 10 −30 to −20 18 −20 to −10 30 −10 to 0 42
  • 47. BPCU004: ADVANCED BUSINESS STATISTICS Page 20 of 23 0 to −10 65 10 to −20 180 20 to 30 20 Required: Calculate Standard Deviation. Solution : Temperature (X) Mid-Point (m) No. of days (f) d′ = m − (−5) 10 d′2 fd′ fd′2 −40 to −30 −35 10 −3 9 −30 90 −30 to −20 −25 18 −2 4 −36 72 −20 to −10 −15 30 −1 1 −30 30 −10 to 0 −5 42 0 0 0 0 0 to −10 5 65 1 1 65 65 10 to −20 15 180 2 4 360 720 20 to 30 25 20 3 9 60 180 N=365 fd=389 fd2 =1157  = √∑ fd′2 N − ( fd′ N ) 2 *C  = √1157 365 − ( 389 365 ) 2 *10 =14.260 𝐶 Merits of Standard Deviation 1. It is rigidly defined and its value is always definite and based on all the observations and the actual signs of deviations are used. 2. As it is based on arithmetic mean, it has all the merits of arithmetic mean. 3. It is the most important and widely used measure of dispersion. 4. It is possible for further algebraic treatment. 5. It is less affected by the fluctuations of sampling and hence stable. 6. It is the basis for measuring the coefficient of correlation and sampling. Demerits of Standard Deviation 1. It is not easy to understand and it is difficult to calculate. 2. It gives more weight to extreme values because the values are squared up.
  • 48. BPCU004: ADVANCED BUSINESS STATISTICS Page 21 of 23 3. As it is an absolute measure of variability, it cannot be used for the purpose of comparison. Coefficient of Variation The standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which the original figures are collected and stated. The standard deviation of heights of students cannot be compared with the standard deviation of weights of students, as both are expressed in different units, i.e heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted into a relative measure of dispersion for the purpose of comparison. The relative measure is known as the coefficient of variation. The coefficient of variation is obtained by dividing the standard deviation by the mean and multiply it by 100. symbolically, Coefficient of Variation (C. V. ) =  X x100 If we want to compare the variability of two or more series, we can use C.V. The series or groups of data for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable, more stable, more uniform, more consistent or more homogeneous. Example 15 In two factories A and B located in the same industrial area, the average weekly wages (in rupees) and the standard deviations are as follows: Factory Average Standard Deviation No. of workers A 34.5 5 476 B 28.5 4.5 524 Required: (a) Which factory A or B pays out a larger amount as weekly wages? (b) Which factory A or B has greater variability in individual wages? Solution: Total wages paid by factory A = 34.5x476 = Kshs. 16,422 (a) Total wages paid by factory B = 28.5x524 = Kshs. 14,934
  • 49. BPCU004: ADVANCED BUSINESS STATISTICS Page 22 of 23 Therefore factory A pays out larger amount as weekly wages. (b) C.V. of distribution of weekly wages of factory A and B are CV (A) =  X x100 = 5 34.5 x100 = 14.49% CV (B) =  X x100 = 4.5 28.5 x100 = 15.79% Factory B has greater variability in individual wages, since C.V. of factory B is greater than C.V of factory A. Example 16 Prices of a particular commodity in five years in two cities are given below: Price in City A Price in City B 20 10 22 20 19 18 23 12 16 15 Which city has more stable prices? Solution: Actual mean method City A City B Prices (X) dx = X − 20 dx2 Prices (Y) dy = Y − 15 dy2 20 0 0 10 −5 25 22 2 4 20 5 25 19 −1 1 18 3 9 23 3 9 12 −3 9 16 −4 16 15 0 0 X=100 dx dx2 Y=75 dy=0 dy2 =68 City A: X = ∑ X n = 100 5 = 20
  • 50. BPCU004: ADVANCED BUSINESS STATISTICS Page 23 of 23  = √ ∑ dx2 n = √ 30 5 = 2.45 CV (A) =  X x100 = 2.45 20 x100 = 12.25% City B: X = ∑ X n = 75 5 = 15  = √ ∑ dx2 n = √ 68 5 = 3.69 CV (A) =  X x100 = 3.69 15 x100 = 24.6% City A had more stable prices than City B, because the coefficient of variation is less in City A.
  • 51. BMCU002: QUANTITATIVE METHODS NOTES Page 1 of 19 LESSON THREE: OVERVIEW OF HYPOTHESIS TESTING 3.0 Introduction 3.1 Lesson Objectives 3.2 Definition of Hypothesis Testing Hypothesis: It’s a statement about a population parameter developed for the purpose of testing. Hypothesis testing: It’s a procedure based on sample evidence and probability theory to determine whether the hypothesis is a reasonable statement. 3.2 Procedure for Testing a Hypothesis The following are the steps that are followed when testing hypothesis 1. State the null and alternate hypothesis 2. Select a level of significance. 3. Identify the test statistic 4. Formulate a decision rule and identify the rejection region 5. Compute the value of the test statistic 6. Make a conclusion. This lesson gives an overview of the concepts in hypothesis testing. It describes the procedure of testing a hypothesis, differentiates between one-tailed and two-tailed tests and type I and Type II errors. Examples of testing hypothesis about a single population mean when the population variance and not given are discussed. By the end of the lesson, the students should be able to;  Define the term hypothesis  Differentiate between one-tailed and two-tailed tests  Describe the procedure for testing hypothesis  Test hypothesis about the mean when the population variance is known  Test hypothesis about the mean when the population variance is unknown
  • 52. BMCU002: QUANTITATIVE METHODS NOTES Page 2 of 19 State the null hypothesis (HO) and alternate hypothesis (HA)  The null hypothesis is a statement about the value of a population parameter. It should be stated as “There is no significant difference between ……………”. It should always contain an equal sign.  The alternate hypothesis is a statement that is accepted if sample data provide enough evidence that the null hypothesis is false. Select a Level of Significance A level of significance is the probability of rejecting the null hypothesis when it is true. It is designated by  and should be between 0 –1. Types of errors that can be committed i. Type I error: it is rejecting the null hypothesis, when it is true. ii. Type II error: It is not rejecting the null hypothesis, when it is false. Null hypothesis Do not reject HO Reject HO HO is True Correct decision Type I error HO is false Type II error Correct decision Identify the Test Statistic A test statistic is the statistic that will be used to test the hypothesis e.g. ) ( , , 2 square chi Fand     Formulate a decision rule A decision rule is a statement of the conditions under which the null hypothesis is rejected and the conditions under which it is not rejected. The region or area of rejection defines the location of all those values that are so large or so small that the probability of their occurrence under a true null hypothesis is rather remote. Compute the value of the test statistic and make a conclusion The value of the test statistic is determined from the sample information, and is used to determine whether to reject the null hypothesis or not.
  • 53. BMCU002: QUANTITATIVE METHODS NOTES Page 3 of 19 3.4 One-Tailed and Two-Tailed Tests  A test is one tailed when the alternate hypothesis states a direction e.g. Ho: The mean income of women is equal to the mean income of men HA: The mean income of women is greater than the mean income of men  A test is two tailed if no direction is specified in the alternate hypothesis Ho: There is no difference between the mean income of women and the mean income of men HA: There is a difference between the mean income of women and the mean income of men 3.5 Testing The Population Mean When the Population Variance is Known When the population variance is known and the population is normally distributed, the test statistic for testing hypothesis about  is n x Z     . The confidence interval estimator of  when 2  is known is n Z x   2  Example One A study by the Coca-Cola Company showed that the typical adult Kenyan consumes 18 gallons of Coca-Cola each year. According to the same survey, the standard deviation of the number of gallons consumed is 3.0. A random sample of 64 college students showed they consumed an average (mean) of 17 gallons of cola last year. At the 0.05 significance level, can we conclude that there is a significance difference between the mean consumption rate of college students and other adults? Solution 1. Stating the null and alternate hypothesis 18 : 18 : 0     A H H 2. Level of significance: 05 . 0  
  • 54. BMCU002: QUANTITATIVE METHODS NOTES Page 4 of 19 3. Test statistic n X Z     4. Rejection region o c c 025 . 0 2 / H Reject , 96 . 1 or Z 96 . 1 Z If 96 . 1      Z Z 5. Value of the test statistic 96 . 1 67 . 2 64 3 18 17         n X Zc   6. Conclusion Reject H0. Yes, there is a significance difference between the mean consumption rate of college students and other adults. Example Two Past experience indicates that the monthly long distance telephone bill per household in a particular community is normally distributed, with a mean of Sh. 1012 and a standard deviation of Sh. 327. After an advertising campaign that encouraged people to make long distance telephone calls more frequently, a random sample of 57 households revealed that the mean monthly long distance bill was Sh. 1098. Can we conclude at the 10% significance level that the advertising campaign was successful? Solution 1. Stating the null and alternate hypothesis 1012 : 1012 : 0     A H H 2. Level of significance: 1 . 0   3. Test statistic n X Z    
  • 55. BMCU002: QUANTITATIVE METHODS NOTES Page 5 of 19 4. Rejection region o c 1 . 0 H Reject , 28 . 1 Z If 28 . 1    Z Z 5. Value of the test statistic 28 . 1 99 . 1 57 327 1012 1098       n X Zc   6. Conclusion Reject H0. Yes, there is sufficient evidence to conclude that the advertising campaign was successful 3.6 Testing the Population Mean when the Population Variance is Unknown When the population variance is unknown and the population is normally distributed, the test statistic for testing hypothesis about  is n s x t    which has a student t distribution with 1  n degrees of freedom. We now have two different test statistic for testing the population mean. The choice of which one to use depends on whether or not the population variance is known.  If the population variance is known, the test statistic is n x Z      If the population variance is unknown, the test statistic is n s x t    1 .   n f d The confidence interval estimator of  when 2  is unknown is n s t x 2   1 . .   n f d Example One A manufacturer of automobile seats has a production line that produces an average of 100 seats per day. Because of new government regulations, a new safety device has been installed, which the manufacturer believes will reduce average daily output. A random sample of 15 days’ output after the installation of the safety device is shown below:
  • 56. BMCU002: QUANTITATIVE METHODS NOTES Page 6 of 19 93, 103, 95, 101, 91, 105, 96, 94, 101, 88, 98, 94, 101, 92, 95 Assuming that the daily output is normally distributed, is there sufficient evidence at the 5% significance level, to conclude that average daily output has decreased following the installation of the safety device? Solution 1. Stating the null and alternate hypothesis 100 : 100 : 0     A H H 2. Level of significance: 05 . 0   3. Test statistic n s X t    4. Rejection region o c 14 , 05 . 0 1 H Reject , 761 . 1 t If 761 . 1 ,       t t n  5. Value of the test statistic   761 . 1 82 . 2 15 85 . 4 100 47 . 96 85 . 4 14 15 1447 139917 1 47 . 96 15 1447 139917 X 1447 2 2 2 2                         n s X t n n X X S n X X X c  6. Conclusion
  • 57. BMCU002: QUANTITATIVE METHODS NOTES Page 7 of 19 Reject H0. Yes, there is sufficient evidence to conclude that average daily output has decreased following the installation of the safety device Example Two A courier service advertises that its average delivery time is less than six hours for local deliveries. A random sample of the amount of time this courier takes to deliver packages to an address across town produced the following times (rounded to the nearest hour). 7, 3, 4, 6, 10, 5, 6, 4, 3, 8 Is there sufficient evidence to support the courier’s advertisement at the 5% level of significance? Solution 1. Stating the null and alternate hypothesis 6 : 6 : 0     A H H 2. Level of significance: 05 . 0   3. Test statistic n s X t    4. Rejection region o c 9 , 05 . 0 1 H Reject , 833 . 1 t If 833 . 1 ,       t t n  5. Value of the test statistic
  • 58. BMCU002: QUANTITATIVE METHODS NOTES Page 8 of 19   833 . 1 56 . 0 10 27 . 2 6 6 . 5 27 . 2 9 10 56 360 1 6 . 5 10 56 360 X 56 2 2 2 2                         n s X t n n X X S n X X X c  6. Conclusion Do not Reject H0. No, there is no sufficient evidence to conclude that the advertising campaign was successful 3.7 Chi-Square Test A chi-squared test is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Also considered a chi- squared test is a test in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as desired by making the sample size large enough. The chi-square test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. 3.7.1 Chi-Square Test of a Multinomial Experiment (Goodness-Of-Fit Test) A multinomial experiment is a generalized version of a binomial experiment that allows for more than two possible outcomes on each trial of the experiment. The following are the properties of a multinomial experiment  The experiment consists of a fixed number nof trials.  The outcome of each trial can be classified into exactly one of k categories called cells  The probability 1 P that the outcome of a trial will fall into a cell i remains constant for each trial, for . .........k 3, 2, 1,  i moreover, 1 ........ 2 1   k P P P .
  • 59. BMCU002: QUANTITATIVE METHODS NOTES Page 9 of 19  Each trial of the experiment is independent of the other trials. Test Statistic       k i i i i e e o 1 2 2  Rejection Region 1 - k , 2 2     Example One Two companies A and B have recently conducted aggressive advertising campaigns in order to maintain and possibly increase their respective shares of the market for a particular product. These two companies enjoy a dominant position in the market. Before advertising campaigns began, the market share for Company A was 45% while Company B had a market share of 40%. Other competitors accounted for the remaining market share of 15%. To determine whether these market shares changed after the advertising campaigns, a marketing analyst solicited the preferences of a random sample of 200 consumers of this product. Of the 200 consumers, 100 indicated a preference for Company’s A’s product, 85 preferred Company’s B product and the remainder preferred one or another of the products distributed by other competitors. Conduct a test to determine at the 5% level of significance, whether the market shares have changed from the levels they were at before the advertising campaigns occurred. Solution 1. Stating the null and alternate hypothesis Ho: P1= 0.45, P2 = 0.4, P3 = 0.15 HA: At least one of the i P is not equal to its specified value. 2. Level of significance: 05 . 0   3. Test statistic:     k i i i i e e o 1 2 2 ) (  4. Rejection region : 99147 . 5 2 2 , 05 . 1 , 2 2         k
  • 60. BMCU002: QUANTITATIVE METHODS NOTES Page 10 of 19 5. Value of the test statistic: assuming that the null hypothesis is correct, we can calculate the expected number of consumers who prefer A, B and others using the formula np ei  . Company Observed frequency Expected frequency  2 i i e o    i i i e e o 2  A B Others 100 85 15 90 80 30 100 25 225 1.11 .31 7.50 Total 200 200 8.92 Therefore 92 . 8 ) ( 1 2 2      k i i i i e e o  6. Conclusion: Reject Ho There is sufficient evidence at the 5% level of significance to allow us to conclude that the market shares have changed from the levels they were at before the advertising campaigns occurred. Example Two To determine if a single die, is balanced, or fair, the die was rolled 600 times. The observed frequencies with which each of the six sides of the die turned up are recorded in the following table: - Face 1 2 3 4 5 6 Observed frequency 114 92 84 101 107 102 Is there sufficient evidence to conclude at the 5% level of significance, that the die is not fair? Solution 1. Stating the null and alternate hypothesis value specified its ot equal not is s P the of one least At : 6 1 : i 6 5 4 3 2 1 A o H p p p p p p H       2. Level of significance: 05 . 0  
  • 61. BMCU002: QUANTITATIVE METHODS NOTES Page 11 of 19 3. Test statistic:     k i i i i e e o 1 2 2 ) (  4. Decision Rule : Ho Reject , 0705 . 11 If , 0705 . 11 2 2 5 , 05 . 1 , 2 2           k 5. Value of the test statistic: Assuming that the null hypothesis is correct, we can calculate the expected number of consumers who prefer A, B and others using the formula np ei  . Face Observed frequency Expected frequency   i i i e e o 2  1 2 3 4 5 6 114 92 84 101 107 102 100 100 100 100 100 100 1.96 0.64 2.56 0.01 0.49 0.04 Total 600 600 5.7 Therefore 0705 . 11 7 . 5 ) ( 1 2 2       k i i i i e e o  6. Conclusion: Do not Reject Ho. There is no sufficient evidence at the 5% level of significance to allow us to conclude that that the die is not fair. Rule of Five For the discrete distribution of the test statistic 2  to be adequately approximated by the continuous chi-square distribution, the conventional rule is to require that the expected frequency for each cell be at least 5. Where necessary, cells should be combined in order to satisfy this condition. The choice of cells to be combined should be made in such a way that meaningful categories result from the combination.
  • 62. BMCU002: QUANTITATIVE METHODS NOTES Page 12 of 19 3.7.2 Chi-Square Test of a Contingency Table A contingency table is a rectangular table which items from a population are classified according to two characteristics. The objective is to analyze the relationship between two qualitative variables i.e. to investigate whether a dependence relationship exists between two variables or whether the variables are statistically independent. The number of degrees of freedom for a contingency table with r rows and c columns is    1 1 - r . .   c f d . Example One A sample of employees at a large chemical plant was asked to indicate a preference for one of three pension plans. The results are given in the following table: - Job Class Pension Plan Plan A Plan B Plan B Supervisor Clerical Laborer 10 19 81 13 80 57 29 19 22 At the 1% significance level, determine whether there is a relationship between the pension plan selected and the job classification of employees? Solution Job Class Pension Plan Total Plan A Plan B Plan B Supervisor Clerical Laborer 10 19 81 13 80 57 29 19 22 52 118 160 Total 110 150 70 330 We need to conduct a chi-square of the contingency table to determine whether the classifications are statistically independent. Ho: The two classifications are independent HA: the two classifications are dependent
  • 63. BMCU002: QUANTITATIVE METHODS NOTES Page 13 of 19 Test statistic:     k i i i i e e o 1 2 2 ) (  Rejection region : 2767 . 13 2 4 , 01 . 0 ) 1 )( 1 ( , 2 2          c r The value of the test statistic To compute the expected values for each cell, multiply the row total by the column total and divide by the total number of shirts sampled. Cell i Observed frequency o Expected frequency e   e e o 2  1 2 3 4 5 6 7 8 9 10 13 29 19 80 19 81 57 22 17.33 23.64 11.03 39.33 53.64 25.03 53.33 72.73 33.94 3.1003 4.7889 29.2766 10.5087 12.9539 1.4527 14.3564 3.4021 4.2005 Total 84.0401 Value of the test statistic : 0401 . 84 ) ( 1 2 2      k i i i i e e o  Conclusion: Reject Ho. There is enough evidence at the 1% significance level to conclude that the two classifications are dependent. Example Two The Coca Cola Company sells four brands of sodas in East Africa. To help determine if the same marketing approach used in Kenya can be used in Uganda and Tanzania, one of the firm’s marketing analysts wants to ascertain if there is an association between the brand of Soda preferred and the nationality of the consumer. She first classifies the population according to the brand of
  • 64. BMCU002: QUANTITATIVE METHODS NOTES Page 14 of 19 soda preferred i.e. Fanta, Sprite, Coke and Krest. Her second classification consists of the three nationalities; Kenyan, Tanzanian and Ugandan. The marketing analyst then interviews a random sample of 250 Soda drinkers from the three countries, classifies each according to the two criteria and records the observed frequency of drinkers falling into each of the cells as shown in the table below. Nationality Soda preference Total Coke Krest Sprite Fanta Kenyan Ugandan Tanzanian 72 26 7 8 10 10 12 16 14 23 33 19 115 85 50 Total 105 28 42 75 250 Based on the above sample data, can we conclude at the 1% level of significance that there is a relationship between the preference of the soda drinkers and their nationality? Solution We need to conduct a chi-square of the contingency table to determine whether the classifications are statistically independent. Ho: The two classifications are independent HA: the two classifications are dependent Test statistic:     k i i i i e e o 1 2 2 ) (  Rejection region : 8119 . 16 2 6 , 01 . 0 ) 1 )( 1 ( , 2 2          c r The value of the test statistic To compute the expected values for each cell, multiply the row total by the column total and divide by the total number of respondents sampled.
  • 65. BMCU002: QUANTITATIVE METHODS NOTES Page 15 of 19 Cell i Observed frequency o Expected frequency e   e e o 2  1 2 3 4 5 6 7 8 9 10 11 12 72 26 7 8 10 10 12 16 14 23 33 19 48.30 35.70 21.00 12.88 9.52 5.60 19.32 14.28 8.40 34.50 25.50 15.00 11.63 2.64 9.33 1.85 0.02 3.46 2.77 0.21 3.73 3.83 2.21 1.07 Value of the test statistic: 75 . 42 ) ( 1 2 2      k i i i i e e o  Conclusion: Reject Ho. Based on the sample data, we can conclude at the 1% significance level that there is a relationship between preferences of soda drinkers and their nationality. 3.7.3 Chi-Square Test for Normality The chi-square goodness of fit test for a normal distribution proceeds in essentially the same way as the chi-square test for a multinomial population. The multinomial test dealt with a single population of qualitative data, where as a normal distribution involves quantitative data. Therefore, we must begin by subdividing the range of the normal distribution into a set of intervals or categories in order to obtain qualitative data. Example One A battery manufacturer who wants to determine if the lifetimes of his batteries are normally distributed. Such information would be helpful in establishing the guarantee that should be offered. The lifetimes of a sample of 200 batteries are measured and the resulting data are grouped into a
  • 66. BMCU002: QUANTITATIVE METHODS NOTES Page 16 of 19 frequency distribution as shown in the table below. The mean and the standard deviation of the sample life times are 164 and 10 respectively. Is there evidence at the 5% level of significance that the lifetimes of his batteries are normally distributed? Solution 1. Stating the null and alternate hypothesis H0: The data are normally distributed HA: The data are not normally distributed 2. Level of significance: 05 . 0   3. Test statistic: 3 - k d.f. ) ( 1 2 2      k i i i i e e o  4. Decision Rule : Ho Reject , 9915 . 5 If , 9915 . 5 2 2 2 , 05 . 1 , 2 2           k 5. Value of the test statistic: 10 , 164    X 6 . 2 10 164 - 190 Z , 6 . 1 10 164 - 180 Z , 6 . 0 10 164 170 4 . 0 10 164 160 Z , 4 . 1 10 164 150 , 4 . 2 10 164 140                    Z Z Z Life Time in Hours Number of Batteries 140 up to 150 150 up to 160 160 up to 170 170 up to 180 180 up to 190 15 54 78 42 11 Total 200
  • 67. BMCU002: QUANTITATIVE METHODS NOTES Page 17 of 19 Lifetime Probability Observed frequency Expected frequency   i i i e e o 2  Less than 150 150 up to 160 160 up to 170 170 up to 180 180 or more 0.0808 0.2638 0.3811 0.2195 0.0548 15 54 78 42 11 16.16 52.76 76.22 43.9 10.96 0.0833 0.0291 0.0416 0.0822 0.0001 200 200 0.2363 Therefore 9915 . 5 02363 . 0 ) ( 1 2 2       k i i i i e e o  6. Conclusion: Do not Reject Ho. There is no sufficient evidence at the 5% level of significance to allow us to conclude that the lifetimes of his batteries are normally distributed? Example Two The instructors for an introductory accounting course attempt to construct the final examination so that the grades are normally distributed with a mean of 65. Grade Frequency 30 up to 40 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 4 17 29 49 33 18 From the sample of grades appearing in the accompanying frequency distribution table, can you conclude that they have achieved their objective? (Use 05 . 0   )
  • 68. BMCU002: QUANTITATIVE METHODS NOTES Page 18 of 19 Solution 1. Stating the null and alternate hypothesis H0: The data are normally distributed HA: The data are not normally distributed 2. Level of significance: 05 . 0   3. Test statistic: 3 - k d.f. ) ( 1 2 2      k i i i i e e o  4. Decision Rule : Ho Reject , 81373 . 7 If , 81473 . 7 2 2 3 , 05 . 1 , 2 2           k 5. Value of the test statistic: x f xf Dx 10 ' Dx Dx  2 ' Dx ' fDx 2 ' fDx 30 up to 40 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 35 45 55 65 75 85 4 17 29 49 33 18 140 765 1595 3185 2475 1530 -30 -20 -10 0 10 20 -3 -2 -1 0 1 2 9 4 1 0 1 4 -12 -34 -29 0 33 36 36 68 29 0 33 72 150 9690 -6 238 6 . 12 10 * 150 6 150 238 6 . 64 150 9690 2                n xf x 12.6 , 6 . 64    X
  • 69. BMCU002: QUANTITATIVE METHODS NOTES Page 19 of 19 22 . 1 12.6 64.6 - 80 Z , 43 . 0 6 . 12 6 . 64 70 37 . 0 6 . 12 6 . 64 60 Z , 16 . 1 6 . 12 6 . 64 50 , 95 . 1 6 . 12 6 . 64 40                  Z Z Z Lifetime Probability Observed frequency Expected frequency   i i i e e o 2  Less than 40 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 or more 0.0256 0.0974 0.229 0.3144 0.2224 0.1112 4 17 29 49 33 18 3.84 14.61 34.35 47.16 33.36 16.68 0.0067 0.3910 0.8333 0.0718 0.0039 0.1045 150 150 1.4112 Therefore 81473 . 7 4112 . 1 ) ( 1 2 2       k i i i i e e o  6. Conclusion: Do not Reject Ho. The data is normally distributed therefore we can conclude that they have achieved their objective
  • 70. Page 1 of 19 LESSON THREE: REGRESSION ANALYSIS 3.0 Introduction Regression involves developing a mathematical equation that analyses the relationship between the variable to be forecast (dependent variable) and the variables that the statistician believes are related to the forecast variable (independent variable). Regression is the estimation of unknown values or the prediction of one variable from known values of other variables. Simple linear regression involves a relationship between two variables only. Multiple regression analyses or considers the relationship between three or more variables. 3.1 Lesson Objectives By the end of the lesson, the students should be able to: i. Formulate a simple regression model ii. Calculate the coefficient of correlation and determination and interpret them iii. Test hypothesis about the regression coefficients 3.2 Simple Regression The first step in establishing the relationship between X and Y is to obtain observations on the two variables and analyze the data using a scatter diagram to indicate whether a positive or negative relationship exists between X and Y. the relationship can be approximated by a straight line. Algebraically, the relationship is t t X b b Y 1 0   The above function is deterministic since it gives exact relationship between X and Y. when the line is plotted, not all the points will fall on the line because of the following reasons:  Omission of other explanatory variables from the function  Random behavior of human beings  Imperfect specification of the functional form of the model  Errors of aggregation  Errors of measurement To account for the deviations of some points from the straight line, the error term is introduced. The introduction of the error term makes the function stochastic t t t e X b b Y    1 0 . To estimate the values of the coefficients 0 b and 1 b , we need observations on Y, X and the error term. However, the error term is not observable and therefore we make assumptions about the error term.
  • 71. Page 2 of 19 3.3 Assumptions of the Error Term The following are the assumptions of the error term  The error term is a real random variable which has a mean of zero and constant variance (Assumption of homoscedasticity)  The error term is normally distributed  The error term corresponding to different values of X for different periods are not correlated (assumption of no autocorrelation)  There is no relationship between the explanatory variables and the error term  The explanatory variables are measured without error. The error absorbs the influence of omitted variables and errors of measurement in the dependent variable. All the above assumptions are called stochastic assumptions Other Assumptions  The explanatory variables are not perfectly linearly related or correlated (No multicollinearity)  The variables are correctly aggregated  The relation being estimated is identified  The relationship is correctly specified The regression equation of Y on X  It used to predict the values of Y from the given values of X.  It is expressed as follows X b b Y 1 0    To determine the values of 0 b and 1 b the following two normal equations are to be solved simultaneously          2 1 0 1 0 X b X b XY X b nb Y  Alternatively the values of 0 b and 1 b can be got using the following formula’s X b Y b 1 0        2 2 1 X n X Y X n XY b
  • 72. Page 3 of 19 3.4 Correlation Definition: It is the existence of some definite relationship between two or more variables. Correlation analysis is a statistical tool used to describe the degree to which one variable is linearly related to another variable. Types of Correlation Correlation may be classified in the following ways:- (a) Positive and negative correlation. Correlation is said to be positive if two series move in the same direction, otherwise it is negative (opposite Direction). (b) Linear and Non-Linear correlation Correlation is linear if the amount of change in one variable tends to bear a constant ratio to the amount of change in the other variable otherwise it is non-linear. (c) Simple, partial and multiple correlation Simple correlation is where two variables are studied while partial or multiple involves three or more variables. 3.5 Methods of Calculating Simple Correlation  Scatter diagram  Karl Pearson’s coefficient of correlation  Spearman’s rank correlation coefficient  Method of least squares Karl Pearson’s coefficient of correlation (Product moment coefficient of correlation) The coefficient of correlation (r) is a measure of strength of the linear relationship between two variables.        2 2 2 2 Y n Y X n X Y X n XY r Interpretation of the coefficient of correlation 1. When r = +1, there is a perfect positive correlation between the variables 2. When r = -1, there is a perfect negative correlation between the variables
  • 73. Page 4 of 19 3. When r = 0, there is no correlation between the variables 4. The closer r is to +1 or to –1, the stronger the relationship between the variables and the closer r is to 0, the weaker the relationship. 5. The following table lists the interpretations for various correlation coefficients: Value Comment 0.8 to 1.0 0.6 to 0.8 0.4 to 0.6 0.2 to 0.4 0.0 to 0.2 Very strong Strong Moderate Weak Very weak Method of least squares yy xx xy SS SS SS r *  Coefficient of determination (r2 ) It is the square of the correlation coefficient. It shows the proportion of the total variation in the dependent variable Y that is explained or accounted for by the variation in the independent variable X. e.g. If the value of r = 0.9, r2 = 0.81, this means 81% of the variation in the dependent variable has been explained by the independent variable. Example One A random sample of eight auto drivers insured with a company and having similar auto insurance policies was selected. The following table lists their driving experience (in years) and the monthly auto insurance premium (in Sh.000) paid by them. Driving experience (Years) 5 2 12 9 15 6 25 16 Monthly auto insurance premium (In Sh.000) 64 87 50 71 44 56 42 69 i. Find the least squares regression line by identifying the appropriate dependent and independent variable ii. Interpret the meaning of the constants calculated in part (i). iii. Compute the coefficient of correlation and coefficient of determination and interpret them.
  • 74. Page 5 of 19 Solution: i. x y 1 0 ˆ     xx xy SS SS  1 ̂ x y 1 0 ˆ ˆ       90 x   1396 2 x   4739 xy   474 y   29642 2 y   5 . 383 8 90 1396 2 2 2        n x x SSxx    5 . 593 8 474 * 90 4739          n y x xy SSxy   5 . 1557 8 474 29642 2 2 2        n y y SSyy 55 . 1 5 . 383 5 . 593 ˆ 1      xx xy SS SS  69 . 76 ) 25 . 11 * 55 . 1 ( 25 . 59 ˆ ˆ 1 0       x y   x x y 55 . 1 69 . 76 ˆ 1 0       ii. 55 . 1 ˆ 1    it indicates the rate at which the insurance premium reduces with an additional year of driving experience 69 . 76 ˆ 0   It indicates the amount of premium that would be paid by a driver without any years of experience. iii. 77 . 0 5 . 1557 * 5 . 383 5 . 593 *      yy xx xy SS SS SS r There is a strong negative relationship between the years of experience and the monthly auto insurance premiums % 29 . 59 77 . 0 2 2    r 59.29% of the premium paid is determined by the driving experience Example Two A company is using a system of payment by results. The union claims that this seriously discriminates against the workers. there is a fairly steep learning curve which workers follow with the apparent outcome that more experienced workers can perform the task in about half of the time taken by the new employee. You have been asked to find out if there is any basis
  • 75. Page 6 of 19 for this claim. To do this, you have observed ten workers on the shop floor, timing how long it takes them to produce an item. It was then possible for you to match these times with the length of worker’s experience. The results obtained are shown below: Month’s experience 2 5 3 8 5 9 12 16 1 6 Time taken 27 26 30 20 22 20 16 15 30 19 Required: (a) Find the regression line of time taken on month’s experience (b) Compute the coefficient of correlation and coefficient of determination and interpret them. Solution: x b b Y 1 0   xx xy SS SS b  1 X b Y b 1 0   7 6   X   645 2 X  1300 XY   225 Y   5331 2 Y   1 . 196 10 67 645 2 2 2        n X X SSxx    5 . 207 10 225 * 67 1300          n Y X XY SSxy   5 . 268 10 225 5331 2 2 2        n Y Y SSyy 0581 . 1 1 . 196 5 . 207 1      xx xy SS SS b 41073 . 15 ) 7 . 6 * 0581 . 1 ( 5 . 22 1 0       X b Y b X Y 0581 . 1 41073 . 15   iv. 0581 . 1 1   b : It indicates the rate at which the time taken would reduce by for every additional month of experience 41073 . 15 0  b It indicates the time taken by an employee without any experience 9043 . 0 5 . 268 * 1 . 196 5 . 207 *      yy xx xy SS SS SS r There is a very strong negative correlation between the month’s experience and the time taken
  • 76. Page 7 of 19 % 78 . 81 100 * 8178 . 0 9043 . 0 2 2     r 81.78% of the variation in the time taken is explained by the month’s experience Example Three Students in the BMS 302 class were polled by a researcher attempting to establish a relationship between hours of study in the week immediately preceding the end of semester exam and the marks received on the exam. The surveyor gathered the data listed in the accompanying table Hours of study Exam score 25 12 18 26 19 20 23 15 22 8 93 57 55 90 82 95 95 80 85 61 i. Find the least squares regression line by identifying the appropriate dependent and independent variable. ii. Interpret the meaning of the values of 0 and 1 calculated in part (i). iii. Compute the correlation of coefficient and coefficient of determination and interpret them. Solution x y 1 0 ˆ     xx xy SS SS  1 ̂ x y 1 0 ˆ ˆ      188 x   3832 2 x  15540 xy   793 y   65143 2 y   6 . 297 10 188 3832 2 2 2        n x x SSxx    6 . 631 10 793 * 188 15540         n y x xy SSxy
  • 77. Page 8 of 19   1 . 2258 10 793 65143 2 2 2        n y y SSyy 122 . 2 6 . 297 6 . 631 ˆ 1    xx xy SS SS  4064 . 39 ) 8 . 18 * 122 . 2 ( 3 . 79 ˆ ˆ 1 0      x y   x x y 122 . 2 4064 . 39 ˆ 1 0       i. 122 . 2 ˆ 1   it indicates the rate at which the exam score would increase with an additional hour of study 04 . 39 ˆ 0   It indicates the exam score that would be attained by a student who does not study a week to exams. ii. 77 . 0 1 . 2258 * 6 . 297 6 . 631 *    yy xx xy SS SS SS r There is a strong positive relationship between the exam score and the number of hours studied % 29 . 59 77 . 0 2 2   r 59.29% of the exam score is determined by the number of hours studied 3.6 Spearman’s Rank Correlation  It is the correlation between the ranks assigned to individuals by two different people.  It is a non-parametric technique for measuring strength of relationship between paired observations of two variables when the data are in ranked form. It is denoted by R or p N N d N N d R i         3 2 2 2 6 1 ) 1 ( 6 1 In rank correlation, there are two types of problems:- i. Where actual ranks are given ii. Where actual ranks are not given
  • 78. Page 9 of 19 Where actual ranks are given Steps:  Take the differences of the two ranks i.e. (R1-R2) and denote these differences by d.  Square these differences and obtain the total  2 d  Use the formula N N d R     3 2 6 1 Example The ranks given by two judges to 10 individuals are given below. Individual 1 2 3 4 5 6 7 8 9 10 Judge 1(X) 1 2 7 9 8 6 4 3 10 5 Judge 2 (Y) 7 5 8 10 9 4 1 6 3 2 Calculate (a) The spearman’s rank correlation. (b) The Coefficient of correlation Where ranks are not given Ranks can be assigned by taking either the highest value as 1 or the lowest value as 1. the same method should be followed in case of all the variables. Example Calculate the Rank correlation coefficient for the following data of marks given to 1st year B Com students: CMS 100 45 47 60 38 50 CAC 100 60 61 58 48 46 Equal Ranks or Tie in Ranks  Where equal ranks are assigned to some entries, an adjustment in the formula for calculating the Rank coefficient of correlation is made.  The adjustment consists of adding   m m  3 12 1 to the value of  2 d where m stands for the number of items whose ranks are common.
  • 79. Page 10 of 19 Example An examination of eight applicants for a clerical post was taken by a firm. From the marks obtained by the applicants in the accounting and statistics papers, compute the Rank coefficient of correlation. Applicant A B C D E F G H Marks in accounting 15 20 28 12 40 60 20 80 Marks in statistics 40 30 50 30 20 10 30 60 3.7 Assessing the Regression Model 3.7.1 Estimating the variance of the error variable The sample statistic 2 2   n SSE Se is an unbiased estimator of 2 e  . The square root of 2 e S is called the standard error of estimate i.e. 2   n SSE Se xx yy SS SS SS SSE xy 2   Interpretation of the Standard Error of Estimate  The smallest value that the standard error of estimate can assume is zero, which occurs when SSE = 0 i.e. when all the points fall on the regression line.  If  S is close to zero, the fit is excellent and the linear model is likely to be a useful and effective analytical and forecasting tool  If  S is large, the model is a poor one and the statistician should either improve it or discard it.  In general, the standard error of estimate cannot be used as an absolute measure of the model’s utility. Nonetheless, it is useful in comparing models. 3.7.2 Drawing inferences about 1  This involves determining whether a linear relationship actually exists between x and y . The null hypothesis will always state that there is no linear relationship between the variables i.e. 0 : 1 0   H . Any of the following three alternate hypothesis can be tested:- i. 0 : 1   A H Tests whether some linear relationship exists between x and y ii. 0 : 1   A H Tests for a positive linear relationship exists between x and y
  • 80. Page 11 of 19 iii. 0 : 1   A H Tests for a negative linear relationship exists between x and y The test statistic is 1 1 1 b s b t    where xx e b SS S S  1 Assuming that the error variable is normally distributed, the test statistic follows a student distribution with 2  n degrees of freedom The confidence interval estimator of 1 2 , 2 / 1 1 b n S t b      3.7.3 Measuring the strength of the linear relationship 1  is useful in measuring the strength of the linear relationship particularly when we want to compare different models to see which one fits the data better. (a) Coefficient of Correlation The coefficient of correlation denoted by ) (Rho  measures the similarity of the changes in the values of x and y . Its range is 1 1     . Since  is a population parameter, its value is estimated from the data. The sample coefficient of correlation r is defined as follows:- yy xx xy SS SS SS r *  (b) Testing the Coefficient of Correlation If 0   the values of x and y are uncorrelated and the linear model is not appropriate. We can determine if x and y are correlated by testing the following hypothesis 0 : 0 : 0     A H H Test statistics for r s r t    where 2 1 2    n r sr The test statistics is student t distributed with n-2 degrees of freedom if the error variable is normally distributed (c) Coefficient of Determination ) ( 2 r This measures the proportion of variability in the dependent variable that is explained by variability of the independent variable.
  • 81. Page 12 of 19 yy xx SS SS SS r xy 2 2  3.7.4 Predicting the particular value of y for a given x (The prediction Interval) The prediction interval is given by: -   SSxx x x n S t y Y g e n 2 2 , 2 / 1 1 ˆ        Where g x is the given value of x and g x b b y 1 0 ˆ   3.7.5 Estimating the expected value of y for a given x (The confidence Interval) The confidence interval is given by: -   SSxx x x n S t y Y g e n 2 2 , 2 / 1 ˆ       Where g x is the given value of x and g x b b y 1 0 ˆ   Example One A real estate agent would like to predict the selling price of single family homes. After careful consideration, she concludes that the variable likely to be mostly closely related to the selling price is the size of the house. As an experiment, she takes a random sample of 15 recently sold houses and records the selling price in Sh.000’s and size in 100 ft2 of each. The data is shown in the table below: - House size (100 ft2 ) 20.0 14.8 20.5 12.5 18.0 14.3 27.5 16.5 24.3 20.2 Selling price (Sh’000) 89.5 79.9 83.1 56.9 66.6 82.5 126.3 79.3 119.9 87.6 22.0 19.0 12.3 14.0 16.7 112.6 120.8 78.5 74.3 74.8 Required: - (a) Find the sample regression line for the data (b) Estimate the variance of the error variable and the standard error of estimate.
  • 82. Page 13 of 19 (c) Can we conclude at the 1% significance level that the size of a house is linearly related to its selling price? (d) Estimate the 99% confidence interval estimate of 1  (e) Compute the coefficient of correlation and interpret its value (f) Can we conclude at the 1% significance level that the two variables are correlated? (g) Compute the coefficient of determination and interpret its value (h) Predict with 95% confidence the selling price of a house that occupies 2,000ft2 . (i) In a certain part of the city, a developer built several thousand houses whose floor plans and exteriors differ but whose sizes are all 2,000 ft2 . To date, they have been rented but the builder now wants to sell them and wants to know approximately how much money in total he can expect from the sale of the houses. Help him by estimating a 95% confidence interval estimate of the mean selling price of the houses. Solution (a) Find the least squares regression line x b b y 1 0 ˆ   xx xy SS SS b  1 x y b 1 0 ̂     6 . 272 X   6 . 1332 Y   97 . 25257 XY   24 . 5222 2 X 42 . 124618 2   Y   189 . 268 15 6 . 272 24 . 5222 2 2 2        n x x SSxx    186 . 1040 15 6 . 1332 * 6 . 272 97 . 25257         n y x xy SSxy   24 . 6230 15 2 . 1332 42 . 124618 2 2 2        n y y SSyy 88 . 3 189 . 268 186 . 1040 1    xx xy SS SS b 34 . 18 ) 17 . 18 * 88 . 3 ( 84 . 88 1 0      x b y b x x b b y 88 . 3 34 . 18 ˆ 1 0    
  • 83. Page 14 of 19 (b) Estimate the variance of the error variable and the standard error of estimate. 169 13 13 2 15 88 . 2195 2 88 . 2195 19 . 268 18 . 1040 24 . 6230 2 2 2 2             e e xx yy S n SSE S SS SS SS SSE xy (c) Can we conclude at the 1% significance level that the size of a house is linearly related to its selling price? 0 : b t : Statistic Test 0.05 0 : 1 1 1 1        A b o H S H Decision rule 0 13 , 025 . 0 2 , 2 / H Reject , 012 . 3 or 012 . 3 If 3.012       c c n t t t t Value of the test statistic 012 . 3 89 . 4 794 . 0 88 . 3 794 . 0 19 . 268 13 1 1 1        b xx e b S b t SS S S Conclusion: Reject Ho. Yes, the data provides sufficient evidence to conclude that the house size is linearly related to its selling price (d) Estimate the 99% confidence interval estimate of 1  27 . 6 49 . 1 39 . 2 88 . 3 ) 794 . 0 * 012 . 3 ( 88 . 3 1 1 1 2 , 2 / 1 1              b n S t b (e) Compute the coefficient of correlation and interpret its value 805 . 0 24 . 6230 * 19 . 268 18 . 1040 *    yy xx xy SS SS SS r There is a very strong positive correlation between the size of the house and its selling price