2. Organising and graphing quantitative data in a frequency
distribution table.
• Frequency table consists of a number of classes and each
observation is counted and recorded as the frequency of the
class.
• If n observations need to be classified into a frequency table,
determine:
– Number of classes:
c 1 3,3log n
xmax xmin
– Class width
c
2
3. Organising and graphing quantitative data in a frequency
distribution table.
Example:
The following data represents the number of telephone calls received
for two days at a municipal call centre. The data was measured per
hour.
8 11 12 20 18 10 14 18 16 9
5 7 11 12 15 14 16 9 17 11
6 18 9 15 13 12 11 6 10 8
11 13 22 11 11 14 11 10 9
19 14 17 9 3 3 16 8 2
3
14. Histograms
Number of telephone calls per hour
at a municipal call centre
14
Number of hours
12
10
8
6
4
2
0
2 5 8 11 14 17 20 23
Number of calls
14
15. Definitions
Frequency Polygon
A line graph of a frequency distribution and offers a
useful alternative to a histogram. Frequency polygon is
useful in conveying the shape of the distribution
Ogive
A graphic representation of the cumulative frequency
distribution. Used for approximating the number of
values less than or equal to a specified value
15
17. Frequency polygons
Number of telephone calls per hour
at a municipal call centre (x)
14 3,5
Number of hours
12 6,5
10
8
9,5
6 12,5
4
2
15,5
0 18,5
0.5 3.5 6.5 9.5 12.5 15.5 18.5 21.5 24.5
21,5
Arbitrary mid-points to Number of calls
close the polygon. 17
19. Ogives
Ogive of number of call received
at a call centre per hour
100
number of hours
90
% Cumulative
80
70
60
50
40
30
20
10
0
2 5 8 11 14 17 20 23
Number of calls
None of the hours had
less than 2 calls. 19
20. Ogives Ogive of number of call received
20% of the
hours had at a call centre per hour
more than
17 calls 100
number of hours
per hour. 90
% Cumulative
80
70
80% of the 60
hours had 50
less than 40
30
17 calls 20
per hour. 10
0
2 5 8 11 14 17 20 23
50% of Number ofhad less
the hours calls
than 12 calls per hour.
20
21. Exam question 2
A garbage removal company would like to start charging by the
weight of a customers bin rather than by the number of bins put
out. They select a sample of 25 customers and weigh their
garbage bins. The weights in kg are given below:-
14.5 5.2 16.0 14.7 15.6 18.9 13.5 24.6 24.5 7.4
13.2 23.4 13.9 12.0 22.5 31.4 16.1 10.9 25.1 22.1
14.8 15.1 4.9 17.0 10.3
1. Construct a frequency table to describe the data. Include a
frequency and relative (%) frequency column. (Hint: start the
class intervals with the whole number just smaller than the
lowest value in the dataset)
22. Procedure
1. Calculate the range of the dataset
2. Calculate the no of classes
3. Calculate the class width
4. Construct table showing the intervals calculated in 1 to 3
5. Put in the tally for each interval and then show as frequency
6. Calculate the relative (%) frequency
13 marks
23. Range
31.4 - 4.9 = 26.5
No of classes
K or c= 1+3.3logn
n = 25 K or c= 3.3 log (25) = 5.61 ≈ 6
Class Width
xmax xmin = 26.5/6 = 4.41 ≈ 5
Class width
c
25. Exam question 2
2. Comment on the interval 4% of bins weighed between
containing the lowest 29 & 34 kg
percentage
3. In which interval do the data Largest no. of bins weighed
tend to cluster? Which between 14 & 19kg. We
descriptive statistics measure, assume mode will fall in this
can we assume, would be
interval (highest frequency)
found in this interval?
4. Comment on the shape of +ve skewed as more
the distribution without values located in lower
drawing a graph . Give reasons intervals
7 MARKS
29. • QUARTILES
– Order data in ascending order.
– Divide data set into four quarters.
25% 25% 25% 25%
Min Q1 Q2 Q3 Max
29
30. Example – Given the following data set:
2 5 8 −3 5 2 6 5 −4
Determine Q1 for the sample of nine measurements:
•Order the measurements
−4 −3 2 2 5 5 5 6 8
1 2 3 4 5 6 7 8 9
Q1 is the n 1
1
4
9 1
1
4
2,5th value
Find difference between data for 2 & 3
2-(-3)=5 and multiply by the decimal portion of value : 5 x 0.5 = 2.5
30
Add to smallest figure: -3 + 2.5: Q1 = 0.5
31. Example – Given the following data set:
2 5 8 −3 5 2 6 5 −4
Determine Q3 for the sample of nine measurements:
−4 −3 2 2 5 5 5 6 8
1 2 3 4 5 6 7 8 9
Q3 is the n 1
3
4
9 1
3
4
7,5th value
Q3 = 5 + 0,5(6 − 5) = 5,5
31
32. Example – Given the following data set:
2 5 8 −3 5 2 6 5 −4
Interquartile range = Q3 – Q1
Q3 = 5,5
Q1 = −0,5
Interquartile range
= 5,5 – (−0,5)
=6
32
33. INTERQUARTILE RANGE (IQR)
• Difference between the third and first
quartiles
• Indicates how far apart the first and third
quartiles are
IQR = Q3 – Q1
33
34. BOX & WHISKER PLOT
• Provides a graphical summary of data based
on 5 summary measures or values
– First quartile, median, third quartile ,lower limit,
upper limit
• Box and whisker plot detects outliers in a data
set
LL = Q1 – 1,5 (IQR)
UL = Q3 + 1,5 (IQR)
34
35. BOX-AND-WISKER PLOT
Me = 12,38 LL = Q1 – 1,5(IQR) = 9,36 – 1,5(6,31) = –0,11
Q3 = 15,67
Q1 = 9,36 UL = Q3 + 1,5(IQR) = 15,67 – 1,5(6,31) = 25,14
IRR = 6,31
1,5(IQR) IQR 1,5(IQR)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
• Any value smaller than −0,11 will be an outlier.
• Any value larger than 25,14 will be an outlier. 35
36. Exam question 3
The Tubeka brothers spent the following amounts in Rand on groceries over
the last 8 weeks:-
54 56 89 67 74 57 43 51
1. Calculate a five number summary table
2. Construct a box and whisker plot for the data
3. Determine whether there are any outliers. Show calculations
20 MARKS
PROCEDURE
1. Reorder the data set
2. Identify maximum and minimum values in dataset
3. Calculate median
4. Calculate Q1 & Q3
5. Construct plot
6. Calculate upper & lower limits for dataset to determine if outliers present
37. 43 51 54 56 57 67 74 89
xmin = 43 xmax = 89 median = (56+57)/2 = 56.5 Q1 = 51.75 Q3 = 72.25
Q1 = (n+1) (1/4) = (8+1) x ¼ = 2.25 value
Between 51 & 54
54-51 = 3 multiply by decimal portion of value 3x 0.25 = 0.75 and add the lower value
Q1 = 51 + 0.75 = 51.75
Q3 = (n+1) (¾) = (8+1) x ¾ = 6.75 value
Between 67 & 74
74 – 67 = 7 multiply by decimal portion of value 7 x 0.75 = 5.25 and add lower value
Q3 = 67 + 5.25 = 72.25
38. 43 51 54 56 57 67 74 89
xmin = 43 xmax = 89 median = (56+57)/2 = 56.5 Q1 = 51.75 Q3 = 72.25
OUTLIERS
1. Calculate upper & lower limits
LL = Q1 – 1,5 (IQR)
UL = Q3 + 1,5 (IQR)
IQR = 72.25 – 51.75 = 20.5
LL = 51.75 – 1,5(20.5) = 21
UL = 72.25 + 1.5(20.5) = 103
No values smaller than 21 or greater than 103 therefore no outliers present
40. • ARITHMETIC MEAN
– Data is given in a frequency table
– Only an approximate value of the mean
x
fx i i
f i
where f i frequency of the i th class interval
xi = class midpoint of the i th class interval
40
41. • MEDIAN
– Data is given in a frequency table.
– First cumulative frequency ≥ n/2 will indicate the
median class interval.
– Median can also be determined from the ogive.
ui li n Fi 1
M e li
2
fi
where li = lower boundary of the median interval
ui = upper boundary of the median interval
Fi -1 = cumulative frequency of interval foregoing
median interval
fi = frequency of the median interval
41
42. • MODE
– Class interval that has the largest frequency value
will contain the mode.
– Mode is the class midpoint of this class.
– Mode must be determined from the histogram.
42
43. Example – The following data represents the number of
telephone calls received for two days at a municipal call centre.
The data was measured per hour.
To calculate the Number of Number of
mean for the sample calls hours fi xi
of the 48 hours: [2–under 5) 3 3,5
determine the class [5–under 8) 4 6,5
midpoints [8–under 11) 11 9,5
[11–under 14) 13 12,5
[14–under 17) 9 15,5
[17–under 20) 6 18,5
[20–under 23) 2 21,5
n = 48 43
44. Example – The following data represents the number of
telephone calls received for two days at a municipal call centre.
The data was measured per hour.
x
fi xi Number of Number of
calls hours fi xi
fi
[2–under 5) 3 3,5
597
[5–under 8) 4 6,5
48 [8–under 11) 11 9,5
12, 44 [11–under 14) 13 12,5
Average number [14–under 17) 9 15,5
of calls per hour [17–under 20) 6 18,5
is 12,44. [20–under 23) 2 21,5
n = 48 44
45. Exam question 3
The number of overtime hours worked by 40 part-time employees of a
security company in 1 week is shown in the following frequency
distribution:-
Hours per Frequency (f)
week
2.1 - < 2.8 12
2.8 - < 3.5 13
3.5 - < 4.2 7
4.2 - < 4.9 5
4.9 - < 5.6 2
5.6 - < 6.3 1
1. Estimate the mean number of overtime hours worked
2. What % of employees worked at least 4.2 hours overtime?
8 marks
46. Exam question 3
Procedure
1. Calculate the midpoint x
for each interval (lower
limit + upper limit/2)
2. Multiply f by the midpoint
x
3. Total the fx and f columns
4. Divide ∑fx by ∑f
49. • PERCENTILES
– Order data in ascending order.
– Divide data set into hundred parts.
10% 90%
Min P10 Max
80% 20%
Min P80 Max
50% 50%
Min P50 = Q2 Max 49
50. Example – Given the following data set:
2 5 8 −3 5 2 6 5 −4
Determine P20 for the sample of nine measurements:
−4 −3 2 2 5 5 5 6 8
1 2 3 4 5 6 7 8 9
P20 is the n 1 9 1 2
p
100
20
100
nd
value
P20 = −3
50
51. Example – The following data represents the number of
telephone calls received for two days at a municipal call
centre. The data was measured per hour.
Number of Number of
P60 calls hours fi F
= np/100 [2–under 5) 3 3
= 48(60)/100
[5–under 8) 4 7
= 28,8
[8–under 11) 11 18
The first cumulative
[11–under 14) 13 31
frequency ≥ 28,8
[14–under 17) 9 40
[17–under 20) 6 46
[20–under 23) 2 48
n = 48 51
52. Example – The following data represents the number of
telephone calls received for two days at a municipal call centre.
The data was measured per hour.
P60 Number of Number of
u p l p 100 Fp1
np calls hours fi F
lp
fp [2–under 5) 3 3
11
14 11 28,8 18 [5–under 8) 4 7
13, 49
13 [8–under 11) 11 18
[11–under 14) 13 31
60% of the time less [14–under 17) 9 40
than 13,49 or 40% of [17–under 20) 6 46
the time more than
13,49 calls per hour. [20–under 23) 2 48
n = 48 52
53. Exam question 3
1. John, one of the part-time workers was told he falls on the
70th percentile. Calculate the value and explain what it
means.
PROCEDURE
1. Calculate the cumulative frequencies
2. Calculate which class the required percentile falls into by
using P =np/100
3. Once you have identified the class use the percentile formula
given in the tables book to calculate the value. Take CARE to
order the calculation correctly.
4 MARKS
54. Exam question 3
P = np/100 = 40*70/100
Hours per Frequency Cumulative =28
week (f) F
2.1 - < 2.8 12 12 P70 = 3.5 + [ (4.2-3.5)(28-25)]/7
2.8 - < 3.5 13 25 = 3.5 + 0.8
3.5 - < 4.2 7 32
=3.8
4.2 - < 4.9 5 37
4.9 - < 5.6 2 39 70% of the workers worked fewer
hours overtime than John. 70% of
5.6 - < 6.3 1 40 the workers worked fewer than 3.8
hrs. 30% of the workers worked
40 more overtime hours than John. 30%
of the employees worked more than
3.8hrs.
56. Confidence interval
– An interval is calculated around the sample
statistic
Population parameter
included in interval
Confidence interval
56
57. Confidence interval
– An upper and lower limit within in which the
Example:
population parameter is expected to lie
Meaning of a 90% confidence interval:
– Limits will vary from sample to sample
– Specify the probability thatsamples taken from
90% of all possible the interval will
include the parameter produce an interval that will
population will
include the population parameter
– Typical used 90%, 95%, 99%
– Probability denoted by
• (1 – α) known as the level of confidence
• α is the significance level
57
58. • An interval estimate consists of a range of values
with an upper & lower limit
• The population parameter is expected to lie within
this interval with a certain level of confidence
• Limits of an interval vary from sample to sample
therefore we must also specify the probability that
an interval will contain the parameter
• Ideally probability should be as high as possible
58
59. SO REMEMBER
•We can choose the probability
•Probability is denoted by (1-α)
•Typical values are 0.9 (90%); 0.95 (95%) and 0.99 (99%)
•The probability is known as the LEVEL OF CONFIDENCE
•α is known as the SIGNIFICANCE LEVEL
•α corresponds to an area under a curve
•Since we take the confidence level into account when we
estimate an interval, the interval is called CONFIDENCE
INTERVAL
59
60. Confidence interval for Population Mean, n ≥ 30
- population need not be normally distributed
- sample will be approximately normal
CI ( )1 x Z1 , if is known
2 n
s
CI ( )1 x Z1 , if is not known
2 n
60
61. Example :
CI ( )1 x Z1 , if is known
2 n
90% confidence interval
s
CI ( )1 x Z1 , if is not known
2 n 1 – 0,90
0,10
1
90% of all sample
0,10
means fall in this area 0, 05
2 2
These 2 areas added Confidence level
together = α i.e. 10% 1–α =1-α
1-α 0, 05
0, 05
2
= 0,90 2
2
x
Lower conf limit Upper conf limit
61
63. • Confidence interval for Population Mean, n <
30
– For a small sample from a normal population and σ is
known, the normal distribution can be used.
– If σ is unknown we use s to estimate σ
– We need to replace the normal distribution with the t-
distribution
▬ standard normal
s
CI ( )1 x tn 1;1
▬ t-distribution
2 n
63
65. • Example
– The manager of a small departmental store is concerned about
the decline of his weekly sales.
99% confident the mean weekly
– He calculated the average and standard deviation of his sales for
the past 12 weeks, x =sales will be between
R12400 and s = R1346
R11 193,14 and R13 606,86
– Estimate with 99% confidence the population mean sales of the
departmental store.
t11;0.995
s 1346
x tn 1;1 12400 3,106
2 n 12
12400 1206,86
11193,14 ; 13606,86
65
66. • Confidence interval for Population proportion
– Each element in the population can be classified as a
success or failure
number of successes x
ˆ
Sample proportion p =
– Proportion always between 0 and 1 size =
sample n
– For large samples the sample proportion is
approximately normal ˆ
p
p (1 p )
ˆ ˆ
CI ( p )1 p z1
ˆ
2 n 66
67. Exam question 7
1. In a sample of 200 residents of Johannesburg, 120 reported
they believed the property taxes were too high. Develop a
95% confidence interval for the proportion of the
residents who believe the tax rate is too high. Interpret your
answer
2. The time it takes a mechanic to tune an engine in a sample of
20 tune ups is known to be normally distributed with a
sample mean of 45 minutes and a sample standard deviation
of 14 minutes. Develop a 95% confidence interval estimate
for the mean time it will take the mechanic for all engine
tune ups. Interpret your answer
15 MARKS
68. Exam question 7
PROCEDURE
1. Determine what measure your are looking at: mean,
proportion or standard deviation
2. Select appropriate formula based on 1. and sample size (t for
small sample sizes <30; z for larger sample sizes)
3. Put the numbers into the formula and calculate the
confidence intervals
69. Exam question 7
1.
ˆ
Sample proportion p =
number of successes
=
x In a sample of 200 residents of
sample size n Johannesburg, 120 reported
they believed the property
p (1 p )
ˆ ˆ taxes were too high. Develop a
CI ( p )1 p z1
ˆ
2 n 95% confidence interval for
𝑝 = 120/200 = 0.6 the proportion of the
Z 1-α = 1.96 residents who believe the tax
2 rate is too high. Interpret your
CI = 0.6 +/_1.96 √( 0.6 0.4 )/200
answer
CI = 0.6 +/- 0.07
0.53<CI<0.67
At CL of 95% between 53% and 67% of
residents believe tax rate is too high
70. Exam question 7
The time it takes a mechanic
s
CI ( )1 x t n 1;1 to tune an engine in a
2 n sample of 20 tune ups is
known to be normally
14
= 45 +/- 2.093 √20 distributed with a sample
mean of 45 minutes and a
sample standard deviation
= 45 +/- 6.55 of 14 minutes. Develop a
95% confidence interval
38.45< µ < 51.55 estimate for the mean time
At a confidence level of 95% the
it will take the mechanic for
population average time to complete a all engine tune ups.
tune up is between 38.45 and 51.55 Interpret your answer
minutes
72. STEPS OF A HYPOTHESIS TEST
Step 1 • State the null and alternative hypotheses
Step 2 • State the values of α
Step 3 • Calculate the value of the test statistic
Step 4 • Determine the critical value
Step 5 • Make a decision using decision rule or graph
Step 6 • Draw a conclusion
72
73. • Hypothesis test for Population Mean, n < 30
– If σ is unknown we use s to estimate σ
– We need to replace the normal distribution with
the t-distribution with (n - 1) degrees of
freedom
Testing H0: μ = μ0 for n < 30
Alternative Decision rule:
Test statistic
hypothesis Reject H0 if
H1: μ ≠ μ0 |t| ≥ tn - 1;1- α/2 x 0
t
H1: μ > μ0 t ≥ tn-1;1- α s
n
H1: μ < μ0 t ≤ -tn-1;1- α 73
74. • Hypothesis testing for Population proportion
number of successes x
– Sample proportion p =
ˆ =
sample size n
– Proportion always between 0 and 1
Testing H0: p = p0 for n ≥ 30
Alternative Decision rule:
Test statistic
hypothesis Reject H0 if
H1: p ≠ p0 |z| ≥ Z1- α/2 p p0
ˆ
z
H1: p > p0 z ≥ Z1- α p0 (1 p0 )
H1: p < p0 z ≤ -Z1- α n 74
75. Exam question 8
1. Oliver Tambo airport wants to test the claim that on
average cars remain in the short term car park area longer
than 42.5 minutes. The research team drew a random
sample of 24 cars and found that the average time that
these cars remained in the short term parking area was 40
minutes with a sample standard deviation of 2 minutes.
Test the claim at 10% level of significance and interpret.
2. The Gautrain Authority add a bus route if more than 55%
of commuters indicate they would use the route. A
sample of 70 commuters revealed that 42 would use a
route from Sandton to Auckland Park. Does this route
meet the Gautrain criteria. Use 0.05 significance level
16 MARKS
76. Exam question 8
Procedure
1. State H0 and Ha
2. Determine the critical value from the
appropriate test table using α, and n
3. Compute test statistic (t or z value??)
4. Draw conclusion
77. Exam question 8
State hypothesis Oliver Tambo airport wants
H0: µ = 42.5 to test the claim that on
Ha: µ > 42.5 average cars remain in the
Determine critical value short term car park area
tn-1; 1- α = t 23; 0.9 = 1.319 longer than 42.5 minutes.
Reject H0 if the test statistic is > The research team drew a
1.319 random sample of 24 cars
Calculate test statistic and found that the average
x 0 time that these cars
t
s
remained in the short term
n
parking area was 40 minutes
T= 40-42.5 = -6.12 with a sample standard
2 deviation of 2 minutes. Test
√24 the claim at 10% level of
Do not reject H0
significance and interpret.
78. Exam question 8
State hypothesis The Gautrain Authority
H0: p = 0.55 add a bus route if more
Ha: p > 0.55 than 55% of commuters
Determine critical value indicate they would use
α = 0.05 Z = 1.64 the route. A sample of 70
Reject H0 if Z test > 1.64 commuters revealed that
Calculate test statistic 42 would use a route from
number of successes x Sandton to Auckland Park.
ˆ
Sample proportion p = =
sample size n Does this route meet the
p p0
ˆ
z Gautrain criteria. Use 0.05
p0 (1 p0 )
n
significance level
0.6−0.55
Z= = 0.84
√((0.55)(0.45)/70
Do not reject H0
80. Coefficient of correlation
• The coefficient of correlation is used to measure the
strength of association between two variables.
• The coefficient values range between -1 and 1.
– If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression
line.
– If r = 0 there is no linear pattern.
• The coefficient can be used to test for linear
relationship between two variables.
80
81. Perfect positive High positive Low positive
r = +1 r = +0,9 r = +0,3
Y Y Y
X X X
Perfect negative High negative No Correlation
r = -1 r = -0,8 r=0
Y Y Y
X X X
81
82. Exam question 10
The cost of repairing cars that were involved in accidents is one reason
that insurance premiums are so high. In an experiment 5 cars were
driven into a wall. The speeds were varied between 20km/hr and
80km/hr (X). The costs of repair (Y) were estimated and listed below:-
SPEED (Km/h) (X) COST OF REPAIR (R’000)
(Y)
20 3
30 5
40 8
60 24
80 34
1. Use calculator to calculate coefficient of correlation. Interpret your
answer
2. Calculate and interpret the coefficient of determination for this
data
3. Use your calculator to construct regression line equation and
predict repair cost at 50km/h
10 MARKS
83. Exam question 10
1. Put data into calculator
2. Select regression function and select r
3. Calculate coefficient of determination
= r2 x100%
4. Interpret results
5. Using Y = A + BX select regression function on
calculator and determine values for A & B
6. Put x = 50 into formula and calculate result
84. Exam question 10
1. r = 0.98
There is a very strong relationship between the
repair cost and speed.
2. r2 x 100% = 0.982 x 100 = 96%
96% of the variation in the cost of repair is
explained by the variation in the speed at which
the car crashed
3. Y = -10.7 +0.55x
X = 50 Y = 16.8