Why Study Statistics Arunesh Chand Mankotia 2004

Dealing with Uncertainty

Everyday decisions are based on
incomplete information


The price of L&T stock will be higher
in six months than it is now.

versus

The price of L&T stock is likely
to be higher in six months than it
is now.

If the union budget deficit is as high as
predicted, interest rates will remain high
for the rest of the year.
versus
If the union budget deficit is as high
as predicted, it is probable that
interest rates will remain high for
the rest of the year.

Statistical Thinking
Statistical thinking is a philosophy of learning
and action based on the following fundamental
principles:
 All work occurs in a system of interconnected
processes;
 Variation exists in all processes, and
 Understanding and reducing variation are the
keys to success.


Systems and Processes
A system is a number of components that
are logically and sometimes physically
linked together for some purpose.


Systems and Processes
A process is a set of activities operating on a system
that transforms inputs to outputs. A business process is
groups of logically related tasks and activities, that
when performed utilizes the resources of the business
to provide definitive results required to achieve the
business objectives.

Making Decisions

Data, Information, Knowledge
q Data: specific observations of measured numbers.
q Information: processed and summarized data
yielding facts and ideas.
q Knowledge: selected and organized information
that provides understanding, recommendations, and
the basis for decisions.

Making Decisions

Descriptive and Inferential Statistics
Descriptive Statistics include graphical and
numerical procedures that summarize and
process data and are used to transform data
into information.

Making Decisions

Descriptive and Inferential Statistics
Inferential Statistics provide the bases for
predictions, forecasts, and estimates that are
used to transform information to knowledge.

The Journey to Making Decisions

Decision 

Knowledge
Experience, Theory,
Literature, Inferential
Statistics, Computers
Information
Descriptive Statistics,
Probability, Computers
Begin Here:
Data
Identify the
Problem

Summarizing and Describing
Data

 Tables and Graphs
 Numerical Measures

Classification of Variables

 Discrete numerical variable
 Continuous numerical variable
 Categorical variable


Discrete Numerical Variable
A variable that produces a response that
comes from a counting process.


Continuous Numerical Variable
A variable that produces a response that is
the outcome of a measurement process.


Categorical Variables
Variables that produce responses that
belong to groups (sometimes called
“classes”) or categories.

Measurement Levels

Nominal and Ordinal Levels of Measurement
refer to data obtained from categorical
questions.
• A nominal scale indicates assignments to
groups or classes.
• Ordinal data indicate rank ordering of items.

Frequency Distributions

A frequency distribution is a table used to organize data.
The left column (called classes or groups) includes
numerical intervals on a variable being studied. The
right column is a list of the frequencies, or number of
observations, for each class. Intervals are normally of
equal size, must cover the range of the sample
observations, and be non-overlapping.

Construction of a Frequency
Distribution

 Rule 1: Intervals (classes) must be inclusive and non-
overlapping;
 Rule 2: Determine k, the number of classes;
 Rule 3: Intervals should be the same width, w; the width
is determined by the following:
(Largest Number - Smallest Number)
w = Interval Width =
Number of Intervals

Both k and w should be rounded upward, possibly to the next largest integer.

Construction of a Frequency
Distribution

Quick Guide to Number of Classes for a Frequency Distribution

Sample Size Number of Classes
Fewer than 50 5 – 6 classes
50 to 100 6 – 8 classes
over 100 8 – 10 classes

Example of a Frequency Distribution

A Frequency Distribution for the Suntan Lotion Example

Weights (in mL) Number of Bottles
220 less than 225 1
225 less than 230 4
230 less than 235 29
245 less than 250 6

Cumulative Frequency
Distributions

A cumulative frequency distribution contains the
number of observations whose values are less than the
upper limit of each interval. It is constructed by
adding the frequencies of all frequency distribution
intervals up to and including the present interval.

Relative Cumulative Frequency
Distributions

A relative cumulative frequency distribution
converts all cumulative frequencies to
cumulative percentages

Example of a Frequency Distribution

A Cumulative Frequency Distribution for the Sun tan Lotion
Example

Weights (in mL) Number of Bottles
less than 225 1
less than 230 5
less than 235 34
less than 240 68
less than 245 94
less than 250 100

Histograms and Ogives

A histogram is a bar graph that consists of vertical bars
constructed on a horizontal line that is marked off with
intervals for the variable being displayed. The
intervals correspond to those in a frequency
distribution table. The height of each bar is
proportional to the number of observations in that
interval.

Histograms and Ogives

An ogive, sometimes called a cumulative line graph, is
a line that connects points that are the cumulative
percentage of observations below the upper limit of
each class in a cumulative frequency distribution.

Histogram and Ogive for Example 1

Histogram of Weights

40 100
35 90
80
30
70
Frequency

25 60
20 50
15 40
30
10
20
5 10
0 0
224.5 229.5 234.5 239.5 244.5 249.5
Interval Weights (mL)

Stem-and-Leaf Display

A stem-and-leaf display is an exploratory data analysis
graph that is an alternative to the histogram. Data are
grouped according to their leading digits (called the stem)
while listing the final digits (called leaves) separately for
each member of a class. The leaves are displayed
individually in ascending order after each of the stems.



Stem unit: 10

9 1 124678899
(9) 2 122246899
5 3 01234
2 4 02

Tables
- Bar and Pie Charts -
Frequency and Relative Frequency Distribution for
Top Company Employers Example
Number of
Industry Employees Percent
Tourism 85,287 0.35
Retail 49,424 0.2
Health Care 39,588 0.16
Restaurants 16,050 0.06
Communications 11,750 0.05
Technology 11,144 0.05
Space 11,418 0.05
Other 21,336 0.08

Tables

Bar Chart for Top Company Employers Example

1999 Top Company Employers in Central Florida
0.35

0.2
0.16
0.06 0.08
0.05 0.05 0.05
e

gy

e
il
ism

er
s

ns
ta

ar

nt

ac

th
lo
t io
Re

C

ra
ur

Sp

O
no
ica
au
th
To

ch
al

st

un

Te
He

Re

m
m
Co

Industry Category

Tables

Pie Chart for Top Company Employers Example

1999 Top Company Employers in Central Florida

Others
29% Tourism
35%

Health Care
16% Retail
20%

Pareto Diagrams

A Pareto diagram is a bar chart that displays the
frequency of defect causes. The bar at the left indicates
the most frequent cause and bars to the right indicate
causes in decreasing frequency. A Pareto diagram is use
to separate the “vital few” from the “trivial many.”
few many.

Line Charts

A line chart, also called a time plot, is a series of data plotted
at various time intervals. Measuring time along the horizontal
axis and the numerical quantity of interest along the vertical
axis yields a point on the graph for each observation. Joining
points adjacent in time by straight lines produces a time plot.

Line Charts
Growth Trends in Internet Use by Age
1997 to 1999

35
Millions of Adults

31.3 32.7
30
25 26.3
20 20.2 18.5
15 16.5 15.8 17.2
13.8 13 14.2
10 9.8 11.4
7.5
5 5
0 Age 18 to 29
Age 30 to 49
98

99

9
O 7

O 8
7

8

9
7

8
l-9

l-9

l-9
r- 9

r- 9

r- 9
-9

-9
n-

n-
ct

ct
Ju

Ju

Ju
Age 50+
Ap

Ap
Ap

Ja

Ja

April 1997 to July 1999

Parameters and Statistics

A statistic is a descriptive measure computed from a
sample of data. A parameter is a descriptive
measure computed from an entire population of
data.

Measures of Central Tendency
- Arithmetic Mean -

A arithmetic mean is of a set of data is the
sum of the data values divided by the
number of observations.

Sample Mean

If the data set is from a sample, then the sample
n
mean, X , is:
∑x i
x1 + x2 +  + xn
X= i =1
=
n n

Population Mean

If the data set is from a population, then the
population mean, µ , is:
N

∑x
x1 + x2 +  + xn
i
µ= =i =1
N N

- Median -
An ordered array is an arrangement of data in either
ascending or descending order. Once the data are
arranged in ascending order, the median is the value such
that 50% of the observations are smaller and 50% of the
observations are larger.

If the sample size n is an odd number, the median,
Xm, is the middle observation. If the sample size n
is an even number, the median, Xm, is the average
median
of the two middle observations. The median will
be located in the 0.50(n+1)th ordered position.
position

- Mode -

The mode, if one exists, is the most
frequently occurring observation in the
sample or population.

Shape of the Distribution

The shape of the distribution is said to be
symmetric if the observations are balanced,
or evenly distributed, about the mean. In a
symmetric distribution the mean and median
are equal.

Shape of the Distribution

A distribution is skewed if the observations are not
symmetrically distributed above and below the mean.
A positively skewed (or skewed to the right)
distribution has a tail that extends to the right in the
direction of positive values. A negatively skewed (or
skewed to the left) distribution has a tail that extends
to the left in the direction of negative values.

Shapes of the Distribution
Symmetric Distribution

10
9
8
7

Frequency
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9

Positively Skewed Distribution Negatively Skewed Distribution

12 12

10 10

8 8
Frequency

Frequency
6 6
4 4
2 2
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

- Geometric Mean -

The Geometric Mean is the nth root of the product of n
numbers:

X g = n ( x1 • x2 •  • xn ) = ( x1 • x2 •  • xn )1/ n

The Geometric Mean is used to obtain mean growth over
several periods given compounded growth from each
period.

Measures of Variability
- The Range -

The range is in a set of data is the
difference between the largest and
smallest observations

- Sample Variance -

The sample variance, s2, is the sum of the squared
differences between each observation and the sample
mean divided by the sample size minus 1.
n

∑ (x − X )
i
2

s2 = i =1

n −1

- Short-cut Formulas for Sample
Variance -

Short-cut formulas for the sample variance are:

n (∑ xi ) 2
∑ xi − n ∑ xi2 − nX 2
s 2 = i =1 or s2 =
n −1 n −1

- Population Variance -

The population variance, σ2, is the sum of the squared
differences between each observation and the population
mean divided by the population size, N.
N

∑ (x − µ)
i
2

σ2 = i =1

N

- Sample Standard Deviation -

The sample standard deviation, s, is the positive square
root of the variance, and is defined as:
n

∑ (x − X )
i
2

s= s = 2 i =1

n −1

- Population Standard Deviation-

The population standard deviation, σ, is
N

∑ (x − µ)
i
2

σ= σ = 2 i =1
N

The Empirical Rule
(the 68%, 95%, or almost all rule)

For a set of data with a mound-shaped histogram, the Empirical
Rule is:

• approximately 68% of the observations are contained with a
distance of one standard deviation around the mean; µ± 1σ
• approximately 95% of the observations are contained with a
distance of two standard deviations around the mean; µ± 2σ
• almost all of the observations are contained with a distance
of three standard deviation around the mean; µ± 3σ

Coefficient of Variation

The Coefficient of Variation, CV, is a measure of relative
dispersion that expresses the standard deviation as a
percentage of the mean (provided the mean is positive).
The sample coefficient of variation is
s
CV = × 100 if X > 0
X
The population coefficient of variation is
σ
CV = ×100 if µ > 0
µ

Percentiles and Quartiles

Data must first be in ascending order. Percentiles
separate large ordered data sets into 100ths. The Pth
percentile is a number such that P percent of all the
observations are at or below that number.
Quartiles are descriptive measures that separate large
ordered data sets into four quarters.


The first quartile, Q1, is another name for the 25th
percentile. The first quartile divides the ordered data
percentile
such that 25% of the observations are at or below this
value. Q1 is located in the .25(n+1)st position when
the data is in ascending order. That is,
(n + 1)
Q1 = ordered position
4


The third quartile, Q3, is another name for the 75th
percentile. The first quartile divides the ordered
percentile
data such that 75% of the observations are at or
below this value. Q3 is located in the .75(n+1)st
position when the data is in ascending order. That
is,
3(n + 1)
Q3 = ordered position
4

Interquartile Range

The Interquartile Range (IQR) measures the spread
in the middle 50% of the data; that is the difference
between the observations at the 25th and the 75th
percentiles:

IQR = Q3 − Q1

Five-Number Summary

The Five-Number Summary refers to the five
descriptive measures: minimum, first quartile,
median, third quartile, and the maximum.
X min imum < Q1 < Median < Q3 < X max imum

Box-and-Whisker Plots

A Box-and-Whisker Plot is a graphical procedure that
uses the Five-Number summary.
A Box-and-Whisker Plot consists of
• an inner box that shows the numbers which span the
range from Q1 Box-and-Whisker Plot to Q3.
•a line drawn through the box at the median.
The “whiskers” are lines drawn from Q1 to the minimum
vale, and from Q3 to the maximum value.

Box-and-Whisker Plots (Excel)

Box-and-whisker Plot

45

40

35

30

25

20

15

16
10

Grouped Data Mean
For a population of N observations the mean is
K

∑fm i i
µ= i =1
N
For a sample of n observations, the mean is
K

∑fm i i
X= i =1
n

Where the data set contains observation values m1, m2, . . ., mk occurring with
frequencies f1, f2, . . . fK respectively

Grouped Data Variance
For a population of N observations the variance is
K K

∑f i (mi −µ) 2
∑ f i m i2
σ2 = i=1
= i=1
−µ2
N N

For a sample of n observations, the variance is
K K

∑ f i (mi − X ) 2 ∑ f i m i2 − nX 2
s2 = i =1
= i =1
n −1 n −1
Where the data set contains observation values m1, m2, . . ., mk occurring with
frequencies f1, f2, . . . fK respectively

Why Study Statistics Arunesh Chand Mankotia 2004

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (6)

Similar a Why Study Statistics Arunesh Chand Mankotia 2004

Similar a Why Study Statistics Arunesh Chand Mankotia 2004 (20)

Más de Consultonmic

Más de Consultonmic (20)

Why Study Statistics Arunesh Chand Mankotia 2004