Descriptive statistics ii

Descriptive Statistics-II
Dr Mahmoud Alhussami

Shapes of Distribution
 A third important property of data – after location
and dispersion - is its shape
 Distributions of quantitative variables can be
described in terms of a number of features, many
of which are related to the distributions’ physical
appearance or shape when presented graphically.
 modality
 Symmetry and skewness
 Degree of skewness
 Kurtosis

Modality
 The modality of a distribution concerns
how many peaks or high points there are.
 A distribution with a single peak, one
value a high frequency is a unimodal
distribution.

Modality
 A distribution with two
or more peaks called
multimodal
distribution.

Symmetry and Skewness
 A distribution is symmetric if the distribution could be split
down the middle to form two haves that are mirror images
of one another.
 In asymmetric distributions, the peaks are off center, with
a bull of scores clustering at one end, and a tail trailing off
at the other end. Such distributions are often describes as
skewed.
 When the longer tail trails off to the right this is a positively
skewed distribution. E.g. annual income.
 When the longer tail trails off to the left this is called
negatively skewed distribution. E.g. age at death.

Symmetry and Skewness
 Shape can be described by degree of asymmetry (i.e.,
skewness).
 mean > median positive or right-skewness
 mean = median symmetric or zero-skewness
 mean < median negative or left-skewness
 Positive skewness can arise when the mean is
increased by some unusually high values.
 Negative skewness can arise when the mean is
decreased by some unusually low values.

Left skewed:

Right skewed:

Symmetric:

7

Shapes of the Distribution
 Three common shapes of frequency
distributions:

A B C
Symmetrical Positively Negatively
and bell skewed or skewed or
shaped skewed to skewed to
the right the left
March 28, 2013 8

Shapes of the Distribution
 Three less common shapes of frequency
distributions:

A B C
Bimodal Reverse Uniform
J-shaped

March 28, 2013 9

This guy
took a VERY
long time!

10

Degree of Skewness
 A skewness index can readily be calculated most
statistical computer program in conjunction with
frequency distributions
 The index has a value of 0 for perfectly
symmetric distribution.
 A positive value if there is a positive skew, and
negative value if there is a negative skew.
 A skewness index that is more than twice the
value of its standard error can be interpreted as a
departure from symmetry.

Measures of Skewness or Symmetry
 Pearson’s skewness coefficient
 It is nonalgebraic and easily calculated. Also it
is useful for quick estimates of symmetry .
 It is defined as:
skewness = mean-median/SD
 Fisher’s measure of skewness.
 It is based on deviations from the mean to the
third power.

Pearson’s skewness coefficient

 For a perfectly symmetrical distribution, the mean will
equal the median, and the skewness coefficient will be
zero. If the distribution is positively skewed the mean
will be more than the median and the coefficient will be
the positive. If the coefficient is negative, the
distribution is negatively skewed and the mean less than
the median.
 Skewness values will fall between -1 and +1 SD units.
Values falling outside this range indicate a substantially
skewed distribution.
 Hildebrand (1986) states that skewness values above
0.2 or below -0.2 indicate severe skewness.

Assumption of Normality
 Many of the statistical methods that we will
apply require the assumption that a variable or
variables are normally distributed.
 With multivariate statistics, the assumption is
that the combination of variables follows a
multivariate normal distribution.
 Since there is not a direct test for multivariate
normality, we generally test each variable
individually and assume that they are
multivariate normal if they are individually
normal, though this is not necessarily the case.

Evaluating normality
 There are both graphical and statistical methods
for evaluating normality.
 Graphical methods include the histogram and
normality plot.
 Statistical methods include diagnostic hypothesis
tests for normality, and a rule of thumb that says
a variable is reasonably close to normal if its
skewness and kurtosis have values between –1.0
and +1.0.
 None of the methods is absolutely definitive.

Transformations
 When a variable is not normally distributed, we
can create a transformed variable and test it for
normality. If the transformed variable is normally
distributed, we can substitute it in our analysis.
 Three common transformations are: the
logarithmic transformation, the square root
transformation, and the inverse transformation.
 All of these change the measuring scale on the
horizontal axis of a histogram to produce a
transformed variable that is mathematically
equivalent to the original variable.

Types of Data Transformations
 for moderate skewness, use a square root
transformation.
 For substantial skewness, use a log
transformation.
 For sever skewness, use an inverse
transformation.

Computing “Explore” descriptive
statistics

To compute the statistics
needed for evaluating the
normality of a variable, select
the Explore… command from
the Descriptive Statistics
menu.

Adding the variable to be evaluated

Second, click on right
arrow button to move
the highlighted variable
to the Dependent List.

First, click on the
variable to be included
in the analysis to
highlight it.

Selecting statistics to be computed

To select the statistics for the
output, click on the
Statistics… command button.

Including descriptive statistics
First, click on the
Descriptives checkbox
to select it. Clear the
other checkboxes.

Second, click on the
Continue button to
complete the request for
statistics.

Selecting charts for the output

To select the diagnostic charts
for the output, click on the
Plots… command button.

Including diagnostic plots and
statistics
First, click on the
None option button
on the Boxplots panel
since boxplots are not
as helpful as other
charts in assessing
normality.

Finally, click on the
Continue button to
complete the request.

Second, click on the
Normality plots with tests Third, click on the Histogram
checkbox to include checkbox to include a
normality plots and the histogram in the output. You
hypothesis tests for may want to examine the
normality. stem-and-leaf plot as well,
though I find it less useful.

Completing the specifications for the
analysis

Click on the OK button to
complete the specifications
for the analysis and request
SPSS to produce the
output.

The histogram

Histogram An initial impression of the
normality of the distribution
50
can be gained by examining
the histogram.

40 In this example, the
histogram shows a substantial
violation of normality caused
30 by a extremely large value in
the distribution.

20
Frequency

10
Std. Dev = 15.35
Mean = 10.7
0 N = 93.00
0.0 20.0 40.0 60.0 80.0 100.0
10.0 30.0 50.0 70.0 90.0

TOTAL TIME SPENT ON THE INTERNET

The normality plot
Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET
3

2

1

0

The problem with the normality of this
variable’s distribution is reinforced by the
Expected Normal

-1
normality plot.

-2 If the variable were normally distributed,
the red dots would fit the green line very
closely. In this case, the red points in the
-3
upper right of the chart indicate the
-40 -20 0 20 40 60 80 100 120
severe skewing caused by the extremely
large data values.
Observed Value

The test of normality
Tests of Normality
a
Kolmogorov-Smirnov Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
TOTAL TIME SPENT
.246 93 .000 .606 93 .000
ON THE INTERNET
a. Lilliefors Significance Correction

Problem 1 asks about the results of the test of normality. Since the sample
size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample
size were 50 or less, we would use the Shapiro-Wilk statistic instead.

The null hypothesis for the test of normality states that the actual
distribution of the variable is equal to the expected distribution, i.e., the
variable is normally distributed. Since the probability associated with the
test of normality is < 0.001 is less than or equal to the level of significance
(0.01), we reject the null hypothesis and conclude that total hours spent on
the Internet is not normally distributed. (Note: we report the probability as
<0.001 instead of .000 to be clear that the probability is not really zero.)

The answer to problem 1 is false.

The assumption of normality script

An SPSS script to produce all
of the output that we have
produced manually is
available on the course web
site.

After downloading the script,
run it to test the assumption
of linearity.
Select Run Script…
from the Utilities
menu.

Selecting the assumption of
normality script
First, navigate to the folder containing your
scripts and highlight the
NormalityAssumptionAndTransformations.SBS
script.

Second, click on
the Run button to
activate the script.

Specifications for normality script

First, move variables from
the list of variables in the
data set to the Variables to
Test list box.

The default output is to do all of the
transformations of the variable. To
exclude some transformations from the Third, click on the OK
calculations, clear the checkboxes. button to run the script.

The test of normality
Tests of Normality
a
Kolmogorov-Smirnov Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
TOTAL TIME SPENT
.246 93 .000 .606 93 .000
ON THE INTERNET
a. Lilliefors Significance Correction

The script produces the same output that we
computed manually, in this example, the tests
of normality.

When transformations do not work
 When none of the transformations induces
normality in a variable, including that
variable in the analysis will reduce our
effectiveness at identifying statistical
relationships, i.e. we lose power.
 We do have the option of changing the
way the information in the variable is
represented, e.g. substitute several
dichotomous variables for a single metric
variable.

Fisher’s Measure of Skewness
 The formula for Fisher’s skewness statistic is based on
deviations from the mean to the third power.
 The measure of skewness can be interpreted in terms of
the normal curve
 A symmetrical curve will result in a value of 0.
 If the skewness value is positive, them the curve is skewed to
the right, and vice versa for a distribution skewed to the left.
 A z-score is calculated by dividing the measure of skewness
by the standard error for skewness. Values above +1.96 or
below -1.96 are significant at the 0.05 level because 95%
of the scores in a normal deviation fall between +1.96 and
-1.96 from the mean.
 E.g. if Fisher’s skewness= 0.195 and st.err. =0.197 the z-
score = 0.195/0.197 = 0.99

Kurtosis
 The distribution’s kurtosis is concerns how
pointed or flat its peak.
 Two types:
 Leptokurtic distribution (mean thin).
 Platykurtic distribution (means flat).

Kurtosis
 There is a statistical index of kurtosis that can be
computed when computer programs are
instructed to produce a frequency distribution
 For kurtosis index, a value of zero indicates a
shape that is neither flat nor pointed.
 Positive values on the kurtosis statistics indicate
greater peakedness, and negative values indicate
greater flatness.

Fishers’ measure of Kurtosis
Fisher’s measure is based on deviation
from the mean to the fourth power.
 A z-score is calculated by dividing the
measure of kurtosis by the standard error
for kurtosis.

Table of descriptive statistics
Descriptives

Statistic Std. Error
TOTAL TIME SPENT Mean 10.731 1.5918
ON THE INTERNET 95% Confidence Lower Bound 7.570
Interval for Mean Upper Bound
13.893

5% Trimmed Mean 8.295
Median 5.500
Variance 235.655
To answer problem Std. Deviation 15.3511
2, we look at the Minimum .2
values for skewness Maximum 102.0
and kurtosis in the Range 101.8
Descriptives table. Interquartile Range 10.200
Skewness 3.532 .250
Kurtosis 15.614 .495

The skewness and kurtosis for the variable both exceed the rule of
thumb criteria of 1.0. The variable is not normally distributed.

The answer to problem 2 if false.

Other problems on assumption of
normality may ask about the assumption of
 A problem
normality for a nominal level variable. The
answer will be “An inappropriate application of a
statistic” since there is no expectation that a
nominal variable be normal.
 A problem may ask about the assumption of
normality for an ordinal level variable. If the
variable or transformed variable is normal, the
correct answer to the question is “True with
caution” since we may be required to defend
treating an ordinal variable as metric.
 Questions will specify a level of significance to
use and the statistical evidence upon which you
should base your answer.

Normal Distribution
 Also called belt shaped curve, normal
curve, or Gaussian distribution.
 A normal distribution is one that is
unimodal, symmetric, and not too peaked
or flat.
 Given its name by the French
mathematician Quetelet who, in the early
19th century noted that many human
attributes, e.g. height, weight, intelligence
appeared to be distributed normally.

Normal Distribution
 The normal curve is unimodal and symmetric
about its mean (µ).
 In this distribution the mean, median and mode
are all identical.
 The standard deviation (σ) specifies the amount
of dispersion around the mean.
 The two parameters µ and σ completely define a
normal curve.

March 28, 2013 40

 Also called a Probability density function. The
probability is interpreted as "area under the
curve."
 The random variable takes on an infinite # of
values within a given interval
 The probability that X = any particular value
is 0. Consequently, we talk about intervals.
The probability is = to the area under the
curve.
 The area under the whole curve = 1.

41

Normal Distribution

.X is the random variable
.μ is the mean value
.σ is the standard deviation (std) value
.e = 2.7182818... constant
.π = 3.1415926... constant

Importance of Normal Distribution to
Statistics
 Although most distributions are not
exactly normal, most variables tend to
have approximately normal distribution.
 Many inferential statistics assume that the
populations are distributed normally.
 The normal curve is a probability
distribution and is used to answer
questions about the likelihood of getting
various particular outcomes when
sampling from a population.

 Probabilitiesare obtained by getting the area
under the curve inside of a particular interval.
The area under the curve = the proportion of
times under identical (repeated) conditions
that a particular range of values will occur.
 Characteristics of the Normal distribution:
 It is symmetric about the mean μ.
 Mean = median = mode. [“bell-shaped” curve]
 f(X) decreases as X gets farther and farther away
from the mean. It approaches horizontal axis
asymptotically:
- ∞ < X < + ∞. This means that there is always
some probability (area) for extreme values. 45

Why Do We Like The Normal
?Distribution So Much
 There is nothing “special” about standard
normal scores
 These can be computed for observations from any
sample/population of continuous data values
 The score measures how far an observation is from
its mean in standard units of statistical distance
 But, if distribution is not normal, we may not be
able to use Z-score approach.

March 28, 2013 46

Probability Distributions
 Any characteristic that can be measured or
categorized is called a variable.
 If the variable can assume a number of different
values such that any particular outcome is
determined by chance it is called a random
variable.
 Every random variable has a corresponding
probability distribution.
 The probability distribution applies the theory of
probability to describe the behavior of the
random variable.

March 28, 2013 47

Discrete Probability Distributions
 Binomial distribution – the random variable
can only assume 1 of 2 possible outcomes.
There are a fixed number of trials and the
results of the trials are independent.
 i.e. flipping a coin and counting the number of heads in
10 trials.
 Poisson Distribution – random variable can
assume a value between 0 and infinity.
 Counts usually follow a Poisson distribution (i.e.
number of ambulances needed in a city in a given
night)

March 28, 2013 48

Discrete Random Variable
 A discrete random variable X has a finite number of possible
values. The probability distribution of X lists the values and
their probabilities.

Value of X x1 x2 x3 … xk
Probability p1 p2 p3 … pk
1. Every probability pi is a number between 0 and 1.
2. The sum of the probabilities must be 1.
 Find the probabilities of any event by adding the probabilities
of the particular values that make up the event.

March 28, 2013 49

Example
 The instructor in a large class gives 15% each of A’s and D’s,
30% each of B’s and C’s and 10% F’s. The student’s grade
on a 4-point scale is a random variable X (A=4).

Grade F=0 D=1 C=2 B=3 A=4
Probability 0.10 15. 30. 30. 15.
 What is the probability that a student selected at random will
have a B or better?
 ANSWER: P (grade of 3 or 4)=P(X=3) + P(X=4)
= 0.3 + 0.15 = 0.45

March 28, 2013 50

Continuous Probability Distributions
 When it follows a Binomial or a Poisson
distribution the variable is restricted to taking on
integer values only.
 Between two values of a continuous random
variable we can always find a third.
 A histogram is used to represent a discrete
probability distribution and a smooth curve called
the probability density is used to represent a
continuous probability distribution.

March 28, 2013 51

Normal Distribution
Q Is every variable normally distributed?
A Absolutely not
Q Then why do we spend so much time
studying the normal distribution?
A Some variables are normally distributed;
a bigger reason is the “Central Limit
Theorem”!!!!!!!!!!!!!!!!!!!!!!!!!!!??????????
?

March 28, 2013 52

Central Limit Theorem
 describes the characteristics of the "population of the
means" which has been created from the means of an
infinite number of random population samples of size (N),
all of them drawn from a given "parent population".
 It predicts that regardless of the distribution of the parent
population:
 The mean of the population of means is always equal to the
mean of the parent population from which the population
samples were drawn.
 The standard deviation of the population of means is always
equal to the standard deviation of the parent population
divided by the square root of the sample size (N).
 The distribution of means will increasingly approximate a
normal distribution as the size N of samples increases.

Central Limit Theorem
 A consequence of Central Limit Theorem is that if we
average measurements of a particular quantity, the
distribution of our average tends toward a normal one.
 In addition, if a measured variable is actually a
combination of several other uncorrelated variables, all of
them "contaminated" with a random error of any
distribution, our measurements tend to be contaminated
with a random error that is normally distributed as the
number of these variables increases.
 Thus, the Central Limit Theorem explains the ubiquity of
the famous bell-shaped "Normal distribution" (or "Gaussian
distribution") in the measurements domain.

Note that the normal distribution is
defined by two parameters, μ and σ . You
can draw a normal distribution for any μ
and σ combination. There is one normal
distribution, Z, that is special. It has a μ =
0 and a σ = 1. This is the Z distribution,
also called the standard normal
distribution. It is one of trillions of normal
distributions we could have selected.

55

Standard Normal Variable
 It is customary to call a standard normal random
variable Z.
 The outcomes of the random variable Z are
denoted by z.
 The table in the coming slide give the area under
the curve (probabilities) between the mean and
z.
 The probabilities in the table refer to the
likelihood that a randomly selected value Z is
equal to or less than a given value of z and
greater than 0 (the mean of the standard
normal).

March 28, 2013 56

Source: Levine et al, Business Statistics, Pearson.

57

The 68-95-99.7 Rule for the Normal
Distribution
 68% of the observations fall within one
standard deviation of the mean
 95% of the observations fall within two
standard deviations of the mean
 99.7% of the observations fall within three
standard deviations of the mean
 When applied to ‘real data’, these
estimates are considered approximate!

March 28, 2013 58

(:Remember these probabilities (percentages
standard deviations# Approx. area under the
from the mean normal curve
±1 68.
±1.645 90.
±1.96 95.
±2 955.
±2.575 99.
±3 997.

Practice: Find these values yourself using the Z
table. 59
Two Sample Z Test

Standard Normal Curve

March 28, 2013 60

Standard Normal Distribution

50% of probability in 50% of probability in
here –probability=0.5 here–probability=0.5

March 28, 2013 61

Standard Normal Distribution
95% of
probability in
here

2.5% of probability 2.5% of probability
in here in here

Standard Normal
Distribution with 95% area
marked

March 28, 2013 62

Calculating Probabilities
 Probability calculations are always
concerned with finding the probability that
the variable assumes any value in an
interval between two specific points a and
b.
 The probability that a continuous variable
assumes the a value between a and b is
the area under the graph of the density
between a and b.

March 28, 2013 63

If the weight of males is N.D. with μ=150
and σ=10, what is the probability that a
randomly selected male will weigh between
140 lbs and 155 lbs?
[Important Note: Always remember that
the probability that X is equal to any one
particular value is zero, P(X=value) =0,
since the normal distribution is
continuous.]

Normal Distribution 64

Solution:

140 150 155 X

-1 Z
0 0.5

Z = (140 – 150)/ 10 = -1.00 s.d. from mean
Area under the curve = .3413 (from Z table)

Z = (155 – 150) / 10 =+.50 s.d. from mean
Area under the curve = .1915 (from Z table)

Answer: .3413 + .1915 = .5328

65

Example
 For example: What’s the probability of getting a math SAT score of
575 or less, µ=500 and σ=50?

575 − 500
Z= = 1. 5
50
i.e., A score of 575 is 1.5 standard deviations above the mean
575 1 x − 500 2 1.5 1
1 − ( ) 1 − Z2
∴ P ( X ≤ 575) = ∫ (50)
200
2π
⋅ e 2 50 dx  → ∫
−∞
2π
⋅ e 2 dz

Yikes!
But to look up Z= 1.5 in standard normal chart (or enter
into SAS) no problem! = .9332

If IQ is ND with a mean of 100 and a S.D. of
10, what percentage of the population will
have
(a)IQs ranging from 90 to 110?
(b)IQs ranging from 80 to 120?
Solution:
Z = (90 – 100)/10 = -1.00
Z = (110 -100)/ 10 = +1.00
Area between 0 and 1.00 in the Z-table is
.3413; Area between 0 and -1.00 is also .3413
(Z-distribution is symmetric).
Answer to part (a) is .3413 + .3413 = .6826.

67

(b) IQs ranging from 80 to 120?
Solution:
Z = (80 – 100)/10 = -2.00
Z = (120 -100)/ 10 = +2.00
Area between =0 and 2.00 in the Z-table is
.4772; Area between 0 and -2.00 is also .
4772 (Z-distribution is symmetric).
Answer is .4772 + .4772 = .9544.

68

Suppose that the average salary of college
graduates is N.D. with μ=$40,000 and
σ=$10,000.
(a) What proportion of college graduates will earn
$24,800 or less?
(b) What proportion of college graduates will earn
$53,500 or more?
(c) What proportion of college graduates will earn
between $45,000 and $57,000?
(d) Calculate the 80th percentile.
(e) Calculate the 27th percentile.
69

(a) What proportion of college graduates
will earn $24,800 or less?
Solution:
Convert the $24,800 to a Z-score:
Z = ($24,800 - $40,000)/$10,000 = -1.52.
Always DRAW a picture of the distribution
to help you solve these problems.

70

.4357

$24,800 $40,000 X

-1.52 0
Z

First Find the area between 0 and -1.52 in the
Z-table. From the Z table, that area is .4357.
Then, the area from -1.52 to - ∞ is
.5000 - .4357 = .0643.
Answer: 6.43% of college graduates will earn
less than $24,800.

71

(b) What proportion of
college graduates will earn .4115

$53,500 or more? .0885

Solution: $40,000 $53,500

Convert the $53,500 to a Z-score. 0 +1.35
Z

Z = ($53,500 - $40,000)/$10,000 = +1.35.
Find the area between 0 and +1.35 in the Z-
table: .4115 is the table value.
When you DRAW A PICTURE (above) you see
that you need the area in the tail: .5 - .4115
- .0885.
Answer: .0885. Thus, 8.85% of college
graduates will earn $53,500 or more.
72

.4
55
4
.1915
(c) What proportion of college
graduates will earn between
$45,000 and $57,000? $40k $45k $57k

0 .5 1.7
Z

Z = $45,000 – $40,000 / $10,000 = .50
Z = $57,000 – $40,000 / $10,000 = 1.70

From the table, we can get the area under the
curve between the mean (0) and .5; we can get
the area between 0 and 1.7. From the picture
we see that neither one is what we need.
What do we do here? Subtract the small piece
from the big piece to get exactly what we need.
Answer: .4554 − .1915 = .2639
73

Parts (d) and (e) of this example ask you to
compute percentiles. Every Z-score is
associated with a percentile. A Z-score of 0
is the 50th percentile. This means that if you
take any test that is normally distributed
(e.g., the SAT exam), and your Z-score on
the test is 0, this means you scored at the
50th percentile. In fact, your score is the
mean, median, and mode.

74

(d) Calculate the 80th percentile.

.5000 .3000

Solution: $40,000

First, what Z-score is associated 0 .84
Z
with the 80 percentile?
th

A Z-score of approximately +.84 will give you
about .3000 of the area under the curve. Also,
the area under the curve between -∞ and 0 is .
5000. Therefore, a Z-score of +.84 is associated
with the 80th percentile.
ANSWER
Now to find the salary (X) at the 80th percentile:
Just solve for X: +.84 = (X−$40,000)/$10,000
75

(e) Calculate the 27th percentile.

.2300 .5000

Solution: First, what Z-score is associated
.2700

with the 27th percentile? A Z-score $40,000

-.61 0
of approximately -.61will give you Z

about .2300 of the area under the curve, with .2700 in
the tail. (The area under the curve between 0 and -.61
is .2291 which we are rounding to .2300). Also, the
area under the curve between 0 and ∞ is .5000.
Therefore, a Z-score of
-.61 is associated with the 27th percentile.
ANSWER
Now to find the salary (X) at the 27th percentile:
Just solve for X: -0.61 =(X−$40,000)/$10,000
X = $40,000 - $6,100 = $33,900 76

T-Distribution
 Similar to the standard normal in that it is unimodal, bell-
shaped and symmetric.
 The tail on the distribution are “thicker” than the standard
normal
 The distribution is indexed by “degrees of freedom” (df).
 The degrees of freedom measure the amount of information
available in the data set that can be used for estimating the
population variance (df=n-1).
 Area under the curve still equals 1.
 Probabilities for the t-distribution with infinite df equals those
of the standard normal.

March 28, 2013 77

T-Distribution
 The table of t-distribution will give you the
probability to the right of a critical value –
i.e. area in the upper tail.

 We are only given the area (or probability)
for a few selected critical values for each
degree of freedom.

March 28, 2013 78

T-Distribution Example
 For a t-curve from a sample of size 15 find
the area to the left of 2.145.

 Answer: df=15-1=14
 In the table of the t~distribution, the area to
the right of 2.145 is 0.025.
 Therefore the area to the left of 2.145 is:
 1-0.025=0.975

March 28, 2013 79

Graphical Methods
 Frequency Distribution
 Histogram
 Frequency Polygon
 Cumulative Frequency Graph
 Pie Chart.

March 28, 2013 80

Presenting Data
 Table
 Condenses data into a form that can make
them easier to understand;
 Shows many details in summary fashion;
BUT
 Since table shows only numbers, it may not be
readily understood without comparing it to
other values.

Principles of Table Construction
 Don’t try to do too much in a table
 Us white space effectively to make table
layout pleasing to the eye.
 Make sure tables & test refer to each
other.
 Use some aspect of the table to order &
group rows & columns.

Principles of Table Construction
 If appropriate, frame table with summary
statistics in rows & columns to provide a
standard of comparison.
 Round numbers in table to one or two
decimal places to make them easily
understood.
 When creating tables for publication in a
manuscript, double-space them unless
contraindicated by journal.

Frequency Distributions
 A useful way to present data when you
have a large data set is the formation of a
frequency table or frequency distribution.

 Frequency – the number of observations
that fall within a certain range of the data.

March 28, 2013 84

Frequency Table

Age Number of Deaths
1> 564
1-4 86
5-14 127
15-24 490
25-34 66
35-44 806
45-54 1,425
55-64 3,511
65-74 6,932
75-84 10,101
+85 9825
Total 34,524 85

Frequency Table
Data Frequency Cumulative Relative Cumulative
Frequency Frequency Relative
Intervals )%( )%(Frequency

10-19 5 5
20-29 18 23
30-39 10 33
40-49 13 46
50-59 4 50
60-69 4 54
70-79 2 56
Total
86

Cumulative Relative Frequency
 Cumulative Relative Frequency – the
percentage of persons having a
measurement less than or equal to the
upper boundary of the class interval.
 i.e. cumulative relative frequency for the 3rd
interval of our data example:
 8.8+13.3+17.5 = 59.6%
- We say that 59.6% of the children have weights
below 39.5 pounds.

March 28, 2013 87

Number of Intervals
 There is no clear-cut rule on the number
of intervals or classes that should be used.
 Too many intervals – the data may not be
summarized enough for a clear
visualization of how they are distributed.
 Too few intervals – the data may be over-
summarized and some of the details of the
distribution may be lost.

March 28, 2013 88

Presenting Data
Chart
- Visual representation of a
frequency distribution that helps to
gain insight about what the data mean.
- Built with lines, area & text: bar
charts
Ex: bar chart, pie chart

Bar Chart
 Simplest form of chart
 Used to display ETHICAL ISSUES SCALE
nominal or ordinal ITEM 8
data 60

50

40

PERCENT
30

20

10

0
Never Seldom Somet imes Frequently

ACTING AGAINST YOUR OWN PERSONAL/RELIGIOUS VIEWS

Horizontal Bar Chart
CLINICAL PRACTICE AREA
Acute Care
Critical Care
Gerontology
CLINICAL PRACTICE AREA

P ost Anesthesia
Perinatal
Clinical Research
Family Nursing
Neonatal
Psych/Mental Health
Community Health
General Practice
Orthopedics
Primary Care
Operating Room
Medical
Oncology
Other
0 2 4 6 8 10 12 14

PERCENT

Cluster Bar Chart
70

60

50
PERCENT

40

30
Employment
20
Full tim e RN

10 Part tim e RN

0 Self employed
Diploma B achelor Degree
As sociate Degree Post Bac

RN HIGHEST EDUCATION

Pie Chart
 Alternative to bar
chart
 Circle partitioned into Doctorate NonNursing

Doctorate Nursing

percentage MS NonNursing

MS Nursing
Missing

distributions of Juris Doctor
Diploma-Nursing

qualitative variables BS NonNursing

with total area of
100%
AD Nursing
BS Nursing

Histogram
 Appropriate for interval, ratio and
sometimes ordinal data
 Similar to bar charts but bars are placed
side by side
 Often used to represent both frequencies
and percentages
 Most histograms have from 5 to 20 bars

Histogram
80

60
FREQUENCY

40

20

Std. Dev = 22.17
Mean = 61.6
0 N = 439.00
0.0 20.0 40.0 60.0 80.0 100.0
10.0 30.0 50.0 70.0 90.0

SF-36 VITALITY SCORES

Pictures of Data: Histograms
Blood pressure data on a sample of 113 men 

20
15
Number of Men
10
5
0

80 100 120 140 160
Systolic BP (mmHg)

Histogram of the Systolic Blood Pressure for 113 men. Each bar
spans a width of 5 mmHg on the horizontal axis. The height of each
bar represents the number of individuals with SBP in that range.
March 28, 2013 96

Frequency Polygon
Frequency Polygon

20 •First place a dot at the
18 midpoint of the upper base of
16
each rectangular bar.
14 Childrens w eights •The points are connected with
12 straight lines.
10
•At the ends, the points are
8
connected to the midpoints of
6
the previous and succeeding
4
intervals (these intervals have
2
zero frequency).
0
4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5

March 28, 2013 97

Hallmarks of a Good Chart
 Simple & easy to read
 Placed correctly within text
 Use color only when it has a purpose, not
solely for decoration
 Make sure others can understand chart;
try it out on somebody first
 Remember: A poor chart is worse than no
chart at all.

Cumulative Frequency Plot
Weights of Daycare Children
•Place a point with a horizontal
120%
axis marked at the upper class
boundary and a vertical axis
100%
marked at the corresponding
cumulative frequency.
80%

•Each point represents the
of Children

cumulative relative frequency
Percent

60%

and the points are connected
40% with straight lines.
•The left end is connected to
20%
the lower boundary of the first
interval that has data.
0%
9.5 19.5 29.5 39.5 49.5 59.5 69.5 79.5 89.5
Weight Range

March 28, 2013 99

Coefficient of Correlation
 Measure of linear association between 2
continuous variables.
 Setting:
 two measurements are made for each
observation.
 Sample consists of pairs of values and you
want to determine the association between the
variables.

March 28, 2013 100

Association Examples
 Example 1: Association between a mother’s
weight and the birth weight of her child
 2 measurements: mother’s weight and baby’s weight
 Both continuous measures
 Example 2: Association between a risk factor and
a disease
 2 measurements: disease status and risk factor status
 Both dichotomous measurements

March 28, 2013 101

Correlation Analysis
 When you have 2 continuous
measurements you use correlation
analysis to determine the relationship
between the variables.
 Through correlation analysis you can
calculate a number that relates to the
strength of the linear association.

March 28, 2013 102

Types of Relationships
 There are 2 types of relationships:
 Deterministic relationship – the values of the 2
variables are related through an exact
mathematical formula.

 Statistical relationship – this is not a perfect
relationship!!!

March 28, 2013 103

Scatter Plots and Association
 You can plot the 2 variables in a scatter plot (one
of the types of charts in SPSS/Excel).
 The pattern of the “dots” in the plot indicate the
statistical relationship between the variables (the
strength and the direction).
 Positive relationship – pattern goes from lower left to
upper right.
 Negative relationship – pattern goes from upper left to
lower right.
 The more the dots cluster around a straight line the
stronger the linear relationship.

March 28, 2013 104

Birth Weight Data
x (oz) y(%)
112 63
111 66 x – birth weight in ounces
107 72
y – increase in weight between
119 52
70th and 100th days of life,
92 75
expressed as a percentage of
80 118
81 120
birth weight
84 114
118 42
106 72
103 90
94 91
105

Pearson Correlation Coefficient
Birth Weight Data

120

110
Increase in Birth Weight )%(

100

90

80

70

60

50

40
70 80 90 100 110 120 130 140
Birth Weight )in ounces(

March 28, 2013 106

Calculations of Correlation
Coefficient
 In SPSS:
 Go to TOOLS menu and select DATA
ANALYSIS.
 Highlight CORRELATION and click “ok”
 Enter INPUT RANGE (2 columns of data that
contain “x” and “y”)
 Click “ok” (cells where you want the answer to
be placed.

March 28, 2013 107

Pearson Correlation Results

x (oz) y(%)
x (oz) 1
y(%) -0.94629 1

Pearson Correlation Coefficient = -0.946
Interpretation:
- values near 1 indicate strong positive linear relationship
- values near –1 indicate strong negative linear relationship
- values near 0 indicate a weak linear association

March 28, 2013 108

!!!!CAUTION
 Interpreting the correlation coefficient
should be done cautiously!
 A result of 0 does not mean there is NO
relationship …. It means there is no linear
association.
 There may be a perfect non-linear
association.

March 28, 2013 109

The Uses of Frequency Distributions
 Becoming familiar with dataset.
 Cleaning the data.
 Outliers-values that lie outside the normal range of values for
other cases.
 Inspecting the data for missing values.
 Testing assumptions for statistical tests.
 Assumption is a condition that is presumed to be true and
when ignored or violated can lead to misleading or invalid
results.
 When DV is not normally distributed researchers have to
choose between three options:
 Select a statistical test that does not assume a normal distribution.
 Ignore the violation of the assumption.
 Transform the variable to better approximate a distribution that is
normal. Please consult the various data transformation.

The Uses of Frequency Distributions
 Obtaining information about sample
characteristics.
 Directing answering research questions.

Outliers
 Are values that are extreme relative to the bulk
of scores in the distribution.
 They appear to be inconsistent with the rest of
the data.
 Advantages:
 They may indicate characteristics of the population that
would not be known in the normal course of analysis.
 Disadvantages:
 They do not represent the population
 Run counter to the objectives of the analysis
 Can distort statistical tests.

Sources of Outliers
 An error in the recording of the data.
 A failure of data collection, such as not
following sample criteria (e.g.
inadvertently admitting a disoriented
patient into a study), a subject not
following instructions on a questionnaire,
or equipment failure.
 An actual extreme value from an unusual
subjects.

Methods to Identify Outliers
 Traditional way of labeling outliers, any
value more than 3SD from the mean.
 Values that are more than 3 IQRs from the
upper or lower edge of the box plot are
extreme outliers.
 Values between 1.5 and 3 IQRs from the
upper and lower edges of the box are
minor outliers.

Handling Outliers
 Analyze the data two ways:
 With the outliers in the distribution
 With outliers removed.
 If the results are similar, as they are likely to be if the
sample size is large, then the outliers may be ignored.
 If the results are not similar, then a statistical analysis that
is resistant to outliers can be used (e.g. median and IQR).
 If you want to use a mean with outliers, then the trimmed
mean is an option. If calculated with a certain percentage
of the extreme values removed from both ends of the
distribution (e.g. n=100, then 5% trimmed mean is the
mean of the middle 90% of the observation).

Handling Outliers
 Another alternative is a Winsorized mean.
 The highest and lowest extremes are
replaced by the next-to-highest value and
by the next-to-lowest value.
 For Univariate outliers, Tabachnick and
Fidell (2001) suggest changing the scores
on the variables for the outlying cases so
they are deviant. E.g. if the two largest
scores in the distribution are 125 and 122
and the next largest score 87. recode 122
as 88 and 125 as 89.

Outliers
 Steps on SPSS
1. Analyze
2. Descriptive
3. Explore
4. Statistics ……plots
5. Outliers

Missing Data
 Any systematic event external to the respondent (such as
data entry errors or data collection problems) or action on
the part of the respondent (such as refusal to answer) that
leads to missing data.
 It means that analyses are based on fewer study
participants than were in the full study sample. This, in
turn, means less statistical power, which can undermine
statistical conclusion validity-the degree to which the
statistical results are accurate.
 Missing data can also affect internal validity-the degree to
which inferences about the causal effect of the dependent
variable on the dependent variable are warranted, and also
affect the external validity-generalizability.

Strategies to avoid Missing Data
 Persistent follow-up
 Flexibility in scheduling appointments
 Paying incentives.
 Using well-proven methods to track people
who have moved.
 Performing a thorough review of
completed data forms prior to excusing
participants.

Factors to consider in designing a
missing values strategy
 Extent of missing data
 Pattern of missing data
 Nature of missing data.
 Role of the variable
 Level of measurement of the variable.

Extent of missing data
 Researchers usually handle the problem
differently if there is only 1% missing data
as opposed to , say, 25% missing.

Pattern of missing data
 It is more straightforward to deal with
data that are missing a haphazard,
random fashion, as opposed to a
systematic fashion that typically reflects a
bias.
 Different patterns of missing data:
 Missing completely at random (MCAR)
 Missing at random (MAR)
 Missing not at random (MNSR).

(Missing Completely at Random (MCAR
 It means that the probability that the observation
is missing is completely unrelated to either the
value of the missing case or the value of any
other variables.
 Occurs when cases with missing values are just a
random subsample of all cases in the sample.
 When data are MCAR, analyses remain unbiased,
although power is reduced.
 E.g. When one participant did not show up post
the intervention due to emergency. In this
situation, the missing values are not related to
the main variable or to the value of other
characteristics, such as the person’ s age, sex or
experimental group status.

(Missing at Random (MAR

 It considered MAR if missingness is related
to other variables-but not related to the
value of the variable that has the missing
values.
 This pattern is perhaps the most prevalent
pattern of missingness in clinical research.
 E.g. men were less likely to keep their
follow-up appointment. Thus, missingness
is related to a person’s gender.

(Missing not at Random (MNSR
 A pattern in which the value of the
variable that is missing is related to its
missingness. This is often found for such
variables as income (not to tell the truth).

.Nature of Missing Data

 For only one item is a multi-item-measure.
 Sometimes an entire variable is missing.
 In other situations, all data are missing for
study participants.

Role of the variable
 How one handles the missing data
problem may depend on whether a
variable is considered a primary outcome,
a secondary outcome, an independent
(predictor) variable, or control variable
(covariate).

Level of Measurement of the Variable
 Some strategies are best applied when the
variable is measured on an interval or
ratio scale, while others only make sense
for nominal-level variables.

Techniques for Handling Missing Data
 Deletion techniques. Involve excluding subjects
with missing data from statistical calculation.
 Imputation techniques. Involve calculating an
estimate of each missing value and replacing, or
imputing, each value by its respective estimate.
 Note: techniques for handling missing data often
vary in the degree to which they affect the
amount of dispersion around true scores, and the
degree of bias in the final results. Therefore, the
selection of a data handling technique should be
carefully considered.

Deletion Techniques
 Deletion methods involve removal of cases or variables with
missing data.
 Listwise deletion. Also called complete case analysis. It is
simply the analysis of those cases for which there are no
missing data. It eliminates an entire case when any of its
items/variables has a missing data point, whether or not
that data point is part of the analysis. It is the default of
the SPSS.
 Pairwise deletion. Called the available case analysis (unwise
deletion). Involves omitting cases from the analysis on a
variable-by-variable basis. It eliminates a case only when
that case has missing data for variables or items under
analysis.
 Note: deletion techniques are widely criticized because they
assume that the data are MCAR (which is very difficult to
ascertain), pose a risk for bias, and lead to reduction of
sample size and power.

Imputation Techniques
 Imputation is the process of estimating
missing data based on valid values of
other variables or cases in the sample.
 The goal of imputation is to use known
relationship that can be identified in the
valid values of the sample to help estimate
the missing data

Types of Imputation Techniques
 Using prior knowledge.
 Inserting mean values.
 Using regression
 Expectation maximization (EM).
 Multiple imputation.

Prior Knowledge
 Involves replacing a missing value with a
value based on an educational guess.
 It is a reasonable method if the researcher
has a good working knowledge of the
research domain, the sample is large, and
the number of missing values is small.

Mean Replacement
 Also called median replacement for
skewed distribution.
 Involves calculating mean values from a
available data on that variable and using
them to replace missing values before
analysis.
 It is a conservative procedure because the
distribution mean as a whole does not
change and the researcher does not have
to guess at missing values.

Mean Replacement
 Advantages:
 Easily implemented and provides all cases with complete
data.
 A compromise procedure is to insert a group mean for
the missing values.
 Disadvantages:
 It invalidates the variance estimates derived from the
standard variance formulas by understanding the data’s
true variance.
 It distorts the actual distribution of values.
 It depresses the observed correlation that this variable
will have with other variables because all missing data
have a single constant value, thus reducing the
variance.

Using Regression
 Involves using other variables in the dataset as
independent variables to develop a regression
equation for the variable with missing data
serving as the dependent variable.
 Cases with complete data are used to generate
the regression equation.
 The equation is then used to predict missing
values for incomplete cases.
 More regressions are computed, using the
predicted values from the previous regression to
develop the next equation, until the predicted
values from one step to the next are comparable.
 Prediction from the last regression are the ones
used to replace missing values.

Using Regression
 Advantages:
 It is more objective than the researcher’s guess but not
as blind as simply using the overall mean.
 Disadvantages:
 It reinforces the relationships already in the data,
resulting in less generalizability.
 The variance of the distribution is reduced because the
estimate is probably too close to the mean.
 It assumes that the variable with missing data is
correlated substantially with missing data is correlated
substantially with the other variables in the dataset.
 The regression procedure is not constrained in the
estimates it makes.

Expectation Maximization
 For randomly missing data.
 It is an iterative process that proceeds in two
discrete steps:
 In the expectation (E) step, the conditional
expected value of the complete data is computed
and then given the observed values, such as
correlations.
 In the Maximization (M) step, these expected
values are then substituted for the missing and
maximum likelihood estimation is then computed
as though there were no missing data.

Multiple Imputation
 It produces several datasets and analyzes
them separately.
 One set of parameters is then formed by
averaging the resulting estimates and
standard errors.

Multiple Imputation
 Advantages:
 It makes no assumptions about whether data are
randomly missing but incorporates random error
because it requires random variation in the imputation
process.
 It permits use of complete-data methods for data
analysis and also includes the data collector’s
knowledge.
 It permits estimates of nonlinear models.
 It simulates proper inference from data and increases
efficiency of the estimates by minimizing standard
errors.
 It is the method of choice for databases that are made
available for analyses outside the agency that collected
the data.

Multiple Imputation
 Disadvantages:
 It requires conceptual intensiveness to carry
out MI, including special software and model
building.
 It does not produce a unique answer because
randomness is preserved in the MI process,
making reproducibility of exact results
problematic.
 it requires large a mounts of data storage
space that often exceeds space on personal
computers’ hard driver.

Descriptive statistics ii

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Descriptive statistics ii

Similar to Descriptive statistics ii (20)

More from Mohammad Ihmeidan

More from Mohammad Ihmeidan (20)

Recently uploaded

Recently uploaded (20)

Descriptive statistics ii