SlideShare a Scribd company logo
1 of 143
Descriptive Statistics-II
Dr Mahmoud Alhussami
Shapes of Distribution
   A third important property of data – after location
    and dispersion - is its shape
   Distributions of quantitative variables can be
    described in terms of a number of features, many
    of which are related to the distributions’ physical
    appearance or shape when presented graphically.
       modality
       Symmetry and skewness
       Degree of skewness
       Kurtosis
Modality
 The modality of a distribution concerns
  how many peaks or high points there are.
 A distribution with a single peak, one
  value a high frequency is a unimodal
  distribution.
Modality
   A distribution with two
    or more peaks called
    multimodal
    distribution.
Symmetry and Skewness
   A distribution is symmetric if the distribution could be split
    down the middle to form two haves that are mirror images
    of one another.
   In asymmetric distributions, the peaks are off center, with
    a bull of scores clustering at one end, and a tail trailing off
    at the other end. Such distributions are often describes as
    skewed.
       When the longer tail trails off to the right this is a positively
        skewed distribution. E.g. annual income.
       When the longer tail trails off to the left this is called
        negatively skewed distribution. E.g. age at death.
Symmetry and Skewness
   Shape can be described by degree of asymmetry (i.e.,
    skewness).
      mean > median       positive or right-skewness
      mean = median       symmetric or zero-skewness
      mean < median       negative or left-skewness
   Positive skewness can arise when the mean is
    increased by some unusually high values.
   Negative skewness can arise when the mean is
    decreased by some unusually low values.
Left   skewed:



Right   skewed:



Symmetric:




                   7
Shapes of the Distribution
   Three common shapes of frequency
    distributions:




                 A         B            C
          Symmetrical   Positively   Negatively
          and bell      skewed or    skewed or
          shaped        skewed to    skewed to
                        the right    the left
March 28, 2013                                    8
Shapes of the Distribution
   Three less common shapes of frequency
    distributions:




                 A       B         C
            Bimodal   Reverse    Uniform
                      J-shaped


March 28, 2013                              9
This guy
took a VERY
long time!




          10
Degree of Skewness
   A skewness index can readily be calculated most
    statistical computer program in conjunction with
    frequency distributions
   The index has a value of 0 for perfectly
    symmetric distribution.
   A positive value if there is a positive skew, and
    negative value if there is a negative skew.
   A skewness index that is more than twice the
    value of its standard error can be interpreted as a
    departure from symmetry.
Measures of Skewness or Symmetry
   Pearson’s skewness coefficient
       It is nonalgebraic and easily calculated. Also it
        is useful for quick estimates of symmetry .
       It is defined as:
          skewness = mean-median/SD
   Fisher’s measure of skewness.
       It is based on deviations from the mean to the
        third power.
Pearson’s skewness coefficient

     For a perfectly symmetrical distribution, the mean will
      equal the median, and the skewness coefficient will be
      zero. If the distribution is positively skewed the mean
      will be more than the median and the coefficient will be
      the positive. If the coefficient is negative, the
      distribution is negatively skewed and the mean less than
      the median.
     Skewness values will fall between -1 and +1 SD units.
      Values falling outside this range indicate a substantially
      skewed distribution.
     Hildebrand (1986) states that skewness values above
      0.2 or below -0.2 indicate severe skewness.
Assumption of Normality
   Many of the statistical methods that we will
    apply require the assumption that a variable or
    variables are normally distributed.
   With multivariate statistics, the assumption is
    that the combination of variables follows a
    multivariate normal distribution.
   Since there is not a direct test for multivariate
    normality, we generally test each variable
    individually and assume that they are
    multivariate normal if they are individually
    normal, though this is not necessarily the case.
Evaluating normality
   There are both graphical and statistical methods
    for evaluating normality.
   Graphical methods include the histogram and
    normality plot.
   Statistical methods include diagnostic hypothesis
    tests for normality, and a rule of thumb that says
    a variable is reasonably close to normal if its
    skewness and kurtosis have values between –1.0
    and +1.0.
   None of the methods is absolutely definitive.
Transformations
   When a variable is not normally distributed, we
    can create a transformed variable and test it for
    normality. If the transformed variable is normally
    distributed, we can substitute it in our analysis.
   Three common transformations are: the
    logarithmic transformation, the square root
    transformation, and the inverse transformation.
   All of these change the measuring scale on the
    horizontal axis of a histogram to produce a
    transformed variable that is mathematically
    equivalent to the original variable.
Types of Data Transformations
 for moderate skewness, use a square root
  transformation.
 For substantial skewness, use a log
  transformation.
 For sever skewness, use an inverse
  transformation.
Computing “Explore” descriptive
statistics



                       To compute the statistics
                       needed for evaluating the
                       normality of a variable, select
                       the Explore… command from
                       the Descriptive Statistics
                       menu.
Adding the variable to be evaluated




                                Second, click on right
                                arrow button to move
                                the highlighted variable
                                to the Dependent List.


      First, click on the
      variable to be included
      in the analysis to
      highlight it.
Selecting statistics to be computed




                       To select the statistics for the
                       output, click on the
                       Statistics… command button.
Including descriptive statistics
         First, click on the
         Descriptives checkbox
         to select it. Clear the
         other checkboxes.




                                   Second, click on the
                                   Continue button to
                                   complete the request for
                                   statistics.
Selecting charts for the output




                         To select the diagnostic charts
                         for the output, click on the
                         Plots… command button.
Including diagnostic plots and
statistics
  First, click on the
  None option button
  on the Boxplots panel
  since boxplots are not
  as helpful as other
  charts in assessing
  normality.




                                                        Finally, click on the
                                                        Continue button to
                                                        complete the request.



     Second, click on the
     Normality plots with tests   Third, click on the Histogram
     checkbox to include          checkbox to include a
     normality plots and the      histogram in the output. You
     hypothesis tests for         may want to examine the
     normality.                   stem-and-leaf plot as well,
                                  though I find it less useful.
Completing the specifications for the
analysis


                         Click on the OK button to
                         complete the specifications
                         for the analysis and request
                         SPSS to produce the
                         output.
The histogram

                        Histogram                                              An initial impression of the
                                                                               normality of the distribution
                   50
                                                                               can be gained by examining
                                                                               the histogram.

                   40                                                          In this example, the
                                                                               histogram shows a substantial
                                                                               violation of normality caused
                   30                                                          by a extremely large value in
                                                                               the distribution.

                   20
       Frequency




                   10
                                                                                                     Std. Dev = 15.35
                                                                                                     Mean = 10.7
                   0                                                                                 N = 93.00
                        0.0          20.0          40.0          60.0          80.0          100.0
                              10.0          30.0          50.0          70.0          90.0


                        TOTAL TIME SPENT ON THE INTERNET
The normality plot
                          Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET
                     3



                     2



                     1



                     0

                                                          The problem with the normality of this
                                                          variable’s distribution is reinforced by the
   Expected Normal




                     -1
                                                          normality plot.

                     -2                                    If the variable were normally distributed,
                                                           the red dots would fit the green line very
                                                           closely. In this case, the red points in the
                     -3
                                                           upper right of the chart indicate the
                      -40       -20        0   20   40   60       80     100    120
                                                           severe skewing caused by the extremely
                                                           large data values.
                          Observed Value
The test of normality
                                                    Tests of Normality
                                                               a
                                          Kolmogorov-Smirnov                          Shapiro-Wilk
                                  Statistic       df         Sig.         Statistic        df        Sig.
         TOTAL TIME SPENT
                                        .246            93         .000        .606           93       .000
         ON THE INTERNET
            a. Lilliefors Significance Correction




     Problem 1 asks about the results of the test of normality. Since the sample
     size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample
     size were 50 or less, we would use the Shapiro-Wilk statistic instead.

     The null hypothesis for the test of normality states that the actual
     distribution of the variable is equal to the expected distribution, i.e., the
     variable is normally distributed. Since the probability associated with the
     test of normality is < 0.001 is less than or equal to the level of significance
     (0.01), we reject the null hypothesis and conclude that total hours spent on
     the Internet is not normally distributed. (Note: we report the probability as
     <0.001 instead of .000 to be clear that the probability is not really zero.)

     The answer to problem 1 is false.
The assumption of normality script

    An SPSS script to produce all
    of the output that we have
    produced manually is
    available on the course web
    site.

    After downloading the script,
    run it to test the assumption
    of linearity.
                                    Select Run Script…
                                    from the Utilities
                                    menu.
Selecting the assumption of
normality script
                    First, navigate to the folder containing your
                    scripts and highlight the
                    NormalityAssumptionAndTransformations.SBS
                    script.




              Second, click on
              the Run button to
              activate the script.
Specifications for normality script



                                            First, move variables from
                                            the list of variables in the
                                            data set to the Variables to
                                            Test list box.




   The default output is to do all of the
   transformations of the variable. To
   exclude some transformations from the              Third, click on the OK
   calculations, clear the checkboxes.                button to run the script.
The test of normality
                                                Tests of Normality
                                                           a
                                      Kolmogorov-Smirnov                          Shapiro-Wilk
                              Statistic       df         Sig.         Statistic        df        Sig.
      TOTAL TIME SPENT
                                    .246            93         .000        .606           93       .000
      ON THE INTERNET
        a. Lilliefors Significance Correction



                               The script produces the same output that we
                               computed manually, in this example, the tests
                               of normality.
When transformations do not work
 When none of the transformations induces
  normality in a variable, including that
  variable in the analysis will reduce our
  effectiveness at identifying statistical
  relationships, i.e. we lose power.
 We do have the option of changing the
  way the information in the variable is
  represented, e.g. substitute several
  dichotomous variables for a single metric
  variable.
Fisher’s Measure of Skewness
   The formula for Fisher’s skewness statistic is based on
    deviations from the mean to the third power.
   The measure of skewness can be interpreted in terms of
    the normal curve
       A symmetrical curve will result in a value of 0.
       If the skewness value is positive, them the curve is skewed to
        the right, and vice versa for a distribution skewed to the left.
   A z-score is calculated by dividing the measure of skewness
    by the standard error for skewness. Values above +1.96 or
    below -1.96 are significant at the 0.05 level because 95%
    of the scores in a normal deviation fall between +1.96 and
    -1.96 from the mean.
   E.g. if Fisher’s skewness= 0.195 and st.err. =0.197 the z-
    score = 0.195/0.197 = 0.99
Kurtosis
 The distribution’s kurtosis is concerns how
  pointed or flat its peak.
 Two types:
       Leptokurtic distribution (mean thin).
       Platykurtic distribution (means flat).
Kurtosis
    There is a statistical index of kurtosis that can be
    computed when computer programs are
    instructed to produce a frequency distribution
   For kurtosis index, a value of zero indicates a
    shape that is neither flat nor pointed.
   Positive values on the kurtosis statistics indicate
    greater peakedness, and negative values indicate
    greater flatness.
Fishers’ measure of Kurtosis
Fisher’s measure is based on deviation
 from the mean to the fourth power.
 A z-score is calculated by dividing the
 measure of kurtosis by the standard error
 for kurtosis.
Table of descriptive statistics
                                               Descriptives

                                                                     Statistic   Std. Error
             TOTAL TIME SPENT   Mean                                   10.731       1.5918
             ON THE INTERNET    95% Confidence        Lower Bound       7.570
                                Interval for Mean     Upper Bound
                                                                       13.893

                                5% Trimmed Mean                         8.295
                                Median                                  5.500
                                Variance                              235.655
     To answer problem          Std. Deviation                        15.3511
     2, we look at the          Minimum                                    .2
     values for skewness        Maximum                                 102.0
     and kurtosis in the        Range                                   101.8
     Descriptives table.        Interquartile Range                    10.200
                                Skewness                                3.532         .250
                                Kurtosis                               15.614         .495




         The skewness and kurtosis for the variable both exceed the rule of
         thumb criteria of 1.0. The variable is not normally distributed.

         The answer to problem 2 if false.
Other problems on assumption of
normality may ask about the assumption of
  A problem
      normality for a nominal level variable. The
      answer will be “An inappropriate application of a
      statistic” since there is no expectation that a
      nominal variable be normal.
     A problem may ask about the assumption of
      normality for an ordinal level variable. If the
      variable or transformed variable is normal, the
      correct answer to the question is “True with
      caution” since we may be required to defend
      treating an ordinal variable as metric.
     Questions will specify a level of significance to
      use and the statistical evidence upon which you
      should base your answer.
Normal Distribution
 Also called belt shaped curve, normal
  curve, or Gaussian distribution.
 A normal distribution is one that is
  unimodal, symmetric, and not too peaked
  or flat.
 Given its name by the French
  mathematician Quetelet who, in the early
  19th century noted that many human
  attributes, e.g. height, weight, intelligence
  appeared to be distributed normally.
Normal Distribution
   The normal curve is unimodal and symmetric
    about its mean (µ).
   In this distribution the mean, median and mode
    are all identical.
   The standard deviation (σ) specifies the amount
    of dispersion around the mean.
   The two parameters µ and σ completely define a
    normal curve.




March 28, 2013                                        40
 Also  called a Probability density function. The
  probability is interpreted as "area under the
  curve."
 The random variable takes on an infinite # of
  values within a given interval
 The probability that X = any particular value
  is 0. Consequently, we talk about intervals.
  The probability is = to the area under the
  curve.
 The area under the whole curve = 1.




                                                     41
42
Normal Distribution


.X is the random variable
.μ is the mean value
.σ is the standard deviation (std) value
.e = 2.7182818... constant
.π = 3.1415926... constant
Importance of Normal Distribution to
Statistics
 Although most distributions are not
  exactly normal, most variables tend to
  have approximately normal distribution.
 Many inferential statistics assume that the
  populations are distributed normally.
 The normal curve is a probability
  distribution and is used to answer
  questions about the likelihood of getting
  various particular outcomes when
  sampling from a population.
 Probabilitiesare obtained by getting the area
  under the curve inside of a particular interval.
  The area under the curve = the proportion of
  times under identical (repeated) conditions
  that a particular range of values will occur.
 Characteristics of the Normal distribution:
    It is symmetric about the mean μ.
    Mean = median = mode. [“bell-shaped” curve]
    f(X) decreases as X gets farther and farther away
     from the mean. It approaches horizontal axis
     asymptotically:
     - ∞ < X < + ∞. This means that there is always
     some probability (area) for extreme values.         45
Why Do We Like The Normal
?Distribution So Much
      There is nothing “special” about standard
       normal scores
                These can be computed for observations from any
                 sample/population of continuous data values
                The score measures how far an observation is from
                 its mean in standard units of statistical distance
      But, if distribution is not normal, we may not be
       able to use Z-score approach.




March 28, 2013                                                        46
Probability Distributions
   Any characteristic that can be measured or
    categorized is called a variable.
   If the variable can assume a number of different
    values such that any particular outcome is
    determined by chance it is called a random
    variable.
   Every random variable has a corresponding
    probability distribution.
   The probability distribution applies the theory of
    probability to describe the behavior of the
    random variable.

March 28, 2013                                           47
Discrete Probability Distributions
   Binomial distribution – the random variable
    can only assume 1 of 2 possible outcomes.
    There are a fixed number of trials and the
    results of the trials are independent.
                i.e. flipping a coin and counting the number of heads in
                 10 trials.
   Poisson Distribution – random variable can
    assume a value between 0 and infinity.
                Counts usually follow a Poisson distribution (i.e.
                 number of ambulances needed in a city in a given
                 night)



March 28, 2013                                                        48
Discrete Random Variable
        A discrete random variable X has a finite number of possible
         values. The probability distribution of X lists the values and
         their probabilities.

                Value of X    x1      x2       x3     …      xk
               Probability    p1      p2       p3     …      pk
1.       Every probability pi is a number between 0 and 1.
2.       The sum of the probabilities must be 1.
        Find the probabilities of any event by adding the probabilities
         of the particular values that make up the event.




     March 28, 2013                                                  49
Example
   The instructor in a large class gives 15% each of A’s and D’s,
    30% each of B’s and C’s and 10% F’s. The student’s grade
    on a 4-point scale is a random variable X (A=4).


             Grade      F=0      D=1       C=2      B=3    A=4
       Probability      0.10      15.       30.      30.    15.
   What is the probability that a student selected at random will
    have a B or better?
   ANSWER: P (grade of 3 or 4)=P(X=3) + P(X=4)
                                    = 0.3 + 0.15 = 0.45




March 28, 2013                                                 50
Continuous Probability Distributions
   When it follows a Binomial or a Poisson
    distribution the variable is restricted to taking on
    integer values only.
   Between two values of a continuous random
    variable we can always find a third.
   A histogram is used to represent a discrete
    probability distribution and a smooth curve called
    the probability density is used to represent a
    continuous probability distribution.




March 28, 2013                                         51
Normal Distribution
Q      Is every variable normally distributed?
A      Absolutely not
Q      Then why do we spend so much time
       studying the normal distribution?
A      Some variables are normally distributed;
       a bigger reason is the “Central Limit
       Theorem”!!!!!!!!!!!!!!!!!!!!!!!!!!!??????????
       ?



March 28, 2013                                     52
Central Limit Theorem
   describes the characteristics of the "population of the
    means" which has been created from the means of an
    infinite number of random population samples of size (N),
    all of them drawn from a given "parent population".
   It predicts that regardless of the distribution of the parent
    population:
       The mean of the population of means is always equal to the
        mean of the parent population from which the population
        samples were drawn.
       The standard deviation of the population of means is always
        equal to the standard deviation of the parent population
        divided by the square root of the sample size (N).
       The distribution of means will increasingly approximate a
        normal distribution as the size N of samples increases.
Central Limit Theorem
   A consequence of Central Limit Theorem is that if we
    average measurements of a particular quantity, the
    distribution of our average tends toward a normal one.
    In addition, if a measured variable is actually a
    combination of several other uncorrelated variables, all of
    them "contaminated" with a random error of any
    distribution, our measurements tend to be contaminated
    with a random error that is normally distributed as the
    number of these variables increases.
   Thus, the Central Limit Theorem explains the ubiquity of
    the famous bell-shaped "Normal distribution" (or "Gaussian
    distribution") in the measurements domain.
Note that the normal distribution is
defined by two parameters, μ and σ . You
can draw a normal distribution for any μ
and σ combination. There is one normal
distribution, Z, that is special. It has a μ =
0 and a σ = 1. This is the Z distribution,
also called the standard normal
distribution. It is one of trillions of normal
distributions we could have selected.

                                                 55
Standard Normal Variable
   It is customary to call a standard normal random
    variable Z.
   The outcomes of the random variable Z are
    denoted by z.
   The table in the coming slide give the area under
    the curve (probabilities) between the mean and
    z.
   The probabilities in the table refer to the
    likelihood that a randomly selected value Z is
    equal to or less than a given value of z and
    greater than 0 (the mean of the standard
    normal).



March 28, 2013                                      56
Source: Levine et al, Business Statistics, Pearson.




                                                      57
The 68-95-99.7 Rule for the Normal
Distribution
 68% of the observations fall within one
  standard deviation of the mean
 95% of the observations fall within two
  standard deviations of the mean
 99.7% of the observations fall within three
  standard deviations of the mean
 When applied to ‘real data’, these
  estimates are considered approximate!


March 28, 2013                              58
(:Remember these probabilities (percentages
     standard deviations#   Approx. area under the
           from the mean             normal curve
                       ±1                                 68.
                   ±1.645                                 90.
                    ±1.96                                 95.
                       ±2                                955.
                   ±2.575                                 99.
                       ±3                                997.




Practice: Find these values yourself using the Z
table.                                                          59
                                     Two Sample Z Test
Standard Normal Curve




March 28, 2013          60
Standard Normal Distribution



50% of probability in   50% of probability in
here –probability=0.5   here–probability=0.5




March 28, 2013                                  61
Standard Normal Distribution
                                            95% of
                                            probability in
                                            here

      2.5% of probability                             2.5% of probability
      in here                                         in here




                                 Standard Normal
                            Distribution with 95% area
                                      marked



March 28, 2013                                                              62
Calculating Probabilities
 Probability calculations are always
  concerned with finding the probability that
  the variable assumes any value in an
  interval between two specific points a and
  b.
 The probability that a continuous variable
  assumes the a value between a and b is
  the area under the graph of the density
  between a and b.



March 28, 2013                              63
If the weight of males is N.D. with μ=150
and σ=10, what is the probability that a
randomly selected male will weigh between
140 lbs and 155 lbs?
[Important Note: Always remember that
the probability that X is equal to any one
particular value is zero, P(X=value) =0,
since the normal distribution is
continuous.]


                          Normal Distribution   64
Solution:



                   140   150 155    X


                    -1              Z
                          0   0.5


Z = (140 – 150)/ 10 = -1.00 s.d. from mean
Area under the curve = .3413 (from Z table)

Z = (155 – 150) / 10 =+.50 s.d. from mean
Area under the curve = .1915 (from Z table)

Answer: .3413 + .1915 = .5328


                                              65
Example
   For example: What’s the probability of getting a math SAT score of
    575 or less, µ=500 and σ=50?

              575 − 500
          Z=             = 1. 5
                  50
i.e., A score of 575 is 1.5 standard deviations above the mean
                       575                1 x − 500 2    1.5            1
                             1           − (       )           1       − Z2
    ∴ P ( X ≤ 575) =   ∫ (50)
                       200
                                 2π
                                      ⋅ e 2 50 dx    →  ∫
                                                          −∞
                                                               2π
                                                                    ⋅ e 2 dz


Yikes!
But to look up Z= 1.5 in standard normal chart (or enter
into SAS) no problem! = .9332
If IQ is ND with a mean of 100 and a S.D. of
10, what percentage of the population will
have
(a)IQs ranging from 90 to 110?
(b)IQs ranging from 80 to 120?
Solution:
Z = (90 – 100)/10 = -1.00
Z = (110 -100)/ 10 = +1.00
Area between 0 and 1.00 in the Z-table is
 .3413; Area between 0 and -1.00 is also .3413
(Z-distribution is symmetric).
Answer to part (a) is .3413 + .3413 = .6826.

                                                 67
(b) IQs ranging from 80 to 120?
Solution:
Z = (80 – 100)/10 = -2.00
Z = (120 -100)/ 10 = +2.00
Area between =0 and 2.00 in the Z-table is
  .4772; Area between 0 and -2.00 is also .
 4772 (Z-distribution is symmetric).
Answer is .4772 + .4772 = .9544.


                                              68
Suppose that the average salary of college
graduates is N.D. with μ=$40,000 and
σ=$10,000.
 (a)   What proportion of college graduates will earn
       $24,800 or less?
 (b)   What proportion of college graduates will earn
       $53,500 or more?
 (c)   What proportion of college graduates will earn
       between $45,000 and $57,000?
 (d)   Calculate the 80th percentile.
 (e)   Calculate the 27th percentile.
                                                        69
(a) What proportion of college graduates
will earn $24,800 or less?
Solution:
Convert the $24,800 to a Z-score:
Z = ($24,800 - $40,000)/$10,000 = -1.52.
Always DRAW a picture of the distribution
to help you solve these problems.




                                            70
.4357


                              $24,800      $40,000       X

                               -1.52         0
                                                     Z


First Find the area between 0 and -1.52 in the
Z-table. From the Z table, that area is .4357.
Then, the area from -1.52 to - ∞ is
.5000 - .4357 = .0643.
Answer: 6.43% of college graduates will earn
less than $24,800.

                                                             71
(b) What proportion of
college graduates will earn                 .4115

$53,500 or more?                                      .0885

Solution:                             $40,000    $53,500


Convert the $53,500 to a Z-score.         0       +1.35
                                                           Z

Z = ($53,500 - $40,000)/$10,000 = +1.35.
Find the area between 0 and +1.35 in the Z-
table: .4115 is the table value.
When you DRAW A PICTURE (above) you see
that you need the area in the tail: .5 - .4115
- .0885.
Answer: .0885. Thus, 8.85% of college
graduates will earn $53,500 or more.
                                                               72
.4
                                                       55
                                                         4
                                        .1915
(c) What proportion of college
graduates will earn between
$45,000 and $57,000?                 $40k $45k        $57k

                                       0        .5    1.7
                                                             Z

Z = $45,000 – $40,000 / $10,000 = .50
Z = $57,000 – $40,000 / $10,000 = 1.70

From the table, we can get the area under the
curve between the mean (0) and .5; we can get
the area between 0 and 1.7. From the picture
we see that neither one is what we need.
What do we do here? Subtract the small piece
from the big piece to get exactly what we need.
Answer: .4554 − .1915 = .2639
                                                                 73
Parts (d) and (e) of this example ask you to
compute percentiles. Every Z-score is
associated with a percentile. A Z-score of 0
is the 50th percentile. This means that if you
take any test that is normally distributed
(e.g., the SAT exam), and your Z-score on
the test is 0, this means you scored at the
50th percentile. In fact, your score is the
mean, median, and mode.


                                                 74
(d) Calculate the 80th percentile.

                                      .5000   .3000



Solution:                             $40,000

First, what Z-score is associated        0    .84
                                                  Z
with the 80 percentile?
            th

A Z-score of approximately +.84 will give you
about .3000 of the area under the curve. Also,
the area under the curve between -∞ and 0 is .
5000. Therefore, a Z-score of +.84 is associated
with the 80th percentile.
                                     ANSWER
Now to find the salary (X) at the 80th percentile:
Just solve for X: +.84 = (X−$40,000)/$10,000
                                                      75
(e) Calculate the 27th percentile.

                                              .2300   .5000


Solution: First, what Z-score is associated
                                        .2700


with the 27th percentile? A Z-score                $40,000

                                              -.61    0
of approximately -.61will give you                         Z

about .2300 of the area under the curve, with .2700 in
the tail. (The area under the curve between 0 and -.61
is .2291 which we are rounding to .2300). Also, the
area under the curve between 0 and ∞ is .5000.
Therefore, a Z-score of
-.61 is associated with the 27th percentile.
                                      ANSWER
Now to find the salary (X) at the 27th percentile:
Just solve for X: -0.61 =(X−$40,000)/$10,000
X = $40,000 - $6,100 = $33,900                                 76
T-Distribution
   Similar to the standard normal in that it is unimodal, bell-
    shaped and symmetric.
   The tail on the distribution are “thicker” than the standard
    normal
   The distribution is indexed by “degrees of freedom” (df).
   The degrees of freedom measure the amount of information
    available in the data set that can be used for estimating the
    population variance (df=n-1).
   Area under the curve still equals 1.
   Probabilities for the t-distribution with infinite df equals those
    of the standard normal.




March 28, 2013                                                     77
T-Distribution
   The table of t-distribution will give you the
    probability to the right of a critical value –
    i.e. area in the upper tail.

   We are only given the area (or probability)
    for a few selected critical values for each
    degree of freedom.




March 28, 2013                                   78
T-Distribution Example
   For a t-curve from a sample of size 15 find
    the area to the left of 2.145.

   Answer: df=15-1=14
         In the table of the t~distribution, the area to
          the right of 2.145 is 0.025.
         Therefore the area to the left of 2.145 is:
                1-0.025=0.975




March 28, 2013                                              79
Graphical Methods
   Frequency Distribution
   Histogram
   Frequency Polygon
   Cumulative Frequency Graph
   Pie Chart.




March 28, 2013                   80
Presenting Data
   Table
       Condenses data into a form that can make
        them easier to understand;
       Shows many details in summary fashion;
                           BUT
       Since table shows only numbers, it may not be
        readily understood without comparing it to
        other values.
Principles of Table Construction
 Don’t try to do too much in a table
 Us white space effectively to make table
  layout pleasing to the eye.
 Make sure tables & test refer to each
  other.
 Use some aspect of the table to order &
  group rows & columns.
Principles of Table Construction
 If appropriate, frame table with summary
  statistics in rows & columns to provide a
  standard of comparison.
 Round numbers in table to one or two
  decimal places to make them easily
  understood.
 When creating tables for publication in a
  manuscript, double-space them unless
  contraindicated by journal.
Frequency Distributions
   A useful way to present data when you
    have a large data set is the formation of a
    frequency table or frequency distribution.

   Frequency – the number of observations
    that fall within a certain range of the data.




March 28, 2013                                  84
Frequency Table

        Age      Number of Deaths
        1>             564
        1-4             86
        5-14           127
       15-24           490
       25-34            66
       35-44           806
       45-54           1,425
       55-64           3,511
       65-74           6,932
       75-84          10,101
        +85            9825
         Total        34,524        85
Frequency Table
  Data      Frequency   Cumulative    Relative    Cumulative
                        Frequency    Frequency     Relative
Intervals                                )%(     )%(Frequency


  10-19         5           5
  20-29        18          23
  30-39        10          33
  40-49        13          46
  50-59         4          50
  60-69         4          54
  70-79         2          56
    Total
                                                           86
Cumulative Relative Frequency
   Cumulative Relative Frequency – the
    percentage of persons having a
    measurement less than or equal to the
    upper boundary of the class interval.
         i.e. cumulative relative frequency for the 3rd
          interval of our data example:
             8.8+13.3+17.5 = 59.6%
            - We say that 59.6% of the children have weights
              below 39.5 pounds.




March 28, 2013                                                 87
Number of Intervals
 There is no clear-cut rule on the number
  of intervals or classes that should be used.
 Too many intervals – the data may not be
  summarized enough for a clear
  visualization of how they are distributed.
 Too few intervals – the data may be over-
  summarized and some of the details of the
  distribution may be lost.



March 28, 2013                               88
Presenting Data
Chart
 - Visual representation of a
      frequency distribution that helps to
 gain insight about what the data mean.
  -    Built with lines, area & text: bar
        charts
          Ex: bar chart, pie chart
Bar Chart
   Simplest form of chart
   Used to display                                   ETHICAL ISSUES SCALE
    nominal or ordinal                                          ITEM 8
    data                               60


                                       50


                                       40




                             PERCENT
                                       30


                                       20


                                       10

                                       0
                                                   Never    Seldom   Somet imes   Frequently


                                            ACTING AGAINST YOUR OWN PERSONAL/RELIGIOUS VIEWS
Horizontal Bar Chart
                                                             CLINICAL PRACTICE AREA
                                          Acute Care
                                         Critical Care
                                         Gerontology
       CLINICAL PRACTICE AREA


                                    P ost Anesthesia
                                             Perinatal
                                   Clinical Research
                                     Family Nursing
                                             Neonatal
                                Psych/Mental Health
                                  Community Health
                                   General Practice
                                         Orthopedics
                                        Primary Care
                                    Operating Room
                                              Medical
                                            Oncology
                                                 Other
                                                         0    2   4   6    8    10   12   14


                                                                      PERCENT
Cluster Bar Chart
                 70


                 60


                 50
       PERCENT




                 40


                 30
                                                                           Employment
                 20
                                                                            Full tim e RN

                 10                                                         Part tim e RN

                 0                                                          Self employed
                      Diploma                B achelor Degree
                                As sociate Degree               Post Bac


                                RN HIGHEST EDUCATION
Pie Chart
   Alternative to bar
    chart
   Circle partitioned into   Doctorate NonNursing

                              Doctorate Nursing


    percentage                MS NonNursing

                              MS Nursing
                                                             Missing




    distributions of          Juris Doctor
                                                     Diploma-Nursing



    qualitative variables     BS NonNursing



    with total area of
    100%
                                                         AD Nursing
                              BS Nursing
Histogram
 Appropriate for interval, ratio and
  sometimes ordinal data
 Similar to bar charts but bars are placed
  side by side
 Often used to represent both frequencies
  and percentages
 Most histograms have from 5 to 20 bars
Histogram
                  80




                  60
      FREQUENCY




                  40




                  20

                                                                                                    Std. Dev = 22.17
                                                                                                    Mean = 61.6
                  0                                                                                 N = 439.00
                       0.0          20.0          40.0          60.0          80.0          100.0
                             10.0          30.0          50.0          70.0          90.0


                                     SF-36 VITALITY SCORES
Pictures of Data: Histograms
            Blood pressure data on a sample of 113 men                             

                             20
                             15
                 Number of Men
                             10
                             5
                             0




                                  80   100          120           140   160
                                             Systolic BP (mmHg)

Histogram of the Systolic Blood Pressure for 113 men. Each bar
spans a width of 5 mmHg on the horizontal axis. The height of each
bar represents the number of individuals with SBP in that range.
March 28, 2013                                                                96
Frequency Polygon
                                Frequency Polygon


20                                                                                 •First place a dot at the
18                                                                                 midpoint of the upper base of
16
                                                                                   each rectangular bar.
14                                              Childrens w eights                 •The points are connected with
12                                                                                 straight lines.
10
                                                                                   •At the ends, the points are
 8
                                                                                   connected to the midpoints of
 6
                                                                                   the previous and succeeding
 4
                                                                                   intervals (these intervals have
 2
                                                                                   zero frequency).
 0
      4.5    14.5     24.5   34.5      44.5         54.5      64.5   74.5   84.5




     March 28, 2013                                                                                                97
Hallmarks of a Good Chart
 Simple & easy to read
 Placed correctly within text
 Use color only when it has a purpose, not
  solely for decoration
 Make sure others can understand chart;
  try it out on somebody first
 Remember: A poor chart is worse than no
      chart at all.
Cumulative Frequency Plot
                                    Weights of Daycare Children
                                                                                       •Place a point with a horizontal
                120%
                                                                                       axis marked at the upper class
                                                                                       boundary and a vertical axis
                100%
                                                                                       marked at the corresponding
                                                                                       cumulative frequency.
                80%

                                                                                       •Each point represents the
  of Children




                                                                                       cumulative relative frequency
   Percent




                60%

                                                                                       and the points are connected
                40%                                                                    with straight lines.
                                                                                       •The left end is connected to
                20%
                                                                                       the lower boundary of the first
                                                                                       interval that has data.
                 0%
                       9.5   19.5    29.5   39.5   49.5   59.5    69.5   79.5   89.5
                                              Weight Range




March 28, 2013                                                                                                           99
Coefficient of Correlation
 Measure of linear association between 2
  continuous variables.
 Setting:
         two measurements are made for each
          observation.
         Sample consists of pairs of values and you
          want to determine the association between the
          variables.




March 28, 2013                                       100
Association Examples
   Example 1: Association between a mother’s
    weight and the birth weight of her child
         2 measurements: mother’s weight and baby’s weight
                Both continuous measures
   Example 2: Association between a risk factor and
    a disease
         2 measurements: disease status and risk factor status
                Both dichotomous measurements




March 28, 2013                                                    101
Correlation Analysis
 When you have 2 continuous
  measurements you use correlation
  analysis to determine the relationship
  between the variables.
 Through correlation analysis you can
  calculate a number that relates to the
  strength of the linear association.




March 28, 2013                             102
Types of Relationships
   There are 2 types of relationships:
         Deterministic relationship – the values of the 2
          variables are related through an exact
          mathematical formula.

         Statistical relationship – this is not a perfect
          relationship!!!




March 28, 2013                                               103
Scatter Plots and Association
   You can plot the 2 variables in a scatter plot (one
    of the types of charts in SPSS/Excel).
   The pattern of the “dots” in the plot indicate the
    statistical relationship between the variables (the
    strength and the direction).
         Positive relationship – pattern goes from lower left to
          upper right.
         Negative relationship – pattern goes from upper left to
          lower right.
         The more the dots cluster around a straight line the
          stronger the linear relationship.




March 28, 2013                                                      104
Birth Weight Data
    x (oz)         y(%)
             112           63
             111           66   x – birth weight in ounces
             107           72
                                y – increase in weight between
             119           52
                                70th and 100th days of life,
              92           75
                                expressed as a percentage of
              80          118
              81          120
                                birth weight
              84          114
             118           42
             106           72
             103           90
              94           91
                                                                 105
Pearson Correlation Coefficient
                                                                     Birth Weight Data

                                                120

                                                110
                 Increase in Birth Weight )%(




                                                100

                                                90

                                                80

                                                70

                                                60

                                                50

                                                40
                                                      70   80   90         100       110         120   130   140
                                                                      Birth Weight )in ounces(




March 28, 2013                                                                                                     106
Calculations of Correlation
Coefficient
   In SPSS:
         Go to TOOLS menu and select DATA
          ANALYSIS.
         Highlight CORRELATION and click “ok”
         Enter INPUT RANGE (2 columns of data that
          contain “x” and “y”)
         Click “ok” (cells where you want the answer to
          be placed.




March 28, 2013                                         107
Pearson Correlation Results
                                                                             
                          x (oz)        y(%)
         x (oz)                 1
         y(%)            -0.94629               1

Pearson Correlation Coefficient = -0.946
Interpretation:
           - values near 1 indicate strong positive linear relationship
           - values near –1 indicate strong negative linear relationship
           - values near 0 indicate a weak linear association


  March 28, 2013                                                           108
!!!!CAUTION
 Interpreting the correlation coefficient
  should be done cautiously!
 A result of 0 does not mean there is NO
  relationship …. It means there is no linear
  association.
 There may be a perfect non-linear
  association.




March 28, 2013                             109
The Uses of Frequency Distributions
   Becoming familiar with dataset.
   Cleaning the data.
       Outliers-values that lie outside the normal range of values for
        other cases.
   Inspecting the data for missing values.
   Testing assumptions for statistical tests.
       Assumption is a condition that is presumed to be true and
        when ignored or violated can lead to misleading or invalid
        results.
       When DV is not normally distributed researchers have to
        choose between three options:
            Select a statistical test that does not assume a normal distribution.
            Ignore the violation of the assumption.
            Transform the variable to better approximate a distribution that is
             normal. Please consult the various data transformation.
The Uses of Frequency Distributions
 Obtaining information about sample
  characteristics.
 Directing answering research questions.
Outliers
   Are values that are extreme relative to the bulk
    of scores in the distribution.
   They appear to be inconsistent with the rest of
    the data.
   Advantages:
       They may indicate characteristics of the population that
        would not be known in the normal course of analysis.
   Disadvantages:
       They do not represent the population
       Run counter to the objectives of the analysis
       Can distort statistical tests.
Sources of Outliers
 An error in the recording of the data.
 A failure of data collection, such as not
  following sample criteria (e.g.
  inadvertently admitting a disoriented
  patient into a study), a subject not
  following instructions on a questionnaire,
  or equipment failure.
 An actual extreme value from an unusual
  subjects.
Methods to Identify Outliers
 Traditional way of labeling outliers, any
  value more than 3SD from the mean.
 Values that are more than 3 IQRs from the
  upper or lower edge of the box plot are
  extreme outliers.
 Values between 1.5 and 3 IQRs from the
  upper and lower edges of the box are
  minor outliers.
Handling Outliers
   Analyze the data two ways:
       With the outliers in the distribution
       With outliers removed.
   If the results are similar, as they are likely to be if the
    sample size is large, then the outliers may be ignored.
   If the results are not similar, then a statistical analysis that
    is resistant to outliers can be used (e.g. median and IQR).
   If you want to use a mean with outliers, then the trimmed
    mean is an option. If calculated with a certain percentage
    of the extreme values removed from both ends of the
    distribution (e.g. n=100, then 5% trimmed mean is the
    mean of the middle 90% of the observation).
Handling Outliers
 Another alternative is a Winsorized mean.
 The highest and lowest extremes are
  replaced by the next-to-highest value and
  by the next-to-lowest value.
 For Univariate outliers, Tabachnick and
  Fidell (2001) suggest changing the scores
  on the variables for the outlying cases so
  they are deviant. E.g. if the two largest
  scores in the distribution are 125 and 122
  and the next largest score 87. recode 122
  as 88 and 125 as 89.
Outliers
    Steps on SPSS
1.   Analyze
2.   Descriptive
3.   Explore
4.   Statistics ……plots
5.   Outliers
Missing Data
   Any systematic event external to the respondent (such as
    data entry errors or data collection problems) or action on
    the part of the respondent (such as refusal to answer) that
    leads to missing data.
   It means that analyses are based on fewer study
    participants than were in the full study sample. This, in
    turn, means less statistical power, which can undermine
    statistical conclusion validity-the degree to which the
    statistical results are accurate.
   Missing data can also affect internal validity-the degree to
    which inferences about the causal effect of the dependent
    variable on the dependent variable are warranted, and also
    affect the external validity-generalizability.
Strategies to avoid Missing Data
 Persistent follow-up
 Flexibility in scheduling appointments
 Paying incentives.
 Using well-proven methods to track people
  who have moved.
 Performing a thorough review of
  completed data forms prior to excusing
  participants.
Factors to consider in designing a
missing values strategy
 Extent of missing data
 Pattern of missing data
 Nature of missing data.
 Role of the variable
 Level of measurement of the variable.
Extent of missing data
   Researchers usually handle the problem
    differently if there is only 1% missing data
    as opposed to , say, 25% missing.
Pattern of missing data
 It is more straightforward to deal with
  data that are missing a haphazard,
  random fashion, as opposed to a
  systematic fashion that typically reflects a
  bias.
 Different patterns of missing data:
       Missing completely at random (MCAR)
       Missing at random (MAR)
       Missing not at random (MNSR).
(Missing Completely at Random (MCAR
   It means that the probability that the observation
    is missing is completely unrelated to either the
    value of the missing case or the value of any
    other variables.
   Occurs when cases with missing values are just a
    random subsample of all cases in the sample.
   When data are MCAR, analyses remain unbiased,
    although power is reduced.
   E.g. When one participant did not show up post
    the intervention due to emergency. In this
    situation, the missing values are not related to
    the main variable or to the value of other
    characteristics, such as the person’ s age, sex or
    experimental group status.
(Missing at Random (MAR

 It considered MAR if missingness is related
  to other variables-but not related to the
  value of the variable that has the missing
  values.
 This pattern is perhaps the most prevalent
  pattern of missingness in clinical research.
 E.g. men were less likely to keep their
  follow-up appointment. Thus, missingness
  is related to a person’s gender.
(Missing not at Random (MNSR
   A pattern in which the value of the
    variable that is missing is related to its
    missingness. This is often found for such
    variables as income (not to tell the truth).
.Nature of Missing Data

 For only one item is a multi-item-measure.
 Sometimes an entire variable is missing.
 In other situations, all data are missing for
  study participants.
Role of the variable
   How one handles the missing data
    problem may depend on whether a
    variable is considered a primary outcome,
    a secondary outcome, an independent
    (predictor) variable, or control variable
    (covariate).
Level of Measurement of the Variable
   Some strategies are best applied when the
    variable is measured on an interval or
    ratio scale, while others only make sense
    for nominal-level variables.
Techniques for Handling Missing Data
   Deletion techniques. Involve excluding subjects
    with missing data from statistical calculation.
   Imputation techniques. Involve calculating an
    estimate of each missing value and replacing, or
    imputing, each value by its respective estimate.
   Note: techniques for handling missing data often
    vary in the degree to which they affect the
    amount of dispersion around true scores, and the
    degree of bias in the final results. Therefore, the
    selection of a data handling technique should be
    carefully considered.
Deletion Techniques
   Deletion methods involve removal of cases or variables with
    missing data.
   Listwise deletion. Also called complete case analysis. It is
    simply the analysis of those cases for which there are no
    missing data. It eliminates an entire case when any of its
    items/variables has a missing data point, whether or not
    that data point is part of the analysis. It is the default of
    the SPSS.
   Pairwise deletion. Called the available case analysis (unwise
    deletion). Involves omitting cases from the analysis on a
    variable-by-variable basis. It eliminates a case only when
    that case has missing data for variables or items under
    analysis.
   Note: deletion techniques are widely criticized because they
    assume that the data are MCAR (which is very difficult to
    ascertain), pose a risk for bias, and lead to reduction of
    sample size and power.
Imputation Techniques
 Imputation is the process of estimating
  missing data based on valid values of
  other variables or cases in the sample.
 The goal of imputation is to use known
  relationship that can be identified in the
  valid values of the sample to help estimate
  the missing data
Types of Imputation Techniques
 Using prior knowledge.
 Inserting mean values.
 Using regression
 Expectation maximization (EM).
 Multiple imputation.
Prior Knowledge
 Involves replacing a missing value with a
  value based on an educational guess.
 It is a reasonable method if the researcher
  has a good working knowledge of the
  research domain, the sample is large, and
  the number of missing values is small.
Mean Replacement
 Also called median replacement for
  skewed distribution.
 Involves calculating mean values from a
  available data on that variable and using
  them to replace missing values before
  analysis.
 It is a conservative procedure because the
  distribution mean as a whole does not
  change and the researcher does not have
  to guess at missing values.
Mean Replacement
   Advantages:
       Easily implemented and provides all cases with complete
        data.
       A compromise procedure is to insert a group mean for
        the missing values.
   Disadvantages:
       It invalidates the variance estimates derived from the
        standard variance formulas by understanding the data’s
        true variance.
       It distorts the actual distribution of values.
       It depresses the observed correlation that this variable
        will have with other variables because all missing data
        have a single constant value, thus reducing the
        variance.
Using Regression
   Involves using other variables in the dataset as
    independent variables to develop a regression
    equation for the variable with missing data
    serving as the dependent variable.
   Cases with complete data are used to generate
    the regression equation.
   The equation is then used to predict missing
    values for incomplete cases.
   More regressions are computed, using the
    predicted values from the previous regression to
    develop the next equation, until the predicted
    values from one step to the next are comparable.
   Prediction from the last regression are the ones
    used to replace missing values.
Using Regression
   Advantages:
       It is more objective than the researcher’s guess but not
        as blind as simply using the overall mean.
   Disadvantages:
       It reinforces the relationships already in the data,
        resulting in less generalizability.
       The variance of the distribution is reduced because the
        estimate is probably too close to the mean.
       It assumes that the variable with missing data is
        correlated substantially with missing data is correlated
        substantially with the other variables in the dataset.
       The regression procedure is not constrained in the
        estimates it makes.
Expectation Maximization
   For randomly missing data.
   It is an iterative process that proceeds in two
    discrete steps:
   In the expectation (E) step, the conditional
    expected value of the complete data is computed
    and then given the observed values, such as
    correlations.
   In the Maximization (M) step, these expected
    values are then substituted for the missing and
    maximum likelihood estimation is then computed
    as though there were no missing data.
Multiple Imputation
 It produces several datasets and analyzes
  them separately.
 One set of parameters is then formed by
  averaging the resulting estimates and
  standard errors.
Multiple Imputation
   Advantages:
       It makes no assumptions about whether data are
        randomly missing but incorporates random error
        because it requires random variation in the imputation
        process.
       It permits use of complete-data methods for data
        analysis and also includes the data collector’s
        knowledge.
       It permits estimates of nonlinear models.
       It simulates proper inference from data and increases
        efficiency of the estimates by minimizing standard
        errors.
       It is the method of choice for databases that are made
        available for analyses outside the agency that collected
        the data.
Multiple Imputation
   Disadvantages:
       It requires conceptual intensiveness to carry
        out MI, including special software and model
        building.
       It does not produce a unique answer because
        randomness is preserved in the MI process,
        making reproducibility of exact results
        problematic.
        it requires large a mounts of data storage
        space that often exceeds space on personal
        computers’ hard driver.
Descriptive statistics ii

More Related Content

What's hot

Statistics-Measures of dispersions
Statistics-Measures of dispersionsStatistics-Measures of dispersions
Statistics-Measures of dispersions
Capricorn
 
Statistical analysis using spss
Statistical analysis using spssStatistical analysis using spss
Statistical analysis using spss
jpcagphil
 
Skewness & Kurtosis
Skewness & KurtosisSkewness & Kurtosis
Skewness & Kurtosis
Navin Bafna
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Aileen Balbido
 

What's hot (20)

Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Statistics-Measures of dispersions
Statistics-Measures of dispersionsStatistics-Measures of dispersions
Statistics-Measures of dispersions
 
Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"
 
Descriptive Analysis in Statistics
Descriptive Analysis in StatisticsDescriptive Analysis in Statistics
Descriptive Analysis in Statistics
 
Statistical analysis using spss
Statistical analysis using spssStatistical analysis using spss
Statistical analysis using spss
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statistics
 
Data Analysis with SPSS PPT.pdf
Data Analysis with SPSS PPT.pdfData Analysis with SPSS PPT.pdf
Data Analysis with SPSS PPT.pdf
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Skewness & Kurtosis
Skewness & KurtosisSkewness & Kurtosis
Skewness & Kurtosis
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
 
2. measures of dis[persion
2. measures of dis[persion2. measures of dis[persion
2. measures of dis[persion
 
Regression analysis on SPSS
Regression analysis on SPSSRegression analysis on SPSS
Regression analysis on SPSS
 
Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
 

Viewers also liked

Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..
killerkarthic
 
Spss basic1
Spss basic1Spss basic1
Spss basic1
UPM
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
venkadesh236
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Slideshare
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlation
Maria Theresa
 
Basic Statistics Presentation
Basic Statistics PresentationBasic Statistics Presentation
Basic Statistics Presentation
IUBAT
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentation
kesterdavid
 

Viewers also liked (20)

BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 Lecture
 
Chap03 numerical descriptive measures
Chap03 numerical descriptive measuresChap03 numerical descriptive measures
Chap03 numerical descriptive measures
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..
 
Spss basic1
Spss basic1Spss basic1
Spss basic1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocess
PreprocessPreprocess
Preprocess
 
Introduction to spss – part 1
Introduction to spss – part 1Introduction to spss – part 1
Introduction to spss – part 1
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & Graphing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
BID CE workshop 1   session 08 - Biodiversity Data CleaningBID CE workshop 1   session 08 - Biodiversity Data Cleaning
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Correlation in simple terms
Correlation in simple termsCorrelation in simple terms
Correlation in simple terms
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlation
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation in physical science
Correlation in physical science Correlation in physical science
Correlation in physical science
 
Basic Statistics Presentation
Basic Statistics PresentationBasic Statistics Presentation
Basic Statistics Presentation
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentation
 

Similar to Descriptive statistics ii

1 descriptive statistics
1 descriptive statistics1 descriptive statistics
1 descriptive statistics
Sanu Kumar
 
Module Five Normal Distributions & Hypothesis TestingTop of F.docx
Module Five Normal Distributions & Hypothesis TestingTop of F.docxModule Five Normal Distributions & Hypothesis TestingTop of F.docx
Module Five Normal Distributions & Hypothesis TestingTop of F.docx
roushhsiu
 
Spss cross tab n chi sq bivariate analysis
Spss  cross tab n chi sq bivariate analysisSpss  cross tab n chi sq bivariate analysis
Spss cross tab n chi sq bivariate analysis
Raja Azrul Raja Ahmad
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
leblance
 
Statistics and permeability engineering reports
Statistics and permeability engineering reportsStatistics and permeability engineering reports
Statistics and permeability engineering reports
wwwmostafalaith99
 
statical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptstatical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.ppt
NazarudinManik1
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
drasifk
 

Similar to Descriptive statistics ii (20)

1 descriptive statistics
1 descriptive statistics1 descriptive statistics
1 descriptive statistics
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
1.1 course notes inferential statistics
1.1 course notes inferential statistics1.1 course notes inferential statistics
1.1 course notes inferential statistics
 
Data DistributionM (1).pptx
Data DistributionM (1).pptxData DistributionM (1).pptx
Data DistributionM (1).pptx
 
Module Five Normal Distributions & Hypothesis TestingTop of F.docx
Module Five Normal Distributions & Hypothesis TestingTop of F.docxModule Five Normal Distributions & Hypothesis TestingTop of F.docx
Module Five Normal Distributions & Hypothesis TestingTop of F.docx
 
Inferential statistics quantitative data - single sample and 2 groups
Inferential statistics   quantitative data - single sample and 2 groupsInferential statistics   quantitative data - single sample and 2 groups
Inferential statistics quantitative data - single sample and 2 groups
 
M.Ed Tcs 2 seminar ppt npc to submit
M.Ed Tcs 2 seminar ppt npc   to submitM.Ed Tcs 2 seminar ppt npc   to submit
M.Ed Tcs 2 seminar ppt npc to submit
 
Spss cross tab n chi sq bivariate analysis
Spss  cross tab n chi sq bivariate analysisSpss  cross tab n chi sq bivariate analysis
Spss cross tab n chi sq bivariate analysis
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
 
Statistics and permeability engineering reports
Statistics and permeability engineering reportsStatistics and permeability engineering reports
Statistics and permeability engineering reports
 
best for normal distribution.ppt
best for normal distribution.pptbest for normal distribution.ppt
best for normal distribution.ppt
 
statical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptstatical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.ppt
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
Statistics ppt.ppt
Statistics ppt.pptStatistics ppt.ppt
Statistics ppt.ppt
 
STATISTICAL PARAMETERS
STATISTICAL  PARAMETERSSTATISTICAL  PARAMETERS
STATISTICAL PARAMETERS
 
2. chapter ii(analyz)
2. chapter ii(analyz)2. chapter ii(analyz)
2. chapter ii(analyz)
 
lesson 3.1 Unit root testing section 1 .pptx
lesson 3.1 Unit root testing section 1 .pptxlesson 3.1 Unit root testing section 1 .pptx
lesson 3.1 Unit root testing section 1 .pptx
 
Estimation in statistics
Estimation in statisticsEstimation in statistics
Estimation in statistics
 

More from Mohammad Ihmeidan

More from Mohammad Ihmeidan (20)

Anatomy physiology and embryology of urinary tract
Anatomy physiology and embryology of urinary tractAnatomy physiology and embryology of urinary tract
Anatomy physiology and embryology of urinary tract
 
Fusion prostatic biopsy
Fusion prostatic biopsyFusion prostatic biopsy
Fusion prostatic biopsy
 
Renal cell carcinoma
Renal cell carcinomaRenal cell carcinoma
Renal cell carcinoma
 
Medical History taking form introductory course
Medical History taking form introductory courseMedical History taking form introductory course
Medical History taking form introductory course
 
Male reproductive system - Anatomy
Male reproductive system - AnatomyMale reproductive system - Anatomy
Male reproductive system - Anatomy
 
Testicular tumors
Testicular tumors Testicular tumors
Testicular tumors
 
Vesico ureteral reflux
Vesico ureteral reflux Vesico ureteral reflux
Vesico ureteral reflux
 
Wilms tumors
Wilms tumorsWilms tumors
Wilms tumors
 
Bladder injuries
Bladder injuriesBladder injuries
Bladder injuries
 
Urethral trauma
Urethral traumaUrethral trauma
Urethral trauma
 
The cystoscope and accessories
The cystoscope  and accessoriesThe cystoscope  and accessories
The cystoscope and accessories
 
Physicochemistry of renal stones
Physicochemistry of renal stonesPhysicochemistry of renal stones
Physicochemistry of renal stones
 
Postobstructive diuresis
Postobstructive diuresisPostobstructive diuresis
Postobstructive diuresis
 
Renal physiology
Renal physiology Renal physiology
Renal physiology
 
Peri-operative Anaphylaxis
Peri-operative Anaphylaxis Peri-operative Anaphylaxis
Peri-operative Anaphylaxis
 
Hepatic adenoma vs focal nodular hyperplasia
Hepatic adenoma vs focal nodular hyperplasiaHepatic adenoma vs focal nodular hyperplasia
Hepatic adenoma vs focal nodular hyperplasia
 
Hepatic artery varients
Hepatic artery varientsHepatic artery varients
Hepatic artery varients
 
ABSITE Review: Practice Questions, Second Edition 2nd edition by FIser,
 ABSITE Review: Practice Questions, Second Edition 2nd edition by FIser,  ABSITE Review: Practice Questions, Second Edition 2nd edition by FIser,
ABSITE Review: Practice Questions, Second Edition 2nd edition by FIser,
 
نحن نعيش في عالم
نحن نعيش في عالمنحن نعيش في عالم
نحن نعيش في عالم
 
اللغة العربية و المصطلح العلمي
اللغة العربية و المصطلح العلمياللغة العربية و المصطلح العلمي
اللغة العربية و المصطلح العلمي
 

Recently uploaded

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Descriptive statistics ii

  • 2. Shapes of Distribution  A third important property of data – after location and dispersion - is its shape  Distributions of quantitative variables can be described in terms of a number of features, many of which are related to the distributions’ physical appearance or shape when presented graphically.  modality  Symmetry and skewness  Degree of skewness  Kurtosis
  • 3. Modality  The modality of a distribution concerns how many peaks or high points there are.  A distribution with a single peak, one value a high frequency is a unimodal distribution.
  • 4. Modality  A distribution with two or more peaks called multimodal distribution.
  • 5. Symmetry and Skewness  A distribution is symmetric if the distribution could be split down the middle to form two haves that are mirror images of one another.  In asymmetric distributions, the peaks are off center, with a bull of scores clustering at one end, and a tail trailing off at the other end. Such distributions are often describes as skewed.  When the longer tail trails off to the right this is a positively skewed distribution. E.g. annual income.  When the longer tail trails off to the left this is called negatively skewed distribution. E.g. age at death.
  • 6. Symmetry and Skewness  Shape can be described by degree of asymmetry (i.e., skewness).  mean > median positive or right-skewness  mean = median symmetric or zero-skewness  mean < median negative or left-skewness  Positive skewness can arise when the mean is increased by some unusually high values.  Negative skewness can arise when the mean is decreased by some unusually low values.
  • 7. Left skewed: Right skewed: Symmetric: 7
  • 8. Shapes of the Distribution  Three common shapes of frequency distributions: A B C Symmetrical Positively Negatively and bell skewed or skewed or shaped skewed to skewed to the right the left March 28, 2013 8
  • 9. Shapes of the Distribution  Three less common shapes of frequency distributions: A B C Bimodal Reverse Uniform J-shaped March 28, 2013 9
  • 10. This guy took a VERY long time! 10
  • 11. Degree of Skewness  A skewness index can readily be calculated most statistical computer program in conjunction with frequency distributions  The index has a value of 0 for perfectly symmetric distribution.  A positive value if there is a positive skew, and negative value if there is a negative skew.  A skewness index that is more than twice the value of its standard error can be interpreted as a departure from symmetry.
  • 12. Measures of Skewness or Symmetry  Pearson’s skewness coefficient  It is nonalgebraic and easily calculated. Also it is useful for quick estimates of symmetry .  It is defined as: skewness = mean-median/SD  Fisher’s measure of skewness.  It is based on deviations from the mean to the third power.
  • 13. Pearson’s skewness coefficient  For a perfectly symmetrical distribution, the mean will equal the median, and the skewness coefficient will be zero. If the distribution is positively skewed the mean will be more than the median and the coefficient will be the positive. If the coefficient is negative, the distribution is negatively skewed and the mean less than the median.  Skewness values will fall between -1 and +1 SD units. Values falling outside this range indicate a substantially skewed distribution.  Hildebrand (1986) states that skewness values above 0.2 or below -0.2 indicate severe skewness.
  • 14. Assumption of Normality  Many of the statistical methods that we will apply require the assumption that a variable or variables are normally distributed.  With multivariate statistics, the assumption is that the combination of variables follows a multivariate normal distribution.  Since there is not a direct test for multivariate normality, we generally test each variable individually and assume that they are multivariate normal if they are individually normal, though this is not necessarily the case.
  • 15. Evaluating normality  There are both graphical and statistical methods for evaluating normality.  Graphical methods include the histogram and normality plot.  Statistical methods include diagnostic hypothesis tests for normality, and a rule of thumb that says a variable is reasonably close to normal if its skewness and kurtosis have values between –1.0 and +1.0.  None of the methods is absolutely definitive.
  • 16. Transformations  When a variable is not normally distributed, we can create a transformed variable and test it for normality. If the transformed variable is normally distributed, we can substitute it in our analysis.  Three common transformations are: the logarithmic transformation, the square root transformation, and the inverse transformation.  All of these change the measuring scale on the horizontal axis of a histogram to produce a transformed variable that is mathematically equivalent to the original variable.
  • 17. Types of Data Transformations  for moderate skewness, use a square root transformation.  For substantial skewness, use a log transformation.  For sever skewness, use an inverse transformation.
  • 18. Computing “Explore” descriptive statistics To compute the statistics needed for evaluating the normality of a variable, select the Explore… command from the Descriptive Statistics menu.
  • 19. Adding the variable to be evaluated Second, click on right arrow button to move the highlighted variable to the Dependent List. First, click on the variable to be included in the analysis to highlight it.
  • 20. Selecting statistics to be computed To select the statistics for the output, click on the Statistics… command button.
  • 21. Including descriptive statistics First, click on the Descriptives checkbox to select it. Clear the other checkboxes. Second, click on the Continue button to complete the request for statistics.
  • 22. Selecting charts for the output To select the diagnostic charts for the output, click on the Plots… command button.
  • 23. Including diagnostic plots and statistics First, click on the None option button on the Boxplots panel since boxplots are not as helpful as other charts in assessing normality. Finally, click on the Continue button to complete the request. Second, click on the Normality plots with tests Third, click on the Histogram checkbox to include checkbox to include a normality plots and the histogram in the output. You hypothesis tests for may want to examine the normality. stem-and-leaf plot as well, though I find it less useful.
  • 24. Completing the specifications for the analysis Click on the OK button to complete the specifications for the analysis and request SPSS to produce the output.
  • 25. The histogram Histogram An initial impression of the normality of the distribution 50 can be gained by examining the histogram. 40 In this example, the histogram shows a substantial violation of normality caused 30 by a extremely large value in the distribution. 20 Frequency 10 Std. Dev = 15.35 Mean = 10.7 0 N = 93.00 0.0 20.0 40.0 60.0 80.0 100.0 10.0 30.0 50.0 70.0 90.0 TOTAL TIME SPENT ON THE INTERNET
  • 26. The normality plot Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET 3 2 1 0 The problem with the normality of this variable’s distribution is reinforced by the Expected Normal -1 normality plot. -2 If the variable were normally distributed, the red dots would fit the green line very closely. In this case, the red points in the -3 upper right of the chart indicate the -40 -20 0 20 40 60 80 100 120 severe skewing caused by the extremely large data values. Observed Value
  • 27. The test of normality Tests of Normality a Kolmogorov-Smirnov Shapiro-Wilk Statistic df Sig. Statistic df Sig. TOTAL TIME SPENT .246 93 .000 .606 93 .000 ON THE INTERNET a. Lilliefors Significance Correction Problem 1 asks about the results of the test of normality. Since the sample size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample size were 50 or less, we would use the Shapiro-Wilk statistic instead. The null hypothesis for the test of normality states that the actual distribution of the variable is equal to the expected distribution, i.e., the variable is normally distributed. Since the probability associated with the test of normality is < 0.001 is less than or equal to the level of significance (0.01), we reject the null hypothesis and conclude that total hours spent on the Internet is not normally distributed. (Note: we report the probability as <0.001 instead of .000 to be clear that the probability is not really zero.) The answer to problem 1 is false.
  • 28. The assumption of normality script An SPSS script to produce all of the output that we have produced manually is available on the course web site. After downloading the script, run it to test the assumption of linearity. Select Run Script… from the Utilities menu.
  • 29. Selecting the assumption of normality script First, navigate to the folder containing your scripts and highlight the NormalityAssumptionAndTransformations.SBS script. Second, click on the Run button to activate the script.
  • 30. Specifications for normality script First, move variables from the list of variables in the data set to the Variables to Test list box. The default output is to do all of the transformations of the variable. To exclude some transformations from the Third, click on the OK calculations, clear the checkboxes. button to run the script.
  • 31. The test of normality Tests of Normality a Kolmogorov-Smirnov Shapiro-Wilk Statistic df Sig. Statistic df Sig. TOTAL TIME SPENT .246 93 .000 .606 93 .000 ON THE INTERNET a. Lilliefors Significance Correction The script produces the same output that we computed manually, in this example, the tests of normality.
  • 32. When transformations do not work  When none of the transformations induces normality in a variable, including that variable in the analysis will reduce our effectiveness at identifying statistical relationships, i.e. we lose power.  We do have the option of changing the way the information in the variable is represented, e.g. substitute several dichotomous variables for a single metric variable.
  • 33. Fisher’s Measure of Skewness  The formula for Fisher’s skewness statistic is based on deviations from the mean to the third power.  The measure of skewness can be interpreted in terms of the normal curve  A symmetrical curve will result in a value of 0.  If the skewness value is positive, them the curve is skewed to the right, and vice versa for a distribution skewed to the left.  A z-score is calculated by dividing the measure of skewness by the standard error for skewness. Values above +1.96 or below -1.96 are significant at the 0.05 level because 95% of the scores in a normal deviation fall between +1.96 and -1.96 from the mean.  E.g. if Fisher’s skewness= 0.195 and st.err. =0.197 the z- score = 0.195/0.197 = 0.99
  • 34. Kurtosis  The distribution’s kurtosis is concerns how pointed or flat its peak.  Two types:  Leptokurtic distribution (mean thin).  Platykurtic distribution (means flat).
  • 35. Kurtosis  There is a statistical index of kurtosis that can be computed when computer programs are instructed to produce a frequency distribution  For kurtosis index, a value of zero indicates a shape that is neither flat nor pointed.  Positive values on the kurtosis statistics indicate greater peakedness, and negative values indicate greater flatness.
  • 36. Fishers’ measure of Kurtosis Fisher’s measure is based on deviation from the mean to the fourth power.  A z-score is calculated by dividing the measure of kurtosis by the standard error for kurtosis.
  • 37. Table of descriptive statistics Descriptives Statistic Std. Error TOTAL TIME SPENT Mean 10.731 1.5918 ON THE INTERNET 95% Confidence Lower Bound 7.570 Interval for Mean Upper Bound 13.893 5% Trimmed Mean 8.295 Median 5.500 Variance 235.655 To answer problem Std. Deviation 15.3511 2, we look at the Minimum .2 values for skewness Maximum 102.0 and kurtosis in the Range 101.8 Descriptives table. Interquartile Range 10.200 Skewness 3.532 .250 Kurtosis 15.614 .495 The skewness and kurtosis for the variable both exceed the rule of thumb criteria of 1.0. The variable is not normally distributed. The answer to problem 2 if false.
  • 38. Other problems on assumption of normality may ask about the assumption of  A problem normality for a nominal level variable. The answer will be “An inappropriate application of a statistic” since there is no expectation that a nominal variable be normal.  A problem may ask about the assumption of normality for an ordinal level variable. If the variable or transformed variable is normal, the correct answer to the question is “True with caution” since we may be required to defend treating an ordinal variable as metric.  Questions will specify a level of significance to use and the statistical evidence upon which you should base your answer.
  • 39. Normal Distribution  Also called belt shaped curve, normal curve, or Gaussian distribution.  A normal distribution is one that is unimodal, symmetric, and not too peaked or flat.  Given its name by the French mathematician Quetelet who, in the early 19th century noted that many human attributes, e.g. height, weight, intelligence appeared to be distributed normally.
  • 40. Normal Distribution  The normal curve is unimodal and symmetric about its mean (µ).  In this distribution the mean, median and mode are all identical.  The standard deviation (σ) specifies the amount of dispersion around the mean.  The two parameters µ and σ completely define a normal curve. March 28, 2013 40
  • 41.  Also called a Probability density function. The probability is interpreted as "area under the curve."  The random variable takes on an infinite # of values within a given interval  The probability that X = any particular value is 0. Consequently, we talk about intervals. The probability is = to the area under the curve.  The area under the whole curve = 1. 41
  • 42. 42
  • 43. Normal Distribution .X is the random variable .μ is the mean value .σ is the standard deviation (std) value .e = 2.7182818... constant .π = 3.1415926... constant
  • 44. Importance of Normal Distribution to Statistics  Although most distributions are not exactly normal, most variables tend to have approximately normal distribution.  Many inferential statistics assume that the populations are distributed normally.  The normal curve is a probability distribution and is used to answer questions about the likelihood of getting various particular outcomes when sampling from a population.
  • 45.  Probabilitiesare obtained by getting the area under the curve inside of a particular interval. The area under the curve = the proportion of times under identical (repeated) conditions that a particular range of values will occur.  Characteristics of the Normal distribution:  It is symmetric about the mean μ.  Mean = median = mode. [“bell-shaped” curve]  f(X) decreases as X gets farther and farther away from the mean. It approaches horizontal axis asymptotically: - ∞ < X < + ∞. This means that there is always some probability (area) for extreme values. 45
  • 46. Why Do We Like The Normal ?Distribution So Much  There is nothing “special” about standard normal scores  These can be computed for observations from any sample/population of continuous data values  The score measures how far an observation is from its mean in standard units of statistical distance  But, if distribution is not normal, we may not be able to use Z-score approach. March 28, 2013 46
  • 47. Probability Distributions  Any characteristic that can be measured or categorized is called a variable.  If the variable can assume a number of different values such that any particular outcome is determined by chance it is called a random variable.  Every random variable has a corresponding probability distribution.  The probability distribution applies the theory of probability to describe the behavior of the random variable. March 28, 2013 47
  • 48. Discrete Probability Distributions  Binomial distribution – the random variable can only assume 1 of 2 possible outcomes. There are a fixed number of trials and the results of the trials are independent.  i.e. flipping a coin and counting the number of heads in 10 trials.  Poisson Distribution – random variable can assume a value between 0 and infinity.  Counts usually follow a Poisson distribution (i.e. number of ambulances needed in a city in a given night) March 28, 2013 48
  • 49. Discrete Random Variable  A discrete random variable X has a finite number of possible values. The probability distribution of X lists the values and their probabilities. Value of X x1 x2 x3 … xk Probability p1 p2 p3 … pk 1. Every probability pi is a number between 0 and 1. 2. The sum of the probabilities must be 1.  Find the probabilities of any event by adding the probabilities of the particular values that make up the event. March 28, 2013 49
  • 50. Example  The instructor in a large class gives 15% each of A’s and D’s, 30% each of B’s and C’s and 10% F’s. The student’s grade on a 4-point scale is a random variable X (A=4). Grade F=0 D=1 C=2 B=3 A=4 Probability 0.10 15. 30. 30. 15.  What is the probability that a student selected at random will have a B or better?  ANSWER: P (grade of 3 or 4)=P(X=3) + P(X=4) = 0.3 + 0.15 = 0.45 March 28, 2013 50
  • 51. Continuous Probability Distributions  When it follows a Binomial or a Poisson distribution the variable is restricted to taking on integer values only.  Between two values of a continuous random variable we can always find a third.  A histogram is used to represent a discrete probability distribution and a smooth curve called the probability density is used to represent a continuous probability distribution. March 28, 2013 51
  • 52. Normal Distribution Q Is every variable normally distributed? A Absolutely not Q Then why do we spend so much time studying the normal distribution? A Some variables are normally distributed; a bigger reason is the “Central Limit Theorem”!!!!!!!!!!!!!!!!!!!!!!!!!!!?????????? ? March 28, 2013 52
  • 53. Central Limit Theorem  describes the characteristics of the "population of the means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population".  It predicts that regardless of the distribution of the parent population:  The mean of the population of means is always equal to the mean of the parent population from which the population samples were drawn.  The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).  The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.
  • 54. Central Limit Theorem  A consequence of Central Limit Theorem is that if we average measurements of a particular quantity, the distribution of our average tends toward a normal one.  In addition, if a measured variable is actually a combination of several other uncorrelated variables, all of them "contaminated" with a random error of any distribution, our measurements tend to be contaminated with a random error that is normally distributed as the number of these variables increases.  Thus, the Central Limit Theorem explains the ubiquity of the famous bell-shaped "Normal distribution" (or "Gaussian distribution") in the measurements domain.
  • 55. Note that the normal distribution is defined by two parameters, μ and σ . You can draw a normal distribution for any μ and σ combination. There is one normal distribution, Z, that is special. It has a μ = 0 and a σ = 1. This is the Z distribution, also called the standard normal distribution. It is one of trillions of normal distributions we could have selected. 55
  • 56. Standard Normal Variable  It is customary to call a standard normal random variable Z.  The outcomes of the random variable Z are denoted by z.  The table in the coming slide give the area under the curve (probabilities) between the mean and z.  The probabilities in the table refer to the likelihood that a randomly selected value Z is equal to or less than a given value of z and greater than 0 (the mean of the standard normal). March 28, 2013 56
  • 57. Source: Levine et al, Business Statistics, Pearson. 57
  • 58. The 68-95-99.7 Rule for the Normal Distribution  68% of the observations fall within one standard deviation of the mean  95% of the observations fall within two standard deviations of the mean  99.7% of the observations fall within three standard deviations of the mean  When applied to ‘real data’, these estimates are considered approximate! March 28, 2013 58
  • 59. (:Remember these probabilities (percentages standard deviations# Approx. area under the from the mean normal curve ±1 68. ±1.645 90. ±1.96 95. ±2 955. ±2.575 99. ±3 997. Practice: Find these values yourself using the Z table. 59 Two Sample Z Test
  • 61. Standard Normal Distribution 50% of probability in 50% of probability in here –probability=0.5 here–probability=0.5 March 28, 2013 61
  • 62. Standard Normal Distribution 95% of probability in here 2.5% of probability 2.5% of probability in here in here Standard Normal Distribution with 95% area marked March 28, 2013 62
  • 63. Calculating Probabilities  Probability calculations are always concerned with finding the probability that the variable assumes any value in an interval between two specific points a and b.  The probability that a continuous variable assumes the a value between a and b is the area under the graph of the density between a and b. March 28, 2013 63
  • 64. If the weight of males is N.D. with μ=150 and σ=10, what is the probability that a randomly selected male will weigh between 140 lbs and 155 lbs? [Important Note: Always remember that the probability that X is equal to any one particular value is zero, P(X=value) =0, since the normal distribution is continuous.] Normal Distribution 64
  • 65. Solution: 140 150 155 X -1 Z 0 0.5 Z = (140 – 150)/ 10 = -1.00 s.d. from mean Area under the curve = .3413 (from Z table) Z = (155 – 150) / 10 =+.50 s.d. from mean Area under the curve = .1915 (from Z table) Answer: .3413 + .1915 = .5328 65
  • 66. Example  For example: What’s the probability of getting a math SAT score of 575 or less, µ=500 and σ=50? 575 − 500 Z= = 1. 5 50 i.e., A score of 575 is 1.5 standard deviations above the mean 575 1 x − 500 2 1.5 1 1 − ( ) 1 − Z2 ∴ P ( X ≤ 575) = ∫ (50) 200 2π ⋅ e 2 50 dx  → ∫ −∞ 2π ⋅ e 2 dz Yikes! But to look up Z= 1.5 in standard normal chart (or enter into SAS) no problem! = .9332
  • 67. If IQ is ND with a mean of 100 and a S.D. of 10, what percentage of the population will have (a)IQs ranging from 90 to 110? (b)IQs ranging from 80 to 120? Solution: Z = (90 – 100)/10 = -1.00 Z = (110 -100)/ 10 = +1.00 Area between 0 and 1.00 in the Z-table is .3413; Area between 0 and -1.00 is also .3413 (Z-distribution is symmetric). Answer to part (a) is .3413 + .3413 = .6826. 67
  • 68. (b) IQs ranging from 80 to 120? Solution: Z = (80 – 100)/10 = -2.00 Z = (120 -100)/ 10 = +2.00 Area between =0 and 2.00 in the Z-table is .4772; Area between 0 and -2.00 is also . 4772 (Z-distribution is symmetric). Answer is .4772 + .4772 = .9544. 68
  • 69. Suppose that the average salary of college graduates is N.D. with μ=$40,000 and σ=$10,000. (a) What proportion of college graduates will earn $24,800 or less? (b) What proportion of college graduates will earn $53,500 or more? (c) What proportion of college graduates will earn between $45,000 and $57,000? (d) Calculate the 80th percentile. (e) Calculate the 27th percentile. 69
  • 70. (a) What proportion of college graduates will earn $24,800 or less? Solution: Convert the $24,800 to a Z-score: Z = ($24,800 - $40,000)/$10,000 = -1.52. Always DRAW a picture of the distribution to help you solve these problems. 70
  • 71. .4357 $24,800 $40,000 X -1.52 0 Z First Find the area between 0 and -1.52 in the Z-table. From the Z table, that area is .4357. Then, the area from -1.52 to - ∞ is .5000 - .4357 = .0643. Answer: 6.43% of college graduates will earn less than $24,800. 71
  • 72. (b) What proportion of college graduates will earn .4115 $53,500 or more? .0885 Solution: $40,000 $53,500 Convert the $53,500 to a Z-score. 0 +1.35 Z Z = ($53,500 - $40,000)/$10,000 = +1.35. Find the area between 0 and +1.35 in the Z- table: .4115 is the table value. When you DRAW A PICTURE (above) you see that you need the area in the tail: .5 - .4115 - .0885. Answer: .0885. Thus, 8.85% of college graduates will earn $53,500 or more. 72
  • 73. .4 55 4 .1915 (c) What proportion of college graduates will earn between $45,000 and $57,000? $40k $45k $57k 0 .5 1.7 Z Z = $45,000 – $40,000 / $10,000 = .50 Z = $57,000 – $40,000 / $10,000 = 1.70 From the table, we can get the area under the curve between the mean (0) and .5; we can get the area between 0 and 1.7. From the picture we see that neither one is what we need. What do we do here? Subtract the small piece from the big piece to get exactly what we need. Answer: .4554 − .1915 = .2639 73
  • 74. Parts (d) and (e) of this example ask you to compute percentiles. Every Z-score is associated with a percentile. A Z-score of 0 is the 50th percentile. This means that if you take any test that is normally distributed (e.g., the SAT exam), and your Z-score on the test is 0, this means you scored at the 50th percentile. In fact, your score is the mean, median, and mode. 74
  • 75. (d) Calculate the 80th percentile. .5000 .3000 Solution: $40,000 First, what Z-score is associated 0 .84 Z with the 80 percentile? th A Z-score of approximately +.84 will give you about .3000 of the area under the curve. Also, the area under the curve between -∞ and 0 is . 5000. Therefore, a Z-score of +.84 is associated with the 80th percentile. ANSWER Now to find the salary (X) at the 80th percentile: Just solve for X: +.84 = (X−$40,000)/$10,000 75
  • 76. (e) Calculate the 27th percentile. .2300 .5000 Solution: First, what Z-score is associated .2700 with the 27th percentile? A Z-score $40,000 -.61 0 of approximately -.61will give you Z about .2300 of the area under the curve, with .2700 in the tail. (The area under the curve between 0 and -.61 is .2291 which we are rounding to .2300). Also, the area under the curve between 0 and ∞ is .5000. Therefore, a Z-score of -.61 is associated with the 27th percentile. ANSWER Now to find the salary (X) at the 27th percentile: Just solve for X: -0.61 =(X−$40,000)/$10,000 X = $40,000 - $6,100 = $33,900 76
  • 77. T-Distribution  Similar to the standard normal in that it is unimodal, bell- shaped and symmetric.  The tail on the distribution are “thicker” than the standard normal  The distribution is indexed by “degrees of freedom” (df).  The degrees of freedom measure the amount of information available in the data set that can be used for estimating the population variance (df=n-1).  Area under the curve still equals 1.  Probabilities for the t-distribution with infinite df equals those of the standard normal. March 28, 2013 77
  • 78. T-Distribution  The table of t-distribution will give you the probability to the right of a critical value – i.e. area in the upper tail.  We are only given the area (or probability) for a few selected critical values for each degree of freedom. March 28, 2013 78
  • 79. T-Distribution Example  For a t-curve from a sample of size 15 find the area to the left of 2.145.  Answer: df=15-1=14  In the table of the t~distribution, the area to the right of 2.145 is 0.025.  Therefore the area to the left of 2.145 is:  1-0.025=0.975 March 28, 2013 79
  • 80. Graphical Methods  Frequency Distribution  Histogram  Frequency Polygon  Cumulative Frequency Graph  Pie Chart. March 28, 2013 80
  • 81. Presenting Data  Table  Condenses data into a form that can make them easier to understand;  Shows many details in summary fashion; BUT  Since table shows only numbers, it may not be readily understood without comparing it to other values.
  • 82. Principles of Table Construction  Don’t try to do too much in a table  Us white space effectively to make table layout pleasing to the eye.  Make sure tables & test refer to each other.  Use some aspect of the table to order & group rows & columns.
  • 83. Principles of Table Construction  If appropriate, frame table with summary statistics in rows & columns to provide a standard of comparison.  Round numbers in table to one or two decimal places to make them easily understood.  When creating tables for publication in a manuscript, double-space them unless contraindicated by journal.
  • 84. Frequency Distributions  A useful way to present data when you have a large data set is the formation of a frequency table or frequency distribution.  Frequency – the number of observations that fall within a certain range of the data. March 28, 2013 84
  • 85. Frequency Table Age Number of Deaths 1> 564 1-4 86 5-14 127 15-24 490 25-34 66 35-44 806 45-54 1,425 55-64 3,511 65-74 6,932 75-84 10,101 +85 9825 Total 34,524 85
  • 86. Frequency Table Data Frequency Cumulative Relative Cumulative Frequency Frequency Relative Intervals )%( )%(Frequency 10-19 5 5 20-29 18 23 30-39 10 33 40-49 13 46 50-59 4 50 60-69 4 54 70-79 2 56 Total 86
  • 87. Cumulative Relative Frequency  Cumulative Relative Frequency – the percentage of persons having a measurement less than or equal to the upper boundary of the class interval.  i.e. cumulative relative frequency for the 3rd interval of our data example:  8.8+13.3+17.5 = 59.6% - We say that 59.6% of the children have weights below 39.5 pounds. March 28, 2013 87
  • 88. Number of Intervals  There is no clear-cut rule on the number of intervals or classes that should be used.  Too many intervals – the data may not be summarized enough for a clear visualization of how they are distributed.  Too few intervals – the data may be over- summarized and some of the details of the distribution may be lost. March 28, 2013 88
  • 89. Presenting Data Chart - Visual representation of a frequency distribution that helps to gain insight about what the data mean. - Built with lines, area & text: bar charts Ex: bar chart, pie chart
  • 90. Bar Chart  Simplest form of chart  Used to display ETHICAL ISSUES SCALE nominal or ordinal ITEM 8 data 60 50 40 PERCENT 30 20 10 0 Never Seldom Somet imes Frequently ACTING AGAINST YOUR OWN PERSONAL/RELIGIOUS VIEWS
  • 91. Horizontal Bar Chart CLINICAL PRACTICE AREA Acute Care Critical Care Gerontology CLINICAL PRACTICE AREA P ost Anesthesia Perinatal Clinical Research Family Nursing Neonatal Psych/Mental Health Community Health General Practice Orthopedics Primary Care Operating Room Medical Oncology Other 0 2 4 6 8 10 12 14 PERCENT
  • 92. Cluster Bar Chart 70 60 50 PERCENT 40 30 Employment 20 Full tim e RN 10 Part tim e RN 0 Self employed Diploma B achelor Degree As sociate Degree Post Bac RN HIGHEST EDUCATION
  • 93. Pie Chart  Alternative to bar chart  Circle partitioned into Doctorate NonNursing Doctorate Nursing percentage MS NonNursing MS Nursing Missing distributions of Juris Doctor Diploma-Nursing qualitative variables BS NonNursing with total area of 100% AD Nursing BS Nursing
  • 94. Histogram  Appropriate for interval, ratio and sometimes ordinal data  Similar to bar charts but bars are placed side by side  Often used to represent both frequencies and percentages  Most histograms have from 5 to 20 bars
  • 95. Histogram 80 60 FREQUENCY 40 20 Std. Dev = 22.17 Mean = 61.6 0 N = 439.00 0.0 20.0 40.0 60.0 80.0 100.0 10.0 30.0 50.0 70.0 90.0 SF-36 VITALITY SCORES
  • 96. Pictures of Data: Histograms Blood pressure data on a sample of 113 men  20 15 Number of Men 10 5 0 80 100 120 140 160 Systolic BP (mmHg) Histogram of the Systolic Blood Pressure for 113 men. Each bar spans a width of 5 mmHg on the horizontal axis. The height of each bar represents the number of individuals with SBP in that range. March 28, 2013 96
  • 97. Frequency Polygon Frequency Polygon 20 •First place a dot at the 18 midpoint of the upper base of 16 each rectangular bar. 14 Childrens w eights •The points are connected with 12 straight lines. 10 •At the ends, the points are 8 connected to the midpoints of 6 the previous and succeeding 4 intervals (these intervals have 2 zero frequency). 0 4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 March 28, 2013 97
  • 98. Hallmarks of a Good Chart  Simple & easy to read  Placed correctly within text  Use color only when it has a purpose, not solely for decoration  Make sure others can understand chart; try it out on somebody first  Remember: A poor chart is worse than no chart at all.
  • 99. Cumulative Frequency Plot Weights of Daycare Children •Place a point with a horizontal 120% axis marked at the upper class boundary and a vertical axis 100% marked at the corresponding cumulative frequency. 80% •Each point represents the of Children cumulative relative frequency Percent 60% and the points are connected 40% with straight lines. •The left end is connected to 20% the lower boundary of the first interval that has data. 0% 9.5 19.5 29.5 39.5 49.5 59.5 69.5 79.5 89.5 Weight Range March 28, 2013 99
  • 100. Coefficient of Correlation  Measure of linear association between 2 continuous variables.  Setting:  two measurements are made for each observation.  Sample consists of pairs of values and you want to determine the association between the variables. March 28, 2013 100
  • 101. Association Examples  Example 1: Association between a mother’s weight and the birth weight of her child  2 measurements: mother’s weight and baby’s weight  Both continuous measures  Example 2: Association between a risk factor and a disease  2 measurements: disease status and risk factor status  Both dichotomous measurements March 28, 2013 101
  • 102. Correlation Analysis  When you have 2 continuous measurements you use correlation analysis to determine the relationship between the variables.  Through correlation analysis you can calculate a number that relates to the strength of the linear association. March 28, 2013 102
  • 103. Types of Relationships  There are 2 types of relationships:  Deterministic relationship – the values of the 2 variables are related through an exact mathematical formula.  Statistical relationship – this is not a perfect relationship!!! March 28, 2013 103
  • 104. Scatter Plots and Association  You can plot the 2 variables in a scatter plot (one of the types of charts in SPSS/Excel).  The pattern of the “dots” in the plot indicate the statistical relationship between the variables (the strength and the direction).  Positive relationship – pattern goes from lower left to upper right.  Negative relationship – pattern goes from upper left to lower right.  The more the dots cluster around a straight line the stronger the linear relationship. March 28, 2013 104
  • 105. Birth Weight Data x (oz) y(%) 112 63 111 66 x – birth weight in ounces 107 72 y – increase in weight between 119 52 70th and 100th days of life, 92 75 expressed as a percentage of 80 118 81 120 birth weight 84 114 118 42 106 72 103 90 94 91 105
  • 106. Pearson Correlation Coefficient Birth Weight Data 120 110 Increase in Birth Weight )%( 100 90 80 70 60 50 40 70 80 90 100 110 120 130 140 Birth Weight )in ounces( March 28, 2013 106
  • 107. Calculations of Correlation Coefficient  In SPSS:  Go to TOOLS menu and select DATA ANALYSIS.  Highlight CORRELATION and click “ok”  Enter INPUT RANGE (2 columns of data that contain “x” and “y”)  Click “ok” (cells where you want the answer to be placed. March 28, 2013 107
  • 108. Pearson Correlation Results    x (oz) y(%) x (oz) 1 y(%) -0.94629 1 Pearson Correlation Coefficient = -0.946 Interpretation: - values near 1 indicate strong positive linear relationship - values near –1 indicate strong negative linear relationship - values near 0 indicate a weak linear association March 28, 2013 108
  • 109. !!!!CAUTION  Interpreting the correlation coefficient should be done cautiously!  A result of 0 does not mean there is NO relationship …. It means there is no linear association.  There may be a perfect non-linear association. March 28, 2013 109
  • 110. The Uses of Frequency Distributions  Becoming familiar with dataset.  Cleaning the data.  Outliers-values that lie outside the normal range of values for other cases.  Inspecting the data for missing values.  Testing assumptions for statistical tests.  Assumption is a condition that is presumed to be true and when ignored or violated can lead to misleading or invalid results.  When DV is not normally distributed researchers have to choose between three options:  Select a statistical test that does not assume a normal distribution.  Ignore the violation of the assumption.  Transform the variable to better approximate a distribution that is normal. Please consult the various data transformation.
  • 111. The Uses of Frequency Distributions  Obtaining information about sample characteristics.  Directing answering research questions.
  • 112. Outliers  Are values that are extreme relative to the bulk of scores in the distribution.  They appear to be inconsistent with the rest of the data.  Advantages:  They may indicate characteristics of the population that would not be known in the normal course of analysis.  Disadvantages:  They do not represent the population  Run counter to the objectives of the analysis  Can distort statistical tests.
  • 113. Sources of Outliers  An error in the recording of the data.  A failure of data collection, such as not following sample criteria (e.g. inadvertently admitting a disoriented patient into a study), a subject not following instructions on a questionnaire, or equipment failure.  An actual extreme value from an unusual subjects.
  • 114. Methods to Identify Outliers  Traditional way of labeling outliers, any value more than 3SD from the mean.  Values that are more than 3 IQRs from the upper or lower edge of the box plot are extreme outliers.  Values between 1.5 and 3 IQRs from the upper and lower edges of the box are minor outliers.
  • 115. Handling Outliers  Analyze the data two ways:  With the outliers in the distribution  With outliers removed.  If the results are similar, as they are likely to be if the sample size is large, then the outliers may be ignored.  If the results are not similar, then a statistical analysis that is resistant to outliers can be used (e.g. median and IQR).  If you want to use a mean with outliers, then the trimmed mean is an option. If calculated with a certain percentage of the extreme values removed from both ends of the distribution (e.g. n=100, then 5% trimmed mean is the mean of the middle 90% of the observation).
  • 116.
  • 117. Handling Outliers  Another alternative is a Winsorized mean.  The highest and lowest extremes are replaced by the next-to-highest value and by the next-to-lowest value.  For Univariate outliers, Tabachnick and Fidell (2001) suggest changing the scores on the variables for the outlying cases so they are deviant. E.g. if the two largest scores in the distribution are 125 and 122 and the next largest score 87. recode 122 as 88 and 125 as 89.
  • 118. Outliers  Steps on SPSS 1. Analyze 2. Descriptive 3. Explore 4. Statistics ……plots 5. Outliers
  • 119. Missing Data  Any systematic event external to the respondent (such as data entry errors or data collection problems) or action on the part of the respondent (such as refusal to answer) that leads to missing data.  It means that analyses are based on fewer study participants than were in the full study sample. This, in turn, means less statistical power, which can undermine statistical conclusion validity-the degree to which the statistical results are accurate.  Missing data can also affect internal validity-the degree to which inferences about the causal effect of the dependent variable on the dependent variable are warranted, and also affect the external validity-generalizability.
  • 120. Strategies to avoid Missing Data  Persistent follow-up  Flexibility in scheduling appointments  Paying incentives.  Using well-proven methods to track people who have moved.  Performing a thorough review of completed data forms prior to excusing participants.
  • 121. Factors to consider in designing a missing values strategy  Extent of missing data  Pattern of missing data  Nature of missing data.  Role of the variable  Level of measurement of the variable.
  • 122. Extent of missing data  Researchers usually handle the problem differently if there is only 1% missing data as opposed to , say, 25% missing.
  • 123. Pattern of missing data  It is more straightforward to deal with data that are missing a haphazard, random fashion, as opposed to a systematic fashion that typically reflects a bias.  Different patterns of missing data:  Missing completely at random (MCAR)  Missing at random (MAR)  Missing not at random (MNSR).
  • 124. (Missing Completely at Random (MCAR  It means that the probability that the observation is missing is completely unrelated to either the value of the missing case or the value of any other variables.  Occurs when cases with missing values are just a random subsample of all cases in the sample.  When data are MCAR, analyses remain unbiased, although power is reduced.  E.g. When one participant did not show up post the intervention due to emergency. In this situation, the missing values are not related to the main variable or to the value of other characteristics, such as the person’ s age, sex or experimental group status.
  • 125. (Missing at Random (MAR  It considered MAR if missingness is related to other variables-but not related to the value of the variable that has the missing values.  This pattern is perhaps the most prevalent pattern of missingness in clinical research.  E.g. men were less likely to keep their follow-up appointment. Thus, missingness is related to a person’s gender.
  • 126. (Missing not at Random (MNSR  A pattern in which the value of the variable that is missing is related to its missingness. This is often found for such variables as income (not to tell the truth).
  • 127. .Nature of Missing Data  For only one item is a multi-item-measure.  Sometimes an entire variable is missing.  In other situations, all data are missing for study participants.
  • 128. Role of the variable  How one handles the missing data problem may depend on whether a variable is considered a primary outcome, a secondary outcome, an independent (predictor) variable, or control variable (covariate).
  • 129. Level of Measurement of the Variable  Some strategies are best applied when the variable is measured on an interval or ratio scale, while others only make sense for nominal-level variables.
  • 130. Techniques for Handling Missing Data  Deletion techniques. Involve excluding subjects with missing data from statistical calculation.  Imputation techniques. Involve calculating an estimate of each missing value and replacing, or imputing, each value by its respective estimate.  Note: techniques for handling missing data often vary in the degree to which they affect the amount of dispersion around true scores, and the degree of bias in the final results. Therefore, the selection of a data handling technique should be carefully considered.
  • 131. Deletion Techniques  Deletion methods involve removal of cases or variables with missing data.  Listwise deletion. Also called complete case analysis. It is simply the analysis of those cases for which there are no missing data. It eliminates an entire case when any of its items/variables has a missing data point, whether or not that data point is part of the analysis. It is the default of the SPSS.  Pairwise deletion. Called the available case analysis (unwise deletion). Involves omitting cases from the analysis on a variable-by-variable basis. It eliminates a case only when that case has missing data for variables or items under analysis.  Note: deletion techniques are widely criticized because they assume that the data are MCAR (which is very difficult to ascertain), pose a risk for bias, and lead to reduction of sample size and power.
  • 132. Imputation Techniques  Imputation is the process of estimating missing data based on valid values of other variables or cases in the sample.  The goal of imputation is to use known relationship that can be identified in the valid values of the sample to help estimate the missing data
  • 133. Types of Imputation Techniques  Using prior knowledge.  Inserting mean values.  Using regression  Expectation maximization (EM).  Multiple imputation.
  • 134. Prior Knowledge  Involves replacing a missing value with a value based on an educational guess.  It is a reasonable method if the researcher has a good working knowledge of the research domain, the sample is large, and the number of missing values is small.
  • 135. Mean Replacement  Also called median replacement for skewed distribution.  Involves calculating mean values from a available data on that variable and using them to replace missing values before analysis.  It is a conservative procedure because the distribution mean as a whole does not change and the researcher does not have to guess at missing values.
  • 136. Mean Replacement  Advantages:  Easily implemented and provides all cases with complete data.  A compromise procedure is to insert a group mean for the missing values.  Disadvantages:  It invalidates the variance estimates derived from the standard variance formulas by understanding the data’s true variance.  It distorts the actual distribution of values.  It depresses the observed correlation that this variable will have with other variables because all missing data have a single constant value, thus reducing the variance.
  • 137. Using Regression  Involves using other variables in the dataset as independent variables to develop a regression equation for the variable with missing data serving as the dependent variable.  Cases with complete data are used to generate the regression equation.  The equation is then used to predict missing values for incomplete cases.  More regressions are computed, using the predicted values from the previous regression to develop the next equation, until the predicted values from one step to the next are comparable.  Prediction from the last regression are the ones used to replace missing values.
  • 138. Using Regression  Advantages:  It is more objective than the researcher’s guess but not as blind as simply using the overall mean.  Disadvantages:  It reinforces the relationships already in the data, resulting in less generalizability.  The variance of the distribution is reduced because the estimate is probably too close to the mean.  It assumes that the variable with missing data is correlated substantially with missing data is correlated substantially with the other variables in the dataset.  The regression procedure is not constrained in the estimates it makes.
  • 139. Expectation Maximization  For randomly missing data.  It is an iterative process that proceeds in two discrete steps:  In the expectation (E) step, the conditional expected value of the complete data is computed and then given the observed values, such as correlations.  In the Maximization (M) step, these expected values are then substituted for the missing and maximum likelihood estimation is then computed as though there were no missing data.
  • 140. Multiple Imputation  It produces several datasets and analyzes them separately.  One set of parameters is then formed by averaging the resulting estimates and standard errors.
  • 141. Multiple Imputation  Advantages:  It makes no assumptions about whether data are randomly missing but incorporates random error because it requires random variation in the imputation process.  It permits use of complete-data methods for data analysis and also includes the data collector’s knowledge.  It permits estimates of nonlinear models.  It simulates proper inference from data and increases efficiency of the estimates by minimizing standard errors.  It is the method of choice for databases that are made available for analyses outside the agency that collected the data.
  • 142. Multiple Imputation  Disadvantages:  It requires conceptual intensiveness to carry out MI, including special software and model building.  It does not produce a unique answer because randomness is preserved in the MI process, making reproducibility of exact results problematic.  it requires large a mounts of data storage space that often exceeds space on personal computers’ hard driver.