SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Bivariate 1


                                                                           January 16, 2003
                              BIVARIATE ANALYSIS:
                            ESTIMATING ASSOCIATIONS

                                   Carol S. Aneshensel
                           University of California, Los Angeles

I.    OVERVIEW
      A.  Determines whether two variables are empirically associated.
      B.  Key issue: Is the association in the form anticipated by theory?
      C.  Lays the foundation for multivariate analysis
II.   METHODS OF BIVARIATE ANALYSIS
      A.  Criteria for Selection of a Method
      B.  Proportions: Contingency Tables
          1.      Test of conditional probabilities
          2.      Used with 2 categorical variables
          3.      No distributional assumptions
          4.      Low statistical power
          5.      Calculation of Chi-squared (χ5) and its degrees of freedom
      C.  Mean Differences: Analysis of Variance
          1.      Test of mean differences relative to variance
          2.      Independent variable is categorical
          3.      Dependent variable is interval or ratio.
          4.      No assumptions about the form of the association
          5.      Calculation of F and its degrees of freedom
      D.  Correlation: Correlation Coefficients
          1.      Test of linear association
          2.      Two interval variables
          3.      Linear association
          4.      Calculation of t and its degrees of freedom
Bivariate 2

                                 BIVARIATE ANALYSIS:
                               ESTIMATING ASSOCIATIONS

                                      Carol S. Aneshensel
                              University of California, Los Angeles

        The first step in the analysis of a focal relationship is to determine whether there is an
empirical association between its two component variables. This objective is accomplished by
means of bivariate analysis. This analysis ascertains whether the values of the dependent
variable tend to coincide with those of the independent variable. In most instances, the
association between two variables is assessed with a bivariate statistical technique (see below for
exceptions). The three most commonly used techniques are contingency tables, analysis of
variance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to a
multivariate form to evaluate whether the association can be interpreted as a relationship.
        Not any association, will do, however: We are interested in one particular association,
that predicted by theory. If we expect to find a linear association, but find instead a U-shaped
one, then our theory is not supported even though the two variables are associated with one
another. Thus, the object of explanatory analysis is to ascertain whether the independent variable
is associated with the dependent variable in the manner predicted by theory.
        In some instances, the association between two variables is assessed with a multivariate
rather than a bivariate statistical technique. This situation arises when two or more variable are
needed to express the functional form of the association. For example, the correlation coefficient
estimates the linear association between two variables, but a nonlinear association requires a
different approach, such as the parabola specified by the terms X and X2. Although two analytic
variables (X and X2) are used to operationalize the form of the association (parabola), these
variables pertain to one substantive theoretical variable (X). The two analytic variables are best
thought of as one 2-part variable that reflects the nonlinear form of the association with the
dependent variable. Thus, the analysis is bivariate even though a multivariate statistical
technique is used.
        Although this distinction may appear to be hair-splitting, it reduces confusion about the
focal relationship when more than one term is used to operationalize the independent variable.
For example, the categories of ethnicity might be converted into a set of dichotomous "dummy
variables" indicating whether the person is (1) African American, (2) Latino, (3) Asian
American, or (4) non-Latino White (or is in the excluded reference category of "Other"). A
theoretical model containing one independent variable, ethnicity, now appears to involve four
independent variables. The four "dummy variables," however, are in actuality one composite
variable with five categories. This type of hybrid variable requires a multivariate statistical
technique, such as regression, even though it represents a bivariate association.
        The importance of bivariate analysis is sometimes overlooked because it superseded by
multivariate analysis. This misperception is reinforced by scientific journals that report bivariate
associations only in passing, if at all. This practice creates the misleading impression that
analysis begins at the multivariate level. In reality, the multiple-variable model rests upon the
foundation laid by the thorough analysis of the 2-variable model. The proper specification of the
Bivariate 3

theoretical model at the bivariate level is essential to the quality of subsequent multivariate
analysis.
         Some forms of bivariate analysis require that variables be differentiated into independent
or dependent types. For example, the analysis of group differences in means, either by t-test or
ANOVA, treats the group variable as independent, which means that the procedure is
asymmetricalCdifferent values are obtained if the independent and dependent variables are
inverted. In contrast, the Pearson correlation coefficient, the most widely used measure of
bivariate association, yields identical values irrespective of which variable is treated as
dependent, meaning that it is symmetricalC the same coefficient and probability level are
obtained if the two variables are interchanged. Similarly, the chi-squared (χ5) test for
independence between nominal variables yields the same value irrespective of whether the
dependent variable appears in the rows or the columns of the contingency table. Although the
test of statistical significance is unchanged, switching variables yields different expressions of
the association because row and column percentages are not interchangeable. Unlike the
correlation coefficient, where both the statistic and test of statistical significance are symmetrical,
only the probability level is symmetrical in the χ2 technique.
         Designating one variable as independent and the other variable as dependent is productive
even when this differentiation is not required by the statistical method. The value of this
designation lies in setting the stage for subsequent multivariate analysis where this differentiation
is required by most statistical techniques.1 This designation is helpful in the bivariate analysis of
the focal relationship because multivariate analysis ultimately seeks to determine whether the
bivariate association is indicative of a state of dependency between the two variables. This
approach makes more sense if the original association is conceptualized as a potential
relationship.

                           METHODS OF BIVARIATE ANALYSIS
Selection of a Method
        There is a multitude of statistical techniques for the assessment of bivariate associations.
This profusion of techniques reflects a key consideration in the selection of a method of analysis
C the measurement properties of the independent and dependent variable. For example,
correlational techniques are the method of choice for analysis of two interval variables (when the
association is assumed to be linear), but are not suitable to the analysis of two categorical
variables. Given that there are numerous possible combinations of measurement types, there are
numerous analytic techniques.
        A second contributor to this proliferation is sample size: some methods are applicable
only to large samples. Statistical techniques are also distinguished from one another on the basis
of assumptions about the distributional properties of the variables. For example, there are
different computational formulas for the simple t-test depending upon whether the variance of the


1
       Some multivariate techniques do not require that one variable be treated
       as dependent, for example, log-linear models, but this situation is an
       exception.
Bivariate 4

dependent variable is assumed to be the same in the two groups being compared. In contrast,
nonparametric techniques make no distributional assumptions.
        The sheer number of alternative methods can bewilder. The bulk of bivariate analysis in
the social sciences, however, is conducted with three techniques: contingency table analysis of
proportions, ANOVA assessment of mean differences between groups, and correlation
coefficients. As illustrated in Figure 1, a key consideration in the selection of a technique is the
level of measurement. The contingency table technique is used when both variables are nominal.
 Means are analyzed when independent variable is nominal and the dependent variable is interval
or ratio. Correlations are calculated when both variables are interval or ratio (and the association
is assumed to be linear).
        These three methods do not exhaust the possible combinations of independent and
dependent variables, as indicated by the blank cells in Figure 1. Although there are alternative
methods of analysis for these combinations, many researchers adapt one of the three methods
shown in this figure. For instance, if the dependent variable is nominal and the independent
variable is measured at a higher level, the independent variable is often collapsed into categorical
form. This transformation permits the use of the familiar χ2 test, but wastes valuable information
about the inherent ordering of the interval variable.
        Ordinal variables are a bit troublesome because they do not satisfy the assumptions of the
interval methods of analysis, but the use of a nominal method of analysis, the practical
alternative, is wasteful because it does not make use of the ordering information. Interval
methods are often used for ordinal variables that approximate interval variables, that is, quasi-
interval variables, but strictly speaking this practice is inappropriate. If the ordinal variable does
not approximate an interval variable, then it can be treated as a nominal variable in χ2 test,
although, once again, this practice wastes information.
        Why not use a statistical technique intended specifically for the exact measurement
properties of the independent and dependent variables instead of relaxing assumptions or loosing
power? Surely these techniques are more appropriate than the adaptations just described. What
accounts for the popularity of bending methods or measures to permit the use of conventional
methods of bivariate analysis? Quite simply, the conventional methods readily generalize into
familiar methods of multivariate analysis. Correlations and ANOVA form the foundation for
multiple linear regression. Similarly, logistic regression is based on the techniques used with
contingency tables. These three methods of bivariate analysis are used frequently, then, because
of their conceptual continuity with common forms of multivariate analysis.
        The reader is referred to a standard statistic text for less frequently used methods of
assessing bivariate associations. These methods are omitted here so that excessive attention to
technique does not deflect attention from the logic of analysis. A discussion of the relative
merits of various types of correlation coefficients, for example, would unnecessarily divert
attention from the question of whether a linear model, assumed in correlational techniques, is
appropriate based on theory C not theory as taught in a statistics class, but the substantive theory
directing the research. According to the gospel of statistical theory, my dismissive treatment of
technically correct methods of bivariate analysis is heretical. In defense of this stance, I note
only that I am merely calling attention to a widespread practice in applied analysis.
Bivariate 5
Bivariate 6

         In addition to level of measurement, the selection of a statistical procedure should also be
based upon the type of association one expects to find. In most applications, this issue focuses
on whether it is appropriate to use a linear model. In this context, linear means that there is a
constant rate of change in the dependent variable across all values of the independent variable.
This concept makes sense only where there are constant intervals on both variables, which means
that, strictly speaking, linearity is relevant only when both variables are measured at the interval
level.2 In other applications, the issue is specifying which groups are expected to differ from
others. An example of this approach would be hypothesizing that the prevalence of depression is
greater among women than men, as distinct from asserting that gender and depression are
associated with one another (see below).
         In practice, the expected functional form of an association is often overlooked in the
selection of an analytic technique. We tend to become preoccupied with finding a procedure that
fits the measurement characteristics of the independent and dependent variables. The validity of
the entire analysis, however, depends upon the selection of analytic techniques that matches
theory-based expectations about the form of the association. Unfortunately, theory is often mute
on this topic. Nevertheless, it is incumbent upon the analyst to translate theory into the
appropriate analytic model.

Methods: Proportions, Means and Correlations
        In this section, the three most common methods of bivariate analysis are summarized
briefly: χ5 tests of proportions, ANOVA for mean differences, and correlation coefficients for
linear associations. As noted above, the selection of a method is likely to be driven by the
measurement characteristics of the independent and dependent variables. Contingency tables are
appropriate for two nominal variables; tests of mean differences are used for an interval outcome
and nominal independent variable; correlations are employed for linear associations between
interval variables (see Figure 1). If both variables are interval, but the association is expected to
be non-linear, then the correlational technique needs to be adapted to the expected form of the
association using a multiple regression format.
        Each of these forms of analysis can be performed using any major statistical software
program. The emphasis of this presentation, therefore, is not on computations, but on
interpretation. It is useful, however, to review the fundamentals of these methods of analysis to
understand their proper use and interpretation.


2
       There are a few common exceptions. For example, correlation
       coefficients are often calculated for ordinal variables that are quasi-
       interval. Also, dichotomous dependent variables are often treated as
       interval because there is only one interval.
Bivariate 7

        It should be noted that the techniques discussed here are but a few of the many options
available for bivariate analysis. Characteristics of one's data may make other approaches much
more appropriate. The three techniques described here are highlighted because of their
widespread use and because they form the basis for the most commonly used types of
multivariate analysis. The anticipation of multivariate analysis makes it logical to conduct
bivariate analysis that is consistent with the multivariate model.
        Proportions: Contingency Tables. Two properties of the χ5 analysis of a contingency
table make it an especially appealing form of bivariate analysis. First, it is based on the lowest
form of measurement, two nominal variables. The absence of level-of-measurement restrictions
means that the technique may also be used with ordinal, interval or ratio data. Second, this
technique does not require assumptions about the nature of the association, in particular, it does
not assume a linear association. It is used to determine whether any association is present in the
data, without specifying in advance the expected form of this association. This flexibility is the
method's most appealing characteristic.
        These characteristics, however, also establish the limitations of the method. Using χ5
analysis with higher-order variables means that some data is transformed into a lower form of
measurement, converted to categorical form. This transformation leads to a loss of information
and a concomitant loss of statistical power. Although other statistics for contingency table
analysis take into consideration the ordinal quality of variables (e.g., Somers's D), these
techniques are not as widely used as the simple yet less powerful χ5.
        Furthermore, the χ5 test only tells you that some association seems to be present, without
regard to its theoretical relevance. The conclusion that an association is present is not nearly as
meaningful, compelling, or satisfying as the conclusion that the expected association is present.
The χ5 test does not yield this information, although it is possible to adapt the method to this end.
        The χ5 test for independence is used to determine whether there is an association between
two categorical variables. If the two variables are unrelated, then the distribution of one variable
should be the same regardless of the value of the other variable. If instead the two variables are
associated, then the distribution of the dependent variable should differ across the values of the
independent variable. The χ5 test for independence does not distinguish between independent
and dependent variables. Treating one variable as independent is optional and does not alter the
value of the test statistic.
        The dependency between the variables could be stated in the reverse direction: the
distribution of the independent variable differs across categories of the dependent variable.
Although immaterial to the calculation of χ5, this formulation is backwards in terms of the logic
of cause and effect. It treats the dependent variable as fixed and assesses variation in the
independent variable across these fixed values. However, a variable that depends upon another
does not have fixed values; its values vary according to the influence of the independent variable.
 For example, the association between gender and depression is best stated as differences in the
probability of being depressed between men and women, not whether the probability of being a
woman differs between depressed and not depressed persons. The proper formulation of the
Bivariate 8

association, then, is to examine variation in the distribution of the outcome variable across
categories of its presumed cause.3
       The χ5 test for independence is illustrated in Figure 2. In this contingency table, the
independent variable X appears in the rows (1...i) and the dependent variable Y appears in the
columns (1...j). The analytic question is whether the distribution of Y varies across the categories
of X.




3
       Whether this distribution is calculated as row or column percentages is
       immaterial.
Bivariate 9
Bivariate 10

        The overall distribution of Y is given at the bottom of the table. For example, the
proportion in column 1 is p1 = N.1/N; the proportion in column 2 is p2 = N.2/N; and so on until pj =
N.j/N. This proportional distribution should be duplicated within each row of the table if Y is
indeed independent of X. In other words, the distribution of subjects within row 1 should
resemble the distribution of subjects within row 2, and so on through row i, the last row. This
similarity means that the proportion of subjects in column j should be similar across all
categories of X (1, 2, ... i), and similar to the overall proportion of subjects in column j (p.j or
Nij/N). This equivalency should be manifest for all values of Y across all values of X.
        The null hypothesis for the χ5 test for independence essentially states that identical
conditional probabilities are expected under the condition that Y does not depend upon X:

       H0:      p11 = p21 = . . . = pi1 = p.1 = N.1/N                                               (1)
                p12 = p22 = . . . = pi2 = p.2 = N.2/N
                ...
                p1j = p2j = . . . = pij = p.j = Nij/N
The null hypothesis is evaluated with the χ5 test statistic. Its definitional formula and degrees of
                                              2
freedom appear in Figure 2. A large χ value relative to its degrees of freedom leads to rejection
of the null hypothesis. The null hypothesis is rejected if any pij … Nij/N, that is, if any proportion
within the table deviates substantially from the marginal distributions of X and Y. This result
means that the observed covariation between X and Y is unlikely to occur if in reality X and Y are
independent.
        The key to understanding this procedure is the calculation of expected values for the cells
comprising the contingency table. These values are calculated assuming that column
distributions are independent of row distributions (and vice versa). If so, then the overall
marginal proportions for the rows and columns should be replicated within the body of the table.

         The value expected under independence is compared to the observed value for each cell.
If the assumption of independence is valid, there should be only a small difference between
expected and observed values. If instead there is a large difference for one or more cells, the χ5
statistic will be large relative to its degrees of freedom and the null hypothesis will be rejected.
         The chief limitation of this procedure is the nonspecific nature of the null hypothesis.
When we reject it, we know that one or more cell frequencies differs markedly from the
expected, but do not know which cells are deviant. We know that an association exists, but do
not know the form of the association. It might be the one predicted by theory, but then again it
might be some other form of association. This limitation can be remedied if the expected
relationship is more precisely specified prior to analysis.
         To continue the example given earlier in this section, we might hypothesize that females
are more likely to be depressed than males. This hypothesis is considerably more precise than
the hypothesis that gender and depression are associated with one another. It could be
operationalized as an odds ratio greater than 1.00 for depression for females relative to males.
The odds ratio expresses the association between variables in a multiplicative form, meaning that
a value of 1.00 is equivalent to independence. The odds ratio for the data reported in Table 1 is
1.80. This is the exact value for the sample (N = 1,393). If we wish to extend this finding to the
Bivariate 11

population, we could calculate a confidence interval for the odds ratio and determine whether it
includes the value of 1.4 The 95% confidence interval for this example is 1.25 to 2.60. The
lower boundary is greater than 1, meaning that the odds of being depressed are significantly
higher for women than men. Note that this conclusion is more precise than the conclusion made
with χ5, namely that depression is not independent of gender.
        Note also that we are interested only in the possibility that females are at greater risk. A
greater risk among males would disconfirm our theory. However, a male excess of depression
would lead to rejection of the null hypothesis in the χ5 procedure. Our more specific null
hypothesis is rejected only if the confidence interval is greater than 1; we fail to reject if the
confidence interval does not differ from 1 or is smaller than 1. When only an association is
hypothesized, we fail to reject only when the confidence interval includes 1, not when it is
smaller than 1. By specifying the nature of the association we have increased the power of the
test.


4
       Agresti (1990:54-6) gives the following formula for the 100(1-α) percent
       confidence interval for log θ for large samples:

               log(θ)=zα/2 σ(logθ)
       where
                θ = n11n22/n12n21

       is the sample value of the odds ratio, and
             σ(logθ) = [(1/n11) + (1/n22) + (1/n12) + (1/n21)]1/2

       The confidence interval for θ is obtained by exponentiating (taking the
       antilog) of the endpoints of this interval. (Agresti, Alan 1990.
       Categorical Data Analysis (New York: John Wiley & Sons).
Bivariate 12

        The χ5 test is useful for inferential purposes, but it is not informative about the magnitude
of the association. Quantifying the association requires an additional calculation. Although there
are numerous options from which to select, the odds ratio described earlier has gained
considerable popularity in recent years because of its straightforward interpretation. Somers's d
is useful as well because of its common use in the logistic regression model, which is a
multivariate generalization of the χ5 procedure.
        For an example of crosstabular analysis, we can turn to a study estimating the extent to
which midlife and older parents experience the divorce of their sons and daughters (Spitze,
Logan, Deane, and Zerger 1994).5 The overall prevalence of this event was influenced by time
insofar as the oldest adults have been at risk for the longest time, whereas their younger
counterparts have had less time to have adult children marry and divorce. The bivariate analysis
shown in Table 2
takes to account the     TABLE 2         Percent Distribution of Marital History of Children by
age of the parent.       Age of the Parent
        Although the
experience of having
an adult child is                                        Age of the Parent
normative in all of     Percentage of all                        40-     50-    60-     70+
these age groups,       respondents who have had:                49      59     69
barely half of the
youngest adults have An adult child                     55.1 87.6 88.6 72.2
experienced this life An ever-married child             19.5 70.5 83.1 74.4
course transition. At An ever-divorced child                      3.0 24.0 40.6 35.8
older ages, about four A currently divorced/             4.4 19.6 27.6 24.4
in five have done so. separated child
The researchers         A remarried child                        1.1      9.5 22.1 21.1
attribute the slight
dropoff in the oldest N                                          365     275    308     246
age group to the                                                            Source: Spitze et al.
lower fertility of      (1994).
those who were of
5
       Spitze, Glenna, Logan, John R., Deane, Glenn, and Zerger, Suzanne.
       (1994). Adult children's divorce and intergenerational relationships.
       Journal of Marriage and the Family, 56:279-293.
Bivariate 13

childbearing age prior to the baby boom period.
         These age trends are even more evident for having a married child. This event is
relatively rare for the youngest group, but quite common among persons over the age of 50 years.
 Age is strongly associated with having children who have divorced, including those who are
currently divorced, and with having children who have remarried.
         No tests of statistical significance are presented for the data in Table 2 in the original
report. The differences are quite large between the youngest cohort and the other cohorts. Given
the large sample size, these differences are obviously of statistical significance.
         If χ2 tests were provided, however, there would be five such tests, one for each dependent
variable. At first glance, it may not be obvious that there are five dependent variables. This
confusion arises because the table reports percentages rather than cell frequencies (compare with
Figure 2). (Cell frequencies can be extracted from Table 2 by multiplying the percentage by the
sample size, and dividing by 100.) Only the percent "yes" is given in Table 2 because the percent
"no" is implicit given that each of these variables is a dichotomy. Thus, Table 2 represents the
cross-tabulation of age (in 4 categories) by each of the following dependent variables: adult
child/no; ever-married child/no; ever-divorced child/no; currently separated-divorced child/no;
remarried child/no.
         The question of linearity is relevant to this example because the independent variable,
age, is quasi-interval and the dependent variables are all dichotomies. Two distinct nonlinear
patterns are evident in these data. The first is an inverted U-shaped curve: a sharp increase,
especially between the forties and fifties, followed by a decline among those who are seventy or
older. As noted previously, the researchers attribute the decline to a pre-baby boom cohort
effect. The first three dependent variables follow this pattern. The increase at earlier ages
reflects the combined impact of life course considerations, such as the duration of risk for the
event. These factors appear to be the primary consideration for the second pattern, an initial
increase followed by a plateau, which describes the last two entries in the table.
         Means: Analysis of Variance. The ANOVA procedure is similar in many respects to the
χ2 test for independence. In both techniques, the independent variable needs to be measured only
at the nominal level. Also, the null hypothesis is structurally similar in the two procedures. In
the case of χ5, we test whether proportions are constant across categories of the independent
variable and, therefore, equal to the overall marginal proportion. For ANOVA, we test whether
means are constant across categories of the independent variables and, thus, equal to the grand
mean. In both techniques, the null hypothesis is nonspecific: rejecting it is not informative about
which categories of the independent variable differ.
         The main practical difference between methods is that ANOVA requires an interval level
of measurement for the dependent variable.6 This restriction means that the ANOVA technique
is not as widely applicable as the χ5 test, for which the dependent variable can be nominal. The
limitation of ANOVA to interval dependent variables, however, is a trade-off for greater
statistical power.

6
       This assumption is violated when ANOVA is used for ordinal variables
       that approximate the interval level of measurement, but this procedure
       is, strictly speaking, incorrect.
Bivariate 14

        The χ5 test could be substituted for ANOVA by transforming an interval dependent
variable into categorical form. This approach is undesirable because valuable information is lost
when a range of values is collapsed into a single category. Additional information is lost because
the ordering of values is immaterial to χ2. The attendant decrease in statistical power makes χ5
an unattractive alternative to ANOVA.
        The ANOVA procedure is concerned with both central tendency and spread. The
measure of central tendency is the mean. It is calculated for the dependent variable, both overall
and within groups defined by the categories of the independent variable. Specifically, the null
hypothesis is that the within-group mean is equal across groups and, therefore, equal to the grand
mean:

       H0:     µ1 = µ2 = ... = µj = µ (2)

In this equation, µ is the mean, and j is the number of groups, which is the number of categories
on the independent variable. The null hypothesis is rejected if any µj … µ, if any group mean
differs from the grand mean.
         The critical issue in ANOVA is not the absolute difference in means, however, but the
difference in means relative to spread or the variance of the distribution.7 This feature is
illustrated in Figure 3 for several hypothetical distributions. The first panel (a) displays large
differences in means relative to the within-group variation, a pattern that yields a large value for
F. This pattern would lead to a rejection of the null hypothesis.
         The second panel (b) displays the same absolute mean differences as the top panel, but
substantially larger within-group variation. Although the means differ from one another, there is
a large amount of overlap among the distributions. The overlap is so extensive that the
distribution with the lowest mean extends over with the distribution with the highest mean. This
pattern would produce a low F value, leading to failure to reject the null hypothesis. This
conclusion is reached even though the absolute mean differences are the same as in panel a,
which led to rejection of the null hypothesis. The difference in conclusions for the two sets of
distributions arises from their spreads: the mean difference is large relative to the variance of the
distribution in panel a, but relatively small in panel b.
         The third panel (c) shows distributions with the same variances as the second panel (b),
but with substantially larger mean differences. As in the previous case, the variances are large.
In this instance, however, the mean differences between groups are also large. Similarly, in the
last panel (d) the variances are small, but so are the mean differences between groups. We would
fail to reject the null hypothesis despite the small variances because these variances are large

7
       ANOVA was originally developed in terms of a variance, specifically the
       hypothesis that all of the group means are equal is equivalent to the
       hypothesis that the variance of the means is zero (Darlington 1974).
Bivariate 15

relative to the mean differences. As these examples illustrate, it is not the absolute value of the
mean differences that is crucial, but the mean difference relative to the variance.
Bivariate 16
Bivariate 17

        ANOVA is based on the decomposition of variation in the dependent variable into
within- and between-group components, with the groups being categories of the independent
variable. The calculations are based on the sum of squares, that is, deviation of observations
from the mean, as shown in Figure 4.8 The test statistic for ANOVA is F, which is a ratio of
variances. If differences between the means are due to sampling error, then the F ratio should be
around 1.00. Large values of F (relative to its degrees of freedom) would lead to rejection of the
null hypothesis.
        Although ANOVA may be used when there are only 2 categories on the independent
variable, it is customary to use a t test in this situation. The definitional formula for t is:9

                      M1 - M2 / ((s1 /n1) + (s2 / n2))2
                                    2          2
       t       =                                                                                (3)

T and F are equivalent when there are only two groups: t5 = F. Thus, the choice of one method
over the other is immaterial.
        ANOVA is not informative about which means are unequal: it tests only whether all
means are equal to one another and to the grand mean. Rejection of the null hypothesis signifies
that the dependent variable is probably not uniformly distributed across all values of the
independent variable, but does not reveal which cells deviate from expectations. Thus, an
association appears to be present, but it is not known whether it is the one forecast by theory. As
was the case with the χ5 procedure, additional steps are required to test more specific hypotheses
about the precise nature of the association. This may be done a priori or using a post hoc test
(e.g., Scheffe). Specifying contrasts in advance (on the basis of theory) is preferable to
              '
examining all possible contrasts after the fact because the later penalizes you for making multiple
contrasts to reduce the risk of capitalizing on chance.
        Depending upon one's hypothesis, it also may be desirable to test for a trend. Trend
analysis is most relevant when the independent variable is interval and linearity is at issue. In
this case, it may be desirable to partition the between-groups sum of squares into linear,
quadratic, cubic, or higher-order trends. However, visual inspection of the data is usually the

8
       There are    numerous alternative specifications of the ANOVA model
       depending    upon whether fixed or random effects are modeled, whether
       there are    equal or unequal cell frequencies, etc. The user should
       consult a    text on ANOVA for full discussion of these technical concerns.
9
       This formula for t assumes unequal variances between groups; a slightly
       different formula is used if one can assume equal variances.
Bivariate 18

most informative approach for understanding the shape of the association. Simply plotting mean
values is often more instructive than the results of sophisticated statistical tests.
Bivariate 19
Bivariate 20

         Several core aspects of ANOVA are illustrated in Table 3, which shows group variation
in levels of depressive symptoms. These data are from the survey of Toronto adults introduced
earlier in this paper (Turner & Marino 1994; see Table 1). In addition to major depressive
disorder, this study also assessed the occurrence of depressive symptoms during the previous
week. This assessment was made with the Center for Epidemiologic Studies-Depression (CES-
D) Scale, which is the summated total of 20 symptoms, each rated from (0) "0 days" through (3)
"5-7 days."
         The average symptom level varies significantly by gender, age, and marital status. The
nature of this difference is clear for gender insofar as there are only two groups: the average is
lower for men than Table 3 Depression by Select Characteristics
women. For age
and marital status,
however, the         Characteristic                       CES-DH           N            MDDI
differences are less                                      (Mean)                          (%)
clear because more
than two groups
are being            Gender
compared.
         Symptoms Male                                    10.21***        603            7.7***
are most common
among the              Female                             13.10           788           12.9
youngest adults
and thereafter       Age
decline with age       18-25                              15.14***        304           18.4***
(at least through
age 55). It is         26-35                              10.92           470            9.8
tempting to
conclude that the      36-45                              11.09           393            7.2
youngest and
oldest groups          46-55                               9.15           224            4.7
differ, given that
these scores are the Marital Status
most extreme.          Married                             9.98***        673            6.6***
These large
differences,           Previously Married                 14.22           171           11.5
however, may be
offset by              Never Married                      13.70           547           15.8
exceedingly large
variances. As a
result, we are       Total                                11.79          1,391          10.6
limited to the
nonspecific                       Source: Turner and Marino (1994); Table 1.
conclusion that              H
                               Depressive symptoms; Center for Epidemiology-Depression Scale.
                                  I
                                     Major Depressive Disorder; Composite International
                     Diagnostic Interview.
                     ***
                         p < .001
Bivariate 21

depressive symptoms vary with age.
        The two unmarried groups have similar levels of symptoms compared to the markedly
different scores of the currently married. In the absence of specific tests for pairs of means,
however, we can only conclude that at least one of these means differs from the grand mean.
        The far right column of Table 3 presents prevalence estimates for major depressive
disorder. These data are shown here to emphasize the similarity between the analysis of means
(ANOVA) and the analysis of proportions (χ2). The prevalence of major depression differs
significantly by gender, age, and marital status. The nature of the gender difference is again
clear, given that there are only two groups. However, the nature of the age and marital status
associations is not specified for particular subgroups. Thus, we are limited to the conclusion that
depression is associated with age and cannot conclude that depression declines with age, even
though the prevalence of depression in the youngest age group is almost twice that of any other
age group. Similarly, although it is tempting to conclude that the married are substantially less
likely to be depressed than the previously married or the never married, this information is not
given by the overall test, meaning that we can only conclude that the three groups do not have the
same rate of depression.
        Finally, a comment on linearity. Figure 5 graphs the association between depression and
age from the data in Table 3. In this example, age has been collapsed into four categories,
meaning that it is ordinal rather than interval. In reality, however, age is an interval variable,
making it reasonable to ask whether its association with depression is linear. The problem in
treating age as a quasi-interval variable is the first interval, ages 18-25, which is shorter (7 years)
than the other intervals (10 years). The problem of unequal age intervals can be circumvented,
however, by assigning each interval a value equal to its midpoint.
        The observed age trend for depressive symptoms [Figure 5(a)] is distinctly nonlinear.
Symptoms are most common among the youngest age group and least common in the oldest age
group, but do not follow a pattern of steady decline between these extremes. Instead, there is a
plateau in average symptom levels between the two middle age groups.
        The observed age trend for rates of major depressive disorder [Figure 5(b)] also is
distinctly nonlinear, although the pattern differs somewhat from the pattern for average symptom
levels. Like symptoms, disorder is most common for the youngest adults and least common for
the oldest adults. Unlike symptoms, however, the decline with age is apparent across the two
middle age groups. Despite the continuity of decline, the trend is nonlinear because the decline
during the youngest period is noticeably steeper than thereafter.
        In sum, although ANOVA does not assume linear associations, it is possible to ascertain
whether this is the case when both variables are interval (or quasi-interval). The same is true for
the analysis of proportions using the χ2 test. We turn now to the correlation coefficient, which
assumes the linear form.
Bivariate 22
Bivariate 23

         Correlations: Linear Associations. Although there are several correlation coefficients,
Pearson's r is by far the most widely used. This coefficient is used when both the independent
                                                               10
and dependent variables are measured at the interval level. From the perspective of
operationalizing a theory-based relationship, the most important aspect of this technique is the
assumption that the association between the independent and dependent variables is linear.11 It is
conventional to recommend inspection of a scatterplot to ensure that there are no gross
departures from linearity. This approach is illustrated in Figure 6 for both linear and nonlinear
associations.
         Although the shape of an association is usually clear in textbook illustrations such as this
one, it is more difficult to visualize associations from scatterplots in practice. The difficulty
arises because large sample sizes generate too many data points, many of which overlap.
Computer-generated scatterplots use symbols such as letters to signify the number of
observations at a particular location, but it is difficult to mentally weigh the points in a plot
according to these symbolic tallies. It is sometimes useful to select a small random sample of
one's sample to circumvent this problem.
         Another tactic for detecting nonlinearity is to collapse the independent variable and
examine the distribution of means as one would in ANOVA. This technique is not as
informative as the scatterplot, given that many distinct values are collapsed into categories and
means, but it is helpful in detecting departures from linearity, especially in combination with a
scatterplot. Yet another strategy entails collapsing both variables into categorical form and
examining their cross-tabulation. This procedure sacrifices even more information than the
previous approach, but it may be helpful, especially if extreme scores are of special interest.
         The correlation coefficient r describes the association between two variables as the
straight line that minimizes the deviation between observed (Y) and estimated (ì) values of the
dependent variable, as illustrated in Figure 7. This feature gives the method its name, the "least-
squares method." The value of r measures the association between X and Y in terms of how

10
       As we have seen repeatedly, however, ordinal variables that approximate
       the interval level of measurement are often used in practice for
       statistical techniques that require an interval level of measurement.
11
       There are other requirements as well, including normal distributions and
       homoscedasticity. The reader is referred to a text on multiple
       regression for a through consideration of the correlational model and
       its assumptions.
Bivariate 24

closely the data points cluster around the least-squares line. The absolute value of r is large when
the data points hover close to the least-squares line; when observations are more widely
dispersed around this line, the absolute value of r is close to zero. The values of the correlation
coefficient range from 1, which indicates perfect correspondence, through 0, which signifies a
complete lack of correspondence, to -1, which connotes perfect inverse (i.e., negative)
association (see Figure 6).
Bivariate 25
Bivariate 26
Bivariate 27

         The null hypothesis is once again that Y is independent of X, specifically, H0: r = 0. The
test statistic for generalization from the sample to the population is t, computed as shown in
Figure 7. Although a two-tailed test may be used, the direction of the association is usually
theoretically important, which makes the use of a one-tailed test appropriate.
         It is important to note that this technique assumes that the association between X and Y is
linear in form. If there is a nonlinear association, the value of r will be seriously misleading.
This problem is illustrated in Figure 8. In this example, the data are better represented by a
parabola than by a straight line. Correlational techniques for determining whether specific
nonlinear trends are present entail a multivariate model. Here is suffices to note that these
techniques are isomorphic to those described above for trends in ANOVA.
         The slope of the least-squared line is of interest because it quantifies the magnitude of the
association between the independent and dependent variables. Specifically, the slope is the
change in Y produced by a one unit increase in X. A steep slope (in either direction) indicates a
strong relationship whereas a weak relationship appears to be almost horizontal. The slope is not
given by r, a common misconception, but can be derived from r and information about the
distributions of X and Y. Another indicator of the strength of the association is r2, which is the
proportion of the variance in the dependent variable that is accounted for by the independent
variable.
         For example, two measures of depression strongly covary, specifically, the correlation
between the Child Depression Inventory (CDI)12 and the Stony Brook (SB) Child Psychiatric
            13
Checklist measure of depression is quite strong (r = .59; p < .001), but well below the perfect
correspondence (r = 1.00) that would be expected if both measures were perfectly reliable and
valid. The r2 value is .35, meaning that about a third of the variance in the Stony Brook is shared
with the CDI (and vice versa). Although this correspondence is strong, most of the variance in
the two measures is not shared in common. The significance test for the correlation between the
CDI and the SB (p < .001) indicates that it is extremely unlikely that this correlation would have
been observed if in truth the two variables are not correlated with one another.


12
       Kovacs, M. & Beck, A.T. (1977). An empirical-clinical approach toward a
       definition of childhood depression. In J.G. Schulterbrandt & A. Raskin
       (Eds.), Depression in Childhood: Diagnosis, Treatment, and Conceptual
       Models. NY: Raven, pp. 1-25.
13
       Gadow, K.D. and Sprafkin, J. (1987). Stony Brook Child Psychiatric
       Checklist-3R. Stony Brook, New York. Unpublished manuscript.
Bivariate 28

        The correlation coefficient is an appropriate indicator of the association between the CDI
and the SB for two reasons: (1) both variables are quasi-interval, and (2) the association can be
assumed to be linear. The latter point is particularly important, given that the functional form of
associations is often overlooked. The two variables are similar measures of the same construct,
which means that an increase in one measure should be matched by an increase in the other
measure. Moreover, this correspondence should be evident across the full span of values for both
variables. There is no reason to anticipate, for example, a plateau, or a threshold effect. Thus,
the correlation coefficient is an appropriate choice.
Bivariate 29
Bivariate 30

         The correlation between the CDI and the SB is considerably stronger than their
correlations with measures of other constructs. This pattern is expected, given that two measures
of the same construct should be more highly correlated than measures of different constructs.
         However, adolescent depression was also assessed with two other measures, SB ratings
made by the mother and by the father. These measures correlate with the CDI (.29 and .22,
respectively), and with the adolescent's self assessment on the SB (.28 and .21, respectively), but
far below the correlation between the two adolescent measures (.59). Although these correlations
are all statistically significant, the parental measures account for no more that 8.4 percent of the
variance in the adolescent self-reports. Thus, as concluded earlier, parental ratings are not
especially good measures of adolescent mood, despite the fact that such ratings have been
standard practice in psychiatric epidemiology.
         In sum, the few correlations reviewed here demonstrate the importance of considering
both the statistical significance of an association and its substantive importance.

                                             SUMMARY
        Although the specifics of bivariate analysis are unique to each statistical technique, there
is a functional similarity across these methods. As noted above, the null hypothesis for bivariate
analysis states that the values on the two variables are independent of one another. The usual
goal is to reject this hypothesis, to conclude that the variables are not independent of one another.
 In practice, this means that knowing the values on one variable is informative about the likely
values on the second variable.

Más contenido relacionado

La actualidad más candente

Analysis of data (pratik)
Analysis of data (pratik)Analysis of data (pratik)
Analysis of data (pratik)
Patel Parth
 
applied multivariate statistical techniques in agriculture and plant science 2
applied multivariate statistical techniques in agriculture and plant science 2applied multivariate statistical techniques in agriculture and plant science 2
applied multivariate statistical techniques in agriculture and plant science 2
amir rahmani
 
Statistika Dasar (15) statistika non_parametrik
Statistika Dasar (15) statistika non_parametrikStatistika Dasar (15) statistika non_parametrik
Statistika Dasar (15) statistika non_parametrik
jayamartha
 
Non parametrics
Non parametricsNon parametrics
Non parametrics
Ryan Sain
 

La actualidad más candente (19)

Wilcoxon signed rank test
Wilcoxon signed rank testWilcoxon signed rank test
Wilcoxon signed rank test
 
Analysis of data (pratik)
Analysis of data (pratik)Analysis of data (pratik)
Analysis of data (pratik)
 
Comparison between two statistical tests of significance
Comparison between two statistical tests of significanceComparison between two statistical tests of significance
Comparison between two statistical tests of significance
 
The Sign Test
The Sign TestThe Sign Test
The Sign Test
 
applied multivariate statistical techniques in agriculture and plant science 2
applied multivariate statistical techniques in agriculture and plant science 2applied multivariate statistical techniques in agriculture and plant science 2
applied multivariate statistical techniques in agriculture and plant science 2
 
Statistika Dasar (15) statistika non_parametrik
Statistika Dasar (15) statistika non_parametrikStatistika Dasar (15) statistika non_parametrik
Statistika Dasar (15) statistika non_parametrik
 
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOUCorrelation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
 
F0562023
F0562023F0562023
F0562023
 
Non parametrics
Non parametricsNon parametrics
Non parametrics
 
Advance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestAdvance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank Test
 
Introduction to Statistics - Basics Statistics Concepts - Day 1- 8614 - B.Ed ...
Introduction to Statistics - Basics Statistics Concepts - Day 1- 8614 - B.Ed ...Introduction to Statistics - Basics Statistics Concepts - Day 1- 8614 - B.Ed ...
Introduction to Statistics - Basics Statistics Concepts - Day 1- 8614 - B.Ed ...
 
ANALYZING DATA BY BY SELLIGER AND SHAOAMY (1989)
ANALYZING DATA BY BY SELLIGER AND SHAOAMY (1989)ANALYZING DATA BY BY SELLIGER AND SHAOAMY (1989)
ANALYZING DATA BY BY SELLIGER AND SHAOAMY (1989)
 
Multivariate and Conditional Distribution
Multivariate and Conditional DistributionMultivariate and Conditional Distribution
Multivariate and Conditional Distribution
 
Bio statistic (lecture 01)
Bio statistic (lecture 01)Bio statistic (lecture 01)
Bio statistic (lecture 01)
 
Analysis of variance
Analysis of varianceAnalysis of variance
Analysis of variance
 
Emil Pulido on Quantitative Research: Inferential Statistics
Emil Pulido on Quantitative Research: Inferential StatisticsEmil Pulido on Quantitative Research: Inferential Statistics
Emil Pulido on Quantitative Research: Inferential Statistics
 
Parametric and non parametric test in biostatistics
Parametric and non parametric test in biostatistics Parametric and non parametric test in biostatistics
Parametric and non parametric test in biostatistics
 
s.analysis
s.analysiss.analysis
s.analysis
 
cross tabulation
 cross tabulation cross tabulation
cross tabulation
 

Destacado

Bivariate analysis
Bivariate analysisBivariate analysis
Bivariate analysis
ariassam
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
sristi1992
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
kongara
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysis
Setia Pramana
 
Methods of multivariate analysis
Methods of multivariate analysisMethods of multivariate analysis
Methods of multivariate analysis
haramaya university
 
Univariate Analysis
 Univariate Analysis Univariate Analysis
Univariate Analysis
Soumya Sahoo
 

Destacado (12)

Bivariate analysis
Bivariate analysisBivariate analysis
Bivariate analysis
 
121 vhgfhg
121 vhgfhg121 vhgfhg
121 vhgfhg
 
Bivariate
BivariateBivariate
Bivariate
 
Bivariate
BivariateBivariate
Bivariate
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
 
Multivariate analysis
Multivariate analysisMultivariate analysis
Multivariate analysis
 
Malhotra17
Malhotra17Malhotra17
Malhotra17
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysis
 
Methods of multivariate analysis
Methods of multivariate analysisMethods of multivariate analysis
Methods of multivariate analysis
 
Univariate Analysis
 Univariate Analysis Univariate Analysis
Univariate Analysis
 

Similar a Biva riate analysis pdf

this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docx
howardh5
 
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1 .docx
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1  .docx6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1  .docx
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1 .docx
alinainglis
 
Correlational research
Correlational researchCorrelational research
Correlational research
Jijo G John
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
Sumit Prajapati
 

Similar a Biva riate analysis pdf (20)

this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docx
 
TYPESOFDATAANALYSIS research methodology .pdf
TYPESOFDATAANALYSIS research methodology .pdfTYPESOFDATAANALYSIS research methodology .pdf
TYPESOFDATAANALYSIS research methodology .pdf
 
9_Different_Statistical_Techniques.pptx
9_Different_Statistical_Techniques.pptx9_Different_Statistical_Techniques.pptx
9_Different_Statistical_Techniques.pptx
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
 
Regression
RegressionRegression
Regression
 
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1 .docx
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1  .docx6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1  .docx
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1 .docx
 
Analysis of variance (anova)
Analysis of variance (anova)Analysis of variance (anova)
Analysis of variance (anova)
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis technique
 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdf
 
Path analysis with manifest variables
Path analysis with manifest variablesPath analysis with manifest variables
Path analysis with manifest variables
 
Correlation Studies - Descriptive Studies
Correlation Studies - Descriptive StudiesCorrelation Studies - Descriptive Studies
Correlation Studies - Descriptive Studies
 
Parametric & non-parametric
Parametric & non-parametricParametric & non-parametric
Parametric & non-parametric
 
Correlational research
Correlational researchCorrelational research
Correlational research
 
1756-0500-3-267.pdf
1756-0500-3-267.pdf1756-0500-3-267.pdf
1756-0500-3-267.pdf
 
Discriminant analysis.pptx
Discriminant analysis.pptxDiscriminant analysis.pptx
Discriminant analysis.pptx
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
 
Anova in easyest way
Anova in easyest wayAnova in easyest way
Anova in easyest way
 
STATISTICAL REGRESSION MODELS
STATISTICAL REGRESSION MODELSSTATISTICAL REGRESSION MODELS
STATISTICAL REGRESSION MODELS
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data Analysis
 
A Compact Guide to Biostatistics
A Compact Guide to BiostatisticsA Compact Guide to Biostatistics
A Compact Guide to Biostatistics
 

Más de Deven Vaijapurkar (7)

Ch 4 merit rating and job details
Ch 4 merit rating and job detailsCh 4 merit rating and job details
Ch 4 merit rating and job details
 
Ch 4 contd.promotions and transfers
Ch 4 contd.promotions and transfersCh 4 contd.promotions and transfers
Ch 4 contd.promotions and transfers
 
Ch 3 communication about disciplinary measures
Ch 3 communication about disciplinary measuresCh 3 communication about disciplinary measures
Ch 3 communication about disciplinary measures
 
Ch 2 mpm letter drafting
Ch 2 mpm letter draftingCh 2 mpm letter drafting
Ch 2 mpm letter drafting
 
Ch.1 labour welfare
Ch.1 labour welfareCh.1 labour welfare
Ch.1 labour welfare
 
M.p.m. exam. nov.2011
M.p.m. exam. nov.2011M.p.m. exam. nov.2011
M.p.m. exam. nov.2011
 
10.mpm syllabus
10.mpm syllabus10.mpm syllabus
10.mpm syllabus
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Biva riate analysis pdf

  • 1. Bivariate 1 January 16, 2003 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los Angeles I. OVERVIEW A. Determines whether two variables are empirically associated. B. Key issue: Is the association in the form anticipated by theory? C. Lays the foundation for multivariate analysis II. METHODS OF BIVARIATE ANALYSIS A. Criteria for Selection of a Method B. Proportions: Contingency Tables 1. Test of conditional probabilities 2. Used with 2 categorical variables 3. No distributional assumptions 4. Low statistical power 5. Calculation of Chi-squared (χ5) and its degrees of freedom C. Mean Differences: Analysis of Variance 1. Test of mean differences relative to variance 2. Independent variable is categorical 3. Dependent variable is interval or ratio. 4. No assumptions about the form of the association 5. Calculation of F and its degrees of freedom D. Correlation: Correlation Coefficients 1. Test of linear association 2. Two interval variables 3. Linear association 4. Calculation of t and its degrees of freedom
  • 2. Bivariate 2 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los Angeles The first step in the analysis of a focal relationship is to determine whether there is an empirical association between its two component variables. This objective is accomplished by means of bivariate analysis. This analysis ascertains whether the values of the dependent variable tend to coincide with those of the independent variable. In most instances, the association between two variables is assessed with a bivariate statistical technique (see below for exceptions). The three most commonly used techniques are contingency tables, analysis of variance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to a multivariate form to evaluate whether the association can be interpreted as a relationship. Not any association, will do, however: We are interested in one particular association, that predicted by theory. If we expect to find a linear association, but find instead a U-shaped one, then our theory is not supported even though the two variables are associated with one another. Thus, the object of explanatory analysis is to ascertain whether the independent variable is associated with the dependent variable in the manner predicted by theory. In some instances, the association between two variables is assessed with a multivariate rather than a bivariate statistical technique. This situation arises when two or more variable are needed to express the functional form of the association. For example, the correlation coefficient estimates the linear association between two variables, but a nonlinear association requires a different approach, such as the parabola specified by the terms X and X2. Although two analytic variables (X and X2) are used to operationalize the form of the association (parabola), these variables pertain to one substantive theoretical variable (X). The two analytic variables are best thought of as one 2-part variable that reflects the nonlinear form of the association with the dependent variable. Thus, the analysis is bivariate even though a multivariate statistical technique is used. Although this distinction may appear to be hair-splitting, it reduces confusion about the focal relationship when more than one term is used to operationalize the independent variable. For example, the categories of ethnicity might be converted into a set of dichotomous "dummy variables" indicating whether the person is (1) African American, (2) Latino, (3) Asian American, or (4) non-Latino White (or is in the excluded reference category of "Other"). A theoretical model containing one independent variable, ethnicity, now appears to involve four independent variables. The four "dummy variables," however, are in actuality one composite variable with five categories. This type of hybrid variable requires a multivariate statistical technique, such as regression, even though it represents a bivariate association. The importance of bivariate analysis is sometimes overlooked because it superseded by multivariate analysis. This misperception is reinforced by scientific journals that report bivariate associations only in passing, if at all. This practice creates the misleading impression that analysis begins at the multivariate level. In reality, the multiple-variable model rests upon the foundation laid by the thorough analysis of the 2-variable model. The proper specification of the
  • 3. Bivariate 3 theoretical model at the bivariate level is essential to the quality of subsequent multivariate analysis. Some forms of bivariate analysis require that variables be differentiated into independent or dependent types. For example, the analysis of group differences in means, either by t-test or ANOVA, treats the group variable as independent, which means that the procedure is asymmetricalCdifferent values are obtained if the independent and dependent variables are inverted. In contrast, the Pearson correlation coefficient, the most widely used measure of bivariate association, yields identical values irrespective of which variable is treated as dependent, meaning that it is symmetricalC the same coefficient and probability level are obtained if the two variables are interchanged. Similarly, the chi-squared (χ5) test for independence between nominal variables yields the same value irrespective of whether the dependent variable appears in the rows or the columns of the contingency table. Although the test of statistical significance is unchanged, switching variables yields different expressions of the association because row and column percentages are not interchangeable. Unlike the correlation coefficient, where both the statistic and test of statistical significance are symmetrical, only the probability level is symmetrical in the χ2 technique. Designating one variable as independent and the other variable as dependent is productive even when this differentiation is not required by the statistical method. The value of this designation lies in setting the stage for subsequent multivariate analysis where this differentiation is required by most statistical techniques.1 This designation is helpful in the bivariate analysis of the focal relationship because multivariate analysis ultimately seeks to determine whether the bivariate association is indicative of a state of dependency between the two variables. This approach makes more sense if the original association is conceptualized as a potential relationship. METHODS OF BIVARIATE ANALYSIS Selection of a Method There is a multitude of statistical techniques for the assessment of bivariate associations. This profusion of techniques reflects a key consideration in the selection of a method of analysis C the measurement properties of the independent and dependent variable. For example, correlational techniques are the method of choice for analysis of two interval variables (when the association is assumed to be linear), but are not suitable to the analysis of two categorical variables. Given that there are numerous possible combinations of measurement types, there are numerous analytic techniques. A second contributor to this proliferation is sample size: some methods are applicable only to large samples. Statistical techniques are also distinguished from one another on the basis of assumptions about the distributional properties of the variables. For example, there are different computational formulas for the simple t-test depending upon whether the variance of the 1 Some multivariate techniques do not require that one variable be treated as dependent, for example, log-linear models, but this situation is an exception.
  • 4. Bivariate 4 dependent variable is assumed to be the same in the two groups being compared. In contrast, nonparametric techniques make no distributional assumptions. The sheer number of alternative methods can bewilder. The bulk of bivariate analysis in the social sciences, however, is conducted with three techniques: contingency table analysis of proportions, ANOVA assessment of mean differences between groups, and correlation coefficients. As illustrated in Figure 1, a key consideration in the selection of a technique is the level of measurement. The contingency table technique is used when both variables are nominal. Means are analyzed when independent variable is nominal and the dependent variable is interval or ratio. Correlations are calculated when both variables are interval or ratio (and the association is assumed to be linear). These three methods do not exhaust the possible combinations of independent and dependent variables, as indicated by the blank cells in Figure 1. Although there are alternative methods of analysis for these combinations, many researchers adapt one of the three methods shown in this figure. For instance, if the dependent variable is nominal and the independent variable is measured at a higher level, the independent variable is often collapsed into categorical form. This transformation permits the use of the familiar χ2 test, but wastes valuable information about the inherent ordering of the interval variable. Ordinal variables are a bit troublesome because they do not satisfy the assumptions of the interval methods of analysis, but the use of a nominal method of analysis, the practical alternative, is wasteful because it does not make use of the ordering information. Interval methods are often used for ordinal variables that approximate interval variables, that is, quasi- interval variables, but strictly speaking this practice is inappropriate. If the ordinal variable does not approximate an interval variable, then it can be treated as a nominal variable in χ2 test, although, once again, this practice wastes information. Why not use a statistical technique intended specifically for the exact measurement properties of the independent and dependent variables instead of relaxing assumptions or loosing power? Surely these techniques are more appropriate than the adaptations just described. What accounts for the popularity of bending methods or measures to permit the use of conventional methods of bivariate analysis? Quite simply, the conventional methods readily generalize into familiar methods of multivariate analysis. Correlations and ANOVA form the foundation for multiple linear regression. Similarly, logistic regression is based on the techniques used with contingency tables. These three methods of bivariate analysis are used frequently, then, because of their conceptual continuity with common forms of multivariate analysis. The reader is referred to a standard statistic text for less frequently used methods of assessing bivariate associations. These methods are omitted here so that excessive attention to technique does not deflect attention from the logic of analysis. A discussion of the relative merits of various types of correlation coefficients, for example, would unnecessarily divert attention from the question of whether a linear model, assumed in correlational techniques, is appropriate based on theory C not theory as taught in a statistics class, but the substantive theory directing the research. According to the gospel of statistical theory, my dismissive treatment of technically correct methods of bivariate analysis is heretical. In defense of this stance, I note only that I am merely calling attention to a widespread practice in applied analysis.
  • 6. Bivariate 6 In addition to level of measurement, the selection of a statistical procedure should also be based upon the type of association one expects to find. In most applications, this issue focuses on whether it is appropriate to use a linear model. In this context, linear means that there is a constant rate of change in the dependent variable across all values of the independent variable. This concept makes sense only where there are constant intervals on both variables, which means that, strictly speaking, linearity is relevant only when both variables are measured at the interval level.2 In other applications, the issue is specifying which groups are expected to differ from others. An example of this approach would be hypothesizing that the prevalence of depression is greater among women than men, as distinct from asserting that gender and depression are associated with one another (see below). In practice, the expected functional form of an association is often overlooked in the selection of an analytic technique. We tend to become preoccupied with finding a procedure that fits the measurement characteristics of the independent and dependent variables. The validity of the entire analysis, however, depends upon the selection of analytic techniques that matches theory-based expectations about the form of the association. Unfortunately, theory is often mute on this topic. Nevertheless, it is incumbent upon the analyst to translate theory into the appropriate analytic model. Methods: Proportions, Means and Correlations In this section, the three most common methods of bivariate analysis are summarized briefly: χ5 tests of proportions, ANOVA for mean differences, and correlation coefficients for linear associations. As noted above, the selection of a method is likely to be driven by the measurement characteristics of the independent and dependent variables. Contingency tables are appropriate for two nominal variables; tests of mean differences are used for an interval outcome and nominal independent variable; correlations are employed for linear associations between interval variables (see Figure 1). If both variables are interval, but the association is expected to be non-linear, then the correlational technique needs to be adapted to the expected form of the association using a multiple regression format. Each of these forms of analysis can be performed using any major statistical software program. The emphasis of this presentation, therefore, is not on computations, but on interpretation. It is useful, however, to review the fundamentals of these methods of analysis to understand their proper use and interpretation. 2 There are a few common exceptions. For example, correlation coefficients are often calculated for ordinal variables that are quasi- interval. Also, dichotomous dependent variables are often treated as interval because there is only one interval.
  • 7. Bivariate 7 It should be noted that the techniques discussed here are but a few of the many options available for bivariate analysis. Characteristics of one's data may make other approaches much more appropriate. The three techniques described here are highlighted because of their widespread use and because they form the basis for the most commonly used types of multivariate analysis. The anticipation of multivariate analysis makes it logical to conduct bivariate analysis that is consistent with the multivariate model. Proportions: Contingency Tables. Two properties of the χ5 analysis of a contingency table make it an especially appealing form of bivariate analysis. First, it is based on the lowest form of measurement, two nominal variables. The absence of level-of-measurement restrictions means that the technique may also be used with ordinal, interval or ratio data. Second, this technique does not require assumptions about the nature of the association, in particular, it does not assume a linear association. It is used to determine whether any association is present in the data, without specifying in advance the expected form of this association. This flexibility is the method's most appealing characteristic. These characteristics, however, also establish the limitations of the method. Using χ5 analysis with higher-order variables means that some data is transformed into a lower form of measurement, converted to categorical form. This transformation leads to a loss of information and a concomitant loss of statistical power. Although other statistics for contingency table analysis take into consideration the ordinal quality of variables (e.g., Somers's D), these techniques are not as widely used as the simple yet less powerful χ5. Furthermore, the χ5 test only tells you that some association seems to be present, without regard to its theoretical relevance. The conclusion that an association is present is not nearly as meaningful, compelling, or satisfying as the conclusion that the expected association is present. The χ5 test does not yield this information, although it is possible to adapt the method to this end. The χ5 test for independence is used to determine whether there is an association between two categorical variables. If the two variables are unrelated, then the distribution of one variable should be the same regardless of the value of the other variable. If instead the two variables are associated, then the distribution of the dependent variable should differ across the values of the independent variable. The χ5 test for independence does not distinguish between independent and dependent variables. Treating one variable as independent is optional and does not alter the value of the test statistic. The dependency between the variables could be stated in the reverse direction: the distribution of the independent variable differs across categories of the dependent variable. Although immaterial to the calculation of χ5, this formulation is backwards in terms of the logic of cause and effect. It treats the dependent variable as fixed and assesses variation in the independent variable across these fixed values. However, a variable that depends upon another does not have fixed values; its values vary according to the influence of the independent variable. For example, the association between gender and depression is best stated as differences in the probability of being depressed between men and women, not whether the probability of being a woman differs between depressed and not depressed persons. The proper formulation of the
  • 8. Bivariate 8 association, then, is to examine variation in the distribution of the outcome variable across categories of its presumed cause.3 The χ5 test for independence is illustrated in Figure 2. In this contingency table, the independent variable X appears in the rows (1...i) and the dependent variable Y appears in the columns (1...j). The analytic question is whether the distribution of Y varies across the categories of X. 3 Whether this distribution is calculated as row or column percentages is immaterial.
  • 10. Bivariate 10 The overall distribution of Y is given at the bottom of the table. For example, the proportion in column 1 is p1 = N.1/N; the proportion in column 2 is p2 = N.2/N; and so on until pj = N.j/N. This proportional distribution should be duplicated within each row of the table if Y is indeed independent of X. In other words, the distribution of subjects within row 1 should resemble the distribution of subjects within row 2, and so on through row i, the last row. This similarity means that the proportion of subjects in column j should be similar across all categories of X (1, 2, ... i), and similar to the overall proportion of subjects in column j (p.j or Nij/N). This equivalency should be manifest for all values of Y across all values of X. The null hypothesis for the χ5 test for independence essentially states that identical conditional probabilities are expected under the condition that Y does not depend upon X: H0: p11 = p21 = . . . = pi1 = p.1 = N.1/N (1) p12 = p22 = . . . = pi2 = p.2 = N.2/N ... p1j = p2j = . . . = pij = p.j = Nij/N The null hypothesis is evaluated with the χ5 test statistic. Its definitional formula and degrees of 2 freedom appear in Figure 2. A large χ value relative to its degrees of freedom leads to rejection of the null hypothesis. The null hypothesis is rejected if any pij … Nij/N, that is, if any proportion within the table deviates substantially from the marginal distributions of X and Y. This result means that the observed covariation between X and Y is unlikely to occur if in reality X and Y are independent. The key to understanding this procedure is the calculation of expected values for the cells comprising the contingency table. These values are calculated assuming that column distributions are independent of row distributions (and vice versa). If so, then the overall marginal proportions for the rows and columns should be replicated within the body of the table. The value expected under independence is compared to the observed value for each cell. If the assumption of independence is valid, there should be only a small difference between expected and observed values. If instead there is a large difference for one or more cells, the χ5 statistic will be large relative to its degrees of freedom and the null hypothesis will be rejected. The chief limitation of this procedure is the nonspecific nature of the null hypothesis. When we reject it, we know that one or more cell frequencies differs markedly from the expected, but do not know which cells are deviant. We know that an association exists, but do not know the form of the association. It might be the one predicted by theory, but then again it might be some other form of association. This limitation can be remedied if the expected relationship is more precisely specified prior to analysis. To continue the example given earlier in this section, we might hypothesize that females are more likely to be depressed than males. This hypothesis is considerably more precise than the hypothesis that gender and depression are associated with one another. It could be operationalized as an odds ratio greater than 1.00 for depression for females relative to males. The odds ratio expresses the association between variables in a multiplicative form, meaning that a value of 1.00 is equivalent to independence. The odds ratio for the data reported in Table 1 is 1.80. This is the exact value for the sample (N = 1,393). If we wish to extend this finding to the
  • 11. Bivariate 11 population, we could calculate a confidence interval for the odds ratio and determine whether it includes the value of 1.4 The 95% confidence interval for this example is 1.25 to 2.60. The lower boundary is greater than 1, meaning that the odds of being depressed are significantly higher for women than men. Note that this conclusion is more precise than the conclusion made with χ5, namely that depression is not independent of gender. Note also that we are interested only in the possibility that females are at greater risk. A greater risk among males would disconfirm our theory. However, a male excess of depression would lead to rejection of the null hypothesis in the χ5 procedure. Our more specific null hypothesis is rejected only if the confidence interval is greater than 1; we fail to reject if the confidence interval does not differ from 1 or is smaller than 1. When only an association is hypothesized, we fail to reject only when the confidence interval includes 1, not when it is smaller than 1. By specifying the nature of the association we have increased the power of the test. 4 Agresti (1990:54-6) gives the following formula for the 100(1-α) percent confidence interval for log θ for large samples: log(θ)=zα/2 σ(logθ) where θ = n11n22/n12n21 is the sample value of the odds ratio, and σ(logθ) = [(1/n11) + (1/n22) + (1/n12) + (1/n21)]1/2 The confidence interval for θ is obtained by exponentiating (taking the antilog) of the endpoints of this interval. (Agresti, Alan 1990. Categorical Data Analysis (New York: John Wiley & Sons).
  • 12. Bivariate 12 The χ5 test is useful for inferential purposes, but it is not informative about the magnitude of the association. Quantifying the association requires an additional calculation. Although there are numerous options from which to select, the odds ratio described earlier has gained considerable popularity in recent years because of its straightforward interpretation. Somers's d is useful as well because of its common use in the logistic regression model, which is a multivariate generalization of the χ5 procedure. For an example of crosstabular analysis, we can turn to a study estimating the extent to which midlife and older parents experience the divorce of their sons and daughters (Spitze, Logan, Deane, and Zerger 1994).5 The overall prevalence of this event was influenced by time insofar as the oldest adults have been at risk for the longest time, whereas their younger counterparts have had less time to have adult children marry and divorce. The bivariate analysis shown in Table 2 takes to account the TABLE 2 Percent Distribution of Marital History of Children by age of the parent. Age of the Parent Although the experience of having an adult child is Age of the Parent normative in all of Percentage of all 40- 50- 60- 70+ these age groups, respondents who have had: 49 59 69 barely half of the youngest adults have An adult child 55.1 87.6 88.6 72.2 experienced this life An ever-married child 19.5 70.5 83.1 74.4 course transition. At An ever-divorced child 3.0 24.0 40.6 35.8 older ages, about four A currently divorced/ 4.4 19.6 27.6 24.4 in five have done so. separated child The researchers A remarried child 1.1 9.5 22.1 21.1 attribute the slight dropoff in the oldest N 365 275 308 246 age group to the Source: Spitze et al. lower fertility of (1994). those who were of 5 Spitze, Glenna, Logan, John R., Deane, Glenn, and Zerger, Suzanne. (1994). Adult children's divorce and intergenerational relationships. Journal of Marriage and the Family, 56:279-293.
  • 13. Bivariate 13 childbearing age prior to the baby boom period. These age trends are even more evident for having a married child. This event is relatively rare for the youngest group, but quite common among persons over the age of 50 years. Age is strongly associated with having children who have divorced, including those who are currently divorced, and with having children who have remarried. No tests of statistical significance are presented for the data in Table 2 in the original report. The differences are quite large between the youngest cohort and the other cohorts. Given the large sample size, these differences are obviously of statistical significance. If χ2 tests were provided, however, there would be five such tests, one for each dependent variable. At first glance, it may not be obvious that there are five dependent variables. This confusion arises because the table reports percentages rather than cell frequencies (compare with Figure 2). (Cell frequencies can be extracted from Table 2 by multiplying the percentage by the sample size, and dividing by 100.) Only the percent "yes" is given in Table 2 because the percent "no" is implicit given that each of these variables is a dichotomy. Thus, Table 2 represents the cross-tabulation of age (in 4 categories) by each of the following dependent variables: adult child/no; ever-married child/no; ever-divorced child/no; currently separated-divorced child/no; remarried child/no. The question of linearity is relevant to this example because the independent variable, age, is quasi-interval and the dependent variables are all dichotomies. Two distinct nonlinear patterns are evident in these data. The first is an inverted U-shaped curve: a sharp increase, especially between the forties and fifties, followed by a decline among those who are seventy or older. As noted previously, the researchers attribute the decline to a pre-baby boom cohort effect. The first three dependent variables follow this pattern. The increase at earlier ages reflects the combined impact of life course considerations, such as the duration of risk for the event. These factors appear to be the primary consideration for the second pattern, an initial increase followed by a plateau, which describes the last two entries in the table. Means: Analysis of Variance. The ANOVA procedure is similar in many respects to the χ2 test for independence. In both techniques, the independent variable needs to be measured only at the nominal level. Also, the null hypothesis is structurally similar in the two procedures. In the case of χ5, we test whether proportions are constant across categories of the independent variable and, therefore, equal to the overall marginal proportion. For ANOVA, we test whether means are constant across categories of the independent variables and, thus, equal to the grand mean. In both techniques, the null hypothesis is nonspecific: rejecting it is not informative about which categories of the independent variable differ. The main practical difference between methods is that ANOVA requires an interval level of measurement for the dependent variable.6 This restriction means that the ANOVA technique is not as widely applicable as the χ5 test, for which the dependent variable can be nominal. The limitation of ANOVA to interval dependent variables, however, is a trade-off for greater statistical power. 6 This assumption is violated when ANOVA is used for ordinal variables that approximate the interval level of measurement, but this procedure is, strictly speaking, incorrect.
  • 14. Bivariate 14 The χ5 test could be substituted for ANOVA by transforming an interval dependent variable into categorical form. This approach is undesirable because valuable information is lost when a range of values is collapsed into a single category. Additional information is lost because the ordering of values is immaterial to χ2. The attendant decrease in statistical power makes χ5 an unattractive alternative to ANOVA. The ANOVA procedure is concerned with both central tendency and spread. The measure of central tendency is the mean. It is calculated for the dependent variable, both overall and within groups defined by the categories of the independent variable. Specifically, the null hypothesis is that the within-group mean is equal across groups and, therefore, equal to the grand mean: H0: µ1 = µ2 = ... = µj = µ (2) In this equation, µ is the mean, and j is the number of groups, which is the number of categories on the independent variable. The null hypothesis is rejected if any µj … µ, if any group mean differs from the grand mean. The critical issue in ANOVA is not the absolute difference in means, however, but the difference in means relative to spread or the variance of the distribution.7 This feature is illustrated in Figure 3 for several hypothetical distributions. The first panel (a) displays large differences in means relative to the within-group variation, a pattern that yields a large value for F. This pattern would lead to a rejection of the null hypothesis. The second panel (b) displays the same absolute mean differences as the top panel, but substantially larger within-group variation. Although the means differ from one another, there is a large amount of overlap among the distributions. The overlap is so extensive that the distribution with the lowest mean extends over with the distribution with the highest mean. This pattern would produce a low F value, leading to failure to reject the null hypothesis. This conclusion is reached even though the absolute mean differences are the same as in panel a, which led to rejection of the null hypothesis. The difference in conclusions for the two sets of distributions arises from their spreads: the mean difference is large relative to the variance of the distribution in panel a, but relatively small in panel b. The third panel (c) shows distributions with the same variances as the second panel (b), but with substantially larger mean differences. As in the previous case, the variances are large. In this instance, however, the mean differences between groups are also large. Similarly, in the last panel (d) the variances are small, but so are the mean differences between groups. We would fail to reject the null hypothesis despite the small variances because these variances are large 7 ANOVA was originally developed in terms of a variance, specifically the hypothesis that all of the group means are equal is equivalent to the hypothesis that the variance of the means is zero (Darlington 1974).
  • 15. Bivariate 15 relative to the mean differences. As these examples illustrate, it is not the absolute value of the mean differences that is crucial, but the mean difference relative to the variance.
  • 17. Bivariate 17 ANOVA is based on the decomposition of variation in the dependent variable into within- and between-group components, with the groups being categories of the independent variable. The calculations are based on the sum of squares, that is, deviation of observations from the mean, as shown in Figure 4.8 The test statistic for ANOVA is F, which is a ratio of variances. If differences between the means are due to sampling error, then the F ratio should be around 1.00. Large values of F (relative to its degrees of freedom) would lead to rejection of the null hypothesis. Although ANOVA may be used when there are only 2 categories on the independent variable, it is customary to use a t test in this situation. The definitional formula for t is:9 M1 - M2 / ((s1 /n1) + (s2 / n2))2 2 2 t = (3) T and F are equivalent when there are only two groups: t5 = F. Thus, the choice of one method over the other is immaterial. ANOVA is not informative about which means are unequal: it tests only whether all means are equal to one another and to the grand mean. Rejection of the null hypothesis signifies that the dependent variable is probably not uniformly distributed across all values of the independent variable, but does not reveal which cells deviate from expectations. Thus, an association appears to be present, but it is not known whether it is the one forecast by theory. As was the case with the χ5 procedure, additional steps are required to test more specific hypotheses about the precise nature of the association. This may be done a priori or using a post hoc test (e.g., Scheffe). Specifying contrasts in advance (on the basis of theory) is preferable to ' examining all possible contrasts after the fact because the later penalizes you for making multiple contrasts to reduce the risk of capitalizing on chance. Depending upon one's hypothesis, it also may be desirable to test for a trend. Trend analysis is most relevant when the independent variable is interval and linearity is at issue. In this case, it may be desirable to partition the between-groups sum of squares into linear, quadratic, cubic, or higher-order trends. However, visual inspection of the data is usually the 8 There are numerous alternative specifications of the ANOVA model depending upon whether fixed or random effects are modeled, whether there are equal or unequal cell frequencies, etc. The user should consult a text on ANOVA for full discussion of these technical concerns. 9 This formula for t assumes unequal variances between groups; a slightly different formula is used if one can assume equal variances.
  • 18. Bivariate 18 most informative approach for understanding the shape of the association. Simply plotting mean values is often more instructive than the results of sophisticated statistical tests.
  • 20. Bivariate 20 Several core aspects of ANOVA are illustrated in Table 3, which shows group variation in levels of depressive symptoms. These data are from the survey of Toronto adults introduced earlier in this paper (Turner & Marino 1994; see Table 1). In addition to major depressive disorder, this study also assessed the occurrence of depressive symptoms during the previous week. This assessment was made with the Center for Epidemiologic Studies-Depression (CES- D) Scale, which is the summated total of 20 symptoms, each rated from (0) "0 days" through (3) "5-7 days." The average symptom level varies significantly by gender, age, and marital status. The nature of this difference is clear for gender insofar as there are only two groups: the average is lower for men than Table 3 Depression by Select Characteristics women. For age and marital status, however, the Characteristic CES-DH N MDDI differences are less (Mean) (%) clear because more than two groups are being Gender compared. Symptoms Male 10.21*** 603 7.7*** are most common among the Female 13.10 788 12.9 youngest adults and thereafter Age decline with age 18-25 15.14*** 304 18.4*** (at least through age 55). It is 26-35 10.92 470 9.8 tempting to conclude that the 36-45 11.09 393 7.2 youngest and oldest groups 46-55 9.15 224 4.7 differ, given that these scores are the Marital Status most extreme. Married 9.98*** 673 6.6*** These large differences, Previously Married 14.22 171 11.5 however, may be offset by Never Married 13.70 547 15.8 exceedingly large variances. As a result, we are Total 11.79 1,391 10.6 limited to the nonspecific Source: Turner and Marino (1994); Table 1. conclusion that H Depressive symptoms; Center for Epidemiology-Depression Scale. I Major Depressive Disorder; Composite International Diagnostic Interview. *** p < .001
  • 21. Bivariate 21 depressive symptoms vary with age. The two unmarried groups have similar levels of symptoms compared to the markedly different scores of the currently married. In the absence of specific tests for pairs of means, however, we can only conclude that at least one of these means differs from the grand mean. The far right column of Table 3 presents prevalence estimates for major depressive disorder. These data are shown here to emphasize the similarity between the analysis of means (ANOVA) and the analysis of proportions (χ2). The prevalence of major depression differs significantly by gender, age, and marital status. The nature of the gender difference is again clear, given that there are only two groups. However, the nature of the age and marital status associations is not specified for particular subgroups. Thus, we are limited to the conclusion that depression is associated with age and cannot conclude that depression declines with age, even though the prevalence of depression in the youngest age group is almost twice that of any other age group. Similarly, although it is tempting to conclude that the married are substantially less likely to be depressed than the previously married or the never married, this information is not given by the overall test, meaning that we can only conclude that the three groups do not have the same rate of depression. Finally, a comment on linearity. Figure 5 graphs the association between depression and age from the data in Table 3. In this example, age has been collapsed into four categories, meaning that it is ordinal rather than interval. In reality, however, age is an interval variable, making it reasonable to ask whether its association with depression is linear. The problem in treating age as a quasi-interval variable is the first interval, ages 18-25, which is shorter (7 years) than the other intervals (10 years). The problem of unequal age intervals can be circumvented, however, by assigning each interval a value equal to its midpoint. The observed age trend for depressive symptoms [Figure 5(a)] is distinctly nonlinear. Symptoms are most common among the youngest age group and least common in the oldest age group, but do not follow a pattern of steady decline between these extremes. Instead, there is a plateau in average symptom levels between the two middle age groups. The observed age trend for rates of major depressive disorder [Figure 5(b)] also is distinctly nonlinear, although the pattern differs somewhat from the pattern for average symptom levels. Like symptoms, disorder is most common for the youngest adults and least common for the oldest adults. Unlike symptoms, however, the decline with age is apparent across the two middle age groups. Despite the continuity of decline, the trend is nonlinear because the decline during the youngest period is noticeably steeper than thereafter. In sum, although ANOVA does not assume linear associations, it is possible to ascertain whether this is the case when both variables are interval (or quasi-interval). The same is true for the analysis of proportions using the χ2 test. We turn now to the correlation coefficient, which assumes the linear form.
  • 23. Bivariate 23 Correlations: Linear Associations. Although there are several correlation coefficients, Pearson's r is by far the most widely used. This coefficient is used when both the independent 10 and dependent variables are measured at the interval level. From the perspective of operationalizing a theory-based relationship, the most important aspect of this technique is the assumption that the association between the independent and dependent variables is linear.11 It is conventional to recommend inspection of a scatterplot to ensure that there are no gross departures from linearity. This approach is illustrated in Figure 6 for both linear and nonlinear associations. Although the shape of an association is usually clear in textbook illustrations such as this one, it is more difficult to visualize associations from scatterplots in practice. The difficulty arises because large sample sizes generate too many data points, many of which overlap. Computer-generated scatterplots use symbols such as letters to signify the number of observations at a particular location, but it is difficult to mentally weigh the points in a plot according to these symbolic tallies. It is sometimes useful to select a small random sample of one's sample to circumvent this problem. Another tactic for detecting nonlinearity is to collapse the independent variable and examine the distribution of means as one would in ANOVA. This technique is not as informative as the scatterplot, given that many distinct values are collapsed into categories and means, but it is helpful in detecting departures from linearity, especially in combination with a scatterplot. Yet another strategy entails collapsing both variables into categorical form and examining their cross-tabulation. This procedure sacrifices even more information than the previous approach, but it may be helpful, especially if extreme scores are of special interest. The correlation coefficient r describes the association between two variables as the straight line that minimizes the deviation between observed (Y) and estimated (ì) values of the dependent variable, as illustrated in Figure 7. This feature gives the method its name, the "least- squares method." The value of r measures the association between X and Y in terms of how 10 As we have seen repeatedly, however, ordinal variables that approximate the interval level of measurement are often used in practice for statistical techniques that require an interval level of measurement. 11 There are other requirements as well, including normal distributions and homoscedasticity. The reader is referred to a text on multiple regression for a through consideration of the correlational model and its assumptions.
  • 24. Bivariate 24 closely the data points cluster around the least-squares line. The absolute value of r is large when the data points hover close to the least-squares line; when observations are more widely dispersed around this line, the absolute value of r is close to zero. The values of the correlation coefficient range from 1, which indicates perfect correspondence, through 0, which signifies a complete lack of correspondence, to -1, which connotes perfect inverse (i.e., negative) association (see Figure 6).
  • 27. Bivariate 27 The null hypothesis is once again that Y is independent of X, specifically, H0: r = 0. The test statistic for generalization from the sample to the population is t, computed as shown in Figure 7. Although a two-tailed test may be used, the direction of the association is usually theoretically important, which makes the use of a one-tailed test appropriate. It is important to note that this technique assumes that the association between X and Y is linear in form. If there is a nonlinear association, the value of r will be seriously misleading. This problem is illustrated in Figure 8. In this example, the data are better represented by a parabola than by a straight line. Correlational techniques for determining whether specific nonlinear trends are present entail a multivariate model. Here is suffices to note that these techniques are isomorphic to those described above for trends in ANOVA. The slope of the least-squared line is of interest because it quantifies the magnitude of the association between the independent and dependent variables. Specifically, the slope is the change in Y produced by a one unit increase in X. A steep slope (in either direction) indicates a strong relationship whereas a weak relationship appears to be almost horizontal. The slope is not given by r, a common misconception, but can be derived from r and information about the distributions of X and Y. Another indicator of the strength of the association is r2, which is the proportion of the variance in the dependent variable that is accounted for by the independent variable. For example, two measures of depression strongly covary, specifically, the correlation between the Child Depression Inventory (CDI)12 and the Stony Brook (SB) Child Psychiatric 13 Checklist measure of depression is quite strong (r = .59; p < .001), but well below the perfect correspondence (r = 1.00) that would be expected if both measures were perfectly reliable and valid. The r2 value is .35, meaning that about a third of the variance in the Stony Brook is shared with the CDI (and vice versa). Although this correspondence is strong, most of the variance in the two measures is not shared in common. The significance test for the correlation between the CDI and the SB (p < .001) indicates that it is extremely unlikely that this correlation would have been observed if in truth the two variables are not correlated with one another. 12 Kovacs, M. & Beck, A.T. (1977). An empirical-clinical approach toward a definition of childhood depression. In J.G. Schulterbrandt & A. Raskin (Eds.), Depression in Childhood: Diagnosis, Treatment, and Conceptual Models. NY: Raven, pp. 1-25. 13 Gadow, K.D. and Sprafkin, J. (1987). Stony Brook Child Psychiatric Checklist-3R. Stony Brook, New York. Unpublished manuscript.
  • 28. Bivariate 28 The correlation coefficient is an appropriate indicator of the association between the CDI and the SB for two reasons: (1) both variables are quasi-interval, and (2) the association can be assumed to be linear. The latter point is particularly important, given that the functional form of associations is often overlooked. The two variables are similar measures of the same construct, which means that an increase in one measure should be matched by an increase in the other measure. Moreover, this correspondence should be evident across the full span of values for both variables. There is no reason to anticipate, for example, a plateau, or a threshold effect. Thus, the correlation coefficient is an appropriate choice.
  • 30. Bivariate 30 The correlation between the CDI and the SB is considerably stronger than their correlations with measures of other constructs. This pattern is expected, given that two measures of the same construct should be more highly correlated than measures of different constructs. However, adolescent depression was also assessed with two other measures, SB ratings made by the mother and by the father. These measures correlate with the CDI (.29 and .22, respectively), and with the adolescent's self assessment on the SB (.28 and .21, respectively), but far below the correlation between the two adolescent measures (.59). Although these correlations are all statistically significant, the parental measures account for no more that 8.4 percent of the variance in the adolescent self-reports. Thus, as concluded earlier, parental ratings are not especially good measures of adolescent mood, despite the fact that such ratings have been standard practice in psychiatric epidemiology. In sum, the few correlations reviewed here demonstrate the importance of considering both the statistical significance of an association and its substantive importance. SUMMARY Although the specifics of bivariate analysis are unique to each statistical technique, there is a functional similarity across these methods. As noted above, the null hypothesis for bivariate analysis states that the values on the two variables are independent of one another. The usual goal is to reject this hypothesis, to conclude that the variables are not independent of one another. In practice, this means that knowing the values on one variable is informative about the likely values on the second variable.