TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Biva riate analysis pdf
1. Bivariate 1
January 16, 2003
BIVARIATE ANALYSIS:
ESTIMATING ASSOCIATIONS
Carol S. Aneshensel
University of California, Los Angeles
I. OVERVIEW
A. Determines whether two variables are empirically associated.
B. Key issue: Is the association in the form anticipated by theory?
C. Lays the foundation for multivariate analysis
II. METHODS OF BIVARIATE ANALYSIS
A. Criteria for Selection of a Method
B. Proportions: Contingency Tables
1. Test of conditional probabilities
2. Used with 2 categorical variables
3. No distributional assumptions
4. Low statistical power
5. Calculation of Chi-squared (χ5) and its degrees of freedom
C. Mean Differences: Analysis of Variance
1. Test of mean differences relative to variance
2. Independent variable is categorical
3. Dependent variable is interval or ratio.
4. No assumptions about the form of the association
5. Calculation of F and its degrees of freedom
D. Correlation: Correlation Coefficients
1. Test of linear association
2. Two interval variables
3. Linear association
4. Calculation of t and its degrees of freedom
2. Bivariate 2
BIVARIATE ANALYSIS:
ESTIMATING ASSOCIATIONS
Carol S. Aneshensel
University of California, Los Angeles
The first step in the analysis of a focal relationship is to determine whether there is an
empirical association between its two component variables. This objective is accomplished by
means of bivariate analysis. This analysis ascertains whether the values of the dependent
variable tend to coincide with those of the independent variable. In most instances, the
association between two variables is assessed with a bivariate statistical technique (see below for
exceptions). The three most commonly used techniques are contingency tables, analysis of
variance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to a
multivariate form to evaluate whether the association can be interpreted as a relationship.
Not any association, will do, however: We are interested in one particular association,
that predicted by theory. If we expect to find a linear association, but find instead a U-shaped
one, then our theory is not supported even though the two variables are associated with one
another. Thus, the object of explanatory analysis is to ascertain whether the independent variable
is associated with the dependent variable in the manner predicted by theory.
In some instances, the association between two variables is assessed with a multivariate
rather than a bivariate statistical technique. This situation arises when two or more variable are
needed to express the functional form of the association. For example, the correlation coefficient
estimates the linear association between two variables, but a nonlinear association requires a
different approach, such as the parabola specified by the terms X and X2. Although two analytic
variables (X and X2) are used to operationalize the form of the association (parabola), these
variables pertain to one substantive theoretical variable (X). The two analytic variables are best
thought of as one 2-part variable that reflects the nonlinear form of the association with the
dependent variable. Thus, the analysis is bivariate even though a multivariate statistical
technique is used.
Although this distinction may appear to be hair-splitting, it reduces confusion about the
focal relationship when more than one term is used to operationalize the independent variable.
For example, the categories of ethnicity might be converted into a set of dichotomous "dummy
variables" indicating whether the person is (1) African American, (2) Latino, (3) Asian
American, or (4) non-Latino White (or is in the excluded reference category of "Other"). A
theoretical model containing one independent variable, ethnicity, now appears to involve four
independent variables. The four "dummy variables," however, are in actuality one composite
variable with five categories. This type of hybrid variable requires a multivariate statistical
technique, such as regression, even though it represents a bivariate association.
The importance of bivariate analysis is sometimes overlooked because it superseded by
multivariate analysis. This misperception is reinforced by scientific journals that report bivariate
associations only in passing, if at all. This practice creates the misleading impression that
analysis begins at the multivariate level. In reality, the multiple-variable model rests upon the
foundation laid by the thorough analysis of the 2-variable model. The proper specification of the
3. Bivariate 3
theoretical model at the bivariate level is essential to the quality of subsequent multivariate
analysis.
Some forms of bivariate analysis require that variables be differentiated into independent
or dependent types. For example, the analysis of group differences in means, either by t-test or
ANOVA, treats the group variable as independent, which means that the procedure is
asymmetricalCdifferent values are obtained if the independent and dependent variables are
inverted. In contrast, the Pearson correlation coefficient, the most widely used measure of
bivariate association, yields identical values irrespective of which variable is treated as
dependent, meaning that it is symmetricalC the same coefficient and probability level are
obtained if the two variables are interchanged. Similarly, the chi-squared (χ5) test for
independence between nominal variables yields the same value irrespective of whether the
dependent variable appears in the rows or the columns of the contingency table. Although the
test of statistical significance is unchanged, switching variables yields different expressions of
the association because row and column percentages are not interchangeable. Unlike the
correlation coefficient, where both the statistic and test of statistical significance are symmetrical,
only the probability level is symmetrical in the χ2 technique.
Designating one variable as independent and the other variable as dependent is productive
even when this differentiation is not required by the statistical method. The value of this
designation lies in setting the stage for subsequent multivariate analysis where this differentiation
is required by most statistical techniques.1 This designation is helpful in the bivariate analysis of
the focal relationship because multivariate analysis ultimately seeks to determine whether the
bivariate association is indicative of a state of dependency between the two variables. This
approach makes more sense if the original association is conceptualized as a potential
relationship.
METHODS OF BIVARIATE ANALYSIS
Selection of a Method
There is a multitude of statistical techniques for the assessment of bivariate associations.
This profusion of techniques reflects a key consideration in the selection of a method of analysis
C the measurement properties of the independent and dependent variable. For example,
correlational techniques are the method of choice for analysis of two interval variables (when the
association is assumed to be linear), but are not suitable to the analysis of two categorical
variables. Given that there are numerous possible combinations of measurement types, there are
numerous analytic techniques.
A second contributor to this proliferation is sample size: some methods are applicable
only to large samples. Statistical techniques are also distinguished from one another on the basis
of assumptions about the distributional properties of the variables. For example, there are
different computational formulas for the simple t-test depending upon whether the variance of the
1
Some multivariate techniques do not require that one variable be treated
as dependent, for example, log-linear models, but this situation is an
exception.
4. Bivariate 4
dependent variable is assumed to be the same in the two groups being compared. In contrast,
nonparametric techniques make no distributional assumptions.
The sheer number of alternative methods can bewilder. The bulk of bivariate analysis in
the social sciences, however, is conducted with three techniques: contingency table analysis of
proportions, ANOVA assessment of mean differences between groups, and correlation
coefficients. As illustrated in Figure 1, a key consideration in the selection of a technique is the
level of measurement. The contingency table technique is used when both variables are nominal.
Means are analyzed when independent variable is nominal and the dependent variable is interval
or ratio. Correlations are calculated when both variables are interval or ratio (and the association
is assumed to be linear).
These three methods do not exhaust the possible combinations of independent and
dependent variables, as indicated by the blank cells in Figure 1. Although there are alternative
methods of analysis for these combinations, many researchers adapt one of the three methods
shown in this figure. For instance, if the dependent variable is nominal and the independent
variable is measured at a higher level, the independent variable is often collapsed into categorical
form. This transformation permits the use of the familiar χ2 test, but wastes valuable information
about the inherent ordering of the interval variable.
Ordinal variables are a bit troublesome because they do not satisfy the assumptions of the
interval methods of analysis, but the use of a nominal method of analysis, the practical
alternative, is wasteful because it does not make use of the ordering information. Interval
methods are often used for ordinal variables that approximate interval variables, that is, quasi-
interval variables, but strictly speaking this practice is inappropriate. If the ordinal variable does
not approximate an interval variable, then it can be treated as a nominal variable in χ2 test,
although, once again, this practice wastes information.
Why not use a statistical technique intended specifically for the exact measurement
properties of the independent and dependent variables instead of relaxing assumptions or loosing
power? Surely these techniques are more appropriate than the adaptations just described. What
accounts for the popularity of bending methods or measures to permit the use of conventional
methods of bivariate analysis? Quite simply, the conventional methods readily generalize into
familiar methods of multivariate analysis. Correlations and ANOVA form the foundation for
multiple linear regression. Similarly, logistic regression is based on the techniques used with
contingency tables. These three methods of bivariate analysis are used frequently, then, because
of their conceptual continuity with common forms of multivariate analysis.
The reader is referred to a standard statistic text for less frequently used methods of
assessing bivariate associations. These methods are omitted here so that excessive attention to
technique does not deflect attention from the logic of analysis. A discussion of the relative
merits of various types of correlation coefficients, for example, would unnecessarily divert
attention from the question of whether a linear model, assumed in correlational techniques, is
appropriate based on theory C not theory as taught in a statistics class, but the substantive theory
directing the research. According to the gospel of statistical theory, my dismissive treatment of
technically correct methods of bivariate analysis is heretical. In defense of this stance, I note
only that I am merely calling attention to a widespread practice in applied analysis.
6. Bivariate 6
In addition to level of measurement, the selection of a statistical procedure should also be
based upon the type of association one expects to find. In most applications, this issue focuses
on whether it is appropriate to use a linear model. In this context, linear means that there is a
constant rate of change in the dependent variable across all values of the independent variable.
This concept makes sense only where there are constant intervals on both variables, which means
that, strictly speaking, linearity is relevant only when both variables are measured at the interval
level.2 In other applications, the issue is specifying which groups are expected to differ from
others. An example of this approach would be hypothesizing that the prevalence of depression is
greater among women than men, as distinct from asserting that gender and depression are
associated with one another (see below).
In practice, the expected functional form of an association is often overlooked in the
selection of an analytic technique. We tend to become preoccupied with finding a procedure that
fits the measurement characteristics of the independent and dependent variables. The validity of
the entire analysis, however, depends upon the selection of analytic techniques that matches
theory-based expectations about the form of the association. Unfortunately, theory is often mute
on this topic. Nevertheless, it is incumbent upon the analyst to translate theory into the
appropriate analytic model.
Methods: Proportions, Means and Correlations
In this section, the three most common methods of bivariate analysis are summarized
briefly: χ5 tests of proportions, ANOVA for mean differences, and correlation coefficients for
linear associations. As noted above, the selection of a method is likely to be driven by the
measurement characteristics of the independent and dependent variables. Contingency tables are
appropriate for two nominal variables; tests of mean differences are used for an interval outcome
and nominal independent variable; correlations are employed for linear associations between
interval variables (see Figure 1). If both variables are interval, but the association is expected to
be non-linear, then the correlational technique needs to be adapted to the expected form of the
association using a multiple regression format.
Each of these forms of analysis can be performed using any major statistical software
program. The emphasis of this presentation, therefore, is not on computations, but on
interpretation. It is useful, however, to review the fundamentals of these methods of analysis to
understand their proper use and interpretation.
2
There are a few common exceptions. For example, correlation
coefficients are often calculated for ordinal variables that are quasi-
interval. Also, dichotomous dependent variables are often treated as
interval because there is only one interval.
7. Bivariate 7
It should be noted that the techniques discussed here are but a few of the many options
available for bivariate analysis. Characteristics of one's data may make other approaches much
more appropriate. The three techniques described here are highlighted because of their
widespread use and because they form the basis for the most commonly used types of
multivariate analysis. The anticipation of multivariate analysis makes it logical to conduct
bivariate analysis that is consistent with the multivariate model.
Proportions: Contingency Tables. Two properties of the χ5 analysis of a contingency
table make it an especially appealing form of bivariate analysis. First, it is based on the lowest
form of measurement, two nominal variables. The absence of level-of-measurement restrictions
means that the technique may also be used with ordinal, interval or ratio data. Second, this
technique does not require assumptions about the nature of the association, in particular, it does
not assume a linear association. It is used to determine whether any association is present in the
data, without specifying in advance the expected form of this association. This flexibility is the
method's most appealing characteristic.
These characteristics, however, also establish the limitations of the method. Using χ5
analysis with higher-order variables means that some data is transformed into a lower form of
measurement, converted to categorical form. This transformation leads to a loss of information
and a concomitant loss of statistical power. Although other statistics for contingency table
analysis take into consideration the ordinal quality of variables (e.g., Somers's D), these
techniques are not as widely used as the simple yet less powerful χ5.
Furthermore, the χ5 test only tells you that some association seems to be present, without
regard to its theoretical relevance. The conclusion that an association is present is not nearly as
meaningful, compelling, or satisfying as the conclusion that the expected association is present.
The χ5 test does not yield this information, although it is possible to adapt the method to this end.
The χ5 test for independence is used to determine whether there is an association between
two categorical variables. If the two variables are unrelated, then the distribution of one variable
should be the same regardless of the value of the other variable. If instead the two variables are
associated, then the distribution of the dependent variable should differ across the values of the
independent variable. The χ5 test for independence does not distinguish between independent
and dependent variables. Treating one variable as independent is optional and does not alter the
value of the test statistic.
The dependency between the variables could be stated in the reverse direction: the
distribution of the independent variable differs across categories of the dependent variable.
Although immaterial to the calculation of χ5, this formulation is backwards in terms of the logic
of cause and effect. It treats the dependent variable as fixed and assesses variation in the
independent variable across these fixed values. However, a variable that depends upon another
does not have fixed values; its values vary according to the influence of the independent variable.
For example, the association between gender and depression is best stated as differences in the
probability of being depressed between men and women, not whether the probability of being a
woman differs between depressed and not depressed persons. The proper formulation of the
8. Bivariate 8
association, then, is to examine variation in the distribution of the outcome variable across
categories of its presumed cause.3
The χ5 test for independence is illustrated in Figure 2. In this contingency table, the
independent variable X appears in the rows (1...i) and the dependent variable Y appears in the
columns (1...j). The analytic question is whether the distribution of Y varies across the categories
of X.
3
Whether this distribution is calculated as row or column percentages is
immaterial.
10. Bivariate 10
The overall distribution of Y is given at the bottom of the table. For example, the
proportion in column 1 is p1 = N.1/N; the proportion in column 2 is p2 = N.2/N; and so on until pj =
N.j/N. This proportional distribution should be duplicated within each row of the table if Y is
indeed independent of X. In other words, the distribution of subjects within row 1 should
resemble the distribution of subjects within row 2, and so on through row i, the last row. This
similarity means that the proportion of subjects in column j should be similar across all
categories of X (1, 2, ... i), and similar to the overall proportion of subjects in column j (p.j or
Nij/N). This equivalency should be manifest for all values of Y across all values of X.
The null hypothesis for the χ5 test for independence essentially states that identical
conditional probabilities are expected under the condition that Y does not depend upon X:
H0: p11 = p21 = . . . = pi1 = p.1 = N.1/N (1)
p12 = p22 = . . . = pi2 = p.2 = N.2/N
...
p1j = p2j = . . . = pij = p.j = Nij/N
The null hypothesis is evaluated with the χ5 test statistic. Its definitional formula and degrees of
2
freedom appear in Figure 2. A large χ value relative to its degrees of freedom leads to rejection
of the null hypothesis. The null hypothesis is rejected if any pij … Nij/N, that is, if any proportion
within the table deviates substantially from the marginal distributions of X and Y. This result
means that the observed covariation between X and Y is unlikely to occur if in reality X and Y are
independent.
The key to understanding this procedure is the calculation of expected values for the cells
comprising the contingency table. These values are calculated assuming that column
distributions are independent of row distributions (and vice versa). If so, then the overall
marginal proportions for the rows and columns should be replicated within the body of the table.
The value expected under independence is compared to the observed value for each cell.
If the assumption of independence is valid, there should be only a small difference between
expected and observed values. If instead there is a large difference for one or more cells, the χ5
statistic will be large relative to its degrees of freedom and the null hypothesis will be rejected.
The chief limitation of this procedure is the nonspecific nature of the null hypothesis.
When we reject it, we know that one or more cell frequencies differs markedly from the
expected, but do not know which cells are deviant. We know that an association exists, but do
not know the form of the association. It might be the one predicted by theory, but then again it
might be some other form of association. This limitation can be remedied if the expected
relationship is more precisely specified prior to analysis.
To continue the example given earlier in this section, we might hypothesize that females
are more likely to be depressed than males. This hypothesis is considerably more precise than
the hypothesis that gender and depression are associated with one another. It could be
operationalized as an odds ratio greater than 1.00 for depression for females relative to males.
The odds ratio expresses the association between variables in a multiplicative form, meaning that
a value of 1.00 is equivalent to independence. The odds ratio for the data reported in Table 1 is
1.80. This is the exact value for the sample (N = 1,393). If we wish to extend this finding to the
11. Bivariate 11
population, we could calculate a confidence interval for the odds ratio and determine whether it
includes the value of 1.4 The 95% confidence interval for this example is 1.25 to 2.60. The
lower boundary is greater than 1, meaning that the odds of being depressed are significantly
higher for women than men. Note that this conclusion is more precise than the conclusion made
with χ5, namely that depression is not independent of gender.
Note also that we are interested only in the possibility that females are at greater risk. A
greater risk among males would disconfirm our theory. However, a male excess of depression
would lead to rejection of the null hypothesis in the χ5 procedure. Our more specific null
hypothesis is rejected only if the confidence interval is greater than 1; we fail to reject if the
confidence interval does not differ from 1 or is smaller than 1. When only an association is
hypothesized, we fail to reject only when the confidence interval includes 1, not when it is
smaller than 1. By specifying the nature of the association we have increased the power of the
test.
4
Agresti (1990:54-6) gives the following formula for the 100(1-α) percent
confidence interval for log θ for large samples:
log(θ)=zα/2 σ(logθ)
where
θ = n11n22/n12n21
is the sample value of the odds ratio, and
σ(logθ) = [(1/n11) + (1/n22) + (1/n12) + (1/n21)]1/2
The confidence interval for θ is obtained by exponentiating (taking the
antilog) of the endpoints of this interval. (Agresti, Alan 1990.
Categorical Data Analysis (New York: John Wiley & Sons).
12. Bivariate 12
The χ5 test is useful for inferential purposes, but it is not informative about the magnitude
of the association. Quantifying the association requires an additional calculation. Although there
are numerous options from which to select, the odds ratio described earlier has gained
considerable popularity in recent years because of its straightforward interpretation. Somers's d
is useful as well because of its common use in the logistic regression model, which is a
multivariate generalization of the χ5 procedure.
For an example of crosstabular analysis, we can turn to a study estimating the extent to
which midlife and older parents experience the divorce of their sons and daughters (Spitze,
Logan, Deane, and Zerger 1994).5 The overall prevalence of this event was influenced by time
insofar as the oldest adults have been at risk for the longest time, whereas their younger
counterparts have had less time to have adult children marry and divorce. The bivariate analysis
shown in Table 2
takes to account the TABLE 2 Percent Distribution of Marital History of Children by
age of the parent. Age of the Parent
Although the
experience of having
an adult child is Age of the Parent
normative in all of Percentage of all 40- 50- 60- 70+
these age groups, respondents who have had: 49 59 69
barely half of the
youngest adults have An adult child 55.1 87.6 88.6 72.2
experienced this life An ever-married child 19.5 70.5 83.1 74.4
course transition. At An ever-divorced child 3.0 24.0 40.6 35.8
older ages, about four A currently divorced/ 4.4 19.6 27.6 24.4
in five have done so. separated child
The researchers A remarried child 1.1 9.5 22.1 21.1
attribute the slight
dropoff in the oldest N 365 275 308 246
age group to the Source: Spitze et al.
lower fertility of (1994).
those who were of
5
Spitze, Glenna, Logan, John R., Deane, Glenn, and Zerger, Suzanne.
(1994). Adult children's divorce and intergenerational relationships.
Journal of Marriage and the Family, 56:279-293.
13. Bivariate 13
childbearing age prior to the baby boom period.
These age trends are even more evident for having a married child. This event is
relatively rare for the youngest group, but quite common among persons over the age of 50 years.
Age is strongly associated with having children who have divorced, including those who are
currently divorced, and with having children who have remarried.
No tests of statistical significance are presented for the data in Table 2 in the original
report. The differences are quite large between the youngest cohort and the other cohorts. Given
the large sample size, these differences are obviously of statistical significance.
If χ2 tests were provided, however, there would be five such tests, one for each dependent
variable. At first glance, it may not be obvious that there are five dependent variables. This
confusion arises because the table reports percentages rather than cell frequencies (compare with
Figure 2). (Cell frequencies can be extracted from Table 2 by multiplying the percentage by the
sample size, and dividing by 100.) Only the percent "yes" is given in Table 2 because the percent
"no" is implicit given that each of these variables is a dichotomy. Thus, Table 2 represents the
cross-tabulation of age (in 4 categories) by each of the following dependent variables: adult
child/no; ever-married child/no; ever-divorced child/no; currently separated-divorced child/no;
remarried child/no.
The question of linearity is relevant to this example because the independent variable,
age, is quasi-interval and the dependent variables are all dichotomies. Two distinct nonlinear
patterns are evident in these data. The first is an inverted U-shaped curve: a sharp increase,
especially between the forties and fifties, followed by a decline among those who are seventy or
older. As noted previously, the researchers attribute the decline to a pre-baby boom cohort
effect. The first three dependent variables follow this pattern. The increase at earlier ages
reflects the combined impact of life course considerations, such as the duration of risk for the
event. These factors appear to be the primary consideration for the second pattern, an initial
increase followed by a plateau, which describes the last two entries in the table.
Means: Analysis of Variance. The ANOVA procedure is similar in many respects to the
χ2 test for independence. In both techniques, the independent variable needs to be measured only
at the nominal level. Also, the null hypothesis is structurally similar in the two procedures. In
the case of χ5, we test whether proportions are constant across categories of the independent
variable and, therefore, equal to the overall marginal proportion. For ANOVA, we test whether
means are constant across categories of the independent variables and, thus, equal to the grand
mean. In both techniques, the null hypothesis is nonspecific: rejecting it is not informative about
which categories of the independent variable differ.
The main practical difference between methods is that ANOVA requires an interval level
of measurement for the dependent variable.6 This restriction means that the ANOVA technique
is not as widely applicable as the χ5 test, for which the dependent variable can be nominal. The
limitation of ANOVA to interval dependent variables, however, is a trade-off for greater
statistical power.
6
This assumption is violated when ANOVA is used for ordinal variables
that approximate the interval level of measurement, but this procedure
is, strictly speaking, incorrect.
14. Bivariate 14
The χ5 test could be substituted for ANOVA by transforming an interval dependent
variable into categorical form. This approach is undesirable because valuable information is lost
when a range of values is collapsed into a single category. Additional information is lost because
the ordering of values is immaterial to χ2. The attendant decrease in statistical power makes χ5
an unattractive alternative to ANOVA.
The ANOVA procedure is concerned with both central tendency and spread. The
measure of central tendency is the mean. It is calculated for the dependent variable, both overall
and within groups defined by the categories of the independent variable. Specifically, the null
hypothesis is that the within-group mean is equal across groups and, therefore, equal to the grand
mean:
H0: µ1 = µ2 = ... = µj = µ (2)
In this equation, µ is the mean, and j is the number of groups, which is the number of categories
on the independent variable. The null hypothesis is rejected if any µj … µ, if any group mean
differs from the grand mean.
The critical issue in ANOVA is not the absolute difference in means, however, but the
difference in means relative to spread or the variance of the distribution.7 This feature is
illustrated in Figure 3 for several hypothetical distributions. The first panel (a) displays large
differences in means relative to the within-group variation, a pattern that yields a large value for
F. This pattern would lead to a rejection of the null hypothesis.
The second panel (b) displays the same absolute mean differences as the top panel, but
substantially larger within-group variation. Although the means differ from one another, there is
a large amount of overlap among the distributions. The overlap is so extensive that the
distribution with the lowest mean extends over with the distribution with the highest mean. This
pattern would produce a low F value, leading to failure to reject the null hypothesis. This
conclusion is reached even though the absolute mean differences are the same as in panel a,
which led to rejection of the null hypothesis. The difference in conclusions for the two sets of
distributions arises from their spreads: the mean difference is large relative to the variance of the
distribution in panel a, but relatively small in panel b.
The third panel (c) shows distributions with the same variances as the second panel (b),
but with substantially larger mean differences. As in the previous case, the variances are large.
In this instance, however, the mean differences between groups are also large. Similarly, in the
last panel (d) the variances are small, but so are the mean differences between groups. We would
fail to reject the null hypothesis despite the small variances because these variances are large
7
ANOVA was originally developed in terms of a variance, specifically the
hypothesis that all of the group means are equal is equivalent to the
hypothesis that the variance of the means is zero (Darlington 1974).
15. Bivariate 15
relative to the mean differences. As these examples illustrate, it is not the absolute value of the
mean differences that is crucial, but the mean difference relative to the variance.
17. Bivariate 17
ANOVA is based on the decomposition of variation in the dependent variable into
within- and between-group components, with the groups being categories of the independent
variable. The calculations are based on the sum of squares, that is, deviation of observations
from the mean, as shown in Figure 4.8 The test statistic for ANOVA is F, which is a ratio of
variances. If differences between the means are due to sampling error, then the F ratio should be
around 1.00. Large values of F (relative to its degrees of freedom) would lead to rejection of the
null hypothesis.
Although ANOVA may be used when there are only 2 categories on the independent
variable, it is customary to use a t test in this situation. The definitional formula for t is:9
M1 - M2 / ((s1 /n1) + (s2 / n2))2
2 2
t = (3)
T and F are equivalent when there are only two groups: t5 = F. Thus, the choice of one method
over the other is immaterial.
ANOVA is not informative about which means are unequal: it tests only whether all
means are equal to one another and to the grand mean. Rejection of the null hypothesis signifies
that the dependent variable is probably not uniformly distributed across all values of the
independent variable, but does not reveal which cells deviate from expectations. Thus, an
association appears to be present, but it is not known whether it is the one forecast by theory. As
was the case with the χ5 procedure, additional steps are required to test more specific hypotheses
about the precise nature of the association. This may be done a priori or using a post hoc test
(e.g., Scheffe). Specifying contrasts in advance (on the basis of theory) is preferable to
'
examining all possible contrasts after the fact because the later penalizes you for making multiple
contrasts to reduce the risk of capitalizing on chance.
Depending upon one's hypothesis, it also may be desirable to test for a trend. Trend
analysis is most relevant when the independent variable is interval and linearity is at issue. In
this case, it may be desirable to partition the between-groups sum of squares into linear,
quadratic, cubic, or higher-order trends. However, visual inspection of the data is usually the
8
There are numerous alternative specifications of the ANOVA model
depending upon whether fixed or random effects are modeled, whether
there are equal or unequal cell frequencies, etc. The user should
consult a text on ANOVA for full discussion of these technical concerns.
9
This formula for t assumes unequal variances between groups; a slightly
different formula is used if one can assume equal variances.
18. Bivariate 18
most informative approach for understanding the shape of the association. Simply plotting mean
values is often more instructive than the results of sophisticated statistical tests.
20. Bivariate 20
Several core aspects of ANOVA are illustrated in Table 3, which shows group variation
in levels of depressive symptoms. These data are from the survey of Toronto adults introduced
earlier in this paper (Turner & Marino 1994; see Table 1). In addition to major depressive
disorder, this study also assessed the occurrence of depressive symptoms during the previous
week. This assessment was made with the Center for Epidemiologic Studies-Depression (CES-
D) Scale, which is the summated total of 20 symptoms, each rated from (0) "0 days" through (3)
"5-7 days."
The average symptom level varies significantly by gender, age, and marital status. The
nature of this difference is clear for gender insofar as there are only two groups: the average is
lower for men than Table 3 Depression by Select Characteristics
women. For age
and marital status,
however, the Characteristic CES-DH N MDDI
differences are less (Mean) (%)
clear because more
than two groups
are being Gender
compared.
Symptoms Male 10.21*** 603 7.7***
are most common
among the Female 13.10 788 12.9
youngest adults
and thereafter Age
decline with age 18-25 15.14*** 304 18.4***
(at least through
age 55). It is 26-35 10.92 470 9.8
tempting to
conclude that the 36-45 11.09 393 7.2
youngest and
oldest groups 46-55 9.15 224 4.7
differ, given that
these scores are the Marital Status
most extreme. Married 9.98*** 673 6.6***
These large
differences, Previously Married 14.22 171 11.5
however, may be
offset by Never Married 13.70 547 15.8
exceedingly large
variances. As a
result, we are Total 11.79 1,391 10.6
limited to the
nonspecific Source: Turner and Marino (1994); Table 1.
conclusion that H
Depressive symptoms; Center for Epidemiology-Depression Scale.
I
Major Depressive Disorder; Composite International
Diagnostic Interview.
***
p < .001
21. Bivariate 21
depressive symptoms vary with age.
The two unmarried groups have similar levels of symptoms compared to the markedly
different scores of the currently married. In the absence of specific tests for pairs of means,
however, we can only conclude that at least one of these means differs from the grand mean.
The far right column of Table 3 presents prevalence estimates for major depressive
disorder. These data are shown here to emphasize the similarity between the analysis of means
(ANOVA) and the analysis of proportions (χ2). The prevalence of major depression differs
significantly by gender, age, and marital status. The nature of the gender difference is again
clear, given that there are only two groups. However, the nature of the age and marital status
associations is not specified for particular subgroups. Thus, we are limited to the conclusion that
depression is associated with age and cannot conclude that depression declines with age, even
though the prevalence of depression in the youngest age group is almost twice that of any other
age group. Similarly, although it is tempting to conclude that the married are substantially less
likely to be depressed than the previously married or the never married, this information is not
given by the overall test, meaning that we can only conclude that the three groups do not have the
same rate of depression.
Finally, a comment on linearity. Figure 5 graphs the association between depression and
age from the data in Table 3. In this example, age has been collapsed into four categories,
meaning that it is ordinal rather than interval. In reality, however, age is an interval variable,
making it reasonable to ask whether its association with depression is linear. The problem in
treating age as a quasi-interval variable is the first interval, ages 18-25, which is shorter (7 years)
than the other intervals (10 years). The problem of unequal age intervals can be circumvented,
however, by assigning each interval a value equal to its midpoint.
The observed age trend for depressive symptoms [Figure 5(a)] is distinctly nonlinear.
Symptoms are most common among the youngest age group and least common in the oldest age
group, but do not follow a pattern of steady decline between these extremes. Instead, there is a
plateau in average symptom levels between the two middle age groups.
The observed age trend for rates of major depressive disorder [Figure 5(b)] also is
distinctly nonlinear, although the pattern differs somewhat from the pattern for average symptom
levels. Like symptoms, disorder is most common for the youngest adults and least common for
the oldest adults. Unlike symptoms, however, the decline with age is apparent across the two
middle age groups. Despite the continuity of decline, the trend is nonlinear because the decline
during the youngest period is noticeably steeper than thereafter.
In sum, although ANOVA does not assume linear associations, it is possible to ascertain
whether this is the case when both variables are interval (or quasi-interval). The same is true for
the analysis of proportions using the χ2 test. We turn now to the correlation coefficient, which
assumes the linear form.
23. Bivariate 23
Correlations: Linear Associations. Although there are several correlation coefficients,
Pearson's r is by far the most widely used. This coefficient is used when both the independent
10
and dependent variables are measured at the interval level. From the perspective of
operationalizing a theory-based relationship, the most important aspect of this technique is the
assumption that the association between the independent and dependent variables is linear.11 It is
conventional to recommend inspection of a scatterplot to ensure that there are no gross
departures from linearity. This approach is illustrated in Figure 6 for both linear and nonlinear
associations.
Although the shape of an association is usually clear in textbook illustrations such as this
one, it is more difficult to visualize associations from scatterplots in practice. The difficulty
arises because large sample sizes generate too many data points, many of which overlap.
Computer-generated scatterplots use symbols such as letters to signify the number of
observations at a particular location, but it is difficult to mentally weigh the points in a plot
according to these symbolic tallies. It is sometimes useful to select a small random sample of
one's sample to circumvent this problem.
Another tactic for detecting nonlinearity is to collapse the independent variable and
examine the distribution of means as one would in ANOVA. This technique is not as
informative as the scatterplot, given that many distinct values are collapsed into categories and
means, but it is helpful in detecting departures from linearity, especially in combination with a
scatterplot. Yet another strategy entails collapsing both variables into categorical form and
examining their cross-tabulation. This procedure sacrifices even more information than the
previous approach, but it may be helpful, especially if extreme scores are of special interest.
The correlation coefficient r describes the association between two variables as the
straight line that minimizes the deviation between observed (Y) and estimated (ì) values of the
dependent variable, as illustrated in Figure 7. This feature gives the method its name, the "least-
squares method." The value of r measures the association between X and Y in terms of how
10
As we have seen repeatedly, however, ordinal variables that approximate
the interval level of measurement are often used in practice for
statistical techniques that require an interval level of measurement.
11
There are other requirements as well, including normal distributions and
homoscedasticity. The reader is referred to a text on multiple
regression for a through consideration of the correlational model and
its assumptions.
24. Bivariate 24
closely the data points cluster around the least-squares line. The absolute value of r is large when
the data points hover close to the least-squares line; when observations are more widely
dispersed around this line, the absolute value of r is close to zero. The values of the correlation
coefficient range from 1, which indicates perfect correspondence, through 0, which signifies a
complete lack of correspondence, to -1, which connotes perfect inverse (i.e., negative)
association (see Figure 6).
27. Bivariate 27
The null hypothesis is once again that Y is independent of X, specifically, H0: r = 0. The
test statistic for generalization from the sample to the population is t, computed as shown in
Figure 7. Although a two-tailed test may be used, the direction of the association is usually
theoretically important, which makes the use of a one-tailed test appropriate.
It is important to note that this technique assumes that the association between X and Y is
linear in form. If there is a nonlinear association, the value of r will be seriously misleading.
This problem is illustrated in Figure 8. In this example, the data are better represented by a
parabola than by a straight line. Correlational techniques for determining whether specific
nonlinear trends are present entail a multivariate model. Here is suffices to note that these
techniques are isomorphic to those described above for trends in ANOVA.
The slope of the least-squared line is of interest because it quantifies the magnitude of the
association between the independent and dependent variables. Specifically, the slope is the
change in Y produced by a one unit increase in X. A steep slope (in either direction) indicates a
strong relationship whereas a weak relationship appears to be almost horizontal. The slope is not
given by r, a common misconception, but can be derived from r and information about the
distributions of X and Y. Another indicator of the strength of the association is r2, which is the
proportion of the variance in the dependent variable that is accounted for by the independent
variable.
For example, two measures of depression strongly covary, specifically, the correlation
between the Child Depression Inventory (CDI)12 and the Stony Brook (SB) Child Psychiatric
13
Checklist measure of depression is quite strong (r = .59; p < .001), but well below the perfect
correspondence (r = 1.00) that would be expected if both measures were perfectly reliable and
valid. The r2 value is .35, meaning that about a third of the variance in the Stony Brook is shared
with the CDI (and vice versa). Although this correspondence is strong, most of the variance in
the two measures is not shared in common. The significance test for the correlation between the
CDI and the SB (p < .001) indicates that it is extremely unlikely that this correlation would have
been observed if in truth the two variables are not correlated with one another.
12
Kovacs, M. & Beck, A.T. (1977). An empirical-clinical approach toward a
definition of childhood depression. In J.G. Schulterbrandt & A. Raskin
(Eds.), Depression in Childhood: Diagnosis, Treatment, and Conceptual
Models. NY: Raven, pp. 1-25.
13
Gadow, K.D. and Sprafkin, J. (1987). Stony Brook Child Psychiatric
Checklist-3R. Stony Brook, New York. Unpublished manuscript.
28. Bivariate 28
The correlation coefficient is an appropriate indicator of the association between the CDI
and the SB for two reasons: (1) both variables are quasi-interval, and (2) the association can be
assumed to be linear. The latter point is particularly important, given that the functional form of
associations is often overlooked. The two variables are similar measures of the same construct,
which means that an increase in one measure should be matched by an increase in the other
measure. Moreover, this correspondence should be evident across the full span of values for both
variables. There is no reason to anticipate, for example, a plateau, or a threshold effect. Thus,
the correlation coefficient is an appropriate choice.
30. Bivariate 30
The correlation between the CDI and the SB is considerably stronger than their
correlations with measures of other constructs. This pattern is expected, given that two measures
of the same construct should be more highly correlated than measures of different constructs.
However, adolescent depression was also assessed with two other measures, SB ratings
made by the mother and by the father. These measures correlate with the CDI (.29 and .22,
respectively), and with the adolescent's self assessment on the SB (.28 and .21, respectively), but
far below the correlation between the two adolescent measures (.59). Although these correlations
are all statistically significant, the parental measures account for no more that 8.4 percent of the
variance in the adolescent self-reports. Thus, as concluded earlier, parental ratings are not
especially good measures of adolescent mood, despite the fact that such ratings have been
standard practice in psychiatric epidemiology.
In sum, the few correlations reviewed here demonstrate the importance of considering
both the statistical significance of an association and its substantive importance.
SUMMARY
Although the specifics of bivariate analysis are unique to each statistical technique, there
is a functional similarity across these methods. As noted above, the null hypothesis for bivariate
analysis states that the values on the two variables are independent of one another. The usual
goal is to reject this hypothesis, to conclude that the variables are not independent of one another.
In practice, this means that knowing the values on one variable is informative about the likely
values on the second variable.