Chapter 4: Problem 3
1. For problem three in chapter four, a teacher wants to display her students number of responses for each day of the week. And she wants to do that with a bar chart. Since she hasn't taken a stats class, she comes to you for help. You first enter her data into SPSS and the results look like this-- When you look at your data set, you'll see that it actually has the wrong level of measurement. Notice that there's a little Venn diagram at the top of each column, which indicates that your data has been entered as nominal. That would be correct if you were noting which day of the week a student participated, but since you're noting how often a given student participated, the correct level of measurement is a scale. Go ahead and change that. Watch how I do that. Under variable view, under measure, you just want to click each one and turn it into a scale. You can also cut and paste these, and I can show you that in another video. Once you have them changed, go back to data view, and you'll see that at the top it has changed in two little rulers. The next question is, how do I get SPSS to display the average score per day rather the total number of individual scores, which might look like a mess, and it's why this question is a toughie. To do that we go under graphs, and you'll see that you have two options, you can do a Chart Builder or a Legacy Dialog. For this question we want to use the Legacy Dialog. We go to Bar and when we click that, there are two questions-- one, what type of bar chart? We want a simple one. And then, how do you want the data in their area displayed? Do we want to summarize for the groups? We really don't. We want summary of separate variables where each day of the week is a variable. We click on Define and then here you'll see every day of the week. You want to bring that over and you see your bar charts are going to represent the mean for every day of the week. As a good habit you want to make sure you title it, I called it "Students' Engagement During Group Discussion." The second one is by day of week. We hit Continue, and then when we hit OK, you're going to see your output pop up. And here is our bar chart-- every day of the week showing the average student engagement. And this is how you answer problem 3 in chapter 4. Good luck.
2. Identify whether these distributions are negatively skewed, positively skewed, or not skewed at all and explain why you describe them that way.
a. This talented group of athletes scored very high on the vertical jump task.
b. On this incredibly crummy test, everyone received the same score.
c. On the most difficult spelling test of the year, the third graders wept as the scores were delivered and then their parents complained.
3. Use the data available as Chapter 4 Data Set 3 on pie preference to create a pie chart ☺ using SPSS.
4. For each of the followin.
Chapter 4 Problem 31. For problem three in chapter four, a teac.docx
1. Chapter 4: Problem 3
1. For problem three in chapter four, a teacher wants to display
her students number of responses for each day of the week.
And she wants to do that with a bar chart. Since she hasn't
taken a stats class, she comes to you for help. You first enter
her data into SPSS and the results look like this-- When you
look at your data set, you'll see that it actually has the wrong
level of measurement. Notice that there's a little Venn
diagram at the top of each column, which indicates that your
data has been entered as nominal. That would be correct if you
were noting which day of the week a student participated, but
since you're noting how often a given student participated, the
correct level of measurement is a scale. Go ahead and change
that. Watch how I do that. Under variable view, under
measure, you just want to click each one and turn it into a
scale. You can also cut and paste these, and I can show you
that in another video. Once you have them changed, go back to
data view, and you'll see that at the top it has changed in two
little rulers. The next question is, how do I get SPSS to
display the average score per day rather the total number of
individual scores, which might look like a mess, and it's why
this question is a toughie. To do that we go under graphs, and
you'll see that you have two options, you can do a Chart
Builder or a Legacy Dialog. For this question we want to use
the Legacy Dialog. We go to Bar and when we click that, there
are two questions-- one, what type of bar chart? We want a
simple one. And then, how do you want the data in their area
displayed? Do we want to summarize for the groups? We
really don't. We want summary of separate variables where
each day of the week is a variable. We click on Define and
then here you'll see every day of the week. You want to bring
that over and you see your bar charts are going to represent the
mean for every day of the week. As a good habit you want to
make sure you title it, I called it "Students' Engagement
During Group Discussion." The second one is by day of
2. week. We hit Continue, and then when we hit OK, you're
going to see your output pop up. And here is our bar chart--
every day of the week showing the average student engagement.
And this is how you answer problem 3 in chapter 4. Good luck.
2. Identify whether these distributions are negatively skewed,
positively skewed, or not skewed at all and explain why you
describe them that way.
a. This talented group of athletes scored very high on the
vertical jump task.
b. On this incredibly crummy test, everyone received the same
score.
c. On the most difficult spelling test of the year, the third
graders wept as the scores were delivered and then their parents
complained.
3. Use the data available as Chapter 4 Data Set 3 on pie
preference to create a pie chart ☺ using SPSS.
4. For each of the following, indicate whether you would use a
pie, line, or bar chart and why.
a. The proportion of freshmen, sophomores, juniors, and seniors
in a particular university
b. Change in temperature over a 24-hour period
c. Number of applicants for four different jobs
d. Percentage of test takers who passed
e. Number of people in each of 10 categories
5. Provide an example of when you might use each of the
following types of charts. For example, you would use a pie
chart to show the proportion of children who receive a reduced-
price lunch that are in Grades 1 through 6. When you are done,
draw the fictitious chart by hand.
a. Line
b. Bar
c. Scatter/dot (extra credit)
6. Go to the library or online and find a journal article in your
area of interest that contains empirical data but does not contain
any visual representation of them. Use the data to create a chart.
Be sure to specify what type of chart you are creating and why
3. you chose the one you did. You can create the chart manually or
using SPSS or Excel.
7. Create the worst-looking chart that you can, crowded with
chart and font junk. Nothing makes as lasting an impression as
a bad example.
8. And, finally, what is the purpose of a chart or graph?
4 CREATING GRAPHS A PICTURE REALLY IS WORTH A
THOUSAND WORDS
4: MEDIA LIBRARY
Premium Videos
Core Concepts in Stats Video
· Examining Data: Tables and Figures
Lightboard Lecture Video
· Creating a Simple Chart
Time to Practice Video
· Chapter 4: Problem 3
Difficulty Scale
(moderately easy but not a cinch)
WHAT YOU WILL LEARN IN THIS CHAPTER
· Understanding why a picture is really worth a thousand words
· Creating a histogram and a polygon
· Understanding the different shapes of different distributions
· Using SPSS to create incredibly cool charts
· Creating different types of charts and understanding their
application and uses
WHY ILLUSTRATE DATA?
In the previous two chapters, you learned about the two most
important types of descriptive statistics—measures of central
tendency and measures of variability. Both of these provide you
with the one best number for describing a group of data (central
tendency) and a number reflecting how diverse, or different,
scores are from one another (variability).
What we did not do, and what we will do here, is examine how
differences in these two measures result in different-looking
4. distributions. Numbers alone (such as M = 3 and s = 3) may be
important, but a visual representation is a much more effective
way of examining the characteristics of a distribution as well as
the characteristics of any set of data.
So, in this chapter, we’ll learn how to visually represent a
distribution of scores as well as how to use different types of
graphs to represent different types of data.
CORE CONCEPTS IN STATS VIDEO
Examining Data: Tables and Figures
X-TIMESTAMP-MAP=LOCAL: Examining data helps find
data entry errors, evaluate research methodology, identify
outliers, and determine the shape of a distribution in a data
set. Researchers typically examine collected data in two ways,
by creating tables and figures. Imagine you asked a group of
friends to rate a movie they've seen on a one to five scale.
A table helps identify the variable and the possible values of
the variable. The sample size, often referred to as n, is 14
because there are ratings reported from 14 people. This is how
large the total sample is. From this, we can determine how
many in the sample have each value of the variable. We can
also determine the percentage that the sample has of each
possible value. Figures display variables from the table.
Nominal and ordinal variables can be depicted with bar charts,
while interval and ratio variables can be depicted using
histograms and frequency polygons. For this data set, we can
use a bar chart. Distributions of data can be characterized
along three aspects or dimensions, modality, symmetry, and
variability. In a unimodal distribution, a small range of values
has the greatest frequency or mode of the set. However, it's
possible for a distribution to have more than one mode. For a
bimodal distribution, we see two values that seem to occur
with the greatest frequency. A distribution is symmetrical if
folding it in half makes each half mirror the other. When a
distribution isn't symmetrical, it's called asymmetrical or
skewed. This distribution is often called positively skewed
because the tail-- or narrow end-- of the distribution is on the
5. right end. Variability in a distribution is the amount of
spread or dispersion of values for a variable. Peak
distributions look like a tall mountain and reflect little
variability in the scores. This means that almost everyone has
given the same score to the movie. A flat distribution has a
lot of variability, where almost everyone's scores are different.
In between is a normal distribution. It stands apart from the
distributions that are neither peak nor flat, unimodal, or
symmetrical.
TEN WAYS TO A GREAT FIGURE (EAT LESS AND
EXERCISE MORE?)
Whether you create illustrations by hand or use a computer
program, the same principles of decent design apply. Here are
10 to copy and put above your desk:
1. Minimize chart or graph junk. “Chart junk” (a close cousin to
“word junk”) happens when you use every function, every
graph, and every feature a computer program has to make your
charts busy, full, and uninformative. With graphs, more is
definitely less.
2. Plan out your chart before you start creating the final
copy. Use graph paper even if you will be using a computer
program to generate the graph. Actually, why not just use your
computer to generate and print out graph paper
(try www.printfreegraphpaper.com).
3. Say what you mean and mean what you say—no more and no
less. There’s nothing worse than a cluttered (with too much text
and fancy features) graph to confuse the reader.
4. Label everything so nothing is left to the misunderstanding of
the audience.
5. A graph should communicate only one idea—a description of
data or a demonstration of a relationship.
6. Keep things balanced. When you construct a graph, center
titles and labels.
7. Maintain the scale in a graph. “Scale” refers to the
proportional relationship between the horizontal and vertical
axes. This ratio should be about 3 to 4, so a graph that is 3
6. inches wide will be about 4 inches tall.
8. Simple is best and less is more. Keep the chart simple but not
simplistic. Convey one idea as straightforwardly as possible,
with distracting information saved for the accompanying text.
Remember, a chart or graph should be able to stand alone, and
the reader should be able to understand the message.
9. Limit the number of words you use. Too many words, or
words that are too large (both in terms of physical size and
idea-wise), can detract from the visual message your chart
should convey.
10. A chart alone should convey what you want to say. If it
doesn’t, go back to your plan and try it again.
FIRST THINGS FIRST: CREATING A FREQUENCY
DISTRIBUTION
The most basic way to illustrate data is through the creation of a
frequency distribution. A frequency distribution is a method of
tallying and representing how often certain scores occur. In the
creation of a frequency distribution, scores are usually grouped
into class intervals, or ranges of numbers.
Here are 50 scores on a test of reading comprehension on which
a frequency distribution is based:
47
10
31
25
20
2
11
31
25
21
44
14
15
26
21
8. range of scores, there are associated frequency counts.
Class Interval
Frequency
45–49
1
40–44
2
35–39
4
30–34
8
25–29
10
20–24
10
15–19
8
10–14
4
5–9
2
0–4
1
People Who Loved Statistics
Helen M. Walker (1891–1983) began her college career
studying philosophy and then became a high school math
teacher. She got her master’s degree, taught mathematics at the
University of Kansas (your authors’ favorite college) where she
was tenured, and then studied the history of statistics (at least
up to 1929, when she wrote her doctoral dissertation at
Columbia). Dr. Walker’s greatest interest was in the teaching of
statistics, and many years after her death, a scholarship was
endowed in her name at Columbia for students who want to
teach statistics! Her publications included a whole book
teaching about the best way to show statistics using tables. Oh,
and along the way, she became the first woman president of the
9. American Statistical Association. All this achievement from
someone who actually loved teaching statistics. Just like your
professor!
The Classiest of Intervals
As you can see from the above table, a class interval is a range
of numbers, and the first step in the creation of a frequency
distribution is to define how large each interval will be. As you
can see in the frequency distribution that we created, each
interval spans five possible scores, such as 5–9 (which includes
scores 5, 6, 7, 8, and 9) and 40–44 (which includes scores 40,
41, 42, 43, and 44). How did we decide to have an interval that
includes only five scores? Why not five intervals, each
consisting of 10 scores? Or two intervals, each consisting of 25
scores?
Here are some general rules to follow in the creation of a class
interval, regardless of the size of values in the data set you are
dealing with:
1. Select a class interval that has a range of 2, 5, 10, 15, or 20
data points. In our example, we chose 5.
2. Select a class interval so that 10 to 20 such intervals cover
the entire range of data. A convenient way to do this is to
compute the range and then divide by a number that represents
the number of intervals you want to use (between 10 and 20). In
our example, there are 50 scores, and we wanted 10 intervals:
50/10 = 5, which is the size of each class interval. If you had a
set of scores ranging from 100 to 400, you could start with an
estimate of 20 intervals and see if the interval range makes
sense for your data: 300/20 = 15, so 15 would be the class
interval.
3. Begin listing the class interval with a multiple of that
interval. In our frequency distribution of reading comprehension
test scores, the class interval is 5, and we started the lowest
class interval at 0.
4. Finally, the interval made up of the largest scores goes at the
top of the frequency distribution.
There are some simple steps for creating class intervals on the
10. way to creating a frequency distribution. Here are six general
rules:
1. Determine the range.
2. Decide on the number of class intervals.
3. Decide on the size of the class interval.
4. Decide the starting point for the first class.
5. Create the class intervals.
6. Put the data into the class intervals.
Once class intervals are created, it’s time to complete the
frequency part of the frequency distribution. That’s simply
counting the number of times a score occurs in the raw data and
entering that number in each of the class intervals represented
by the count.
In the frequency distribution that we created for our reading
comprehension data, the number of scores that occur between 30
and 34 and thus are in the 30–34 class interval is 8. So, an 8
goes in the column marked Frequency. There’s your frequency
distribution. As you might realize, it is easier to do this
counting if you have your scores listed in order.
Sometimes it is a good idea to graph your data first and then do
whatever calculations or analysis is called for. By first looking
at the data, you may gain insight into the relationship between
variables, what kind of descriptive statistic is the right one to
use to describe the data, and so on. This extra step might
increase your insights and the value of what you are doing.
LIGHTBOARD LECTURE VIDEO
Creating a Simple Chart
So statisticians and professor types use graphs and charts all
the time. Let's take a second to figure out what's in a chart,
like, what are the pieces that puts this all together. Here's
some scores. See here, different scores people could get, and
here's the number of people that got those scores. Very
typical data that people like to graph all the time. And when
you make a graph, you usually have these two lines, two
dimensions. And along the bottom, you often put the actual
scores. So here are the actual scores are 8, 9, 1 And along the
11. top, it's very common to put something that shows how many
people got that score or the frequency. OK. So, for instance,
the 8 here, one person got that. And the 9, three people got
that. One, two, three. And the 1 Six people got that. I don't
know if we have room for that. That's up here somewhere.
And then for the 11, two people got to that. That'd be about
there. And the 12, only one person got that. So these dots
represent the different people. It's sort of hard to look at those
dots, so instead we make these bars. And the taller the bar,
you can see, then the more people got it. So, like, the 1
that's way up here around a 6. And only one person got an 8,
so that's down here around a 1. But this, now, is a picture of
what your scores look like. And if you look at just the tops of
these bars, you can see, like, where we get a normal curve.
THE PLOT THICKENS: CREATING A HISTOGRAM
Now that we’ve got a tally of how many scores fall in what
class intervals, we’ll go to the next step and create what is
called a histogram, a visual representation of the frequency
distribution where the frequencies are represented by bars.
Depending on the book or journal article or report you read and
the software you use, visual representations of data are called
graphs (such as in SPSS) or charts (such as in the Microsoft
spreadsheet Excel). It really makes no difference. All you need
to know is that a graph or a chart is the visual representation of
data.
To create a histogram, do the following:
1. Using a piece of graph paper, place values at equal distances
along the x-axis, as shown in Figure 4.1. Now, identify
the midpoint of each class interval, which is the middle point in
the interval. It’s pretty easy to just eyeball, but you can also
just add the top and bottom values of the class interval and
divide by 2. For example, the midpoint of the class interval 0–4
is the average of 0 and 4, or 4/2 = 2.
2. Draw a bar or column centered on each midpoint that
represents the entire class interval to the height representing the
12. frequency of that class interval. For example, in Figure 4.2, you
can see that in our first entry, the class interval of 0–4 is
represented by the frequency of 1 (representing the one time a
value between 0 and 4 occurs). Continue drawing bars or
columns until each of the frequencies for each of the class
intervals is represented. Figure 4.2 is a nice hand-drawn
(really!) histogram for the frequency distribution of the 50
scores that we have been working with so far.
Notice that each class interval is represented by a range of
scores along the x-axis.
Figure 4.1 ⬢ Class intervals along the x-axis
The Tallyho Method
You can see by the simple frequency distribution at the
beginning of the chapter that you already know more about the
distribution of scores than you’d learn from just a simple listing
of them. You have a good idea of what values occur with what
frequency. But another visual representation (besides a
histogram) can be done by using tallies for each of the
occurrences, as shown in Figure 4.3.
Figure 4.2 ⬢ A hand-drawn histogram
Figure 4.3 ⬢ Tallying scores
We used tallies that correspond with the frequency of scores
that occur within a certain class. This gives you an even better
visual representation of how often certain scores occur relative
to other scores.THE NEXT STEP: A FREQUENCY POLYGON
Creating a histogram or a tally of scores wasn’t so difficult, and
the next step (and the next way of illustrating data) is even
easier. We’re going to use the same data—and, in fact, the
histogram that you just saw created—to create a frequency
polygon. (Polygon is a word for shape.) A frequency polygon is
a continuous line that represents the frequencies of scores
within a class interval, as shown in Figure 4.4.
13. Figure 4.4 ⬢ A hand-drawn frequency polygon
How did we draw this? Here’s how:
1. Place a midpoint at the top of each bar or column in a
histogram (see Figure 4.2).
2. Connect the lines and you’ve got it—a frequency polygon!
Note that in Figure 4.4, the histogram on which the frequency
polygon is based is drawn using vertical and horizontal lines,
and the polygon is drawn using curved lines. That’s because,
although we want you to see what a frequency polygon is based
on, you usually don’t see the underlying histogram.
Why use a frequency polygon rather than a histogram to
represent data? For two reasons. Visually, a frequency polygon
appears more dynamic than a histogram (a line that represents
change in frequency always looks neat). Also, the use of a
continuous line suggests that the variable represented by the
scores along the x-axis is also a theoretically continuous,
interval-level measurement as we talked about in Chapter 2. (To
purists, the fact that the bars touch each other in a histogram
suggests the interval-level nature of the variable, as well.)
Cumulating Frequencies
Once you have created a frequency distribution and have
visually represented those data using a histogram or a frequency
polygon, another option is to create a visual representation of
the cumulative frequency of occurrences by class intervals. This
is called a cumulative frequency distribution.
A cumulative frequency distribution is based on the same data
as a frequency distribution but with an added column
(Cumulative Frequency), as shown below.
Class Interval
Frequency
Cumulative Frequency
45–49
1
50
40–44
14. 2
49
35–39
4
47
30–34
8
43
25–29
10
35
20–24
10
25
15–19
8
15
10–14
4
7
5–9
2
3
0–4
1
1
The cumulative frequency distribution begins with the creation
of a new column labeled “Cumulative Frequency.” Then, we add
the frequency in a class interval to all the frequencies below it.
For example, for the class interval of 0–4, there is 1 occurrence
and none below it, so the cumulative frequency is 1. For the
class interval of 5–9, there are 2 occurrences in that class
interval and one below it for a total of 3 (2 + 1) occurrences.
The last class interval (45–49) contains 1 occurrence, and there
are now a total of 50 occurrences at or below that class interval.
Once we create the cumulative frequency distribution, then we
15. can plot it as a histogram or a frequency polygon. Only this
time, we’ll skip right ahead and plot the midpoint of each class
interval as a function of the cumulative frequency of that class
interval. You can see the cumulative frequency distribution
in Figure 4.5 based on these same 50 scores. Notice this
frequency polygon is shaped a little like a letter S. If the scores
in a data set are distributed the way scores typically are,
cumulative frequencies will often graph this way.
Figure 4.5 ⬢ A hand-drawn cumulative frequency distribution
Another name for a cumulative frequency polygon is an ogive.
And, if the distribution of the data is normal or bell shaped
(see Chapter 8 for more on this), then the ogive represents what
is popularly known as a bell curve or a normal distribution.
SPSS creates a really nice ogive—it’s called a P-P plot (for
probability plot) and is easy to create. See Appendix A for an
introduction to creating graphs using SPSS, as well as the
material toward the end of this chapter.
OTHER COOL WAYS TO CHART DATA
What we did so far in this chapter is take some data and show
how charts such as histograms and polygons can be used to
communicate visually. But several other types of charts are also
used in the behavioral and social sciences, and although it’s not
necessary for you to know exactly how to create them
(manually), you should at least be familiar with their names and
what they do. So here are some popular charts, what they do,
and how they do it.
There are several very good personal computer applications for
creating charts, among them the spreadsheet Excel (a Microsoft
product) and, of course, SPSS. The charts in the “Using the
Computer to Illustrate Data” section were created using SPSS as
well.
Bar Charts
A bar or column chart should be used when you want to
compare the frequencies of different categories with one
another. Categories are organized horizontally on the x-axis,
16. and values are shown vertically on the y-axis. Here are some
examples of when you might want to use a column chart:
· Number of participants in different water exercise activities
· The sales of three different types of products
· Number of children in each of six different grades
Figure 4.6 shows a graph of number of participants in different
water activities.
Figure 4.6 ⬢ A bar chart that compares different water activities
Column Charts
A column chart is identical to a bar chart, but in this chart,
categories are organized on the y-axis (which is the vertical
one), and values are shown on the x-axis (the horizontal one).
Line Charts
A line chart should be used when you want to show a trend in
the data at equal intervals. This sort of graph is often used when
the x-axis represents time. Here are some examples of when you
might want to use a line chart:
· Number of cases of mononucleosis (mono) per season among
college students at three state universities
· Toy sales for the T&K company over four quarters
· Number of travelers on two different airlines for each quarter
In Figure 4.7, you can see a chart of sales in units over four
quarters.
Figure 4.7 ⬢ Using a line chart to show a trend over time
Pie Charts
A pie chart should be used when you want to show the
proportion or percentage of people or things in various
categories. The rule is that the percentages in each “slice” must
add up to 100%, to make a whole pie. Here are some examples
of when you might want to use a pie chart:
· Of children living in poverty, the percentage who represent
various ethnicities
· Of students enrolled, the proportion who are in night or day
17. classes
· Of participants, the percentage in various age groups
Note that a pie chart describes a nominal-level variable (such as
ethnicity, time of enrollment, and age groups).
In Figure 4.8, you can see a pie chart of voter preference. And
we did a few fancy-schmancy things, such as separating and
labeling the slices.
Figure 4.8 ⬢ A pie chart illustrating the relative proportion of
one category to others
USING THE COMPUTER (SPSS, THAT IS) TO ILLUSTRATE
DATA
Now let’s use SPSS and go through the steps in creating some
of the charts that we explored in this chapter. First, here are
some general SPSS charting guidelines.
1. Although there are a couple options, we will use the Chart
Builder option on the Graphs menu. This is the easiest way to
get started and well worth learning how to use.
2. In general, you click Graphs → Chart Builder, and you see a
dialog box from which you will select the type of graph you
want to create.
3. Click the type of graph you want to create and then select the
specific design of that type of graph.
4. Drag the variable names to the axis where each belongs.
5. Click OK, and you’ll see your graph.
Let’s practice.
Creating a Histogram
1. Enter the data you want to use to create the graph. Use those
50 scores we’ve been using in this chapter or make some up just
to practice with.
2. Click Graphs → Chart Builder and you will see the Chart
Builder dialog box, as shown in Figure 4.9. If you see any other
screen, click OK.
3. Click the Histogram option in the Choose from: list and
18. double-click the first image.
4. Drag the variable you wish to graph to the “x-axis?” location
in the preview window.
5. Click OK and you will see a histogram, as shown in Figure
4.10.
Figure 4.9 ⬢ The Chart Builder dialog box
The histogram in Figure 4.10 looks a bit different from the
hand-drawn one representing the same data shown earlier in this
chapter, in Figure 4.2. The difference is that SPSS defines class
intervals using its own idiosyncratic method. SPSS took as the
middle of a class interval the bottom number of the interval
(such as 10) rather than the midpoint (such as 12.5).
Consequently, scores are allocated to different groups. The
lesson here? How you group data makes a big difference in the
way they look in a histogram. And, once you get to know SPSS
well, you can make all kinds of fine-tuned adjustments to make
graphs appear exactly as you want them.
Creating a Bar Graph
To create a bar graph, follow these steps:
1. Enter the data you want to use to create the graph. We used
the following data that show the number of people in a club who
belong to each of three political parties. 1 = Democrat, 2 =
Republican, 3 = Independent
1, 1, 2, 3, 2, 1, 1, 2, 1
2. Click Graphs → Chart Builder, and you will see the Chart
Builder dialog box, as shown in Figure 4.11. If you see any
other screen, click OK.
3. Click the Bar option in the Choose from: list and double-
click the first image.
4. Drag the variable named Party to the x-axis? location in the
preview window.
5. Drag the variable named Number to the Count axis.
19. 6. Click OK and you will see the bar graph, as shown in Figure
4.12.
Figure 4.11 ⬢ The Chart Builder dialog box
Figure 4.12 ⬢ A bar graph created using the Chart Builder
Creating a Line Graph
To create a line graph, follow these steps:
1. Enter the data you want to use to create the graph. In this
example, we will be using the percentage of the total student
body who attended the first day of classes each year over the
duration of a 10-year program. Here are the data. You can type
them into SPSS exactly as shown here, with the top row being
the names you will give the two variables (columns).
Year
Attendance
1
87
2
88
3
89
4
76
5
80
6
96
7
91
8
97
9
20. 89
10
79
2. Click Graphs → Chart Builder and you will see the Chart
Builder dialog box, as shown in Figure 4.11. If you see any
other screen, click OK.
3. Click the Line option in the Choose from: list and double-
click the first image.
4. Drag the variable named Year to the x-axis? location in the
preview window.
5. Drag the variable named Attendance to the y-axis? location.
6. Click OK, and you will see the line graph, as shown in Figure
4.13. We used the SPSS Chart Editor to change the minimum
and maximum values on the y-axis.
Figure 4.13 ⬢ A line graph created using the Chart Builder
Creating a Pie Chart
To create a pie chart, follow these steps:
1. Enter the data you want to use to create the chart. In this
example, the pie chart represents the percentage of people
buying different brands of doughnuts. Here are the data:
Brand
Percentage
Krispies
55
Dunks
35
Other
10
2. Click Graphs → Chart Builder, and you will see the Chart
Builder dialog box, as shown in Figure 4.11. If you see any
other screen, click OK.
3. Click the Pie/Polar option in the Choose from: list and
double-click the only image.
21. 4. Drag the variable named Brand to the Slice by? axis label.
5. Drag the variable named Percentage to the Angle Variable?
axis label.
6. Click OK, and you will see the pie chart, as shown in Figure
4.14.
Figure 4.14 ⬢ A pie chart created using the Chart Builder
Real-World Stats
Graphs work, and a picture really is worth more than a thousand
words.
In this article, an oldie but goodie, the researchers examined
how people perceive and process statistical graphs. Stephen
Lewandowsky and Ian Spence reviewed empirical studies
designed to explore how suitable different types of graphs are
and how what is known about human perception can have an
impact on the design and utility of these charts.
They focused on some of the theoretical explanations for why
certain elements work and don’t, the use of pictorial symbols
(like a happy face symbol, which could make up the bar in a bar
chart), and multivariate displays, where more than one set of
data needs to be represented. And, as is very often the case with
any paper, they concluded that not enough data were available
yet. Given the increasingly visual world in which we live
(emojis, anyone? ☹ ☺), this is interesting and useful reading to
gain a historical perspective on how information was (and still
is) discussed as a scientific topic.
Want to know more? Go online or to the library and find …
Lewandowsky, S., & Spence, I. (1989). The perception of
statistical graphs. Sociological Methods Research, 18, 200–242.
Summary
There’s no question that charts are fun to create and can add
enormous understanding to what might otherwise appear to be
disorganized data. Follow our suggestions in this chapter and
use charts well but only when they enhance, not duplicate,
what’s already there.
Time to Practice
22. 1. A data set of 50 comprehension scores (named
Comprehension Score) called Chapter 4 Data Set 1 is available
in Appendix C and on the website. Answer the following
questions and/or complete the following tasks:
a. Create a frequency distribution and a histogram for the set.
b. Why did you select the class interval you used?
2. Here is a frequency distribution. Create a histogram by hand
or by using SPSS.
Class Interval
Frequency
261–280
140
241–260
320
221–240
3,380
201–220
600
181–200
500
161–180
410
141–160
315
121–140
300
100–120
200
3. A third-grade teacher wants to improve her students’ level of
engagement during group discussions and instruction. She keeps
track of each of the 15 third graders’ number of responses every
day for 1 week, and the data are available as Chapter 4 Data Set
2. Use SPSS to create a bar chart with one bar for each day (and
warning—this may be a toughie).Time to Practice VideoChapter
4: Problem 3
1. For problem three in chapter four, a teacher wants to display
23. her students number of responses for each day of the week.
And she wants to do that with a bar chart. Since she hasn't
taken a stats class, she comes to you for help. You first enter
her data into SPSS and the results look like this-- When you
look at your data set, you'll see that it actually has the wrong
level of measurement. Notice that there's a little Venn
diagram at the top of each column, which indicates that your
data has been entered as nominal. That would be correct if you
were noting which day of the week a student participated, but
since you're noting how often a given student participated, the
correct level of measurement is a scale. Go ahead and change
that. Watch how I do that. Under variable view, under
measure, you just want to click each one and turn it into a
scale. You can also cut and paste these, and I can show you
that in another video. Once you have them changed, go back to
data view, and you'll see that at the top it has changed in two
little rulers. The next question is, how do I get SPSS to
display the average score per day rather the total number of
individual scores, which might look like a mess, and it's why
this question is a toughie. To do that we go under graphs, and
you'll see that you have two options, you can do a Chart
Builder or a Legacy Dialog. For this question we want to use
the Legacy Dialog. We go to Bar and when we click that, there
are two questions-- one, what type of bar chart? We want a
simple one. And then, how do you want the data in their area
displayed? Do we want to summarize for the groups? We
really don't. We want summary of separate variables where
each day of the week is a variable. We click on Define and
then here you'll see every day of the week. You want to bring
that over and you see your bar charts are going to represent the
mean for every day of the week. As a good habit you want to
make sure you title it, I called it "Students' Engagement
During Group Discussion." The second one is by day of
week. We hit Continue, and then when we hit OK, you're
going to see your output pop up. And here is our bar chart--
every day of the week showing the average student engagement.
24. And this is how you answer problem 3 in chapter 4. Good luck.
2. Identify whether these distributions are negatively skewed,
positively skewed, or not skewed at all and explain why you
describe them that way.
a. This talented group of athletes scored very high on the
vertical jump task.
b. On this incredibly crummy test, everyone received the same
score.
c. On the most difficult spelling test of the year, the third
graders wept as the scores were delivered and then their parents
complained.
3. Use the data available as Chapter 4 Data Set 3 on pie
preference to create a pie chart ☺ using SPSS.
4. For each of the following, indicate whether you would use a
pie, line, or bar chart and why.
a. The proportion of freshmen, sophomores, juniors, and seniors
in a particular university
b. Change in temperature over a 24-hour period
c. Number of applicants for four different jobs
d. Percentage of test takers who passed
e. Number of people in each of 10 categories
5. Provide an example of when you might use each of the
following types of charts. For example, you would use a pie
chart to show the proportion of children who receive a reduced-
price lunch that are in Grades 1 through 6. When you are done,
draw the fictitious chart by hand.
a. Line
b. Bar
c. Scatter/dot (extra credit)
6. Go to the library or online and find a journal article in your
area of interest that contains empirical data but does not contain
any visual representation of them. Use the data to create a chart.
Be sure to specify what type of chart you are creating and why
you chose the one you did. You can create the chart manually or
using SPSS or Excel.
7. Create the worst-looking chart that you can, crowded with
25. chart and font junk. Nothing makes as lasting an impression as
a bad example.
8. And, finally, what is the purpose of a chart or graph?
Student Study Site
Get the tools you need to sharpen your study skills!
Visit edge.sagepub.com/salkindfrey7e to access practice
quizzes, eFlashcards, original and curated videos, data sets, and
more!5 COMPUTING CORRELATION COEFFICIENTS ICE
CREAM AND CRIME5: MEDIA LIBRARY
Premium VideosCore Concepts in Stats Video
· CorrelationLightboard Lecture Video
· Partial CorrelationsTime to Practice Video
· Chapter 5: Problem 6
Difficulty Scale
(moderately hard)WHAT YOU WILL LEARN IN THIS
CHAPTER
· Understanding what correlations are and how they work
· Computing a simple correlation coefficient
· Interpreting the value of the correlation coefficient
· Understanding what other types of correlations exist and when
they should be usedWHAT ARE CORRELATIONS ALL
ABOUT?
Measures of central tendency and measures of variability are
not the only descriptive statistics that we are interested in using
to get a picture of what a set of scores looks like. You have
already learned that knowing the values of the one most
representative score (central tendency) and a measure of spread
or dispersion (variability) is critical for describing the
characteristics of a distribution.
However, sometimes we are as interested in the relationship
between variables—or, to be more precise, how the value of one
variable changes when the value of another variable changes.
The way we express this interest is through the computation of a
simple correlation coefficient. For example, what’s the
26. relationship between age and strength? Income and years of
education? Memory skills and amount of drug use? Your
political attitudes and the attitudes of your parents?
A correlation coefficient is a numerical index that reflects the
relationship or association between two variables. The value of
this descriptive statistic ranges between −1.00 and +1.00. A
correlation between two variables is sometimes referred to as
a bivariate (for two variables) correlation. Even more
specifically, the type of correlation that we will talk about in
the majority of this chapter is called the Pearson product-
moment correlation, named for its inventor, Karl Pearson.
The Pearson correlation coefficient examines the relationship
between two variables, but both of those variables are
continuous in nature. In other words, they are variables that can
assume any value along some underlying continuum; examples
include height (you really can be 5 feet 6.1938574673 inches
tall), age, test score, and income. Remember in Chapter 2, when
we talked about levels of measurement? Interval and ratio levels
of measurement are continuous. But a host of other variables are
not continuous. They’re called discrete or categorical variables,
and examples are race (such as black and white), social class
(such as high and low), and political affiliation (such as
Democrat and Republican). In Chapter 2, we called these types
of variables nominal level. You need to use other correlational
techniques, such as the phi correlation, in these cases. These
topics are for a more advanced course, but you should know
they are acceptable and very useful techniques. We mention
them briefly later on in this chapter.
Other types of correlation coefficients measure the relationship
between more than two variables, and we’ll talk about one of
these in some more advanced chapters later on (which you are
looking forward to already, right?).
Types of Correlation Coefficients: Flavor 1 and Flavor 2
A correlation reflects the dynamic quality of the relationship
between variables. In doing so, it allows us to understand
27. whether variables tend to move in the same or opposite
directions in relationship to each other. If variables change in
the same direction, the correlation is called a direct
correlation or a positive correlation. If variables change in
opposite directions, the correlation is called an indirect
correlation or a negative correlation. Table 5.1 shows a
summary of these relationships.
Table 5.1 ⬢ Types of Correlations
What Happens to Variable X
What Happens to Variable Y
Type of Correlation
Value
Example
X increases in value.
Y increases in value.
Direct or positive
Positive, ranging from .00 to +1.00
The more time you spend studying, the higher your test score
will be.
X decreases in value.
Y decreases in value.
Direct or positive
Positive, ranging from .00 to +1.00
The less money you put in the bank, the less interest you will
earn.
X increases in value.
Y decreases in value.
Indirect or negative
Negative, ranging from −1.00 to .00
The more you exercise, the less you will weigh.
X decreases in value.
Y increases in value.
Indirect or negative
Negative, ranging from –1.00 to .00
The less time you take to complete a test, the more items you
will get wrong.
28. Now, keep in mind that the examples in the table reflect
generalities, for example, regarding time to complete a test and
the number of items correct on that test. In general, the less
time that is taken on a test, the lower the score. Such a
conclusion is not rocket science, because the faster one goes,
the more likely one is to make careless mistakes such as not
reading instructions correctly. But, of course, some people can
go very fast and do very well. And other people go very slowly
and don’t do well at all. The point is that we are talking about
the average performance of a group of people on two different
variables. We are computing the correlation between the two
variables for the group of people, not for any one particular
person.
There are several easy (but important) things to remember about
the correlation coefficient:
· A correlation can range in value from −1.00 to +1.00.
· The absolute value of the coefficient reflects the strength of
the correlation. So a correlation of −.70 is stronger than a
correlation of +.50. One frequently made mistake regarding
correlation coefficients occurs when students assume that a
direct or positive correlation is always stronger (i.e., “better”)
than an indirect or negative correlation because of the sign and
nothing else.
· To calculate a correlation, you need exactly two variables and
at least two people.
· Another easy mistake is to assign a value judgment to the sign
of the correlation. Many students assume that a negative
relationship is not good and a positive one is good. But think of
the example from Table 5.1 where exercise and weight have a
negative correlation. That negative correlation is a positive
thing! That’s why, instead of using the
terms negative and positive, you might prefer to use the
terms indirect and direct to communicate meaning more clearly.
· The Pearson product-moment correlation coefficient is
represented by the small letter r with a subscript representing
the variables that are being correlated. You’d think that P for
29. Pearson might be used as the symbol for this correlation, but in
Greek, the P letter actually is similar to the English “r” sound,
so r is used. P is used for the theoretical correlation in a
population, so don’t feel sorry for Pearson. (If it helps, think
of r as standing for relationship.) For example,
· rxy is the correlation between variable X and variable Y.
· rweight-height is the correlation between weight and height.
· rSAT.GPA is the correlation between SAT score and grade
point average (GPA).
The correlation coefficient reflects the amount of variability
that is shared between two variables and what they have in
common. For example, you can expect an individual’s height to
be correlated with an individual’s weight because these two
variables share many of the same characteristics, such as the
individual’s nutritional and medical history, general health, and
genetics, and, of course, taller people have more mass usually.
On the other hand, if one variable does not change in value and
therefore has nothing to share, then the correlation between it
and another variable is zero. For example, if you computed the
correlation between age and number of years of school
completed, and everyone was 25 years old, there would be no
correlation between the two variables because there is literally
no information (no variability) in age available to share.
Likewise, if you constrain or restrict the range of one variable,
the correlation between that variable and another variable will
be less than if the range is not constrained. For example, if you
correlate reading comprehension and grades in school for very
high-achieving children, you’ll find the correlation to be lower
than if you computed the same correlation for children in
general. That’s because the reading comprehension score of
very high-achieving students is quite high and much less
variable than it would be for all children. The moral? When you
are interested in the relationship between two variables, try to
collect sufficiently diverse data—that way, you’ll get the truest
representative result. And how do you do that? Measure a
variable as precisely as possible (use higher, more informative
30. levels of measurement) and use a sample that varies greatly on
the characteristics you are interested in.COMPUTING A
SIMPLE CORRELATION COEFFICIENT
The computational formula for the simple Pearson product-
moment correlation coefficient between a variable
labeled X and a variable labeled Y is shown in Formula
5.1:COMPUTING A SIMPLE CORRELATION COEFFICIENT
The computational formula for the simple Pearson product-
moment correlation coefficient between a variable
labeled X and a variable labeled Y is shown in Formula 5.1:
(5.1)
rxy=n∑XY−∑X∑Y√[n∑X2−(∑X)2][n∑Y2−(∑Y)2],rxy=n∑XY−
∑X∑Y[n∑X2−(∑X)2][n∑Y2−(∑Y)2],
where
· rxy is the correlation coefficient between X and Y;
· n is the size of the sample;
· X is each individual’s score on the X variable;
· Y is each individual’s score on the Y variable;
· XY is the product of each X score times its
corresponding Y score;
· X2 is each individual’s X score, squared; and
· Y2 is each individual’s Y score, squared.
Here are the data we will use in this example:
X
Y
X2
Y2
XY
2
3
4
9
6
4
32. 4
25
16
20
6
4
36
16
24
7
5
49
25
35
Total, Sum, or ∑
54
43
320
201
247
Before we plug the numbers in, let’s make sure you understand
what each one represents:
· ∑X, or the sum of all the X values, is 54.
· ∑Y, or the sum of all the Y values, is 43.
· ∑X 2, or the sum of each X value squared, is 320.
· ∑Y 2, or the sum of each Y value squared, is 201.
· ∑XY, or the sum of the products of X and Y, is 247.
It’s easy to confuse the sum of a set of values squared and the
sum of the squared values. The sum of a set of values squared is
taking values such as 2 and 3, summing them (to be 5), and then
squaring that (which is 25). The sum of the squared values is
taking values such as 2 and 3, squaring them (to get 4 and 9,
respectively), and then adding those together (to get 13). Just
look for the parentheses as you work.
33. Here are the steps in computing the correlation coefficient:
1. List the two values for each participant. You should do this
in a column format so as not to get confused. Use graph paper if
working manually or SPSS or some other data analysis tool if
working digitally.
2. Compute the sum of all the X values and compute the sum of
all the Y values.
3. Square each of the X values and square each of the Y values.
4. Find the sum of the XY products.
These values are plugged into the equation you see in Formula
5.2:
(5.2)
rxy=(10×247)−(54×43)√[(10×320)−542][(10×201)−432].rxy=(1
0×247)−(54×43)[(10×320)−542][(10×201)−432].
Ta-da! And you can see the answer in Formula 5.3:
(5.3)
rxy=148213.83=.692.rxy=148213.83=.692.
What’s really interesting about correlations is that they measure
the amount of distance that one variable covaries in relation to
another. So, if both variables are highly variable (have lots of
wide-ranging values), the correlation between them is more
likely to be high than if not. Now, that’s not to say that lots of
variability guarantees a higher correlation, because the scores
have to vary in a systematic way. But if the variance is
constrained in one variable, then no matter how much the other
variable changes, the correlation will be lower. For example,
let’s say you are examining the correlation between academic
achievement in high school and first-year grades in college and
you look at only the top 10% of the class. Well, that top 10% is
likely to have very similar grades, introducing no variability
and no room for the one variable to vary as a function of the
other. Guess what you get when you correlate one variable with
another variable that does not change (that is, has no
variability)? rxy = 0, that’s what. The lesson here? Variability
works, and you should not artificially limit it.
34. The Scatterplot: A Visual Picture of a Correlation
There’s a very simple way to visually represent a correlation:
Create what is called a scatterplot, or scattergram (in SPSS
lingo it’s a scatter/dot graph). This is simply a plot of each set
of scores on separate axes.
Here are the steps to complete a scattergram like the one you
see in Figure 5.1, which plots the 10 sets of scores for which we
computed the sample correlation earlier.
Figure 5.1 ⬢ A simple scattergram
1. Draw the x-axis and the y-axis. Usually, the X variable goes
on the horizontal axis and the Y variable goes on the vertical
axis.
2. Mark both axes with the range of values that you know to be
the case for the data. For example, the value of the X variable in
our example ranges from 2 to 8, so we marked the x-axis from 0
to 9. There’s no harm in marking the axes a bit low or high—
just as long as you allow room for the values to appear. The
value of the Y variable ranges from 2 to 6, and we marked that
axis from 0 to 9. Having similarly labeled (and scaled) axes can
sometimes make the finished scatterplot easier to understand.
3. Finally, for each pair of scores (such as 2 and 3, as shown
in Figure 5.1), we entered a dot on the chart by marking the
place where 2 falls on the x-axis and 3 falls on the y-axis. The
dot represents a data point, which is the intersection of the two
values.
When all the data points are plotted, what does such an
illustration tell us about the relationship between the variables?
To begin with, the general shape of the collection of data points
indicates whether the correlation is direct (positive) or indirect
(negative).
A positive slope occurs when the data points group themselves
in a cluster from the lower left-hand corner on the x- and y-axes
35. through the upper right-hand corner. A negative slope occurs
when the data points group themselves in a cluster from the
upper left-hand corner on the x- and y-axes through the lower
right-hand corner.
Here are some scatterplots showing very different correlations
where you can see how the grouping of the data points reflects
the sign and strength of the correlation coefficient.
Figure 5.2 shows a perfect direct correlation, where rxy = 1.00
and all the data points are aligned along a straight line with a
positive slope.
Figure 5.2 ⬢ A perfect direct, or positive, correlation
If the correlation were perfectly indirect, the value of the
correlation coefficient would be −1.00, and the data points
would align themselves in a straight line as well but from the
upper left-hand corner of the chart to the lower right. In other
words, the line that connects the data points would have a
negative slope. And, remember, in both examples, the strength
of the association is the same; it is only the direction that is
different.
Don’t ever expect to find a perfect correlation between any two
variables in the behavioral or social sciences. Such a correlation
would say that two variables are so perfectly related, they share
everything in common. In other words, knowing one is exactly
like knowing the other. Just think about your classmates. Do
you think they all share any one thing in common that is
perfectly related to another of their characteristics across all
those different people? Probably not. In fact, r values
approaching .7 and .8 are just about the highest you’ll see.
In Figure 5.3, you can see the scatterplot for a strong (but not
perfect) direct relationship where rxy = .70. Notice that the data
points align themselves along a positive slope, although not
perfectly.
Now, we’ll show you a strong indirect, or negative, relationship
in Figure 5.4, where rxy = −.82. Notice that the data points
align themselves on a negative slope from the upper left-hand
36. corner of the chart to the lower right-hand corner.
That’s what different types of correlations look like, and you
can really tell the general strength and direction by examining
the way the points are grouped.
Figure 5.3 ⬢ A strong, but not perfect, direct relationship
Not all correlations are reflected by a straight line showing
the X and the Y values in a relationship called a linear
correlation (see Chapter 16 for tons of fun stuff about this). The
relationship may not be linear and may not be reflected by a
straight line. Let’s take the correlation between age and
memory. For the early years, the correlation is probably highly
positive—the older children get, the better their memory. Then,
into young and middle adulthood, there isn’t much of a change
or much of a correlation, because most young and middle adults
maintain a good (but not necessarily increasingly better)
memory. But with old age, memory begins to suffer, and there is
an indirect relationship between memory and aging in the later
years. If you take these together and look at the relationship
over the life span, you find that the correlation between memory
and age tends to look something like a curve where age
continues to grow at the same rate but memory increases at
first, levels off, and then decreases. It’s
a curvilinear relationship, and sometimes, the best description
of a relationship is that it is curvilinear.CORE CONCEPTS IN
STATS VIDEOCorrelation
People who perform well in high school tend to do well in
college-- right? Say you wanted to measure the relationship
between students' GPA in high school and their GPA in
college-- you would calculate a Pearson correlation between
the two variables. To see the relationship between two
variables, we draw a picture of the data using a scatter plot.
We plot each person's high school GPA on the x-axis and their
college GPA on the y-axis. Notice how the dots have a pattern.
37. Next, we compute a line called a regression line that runs
through the center of the dots. This line is computed using the
average. We use a formula to measure how much each GPA
varies independently and how much the two GPA is very
together. The variance is the average amount that each
student's high school GPA differs from the mean of all of the
high school GPAs. Of course, each student's college GPA
also varies from the mean of the college GPAs. But high
school GPA and college GPA also very together. People who
do well in high school tend to do well in college. Varying
together is called covariance. The Pearson correlation is
calculated by dividing the covariance of the two variables by
the variance of the two variables. Sometimes the dots are
spread out from the line-- that means there is a lot of
independent variance. The relationship between the two
variables is weaker. Sometimes the dots are close to the line--
the relationship is stronger. Sometimes the line looks like this-
- as one goes up, the other goes up. Sometimes the line is
like this-- as one goes up, the other goes down. College GPA
is negatively related to the amount of time spent partying.
We use correlation to understand the strength and the direction
of the relationship between the variables. The Correlation
Matrix: Bunches of Correlations
What happens if you have more than two variables and you want
to see correlations among all pairs of variables? How are the
correlations illustrated? Use a correlation matrix like the one
shown in Table 5.2—a simple and elegant solution.
As you can see in these made-up data, there are four variables
in the matrix: level of income (Income), level of education
(Education), attitude toward voting (Attitude), and how sure
they are that they will vote (Vote).
Table 5.2 ⬢ Correlation Matrix
Income
Education
Attitude
38. Vote
Income
1.00
.574
−.08
−.291
Education
.574
1.00
−.149
−.199
Attitude
−.08
−.149
1.00
−.169
Vote
−.291
−.199
−.169
1.00
For each pair of variables, there is a correlation coefficient. For
example, the correlation between income level and education is
.574. Similarly, the correlation between income level and how
sure people are that they will vote in the next election is −.291
(meaning that the higher the level of income, the less confident
people were that they would vote).
In such a matrix with four variables, there are really only six
correlation coefficients. Because variables correlate perfectly
with themselves (those are the 1.00s down the diagonal), and
because the correlation between Income and Vote is the same as
the correlation between Vote and Income, the matrix creates a
mirror image of itself.
You can use SPSS—or almost any other statistical analysis
package, such as Excel—to easily create a matrix like the one
you saw earlier. In applications like Excel, you can use the Data
39. Analysis ToolPak.
You will see such matrices (the plural of matrix) when you read
journal articles that use correlations to describe the
relationships among several variables.
Understanding What the Correlation Coefficient Means
Well, we have this numerical index of the relationship between
two variables, and we know that the higher the value of the
correlation (regardless of its sign), the stronger the relationship
is. But how can we interpret it and make it a more meaningful
indicator of a relationship?
Here are different ways to look at the interpretation of that
simple rxy.
Using-Your-Thumb (or Eyeball) Method
Perhaps the easiest (but not the most informative) way to
interpret the value of a correlation coefficient is by eyeballing it
and using the information in Table 5.3. This is based on
customary interpretations of the size of a correlation in the
behavioral sciences.
So, if the correlation between two variables is .3, you could
safely conclude that the relationship is a moderate one—not
strong but certainly not weak enough to say that the variables in
question don’t share anything in common.
Table 5.3 ⬢ Interpreting a Correlation Coefficient
Size of the Correlation
Coefficient General Interpretation
.5 to 1.0
Strong relationship
.4
Moderate to strong relationship
.3
Moderate relationship
.2
Weak to moderate relationship
0 to .1
Weak or no relationship
This eyeball method is perfectly acceptable for a quick
40. assessment of the strength of the relationship between variables,
such as when you briefly evaluate data presented visually. But
because this rule of thumb depends on a subjective judgment (of
what’s “strong” or “weak”), we would like a more precise
method. That’s what we’ll look at now.
Special Effects! Correlation Coefficient
Throughout the book, we will learn about various effect sizes
and how to interpret them. An effect size is an index of the
strength of the relationship among variables, and with most
statistical procedures we learn about, there will be an associated
effect size that should be reported and interpreted. The
correlation coefficient is a perfect example of an effect size as
it quite literally is a measure of the strength of a relationship.
Thanks to Table 5.3, we already know how to interpret it.
SQUARING THE CORRELATION COEFFICIENT: A
DETERMINED EFFORT
Here’s the much more precise way to interpret the correlation
coefficient: computing the coefficient of determination.
The coefficient of determination is the percentage of variance in
one variable that is accounted for by the variance in the other
variable. Quite a mouthful, huh?
Earlier in this chapter, we pointed out how variables that share
something in common tend to be correlated with one another. If
we correlated math and language arts grades for 100 fifth-grade
students, we would find the correlation to be moderately strong,
because many of the reasons why children do well (or poorly) in
math tend to be the same reasons why they do well (or poorly)
in language arts. The number of hours they study, how bright
they are, how interested their parents are in their schoolwork,
the number of books they have at home, and more are all related
to both math and language arts performance and account for
differences between children (and that’s where the variability
comes in).
The more these two variables share in common, the more they
will be related. These two variables share variability—or the
reason why children differ from one another. And on the whole,
41. the brighter child who studies more will do better.
To determine exactly how much of the variance in one variable
can be accounted for by the variance in another variable, the
coefficient of determination is computed by squaring the
correlation coefficient.
For example, if the correlation between GPA and number of
hours of study time is .70 (or r GPA.time = .70), then the
coefficient of determination, represented
by r2GPA.timerGPA.time2 , is .702, or .49. This means that
49% of the variance in GPA “can be explained by” or “is shared
by” the variance in studying time. And the stronger the
correlation, the more variance can be explained (which only
makes good sense). The more two variables share in common
(such as good study habits, knowledge of what’s expected in
class, and lack of fatigue), the more information about
performance on one score can be explained by the other score.
However, if 49% of the variance can be explained, this means
that 51% cannot—so even for a very strong correlation of .70,
many of the reasons why scores on these variables tend to be
different from one another go unexplained. This amount of
unexplained variance is called the coefficient of alienation (also
called the coefficient of nondetermination). Don’t worry. No
aliens here. This isn’t X-Files or Walking Dead stuff—it’s just
the amount of variance in Y not explained by X (and, of course,
vice versa since the relationship goes both ways).
How about a visual presentation of this sharing variance idea?
Okay. In Figure 5.5, you’ll find a correlation coefficient, the
corresponding coefficient of determination, and a diagram that
represents how much variance is shared between the two
variables. The larger the shaded area in each diagram (and the
more variance the two variables share), the more highly the
variables are correlated.
· The first diagram in Figure 5.5 shows two circles that do not
touch. They don’t touch because they do not share anything in
common. The correlation is zero.
· The second diagram shows two circles that overlap. With a
42. correlation of .5 (and r2xy=.25rxy2=.25 ), they share about 25%
of the variance between them.
· Finally, the third diagram shows two circles placed almost on
top of each other. With an almost perfect correlation of rxy =
.90 (r2xy=.81rxy2=.81 ), they share about 81% of the variance
between them.
Figure 5.5 ⬢ How variables share variance and the resulting
correlationAs More Ice Cream Is Eaten … the Crime Rate Goes
Up (or Association vs. Causality)
Now here’s the really important thing to be careful about when
computing, reading about, or interpreting correlation
coefficients.
Imagine this. In a small midwestern town, a phenomenon
occurred that defied any logic. The local police chief observed
that as ice cream consumption increased, crime rates tended to
increase as well. Quite simply, if you measured both, you would
find the relationship was direct, meaning that as people eat
more ice cream, the crime rate increases. And as you might
expect, as they eat less ice cream, the crime rate goes down.
The police chief was baffled until he recalled the Stats 1 class
he took in college and still fondly remembered. (He probably
also pulled out his copy of this book that he still owned. In fact,
it was likely one of three copies he had purchased to make sure
he always had one handy.)
He wondered how this could be turned into an aha! “Very
easily,” he thought. The two variables must share something or
have something in common with one another. Remember that it
must be something that relates to both level of ice cream
consumption and level of crime rate. Can you guess what that
is?
The outside temperature is what they both have in common.
When it gets warm outside, such as in the summertime, more
crimes are committed (it stays light longer, people leave the
windows open, bad guys and girls are out more, etc.). And
because it is warmer, people enjoy the ancient treat and art of
eating ice cream. Conversely, during the long and dark winter
43. months, less ice cream is consumed and fewer crimes are
committed as well.
Joe, though, recently elected as a city commissioner, learns
about these findings and has a great idea, or at least one that he
thinks his constituents will love. (Keep in mind, he skipped the
statistics offering in college.) Why not just limit the
consumption of ice cream in the summer months to reduce the
crime rate? Sounds good, right? Well, on closer inspection, it
really makes no sense at all.
That’s because of the simple principle that correlations express
the association that exists between two or more variables; they
have nothing to do with causality. In other words, just because
level of ice cream consumption and crime rate increase together
(and decrease together as well) does not mean that a change in
one results in a change in the other.
For example, if we took all the ice cream out of all the stores in
town and no more was available, do you think the crime rate
would decrease? Of course not, and it’s preposterous to think
so. But strangely enough, that’s often how associations are
interpreted—as being causal in nature—and complex issues in
the social and behavioral sciences are reduced to trivialities
because of this misunderstanding. Did long hair and hippiedom
have anything to do with the Vietnam conflict? Of course not.
Does the rise in the number of crimes committed have anything
to do with more efficient and safer cars? Of course not. But they
all happen at the same time, creating the illusion of being
associated.
People Who Loved Statistics
Katharine Coman (1857–1915) was such a kind and caring
researcher that a famous book of poetry and prose was written
about her after her death from cancer at the age of 57. Her love
for statistics was demonstrated in her belief that the study of
economics could solve social problems and urged her college,
Wellesley, to let her teach economics and statistics. She may
have been the first woman statistics professor. Coman was a
44. prominent social activist in her life and in her writings, and she
frequently cited industrial and economic statistics to support her
positions, especially as they related to the labor movement and
the role of African American workers. The artistic biography
written about Professor Coman was Yellow Clover (1922), a
tribute to her by her longtime companion (and coauthor of the
song “America the Beautiful”), Katherine Lee Bates.
Using SPSS to Compute a Correlation Coefficient
Let’s use SPSS to compute a correlation coefficient. The data
set we are using is an SPSS data file named Chapter 5 Data Set
1.
There are two variables in this data set:
Variable
Definition
Income
Annual income in dollars
Education
Level of education measured in years
To compute the Pearson correlation coefficient, follow these
steps:
1. Open the file named Chapter 5 Data Set 1.
2. Click Analyze → Correlate → Bivariate, and you will see the
Bivariate Correlations dialog box, as shown in Figure 5.6.
3. Double-click on the variable named Income to move it to the
Variables: box.
4. Double-click on the variable named Education to move it to
the Variables: box. You can also hold down the Ctrl key to
select more than one variable at a time and then use the “move”
arrow in the center of the dialog box to move them both.
5. Click OK.
Figure 5.6 ⬢ The Bivariate Correlations dialog box
Understanding the SPSS Output
The output in Figure 5.7 shows the correlation coefficient to be
equal to .574. Also shown are the sample size, 20, and a
45. measure of the statistical significance of the correlation
coefficient (we’ll cover the topic of statistical significance
in Chapter 9).
Figure 5.7 ⬢ SPSS output for the computation of the correlation
coefficient
The SPSS output shows that the two variables are related to one
another and that as level of income increases, so does level of
education. Similarly, as level of income decreases, so does level
of education. The fact that the correlation is significant means
that this relationship is not due to chance.
As for the meaningfulness of the relationship, the coefficient of
determination is .5742 or .329 or .33, meaning that 33% of the
variance in one variable is accounted for by the other.
According to our eyeball strategy, this is a relatively weak
relationship. Once again, remember that low levels of income
do not cause low levels of education, nor does not finishing
high school mean that someone is destined to a life of low
income. That’s causality, not association, and correlations speak
only to association.
Creating a Scatterplot (or Scattergram or Whatever)
You can draw a scatterplot by hand, but it’s good to know how
to have SPSS do it for you as well. Let’s take the same data that
we just used to produce the correlation matrix in Figure 5.7 and
use it to create a scatterplot. Be sure that the data set
named Chapter 5 Data Set 1 is on your screen.
1. Click Graphs → Chart Builder → Scatter/Dot, and you will
see the Chart Builder dialog box shown in Figure 5.8.
2. Double-click on the first Scatter/Dot example.
3. Highlight and drag the variable named Income to the y-axis.
4. Highlight and drag the variable named Education to the x-
axis.
5. Click OK, and you’ll have a very nice, simple, and easy-to-
understand scatterplot like the one you see in Figure 5.9.
46. Figure 5.8 ⬢ The Chart Builder dialog box
OTHER COOL CORRELATIONS
There are different ways in which variables can be assessed. For
example, nominal-level variables are categorical in nature;
examples are race (e.g., black or white) and political affiliation
(e.g., Independent or Republican). Or, if you are measuring
income and age, you are measuring interval-level variables,
because the underlying continuum on which they are based has
equally appearing intervals. As you continue your studies,
you’re likely to come across correlations between data that
occur at different levels of measurement. And to compute these
correlations, you need some specialized techniques. Table
5.4 summarizes what these different techniques are and how
they differ from one another.
Table 5.4 ⬢ Correlation Coefficient Shopping, Anyone?
Level of Measurement and Examples
Variable X
Variable Y
Type of Correlation
Correlation Being Computed
Nominal (voting preference, such as Republican or Democrat)
Nominal (biological sex, such as male or female)
Phi coefficient
The correlation between voting preference and sex
Nominal (social class, such as high, medium, or low)
Ordinal (rank in high school graduating class)
Rank biserial coefficient
The correlation between social class and rank in high school
Nominal (family configuration, such as two-parent or single-
parent)
Interval (grade point average)
Point biserial
The correlation between family configuration and grade point
average
47. Ordinal (height converted to rank)
Ordinal (weight converted to rank)
Spearman rank coefficient
The correlation between height and weight
Interval (number of problems solved)
Interval (age in years)
Pearson correlation coefficient
The correlation between number of problems solved and age in
years
PARTING WAYS: A BIT ABOUT PARTIAL CORRELATION
Okay, now you have the basics about simple correlation, but
there are many other correlational techniques that are
specialized tools to use when exploring relationships between
variables.
A common “extra” tool is called partial correlation, where the
relationship between two variables is explored, but the impact
of a third variable is removed from the relationship between the
two. Sometimes that third variable is called a mediating or
a confounding variable.
For example, let’s say that we are exploring the relationship
between level of depression and incidence of chronic disease
and we find that, on the whole, the relationship is positive. In
other words, the more chronic disease is evident, the higher the
likelihood that depression is present as well (and of course vice
versa). Now remember, the relationship might not be causal, one
variable might not “cause” the other, and the presence of one
does not mean that the other will be present as well. The
positive correlation is just an assessment of an association
between these two variables, the key idea being that they share
some variance in common.
And that’s exactly the point—it’s the other variables they share
in common that we want to control and, in some cases, remove
from the relationship so we can focus on the key relationship we
are interested in.
For example, how about level of family support? Nutritional
habits? Severity or length of illness? These and many more
48. variables can all explain the relationship between these two
variables, or they may at least account for some of the variance.
And think back a bit. That’s exactly the same argument we
made when focusing on the relationship between the
consumption of ice cream and the level of crime. Once outside
temperature (the mediating or confounding variable) is removed
from the equation … boom! The relationship between the
consumption of ice cream and the crime level plummets. Let’s
take a look.
Here are some data on the consumption of ice cream and the
crime rate for 10 cities.
Consumption of Ice Cream
Crime Rate
Consumption of ice cream
1.00
.743
Crime rate
1.00
So, the correlation between these two variables, consumption of
ice cream and crime rate, is .743. This is a pretty healthy
relationship, accounting for about 50% of the variance between
the two variables (.7432 = .55 or 55%).
Now, we’ll add a third variable, average outside temperature.
Here are the Pearson correlation coefficients for the set of three
variables.
Consumption of Ice Cream
Crime Rate
Average Outside Temperature
Consumption of ice cream
1.00
.743
.704
Crime rate
49. 1.00
.655
Average outside temperature
1.00
As you can see by these values, there’s a fairly strong
relationship between ice cream consumption and outside
temperature and between crime rate and outside temperature.
We’re interested in the question, “What’s the correlation
between ice cream consumption and crime rate with the effects
of outside temperature removed or partialed out?”
That’s what partial correlation does. It looks at the relationship
between two variables (in this case, consumption of ice cream
and crime rate) as it removes the influence of a third (in this
case, outside temperature).
A third variable that explains the relationship between two
variables can be a mediating variable or a confounding variable.
Those are different types of variables with different definitions,
though, and are easy to confuse. In our example with
correlations, a confounding variable is something like
temperature that affects both our variables of interest and
explains the correlation between them. A mediating variable is a
variable that comes between our two variables of interest and
explains the apparent relationship. For example, if A is
correlated with B and B is correlated with C, A and C would
seem to be related but only because they are both related to B.
B is a mediating variable. Perhaps A affects B and B affects C,
so A and C are correlated.
LIGHTBOARD LECTURE VIDEO
Partial Correlations
When we talk about correlations between two variables, we
almost always think about it by drawing these two overlapping
circles. And we have variable A variable B. And they
somehow are correlated. They overlap in some way. And they
50. share this information. Sometimes, though, there can be
another variable. Let's call it-- let's see, my computer brain
suggests C. That correlates with both of those. And when these
variables all correlate together, there's all these overlapping
areas. So A, B, and C all measure the same thing to some
degree. A and C all measure this same thing to some degree.
And B and C all measure this same thing to some degree.
Sometimes we want to know, though, what would the
relationship between A and B be if we controlled for C?
Would the correlation go up? Would it go down? And you
can even see visually, some of this correlation between A and
B is this part right here that is actually part of C as well. So
statistically, we call this correlation after controlling for
another a partial correlation. And I'm going to show you what
a partial correlation looks like. If we control for a variable, it
means we statistically remove it. It's as if it's not there. It's
as if everyone got And it creates now, between A and B, this
new relationship. And you can look-- literally, it is different.
It's a different type of correlation. It probably would go down,
mathematically. The other thing interesting-- when you
remove a relationship because of a partial correlation, is the
variables themselves are now a different shape because a little
bit of A used to be C. And a little bit of B used to be C. So
when you control for a third variable, you end up with a
different relationship and different variables. One way to
think about controlling for another variable is, instead of
thinking about variables, think about friends. We all have
different friends. Sometimes you're friends with someone
because you share some other friend. So imagine instead,
you've got three friends here-- 1, 2, and 3. You often see 1
and 2 together. But it's because they're with number 3.
They're with friend 3. What would happen if we remove 3, that
1 and 2 are never together unless 3's there? Let's remove all
the times where 3 is also there. Well, 1 and 2 spend a little bit
of time together. But they don't associate as closely. They're
not really close friends with each other. They just seem to be
51. friends because they both hang out with number 3. And
number 3 has the nice house and has all the fun parties. But if
it's just 1 and 2, they barely like each other.
Using SPSS to Compute Partial Correlations
Let’s use some data and SPSS to illustrate the computation of a
partial correlation. Here are the raw data.
City
Ice Cream Consumption
Crime Rate
Average Outside Temperature
1
3.4
62
88
2
5.4
98
89
3
6.7
76
65
4
2.3
45
44
5
5.3
94
89
6
4.4
88
62
7
52. 5.1
90
91
8
2.1
68
33
9
3.2
76
46
10
2.2
35
41
1. Enter the data we are using into SPSS.
2. Click Analyze → Correlate → Partial and you will see the
Partial Correlations dialog box, as shown in Figure 5.10.
3. Move Ice_Cream and Crime_Rate to the Variables: box by
dragging them or double-clicking on each one.
4. Move the variable named Outside_Temp to the Controlling
for: box.
5. Click OK and you will see the SPSS output as shown
in Figure 5.11.
Figure 5.10 ⬢ The Partial Correlations dialog box
Understanding the SPSS Output
As you can see in Figure 5.11, the correlation between ice
cream consumption (Ice_Cream) and crime rate (Crime_Rate)
with the influence or moderation of outside temperature
(Outside_Temp) removed is .525. This is less than the simple
Pearson correlation between ice cream consumption and crime
rate (which is .743), which does not consider the influence of
outside temperature. What seemed to explain 55% of the
variance (and was what we call “significant at the .05 level”),
with the removal of Outside_Temp as a moderating variable,
53. now explains .5252 = 0.28 = 28% of the variance (and the
relationship is no longer significant).
Figure 5.11 ⬢ The completed partial correlation analysis
Our conclusion? Outside temperature accounted for enough of
the shared variance between the consumption of ice cream and
the crime rate for us to conclude that the two-variable
relationship was significant. But, with the removal of the
moderating or confounding variable outside temperature, the
relationship was no longer significant. And we don’t need to
stop selling ice cream to try to reduce crime.
Real-World Stats
This is a fun one and consistent with the increasing interest in
using statistics in various sports in various ways, a discipline
informally named sabermetrics. The term was coined by Bill
James (and his approach is represented in the movie and
book Moneyball).
Stephen Hall and his colleagues examined the link between
teams’ payrolls and the competitiveness of those teams (for both
professional baseball and soccer), and he was one of the first to
look at this from an empirical perspective. In other words, until
these data were published, most people made decisions based on
anecdotal evidence rather than quantitative assessments. Hall
looked at data on team payrolls in American Major League
Baseball and English soccer between 1980 and 2000, and he
used a model that allows for the establishment of causality (and
not just association) by looking at the time sequence of events
to examine the link.
In baseball, payroll and performance both increased
significantly in the 1990s, but there was no evidence that
causality runs in the direction from payroll to performance. In
comparison, for English soccer, the researchers did show that
higher payrolls actually were at least one cause of better
performance. Pretty cool, isn’t it, how association can be
explored to make real-world decisions?
Want to know more? Go online or to the library and find …
54. Hall, S., Szymanski, S., & Zimbalist, A. S. (2002). Testing
causality between team performance and payroll: The cases of
Major League Baseball and English soccer. Journal of Sports
Economics, 3, 149–168.
Summary
The idea of showing how things are related to one another and
what they have in common is a very powerful one, and the
correlation coefficient is a very useful descriptive statistic (one
used in inference as well, as we will show you later). Keep in
mind that correlations express a relationship that is associative
but not necessarily causal, and you’ll be able to understand how
this statistic gives us valuable information about relationships
between variables and how variables change or remain the same
in concert with others. Now it’s time to change speeds just a bit
and wrap up Part II with a focus on reliability and validity. You
need to know about these ideas because you’ll be learning how
to determine what differences in outcomes, such as scores and
other variables, represent.
Time to Practice
1. Use these data to answer Questions 1a and 1b. These data are
saved as Chapter 5 Data Set 2.
a. Compute the Pearson product-moment correlation coefficient
by hand and show all your work.
b. Construct a scatterplot for these 10 pairs of values by hand.
Based on the scatterplot, would you predict the correlation to be
direct or indirect? Why?
Number Correct (out of a possible 20)
Attitude (out of a possible 100)
17
94
13
73
12
59
15
80
55. 16
93
14
85
16
66
16
79
18
77
19
91
2. Use these data to answer Questions 2a and 2b. These data are
saved as Chapter 5 Data Set 3.
Speed (to complete a 50-yard swim)
Strength (number of pounds bench-pressed)
21.6
135
23.4
213
26.5
243
25.5
167
20.8
120
19.5
134
20.9
209
18.7
176
29.8
156
28.7
177
56. a. Using either a calculator or a computer, compute the Pearson
correlation coefficient.
b. Interpret these data using the general range of very weak to
very strong. Also compute the coefficient of determination.
How does the subjective analysis compare with the value of r 2?
3. Rank the following correlation coefficients on strength of
their relationship (list the weakest first).
. .71
. +.36
. −.45
. .47
. −.62
· For the following set of scores, calculate the Pearson
correlation coefficient and interpret the outcome. These data are
saved as Chapter 5 Data Set 4.
Achievement Increase Over 12 Months
Classroom Budget Increase Over 12 Months
0.07
0.11
0.03
0.14
0.05
0.13
0.07
0.26
0.02
0.08
0.01
0.03
0.05
0.06
0.04
0.12
0.04
0.11
· For the following set of data, by hand, correlate minutes of
57. exercise with grade point average (GPA). What do you conclude
given your analysis? These data are saved as Chapter 5 Data Set
5.
Exercise
GPA
25
3.6
30
4.0
20
3.8
60
3.0
45
3.7
90
3.9
60
3.5
0
2.8
15
3.0
10
2.5
Use SPSS to determine the correlation between hours of
studying and GPA for these honor students. Why is the
correlation so low?
Hours of Studying
GPA
23
3.95
12
3.90
15
4.00
58. 14
3.76
16
3.97
21
3.89
14
3.66
11
3.91
18
3.80
9
3.89
Time to Practice Video
Chapter 5: Problem 6
Chapter 5, Problem 6 asks you to compute a correlation. They
want to assess how honor students GPAs are correlated with
their hours of studying, and then to answer the question of
why that correlation is so low. So you see listed here are
individual student's GPAs and the average number of hours
that they study. We want to set up our SPSS data file with this
information. When we look here, you'll see the two variables.
And under Variable View, make sure that they're both set up
with scale, since they're both measured on a continuous range.
A GPA can go from can go from When we look here to do,
this we're going to go under Analyze and then Correlate By
Variate. We have two variables, so this would be a by variate
correlation. This is straightforward here. We're going to take
both of them, move it to our variables. You notice down here
it's saying we have the Pearson, which is for by variate
continuous data. We're going to look for a two-tailed
significance. Is it, how are they related to each other? We are
asking it to flag our significant correlations. You always want
to look under Options, but it defaults to what we want. We
could say we want to show the means and standard deviations,
59. we don't need to do that for this one. So instead, let's hit OK.
And here is our information. When we look at our
significance, it's a We need it to be lower than a And so, the
question is, why is this correlation of a so low? Well, if we
look back at our data itself, it's going to give us an
understanding. Of these are really high GPAs. When there's
very little variability, you're typically not going to get a strong
correlation because there's not much change. Even though
there seems to be a range of how they studied, in terms of
hours, it didn't have a big effect on their GPA because they're
all really high. And this is how you answer Chapter 5, Problem
6.
1. The coefficient of determination between two variables is
.64. Answer the following questions:
a. What is the Pearson correlation coefficient?
b. How strong is the relationship?
c. How much of the variance in the relationship between these
two variables is unaccounted for?
2. Here is a set of three variables for each of 20 participants in
a study on recovery from a head injury. Create a simple matrix
that shows the correlations between each variable. You can do
this by hand (and plan on being here for a while) or use SPSS or
any other application. These data are saved as Chapter 5 Data
Set 6.
Age at Injury
Level of Treatment
12-Month Treatment Score
25
1
78
16
2
66
8
2
78
61. 23
1
92
31
2
97
53
2
69
11
3
79
33
2
69
3. Look at Table 5.4. What type of correlation coefficient would
you use to examine the relationship between biological sex
(defined in this study as having only two categories: male or
female) and political affiliation? How about family
configuration (two-parent or single-parent) and high school
GPA? Explain why you selected the answers you did.
4. When two variables are correlated (such as strength and
running speed), they are associated with one another. Explain
how, even if there is a correlation between the two, one might
not cause the other.
5. Provide three examples of an association between two
variables where a causal relationship makes perfect sense
conceptually.
6. Why can’t correlations be used as a tool to prove a causal
relationship between variables rather than just an association?
7. When would you use partial correlation?
Student Study Site
Get the tools you need to sharpen your study skills!
Visit edge.sagepub.com/salkindfrey7e to access practice
quizzes, eFlashcards, original and curated videos, data sets, and
62. more!6 AN INTRODUCTION TO UNDERSTANDING
RELIABILITY AND VALIDITY JUST THE TRUTH6: MEDIA
LIBRARY
Premium VideosLightboard Lecture Video
· Reliability
· ValidityTime to Practice Video
· Chapter 6: Problem 5
Difficulty Scale
(not so hard)WHAT YOU WILL LEARN IN THIS CHAPTER
· Defining reliability and validity and understanding why they
are important
· This is a stats class! What’s up with this measurement stuff?
· Understanding how the quality of tests is evaluated
· Computing and interpreting various types of reliability
coefficients
· Computing and interpreting various types of validity
coefficientsAN INTRODUCTION TO RELIABILITY AND
VALIDITY
Ask any parent, teacher, pediatrician, or almost anyone in your
neighborhood what the five top concerns are about today’s
children, and there is sure to be a group who identifies obesity
as one of those concerns. Sandy Slater and her colleagues
developed and tested the reliability and validity of a self-
reported questionnaire on home, school, and neighborhood
physical activity environments for youth located in low-income
urban minority neighborhoods and rural areas. In particular, the
researchers looked at such variables as information on the
presence of electronic and play equipment in youth participants’
bedrooms and homes and outdoor play equipment at schools.
They also looked at what people close to the children thought
about being active. A total of 205 parent–child pairs completed
a 160-item take-home survey on two different occasions, a
perfect model for establishing test–retest reliability. The
researchers found that the measure had good reliability and
validity. The researchers hoped that this survey could be used to
63. help identify opportunities and develop strategies to encourage
underserved youth to be more physically active.
Want to know more? Go online or to the library and find …
Slater, S., Full, K., Fitzgibbon, M., & Uskali, A. (2015, June 4).
Test–retest reliability and validity results of the Youth Physical
Activity Supports Questionnaire. SAGE Open, 5(2).
doi:10.1177/2158244015586809
What’s Up With This Measurement Stuff?
An excellent question and one that you should be asking. After
all, you enrolled in a stats class, and up to now, that’s been the
focus of the material that has been covered. Now it looks like
you’re faced with a topic that belongs in a tests and
measurements class. So, what’s this material doing in a stats
book?
Well, much of what we have covered so far in Statistics for
People Who (Think They) Hate Statistics has to do with the
collection and description of data. Now we are about to begin
the journey toward analyzing and interpreting data. But before
we begin learning those skills, we want to make sure that the
data are what you think they are—that the data represent what it
is you want to know about. In other words, if you’re studying
poverty, you want to make sure that the measure you use to
assess poverty works and that it works time after time. Or, if
you are studying aggression in middle-aged males, you want to
make sure that whatever tool you use to assess aggression works
and that it works time after time.
More really good news: Should you continue in your education
and want to take a class on tests and measurements, this
introductory chapter will give you a real jump on understanding
the scope of the area and what topics you’ll be studying.
And to make sure that the entire process of collecting data and
making sense out of them works, you first have to make sure
that what you use to collect data works as well. The
fundamental questions that are answered in this chapter are
“How do I know that the test, scale, instrument, and so on. I use
64. produces scores that aren’t random but actually represent an
individual’s typical performance?” (that’s reliability) and “How
do I know that the test, scale, instrument, and so on. I use
measures what it is supposed to?” (that’s validity).
Anyone who does research will tell you about the importance of
establishing the reliability and validity of your measurement
tool, whether it’s a simple observational instrument of consumer
behavior or one that measures a complex psychological
construct such as depression. However, there’s another very
good reason. If the tools that you use to collect data are
unreliable or invalid, then the results of any test or any
hypothesis, and the conclusions you may reach based on those
results, are necessarily inconclusive. If you are not sure that the
test does what it is supposed to and that it does so consistently
without randomness in its scores, how do you know that the
nonsignificant results you got aren’t a function of the lousy test
tools rather than an actual reflection of reality? Want a clean
test of your hypothesis? Make reliability and validity an
important part of your research.
You may have noticed a new term at the beginning of this
chapter—dependent variable. In an experiment, this is the
outcome variable, or what the researcher looks at to see whether
any change has occurred as a function of the treatment that has
taken place. And guess what? The treatment has a name as
well—the independent variable. For example, if a researcher
examined the effect of different reading programs on
comprehension, the independent variable would be the reading
program, and the dependent or outcome variable would be
reading comprehension score. The term dependent variable is
used for the outcome variable because the hypothesis suggests
that it depends on, or is affected by, the independent variable.
Although these terms will not be used often throughout the
remainder of this book, you should have some familiarity with
them.RELIABILITY: DOING IT AGAIN UNTIL YOU GET IT
RIGHT
Reliability is pretty easy to understand and figure out. It’s
65. simply whether a test, or whatever you use as a measurement
tool, measures something consistently. If you administer a test
of personality type before a special treatment occurs, will the
administration of that same test 4 months later be reliable?
That, my friend, is one type of reliability—the degree to which
scores are consistent for one person measured twice. There are
other types of reliability, each of which we will get to after we
define reliability just a bit more.
Test Scores: Truth or Dare?
When you take a test in this class, you get a score, such as 89
(good for you) or 65 (back to the books!). That test score
consists of several elements, including the observed score (or
what you actually get on the test, such as 89 or 65) and a true
score (the typical score you would get if you took the same test
an infinite number of times). We can’t directly measure true
score (because we don’t have the time or energy to give
someone the same test an infinite number of times), but we can
estimate it.
Why aren’t true scores and observed scores the same? Well,
they can be if the test (and the accompanying observed score) is
a perfect (and we mean absolutely perfect) reflection of what’s
being measured.
But the bread sometimes falls on the buttered side, and
Murphy’s law tells us that the world is not perfect. So, what you
see as an observed score may come close to the true score, but
rarely are they the same. Rather, the difference as you see is the
amount of error that is introduced.
Notice that reliability is not the same as validity; it does not
reflect whether you are measuring what you want to. Here’s
why. True score has nothing to do with whether the construct of
interest is really being reflected. Rather, true score is
the mean score an individual would get if he or she took a test
an infinite number of times, and it represents the theoretical
typical level of performance on a given test. Now, one would
hope that the typical level of performance would reflect the
66. construct of interest, but that’s another question (a question of
validity). The distinction here is that a test is reliable if it
consistently produces whatever score a person would get on
average, regardless of what the test is measuring. In fact, a
perfectly reliable test might not produce a score that has
anything to do with the construct of interest, such as “what you
really know.”LIGHTBOARD LECTURE VIDEOReliability
So one of the qualities of a good test is that it's reliable. And
a lot of times when you learn stuff, you hear about validity
and reliability. And you think it's all the same thing. Well,
validity is whether the score matches the thing that supposed
to be measured. We're not talking about that. Reliability is
different. Reliability refers to whether the score you just got
on a test is the typical score you would have gotten on that
test, not whether it matches the level of the trait you're trying
to measure. That's validity. But if you took the test again
tomorrow, would you get the same score if you took it
yesterday? If you'd had a better breakfast, would your score
be any different? That's what reliability is all about. And we
can think about reliability as the purpose of a test is for you
to get the score that represents your typical level of
performance. If you took a test an infinite number of times
and you averaged all those scores-- you alone-- that average is
the typical level of performance you would get. So every
time someone takes a test, when you take a test, you get some
score on it. And if this middle bullseye is your typical level
of performance, the test score you would get if you took the
test infinite number of times and averaged it, that's the score
you typically get. And if you get that score, then it's a reliable
test. And if that's true for everybody, it's a reliable test. But
there's some randomness. There's always randomness. That's
the problem with social sciences. There's all this randomness
in human behavior. And so the score you get won't be your
exact typical level of performance. It will not be right here
on the bullseye. It might be out here somewhere. It might be
further out here. I mean, the further away it is from your