2. DEALING WITH DATA
CODING DATA systematically reorganizing raw data into
a format that is machine-readable (i.e. easy to analyze
using computers)
Coding can be a clerical task when the data are recorded
as numbers on well-organized recording sheets, but it is
very difficult when, for example, a researcher wants to
code answers to open-ended survey questions into
numbers in a process similar to latent content analysis.
3. DEALING WITH DATA
Researchers uses coding system and a codebook for
data coding:
Coding system set of rules stating that certain numbers are
assigned to variable attributes.
example: a researcher codes males as 1 and females as 2.
Codebook a document describing the coding system and
the location of data for variables in a format that computers
can use.
4. DEALING WITH DATA
Precoding placing the code categories on the
questionnaire.
If a researcher does not precode, his first step after
collecting data is to create a codebook. He also gives
each case an identification number to keep track of the
cases. Next he transfers the information from each
questionnaire into a format that computers can read.
5. ENTERING DATA
Most computer programs designed for data analysis
need the data in a grid format.
In the grid, each row represents a respondent, subject, or
case DATA RECORDS because each is a record of
data for a single case
Column or sets of columns represents specific variables.
6. CLEANING DATA
Accuracy is extremely important when coding data.
Errors made in coding can cause misleading results.
Highly recommended to recheck all coding.
Coding can be verified in two ways:
Possible code cleaning (wild code checking)
involves checking the categories of all variables for impossible
codes
Contingency cleaning (consistency checking)
involves cross-classifying two variables and looking for
logically impossible combinations.
7. COMPUTERS AND SOCIAL RESEARCH
Researchers use computers to perform specialized tasks
more efficiently and effectively.
Example: organize data, calculate statistics, write reports
8. COMPUTERS AND SOCIAL RESEARCH
Computer began as mechanical devices for sorting cards
that had holes punched into them.
Each hole were punched in specific locations and
represented information about a variable.
These machines organized vast amounts of information
more quickly, reliably, and efficiently than the paper-and-
pencil methods.
IBM CARDS thin cardboard cards that had 80
columns and 12 rows, or 960 spaces for information.
9. COMPUTERS AND SOCIAL RESEARCH
COMPUTER TERMINAL is a simple typewriter-like
device connected to a mainframe computer, which
people use to type data and instructions directly into the
computer.
MICROCOMPUTERS have replaced mainframes for
many tasks, to more people, and have stimulated new
uses form computers.
Three basic parts:
Monitor (CRT or VDT)
Keyboard
CPU (Central Processing Unit)
10. COMPUTERS AND SOCIAL RESEARCH
Information can get into a microcomputer in 4 ways:
Some is built into the computer memory itself
User can type it on the keyboard
It can come across a telephone line and into the computer
through a modem
It can be stored on floppy disks, or data travellers which
computers can read.
11. HOW COMPUTERS HELP THE RESEARCHER
Purpose of the computer: to write reports, to organize
large amounts of data, and to compute statistical
measures.
Computers help researchers perform tasks faster and
with greater accuracy than by hand.
Without the appropriate computer, a researcher cannot
analyze data from a large-scale research project or
calculate complicated statistics.
SPSS statistical package for the social sciences is a
statistical software most social researchers use that is
specifically designed for analyzing quantitative social
science data.
12. STATISTICS
a branch of applied mathematics used to manipulate and
summarize the features of numbers.
Descriptive statistics describe numerical data. They can
be categorized by the number of variables involved:
Univariate
Bivariate
Multivariate
13. RESULTS WITH ONE VARIABLE (UNIVARIATE)
Frequency Distributions
Measures of Central Tendency
Measures of Variation
14. RESULTS WITH ONE VARIABLE
Univariate describes one variable. The easiest way to
describe the numerical data of one variable is through
frequency distribution.
FREQUENCY DISTRIBUTION can be used with
nominal-, ordinal-, interval-, or ratio-level data and takes
many forms.
Example: A data having 400 respondents can be
summarized on the gender of the respondents with a raw
count or a percentage frequency distribution.
15. GENDER COUNT OF RESPONDENTS
RAW COUNT FREQUENCY PERCENTAGE FREQUENCY
DISTRIBUTION
Distribution
Gender Frequency Gender Frequency
Male 100 Male 25%
Female 300 Female 75%
Total 400 Total 100%
BAR CHART
Females
Column2
Column1
Males frequency distribution
0 20 40 60 80
16. RESULTS WITH ONE VARIABLE
Common types of graphic representations:
Histogram
Bar chart
Pie chart
For interval- or ratio-level data researcher groups the
information into categories; grouped categories should be
mutually exclusive
Interval- or ratio-level data are often plotted in a
frequency polygon.
17. RESULTS WITH ONE VARIABLE
MEASURES OF CENTRAL TENDENCY
Mean arithmetic average
most widely used measure of central tendency
can be used only with interval- or ratio-level data.
Median middle point
50th percentile, or the point at which half the cases
are above it and half below it
can be used with ordinal-, interval-, or ratio-level
data (not nominal-level)
Mode easiest to use and can be used with nominal, ordinal
, interval, or ratio data
most common or frequently occurring number
18. RESULTS WITH ONE VARIABLE
If the frequency distribution forms a NORMAL or BELL-
SHAPED CURVE, the three measures of central
tendency equal each other
number of cases
mean , median, mode
Lowest values of variables highest
19. RESULTS WITH ONE VARIABLE
SKEWED DISTRIBUTION measures of central
tendency are not equal
mode
median
mean
mode
median
mean
20. RESULTS WITH ONE VARIABLE
If most cases have lower scores with a few extreme high
scores, the mean will be the highest, the median in the
middle, and the mode the lowest.
If most cases have higher scores with a few extreme low
scores, the mean will be the lowest, the median in the
middle, and the mode the highest.
In general, the median is best for skewed
distributions, although the mean is used in most other
statistics.
21. RESULTS WITH ONE VARIABLE
MEASURES OF VARIATION one- number summary
of a distribution
Three basic ways:
Range
Percentile
Standard deviation
22. RESULTS WITH ONE VARIABLE
RANGE simplest
largest and smallest score
for ordinal-, interval-, and ratio-level
PERCENTILES tell the score at a specific place within
the distribution
for ordinal-, interval-, and ratio-level
STANDARD DEVIATION most difficult to compute
measure of dispersion; yet is also the most
comprehensive and widely used.
interval or ratio level
based on the mean and gives an “average
distance” between all scores and the mean
23. RESULTS WITH ONE VARIABLE
STEPS IN COMPUTING THE STANDARD DEVIATION
1. Compute the mean.
2. Subtract the mean from each score.
3. Square the resulting difference for each score.
4. Total up the squared differences to get the sum of
squares.
5. Divide the sum of squares by the number of cases to get
the variance.
6. Take the square root of the variance, which is the
standard deviation.
24. RESULTS WITH ONE VARIABLE
Standard deviation and the mean are used to create z-
scores.
Z-scores let a researcher compare two or more
distribution or groups.
also called a standardized score, expresses
points or scores on a frequency distribution in terms of a
number of standard deviations from the mean. Scores
are in terms of their relative position within a
distribution, not as absolute values.
25. EXAMPLE OF COMPUTING THE STANDARD DEVIATION
[8 RESPONDENTS, VARIABLE = YEARS OF SCHOOLING]
score score- mean squared (score-mean)
15 2.5 6.25
12 -0.5 .25
12 -0.5 .25
10 -2.5 6.25
16 3.5 12.25
18 5.5 30.25
8 -4.5 20.25
9 -3.5 12.25
Mean = 12.5 sum of squares = 88
Variance = sum of squares = 88 = 11
no. of cases 8
Standard deviation = square root of variance = √11 = 3.317 years
26. Symbols: Formula:
X = score of case Standard deviation
√ ∑(x-x )2
∑ = sigma for sum N
N = number of cases
X = mean
27. RESULTS WITH TWO VARIABLES (BIVARIATE)
The Scattergram
Cross-tabulation/ Percentaged Table
Measures of Association
28. RESULTS WITH TWO VARIABLES
Bavariate statistics let a researcher consider two
variables together and describe the relationship between
variables
Statistical relationships are based on two ideas:
Covariation things that are associated
Example: life expectancy and income
Independence no association or relationship between two
variables
Example: number of siblings and life expectancy
29. RESULTS WITH TWO VARIABLES
SCATTERGRAM a graph on which a researcher plots
each case or observation, where each axis represents
the value of one variable.
used for interval- or ratio-level data; rarely for ordinal;
never for nominal
independent variable = horizontal axis (x)
dependent variable = vertical axis (y)
30. RESULTS WITH TWO VARIABLES
Three aspects of a bivariate relationship in a scattergram:
Form
Independence random scatter with no pattern, or a straight line that
is parallel to one of the axis
Linear straight line that can be visualized in the middle of a maze of
cases running from one corner to another
Curvilinear centre of a maze of cases could form a U curve, right
side up or upside down, or an S curve
Direction
Positive and negative
Precision
Amount of spread in the points on the graph
High level occurs when the points hug the line that summarizes the
relationships
Low level occurs when the points are widely spread around the line
31. RESULTS WITH TWO VARIABLES
PERCENTAGED TABLES presents the same
information as a scattergram in a more condensed form
Cross-tabulation cases are organized in the table on
the basis of two variables at the same time.
Bavariate tables usually contain percentages.
32. RESULTS WITH TWO VARIABLES
CONSTRUCTING PERCENTAGE TABLES
1. Raw data
2. Compound frequency distribution (CFD)
Figure all possible combinations of variable categories
Make a mark next to the combination category into which each
case falls
Add up the marks for the number of cases in a combination
category
3. Set up the parts of a table (labeling rows and columns)
4. Each number from the CFD is placed in a cell in the
table that corresponds to the combination of variable
categories.
33. THE PARTS OF A TABLE
Title
Label row and column variable and give a name to each
of the variable categories.
Marginals totals of the columns and rows
Body of the table
Cell of a table
If there is missing information (cases in which a
respondent refused to answer, ended interview, said “I
don’t know”, etc.), report the number of missing cases
near the table to account for all original cases
For percentaged tables, provide the number of cases or
N on which percentages are computed in parentheses
near the total of 100%. This makes it possible to go back
and forth from a percentaged table to a raw count table
and vice versa.
34. RAW COUNT TABLE
AGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE
Age group
Attitude Under 30 30-45 46-60 61 and total
older
Agree 20 10 4 3 37
No opinion 3 (4) 10 10 2 25
Disagree 3 5 21 10 39
_____ _____ _____ _____ _____
Total 26 25 35 15 101
Missing =8
cases (6)
35. COLUMN PERCENTAGED TABLE
AGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE
Age group
Attitude Under 30 30-45 46-60 61 and Total
older
Agree 76.9 % 40% 11.14% 20% 36.6%
No opinion 11.5 % 40 % 28.6 % 13.3 % 24.8%
Disagree 11.5% 20% 60% 66.7% 38.6%
_____ ______ ______ ______ _____
Total 99.9% 100% 100% 100% 100%
(N) (26) (25) (35) (15) (101)
Missing =8
cases
36. ROW PERCENTAGED TABLE
AGE GROUP BY ATTITUDE ABOUT CHANGING THE DRINKING AGE
Age group
Attitude Under 30 30-45 46-60 61 and Total (N)
older
Agree 54.1% 27% 10.8% 8.1% 100 (37)
No 12% 40% 40% 8% 100 (25)
opinion
Disagree 7.7% 12.8% 53.8% 25.6% 99.9 (39)
_____ ______ _____ _____ _____
Total 25.7% 24.8% 34.7% 14.9% 100.1 (101)
Missing =8
cases
37. RESULTS WITH TWO VARIABLES
Three ways to percentage a table:
By row
By column
For the total
First two are most often used; last rare
Which is best?
Either can be appropriate.
38. RESULTS WITH TWO VARIABLES
MEASURES OF ASSOCIATION single number that
expresses the strength, and often the direction, of a
relationship; condenses information about a bivariate
relationship into a single number
Many measures are called by letters of the greek
alphabet (lambda, gamma, tau, chi (squared), and rho).
The emphasis is on interpreting the measures, not on
their calculation.
39. SUMMARY OF MEASURES OF ASSOCIATION
Measure Greek Type of data High Independence
symbol association
Lambda λ Nominal 1.0 0
Gamma γ Ordinal +1.0, -1.0 0
Tau τ Ordinal +1.0, -1.0 0
(Kendall’s)
Rho ρ Interval, ratio +1.0, -1.0 0
Chi-squared χ2 Nominal, Infinity 0
ordinal
40. MORE THAN TWO VARIABLES (MULTIVARIATE)
Trivariate percentaged tables
Multiple regression analysis
41. MORE THAN TWO VARIABLES
STATISTICAL CONTROL control variables
Researcher controls for alternative explanations in
multivariate analysis by introducing a third variable.
Example: Bavariate table showing that taller teens like
baseball more than shorter ones
May be spurious since male teens are taller than
females, and males tend to like baseball more than
females.
To test whether the relationship is due to sex, a
researcher must control for gender
42. MORE THAN TWO VARIABLES
TRIAVARIATE PERCENTAGED TABLES
Trivariate tables has a bivariate table of the independent
and dependent variable for each category of the control
variable. partials
Number of partials depends on the number of categories
in the control variable.
Trivariate tables have 3 limitations:
Difficult to interpret if a control variable has more than 4
categories.
Control variables can be at any level of measurement, but
interval or ratio control variables must be grouped, and how
cases are grouped can affect the interpretations of effects.
Total number of cases is a limiting factor because the cases
are divided among cells in partial.
43. MORE THAN TWO VARIABLES
P Number of cells in the partials
C number of cells in the bivariate relationship
N number of categories in the control variable
P=CxN
Example: Control variable having 3 categories, and a
bivariate table having 12 cells.
P= 12 x 3 = 36 cells
An average of 5 cases per cell is recommended, so the
researcher will need 5 x 6 = 180 cases at minimum
44. COMPOUND FREQUENCY DISTRIBUTION FOR TRIVARIATE TABLE
Males Females
Age Attitude No. of cases Age Attitude No. of cases
Under 30 Agree 10 Under 30 Agree 10
Under 30 No option 1 Under 30 No option 2
Under 30 Disagree 2 Under 30 Disagree 1
30-45 Agree 5 30-45 Agree 5
30-45 No option 5 30-45 No option 5
30-45 Disagree 2 30-45 Disagree 3
46-60 Agree 2 46-60 Agree 2
46-60 No option 5 46-60 No option 5
46-60 Disagree 11 46-60 Disagree 10
61 and older Agree 3 61 and older Agree 0
61 and older No option 0 61 and older No option 2
61 and older Disagree 5____ 61 and older Disagree 5______
subtotal 51 subtotal 50
Missing on either variable 4____ Missing on either variable 4______
No. of males 55 No. of females 54
45. PARTIAL TABLE FOR MALES
Attitude Under 30 30-45 46-60 61 and older Total
Agree 10 5 2 3 20
No option 1 5 5 0 11
Disagree 2 2 11 5 20
Total 13 12 18 8 51
Missing =4
cases
PARTIAL TABLE FOR FEMALES
Attitude Under 30 30-45 46-60 61 and older Total
Agree 10 5 2 0 17
No option 2 5 5 2 14
Disagree 1 3 10 5 19
Total 13 13 17 7
Missing =4
cases
46. MORE THAN TWO VARIABLES
Elaboration paradigm system for reading percentaged
trivariate tables
describes the pattern that emerges when a control
variable is introduced
Four terms describe how the partial tables compare to
the initial bivariate table, or how the original bivariate
relationship changes after the control variable is
considered.
47. SUMMARY OF THE ELABORATION PARADIGM
Pattern name Pattern seen when comparing partials to the
original bivariate table
Replication Same relationship in both partials as in bivariate table
Specification Bivariate relationship seen in one of the partial tables
Interpretation Bivariate relationship weakens greatly or disappears in
the partial tables (control variable intervening)
Explanation Bivariate relationship weakens greatly or disappears in
the partial tables (control variable before independent)
Suppressor variable No bivariate relationship, relationship appears in partial
tables only
48. EXAMPLES OF ELABORATION PATTERNS
REPLICATION
Bivariate table Partials
Control = low Control = high
Low High Low High Low High
Low 85% 15% Low 84% 16% 86% 14%
High 15% 85% High 16$ 84% 14% 86%
INTERPRETATION OR EXPLANATION
Bivariate table Partials
Control = low Control = high
Low High Low High Low High
Low 85% 15% Low 45% 55% 55% 45%
High 15% 85% High 55% 45% 45% 55%
49. EXAMPLES OF ELABORATION PATTERNS
SPECIFICATION
Bivariate table Partials
Control = low Control = high
Low High Low High Low High
Low 85% 15% Low 95% 5% 50% 50%
High 15% 85% High 5% 95% 50% 50%
SUPPRESSOR VARIABLE
Bivariate table Partials
Control = low Control = high
Low High Low High Low High
Low 54% 46% Low 84% 16% 14% 86%
High 46% 54% High 16% 84% 86% 14%
50. MORE THAN TWO VARIABLES
Replication pattern easiest to understand; when the
partials replicate or reproduce the same relationship that
existed in the bivariate table before considering the
control variable
control variable has no effect
Specification pattern occurs when one partial replicates
the initial bivariate relationship, but other partials do not.
researcher can specify the category of the control
variable in which the initial relationship persists
Example: college grades and automobiles
51. MORE THAN TWO VARIABLES
Interpretation pattern describes the situation in which
the control variable intervenes between the original
independent and dependent variable.
Explanation pattern looks the same as interpretation;
difference is the temporal order of the control variable
control variable comes before the independent
variable in the initial bivariate relationship
Suppressor variable pattern occurs when the bivariate
tables suggest independence but a relationship appears
in one or both of the partials
52. MORE THAN TWO VARIABLES
MULTIPLE REGRESSION statistical technique which
is quickly computed by an appropriate statistics software
Regression results measure the direction and size of the
effect of each variable on a dependent variable. The
effect is measured precisely and given a numerical
value.
The effect on the dependent variable is measured by a
standardized coefficient or the Greek letter beta (β). It is
equal to the ρ correlation coefficient.
53. MORE THAN TWO VARIABLES
Researchers use the beta regression coefficient to
determine whether control variables have an effect.
Example: the bivariate correlation between X and Y is
0.75. Researcher statistically considers four control
variables. If the beta remains at 0.75, then the four
control variables have no effect. However, if the beta for
X and Y gets smaller, it indicates that the control
variables have an effect.
54. INFERENTIAL STATISTICS
use probability theory to test hypotheses formally, permit
inferences from a sample to a population, and test
whether descriptive results are likely to be due to random
factors or to a real relationship.
are a more powerful type of statistics than descriptive
statistics
rely on principles from probability sampling, where a
researcher uses a random process to select a subset of
cases from the entire population.
used when researchers conduct various statistical test
(e.g. t-test or an F-test)
55. INFERENTIAL STATISTICS
Statistical significance results are not likely to be due
to chance factors
indicates the probability of finding a relationship in the
sample where there is none in the population
uses probability theory and specific statistical tests to
tell a researcher whether the results are likely to be
produced by random error in random sampling
56. INFERENTIAL STATISTICS
Levels of significance (usually .05, .01, or .001) is a way
of talking about the likelihood that the results are due to
chance factors –that is, a relationship appears in the
sample when there is none in the population.
If a researcher says that results are significant at the .05
level, this means:
Results like these are due to a chance factors only 5 in 100
times
There is a 95% chance that the sample results are not due to
chance factors alone, but reflect the population accurately
The odds of such results based on chance alone are .05 or 5%
One can be 95% confident that the results are due to a real
relationship in the population, not chance factors
57. INFERENTIAL STATISTICS
Type I error occurs when the researcher says that a
relationship exists when in fact none exists
falsely rejecting a null hypothesis
Type II error occurs when a researcher says that a
relationship does not exist when in fact it does.
falsely accepting a null hypothesis
True situation in the world
What the researcher says No relationship Causal relationship
No relationship No error Type II error
Causal relationship Type I error No error