2. Data analysis
Descriptive/Frequency
- Demographics (Number and/or percentage)
- Cross-tabulation (Number and/or percentage)
Goodness of Measures Measurement Validity and Reliability.
Reliability: The degree to which measures are free from random error and therefore yield
consistent results.
Inferential/Hypothesis testing
- t-test or ANOVA
- Correlation
- Regression
3. The Right Technique in Data Analysis?
What is the purpose of the analysis?
- Descriptive, compare group, relationship
What is the level of measurement?
- Parametric and Non-parametric
How many variables are involved?
- Univariate, bivariate, multivariate
What kind of tests?
Descriptive or Inferential.
If inferential set the significance level
4. Descriptive Analysis
Purpose: To describe the distribution of the demographic
variable
Frequencies distribution – if 1 ordinal or nominal
Cross-tabulation – if 2 ordinal or nominal
Means – if 1 interval or ratio
Means of subgroup – if 1 interval or ratio by subgroup
6. FACTOR ANALYSIS
• Go to analyze – dimension reduction- factor
- Enter items of IV or DV into dialogue box
- Tick descriptive – initial solution – coefficient- sig. level-
determinant-KMO & Bartlett test-inverse-reproduced-
-anti-image
- Tick extraction – principal component
- Tick rotation – varimax – rotated solution -loading plot
- Tick score – display factor coefficient matrix
- Tick option – sorted by size
- Tick ok
7. FACTOR ANALYSIS … CONT.
To conduct a Factor Analysis, start from the
“Analyze” menu. This procedure is intended to
reduce the complexity in a set of data, so we
choose “Dimension Reduction” from the menu.
And the choice in this category is “Factor,” for
factor analysis.
This dataset gives children’s scores on subtests
of the Wechsler Intelligence Scale for Children
(WISC-III). The Wechsler scales are scored to
give you a “verbal” and a “performance” IQ.
The question is whether we can reproduce the
verbal vs. nonverbal distinction, with the
appropriate subtests grouping into each
category, using factor analysis.
8. FACTOR ANALYSIS … CONT.
Factor analysis has no IVs and DVs, so
everything you want to get factors for just goes
into the list labeled “variables.” In this case, it’s
all the variables. In some datasets, there is also
a dummy “subject number” variable included. Be
sure that you don’t include subject number as
one of the variables for your factor analysis!
9. FACTOR ANALYSIS … CONT.
In this dialog box, you can make a number of selections. First, I want you
to un-check the box labeled “Unrotated factor solution.” This is a
default setting for your printout, but it just gives you information that you
don’t need, and that may distract you from the real answers. So,
always go into the Extraction sub-dialog and un-check this box.
Second, check the box for a “scree plot.” This will give you a scree
diagram, which is one way to decide how many factors to extract.
Third, look at the section labeled “Extract.” As you can see, the default
setting is for SPSS to use the Kaiser stopping criterion (i.e., all factors
with eigenvalues greater than 1) to decide how
many factors to extract.
You can set a more conservative stopping criterion by requiring each
factor to have a higher eigenvalue.
Or, if you already know exactly how many factors you think
there will be, you can set the extraction method to a specific “Number of
factors,” and then put
the number into this box.
10. FACTOR ANALYSIS … CONT.
This dialog allows you to choose a “rotation method” for your
factor analysis.
a rotation method gets factors that are as different from each
other as possible, and helps you interpret the factors by putting
each variable primarily on one of the factors.
However, you still need to decide whether you want an
“orthogonal” solution (factors are not highly correlated with
each other), or an “oblique” solution (factors are correlated with
one another).
If you want an oblique solution, the only choice SPSS gives
you is “Direct Oblimin.”
All of the others are orthogonal solutions—the one that you’ll
use most often from these choices is the default value,
“Varimax.” Most of the factor analyses you will see in
published articles use a Varimax rotation.
make sure that the check-box for a “rotated solution” is on.
The rotated solution gives you the factor loadings for each
individual variable in your dataset, which are
what you use to interpret the meaning of (i.e., make up names
for) the different factors.
11. FACTOR ANALYSIS … CONT.
This table shows you the actual factors that were
extracted. If you look at the section labeled
“Rotation Sums of Squared Loadings,” it shows you
only those factors that met your cut-off
criterion (extraction method). In this case, there were
three factors with eigenvalues greater than
1. SPSS always extracts as many factors initially as
there are variables in the dataset, but the rest
of these didn’t make the grade. The “% of variance”
column tells you how much of the total
variability (in all of the variables together) can be
accounted for by each of these summary scales
or factors. Factor 1 accounts for 27.485% of the
variability in all 11 variables, and so on.
12. FACTOR ANALYSIS … CONT.
Finally, the Rotated Component Matrix shows you the
factor loadings for each variable. I went
across each row, and highlighted the factor that each
variable loaded most strongly on. Based on
these factor loadings, I think the factors represent:
--The first 5 subtests loaded strongly on Factor 1, which
I’ll call “Verbal IQ”
--Picture Completion through Object Assembly all loaded
strongly on Factor 2, which I’ll
“Performance IQ”
--Coding loaded strongly on Factor 3 (and Digit Span
loaded fairly strongly on Factor 3,
although it also loaded on Factor 1). Probably Factor 3 is
“Freedom from Distraction,” because
these are concentration-intensive tasks.
16. RELIABILITY
• Go to analyze – scale – reliability analysis
- Enter items to be analyzed
- Tick statistics – descriptive for –item – scale – scale if item deleted.
• Verify the output
- If the scale (Cronbach alpha) > .70, the reliability
of the variable is achieved (Nunnally, 1978)
- If not verify the table and check the alpha
scale if item deleted to detect for improvement.
- Drop item stated in scale if item deleted and run
reliability again.
- Do summated scale to formulate a variable.
17. R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)
Mean Std Dev Cases
1. BO12 3.9580 1.0269 143.0
2. BO6 3.4825 .9704 143.0
3. BO7 2.9650 .8914 143.0
N of
Statistics for Mean Variance Std Dev Variables
SCALE 10.4056 4.9048 2.2147 3
Item-total Statistics
Scale Scale Corrected
Mean Variance Item- Alpha
if Item if Item Total if Item
Deleted Deleted Correlation Deleted
BO12 6.4476 2.3194 .4894 .5030
BO6 6.9231 2.4518 .4974 .4916
BO7 7.4406 2.9243 .3890 .6348
Reliability Coefficients
N of Cases = 143.0 N of Items = 3
Alpha = .6465
18. 1 Dependent Variable 1 Independent Variable Test
Binary
Metric Logistic regression
Non-metric Chi-square test
Non-metric
Metric Logistic regression
Binary Mann-Whitney test
Metric
Binary t-test
Metric Regression analysis
Nominal Analysis of variance
When do we need which test?
18
19. 1 Dependent Variable 2 or more Independent Variables Test
Non-metric
Metric Logistic regression
Non-metric Loglinear analysis
Metric
Metric Multiple regression
Non-metric Analysis of variance
When do we need which test?
19
20. A chi-squared test (also chi-square or χ2 test) is a statistical hypothesis test used in the analysis of contingency
tables when the sample sizes are large. A Chi-square test is a hypothesis testing method. Two common Chi-
square tests involve checking if observed frequencies in one or more categories match expected frequencies.
A contingency table is a tool used to summarize and analyze the relationship between two categorical variables.
The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is
either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand
whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your
dependent variable would be "attitudes towards pay discrimination" and your independent variable would be "gender", which has
two groups: "male" and "female").
A t test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine
whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one
another.
Log-Linear Analysis is a statistical test used to determine if the proportions of categories in two or more group variables
significantly differ from each other. To use this test, you should have two or more group variables with two or more options in each
group variable. See more below.
21. Correlation
H1: Autonomy and innovative orientation among Bumiputera
SMEs in northern Malaysia are related significantly
Correlations
Autonomy Innovative
Autonomy Pearson Correlation 1 .072
Sig. (2-tailed) . .297
N 210 210
Innovative Pearson Correlation .072 1
Sig. (2-tailed) .297 .
N 210 210
Interpretation:
(r = .072, p < .297) if significant level is set at p < .05, then
there is no statistical significant correlation between autonomy
and innovativeness. Therefore, H1 rejected.
22. The purpose of regression models is learn more about
the relationship between several independent or
predictor variables and a dependent or criterion
variable.
The computational problem that needs to be solved in
regression analysis is to fit a straight line to a number
of points.
Y = b0 +b1x1 + b2x2 + … + bnxn + e
Regression models
23
23. Linear regression
1 dependent variable: continuous/scale
One or more independent variables: continuous/scale
Hierarchical regression
1 dependent variable: continuous/scale
Multiple blocks of independent variables: continuous/scale
Logistic regression
1 dependent variable: binary
One or more independent variables: continuous/scale
Types of Regression Models
24
25. Output of SPSS Regression Analyses
26
the F-test can assess
the equality of variances.
26. Output of SPSS
Regression Analyses
27
Confidence interval = sample
mean ± margin of error
To obtain this confidence interval,
add and subtract the margin of
error from the sample mean. This
result is the upper limit and the
lower limit of the confidence
interval.
27. MULTIPLE REGRESSION ANALYSIS…CONT.
Consider Some Multiple Regression Assumptions:
1. Normality – Verify Skewness < 2.0 or histogram (Skewness is a measurement of the
distortion of symmetrical distribution or asymmetry in a data set. )
1. Linearity – Verify p-p plot of std. regress residuals
2. Homocedasticity – an assumption of equal or similar variances in different groups being compared.
3. Free from error term – Durbin Watson between 1.5 – 2.5
4. Free from multicollinearity – Correlation < .70,
28. 1. Describe Descriptive Statistics (means, st. dev.) of all variables
2. Report on testing of assumptions – especially if assumptions are violated and what
was done about it.
3. Report on model fit statistics (F, df1, df2, R2).
4. Report parameter estimates – for constant and IV
1. Standardized Beta
2. T-value and significance
3. (Confidence intervals)
Reporting Regression Analyses
29
29. Type of regression models where
The dependent variable is binary
[or ordinal: ordered logistic regression (e.g. 3 categories: low, medium, high)]
Checks whether we can predict in which category we will land based on the
values of the IV.
Essentially compares a model with predictors (BLOCK 1) against a model
without predictors (BLOCK 0):
is a prediction with our variables better than random chance?
Example: http://eprints.qut.edu.au/31606/
Logistic Regression Analysis
30
33. a statistical method used to test differences between two or more means.
Inferences about means are made by analyzing variance.
Think of it as an extension of t-tests
To two or more groups
To means + variance rather than only means.
In a typical ANOVA, the null hypothesis is that all groups are random samples of the same population.
For example, when studying the effect of different treatments on similar samples of patients, the null hypothesis would be that
all treatments have the same effect (perhaps none).
Rejecting the null hypothesis would imply that different treatments result in altered effects.
Often used in experimental research, to study effects of treatments.
Analysis of Variance Models
34
34. One-way ANOVA
used to test for differences among two or more independent groups (means).
Typically, however, the one-way ANOVA is used to test for differences among at least three groups, since the two-group case can be
covered by a t-test (when there are only two means to compare, the t-test and the ANOVA F-test are equivalent).
Factorial ANOVA
used when the experimenter wants to study the interaction effects among the treatments.
Repeated measures ANOVA
used when the same subjects are used for each treatment (e.g., in a longitudinal study).
Multivariate analysis of variance (MANOVA)
used when there is more than one dependent variable.
Analysis of covariance (ANCOVA)
blends ANOVA and regression: evaluates whether population means of a DV are equal across levels of a categorical IV [treatment],
while statistically controlling for the effects of other continuous variables that are not of primary interest [covariates].
Types of Analysis of Variance Models
35
35. When can we use ANOVA?
• The t-test is used to compare the means of two-groups.
• One-way ANOVA is used to compare the means of two or more
groups.
• We can use one-way ANOVA whenever the dependent variable (DV)
is numerical and the independent variable (IV) is categorical.
• The independent variable in ANOVA is also called a factor.
36
36. Examples
The following are situations where we can use ANOVA:
• Testing the differences in blood pressure among different groups
of people (DV is blood pressure and the group is the IV).
• Testing which type of social media affects hours of sleep (type of
social media used is the IV and hours of sleep is the DV).
37
37. The type of ANOVA model is highly dependent on your research design and
theory; in particular:
What are between-subject factors? How many?
What are within-subject factors? How many?
What are treatments? How many?
Which factors are theoretically relevant, which are mere controls?
ANOVA and Research Designs
38
38. Independence, normality and homogeneity of the variances of the
residuals
Note there are no necessary assumptions for ANOVA in its full generality,
but the F-test used for ANOVA hypothesis testing has assumptions and
practical limitations.
ANOVA Assumptions
39
39. One-way
= one-way between groups model
E.g., school performance between boys versus girls
Two-way
= two one-ways for each factor PLUS
interaction between two factors
E.g., school performance between boys versus girls and locals versus
internationals
Three-way
You get the idea…
One-way and two-way ANOVA
40
40. Injuries sustained by kids wearing superhero
costumes
Does it depend on which costume they wear?
Superman, Spiderman, Hulk, Ninja Turtle?
Adopted from
http://www.statisticshell.com/docs/onewayanova.pdf
Illustration: Analysis of Variance
41
41. Are injuries sustained random or significantly dependent on wearing superhero
costumes?
Is there any order of injuries sustained by type of costume?
What ANOVA could tell us
42
42. What ANOVA could tell us
Variance in injuries severity explained by
different costumes
Flying superheroes
Non-flying
superheroes
Superman Spiderman
Hulk Ninja Turtle
Contrast 1
Contrast 2
Contrast 3
43. Assumptions of ANOVA
• The observations in each group are normally distributed.
This can be tested by plotting the numerical variable separately for
each group and checking that they all have a bell shape.
Alternatively, you could use the Shapiro-Wilk test for normality.
44
44. Assumptions
• The groups have equal variances (i.e., homogeneity of variance).
You can plot each group separately and check that they exhibit similar variability.
Alternatively, you can use Levene’s test for homogeneity.
• The observations in each group are independent.
This could be assessed by common sense looking at the study design.
For example, if there is a participant in more than one group, your observations are
not independent.
45
45. Hypothesis Testing
ANOVA tests the null hypothesis:
H0 : The groups have equal means versus the alternative
hypothesis:
H1 : At least one group mean is different from the other
group means.
46
F-Test
46. ANOVA in SPSS
47
Example:
Is there a difference in optimism scores for young,
middle-aged and old participants?
Categorical IV - Age with 3 levels:
• 29 and younger
• Between 30 and 44
• 45 or above
Continuous DV – Optimism scores
47. ANOVA in SPSS
48
Interpreting the output:
1. Check that the groups have equal variances using Levene’s test for
homogeneity.
• Check the significance value (Sig.) for Levene’s test Based on Mean.
• If this number is greater than .05 you have not violated the assumption of
homogeneity of variance.
48. ANOVA in SPSS
49
Interpreting the output:
2. Check the significance of the ANOVA.
• If the Sig. value is less than or equal to .05, there is a significant difference
somewhere among the mean scores on your dependent variable for the three
groups.
• However, this does not tell us which group is different from which other
group.
49. ANOVA in SPSS
50
Interpreting the output:
3. ONLY if the ANOVA is significant, check the significance of the
differences between each pair of groups in the table labelled
Multiple Comparisons.
50. ANOVA in SPSS
51
Calculating effect size:
• In an ANOVA, effect size will tell us how large the difference between
groups is.
• We will calculate eta squared, which is one of the most common
effect size statistics.
Eta squared =
Sum of squares between groups
Total sum of squares
51. ANOVA in SPSS
52
Calculating effect size:
179.07
8513.02
= .02
According to Cohen (1988):
Small effect: .01
Medium effect: .06
Large effect: .14
52. ANOVA in SPSS
53
Example results write-up:
A one way between-groups analysis of variance was conducted to explore the impact of
age on levels of optimism. Participants were divided into three groups according to their
age (Group 1: 29yrs or less; Group 2: 30 to 44yrs; Group 3: 45yrs and above). There was a
statistically significant difference at the p < .05 level in optimism scores for the three age
groups: F (2, 432) = 4.6, p = .01. Despite reaching statistical significance, the actual
difference in mean scores between the groups was quite small. The effect size, calculated
using eta squared, was .02. Post-hoc comparisons using the Tukey HSD test indicated that
the mean score for Group 1 (M = 21.36, SD = 4.55) was significantly different
from Group 3 (M = 22.96, SD = 4.49).
54. Descriptive Statistics-Numeric Data
• After Importing your dataset, and providing names to
variables, click on:
• ANALYZE DESCRIPTIVE STATISTICS DESCRIPTIVES
• Choose any variables to be analyzed and place them in box on right
• Options include:
56. Descriptive Statistics-General Data
• After Importing your dataset, and providing names to variables,
click on:
• ANALYZE DESCRIPTIVE STATISTICS FREQUENCIES
• Choose any variables to be analyzed and place them in box on right
• Options include (For Categorical Variables):
• Frequency Tables
• Pie Charts, Bar Charts
• Options include (For Numeric Variables)
• Frequency Tables (Useful for discrete data)
• Measures of Central Tendency, Dispersion, Percentiles
• Pie Charts, Histograms
57. Example 1.4 - Smoking Status
S
0
9
9
9
3
3
3
2
9
6
6
8
2
4
4
2
3
8
8
0
7
0
0
N
Q
Q
C
O
T
V
u
r c
P
u
r c
58. Vertical Bar Charts and Pie Charts
• After Importing your dataset, and providing names to variables, click on:
• GRAPHS BAR… SIMPLE (Summaries for Groups of Cases) DEFINE
• Bars Represent N of Cases (or % of Cases)
• Put the variable of interest as the CATEGORY AXIS
• GRAPHS PIE… (Summaries for Groups of Cases) DEFINE
• Slices Represent N of Cases (or % of Cases)
• Put the variable of interest as the DEFINE SLICES BY
60. Histograms
• After Importing your dataset, and providing names to
variables, click on:
• GRAPHS HISTOGRAM
• Select Variable to be plotted
• Click on DISPLAY NORMAL CURVE if you want a normal curve
superimposed (see Chapter 3).
62. Side-by-Side Bar Charts
• After Importing your dataset, and providing
names to variables, click on:
• GRAPHS BAR… Clustered (Summaries for Groups of
Cases) DEFINE
• Bars Represent N of Cases (or % of Cases)
• CATEGORY AXIS: Variable that represents groups to be
compared (independent variable)
• DEFINE CLUSTERS BY: Variable that represents outcomes of
interest (dependent variable)
63. Example 1.7 - Streptomycin Study
TRT
2
1
Count
30
20
10
0
OUTCOME
1
2
3
4
5
6
64. Scatterplots
• After Importing your dataset, and providing
names to variables, click on:
• GRAPHS SCATTER SIMPLE DEFINE
• For Y-AXIS, choose the Dependent (Response) Variable
• For X-AXIS, choose the Independent (Explanatory) Variable
66. Scatterplots with 2 Independent
Variables
• After Importing your dataset, and providing names to variables,
click on:
• GRAPHS SCATTER SIMPLE DEFINE
• For Y-AXIS, choose the Dependent Variable
• For X-AXIS, choose the Independent Variable with the most levels
• For SET MARKERS BY, choose the Independent Variable with the fewest
levels
68. Contingency Tables for Conditional Probabilities
• After Importing your dataset, and providing names to
variables, click on:
• ANALYZE DESCRIPTIVE STATISTICS CROSSTABS
• For ROWS, select the variable you are conditioning on
(Independent Variable)
• For COLUMNS, select the variable you are finding the conditional
probability of (Dependent Variable)
• Click on CELLS
• Click on ROW Percentages
69. Example 1.10 - Alcohol & Mortality
C
5
5
0
%
%
%
1
4
5
%
%
%
6
9
5
%
%
%
C
%
C
%
C
%
0
1
W
T
0
1
A
o t
70. Independent Sample t-Test
• After Importing your dataset, and providing names to
variables, click on:
• ANALYZE COMPARE MEANS INDEPENDENT SAMPLES T-TEST
• For TEST VARIABLE, Select the dependent (response) variable(s)
• For GROUPING VARIABLE, Select the independent variable. Then
define the names of the 2 levels to be compared (this can be used
even when the full dataset has more than 2 levels for independent
variable).
71. Example 3.5 - Levocabastine in Renal
Patients
S t
6
3
2
2
6
7
9
7
G
N
H
A
N
e a
e
E
e
t S
4
1
6
0
4
7
7
5 0
8 3
6
3
6
7
7
1 3
4 6
E q
a s
E q
n o
A U
F
S i g
s T
o f V
t
d f
2 - t
M e a
e r e
. E
e r e
o w
p p
o n
v a l
e r e
u a l
72. Paired t-test
• After Importing your dataset, and providing
names to variables, click on:
• ANALYZE COMPARE MEANS PAIRED SAMPLES T-TEST
• For PAIRED VARIABLES, Select the two dependent
(response) variables (the analysis will be based on first
variable minus second variable)
73. Example 3.7 - Cmax in SRC&IRC Codeine
p l
3
2
8
3
5
5
S
I
P
1
e a
N
e
E
e
e s
S
P
N
e l
i g
m
3
9
2
8
9
6
2
0
S
P
e a
e v
E
e a
o w
p p
o n
a l
e r e
i
f f
t
d f
2 -
74. Chi-Square Test
• After Importing your dataset, and providing names to
variables, click on:
• ANALYZE DESCRIPTIVE STATISTICS CROSSTABS
• For ROWS, Select the Independent Variable
• For COLUMNS, Select the Dependent Variable
• Under STATISTICS, Click on CHI-SQUARE
• Under CELLS, Click on OBSERVED, EXPECTED, ROW PERCENTAGES, and
ADJUSTED STANDARDIZED RESIDUALS
• NOTE: Large ADJUSTED STANDARDIZED RESIDUALS (in absolute
value) show which cells are inconsistent with null hypothesis of
independence. A common rule of thumb is seeing which if any cells
have values >3 in absolute value
75. Example 5.8 - Marital Status &
Cancer
R E V
2 9
4 7
7 6
. 1
7 . 9
6 . 0
%
%
0 %
. 3
2 . 3
1 6
0 8
2 4
. 3
1 . 7
4 . 0
%
%
0 %
. 7
- . 7
6 7
5 6
2 3
. 6
1 . 4
3 . 0
%
%
0 %
. 1
1 . 1
5
5
1 0
. 0
5 . 0
0 . 0
%
%
0 %
. 0
. 0
1 7
1 6
3 3
. 0
6 . 0
3 . 0
%
%
0 %
C o u
E x p
% w
A d j
C o u
E x p
% w
A d j
C o u
E x p
% w
A d j
C o u
E x p
% w
A d j
C o u
E x p
% w
S i n
M a r
W i d
D i v
/
M A
T o t a
a n c e
C a n c
N C R E
T o t a l
u a r
0 a
3
3 7
2
3
3 4
1
1
5 7
3
P
L
L
A
N
a l u
d f
m p
s i d
1
m
a
76. Fisher’s Exact Test
• After Importing your dataset, and providing names to variables, click
on:
• ANALYZE DESCRIPTIVE STATISTICS CROSSTABS
• For ROWS, Select the Independent Variable
• For COLUMNS, Select the Dependent Variable
• Under STATISTICS, Click on CHI-SQUARE
• Under CELLS, Click on OBSERVED and ROW PERCENTAGES
• NOTE: You will want to code the data so that the outcome present (Success)
category has the lower value (e.g. 1) and the outcome absent (Failure) category
has the higher value (e.g. 2). Similar for Exposure present category (e.g. 1) and
exposure absent (e.g. 2). Use Value Labels to keep output straight.
77. Example 5.5 - Antiseptic Experiment
R E
6
4
0
%
%
%
6
9
5
%
%
%
2
3
5
%
%
%
C
%
C
%
C
%
A
C
T
T
e a
D e
T H
o t
u a r
5 b
1
4
8
1
8
7
1
3
0 5
0 4
2
1
4
5
P e
C
a
L i
F i
L i
A s
N
a l u
d f
m p .
s i d
c t
s i d
a c t
s i d
C
a .
0
1 0
b .
78. McNemar’s Test
• After Importing your dataset, and providing names to
variables, click on:
• ANALYZE DESCRIPTIVE STATISTICS CROSSTABS
• For ROWS, Select the outcome for condition/time 1
• For COLUMNS, Select the outcome for condition/time 2
• Under STATISTICS, Click on MCNEMAR
• Under CELLS, Click on OBSERVED and TOTAL PERCENTAGES
• NOTE: You will want to code the data so that the outcome present
(Success) category has the lower value (e.g. 1) and the outcome
absent (Failure) category has the higher value (e.g. 2). Similar for
Exposure present category (e.g. 1) and exposure absent (e.g. 2).
Use Value Labels to keep output straight.
79. Example 5.6 - Report of Implant Leak
R E
9
8
7
%
%
%
5
3
8
%
%
%
4
1
5
%
%
%
C
%
C
%
C
%
P
A
S
T
e s
s e
G
o t
a
a
M
N
l u
t
i d
B
a
P-value
80. Relative Risks and Odds Ratios
• After Importing your dataset, and providing names to
variables, click on:
• ANALYZE DESCRIPTIVE STATISTICS CROSSTABS
• For ROWS, Select the Independent Variable
• For COLUMNS, Select the Dependent Variable
• Under STATISTICS, Click on RISK
• Under CELLS, Click on OBSERVED and ROW PERCENTAGES
• NOTE: You will want to code the data so that the outcome present
(Success) category has the lower value (e.g. 1) and the outcome
absent (Failure) category has the higher value (e.g. 2). Similar for
Exposure present category (e.g. 1) and exposure absent (e.g. 2).
Use Value Labels to keep output straight.
81. Example 5.1 - Pamidronate Study
R
7
9
6
%
%
%
4
7
1
%
%
%
1
6
7
%
%
%
C
%
C
%
C
%
P
P
P
T
Y e
N o
E V
o t
E s t
6
3
0
7
2
5
6
3
6
7
O
( P
F
Y
F
N
N
a lu
o w
p p
o n
e r
82. Example 5.2 - Lip Cancer
R E
9
9
8
%
%
%
8
1
9
%
%
%
7
0
7
%
%
%
C
%
C
%
C
%
Y
N
P
T
Y e
N o
C R
o t
s t
3
1
9
6
8
5
8
2
4
7
O
P
F
Y
F
N
a l u
w
p p
o n
e r
83. Correlation
After Importing your dataset, and providing names
to variables, click on:
ANALYZE CORRELATE BIVARIATE
Select the VARIABLES
Select the PEARSON CORRELATION
Select the Two tailed test of significance
Select Flag significant correlations
84.
85.
86. Linear Regression
• After Importing your dataset, and providing
names to variables, click on:
• ANALYZE REGRESSION LINEAR
• Select the DEPENDENT VARIABLE
• Select the INDEPENDENT VARAIABLE(S)
• Click on STATISTICS, then ESTIMATES, CONFIDENCE
INTERVALS, MODEL FIT
87. Examples 7.1-7.6 - Gemfibrozil Clearance
i c
a
8
8
1
0
0
6
5
1
5
3
6
2
8
( C
C
M
1
B
E
d a
i c
e t
a r
f i c
t
S i g
r B
r B
n c e
D
a
88. Examples 7.1-7.6 - Gemfibrozil Clearance
O
b
2
1
8
3
6 a
8
5
3
0
6
R
R
T
M
1
m
u a
d f
S
F
S i g
P
a
D
b
S u
b
5 a
1
6
0
M
1
R
q u
u s
q
E r
s t
P
a
D
b
89. Linear Regression
• We will introduce simple linear regression, in
particular we will:
• Learn when we can use simple linear regression
• Learn the basic workings involved in simple linear
regression
• Linear Regression in SPSS
• This presentation is intended for students in initial
stages of Statistics. No previous knowledge is
required.
90
90. Linear Regression
• Regression is used to study the relationship
between two variables.
• How a change in one variable (e.g., someone’s
exercise habits) can predict the outcome of another
variable (e.g., general health).
• We can use simple regression if both the
dependent variable (DV) and the independent
variable (IV) are numerical.
• If the DV is numerical but the IV is categorical, it is
best to use ANOVA. 91
91. Examples
The following are situations where we can use
regression:
• Testing if IQ affects income (IQ is the IV and income
is the DV).
• Testing if study time affects grades (hours of study
time is the IV and average grade is the DV).
• Testing if exercise affects blood pressure (hours of
exercise is the IV and blood pressure is the DV).
92
92. Displaying the data
When both the DV and IV are numerical, we can
represent data in the form of a scatterplot.
93
93. Displaying the data
It is important to perform a scatterplot because it
helps us to see if the relationship is linear.
In this example, the
relationship between
body fat % and chance
of heart failure is not
linear and hence it is
not sensible to use
linear regression.
94. 95
• Straight line prediction model.
• As an independent variable
changes, what happens to the
dependent variable? I.e., as an
independent variable goes up
and down, does the dependent
variable go up and down?
• They could either move in the
same direction (positive
relationship) or opposite
direction (negative relationship)
Linear Regression
95. 96
• Straight line prediction model.
• As an independent variable
changes, what happens to the
dependent variable? I.e., as an
independent variable goes up
and down, does the dependent
variable go up and down?
• They could either move in the
same direction (positive
relationship) or opposite
direction (negative relationship)
Linear Regression
99. Assumptions of regression
• The errors E are normally distributed.
This can be tested by plotting an histogram of the residuals of
the regression and checking that they all have a bell shape.
Alternatively, you could use the Shapiro-Wilk test for
normality.
100
100. Assumptions of regression
• There are no clear outliers
This can be checked by performing the scatterplot. The outliers (circled
in red in the figure) can simply be removed from the analysis.
101
101. Hypothesis testing
Regression tests the null hypothesis:
H0 : There is no effect of X on Y.
versus the alternative hypothesis:
H1 : There is an effect of X on Y.
If the null hypothesis is rejected, we reject the hypothesis that
there is no relationship and hence we conclude that there is a
significant relationship between X and Y. 102
102. How do we know if we should reject the null
hypothesis?
We perform regression in SPSS and look at the p-value
of the coefficient b.
If the p-value is less than 0.05, we reject the null
hypothesis (the variable is significant), otherwise, we
do not reject the null hypothesis (the variable is not
significant).
103
Hypothesis testing
103. Interpreting the output:
1. The first table that we’re interested in is the Model Summary.
• The R value represents the simple correlation. This indicates a strong degree of correlation
between our two variables.
• The R2 value indicates how much of the total variation in the dependent variable (perceived stress)
can be explained by the independent variable (mastery). In this case, 37.3% can be explained.
104
Regression in SPSS
https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php
104. Interpreting the output:
2. The next table is the ANOVA table, which shows us how well the
regression equation fits the data (i.e., predicts the dependent
variable).
• The regression predicts the dependent variable significantly well (p < .001).
105
Regression in SPSS
https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php
105. Interpreting the output:
3. The Coefficients table gives us the information that we need to
predict stress from mastery, as well as determine whether mastery
contributes statistically significantly to the model.
106
Regression in SPSS
Y = B0 + B1 * X
Total perceived stress = 46.32 + (-.9*Total Mastery)
106. Example results write-up:
A simple linear regression was carried out to test if total mastery significantly predicted total
perceived stress. The results of the regression indicated that the model explained 37.3% of the
variance and that the model was significant F (1, 431) = 257.63, p < .001. It was found that total
mastery significantly predicted total perceived stress (B1 = -.9, p < .001). The final predictive model
was:
total perceived stress = 46.32 + (-9*total mastery)
107
Regression in SPSS
108. Understanding Factor Analysis
Regardless of purpose, factor analysis is used in:
the determination of a small number of factors based on a
particular number of inter-related quantitative variables.
Unlike variables directly measured such as speed,
height, weight, etc., some variables such as egoism,
creativity, happiness, religiosity, comfort are not a
single measurable entity.
They are constructs that are derived from the
measurement of other, directly observable variables .
109. Understanding Factor Analysis
110
Constructs are usually defined as unobservable latent variables. E.g.:
motivation/love/hate/care/altruism/anxiety/worry/stress/product
quality/physical aptitude/democracy /reliability/power.
Example: the construct of teaching effectiveness. Several variables are
used to allow the measurement of such construct (usually several scale
items are used) because the construct may include several dimensions.
Factor analysis measures not directly observable constructs by
measuring several of its underlying dimensions.
The identification of such underlying dimensions (factors) simplifies the
understanding and description of complex constructs.
110. Understanding Factor Analysis
111
• Generally, the number of factors is much smaller than the
number of measures.
• Therefore, the expectation is that a factor represents a set of
measures.
• From this angle, factor analysis is viewed as a data-reduction
technique as it reduces a large number of overlapping variables
to a smaller set of factors that reflect construct(s) or different
dimensions of contruct(s).
111. Understanding Factor Analysis
112
The assumption of factor analysis is that underlying
dimensions (factors) can be used to explain complex
phenomena.
Observed correlations between variables result from their
sharing of factors.
Example: Correlations between a person’s test scores might be
linked to shared factors such as general intelligence, critical
thinking and reasoning skills, reading comprehension etc.
112. Ingredients of a Good Factor Analysis Solution
113
• A major goal of factor analysis is to represent
relationships among sets of variables parsimoniously
yet keeping factors meaningful.
• A good factor solution is both simple and interpretable.
• When factors can be interpreted, new insights are
possible.
113. Application of Factor Analysis
114
Defining indicators of constructs:
Ideally 4 or more measures should be chosen to represent each construct
of interest.
The choice of measures should, as much as possible, be guided by theory,
previous research, and logic.
114. Application of Factor Analysis
115
Defining dimensions for an existing measure:
In this case the variables to be analyzed are chosen by the
initial researcher and not the person conducting the analysis.
Factor analysis is performed on a predetermined set of
items/scales.
Results of factor analysis may not always be satisfactory:
The items or scales may be poor indicators of the construct or
constructs.
There may be too few items or scales to represent each underlying
dimension.
115. Application of Factor Analysis
116
Selecting items or scales to be included in a measure.
Factor analysis may be conducted to determine what items or
scales should be included and excluded from a measure.
Results of the analysis should not be used alone in making
decisions of inclusions or exclusions. Decisions should be taken
in conjunction with the theory and what is known about the
construct(s) that the items or scales assess.
116. Steps in Factor Analysis
117
• Factor analysis usually proceeds in four steps:
• 1st Step: the correlation matrix for all variables is computed
• 2nd Step: Factor extraction
• 3rd Step: Factor rotation
• 4th Step: Make final decisions about the number of
underlying factors
117. Steps in Factor Analysis:
The Correlation Matrix
118
• 1st Step: the correlation matrix
• Generate a correlation matrix for all variables
• Identify variables not related to other variables
• If the correlation between variables are small, it is unlikely
that they share common factors (variables must be related to
each other for the factor model to be appropriate).
• Think of correlations in absolute value.
• Correlation coefficients greater than 0.3 in absolute value are
indicative of acceptable correlations.
• Examine visually the appropriateness of the factor model.
118. Steps in Factor Analysis:
The Correlation Matrix
• Bartlett Test of Sphericity:
used to test the hypothesis the correlation matrix is an identity matrix (all
diagonal terms are 1 and all off-diagonal terms are 0).
If the value of the test statistic for sphericity is large and the associated
significance level is small, it is unlikely that the population correlation matrix
is an identity.
• If the hypothesis that the population correlation matrix is an identity
cannot be rejected because the observed significance level is large, the
use of the factor model should be reconsidered.
119
119. Steps in Factor Analysis:
The Correlation Matrix
• The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy:
is an index for comparing the magnitude of the observed correlation
coefficients to the magnitude of the partial correlation coefficients.
The closer the KMO measure to 1 indicate a sizeable sampling adequacy (.8
and higher are great, .7 is acceptable, .6 is mediocre, less than .5 is
unaccaptable ).
Reasonably large values are needed for a good factor analysis. Small KMO
values indicate that a factor analysis of the variables may not be a good idea.
120
120. Steps in Factor Analysis:
Factor Extraction
121
2nd Step: Factor extraction
The primary objective of this stage is to determine the factors.
Initial decisions can be made here about the number of factors underlying
a set of measured variables.
Estimates of initial factors are obtained using Principal components
analysis.
The principal components analysis is the most commonly used extraction
method . Other factor extraction methods include:
Maximum likelihood method
Principal axis factoring
Alpha method
Unweighted lease squares method
Generalized least square method
Image factoring.
121. Steps in Factor Analysis:
Factor Extraction
122
In principal components analysis, linear combinations of the
observed variables are formed.
The 1st principal component is the combination that accounts
for the largest amount of variance in the sample (1st extracted
factor).
The 2nd principle component accounts for the next largest
amount of variance and is uncorrelated with the first (2nd
extracted factor).
Successive components explain progressively smaller portions of
the total sample variance, and all are uncorrelated with each
other.
122. Steps in Factor Analysis:Factor Extraction
123
To decide on how many factors
we need to represent the data,
we use 2 statistical criteria:
Eigen Values, and
The Scree Plot.
The determination of the
number of factors is usually
done by considering only factors
with Eigen values greater than 1.
Factors with a variance less than
1 are no better than a single
variable, since each variable is
expected to have a variance of 1.
Total Variance Explained
Comp
onent
Initial Eigenvalues
Extraction Sums of Squared
Loadings
Total
% of
Variance
Cumulati
ve % Total
% of
Variance
Cumulati
ve %
1 3.046 30.465 30.465 3.046 30.465 30.465
2 1.801 18.011 48.476 1.801 18.011 48.476
3 1.009 10.091 58.566 1.009 10.091 58.566
4 .934 9.336 67.902
5 .840 8.404 76.307
6 .711 7.107 83.414
7 .574 5.737 89.151
8 .440 4.396 93.547
9 .337 3.368 96.915
10 .308 3.085 100.000
Extraction Method: Principal Component Analysis.
123. Steps in Factor Analysis:
Factor Extraction
The examination of the Scree plot provides a
visual of the total variance associated with each
factor.
The steep slope shows the large factors.
The gradual trailing off (scree) shows the rest of
the factors usually lower than an Eigen value of 1.
In choosing the number of factors, in addition to
the statistical criteria, one should make initial
decisions based on conceptual and theoretical
grounds.
At this stage, the decision about the number of
factors is not final.
124
124. Steps in Factor Analysis:
Factor Extraction
125
Component Matrixa
Component
1 2 3
I discussed my frustrations and feelings with person(s) in school .771 -.271 .121
I tried to develop a step-by-step plan of action to remedy the problems .545 .530 .264
I expressed my emotions to my family and close friends .580 -.311 .265
I read, attended workshops, or sought someother educational approach to correct the
problem
.398 .356 -.374
I tried to be emotionally honest with my self about the problems .436 .441 -.368
I sought advice from others on how I should solve the problems .705 -.362 .117
I explored the emotions caused by the problems .594 .184 -.537
I took direct action to try to correct the problems .074 .640 .443
I told someone I could trust about how I felt about the problems .752 -.351 .081
I put aside other activities so that I could work to solve the problems .225 .576 .272
Extraction Method: Principal Component Analysis.
a. 3 components extracted.
Component Matrix using Principle Component Analysis
125. Steps in Factor Analysis:
Factor Rotation
126
3rd Step: Factor rotation.
In this step, factors are rotated.
Un-rotated factors are typically not very interpretable (most factors
are correlated with may variables).
Factors are rotated to make them more meaningful and easier to
interpret (each variable is associated with a minimal number of
factors).
Different rotation methods may result in the identification of
somewhat different factors.
126. Steps in Factor Analysis:
Factor Rotation
The most popular rotational method is Varimax rotations.
Varimax use orthogonal rotations yielding uncorrelated factors/components.
Varimax attempts to minimize the number of variables that have high
loadings on a factor. This enhances the interpretability of the factors.
127
127. Steps in Factor Analysis:
Factor Rotation
• Other common rotational method used include Oblique rotations which yield
correlated factors.
• Oblique rotations are less frequently used because their results are more
difficult to summarize.
• Other rotational methods include:
Quartimax (Orthogonal)
Equamax (Orthogonal)
Promax (oblique)
128
129. Steps in Factor Analysis:
Making Final Decisions
130
• 4th Step: Making final decisions
• The final decision about the number of factors to choose is the number of factors
for the rotated solution that is most interpretable.
• To identify factors, group variables that have large loadings for the same factor.
• Plots of loadings provide a visual for variable clusters.
• Interpret factors according to the meaning of the variables
• This decision should be guided by:
• A priori conceptual beliefs about the number of factors from past research or
theory
• Eigen values computed in step 2.
• The relative interpretability of rotated solutions computed in step 3.
130. Assumptions Underlying Factor Analysis
131
• Assumption underlying factor analysis include.
• The measured variables are linearly related to the factors + errors.
• This assumption is likely to be violated if items limited response scales (two-
point response scale like True/False, Right/Wrong items).
• The data should have a bi-variate normal distribution for each pair of
variables.
• Observations are independent.
• The factor analysis model assumes that variables are determined by
common factors and unique factors. All unique factors are assumed to be
uncorrelated with each other and with the common factors.