Predicting the quality of a survey question from its design characteristics: SQP

Introduction Question design Modeling measurement error Estimating measurement error Predicting measurement error Concl
Predicting the quality of a survey question
from its design characteristics: SQP
Daniel Oberski
(joint work with Willem Saris)
U N I V E R S I T A T
P O M P E U F A B R A
Predicting the quality of a survey question from its design characteristics: SQP Daniel Oberski

Measurement Representation
Construct
Measurement
Response
Edited data
Validity
Processing
error
Measurement
error
Inferential population
Target population
Sampling frame
Sample
Respondents
Survey statistic
Coverage
error
Sampling
error
Nonresponse
error
(Groves et al. 2004).

Introduction Question design Modeling measurement error Estimating measurement error Predicting measurement error ConclConstruct
Measurement
Response
Edited data
Validity
Processing
error
Measurement
error

• Assume the step from construct to measurement is already
acceptable
→ Assume that the question measures an intended construct:
respondent knows the answer, can interpret the question,
...

acceptable
...
→ reaction of respondent to the question depends on some
unobserved value/opinion, which is in turn a measure of
construct.

acceptable
...
construct.
• We focus only on the degree to which the response is a
good measure of this unobserved score/opinion,
“measurement error”.

acceptable
...
construct.
• We focus only on the degree to which the response is a
good measure of this unobserved score/opinion,
“measurement error”.
• (NOT the degree to which the question is interpretable,
measures some construct, etc.)

Reasons to study measurement error
• Reliability is an upper bound on validity; responses can
never measure underlying construct better than the single
indicator.
• Unreliability increases the variance of estimators:

indicator.
• var(ˆµ) = κ−1
σ2
/n, where κ ∈ (0, 1) is reliability

indicator.
σ2
• Unreliability reduces apparent strength of relationships
between variables:

indicator.
σ2
between variables:
• ρxy = κx · κy · ρXY , where ρXY is the true correlation and ρxy
the observed correlation.

indicator.
σ2
between variables:
• ρxy = κx · κy · ρXY , where ρXY is the true correlation and ρxy
the observed correlation.
• Correlated measurement errors will make variables look
more related than they really are; e.g. “How many minutes
does it take to...” questions correlate partly because they
are all asked in the same way.

Public health ranking: Correction of regression coefﬁcients for κ
Country
Educationaldifferentialsinsubjectivehealthwith2s.e.interval
-0.4-0.3-0.2-0.10.0
GR
CZ
PT
SI
FI
HU
PL
SK
LU
ES
EE
DK
DE
TR
IS
NO
CH
BE
IE
FR
UA
AT
NL
SE
Uncorrected regression coefficient
Measurement error-corrected coefficient
0.82
0.85
0.78
0.73
0.56
0.75
0.71
0.81
0.86
0.85
0.95
0.84
0.91
0.70
0.81
0.87
0.81
0.82
0.92
0.85
0.91
0.81
0.93
0.99

Design characteristics of questions
• Social Desirability
• Centrality
• Reference period
• Question
formulation
• WH word used
• Use of gradation
• Balance of the
request
• Encouragement
• Showcards
present
• Showcards have
pictures
• ...
• Emphasis on subjective
opinion in request
• Information about the
opinion of other people
• Use of stimulus or
statement in the question
• Absolute or comparative
judgment
• Response scale: basic
choice
• Number of categories
• Labels full, partial, or no
• Labels full sentences
• Knowledge provided
• Survey mode
• ...
• Order of the labels
• Correspondence between
labels and numbers of the
scale
• Theoretical range of the
scale
• Neutral category
• Number of ﬁxed reference
points
• Don’t know option
• Interviewer instruction
• Respondent instruction
• Extra motivation, info or
deﬁnition available?
• Agree-disagree scale
• . . .
(Saris & Gallhofer 2007)

Question design choices
• There are a great number of question design
characteristics for which it has at some point been found or
suggested that they inﬂuence the response;
• Any question in a questionnaire represents a series of
choices (conscious or not) on those characteristics: a
method of asking the question;

• It is clear that what is a good method depends strongly on
the topic, for example

• The frequency and importance of an event or series of
events asked about determine: reasonable reference
periods; reasonable categories - wide or deep;
approximately or exactly (Tourangeau et al. 2000).

• But are some methods generally better than others?

• If so, what about those methods makes them better?

Talk outline
1 Question design
The influence of the method
Variation in influence of the method
2 Modeling measurement error
Definitions
Formal model and assumptions
3 Estimating measurement error
Design requirements
Estimation of the model
4 Predicting measurement error
Description of the data
Meta-analysis of the MTMM experiments
Program demonstration

The method inﬂuences the answers

European Social Survey, 2002

Method A:
ENTER START TIME:
1 TvTot
CARD 1 On an average weekday, how much time, in total, do you
spend watching television? Please use this card to answer.
No time at all
Less than ½ hour
½ hour to 1 hour
More than 1 hour, up to1½ hours
More than 1½ hours, up to 2 hours
More than 2 hours, up to 2½ hours
More than 3 hours
(Don’t know)
A2 TvPol
STILL CARD 1 And again on an average weekday, how much of
your time watching television is spent watching news or
programmes about politics and current affairs1
? Still use
this card.
00 GO TO A3
01
02
03
04 ASK A2
05
06
07
88

Method A:
ENTER START TIME:
1 TvTot
CARD 1 On an average weekday, how much time, in total, do you
spend watching television? Please use this card to answer.
No time at all
Less than ½ hour
½ hour to 1 hour
More than 1 hour, up to1½ hours
More than 2 hours, up to 2½ hours
More than 3 hours
(Don’t know)
A2 TvPol
STILL CARD 1 And again on an average weekday, how much of
your time watching television is spent watching news or
programmes about politics and current affairs1
? Still use
this card.
00 GO TO A3
01
02
03
04 ASK A2
05
06
07
88
Method B:!
!""#$%&'()*%)+&#!)&,%$#
!
-&.# !"#$"#$%&'$(&#)&&*+$,-#./)#012.#340&-#4"#3/3$5-#+/#,/1#67&"+#)$32.4"(#
3&5&%464/"89
:##
#
# # # ,$/+%#/)#;!<=>0#### ###?@A#BC@<DE>0# # # #
# # # #
-&1# #!"#$"#$%&'$(&#)&&*+$,-#./)#012.#340&-#4"#3/3$5-#+/#,/1#67&"+#5463&"4"(#3/#
3.&#'$+4/8F
:##
#
# # # ,$/+%#/)#;!<=>G## ?@A#BC@<DE>G# # # #
# # # # # #
#
#
-&2# !"#$"#$%&'$(&#)&&*+$,-#./)#012.#340&-#4"#3/3$5-#+/#,/1#67&"+#'&$+4"(#3.&#
"&)67$7&'688
:##
#
# # # ,$/+%#/)#;!<=>G# #?@A#BC@<DE>G# #
#
#
#Predicting the quality of a survey question from its design characteristics: SQP Daniel Oberski

TV watching: method A versus method B
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
Hours of TV watching:
categorical scale
0
2000
4000
6000
8000
qqq
q
qqqq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
qq
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
q
q
qqqq
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
qqq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
qqq
q
q
qqq
qq
qq
q
q
qq
q
q
q
qq
qq
qqq
q
qqq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
qqq
q
q
q
qqqq
q
q
qqq
qqqqq
qq
q
q
qqq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqqq
q
q
q
q
qq
q
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqqq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
qqqq
q
q
q
qqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
0
5
10
15
write in hrs and mins

Radio listening: method A versus method B
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
Hours of radio listening:
categorical scale
0
2000
4000
6000
8000
q
q
qqqqq
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
qq
q
q
q
q
q
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qqq
q
q
q
qq
q
q
q
qqqq
q
q
q
q
qqqqqqq
qq
q
q
qq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqqq
q
q
q
q
q
q
q
q
q
qqqq
q
qqq
qq
qq
q
qqqq
q
q
qq
q
qq
q
q
q
q
qq
q
q
qq
q
q
qqq
q
q
q
qq
qqqq
q
qqqq
qq
qq
qq
qqq
qq
qqqq
q
q
q
qqqq
qq
q
q
q
q
q
q
q
qq
qqq
q
qq
q
qq
q
q
q
q
qq
q
qq
qq
q
q
q
q
q
q
qq
qq
q
qq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
qq
qqq
qq
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qqq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
qqqq
q
q
qq
qq
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
qq
q
qqq
qq
q
q
q
q
q
q
q
q
qq
q
qqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqqqq
qq
q
q
q
q
qq
q
qqq
q
qq
q
q
q
qqq
q
q
q
q
q
q
qq
q
q
qqq
qqq
qq
q
q
qqq
q
q
qqqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
q
q
qqq
q
q
q
q
q
qq
q
q
q
qqqqq
q
qqq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
qq
qq
q
qq
q
q
q
q
qq
q
q
qq
q
q
qqq
q
qqqqq
q
qqq
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
qq
q
q
qqq
q
q
qqq
qq
q
q
q
q
q
qqq
q
q
qqq
q
q
q
q
q
q
q
qqq
qqq
q
q
qqqq
qq
q
qq
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qqq
qq
q
qq
q
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
q
q
qqqq
q
qqq
q
q
q
qqq
qq
q
q
qqq
q
q
q
q
q
qqqq
q
q
q
q
q
qqq
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
qq
q
qq
q
qqq
q
qq
q
q
qqq
q
q
qq
q
qq
q
qqq
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
qqqqq
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
qq
q
q
qq
qq
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
0
5
10
15

Newspaper reading: method A versus method B
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
Hours of newspaper reading:
categorical scale
0
2000
4000
6000
8000
10000
12000
q
q
q
0
2000
4000
6000
8000
10000

TV watching: method A versus method B
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
categorical scale
0.00
0.05
0.10
0.15
0.20
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
write in hrs and mins, recoded
0.00
0.05
0.10
0.15
0.20

Radio listening: method A versus method B
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
categorical scale
0.00
0.05
0.10
0.15
0.20
0.25
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
0.00
0.05
0.10
0.15
0.20
0.25

Newspaper reading: method A versus method B
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
categorical scale
0.0
0.1
0.2
0.3
0.4
0
h<0.5
0.5<=h<=1
1<h<=1.5
1.5<h<=2
2<h<=2.5
2.5<h<=3
h>3
0.0
0.1
0.2
0.3
0.4

Do people answer methods differently?
• The numeric method clearly produces many outliers, as
well as very high values that may or may not be outliers.
• To the extent that this is due to confusion of hours and
minutes, version C may remedy that problem.

• Distributions of hours with method A and B (recoded) is
similar but not the same:

• There are much fewer people who watch very little TV with
method B, (9% versus 4% of 40,355 respondents),

• Numeric method B has more people who watch a lot of TV.

• Numeric method B has a spike at exactly 1 hour for radio
and newspaper.

• Numeric method B has a spike at exactly 1 hour for radio
and newspaper.
• Overall it is clear the method has some inﬂuence on
average over all 40,355 respondents.

Is the difference between methods the same for all
respondents?

respondents?
The same people were asked both versions. This allows us to
show variation in answers to the numeric question, within
categories of the categorical question.

respondents?
No time at all
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0
Less than 0,5 hour
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0
0,5 hour to 1 hour
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0
More than 1 hour, up to 1,5 hours
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0
More than 1,5 hours, up to 2 hours
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0
More than 2 hours, up to 2,5 hours
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0
More than 2,5 hours, up to 3 hours
Numeric value given
Density
0 1 2 3 4
0.00.20.40.60.81.0

• Not only does the method inﬂuence the distribution of
answers,
• the method effect also depends on the person.

Deﬁnitions
Traits, Methods, and Persons
• Can imagine the same question (“Trait”) being asked in
different ways (“Methods”);
• Can imagine the same method being used to ask different
questions;

Deﬁnitions
Traits, Methods, and Persons
• Can imagine the same question (“Trait”) being asked in
different ways (“Methods”);
• Can imagine the same method being used to ask different
questions;
• A response to a survey question is then different person’s
answers to Trait-Method combinations.

Deﬁnitions
Measurement error model
1 Responses are a measure of some underlying score
(“trait”) so that if a person’s memory were erased and the
person re-interviewed, they should give a similar answer.
2 Responses are inﬂuenced by random variation: errors,
such as mistaking minutes for hours, but also variation in
information retrieved from memory.

Deﬁnitions
3 The method inﬂuences the answers on average, e.g. there
might be more social desirability bias in one method than
another, the scale may suggest some unspoken norm, etc.

Definitions
3 The method influences the answers on average, e.g. there
might be more social desirability bias in one method than
another, the scale may suggest some unspoken norm, etc.
4 Influence of method is different for different people:
random variation in the differences between methods.

Deﬁnitions
Modeling measurement error

Definitions
Quasi-equation
Response =
Responses are a measure of some underlying score
(“trait”) so that if a person’s memory were erased and
the person re-interviewed, they should give a similar
answer.
Trait + Trait × Person+
Responses are influenced by random variation: er-
rors, such as mistaking minutes for hours, but also
variation in information retrieved from memory.
Person × Moment+
The method influences the answers on average, e.g.
there might be more social desirability bias in one
method than another, the scale may suggest some
unspoken norm, etc.
Method + Method × Trait
Influence of method is different for different people:
random variation in the differences between meth-
ods.
Method × Person

Deﬁnitions
Quasi-equation
Response = Trait + Method + Trait × Method+
Trait × Person + Method × Person+
Person × Moment

Deﬁnitions
Interpretation of the model
If persons are a random sample from a population U, consider
Person a random factor.
1 “Rest” variance is called “random measurement error”
2 Proportion of Residual variance on the total is called
“unreliability” (1 − r2)

Deﬁnitions
3 Proportion of Method×Person variance on the total is
called “common method variance” (sometimes “invalidity”),
(1 − v2)

Deﬁnitions
(1 − v2)
4 Proportion of Trait×Person variance on the total is called
“quality” of the question (q2 or κ)

Deﬁnitions
(1 − v2)
4 Proportion of Trait×Person variance on the total is called
“quality” of the question (q2 or κ)
5 “Quality” (q2 or κ) will equal v2 · r2.

Equation model
Yijk = τijk + ηij + ξik + ijk ,
where
i Indexes persons;
j Indexes traits;
k Indexes methods.

Model
Person × Moment
where
i Indexes persons;
j Indexes traits;
k Indexes methods.

Equation with Trait×Method interaction with
Trait×Person
Yijk = τijk + λjk ηij + ξik + ijk ,
where
i Indexes persons;
j Indexes traits;
k Indexes methods.

Assumptions in the model
1 The (interaction) effects do not depend on other
Method×Trait combinations a person might receive;
(“no carry-over effects”, “SUTVA”, “independence
assumption”)
Assumption 2 can sometimes be relaxed (Oberski et al in Salzborn, Davidov
& Reinecke (eds), 2012)

assumption”)
2 There is no separate Person main effect: Trait and Method
within Person already capture all within-person correlation

assumption”)
2 There is no separate Person main effect: Trait and Method
within Person already capture all within-person correlation
(“method variance is the only systematic
variance”, COVU( ijk , ξik ) = 0 and
COVU( ijk , ηik ) = 0 )

The parameters of interest in the model are
• The variance over persons in the Trait effect;
• The variance over persons in the Method effect.
Expressed as proportions of the total variance over persons of
Yjk , these two quantities equal, respectively,

• The reliability κjk of a question asking Trait j with Method k

• The reliability κjk of a question asking Trait j with Method k
• The correlation between two different questions that is
purely due to them being measured with the same method.

Estimation of measurement error with the MTMM design

Design requirements
What design is needed to estimate this model?
Person × Moment
i Indexes persons; j indexes traits; k indexes methods.
• The model suggests that a Person×Method×Trait factorial
experiment would allow for the estimation of the reliability
and method variance.
• Residual or “measurement error” error Person × Moment is
estimated by Person × Trait × Method interaction.

Design requirements
What design is needed to estimate this model?
• A Person×Method×Trait factorial experiment would ask
the same question in different ways (Methods) and use
different methods to ask the same questions, within each
person;
• Campbell and Fiske introduced such designs in 1959
under the name “Multitrait-multimethod” (MTMM)
experiment.
• Not all Trait-Method combinations are necessary, but at
least one repetition within each person is required (Saris,
Satorra & Coenders, 2004).
• Under the model and assumptions 1 and 2, the MTMM
design will provide data that allow for the estimation of the
reliability and method variance (“invalidity”).

Design requirements
Example of an MTMM experiment
On an average weekday, how much time, in total...
T = 1 ...do you spend watching television?
T = 2 ...do you spend listening to the radio?
T = 3 ...do you spend reading the newspapers?
Scales:
M = 1: 8pt (hours)
M = 2: Write in hours and minutes
M = 3: 7pts vague quantiﬁers
Each respondent answered all three questions in two different
ways.
The repetition was given at the end of the interview (after
approximately 50 minutes passed)

Estimation issues
Yijk = τijk + λjk ηij + ξik + ijk .
• The model can be estimated with regression (with Person
a random factor);
• Not flexible enough: little influence on covariance structure
and λjk not possible.
• The model can also be recognized as a factor analysis or
more generally as a structural equation model (SEM),
• through transformation as an IRT or latent class model.
• The SEM framework allows enough flexibility to estimate
the parameters of interest: trait, method and residual
variance or r2, v2, and quality q2.

The model as a SEM (or IRT or latent class) model
M1 M2 M3
T1 T2 T3
y11 y21 y31 y12 y22 y32 y13 y23 y33

Another example
COMPARING QUESTIONS WITH AGREE/DISAGREE RESPONSE OPTIONS TO QUESTIONS WITH ITEM-SPECIFIC RESPONSE OPTIONS 69
Table 4: Experiment 2 of round 2
Introduction Statements Answer categories
Main Using this card, - There is a lot of variety in my work - not at all true
questionnaire please tell me how - My job is secure - a little true
true each of the - My health or safety is at risk because - quite true
“A/D” following statements of my work - very true
is about your current job.
SC group 1 The next 3 questions - Please choose one of the following to - not at all varied
are about your describe how varied your work is. - a little varied
IS current job. - Please choose one of the following to - quite varied
describe how secure your job is - very varied
- Please choose one of the following to (same type of response
say how much, if at all, your work puts scale using terms secure
your health and safety at risk. and safe instead of varied)
SC group 2 - Please indicate, on a scale of 0 to 10, Horizontal 11 point
how varied your work is, where 0 is not scale only labelled at the
IS at all varied and 10 is very varied. end points
- Now please indicate, on a scale of 0 to
10, how secure your job is, where 0 is
not at all secure and 10 is very secure.
- Please indicate, on a scale of 0 to 10,
how much your health and safety is at
risk from your work, where 0 is not at
all at risk and 10 is very much at risk.
Table 5: The means reliability, validity and quality of the three questions of experiment 2 in Round 2 of the ESS across 10 countries for the
diﬀerent methods (standard deviations in brackets)
Reliability r2
Validity v2
Quality q2
Method Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Source: R´evilla, Saris & Krosnick, (2010)

Results from another example
- Please choose one of the following to (same type of response
say how much, if at all, your work puts scale using terms secure
your health and safety at risk. and safe instead of varied)
SC group 2 - Please indicate, on a scale of 0 to 10, Horizontal 11 point
how varied your work is, where 0 is not scale only labelled at the
IS at all varied and 10 is very varied. end points
- Now please indicate, on a scale of 0 to
10, how secure your job is, where 0 is
not at all secure and 10 is very secure.
- Please indicate, on a scale of 0 to 10,
how much your health and safety is at
risk from your work, where 0 is not at
all at risk and 10 is very much at risk.
Table 5: The means reliability, validity and quality of the three questions of experiment 2 in Round 2 of the ESS across 10 countries for the
different methods (standard deviations in brackets)
Reliability r2
Validity v2
Quality q2
Method Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
A/D(4) .65 .59 .61 .99 .98 .99 .64 .58 .60
(.09) (.18) (.15) (.02) (.03) (.03) (.10) (.18) (.15)
IS(4) .80 .80 .80 1 1 1 .80 .80 .80
(.14) (.13) (.14) (0) (0) (0) (.14) (.13) (.14)
IS(11) .81 .83 .77 .98 .98 .98 .80 .82 .76
(.09) (.11) (.12) (.03) (.03) (.04) (.10) (.12) (.14)
using a truth scale with the same number of categories for all
three questions (around .7 to .9 versus .5 to .6). The position
of the IS scale in the supplementary questionnaire is not an
issue as the better quality of the IS scale is also observed both
when it comes first and when it comes later.
Possibly the order of the observations with the different
scale types has an impact on the size of the differences since
we see fewer differences in this second experiment than in
the first, but this may also be linked to the subject matter
of the experiments or to other characteristics of the methods
used (such as the number of points). More research is needed
to determine this, however the important point here is that in
different combinations, the superiority of the IS in terms of
scale with 11 categories was also better than the IS scale with
4 categories. So, not only might the kind of scale (IS versus
A/D) impact the total quality of a measure, but so might the
length of the scale (number of response categories). How-
ever, it seems that this effect varies across countries.
Experiments in Round 3 of the
ESS
In round 3 of the ESS again two SB-MTMM experiments
have been done which allow the comparison of the IS scales
with A/D scales. The attraction of these experiments is thatPredicting the quality of a survey question from its design characteristics: SQP Daniel Oberski

Quality q2
Q1 Q2 Q3
.64 .58 .60
(.10) (.18) (.15)
.80 .80 .80
(.14) (.13) (.14)
.80 .82 .76
(.10) (.12) (.14)

• It looks like there is much more measurement error
(residual variance) in the agree-disagree questions than
there is in the item-speciﬁc scales.
• This was true over all countries (shown is the average over
countries).

• It looks like there is much more measurement error
(residual variance) in the agree-disagree questions than
there is in the item-speciﬁc scales.
• This was true over all countries (shown is the average over
countries).
• Still wonder whether the same would be found with other
topics and under other conditions, and with other
combinations of methods.

Are some types of questions better than others?

• The examples given so far come from a much larger series
of MTMM experiments;
• In the European Social Survey (ESS), every round about
six MTMM experiments are done;

• So far there have been ﬁve rounds (2002, 4, 6, 8, and 10).

• The experiments are done in 20-30 European countries
every two years;

every two years;
• Effective sample size per country is at least 1500.

every two years;
• Each experiment usually estimates the quality for 9
questions (Method-Trait combinations).

every two years;
• Range of topics is reasonably diverse, though factual
questions are underrepresented.

every two years;
• Range of topics is reasonably diverse, though factual
questions are underrepresented.
• In total about 5000 questions available, but only 3000 of
those will be used here for various reasons.

• In addition to the ESS, an older series of experiments also
exists (F. Andrews; K¨oltringer; Saris; Billiet, 1990’s)
• These add another 1089 questions for which reliability and
validity coefﬁcients are estimated

• In addition to the ESS, an older series of experiments also
exists (F. Andrews; K¨oltringer; Saris; Billiet, 1990’s)
• These add another 1089 questions for which reliability and
validity coefﬁcients are estimated
• Combining the two datasets (ESS question qualities and
Old experiment qualities, we created a database of 3011
questions with their reliability and validity estimates.

Reliability and validity estimates of 3011 questions
Reliability coefficient
Reliability coefficient
Frequency
0.4 0.6 0.8 1.0
0200400600800
Validity coefficient
Frequency
0.2 0.4 0.6 0.8 1.0
050010001500

Logit transform of Reliability and validity estimates
Reliability coefficient, logit
Frequency
0 2 4 6
0200400600800
Validity coefficient, logit
Frequency
0 2 4 6
0100200300400500

Coding design characteristics of the 3011 questions
• For each of the 3011 questions in all countries, a team of
coders coded 40 design characteristics of the question;
• Some codes were automatically generated by Natural
Language Processing software (syllables, words, etc).

• Coders were students, assistants to the local coordinators
of the ESS, and two experts;

• For English source version, experts double-coded
questions independently, then created consensus codes;

• Non-expert codes were quality-controlled by detailed
comparison with consensus codes for the English source;

• Non-expert codes were quality-controlled by detailed
comparison with consensus codes for the English source;
• In a meeting between the experts and each other coder,
the discrepancies were discussed and either corrected or
left in as true differences.

• absolute
• avgabs intro
• avgabs total
• avgsy total
• avgwrd intro
• avgwrd total
• balance
• centrality
• computer.assisted
• concept
• country
• domain
• dont know
• encourage
• ﬁxrefpoints
• form basic
• future
• labels
• instr interv
• instr respon
• interviewer
• intr request
• intropresent
• knowledge
• labels gramm
• labels order
• language
• motivation
• opinionother
• past
• position
• questiontype
• scal neutral
• scale basic
• scale corres
• scale trange
• scale urange
• showc boxes
• showc horiz
• showc letter
• showc over
• showc quest
• showc start
• socdesir
• stimulus
• subjectiveop
• symmetry
• used WH word
• usedshowcard
• visual
• from

Domain of question # questions
Internatl politics 64
Health 190
Living conditions 453
Other beliefs 292
Work 469
Personal relations 320
Consumer behavior 34
Leisure activts 131
National gvt 141
Institutions 284
Political parties 30
Trade unions 12
Economy 237
Other 354

Concept of question # questions
Evaluative belief 713
Feeling 903
Importance 96
Expectation 39
Facts, behavior 63
Judgement 123
Relationship 8
Evaluation 704
Norm 57
Policy 250
Right 4
Action tendency 51

Meta-analysis dataset
• For each of the 3011 questions, we have in the database:
• The estimated quality (reliability and validity coefﬁcients)

• About 50 design characteristics (through hand- and
automatic coding)

automatic coding)
• The next step was to relate the design characteristics to
the quality estimates:

automatic coding)
• The next step was to relate the design characteristics to
the quality estimates:
• Can the quality estimates be predicted from the design
characteristics?

Meta-analysis
• Prediction by random forests of regression trees (Breiman
2001);
• Two separate models: one for validity and for reliability
coefﬁcients;

Meta-analysis
• Prediction by random forests of regression trees (Breiman
2001);
• Two separate models: one for validity and for reliability
coefﬁcients;
• Missing data are multiply imputed using the MICE
algorithm (van Buuren & Groothuis-Oudshoorn 2011).

Example regression tree for logit(reliability coefﬁcient)
|
domain=3,4,7,11,13,14,112
domain=3
gradation>=0.5 position< 339.5
position>=410
concept=1,2 position< 404.5
concept=1,73,78
position< 322.5
ncategories>=4.5
domain=6,101,103,120
domain=4,7,11,13,14,112
gradation< 0.5 position>=339.5
position< 410
concept=73,75,76 position>=404.5
concept=2,76
position>=322.5
ncategories< 4.5
1.955
n=1988
1.724
n=1303
0.9636
n=108
0.4959
n=36
1.198
n=72
1.793
n=1195
1.642
n=722
2.023
n=473
1.544
n=108
1.28
n=76
2.17
n=32
2.165
n=365
1.97
n=217
2.45
n=148
2.394
n=685
1.489
n=138
2.622
n=547
2.384
n=233
2.799
n=314
2.681
n=260
3.364
n=54
Example regression tree for reliability coefficient

Meta-analysis with random forests
• R2 based on out-of-bag (crossvalidation) mean square
error is 85% for validity coefficient and 60% for reliability
coefficient.
• Importance measures indicate domain, number of
categories, concept, position in the questionnaire, number
of syllables, country, number of words, fixed reference
points, and other linguistic complexity measures are the
most influential for reliability.

Meta-analysis with random forests
• R2 based on out-of-bag (crossvalidation) mean square
error is 85% for validity coefficient and 60% for reliability
coefficient.
• Importance measures indicate domain, number of
categories, concept, position in the questionnaire, number
of syllables, country, number of words, fixed reference
points, and other linguistic complexity measures are the
most influential for reliability.
• For validity, in addition to the above, order of the labels
(positive-negative), centrality of the trait and other
characteristics are also important.

Predicting the quality of a survey question from its design characteristics: SQP

Predicting the quality of a survey question from its design characteristics: SQP

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (7)

Destacado

Destacado (9)

Similar a Predicting the quality of a survey question from its design characteristics: SQP

Similar a Predicting the quality of a survey question from its design characteristics: SQP (20)

Último

Último (20)

Predicting the quality of a survey question from its design characteristics: SQP