SlideShare una empresa de Scribd logo
1 de 16
Reliability and Validity
Chong Ho Yu, Ph.D.
Conventional views of reliability (AERA et al., 1985)
 Temporal stability: the same form of a test on two or more separate occasions
to the same group of examinees (Test-retest). On many occasions this approach
is not practical because repeated measurements are likely to change the
examinees. For example, the examinees will adapt the test format and thus tend
to score higher in later tests. Hence, careful implementation of the test-retest
approach is strongly recommendation (Yu, 2005).
 Form equivalence: two different forms of test, based on the same content, on
one occasion to the same examinees (Alternate form). After alternate forms
have been developed, it can be used for different examinees. It is very common
in high-stake examination for pre-empting cheating. A examinee who took
Form A earlier could not share the test items with another student who might
take Form B later, because the two forms have different items.
 Internal consistency: the coefficient of test scores obtained from a single test
or survey (Cronbach Alpha, KR20, Spilt-half). For instance, let's say
respondents are asked to rate statements in an attitude survey about computer
anxiety. One statement is "I feel very negative about computers in general."
Another statement is "I enjoy using computers." People who strongly agree
with the first statement should be strongly disagree with the second statement,
and vice versa. If the rating of both statements is high or low among several
respondents, the responses are said to be inconsistent and patternless. The same
principle can be applied to a test. When no pattern is found in the students'
responses, probably the test is too difficult and students just guess the answers
randomly.

Reliability is a necessary but not sufficient condition
for validity. For instance, if the needle of the scale is
five pounds away from zero, I always over-report my
weight by five pounds. Is the measurement consistent?
Yes, but it is consistently wrong! Is the measurement
valid? No! (But if it under-reports my weight by five
pounds, I will consider it a valid measurement)
 Performance, portfolio, and responsive evaluations, where the tasks vary
substantially from student to student and where multiple tasks may be
evaluated simultaneously, are attacked for lacking reliability. One of the
difficulties is that there are more than one source of measurement errors in
performance assessment. For example, the reliability of writing skill test score
is affected by the raters, the mode of discourse, and several other factors
(Parkes, 2000).
 Replications as unification: Users may be confused by the diversity of
reliability indices. Nevertheless, different types of reliability measures share a
common thread: What constitutes a replication of a measurement procedure?
(Brennan, 2001) Take internal consistency as an example. This measure is used
because it is convenient to compute the reliability index based upon data
collected from one occasion. However, the ultimate inference should go beyond
one single testing occasion to others (Yu, 2005). In other words, any
procedures for estimating reliability should attempt to mirror a result based
upon full-length replications.
Conventional views of validity (Cronbach, 1971)
 Face validity: Face validity simply means the validity at face value. As a check
on face validity, test/survey items are sent to teachers to obtain suggestions for
modification. Because of its vagueness and subjectivity, psychometricians have
abandoned this concept for a long time. However, outside the measurement
arena, face validity has come back in another form. While discussing the
validity of a theory, Lacity and Jansen (1994) defines validity as making
common sense, and being persuasive and seeming right to the reader. For
Polkinghorne (1988), validity of a theory refers to results that have the
appearance of truth or reality.
The internal structure of things may not concur with the appearance. Many
times professional knowledge is counter-common sense. The criteria of validity
in research should go beyond "face,"
"appearance," and "common sense."
 Content validity: draw an
inference from test scores to a large
domain of items similar to those on the
test. Content validity is concerned
with sample-population
representativeness. i.e. the knowledge
and skills covered by the test items
should be representative to the larger
domain of knowledge and skills.
For example, computer literacy
includes skills in operating system, word processing, spreadsheet, database,
graphics, internet, and many others. However, it is difficult, if not impossible,
to administer a test covering all aspects of computing. Therefore, only several
tasks are sampled from the population of computer skills.
Content validity is usually established by content experts. Take computer
literacy as an example again. A test of computer literacy should be written or
reviewed by computer science professors because it is assumed that computer
scientists should know what are important in his discipline. By the first glance,
this approach looks similar to the validation process of face validity, but yet
there is a difference. In content validity, evidence is obtained by looking for
agreement in judgments by judges. In short, face validity can be established by
one person but content validity should be checked by a panel.
However, this approach has some drawbacks. Usually experts tend to take their
knowledge for granted and forget how little other people know. It is not
uncommon that some tests written by content experts are extremely difficult.
Second, very often content experts fail to identify the learning objectives of a
subject. Take the following question in a philosophy test as an example:
What is the time period of the philosopher Epicurus?
a. 341-270 BC
b. 331-232 BC
c. 280-207 BC
d. None of the above
This type of question tests the ability of memorizing historical facts, but not
philosophizing. The content expert may argue that "historical facts" are
important for a student to further understand philosophy. Let's change the
subject to computer science and statistics. Look at the following two questions:
When was the founder and CEO of Microsoft, William Gates III born?
a. 1949
b. 1953
c. 1957
d. None of the above
Which of the following statement is true about ANOVA
a. It was invented by R. A. Fisher in 1914
b. It was invented by R. A. Fisher in 1920
c. It was invented by Karl Pearson in 1920
d. None of the above
It would be hard pressed for any computer scientist or statistician to accept that
the above questions fulfill content validity. As a matter of fact, the
memorization approach is a common practice among instructors.
Further, sampling knowledge from a larger domain of knowledge involves
subjective values. For example, a test regarding art history may include many
questions on oil paintings, but less questions on watercolor paintings and
photography because of the perceived importance of oil paintings in art history.
Content validity is sample-oriented rather than sign-oriented. A behavior is
viewed as a sample when it is a subgroup of the same kind of behaviors. On the
other hand, a behavior is considered a sign when it is an indictor or a proxy of a
construct. (Goodenough, 1949). Construct validity and criterion validity, which
will be discussed later, are sign-oriented because both of them indicate
behaviors different from those of the test.
 Criterion: draw an inference from test
scores to performance. A high score of
a valid test indicates that the tester has
met the performance criteria.
Regression analysis can be applied to
establish criterion validity. An
independent variable could be used as
a predictor variable and a dependent
variable, the criterion variable. The
correlation coefficient between them is
called validity coefficients.
For instance, scores of the driving test by simulation is the predictor variable
while scores of the road test is the criterion variable. It is hypothesized that if
the tester passes the simulation test, he/she should meet the criterion of being a
safe driver. In other words, if the simulation test scores could predict the road
test scores in a regression model, the simulation test is claimed to have a high
degree of criterion validity.
In short, criterion validity is about prediction rather than explanation.
Predication is concerned with non-casual or mathematical dependence where as
explanation is pertaining to causal or logical dependence. For example, one can
predict the weather based on the height of mercury inside a thermometer. Thus,
the height of mercury could satisfy the criterion validity as a predictor.
However, one cannot explain why the weather changes by the change of
mercury height. Because of this limitation of criterion validity, an evaluator has
to conduct construct validation.
 Construct: draw an inference form
test scores to a psychological
construct. Because it is concerned with
abstract and theoretical construct,
construct validity is also known
as theoretical construct.
According to Hunter and Schmidt
(1990), construct validity is a
quantitative question rather than a
qualitative distinction such as "valid"
or "invalid"; it is a matter of degree.
Construct validity can be measured by the correlation between
the intended independent variable (construct) and the proxy independent
variable (indicator, sign) that is actually used.
For example, an evaluator wants to study the relationship between general
cognitive ability and job performance. However, the evaluator may not be able
to administer a cognitive test to every subject. In this case, he can use a proxy
variable such as "amount of education" as an indirect indicator of cognitive
ability. After he administered a cognitive test to a portion of all subjects and
found a strong correlation between general cognitive ability and amount of
education, the latter can be used to the larger group because its construct
validity is established.
Other authors (e.g. Angoff,1988; Cronbach & Quirk, 1976) argue that construct
validity cannot be expressed in a single coefficient; there is no mathematical
index of construct validity. Rather the nature of construct validity is qualitative.
There are two types of indictors:
o Reflective indictor: the effect of the construct.
o Formative indictor: the cause of the construct.
When an indictor is expressed in terms of multiple items of an
instrument, factor analysis is used for construct validation.
Test bias is a major threat against construct validity, and therefore test bias
analyses should be employed to examine the test items (Osterlind, 1983).
The presence of test bias definitely affects the measurement of the
psychological construct. However, the absence of test bias does not guarantee
that the test possesses construct validity. In other words, the absence of test bias
is a necessary, but isn't a sufficient condition.
 Construct validation as unification: The criterion and the content models
tends to be empirical-oriented while the construct model is inclined to be
theoretical. Nevertheless, all models of validity requires some form of
interpretation: What is the test measuring? Can it measure what it intends to
measure? In standard scientific inquiries, it is important to formulate an
interpretative (theoretical) framework clearly and then to subject it to empirical
challenges. In this sense, theoretical construct validation is considered
functioning as a unified framework for validity (Kane, 2001).
A modified view of reliability (Moss, 1994)
 There can be validity without reliability if reliability is defined as consistency
among independent measures.
 Reliability is an aspect of construct validity. As assessment becomes less
standardized, distinctions between reliability and validity blur.
 In many situations such as searching faculty candidate and conferring graduate
degree, committee members are not trained to agree on a common set of criteria
and standards
 Inconsistency in students' performance across tasks does not invalidate the
assessment. Rather it becomes an empirical puzzle to be solved by searching
for a more comprehensive interpretation.
 Initial disagreement (e.g., among students, teachers, and parents in responsive
evaluation) would not invalidate the assessment. Rather it would provide an
impetus for dialog.
Li (2003) argued that the preceding view is incorrect:
 The definition of reliability should be defined in terms of the classical test
theory: the squared correlation between observed and true scores or the
proportion of true variance in obtained test scores.
 Reliability is a unitless measure and thus it is already model-free or standard-
free.
 It has been a tradition that multiple factors are introduced into a test to improve
validity but decrease internal-consistent reliability.
An extended view of Moss's reliability (Mislevy, 2004)
 Being inspired by Moss, Mislevy went further to ask whether there can be
reliability without reliability (indices).
 By blending psychometrics and Hermeneutics, in which a holistic and
integrative approach to understand the whole in light of its parts is used,
Mislevy demanded psychometricians to think about what they intend to make
inferences about.
 In many cases we don't present just one argument; rather problem solving
involves arguments or chains of reasoning with massive evidence.
 Off-the-shelf inferential machinary (e.g. compute reliability indices) may fail if
we quantify things or tasks that we don't know much about.
 Probability-based reasoning to more complex assessments based upon
cognitive psychology is needed.
A radical view of reliability (Thompson et al, 2003)
 Reliability is not a property of the test; rather it is attached to the property of
the data. Thus, psychomterics is datammetrics.
 Tests are not reliable. It is important to explore reliability in virtually all
studies.
 Reliability generalization, which can be used in a meta-analysis application
similar to validity generalization, should be implemented to assess variance in
measurement error across studies.
An updated perspective of reliability (Cronbach, 2004)
In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of
measuring reliability, reviewed the historical development of Cronbach Alpha. He
asserted, "I no longer regard the formula (of Cronbach Alpha) as the most appropriate
way to examine most data. Over the years, my associates and I developed the complex
generaliability (G) theory" (p. 403). Discussion of the G theory is beyond the scope of
this document. Nevertheless, Cronbach did not object use of Cronbach Alpha but he
recommended that researchers should take the following into consideration while
employing this approach:
 Standard error of measurement: It is the most important piece of information
to report regarding the instrument, not a coefficient.
 Independence of sampling
 Heterogeneity of content
 How the measurement will be used: Decide whether future uses of the
instrument are likely to be exclusively for absolute decisions, for differential
decisions, or both.
 Number of conditions for the test
A critical view of validity (Pedhazur & Schmelkin,1991)
 Content validity is not a type of validity at all because validity refers to
inferences made about scores, not to an assessment of the content of an
instrument.
 The very definition of a construct implies a domain of content. There is no
sharp distinction between test content and test construct.
A modified view of validity (Messick, 1995)
The conventional view (content, criterion, construct) is fragmented and incomplete,
especially because it fails to take into account both evidence of the value implications
of score meaning as a basis for action and the social consequences of score use.
Validity is not a property of the test or assessment, but rather of the meaning of the
test scores.
 Content: evidence of content relevance, representativeness, and technical
quality
 Substantive: theoretical rationale
 Structural: the fidelity of the scoring structure
 Generalizability: generalization to the population and across populations
 External: applications to multitrait-multimethod comparison
 Consequential: bias, fairness, and justice; the social consequence of the
assessment to the society
Critics argued that consequences should not be a component of validity because test
developers should not be held responsible for the consequences of misuse;
accountability should lie with the misuser. Messick (1998) counter-argued that social
consequences of score interpretation include the value implications of the construct
label, which may or may not commensurate with the construct's trait implications and
need to be addressed in appraising score meaning. While test developers should not be
accountable to misuse of tests, they should still pay attention to the unanticipated
consequences of legitimate score interpretation.
A different view of reliability and validity (Salvucci, Walter, Conley, Fink, &
Saba (1997)
Some scholars argue that the traditional view that "reliability is a necessary but not a
sufficient condition of validity" is incorrect. This school of thought conceptualizes
reliability as invariance and validity asunbiasedness. A sample statistic may have an
expected value over samples equal to the population parameter (unbiasedness), but
have very high variance from a small sample size. Conversely, a sample statistic can
have very low sampling variance but have an expected value far departed from the
population parameter (high bias). In this view, a measure can be unreliable (high
variance) but still valid (unbiased).
Population parameter (Red line) = Sample
statistic (Yellow line) --> unbiased
High variance (Green line)
Unreliable but valid
Population parameter (Red line) <> Sample
statistic (Yellow line) --> Biased
low variance (Green line)
Invalid but reliable
Caution and advice
There is a common misconception that if someone adopts a validated instrument,
he/she does not need to check the reliability and validity with his/her own data.
Imagine this: When I buy a drug that has been approved by FDA and my friend asks
me whether it heals me, I tell him, "I am taking a drug approved by FDA and
therefore I don't need to know whether it works for me or not!" A responsible
evaluator should still check the instrument's reliability and validity with his/her own
subjects and make any modifications if necessary.
Low reliability is less detrimental to the performance pretest. In the pretest where
subjects are not exposed to the treatment and thus are unfamiliar with the subject
matter, a low reliability caused by random guessing is expected. One easy way to
overcome this problem is to include "I don't know" in multiple choices. In an
experimental settings where students' responses would not affect their final grades, the
experimenter should explicitly instruct students to choose "I don't know" instead of
making a guess if they really don't know the answer. Low reliability is a signal of high
measurement error, which reflects a gap between what students actually know and
what scores they receive. The choice "I don't know" can help in closing this gap.
Last Updated: 2008
References
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: Authors.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun
(Eds.), Test validity.Hillsdale, NJ: Lawrence Erlbaum.
Brennan, R. (2001). An essay on the history and future of reliability from the
perspective of replications.Journal of Educational Measurement, 38, 295-317.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
Measurement (2nd Ed.). Washington, D. C.: American Council on Education.
Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and successor
procedures. Educational and Psychological Measurement, 64, 391-418.
Cronbach, L. J. & Quirk, T. J. (1976). Test validity. In International Encyclopedia of
Education. New York: McGraw-Hill.
Goodenough, F. L. (1949). Mental testing: Its history, principles, and
applications. New York: Rinehart.
Hunter, J. E.; & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error
and bias in research findings. Newsbury Park: Sage Publications.
Kane, M. (2001). Current concerns in validity theory. Journal of educational
Measurement, 38, 319-342.
Lacity, M.; & Jansen, M. A. (1994). Understanding qualitative data: A framework of
text analysis methods.Journal of Management Information System, 11, 137-160.
Li, H. (2003). The resolution of some paradoxes related to reliability and
validity. Journal of Educational and Behavioral Statistics, 28, 89-95.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences
from persons' responses and performance as scientific inquiry into scoring
meaning. American Psychologist, 9, 741-749.
Messick, S. (1998). Test validity: A matter of consequence. Social Indicators
Research, 45, 35-44.
Mislevy, R. (2004). Can there be reliability without reliability? Journal of Educational
and Behavioral Statistics, 29, 241-244.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher,
23, 5-12.
Osterlind, S. J. (1983). Test item bias. Newbury Park: Sage Publications.
Parkes, J. (2000). The relationship between the reliability and cost of performance
assessments. Education Policy Analysis Archives, 8. [On-line] Available
URL: http://epaa.asu.edu/epaa/v8n16/
Pedhazur, E. J.; & Schmelkin, L. P. (1991). Measurement, design, and analysis: An
integrated approach.Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Polkinghorne, D. E. (1988). Narrative knowing and the human sciences. Albany: State
University of New York Press.
Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error
studies at the National Center for Education Statistics. Washington D. C.: U. S.
Department of Education
Thompson, B. (Ed.) (2003). Score reliability: Contemporary thinking on reliability
issues. Thousand Oaks: Sage.
Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of
Social Measurement, Vol. 3 (pp. 777-784). San Diego, CA: Academic Press.
Questions for discussion
Pick one of the following cases and determine whether the test or the assessment is
valid. Apply the concepts of reliability and validity to the situation. These cases may
be remote to this cultural context. You may use your own example.
1. In ancient China, candidates for government officials had to take the
examination regarding literature and moral philosophy, rather than public
administration.
2. Before July 1, 1997 when Hong Kong was a British colony, Hong Kong
doctors, including specialists, who graduated from non-Common Wealth
medical schools had to take a general medical examination covering all general
areas in order to be certified.
Navigation
Index
Simplified Navigation
Table of Contents
Search Engine
Contact

Más contenido relacionado

La actualidad más candente

Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
Samcruz5
 
Chapter 6 validity & reliability
Chapter 6 validity & reliabilityChapter 6 validity & reliability
Chapter 6 validity & reliability
Bean Malicse
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
songoten77
 
Characteristics of a good test
Characteristics  of a good testCharacteristics  of a good test
Characteristics of a good test
Ali Heydari
 
Concepts and-measurement
Concepts and-measurementConcepts and-measurement
Concepts and-measurement
Michael Nelson
 

La actualidad más candente (17)

Characteristics of a good test
Characteristics  of a good testCharacteristics  of a good test
Characteristics of a good test
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Chapter 6 validity & reliability
Chapter 6 validity & reliabilityChapter 6 validity & reliability
Chapter 6 validity & reliability
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
Test evaluation
Test evaluationTest evaluation
Test evaluation
 
Validity & reliability
Validity & reliabilityValidity & reliability
Validity & reliability
 
Reliability
ReliabilityReliability
Reliability
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimension
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Characteristics of a good test
Characteristics  of a good testCharacteristics  of a good test
Characteristics of a good test
 
Operational definitions
Operational definitionsOperational definitions
Operational definitions
 
Validity, reliabiltiy and alignment to determine the effectiveness of assessment
Validity, reliabiltiy and alignment to determine the effectiveness of assessmentValidity, reliabiltiy and alignment to determine the effectiveness of assessment
Validity, reliabiltiy and alignment to determine the effectiveness of assessment
 
Test standardization and norming
Test standardization and normingTest standardization and norming
Test standardization and norming
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 
Concepts and-measurement
Concepts and-measurementConcepts and-measurement
Concepts and-measurement
 

Destacado (6)

Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
Messick’s framework
Messick’s frameworkMessick’s framework
Messick’s framework
 
Reability & Validity
Reability & ValidityReability & Validity
Reability & Validity
 
Reliability
ReliabilityReliability
Reliability
 
Content validity
Content validityContent validity
Content validity
 

Similar a Reliability and validity

Test characteristics
Test characteristicsTest characteristics
Test characteristics
Samcruz5
 
Strict Standards Only variables should be passed by reference.docx
Strict Standards Only variables should be passed by reference.docxStrict Standards Only variables should be passed by reference.docx
Strict Standards Only variables should be passed by reference.docx
florriezhamphrey3065
 
RCH 8301, Quantitative Research Methods 1 Course L
  RCH 8301, Quantitative Research Methods 1 Course L  RCH 8301, Quantitative Research Methods 1 Course L
RCH 8301, Quantitative Research Methods 1 Course L
VannaJoy20
 
Running head ASSESSING A CLIENT .docx
Running head ASSESSING A CLIENT                                  .docxRunning head ASSESSING A CLIENT                                  .docx
Running head ASSESSING A CLIENT .docx
healdkathaleen
 

Similar a Reliability and validity (20)

validity and reliability
validity and reliabilityvalidity and reliability
validity and reliability
 
VALIDITY
VALIDITYVALIDITY
VALIDITY
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Business research methods
Business research methodsBusiness research methods
Business research methods
 
validity of scale.pdf
validity of scale.pdfvalidity of scale.pdf
validity of scale.pdf
 
Test characteristics
Test characteristicsTest characteristics
Test characteristics
 
The validity of Assessment.pptx
The validity of Assessment.pptxThe validity of Assessment.pptx
The validity of Assessment.pptx
 
Presentation validity
Presentation validityPresentation validity
Presentation validity
 
Validity.pptx
Validity.pptxValidity.pptx
Validity.pptx
 
Strict Standards Only variables should be passed by reference.docx
Strict Standards Only variables should be passed by reference.docxStrict Standards Only variables should be passed by reference.docx
Strict Standards Only variables should be passed by reference.docx
 
Validity in psychological testing
Validity in psychological testingValidity in psychological testing
Validity in psychological testing
 
Discussion question for meeting two language assessment
Discussion question for meeting two language assessmentDiscussion question for meeting two language assessment
Discussion question for meeting two language assessment
 
Assessment 15 Annotated
Assessment 15 AnnotatedAssessment 15 Annotated
Assessment 15 Annotated
 
Validity.docx
Validity.docxValidity.docx
Validity.docx
 
RCH 8301, Quantitative Research Methods 1 Course L
  RCH 8301, Quantitative Research Methods 1 Course L  RCH 8301, Quantitative Research Methods 1 Course L
RCH 8301, Quantitative Research Methods 1 Course L
 
Research methods
Research methodsResearch methods
Research methods
 
Basic Principles of Assessment
Basic Principles of AssessmentBasic Principles of Assessment
Basic Principles of Assessment
 
Validity
ValidityValidity
Validity
 
Language testing in all levels
Language testing in all levelsLanguage testing in all levels
Language testing in all levels
 
Running head ASSESSING A CLIENT .docx
Running head ASSESSING A CLIENT                                  .docxRunning head ASSESSING A CLIENT                                  .docx
Running head ASSESSING A CLIENT .docx
 

Más de Anju Kumawat

Más de Anju Kumawat (20)

Down syndrome student copy.pptx
Down syndrome student copy.pptxDown syndrome student copy.pptx
Down syndrome student copy.pptx
 
oxygen therapy.pptx
oxygen therapy.pptxoxygen therapy.pptx
oxygen therapy.pptx
 
child hospitalization ppt.pptx
child hospitalization ppt.pptxchild hospitalization ppt.pptx
child hospitalization ppt.pptx
 
FOOD-ADULTRATION-PPT.pptx
FOOD-ADULTRATION-PPT.pptxFOOD-ADULTRATION-PPT.pptx
FOOD-ADULTRATION-PPT.pptx
 
PPT_Gastroentritis_reb.pptx
PPT_Gastroentritis_reb.pptxPPT_Gastroentritis_reb.pptx
PPT_Gastroentritis_reb.pptx
 
food safety and storage.pptx
food safety and storage.pptxfood safety and storage.pptx
food safety and storage.pptx
 
PPT Imperforated Anus.pptx
PPT Imperforated Anus.pptxPPT Imperforated Anus.pptx
PPT Imperforated Anus.pptx
 
mcq nutrition.pptx
mcq nutrition.pptxmcq nutrition.pptx
mcq nutrition.pptx
 
PPT_RENAL FAILURE_Urinary systm.pptx
PPT_RENAL FAILURE_Urinary systm.pptxPPT_RENAL FAILURE_Urinary systm.pptx
PPT_RENAL FAILURE_Urinary systm.pptx
 
complementary feeding.pptx
complementary feeding.pptxcomplementary feeding.pptx
complementary feeding.pptx
 
health talk on postnatal diet.pptx
health talk on postnatal diet.pptxhealth talk on postnatal diet.pptx
health talk on postnatal diet.pptx
 
PPT_Asthma_Respiratory.pptx
PPT_Asthma_Respiratory.pptxPPT_Asthma_Respiratory.pptx
PPT_Asthma_Respiratory.pptx
 
PPT_Cleftlip cleft palate_GI.pptx
PPT_Cleftlip cleft palate_GI.pptxPPT_Cleftlip cleft palate_GI.pptx
PPT_Cleftlip cleft palate_GI.pptx
 
tube fedding.pptx
tube fedding.pptxtube fedding.pptx
tube fedding.pptx
 
Paladi feeding.pptx
Paladi feeding.pptxPaladi feeding.pptx
Paladi feeding.pptx
 
PPT_APPENDICITIS_GI.pptx
PPT_APPENDICITIS_GI.pptxPPT_APPENDICITIS_GI.pptx
PPT_APPENDICITIS_GI.pptx
 
PPT_Instestinal parasites_reb.pptx
PPT_Instestinal parasites_reb.pptxPPT_Instestinal parasites_reb.pptx
PPT_Instestinal parasites_reb.pptx
 
infant warmer.pptx
infant warmer.pptxinfant warmer.pptx
infant warmer.pptx
 
PPT_INSTESTINAL OBSTRUCTION_GI.pptx
PPT_INSTESTINAL OBSTRUCTION_GI.pptxPPT_INSTESTINAL OBSTRUCTION_GI.pptx
PPT_INSTESTINAL OBSTRUCTION_GI.pptx
 
resuscitation equipments.pptx
resuscitation equipments.pptxresuscitation equipments.pptx
resuscitation equipments.pptx
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Reliability and validity

  • 1. Reliability and Validity Chong Ho Yu, Ph.D. Conventional views of reliability (AERA et al., 1985)  Temporal stability: the same form of a test on two or more separate occasions to the same group of examinees (Test-retest). On many occasions this approach is not practical because repeated measurements are likely to change the examinees. For example, the examinees will adapt the test format and thus tend to score higher in later tests. Hence, careful implementation of the test-retest approach is strongly recommendation (Yu, 2005).  Form equivalence: two different forms of test, based on the same content, on one occasion to the same examinees (Alternate form). After alternate forms have been developed, it can be used for different examinees. It is very common in high-stake examination for pre-empting cheating. A examinee who took Form A earlier could not share the test items with another student who might take Form B later, because the two forms have different items.  Internal consistency: the coefficient of test scores obtained from a single test or survey (Cronbach Alpha, KR20, Spilt-half). For instance, let's say respondents are asked to rate statements in an attitude survey about computer anxiety. One statement is "I feel very negative about computers in general." Another statement is "I enjoy using computers." People who strongly agree with the first statement should be strongly disagree with the second statement, and vice versa. If the rating of both statements is high or low among several respondents, the responses are said to be inconsistent and patternless. The same principle can be applied to a test. When no pattern is found in the students'
  • 2. responses, probably the test is too difficult and students just guess the answers randomly.  Reliability is a necessary but not sufficient condition for validity. For instance, if the needle of the scale is five pounds away from zero, I always over-report my weight by five pounds. Is the measurement consistent? Yes, but it is consistently wrong! Is the measurement valid? No! (But if it under-reports my weight by five pounds, I will consider it a valid measurement)  Performance, portfolio, and responsive evaluations, where the tasks vary substantially from student to student and where multiple tasks may be evaluated simultaneously, are attacked for lacking reliability. One of the difficulties is that there are more than one source of measurement errors in performance assessment. For example, the reliability of writing skill test score is affected by the raters, the mode of discourse, and several other factors (Parkes, 2000).  Replications as unification: Users may be confused by the diversity of reliability indices. Nevertheless, different types of reliability measures share a common thread: What constitutes a replication of a measurement procedure? (Brennan, 2001) Take internal consistency as an example. This measure is used because it is convenient to compute the reliability index based upon data collected from one occasion. However, the ultimate inference should go beyond one single testing occasion to others (Yu, 2005). In other words, any procedures for estimating reliability should attempt to mirror a result based upon full-length replications.
  • 3. Conventional views of validity (Cronbach, 1971)  Face validity: Face validity simply means the validity at face value. As a check on face validity, test/survey items are sent to teachers to obtain suggestions for modification. Because of its vagueness and subjectivity, psychometricians have abandoned this concept for a long time. However, outside the measurement arena, face validity has come back in another form. While discussing the validity of a theory, Lacity and Jansen (1994) defines validity as making common sense, and being persuasive and seeming right to the reader. For Polkinghorne (1988), validity of a theory refers to results that have the appearance of truth or reality. The internal structure of things may not concur with the appearance. Many times professional knowledge is counter-common sense. The criteria of validity in research should go beyond "face," "appearance," and "common sense."  Content validity: draw an inference from test scores to a large domain of items similar to those on the test. Content validity is concerned with sample-population representativeness. i.e. the knowledge and skills covered by the test items should be representative to the larger domain of knowledge and skills. For example, computer literacy includes skills in operating system, word processing, spreadsheet, database, graphics, internet, and many others. However, it is difficult, if not impossible, to administer a test covering all aspects of computing. Therefore, only several tasks are sampled from the population of computer skills.
  • 4. Content validity is usually established by content experts. Take computer literacy as an example again. A test of computer literacy should be written or reviewed by computer science professors because it is assumed that computer scientists should know what are important in his discipline. By the first glance, this approach looks similar to the validation process of face validity, but yet there is a difference. In content validity, evidence is obtained by looking for agreement in judgments by judges. In short, face validity can be established by one person but content validity should be checked by a panel. However, this approach has some drawbacks. Usually experts tend to take their knowledge for granted and forget how little other people know. It is not uncommon that some tests written by content experts are extremely difficult. Second, very often content experts fail to identify the learning objectives of a subject. Take the following question in a philosophy test as an example: What is the time period of the philosopher Epicurus? a. 341-270 BC b. 331-232 BC c. 280-207 BC d. None of the above This type of question tests the ability of memorizing historical facts, but not philosophizing. The content expert may argue that "historical facts" are important for a student to further understand philosophy. Let's change the subject to computer science and statistics. Look at the following two questions: When was the founder and CEO of Microsoft, William Gates III born?
  • 5. a. 1949 b. 1953 c. 1957 d. None of the above Which of the following statement is true about ANOVA a. It was invented by R. A. Fisher in 1914 b. It was invented by R. A. Fisher in 1920 c. It was invented by Karl Pearson in 1920 d. None of the above It would be hard pressed for any computer scientist or statistician to accept that the above questions fulfill content validity. As a matter of fact, the memorization approach is a common practice among instructors. Further, sampling knowledge from a larger domain of knowledge involves subjective values. For example, a test regarding art history may include many questions on oil paintings, but less questions on watercolor paintings and photography because of the perceived importance of oil paintings in art history. Content validity is sample-oriented rather than sign-oriented. A behavior is viewed as a sample when it is a subgroup of the same kind of behaviors. On the other hand, a behavior is considered a sign when it is an indictor or a proxy of a construct. (Goodenough, 1949). Construct validity and criterion validity, which will be discussed later, are sign-oriented because both of them indicate behaviors different from those of the test.
  • 6.  Criterion: draw an inference from test scores to performance. A high score of a valid test indicates that the tester has met the performance criteria. Regression analysis can be applied to establish criterion validity. An independent variable could be used as a predictor variable and a dependent variable, the criterion variable. The correlation coefficient between them is called validity coefficients. For instance, scores of the driving test by simulation is the predictor variable while scores of the road test is the criterion variable. It is hypothesized that if the tester passes the simulation test, he/she should meet the criterion of being a safe driver. In other words, if the simulation test scores could predict the road test scores in a regression model, the simulation test is claimed to have a high degree of criterion validity. In short, criterion validity is about prediction rather than explanation. Predication is concerned with non-casual or mathematical dependence where as explanation is pertaining to causal or logical dependence. For example, one can predict the weather based on the height of mercury inside a thermometer. Thus, the height of mercury could satisfy the criterion validity as a predictor. However, one cannot explain why the weather changes by the change of mercury height. Because of this limitation of criterion validity, an evaluator has to conduct construct validation.
  • 7.  Construct: draw an inference form test scores to a psychological construct. Because it is concerned with abstract and theoretical construct, construct validity is also known as theoretical construct. According to Hunter and Schmidt (1990), construct validity is a quantitative question rather than a qualitative distinction such as "valid" or "invalid"; it is a matter of degree. Construct validity can be measured by the correlation between the intended independent variable (construct) and the proxy independent variable (indicator, sign) that is actually used. For example, an evaluator wants to study the relationship between general cognitive ability and job performance. However, the evaluator may not be able to administer a cognitive test to every subject. In this case, he can use a proxy variable such as "amount of education" as an indirect indicator of cognitive ability. After he administered a cognitive test to a portion of all subjects and found a strong correlation between general cognitive ability and amount of education, the latter can be used to the larger group because its construct validity is established. Other authors (e.g. Angoff,1988; Cronbach & Quirk, 1976) argue that construct validity cannot be expressed in a single coefficient; there is no mathematical index of construct validity. Rather the nature of construct validity is qualitative. There are two types of indictors: o Reflective indictor: the effect of the construct. o Formative indictor: the cause of the construct.
  • 8. When an indictor is expressed in terms of multiple items of an instrument, factor analysis is used for construct validation. Test bias is a major threat against construct validity, and therefore test bias analyses should be employed to examine the test items (Osterlind, 1983). The presence of test bias definitely affects the measurement of the psychological construct. However, the absence of test bias does not guarantee that the test possesses construct validity. In other words, the absence of test bias is a necessary, but isn't a sufficient condition.  Construct validation as unification: The criterion and the content models tends to be empirical-oriented while the construct model is inclined to be theoretical. Nevertheless, all models of validity requires some form of interpretation: What is the test measuring? Can it measure what it intends to measure? In standard scientific inquiries, it is important to formulate an interpretative (theoretical) framework clearly and then to subject it to empirical challenges. In this sense, theoretical construct validation is considered functioning as a unified framework for validity (Kane, 2001). A modified view of reliability (Moss, 1994)  There can be validity without reliability if reliability is defined as consistency among independent measures.  Reliability is an aspect of construct validity. As assessment becomes less standardized, distinctions between reliability and validity blur.  In many situations such as searching faculty candidate and conferring graduate degree, committee members are not trained to agree on a common set of criteria and standards
  • 9.  Inconsistency in students' performance across tasks does not invalidate the assessment. Rather it becomes an empirical puzzle to be solved by searching for a more comprehensive interpretation.  Initial disagreement (e.g., among students, teachers, and parents in responsive evaluation) would not invalidate the assessment. Rather it would provide an impetus for dialog. Li (2003) argued that the preceding view is incorrect:  The definition of reliability should be defined in terms of the classical test theory: the squared correlation between observed and true scores or the proportion of true variance in obtained test scores.  Reliability is a unitless measure and thus it is already model-free or standard- free.  It has been a tradition that multiple factors are introduced into a test to improve validity but decrease internal-consistent reliability. An extended view of Moss's reliability (Mislevy, 2004)  Being inspired by Moss, Mislevy went further to ask whether there can be reliability without reliability (indices).  By blending psychometrics and Hermeneutics, in which a holistic and integrative approach to understand the whole in light of its parts is used, Mislevy demanded psychometricians to think about what they intend to make inferences about.  In many cases we don't present just one argument; rather problem solving involves arguments or chains of reasoning with massive evidence.  Off-the-shelf inferential machinary (e.g. compute reliability indices) may fail if we quantify things or tasks that we don't know much about.  Probability-based reasoning to more complex assessments based upon cognitive psychology is needed.
  • 10. A radical view of reliability (Thompson et al, 2003)  Reliability is not a property of the test; rather it is attached to the property of the data. Thus, psychomterics is datammetrics.  Tests are not reliable. It is important to explore reliability in virtually all studies.  Reliability generalization, which can be used in a meta-analysis application similar to validity generalization, should be implemented to assess variance in measurement error across studies. An updated perspective of reliability (Cronbach, 2004) In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of measuring reliability, reviewed the historical development of Cronbach Alpha. He asserted, "I no longer regard the formula (of Cronbach Alpha) as the most appropriate way to examine most data. Over the years, my associates and I developed the complex generaliability (G) theory" (p. 403). Discussion of the G theory is beyond the scope of this document. Nevertheless, Cronbach did not object use of Cronbach Alpha but he recommended that researchers should take the following into consideration while employing this approach:  Standard error of measurement: It is the most important piece of information to report regarding the instrument, not a coefficient.  Independence of sampling  Heterogeneity of content  How the measurement will be used: Decide whether future uses of the instrument are likely to be exclusively for absolute decisions, for differential decisions, or both.  Number of conditions for the test
  • 11. A critical view of validity (Pedhazur & Schmelkin,1991)  Content validity is not a type of validity at all because validity refers to inferences made about scores, not to an assessment of the content of an instrument.  The very definition of a construct implies a domain of content. There is no sharp distinction between test content and test construct. A modified view of validity (Messick, 1995) The conventional view (content, criterion, construct) is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use. Validity is not a property of the test or assessment, but rather of the meaning of the test scores.  Content: evidence of content relevance, representativeness, and technical quality  Substantive: theoretical rationale  Structural: the fidelity of the scoring structure  Generalizability: generalization to the population and across populations  External: applications to multitrait-multimethod comparison  Consequential: bias, fairness, and justice; the social consequence of the assessment to the society Critics argued that consequences should not be a component of validity because test developers should not be held responsible for the consequences of misuse; accountability should lie with the misuser. Messick (1998) counter-argued that social consequences of score interpretation include the value implications of the construct
  • 12. label, which may or may not commensurate with the construct's trait implications and need to be addressed in appraising score meaning. While test developers should not be accountable to misuse of tests, they should still pay attention to the unanticipated consequences of legitimate score interpretation. A different view of reliability and validity (Salvucci, Walter, Conley, Fink, & Saba (1997) Some scholars argue that the traditional view that "reliability is a necessary but not a sufficient condition of validity" is incorrect. This school of thought conceptualizes reliability as invariance and validity asunbiasedness. A sample statistic may have an expected value over samples equal to the population parameter (unbiasedness), but have very high variance from a small sample size. Conversely, a sample statistic can have very low sampling variance but have an expected value far departed from the population parameter (high bias). In this view, a measure can be unreliable (high variance) but still valid (unbiased). Population parameter (Red line) = Sample statistic (Yellow line) --> unbiased High variance (Green line) Unreliable but valid Population parameter (Red line) <> Sample statistic (Yellow line) --> Biased low variance (Green line) Invalid but reliable
  • 13. Caution and advice There is a common misconception that if someone adopts a validated instrument, he/she does not need to check the reliability and validity with his/her own data. Imagine this: When I buy a drug that has been approved by FDA and my friend asks me whether it heals me, I tell him, "I am taking a drug approved by FDA and therefore I don't need to know whether it works for me or not!" A responsible evaluator should still check the instrument's reliability and validity with his/her own subjects and make any modifications if necessary. Low reliability is less detrimental to the performance pretest. In the pretest where subjects are not exposed to the treatment and thus are unfamiliar with the subject matter, a low reliability caused by random guessing is expected. One easy way to overcome this problem is to include "I don't know" in multiple choices. In an experimental settings where students' responses would not affect their final grades, the experimenter should explicitly instruct students to choose "I don't know" instead of making a guess if they really don't know the answer. Low reliability is a signal of high measurement error, which reflects a gap between what students actually know and what scores they receive. The choice "I don't know" can help in closing this gap. Last Updated: 2008 References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors. Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun (Eds.), Test validity.Hillsdale, NJ: Lawrence Erlbaum.
  • 14. Brennan, R. (2001). An essay on the history and future of reliability from the perspective of replications.Journal of Educational Measurement, 38, 295-317. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). Washington, D. C.: American Council on Education. Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and successor procedures. Educational and Psychological Measurement, 64, 391-418. Cronbach, L. J. & Quirk, T. J. (1976). Test validity. In International Encyclopedia of Education. New York: McGraw-Hill. Goodenough, F. L. (1949). Mental testing: Its history, principles, and applications. New York: Rinehart. Hunter, J. E.; & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newsbury Park: Sage Publications. Kane, M. (2001). Current concerns in validity theory. Journal of educational Measurement, 38, 319-342. Lacity, M.; & Jansen, M. A. (1994). Understanding qualitative data: A framework of text analysis methods.Journal of Management Information System, 11, 137-160. Li, H. (2003). The resolution of some paradoxes related to reliability and validity. Journal of Educational and Behavioral Statistics, 28, 89-95. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performance as scientific inquiry into scoring meaning. American Psychologist, 9, 741-749. Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45, 35-44.
  • 15. Mislevy, R. (2004). Can there be reliability without reliability? Journal of Educational and Behavioral Statistics, 29, 241-244. Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23, 5-12. Osterlind, S. J. (1983). Test item bias. Newbury Park: Sage Publications. Parkes, J. (2000). The relationship between the reliability and cost of performance assessments. Education Policy Analysis Archives, 8. [On-line] Available URL: http://epaa.asu.edu/epaa/v8n16/ Pedhazur, E. J.; & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach.Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Polkinghorne, D. E. (1988). Narrative knowing and the human sciences. Albany: State University of New York Press. Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error studies at the National Center for Education Statistics. Washington D. C.: U. S. Department of Education Thompson, B. (Ed.) (2003). Score reliability: Contemporary thinking on reliability issues. Thousand Oaks: Sage. Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of Social Measurement, Vol. 3 (pp. 777-784). San Diego, CA: Academic Press. Questions for discussion Pick one of the following cases and determine whether the test or the assessment is valid. Apply the concepts of reliability and validity to the situation. These cases may be remote to this cultural context. You may use your own example.
  • 16. 1. In ancient China, candidates for government officials had to take the examination regarding literature and moral philosophy, rather than public administration. 2. Before July 1, 1997 when Hong Kong was a British colony, Hong Kong doctors, including specialists, who graduated from non-Common Wealth medical schools had to take a general medical examination covering all general areas in order to be certified. Navigation Index Simplified Navigation Table of Contents Search Engine Contact