SlideShare una empresa de Scribd logo
1 de 31
Practical Language Testing
Fulcher (2010)
Two paradigms in educational
measurement and language testing
1) Norm-referenced testing: The meaning of the score on
a test is derived from the position of an individual in
relation to the group. It discriminate between test takers
and separates them out (i.e., distribute) very effectively.
Decision making with norm-referenced tests involves
value judgments about the meaning of scores in terms
of the intended effect of the test.
Two paradigms in educational measurement
and language testing (Cont.)
 Criterion-referenced testing: The aim is to make a decision
about whether an individual test taker has achieved a pre-
specified criterion, or standard, that is required for a particular
decision context.
What is a standardized test?
 A standardized test is a form of NRT that
1) requires all test takers to answer the same questions, or a
selection of questions from common bank of questions,
in the same way;
2) is scored in a “standard” or consistent manner, which
makes it possible to compare the relative performance of
individual students or groups of students.
 The term is primarily associated with large-scale tests
administered to large populations of students
Why testing is viewed as a ‘science’
 The early scientific use of tests initiated by the
introduction of statistical analysis in testing area during
First World War
 Greenwood (1919): “When you can measure what you
are speaking about and express it in numbers, you know
something about it, but when you cannot measure it,
when you cannot express it in numbers, your knowledge
is of a meagre and unsatisfactory kind” (p. 186)
 Fulcher (2010): “tests, like scientific instruments,
provide the means by which we can observe and
measure consistencies in human ability”.
Why testing is viewed as a ‘science’ (Cont.)
Shohamy (2001): “Testing is perceived as a scientific
discipline because it is experimental, statistical and uses
numbers. It therefore enjoys the prestige granted to science
and is viewed as objective, fair, true and trustworthy” (p.
21) which are key features of the “power of testing”.
Lipman (1922): Strong trait theory is untenable. In fact,
most of the traits or constructs that we work with are
extremely difficult to define, and if we are not able to
define them, measurement is even more problematic.
The curve and score meaning
 In NRT, the meaning of a score is directly related to its place in
the curve of the distribution (or a bell curve) from which it is
drawn.
-3SD -2SD -1SD 0 1SD 2SD 3SD
Central tendency
Central tendency: The most typical behavior of the
group
 Mode: Scores occurs most frequently
Bimodal with two peaks
Trimodal with three peaks
 Median: The point below which 50 percent of the
scores fall and above which 50 percent fall.
 Midpoint: The point halfway between the highest
score and the lowest score on the test (high+low/2)
 Mean:
(The midpoint for NRT is the mean)
Dispersion
Dispersion: How the individual performances vary from the central
tendency.
 Range: The number of points between the highest score and the
lowest one plus 1.
 Standard deviation (SD): A sort of average of the differences of
all scores from mean (the square root of the sum of the
squared deviation scores, divided by N – 1).
Deviation score: The score obtained from the subtraction of the
mean from each of the individual scores ( ) (The mean of
these scores is always zero).
Dispersion (Cont.)
SD formula:
N-1 for sample
N for population group
SD is better than the range since it is the result of
averaging process and lessen the effects of extreme
scores not attributable to performance on the test.
Variance: The squared value of SD
Example
Score Mean X-M (X-M)2
77 71 6 36
75 71 4 16
72 71 1 1
72 71 1 1
70 71 -1 1
65 71 -6 36
66 71 -5 25
Central tendency Dispersion
Mode =72
Median =72
Midpoint =77+66/2=71.5
Mean = 77+75+72+72+70+65+66/7=71
Range = 77-66+1=12
SD =√(36+16+1+1+1+36+25)2/7= 4
Variance = s2 = 42 =16
Example (cont.) (with raw score)
In the normal curve, mean, mode, midpoint, and median are all the
same.
Score 76: 50% +34.13% = 84.13% (Percentile: The total percentage
of students who scored equal to or below a given point in normal
distribution)
60 64 68 72 76 80 84
Standardized tests: a) z scores
 A z-score: The raw score expressed in standard deviations.
 Z score formula:
The mean of z scores is always zero.
The SD of z scores is 1.
3 ≤ z scores ≤ +3
-.5
Z= 70-72/4= -.5sd
Standardized tests: a) z scores (Cont.)
 Three problems of z scores:
1. They are relatively small, ranging from -3 to +3.
2. They can turn out to be negative and positive.
3. They turn out to include several decimal places.
Reporting scores in form of z scores can be demotivating
for the students.
To overcome its problems, z scores should be transformed
to some standardized scales
Standardized tests: b)T scores
Main formula of standardized scales (linear transformation
of z scores):
 T score formula: T = 10z +50
Mean = 50 SD = 10 range = 10-90
 Example: raw score = 70 z score = -0.5
T score = 10 * -0.5 + 50 = 45
Standardized tests: c) CEEB scores
 CEEB (College Entrance Examination Board) is the standardised
Gaokao examination and used for SAT, GRE, TOEFL, etc.
 CEEB formula: CEEB = 100z +500
Mean = 500 SD = 100 range = 100-900
 Example: raw score = 70 z score = -0.5
CEEB score = 100 * -0.5 + 500 = 450
Item analysis
 Item facility/item easiness/ item difficulty/facility index: The
statistics used to examine the percentage of students who
correctly answer a given item.
IF formula = Ncorrect /Ntotal
 Item discrimination (ID): The degree to which an item
separates the students who performed well from those who
did poorly on the test as a whole.
ID formula = IF upper – IF lower
Range Acceptable Best
0 ≤ IF ≤ 1 .3 ≤ IF ≤ .7 IF = .5
-1 ≤ ID ≤ +1 .4 ≤ ID ID = 1
Reliability
 Reliability: Consistency of scores under different
circumstances.
 Reliability differs from scorability
 Reliability indicates the degree to which the observed
score and true score match.
 The observed score (X) is made up of the ‘true’ score of
an individual’s ability on what the test measures (T),
plus the error (E) that can come from a variety of
sources.
Threatens to reliability (Lado)
1. Variation in conditions of administration: Fluctuation of scores over time,
in different places or under slightly different conditions (such as a
different room, or with a different invigilator)
2. The quality of the test itself: Problems with sampling what language to
test – as we can’t test everything in a single test. If a test consists of items
that test very different things, reliability is also reduced. This is because
in standardised tests any group of items from which responses are added
together to create a single score are assumed to test the same ability, skill
or knowledge. The technical term for this is item homogeneity.
3. Variability in scoring: If humans are scoring multiple-choice items they
may become fatigued and make mistakes, or transfer marks inaccurately
from scripts to computer records. However, there is more room for
variation when humans are asked to make judgments.
Calculating reliability
 The method we use to calculate reliability depends upon
what kind of error we wish to focus on.
 The notion of correlation is at the very center of the
notion of reliability.
 A reliability coefficient is calculated that ranges from 0
(randomness) to 1, and no test is ‘perfectly’ reliable.
There is always error of measurement.
Calculating reliability
1. Variation in conditions of administration
 The statistical technique of correlation used is Pearson Product
Moment Correlation.
 Assumptions: 1. Interval scale, 2. Independence: each pair of scores is
independent from all other pairs, 3. Normally distributed, 4. Linearity
 -1 ≤ r ≤ +1:
1. –1 : There is an inverse relationship between the scores
2. 0 : There is no relation between the two sets of scores
3. 1 : The scores are exactly the same on both administrations of the test.
The closer the result is to 1, the more test–retest reliability we have
Coefficient of determination
 Statistical significance is a necessary precondition for a
meaningful correlation but not sufficient in itself.
 Coefficient of determination is simply correlation
coefficient squared (r2), and represents the proportion of
overlapping variance between two sets of scores (i.e., as the
score on one test increases, so it increases proportionally on
the other test)
0 ≤r2≤ 60 low (one third overlapping variance)
60 ≤r2≤ 80 moderate (one third to two third overlapping variance)
80 ≤r2≤100 high (two third to complete overlapping variance)
2. The quality of the test itself (internal
consistency)
 Reliability is addressed in terms of homogeneity of items (they
must all be highly correlated).
 Requirements:
1. Parralelism: Two tests should be parallel (with same means,
variances, same correlation with another well-established
measure of that construct)
2. Independence: The response to any specific item must be
independent of the response to any other item; put another way,
the test taker should not get one item correct because they have
got some other item correct. The technical term for this is the
stochastic independence of items.
 Statistics used: Split-half methods and methods based on item
variance
Split-half method
 Main procedure: Split the test into two equal halves, calculate the
correlation between the two halves.
1. Spearman-Brown split-half reliability estimate: Since reliability is
directly related to the length of a test, correct the correlation for
length via Spearman Brown correction formula (Pallarellism and
independence are required)
2. Guttman split-half reliability estimate (Pallarellism is not required
but independence is required)
Methods based on item variances
 Estimates based on item variances (Pallarellism and independence
are required)
1. Cronbach’s Coefficient alpha for dichotomously scored items
(scored ‘right’ or ‘wrong’)
2. K-R20 /K-R21
3. Variability in scoring (grading and
marking)
 Whatever rater is making the judgment should be a matter of
indifference to the test taker
 Inter-rater reliability: Our concern is with variation between raters
because some raters are more lenient than others, or some raters
may rate some test takers higher than others (perhaps because
they are familiar with the first language and are more sympathetic
to errors).
 Intra-rater reliability: Our concern is with variation within one
rater over time.
 Statistics: Cronbach’s alpha for partial credit judgments
Standard Error of Measurement (SEM)
 One of the most important tools in standardised testing is the standard
error of measurement.
 While the reliability coefficient tells us how much error there might be
in the measurement, it is the standard error of measurement that tells us
what this might mean for a specific observed score more
informative for interpreting the practical implication of reliability
 SEM formula:
 Confidence interval: SEM gives us a confidence interval around an
observed test score, which tells us by how much the true score may be
above or below the observed score that the test taker has actually got on
our test.
Example
Example: SD= 4 r = .64 SEM =4 √1 - .64= 2.4
Raw score = 74 SEM = 2.4
68% (between +1SEM and –1SEM) 71.6 ≤true score ≤76.4
95% (between +2SEM and –2SEM) 69.2 ≤true score ≤ 78.8
99% (between +3SEM and –3SEM) 66.8 ≤true score ≤81.2
100% (between +4SEM and –4SEM) 66.8 ≤true score ≤81.2
Reliability and test length
 In standardised tests with many items, each item provides a piece of
information about the ability of the test taker, therefore, as we increase
the number of items, the reliability will increase.
 Formula for looking at the relationship between reliability and test
length
A: The proportion by which you would have to lengthen the test to get the desired
reliability
rAA : The desired reliability
r11 : The reliability of the current test.
 However, the best way to increase reliability is to produce better items
Relationships with other measures
 One key part of standardised testing: The comparison of
two measures of the same construct.
 If two different measures were highly correlated this
provided evidence of validity. This aspect of external
validity is criterion-related evidence, or evidence that
shows one test is highly correlated with a criterion that
is already known to be a valid measure of its construct
(called evidence for convergent validity)
 Measurement as understood in Classical Test Theory
Practical Language Testing by Fulcher (2010)

Más contenido relacionado

La actualidad más candente

Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentSutrisno Evenddy
 
Understanding reliability and validity
Understanding reliability and validityUnderstanding reliability and validity
Understanding reliability and validityMuhammad Faisal
 
Validity, reliablility, washback
Validity, reliablility, washbackValidity, reliablility, washback
Validity, reliablility, washbackMaury Martinez
 
Kinds of tests and testing
Kinds of tests and testingKinds of tests and testing
Kinds of tests and testingMaury Martinez
 
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessmentErlwinmer Mangmang
 
measurment, testing & eveluation
measurment, testing & eveluationmeasurment, testing & eveluation
measurment, testing & eveluationmpazhou
 
Alternative assessment
Alternative assessmentAlternative assessment
Alternative assessmentcholovacs
 
Language Testing :kinds of tests
Language Testing :kinds of testsLanguage Testing :kinds of tests
Language Testing :kinds of testsahmedabbas1121
 
Stages of test development
Stages of test developmentStages of test development
Stages of test developmentMaury Martinez
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testingPhuong Tran
 
Language Assessment - Standardized Testing by EFL Learners
Language Assessment - Standardized Testing by EFL LearnersLanguage Assessment - Standardized Testing by EFL Learners
Language Assessment - Standardized Testing by EFL LearnersEFL Learning
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and ValidityBrian Ebie
 
Types of test and testing
Types of test and testingTypes of test and testing
Types of test and testinguzma bashir
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validityalameenpa
 
Criterion referenced test
Criterion referenced test Criterion referenced test
Criterion referenced test Ulfa
 
Language testing approaches & techniques
Language testing approaches & techniquesLanguage testing approaches & techniques
Language testing approaches & techniquesShin Chan
 
Norm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced TestsNorm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced TestsFariba Chamani
 
Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Kheang Sokheng
 

La actualidad más candente (20)

Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Understanding reliability and validity
Understanding reliability and validityUnderstanding reliability and validity
Understanding reliability and validity
 
Validity, reliablility, washback
Validity, reliablility, washbackValidity, reliablility, washback
Validity, reliablility, washback
 
Kinds of tests and testing
Kinds of tests and testingKinds of tests and testing
Kinds of tests and testing
 
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessment
 
measurment, testing & eveluation
measurment, testing & eveluationmeasurment, testing & eveluation
measurment, testing & eveluation
 
Alternative assessment
Alternative assessmentAlternative assessment
Alternative assessment
 
Language Testing :kinds of tests
Language Testing :kinds of testsLanguage Testing :kinds of tests
Language Testing :kinds of tests
 
Stages of test development
Stages of test developmentStages of test development
Stages of test development
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
 
Language Assessment - Standardized Testing by EFL Learners
Language Assessment - Standardized Testing by EFL LearnersLanguage Assessment - Standardized Testing by EFL Learners
Language Assessment - Standardized Testing by EFL Learners
 
Test interpretation
Test interpretationTest interpretation
Test interpretation
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
 
Types of test and testing
Types of test and testingTypes of test and testing
Types of test and testing
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
 
Interpreting test results
Interpreting test resultsInterpreting test results
Interpreting test results
 
Criterion referenced test
Criterion referenced test Criterion referenced test
Criterion referenced test
 
Language testing approaches & techniques
Language testing approaches & techniquesLanguage testing approaches & techniques
Language testing approaches & techniques
 
Norm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced TestsNorm-referenced & Criterion-referenced Tests
Norm-referenced & Criterion-referenced Tests
 
Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)
 

Similar a Practical Language Testing by Fulcher (2010)

Measurement and instrumentaion
Measurement and instrumentaionMeasurement and instrumentaion
Measurement and instrumentaionahmedabbas1121
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Linejan
 
MCQ test item analysis
MCQ test item analysisMCQ test item analysis
MCQ test item analysisSoha Rashed
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Tahere Bakhshi
 
educatiinar.pptx
educatiinar.pptxeducatiinar.pptx
educatiinar.pptxNithuNithu7
 
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)Videoconferencias UTPL
 
CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1FriasKentOmer
 
Adapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxAdapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxnettletondevon
 
Administering,scoring and reporting a test ppt
Administering,scoring and reporting a test pptAdministering,scoring and reporting a test ppt
Administering,scoring and reporting a test pptManali Solanki
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testcyrilcoscos
 
Aligning tests to standards
Aligning tests to standardsAligning tests to standards
Aligning tests to standardsFariba Chamani
 
Establishing Validity-and-Reliability-Test ppt.pptx
Establishing Validity-and-Reliability-Test ppt.pptxEstablishing Validity-and-Reliability-Test ppt.pptx
Establishing Validity-and-Reliability-Test ppt.pptxRayLorenzOrtega
 
Fulcher standardized testing
Fulcher standardized testingFulcher standardized testing
Fulcher standardized testingMelikarj
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment. Tarek Tawfik Amin
 
Item analysis and validation
Item analysis and validationItem analysis and validation
Item analysis and validationKEnkenken Tan
 
PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)
PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)
PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)vina serevina
 

Similar a Practical Language Testing by Fulcher (2010) (20)

Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Measurement and instrumentaion
Measurement and instrumentaionMeasurement and instrumentaion
Measurement and instrumentaion
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
MCQ test item analysis
MCQ test item analysisMCQ test item analysis
MCQ test item analysis
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
educatiinar.pptx
educatiinar.pptxeducatiinar.pptx
educatiinar.pptx
 
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
 
Item analysis
Item analysisItem analysis
Item analysis
 
Item analysis
Item analysisItem analysis
Item analysis
 
CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1
 
Adapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxAdapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docx
 
Administering,scoring and reporting a test ppt
Administering,scoring and reporting a test pptAdministering,scoring and reporting a test ppt
Administering,scoring and reporting a test ppt
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Aligning tests to standards
Aligning tests to standardsAligning tests to standards
Aligning tests to standards
 
Establishing Validity-and-Reliability-Test ppt.pptx
Establishing Validity-and-Reliability-Test ppt.pptxEstablishing Validity-and-Reliability-Test ppt.pptx
Establishing Validity-and-Reliability-Test ppt.pptx
 
Fulcher standardized testing
Fulcher standardized testingFulcher standardized testing
Fulcher standardized testing
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
 
Unit. 6.doc
Unit. 6.docUnit. 6.doc
Unit. 6.doc
 
Item analysis and validation
Item analysis and validationItem analysis and validation
Item analysis and validation
 
PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)
PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)
PHYSICS EDUCATION PRINCIPLE & EVALUATION TECHNIQUES (LARAS & NUR ASIAH)
 

Más de Mahsa Farahanynia

The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...Mahsa Farahanynia
 
Culture and nonverbal communication
Culture and nonverbal communicationCulture and nonverbal communication
Culture and nonverbal communicationMahsa Farahanynia
 
Input and Interaction in second language learning
Input and Interaction in second language learningInput and Interaction in second language learning
Input and Interaction in second language learningMahsa Farahanynia
 
Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...Mahsa Farahanynia
 
Mixed between-within groups ANOVA
Mixed between-within groups ANOVAMixed between-within groups ANOVA
Mixed between-within groups ANOVAMahsa Farahanynia
 
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky  Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky Mahsa Farahanynia
 
Uses of language by Brown 1990
Uses of language by Brown 1990Uses of language by Brown 1990
Uses of language by Brown 1990Mahsa Farahanynia
 
Enhancing fairness through a social contract
Enhancing fairness through a social contractEnhancing fairness through a social contract
Enhancing fairness through a social contract Mahsa Farahanynia
 

Más de Mahsa Farahanynia (11)

Situational syllabi
Situational syllabiSituational syllabi
Situational syllabi
 
The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...
 
Culture and nonverbal communication
Culture and nonverbal communicationCulture and nonverbal communication
Culture and nonverbal communication
 
Input and Interaction in second language learning
Input and Interaction in second language learningInput and Interaction in second language learning
Input and Interaction in second language learning
 
Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...
 
Mixed between-within groups ANOVA
Mixed between-within groups ANOVAMixed between-within groups ANOVA
Mixed between-within groups ANOVA
 
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky  Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
 
Uses of language by Brown 1990
Uses of language by Brown 1990Uses of language by Brown 1990
Uses of language by Brown 1990
 
Enhancing fairness through a social contract
Enhancing fairness through a social contractEnhancing fairness through a social contract
Enhancing fairness through a social contract
 
Standards based assessment
Standards based assessmentStandards based assessment
Standards based assessment
 
Aptitude
AptitudeAptitude
Aptitude
 

Último

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Último (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Practical Language Testing by Fulcher (2010)

  • 2. Two paradigms in educational measurement and language testing 1) Norm-referenced testing: The meaning of the score on a test is derived from the position of an individual in relation to the group. It discriminate between test takers and separates them out (i.e., distribute) very effectively. Decision making with norm-referenced tests involves value judgments about the meaning of scores in terms of the intended effect of the test.
  • 3. Two paradigms in educational measurement and language testing (Cont.)  Criterion-referenced testing: The aim is to make a decision about whether an individual test taker has achieved a pre- specified criterion, or standard, that is required for a particular decision context.
  • 4. What is a standardized test?  A standardized test is a form of NRT that 1) requires all test takers to answer the same questions, or a selection of questions from common bank of questions, in the same way; 2) is scored in a “standard” or consistent manner, which makes it possible to compare the relative performance of individual students or groups of students.  The term is primarily associated with large-scale tests administered to large populations of students
  • 5. Why testing is viewed as a ‘science’  The early scientific use of tests initiated by the introduction of statistical analysis in testing area during First World War  Greenwood (1919): “When you can measure what you are speaking about and express it in numbers, you know something about it, but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind” (p. 186)  Fulcher (2010): “tests, like scientific instruments, provide the means by which we can observe and measure consistencies in human ability”.
  • 6. Why testing is viewed as a ‘science’ (Cont.) Shohamy (2001): “Testing is perceived as a scientific discipline because it is experimental, statistical and uses numbers. It therefore enjoys the prestige granted to science and is viewed as objective, fair, true and trustworthy” (p. 21) which are key features of the “power of testing”. Lipman (1922): Strong trait theory is untenable. In fact, most of the traits or constructs that we work with are extremely difficult to define, and if we are not able to define them, measurement is even more problematic.
  • 7. The curve and score meaning  In NRT, the meaning of a score is directly related to its place in the curve of the distribution (or a bell curve) from which it is drawn. -3SD -2SD -1SD 0 1SD 2SD 3SD
  • 8. Central tendency Central tendency: The most typical behavior of the group  Mode: Scores occurs most frequently Bimodal with two peaks Trimodal with three peaks  Median: The point below which 50 percent of the scores fall and above which 50 percent fall.  Midpoint: The point halfway between the highest score and the lowest score on the test (high+low/2)  Mean: (The midpoint for NRT is the mean)
  • 9. Dispersion Dispersion: How the individual performances vary from the central tendency.  Range: The number of points between the highest score and the lowest one plus 1.  Standard deviation (SD): A sort of average of the differences of all scores from mean (the square root of the sum of the squared deviation scores, divided by N – 1). Deviation score: The score obtained from the subtraction of the mean from each of the individual scores ( ) (The mean of these scores is always zero).
  • 10. Dispersion (Cont.) SD formula: N-1 for sample N for population group SD is better than the range since it is the result of averaging process and lessen the effects of extreme scores not attributable to performance on the test. Variance: The squared value of SD
  • 11. Example Score Mean X-M (X-M)2 77 71 6 36 75 71 4 16 72 71 1 1 72 71 1 1 70 71 -1 1 65 71 -6 36 66 71 -5 25 Central tendency Dispersion Mode =72 Median =72 Midpoint =77+66/2=71.5 Mean = 77+75+72+72+70+65+66/7=71 Range = 77-66+1=12 SD =√(36+16+1+1+1+36+25)2/7= 4 Variance = s2 = 42 =16
  • 12. Example (cont.) (with raw score) In the normal curve, mean, mode, midpoint, and median are all the same. Score 76: 50% +34.13% = 84.13% (Percentile: The total percentage of students who scored equal to or below a given point in normal distribution) 60 64 68 72 76 80 84
  • 13. Standardized tests: a) z scores  A z-score: The raw score expressed in standard deviations.  Z score formula: The mean of z scores is always zero. The SD of z scores is 1. 3 ≤ z scores ≤ +3 -.5 Z= 70-72/4= -.5sd
  • 14. Standardized tests: a) z scores (Cont.)  Three problems of z scores: 1. They are relatively small, ranging from -3 to +3. 2. They can turn out to be negative and positive. 3. They turn out to include several decimal places. Reporting scores in form of z scores can be demotivating for the students. To overcome its problems, z scores should be transformed to some standardized scales
  • 15. Standardized tests: b)T scores Main formula of standardized scales (linear transformation of z scores):  T score formula: T = 10z +50 Mean = 50 SD = 10 range = 10-90  Example: raw score = 70 z score = -0.5 T score = 10 * -0.5 + 50 = 45
  • 16. Standardized tests: c) CEEB scores  CEEB (College Entrance Examination Board) is the standardised Gaokao examination and used for SAT, GRE, TOEFL, etc.  CEEB formula: CEEB = 100z +500 Mean = 500 SD = 100 range = 100-900  Example: raw score = 70 z score = -0.5 CEEB score = 100 * -0.5 + 500 = 450
  • 17. Item analysis  Item facility/item easiness/ item difficulty/facility index: The statistics used to examine the percentage of students who correctly answer a given item. IF formula = Ncorrect /Ntotal  Item discrimination (ID): The degree to which an item separates the students who performed well from those who did poorly on the test as a whole. ID formula = IF upper – IF lower Range Acceptable Best 0 ≤ IF ≤ 1 .3 ≤ IF ≤ .7 IF = .5 -1 ≤ ID ≤ +1 .4 ≤ ID ID = 1
  • 18. Reliability  Reliability: Consistency of scores under different circumstances.  Reliability differs from scorability  Reliability indicates the degree to which the observed score and true score match.  The observed score (X) is made up of the ‘true’ score of an individual’s ability on what the test measures (T), plus the error (E) that can come from a variety of sources.
  • 19. Threatens to reliability (Lado) 1. Variation in conditions of administration: Fluctuation of scores over time, in different places or under slightly different conditions (such as a different room, or with a different invigilator) 2. The quality of the test itself: Problems with sampling what language to test – as we can’t test everything in a single test. If a test consists of items that test very different things, reliability is also reduced. This is because in standardised tests any group of items from which responses are added together to create a single score are assumed to test the same ability, skill or knowledge. The technical term for this is item homogeneity. 3. Variability in scoring: If humans are scoring multiple-choice items they may become fatigued and make mistakes, or transfer marks inaccurately from scripts to computer records. However, there is more room for variation when humans are asked to make judgments.
  • 20. Calculating reliability  The method we use to calculate reliability depends upon what kind of error we wish to focus on.  The notion of correlation is at the very center of the notion of reliability.  A reliability coefficient is calculated that ranges from 0 (randomness) to 1, and no test is ‘perfectly’ reliable. There is always error of measurement.
  • 21. Calculating reliability 1. Variation in conditions of administration  The statistical technique of correlation used is Pearson Product Moment Correlation.  Assumptions: 1. Interval scale, 2. Independence: each pair of scores is independent from all other pairs, 3. Normally distributed, 4. Linearity  -1 ≤ r ≤ +1: 1. –1 : There is an inverse relationship between the scores 2. 0 : There is no relation between the two sets of scores 3. 1 : The scores are exactly the same on both administrations of the test. The closer the result is to 1, the more test–retest reliability we have
  • 22. Coefficient of determination  Statistical significance is a necessary precondition for a meaningful correlation but not sufficient in itself.  Coefficient of determination is simply correlation coefficient squared (r2), and represents the proportion of overlapping variance between two sets of scores (i.e., as the score on one test increases, so it increases proportionally on the other test) 0 ≤r2≤ 60 low (one third overlapping variance) 60 ≤r2≤ 80 moderate (one third to two third overlapping variance) 80 ≤r2≤100 high (two third to complete overlapping variance)
  • 23. 2. The quality of the test itself (internal consistency)  Reliability is addressed in terms of homogeneity of items (they must all be highly correlated).  Requirements: 1. Parralelism: Two tests should be parallel (with same means, variances, same correlation with another well-established measure of that construct) 2. Independence: The response to any specific item must be independent of the response to any other item; put another way, the test taker should not get one item correct because they have got some other item correct. The technical term for this is the stochastic independence of items.  Statistics used: Split-half methods and methods based on item variance
  • 24. Split-half method  Main procedure: Split the test into two equal halves, calculate the correlation between the two halves. 1. Spearman-Brown split-half reliability estimate: Since reliability is directly related to the length of a test, correct the correlation for length via Spearman Brown correction formula (Pallarellism and independence are required) 2. Guttman split-half reliability estimate (Pallarellism is not required but independence is required)
  • 25. Methods based on item variances  Estimates based on item variances (Pallarellism and independence are required) 1. Cronbach’s Coefficient alpha for dichotomously scored items (scored ‘right’ or ‘wrong’) 2. K-R20 /K-R21
  • 26. 3. Variability in scoring (grading and marking)  Whatever rater is making the judgment should be a matter of indifference to the test taker  Inter-rater reliability: Our concern is with variation between raters because some raters are more lenient than others, or some raters may rate some test takers higher than others (perhaps because they are familiar with the first language and are more sympathetic to errors).  Intra-rater reliability: Our concern is with variation within one rater over time.  Statistics: Cronbach’s alpha for partial credit judgments
  • 27. Standard Error of Measurement (SEM)  One of the most important tools in standardised testing is the standard error of measurement.  While the reliability coefficient tells us how much error there might be in the measurement, it is the standard error of measurement that tells us what this might mean for a specific observed score more informative for interpreting the practical implication of reliability  SEM formula:  Confidence interval: SEM gives us a confidence interval around an observed test score, which tells us by how much the true score may be above or below the observed score that the test taker has actually got on our test.
  • 28. Example Example: SD= 4 r = .64 SEM =4 √1 - .64= 2.4 Raw score = 74 SEM = 2.4 68% (between +1SEM and –1SEM) 71.6 ≤true score ≤76.4 95% (between +2SEM and –2SEM) 69.2 ≤true score ≤ 78.8 99% (between +3SEM and –3SEM) 66.8 ≤true score ≤81.2 100% (between +4SEM and –4SEM) 66.8 ≤true score ≤81.2
  • 29. Reliability and test length  In standardised tests with many items, each item provides a piece of information about the ability of the test taker, therefore, as we increase the number of items, the reliability will increase.  Formula for looking at the relationship between reliability and test length A: The proportion by which you would have to lengthen the test to get the desired reliability rAA : The desired reliability r11 : The reliability of the current test.  However, the best way to increase reliability is to produce better items
  • 30. Relationships with other measures  One key part of standardised testing: The comparison of two measures of the same construct.  If two different measures were highly correlated this provided evidence of validity. This aspect of external validity is criterion-related evidence, or evidence that shows one test is highly correlated with a criterion that is already known to be a valid measure of its construct (called evidence for convergent validity)  Measurement as understood in Classical Test Theory