John De Jong: Optimizing Test & Courseware Development
1. Optimizing Test &
Courseware Development
Lisbon
23 April 2016
John De Jong
SVP Global Assessment Standards, Pearson
Professor of Language Testing VU University Amsterdam
4. PISA Development over time
2000: Reading Mathematics and Science
2003: Reading Mathematics and Science
2006: Reading Mathematics and Science
2009: Reading Mathematics and Science
+ Optional Electronic Reading
2012: Reading Mathematics and Science
+ Optional Electronic Mathematics
2015: Electronic: Reading Mathematics and Science
+ Collaborative Problem Solving
2018 : Reading Mathematics and Science
+ Global Competence
4
5. Lessons from PISA
Major drivers of success of countries
• Clear standards defined at national level
• High level of teacher autonomy
5
7. Ranking CPS in higher
education and workplace
Applied Skill Rank Educ Rank Work
Oral Communications 3 1
Teamwork / Collaboration 3 1
Problem Solving 1 2
Written Communications 2 2
Information Technology Application 4 3
Lifelong Learning / Self Direction 2 4
Professionalism / Work Ethic 5 4
Ethics / Social Responsibility 6 4
Creativity / Innovation 3 5
Diversity 7 6
Leadership 7 7
8. Survey results
Definition Agree %
is clearly described 97
matches my own understanding
of CPS 95
will help higher ed institutions
to understand CPS 88
will help employers to
understand CPS 100
is what is taught in my country 52
The CPS definition is … Agree %
16. Domainsoflanguageuse/Topics
Difficulty
Self / personal experience
Negotiating with others
Deal with new
Academic
Specialized
Jokes
GE: A1 A2 B1 B2 C1 C2
AE: General MBA
PE: Waiter Politician
Coherent bank of objectives
17. A General Model of Language
Development
GeneralCognition
Language Proficiency
Measuring within
population of language
learners: measures both
linguistic and general
cognitive development
Measuring across two
populations of language
learners, may just
measure cognitive
development only.
Including appropriate
native speaker
population can help to
measure linguistic
development only
0 1 2 3 4 5 etc. “language age”
012345etc.“cognitiveage”
18. The Global Scale of English
18
Comparison PTE Academic (GSE scale) and IELTS and TOEFL
IELTS
TOEFL iBT
21. Overview
• A vocabulary framework linked to the Global Scale of English
(GSE) and the CEFR
• Organized by topics and subtopics based on the CoE Vantage
specifications categorization
• Describing vocabulary targets for learners of general English
• A probabilistic model of productive vocabulary learning
• Based on the principle of incremental learning of word
meanings, from basic to specialized
• Including 20k+ lemmas; 37k+ meanings; 80k+ collocations;
7k+ functional units
• Helping learners, teachers, and materials designers identify
level-appropriate vocabulary
22. Methodology
Combines frequency data and teacher judgements via 4 main
steps:
1. Corpus 2.5 billion words > extraction of frequency list
2. Semantic annotation
• Manual tagging of 37k word meanings using of CoE ‘Vantage’
3. Teacher ratings
• Rating of 37k word meanings by 10 teachers (scale: 1 to 5 + 99)
4. Statistical analysis
• Rank word meanings by combining frequency data and teacher ratings
5. Fit the data onto a model, link each meaning to the CEFR /GSE
23. Lemmas and meanings
Structure vocabulary around pedagogically relevant
sets using the CoE Vantage categorization
Example:
Specific Notions (Topics)
Fork > FOOD&DRINKS_tableware
SPORT&HOBBIES_gardening
TRAVEL_directions
23
24. Theoretical assumptions
A model of vocabulary growth based on current literature:
• Basic (A1) > 500-1k words (500 words as min. elementary level -Hill, 2013;
500-1k as general teaching target)
• Basic (A2)> boundary for high frequency vocabulary set at 3k families for
everyday conversation (Adolphs & Schmitt, 2003)
• Independent (B1) > 5k families to read authentic texts (Schmitt, 2007)
• Independent (B2) > minimum target of 10k lemmas at univ. level (Hazenberg
& Hulstijn, 1996) for Dutch; 8/9k f. for unassisted comprehension (Nation, 2006)
• Proficient (C1 upwards) > 20k f. known by educated L1 speakers (Nation,
2001); 50k w. known by most L1 speakers (Crystal, 1981)
Hill, D. R. (2001). Survey: Graded Readers. ELT Journal 55(3), Oxford University Press, 300-324
Adolphs, S. & Schmitt, N. (2003). Lexical coverage of spoken discourse. Applied Linguistics 24, 4: 425-438.
Schmitt, N. (2007). Current perspectives on vocabulary teaching and learning. In J. Cummins and C. Davison (eds.), International Handbook of
English language teaching: part II. NY: Springer, 827-841.
Hazenberg, S. & Hulstijn, J. H. (1996). Defining a minimal receptive second‐ language vocabulary for non‐native university students: An empirical
investigation. Applied Linguistics, 17 (2), 145‐163
Nation, I., S., P. (2006). How large a vocabulary is needed for reading and listening. The Canadian Modern Language Review, 63 (1), 59-82
Nation, P. (2001). Leaning vocabulary in another language. Cambridge: Cambridge University Press.Schmitt, N. (2000). Vocabulary in language
teaching. Cambridge: Cambridge University Press, pp.7-8
Crystal, D. (1981). Clinical Linguistics. Vienna, Springer
25.
26. Data modelling 1
y = 0.006x3.539
R² = 0.9842
0
10,000
20,000
30,000
40,000
50,000
60,000
10 20 30 40 50 60 70 80 90
From GSE to ModelLem
Hypothesis: 'CumLem'
Model: 'ModelLem'
32. The vocabulary usefulness rating
1 = Essential words learners would want to acquire first
2 = Important words that become necessary at a next stage
3 = Useful words enabling more detailed and specific
language
4 = Nice to have words to express concepts more accurately
5 = Extra words some language users will use occasionally
99 “Escape” words which are impossible to rate - you have
never heard of the word before or you cannot
decide between widely different ratings
Teachers received online training and followed specific
guidelines
Each word was rated by a random 10 out of the 19 raters in an
overlapping design using a pre-defined scale of 1-5
33. Combine ratings and Frequency data
Ra x rRating + Frank x (1- rRating) + Frank
Combine =
2
Where
Combine is the optimal combination of ratings and Frequency data
Ra is the Rating average
rRating is the Reliability of rating data
Frank is the scaled frequency rank.