2. Examination’s Purpose
The goal of the exam development process is to
accurately measure the candidate’s ability in
the field practice.
An examination is created to measure the ability
of the candidate based upon the knowledge
and skills represented in each test question.
An examination is NOT created to measure the
ability of the candidate to take an exam.
Joosten || 2007
3. Considerations in Criterion-referenced Testing
actions necessary for effective and efficient performance
Exam Validity Content Areas
& Reliability
Task/Skill Areas
Item Taxonomy
Joosten || 2007
4. Item Taxonomy
RECALL
PROBLEM SOLVING INTERPRETATION
Joosten || 2007
5. Sample Recall Item
Which of the following describes the active growth
phase of the cycle of normal human hair
growth?
A. Anagen.
B. Betagen.
C. Catagen.
D. Telogen.
Joosten || 2007
6. Sample Interpretation Item
23-year-old woman who is acutely febrile has had
an untreated, painful lower left third molar for 3
weeks. The patient can open her mouth only
8mm, has some pain on swallowing, and has
moderate swelling just beneath the angle of the
mandible on the left side. The diagnosis most
likely is an abscess in which of the following
spaces?
A. lateral pharyngeal.
B. retropharyngeal.
C. submandibular.
D. masticatory.
Joosten || 2007
7. Sample Problem Solving Item
A periapical roetgenogram reveals an impacted
lower third molar in an edentulous mandible. The
patient is experiencing recurrent acute and
chronic infection of the overlying soft tissue
denture base. For definitive treatment, the
surgeon should:
A. reline and relieve the denture base.
B. remove the tooth using appropriate antibiotic
control.
C. trim the swollen tissue and prescribe antibiotics.
D. advise the patient to remove the denture when
eating.
Joosten || 2007
9. Multiple Choice Items
GOAL: Maintain a pool of exam items which are
appropriate to measure the knowledge and
skills necessary for safe and effective
performance in the field of practice.
Item construction affects the performance of
your exam.
A multiple choice item is a specific form of item
that is composed of a stem and options
Parts of an item:
Stem
Distractors
Correct answer
Joosten || 2007
10. Stem
The stem of a multiple choice item may:
ask a question
Which of the following microscopic
subtypes of ameloblastoma is most
common?
give an incomplete statement
The most common microscopic
subtype of ameloblastoma is:The
stem of a multiple choice item may:
describe a situation (along with a question or
incomplete statement)
A 25 year-old man is brought to the
emergency room. He was found
lying unconscious on the sidewalk.
After ascertaining that the airway is
open, the next step in management
Joosten || 2007 be:
should
11. Item Response Options
Options are all the possible answers for a stem.
One correct (best) answer
Three distractors
The best answer is agreed upon by experts.
The distractors are logical misconceptions of
the best answer.
Joosten || 2007
12. Developing Items
Items should have one best answer. Avoid
items based on opinion or for which there is
not an accepted answer.
Items must focus on a single issue, fact, or
problem in each item.
Items should test important and pertinent
material while avoiding trivial facts.
Items should be developed utilizing good
grammar, punctuation, and spelling.
Attempt to write interpretation and problem
solving items.
Use a standard number of responses.
Options should avoid “all of the above” and
“none of the above.”
Joosten || 2007
13. Stem Construction
Stems should:
Avoid over specific knowledge, excess
information, and teaching in the stem.
Include the central idea and most verbiage
in the stem.
Be stated positively and avoid negative
phrasing.
Avoid personal pronouns (i.e., you).
Use terminology common to practice and
avoid textbook verbatim phrasing.
Avoid superlatives such as “always” and
“never.”
Joosten || 2007
14. Responses Construction
Responses should be:
Organized in a logical order
Independent and not overlapping
Fairly consistent in length
Homogeneous
Plausible
Joosten || 2007
16. Item Evaluation
P-value: percent of candidates who selected a response.
Point Biserial Correlation: correlation between those
candidates who did well on the test and those
candidates who selected the correct response.
Joosten || 2007
17. Good Item
1ST # *
2ND
# *
3RD #
*
4TH # *
5TH # *
# #-----#-----#-----#-----
#-----#-----#-----#-----#-----#-----#
0 10 20 30 40 50 60 70 80 90 100
A IS THE CORRECT ANSWER
A B C D
P-VALUE 0.70 0.15 0.05 0.01 Joosten || 2007
18. Good P-value:
Poor Discrimination
1ST # *
2ND # *
3RD # *
4TH # *
5TH # *
#-----#---------#-----#-----#-----#-----#-----#-----#-----#-----#
0 10 20 30 40 50 60 70 80 90 100
C IS THE CORRECT ANSWER
A B C D
P-VALUE 0.05 0. 07 0.73 0.15
RPBI 0.11 -0. 10 0.02 -0.02 || 2007
Joosten
19. Low P-value:
Low Discrimination
1ST # *
2ND # *
3RD # *
4TH # *
5TH # *
#-----#-----#-----#-----#-----#-----#-----#-----#-----#-----#
0 10 20 30 40 50 60 70 80 90 100
A IS THE CORRECT ANSWER
A B C D
P-VALUE 0. 47 0. 33 0.15 0.05
RPBI 0. 08 -0. 13 0.01 Joosten || 2007
0.09
20. Evaluating Item Stems
1. Focus on a single issue, fact, or problem in each item.
2. Avoid over specific knowledge.
3. Avoid textbook verbatim phrasing for items.
4. Avoid items based on opinion.
5. Avoid items for which there is not an accepted answer.
Joosten || 2007
21. Evaluating Item Stems
6. Test important material, while avoiding trivial facts.
7. State the item positively and avoid negative phrasing.
8. Include the central idea and most verbiage in the
stem.
9. Use one best answer format.
10. Use good grammar, punctuation, and spelling.
Joosten || 2007
22. Evaluating Item Stems
11. Avoid excess information in the stem as well as
teaching in the stem.
12. Avoid personal pronouns (i.e., you).
13. Attempt to write stems that require interpretation and
problem solving from the candidate (rather than recall).
Joosten || 2007
23. Anatomy of Item Responses
Item responses should consist of:
1.) the best answer (agreed upon by experts).
2.) logical misconceptions of the best answer or
distractors.
Joosten || 2007
24. Evaluating Item Responses
1. Use a standard number of responses.
2. Place options in a logical order.
3. Keep options independent and not overlapping.
4. Keep options homogeneous in content.
5. Keep the length of the options fairly consistent.
Joosten || 2007
25. Evaluating Item Responses
6. Be sure all distractors are plausible.
7. Be sure all distractors are logical
misconceptions.
8. Avoid “all of the above” and “none of the above.”
9. Phrase options positively, not negatively.
10. Avoid use of slang.
Joosten || 2007
26. Evaluating Item Responses
11. Avoid absurd or “fantastic” options.
12. Avoid giving clues through faulty grammar.
13. Make sure there is only one best answer.
14. Avoid superlatives such as “always” and “never.”
15. Evenly distribute position of the correct answer.
Joosten || 2007
27. General Considerations
•Does the item deal with trivial content?
•Is the answer discrimination too fine?
•Does the item stem includes unrelated information?
•Is there more than one correct answer?
•Is the item highly ambiguous?
•Is the question so obvious that the best answer appeared to be
the only plausible choice?
•Are some distractors ‘tip-offs’ because of the choice of words or
phrasing in the responses or stems?
•Are all of the distractors parallel?
•Are the responses of comparable plausibility?
Joosten || 2007
28. In Summary
The goal of item writing or editing is to create items that
will measure the skills and abilities of the candidates.
To do that the items must be clear, concise, accurate
and be of sound structure and of pertinent content.
Joosten || 2007
29. Review Item Statistics
P-value – percent of candidates who selected a response
Point Biserial Correlation – correlation between those
candidates who did well on the test and those
candidates who selected the correct response:
positive – correct answer
negative - distractors
Joosten || 2007
30. Good Item
1ST # *
2ND
# *
3RD #
*
4TH # *
5TH # *
# #-----#-----#-----#-----
#-----#-----#-----#-----#-----#-----#
0 10 20 30 40 50 60 70 80 90 100
A IS THE CORRECT ANSWER
A B C D
P-VALUE 0.70 0.15 0.05 0.01 Joosten || 2007
Notas del editor
Those candidates that meet or exceed the standard of accepted ability based on their performance on the examination will be certified. We want to accurately measure the candidates ability in the field of practice based on the knowledge and skills represented in the exam.
Content areas, skill areas, and item taxonomy are three consideration in developing an exam. The classification of items assists the examination developer in monitoring the distribution of items across content and task domains as well as cognitive skill levels. Each item in the examination data base is assigned a classification code which is also used in the item selection. Two dimensional model of competence Dimension I: Content Areas General subjects that represent the disciplines of field experience and imply expertise For Example: Physics, Imaging, Safety, and Physiology Content areas are general subjects that represent the disciplines of field experience and imply expertise. Content reflects the major subject category of the item. It is the content classification that is used in selecting items to insure that the entire content domain is covered. It is, in part, developed from the job/task analysis developed by “the experts” in the field of practice and administered to practioners in the field. Dimension II: Tasks or Skill Areas Actions necessary for effective and efficient job performance For example: Identify, Analyze, Calculate, Evaluate…. Apply knowledge of…, Select appropriate…, Prepare appropriate…, Develop, Establish, Calculate, Analyze, Identify, Synthesize, Evaluate, Implement, Perform, and Use Instrumentation Tasks or Skill Areas are. Task is the skill performed (e.g. diagnosis or treatment). By structuring items to reflect different tasks, a greater variety of items are generated. Taxonomy levels relates to the cognitive skill component in criterion-referenced testing. It is the level of mental process used by the candidate to determine the correct response to a item. Taxonomy refers to the cognitive processes required to answer the item. The construction of the stem and responses, utilization of visual materials as well as the process and content of the item all contribute to the classification of an item by taxonomy level. There are three taxonomy levels that must be considered in item writing: recall, interpretation, and problem solving. Exam Validity and Reliability Reliability Is the examination consistent and dependable? Internal consistency will be generally high because items are written by specialists in the field. Validity Does the test measure what it is supposed to measure? Content guideline or exam blueprint is your basis for content validity. Are we achieving our goals: The goal of the exam development process is to accurately measure the candidate’s ability in the field practice. An examination is created to measure the ability of the candidate based upon the knowledge and skills represented in each test question.
Recall Items: test the candidate’s ability to recognize or recall a specific fact or concept. Interpretation Items: require the candidate to interpret information utilizing recall knowledge and are presented in the form of diagnostic images, laboratory data, or patient history. Often interpretation items ask for a decision, such as 1.) diagnosis or 2.) prognosis. Problem Solving: Candidate must use a base of knowledge to interpret data and then solve a problem or make a decision. Also, complications may be added to the situation and need a remedy. Problem solving items often ask for the best management or treatment options for the patient.
Item writing is an arduous task requiring not only mastery of the subject matter, but also an understanding of the examination population and mastery of verbal communication skills. The review process insures that the item adheres to appropriate technical and/or scientific principles (STANDARDS, 1985). Items are selected by a group of experts for inclusion in the examination data bases. Responsibilities of item writer include, but are not limited to the following: Developing new items on a continuing basis, as assigned. Reviewing and selecting items for inclusion in the written examination. Monitoring the content, task, and cognitive skill distributions of items. Monitoring the content quality and difficulty of each item and avoiding duplicate items on the same knowledge/skill. Providing expert input into the criterion standard against which candidates are measured. Reviewing the performance of each item to ascertain the quality of the content and structure of the item. The goal is to maintain a pool of examination items which are appropriate to measure the knowledge and skills necessary for safe and effective performance in the field of practice.
A multiple choice item is designed for objective measurement and contains a STEM and four RESPONSES, one of which is the best answer. The multiple choice item is unique in that the standard by which the best answer is selected is contained in the stem. Also, the best answer does NOT have to be the one and only indisputably correct response to the item, as long as the subject matter expert agree it is the best answer of those presented. The form is flexible so that items may be based on items, situations, laboratory results, etc. The following sections outline techniques for writing and evaluating multiple choice items by considering first the stem and then the responses .
Stem The stem of a multiple choice item may: asks a question gives an incomplete statement states an issue describes a situation or any combination of the above.
Responses : 1. The "BEST" answer is the response the author and other experts consider the most appropriate answer. 2. The "DISTRACTORS" are logical misconceptions of the best answer.
The content of the stem focuses on a central theme or problem, using clear and precise language, without excessive length which can confuse or distract candidates. The stem may ask a straight forward question, present a scenario or describe data or laboratory results. The question or issue presented in the stem should be relevant to the knowledge and skill level of the population being evaluated. Each multiple choice item should have four mutually exclusive responses.
Sentence structure in the stem should be grammatically accurate and logically related to the responses. It should present all relevant information to insure clarity and understanding. Although the multiple choice item format is brief, sufficient information to make an interpretation, answer the question, or solve a problem must be included. Avoid superfluous information, but be certain that all necessary details are included. Also avoid the use of personal pronouns such as "you" which are inappropriate and perhaps confusing. As a general principle, the stem should be stated in a POSITIVE form. Negative statements are not characteristic of normal thought processes, and consequently may place the candidate who is attempting to decipher the item at a disadvantage.
The plausibility of the responses is the first consideration. The best answer should be the one agreed upon by the experts; however, the other three distractors should also seem plausible to the candidates who have partial, incomplete or inappropriate knowledge. The distractors may therefore be considered logical misconceptions of the best answer. The responses should be parallel in content length, and category of information. The grammatical structure of all the responses should be a logical conclusion to the situation, question, or statement presented in the stem. When writing distractors, it is wise to avoid the use of superlatives such as "always" and "never". Such words lead candidates from the response as they tend to be associated with suspect or exaggerated statements. Repetitive language within the responses should be avoided. Words which are repeated in every response may be placed in the stem. Thus, the candidate has less to read and is less likely to be confused by the structure. The length of each response should be approximately the same. There is a tendency among item writers to make the best answer the longest answer. Testwise candidates may key to this fact and answer correctly because of the format of the response. Each distractor should be mutually exclusive and not overlapping. For instance, if a series of percentages is to be used for the responses, each range must be unique to the response. The following example illustrates this: a. 10 - 20 b. 30 - 50 c. 55 - 60 d. 65 - 75 e. 76 - 100 If responses are overlapping, the candidate may not be able to determine the best answer not because they do not know the answer, but because the answer is incorporated into more than one response. In addition, the candidate may be able to argue that more than one response is correct due to the overlap. Avoid using Anone of the above@ as a response. This response does not test what the candidate knows, but only that he/she can recognize that the correct answer is not present. For example: What is the capital of Texas? 1. Kansas City 2. Pasadena 3. New York 4. None of the above(the candidate confidently selects >none of the above= because he/she thinks he/she knows that the capital of Texas is Lubbock) Avoid using all of the above as a response. Essentially, this is an overlapping response, because it requires the candidate to consider the responses in combination. Knowing that two are correct leads a clever candidate to all of the above without knowing the importance or correctness of the remaining responses.
Item evaluations can be performed for items which have been previously tested and for which statistics have been received. The purpose of the item evaluation is to identify items that are not measuring as expected. Items that fail to perform properly increase the error of the exam and therefore do not contribute to the precision of the pass/fail decision made about candidates. After items are presented on a test, they are subjected to statistical, as well as, content analysis. The statistical analysis provides clues for the subject matter experts with regard to how well the content of the item yielded useful information about candidate ability. The purpose of deleting items from an examination is always to create more precise and fair examinations. Any item that performs poorly is flagged for possible deletion. Items may perform poorly for many reasons. Many of these reasons are related to the initial construction of the item stem and responses. Proper development of item stems and responses leads to a higher probability that the item will perform successfully. Traditional Item Analysis In the process of item review, the item statistics represent the performance of the item and provide guidance to the examination reviewers when revising items. Traditional item analysis consists of a p-value and point biserial correlation (RPBI). The p-value is the percent of candidates who selected each response. Hopefully, more candidates selected the keyed correct response than any distractor. The point biserial correlation is the correlation between the performance of the candidates who answered the item correctly and the candidates who did well on the total test. The point biserial correlation should be positive and higher for the keyed correct response and negative for the distractors. This pattern suggests that candidates who did well on the test tended to select the correct answer on the item. The ideal ranges for the item statistics are as follows: p-value: generally in the range of .30 to .80 RPBI: around .20 for the correct answer and negative for all distractors