|
Frequently Asked Questions About Licensing Exams |
Scoring non-multiple choice items
CLEAR Exam Review
(Summer 1991)
Eric Werner, M.A.
Question: I'm familiar with item analysis statistics for multiple-choice tests, but want to know what kinds of information should be used to evaluate performance tests.
Answer: Too few state boards evaluate their performance tests properly. Sometimes boards just assume these tests to be technically sound because they involve direct candidate evaluation by experts. Other times, boards are uncertain, as your question indicates, about what kind of performance-test information to collect and analyze. How to proceed will depend to a large degree on exactly how your performance test is structured (how many separate performances are rated, how ratings are used to determine pass/fail results, how examiners work together, etc.), but following are a few things for you to consider.
If two examiners independently rate each candidate on each of several tasks, are the paired rating similar? Unless there is substantial consistency between examiners, it is difficult to maintain that the examination assesses anything meaningful. The correlation between paired ratings is informative as is the average of the absolute (unsigned) differences between their ratings of each candidate. If acceptable/unacceptable determinations are made or implied, you can use any of several indices of examiner agreement, but will want to speak with a testing specialist about which is most appropriate to your situation.
Are all examiners applying rating standards in essentially the same way, or are some rating more stringently than others? Where two examiners independently rate the same set of candidates (or sets of candidates believed to have the same ability on whatever is rated), you should compute an average rating for each of the examiners. Substantial differences among such averages will alert you to potential problems. If it is not the case that all examiners rated all candidates, you also might make a further review to see if there are any candidates who failed overall after being rated by a relatively large number of especially stringent examiners.
Are examiners assigning a reasonable range of ratings, or are some giving about the same rating to all candidates, even though good reason exists to believe that not all candidates are equally well prepared? By studying the range of ratings assigned and by computing a measure of variation such as the standard deviation, you can identify examiners who have restricted their ratings.
Some parts of your performance test are more likely than others to distinguish between more able and less able candidates. You are familiar with multiple-choice item analysis. Therefore, you can understand that a board might look at the relationship between the ratings on a single part of the performance test and total scores on the test. The comparable multiple-choice item analysis statistic is the discrimination coefficient.
These general suggestions are not a prescription for what to do on your performance test. However, they identify some kinds of analysis that, if properly adapted to your particular situation, probably will prove useful.
CLEAR Exam Review (Winter
1995)
Norman R. Hertz
Question: What are the considerations in developing procedures for scoring performance, problem, essay, or oral examinations?
Answer: There can be a substantial amount of subjectivity, variability, and biases in scoring when the examinations are in a format other than multiple-choice. Subjectivity and variability can be reduced by developing criteria for scoring in advance of scoring. It is best to develop the correct responses at the time the examination is constructed and to have the responses reviewed by individuals other than those constructing the examination.
For scoring performance tests, checklists are often the best format. A scorer can quickly and accurately judge whether a task was performed or not performed. For problem or essay examination questions, a correct answer should be prepared so the scorer can easily compare it with the candidate's answer. Oral examinations should use standardized questions and objective rating scales. For all examinations other than multiple-choice questions, one should use multiple evaluators (scorers) to ensure the reliability of the scoring.
Oral examinations are particularly vulnerable to systematic biases that affect the accuracy of candidate evaluation because of interactions between the examiners and the candidate. The biases are listed below.
First Impressions. Information
presented during the examination may be overlooked if an examiner
relies on a first impression of the candidate during the first
few minutes of the examination to form an overall opinion. The
examiner must be aware of this tendency, gather information about
all important aspects, and weigh each area separately.
Halo Effect. Halo effect simply
refers to overgeneralization. If a candidate is judged strikingly
good or unqualified in a particular area, the examiner may
generalize the judgment to all areas. The examiner should
consider performance in each area independently.
Stereotyping. Stereotyping can
introduce errors in the examination process when scores are
influenced by irrelevant factors (e.g., appearance, gender,
ethnicity).
Similarity Effect. Candidates
similar in background and training to the examiners may be
evaluated higher simply because they are like the examiners.
Candidates who are "different" in some way may be rated
lower than their performances merit.
Contrast Effect. Contrast effect
occurs when an examiner evaluates a candidate directly against
other candidates rather than against an established standard.
Central Tendency. Central tendency
is an inclination to rate everyone in the middle. The examiner is
playing it safe and not making any favorable or unfavorable judgments.
Negative and Positive Leniency. Leniency error is the tendency of an examiner to rate candidates too low or too high on a consistent basis. An established standard must always be kept in mind.
Oral examiners should be made aware of these systematic biases so that they are avoided during examinations.
Back to
index
© 2002
Council
on Licensure, Enforcement and Regulation