Frequently Asked Questions About Licensing Exams

Scoring non-multiple choice items

CLEAR Exam Review (Summer 1991)
Eric Werner, M.A.

Question: I'm familiar with item analysis statistics for multiple-choice tests, but want to know what kinds of information should be used to evaluate performance tests.

Answer: Too few state boards evaluate their performance tests properly. Sometimes boards just assume these tests to be technically sound because they involve direct candidate evaluation by experts. Other times, boards are uncertain, as your question indicates, about what kind of performance-test information to collect and analyze. How to proceed will depend to a large degree on exactly how your performance test is structured (how many separate performances are rated, how ratings are used to determine pass/fail results, how examiners work together, etc.), but following are a few things for you to consider.

If two examiners independently rate each candidate on each of several tasks, are the paired rating similar? Unless there is substantial consistency between examiners, it is difficult to maintain that the examination assesses anything meaningful. The correlation between paired ratings is informative as is the average of the absolute (unsigned) differences between their ratings of each candidate. If acceptable/unacceptable determinations are made or implied, you can use any of several indices of examiner agreement, but will want to speak with a testing specialist about which is most appropriate to your situation.

Are all examiners applying rating standards in essentially the same way, or are some rating more stringently than others? Where two examiners independently rate the same set of candidates (or sets of candidates believed to have the same ability on whatever is rated), you should compute an average rating for each of the examiners. Substantial differences among such averages will alert you to potential problems. If it is not the case that all examiners rated all candidates, you also might make a further review to see if there are any candidates who failed overall after being rated by a relatively large number of especially stringent examiners.

Are examiners assigning a reasonable range of ratings, or are some giving about the same rating to all candidates, even though good reason exists to believe that not all candidates are equally well prepared? By studying the range of ratings assigned and by computing a measure of variation such as the standard deviation, you can identify examiners who have restricted their ratings.

Some parts of your performance test are more likely than others to distinguish between more able and less able candidates. You are familiar with multiple-choice item analysis. Therefore, you can understand that a board might look at the relationship between the ratings on a single part of the performance test and total scores on the test. The comparable multiple-choice item analysis statistic is the discrimination coefficient.

These general suggestions are not a prescription for what to do on your performance test. However, they identify some kinds of analysis that, if properly adapted to your particular situation, probably will prove useful.


CLEAR Exam Review (Winter 1995)
Norman R. Hertz

Question: What are the considerations in developing procedures for scoring performance, problem, essay, or oral examinations?

Answer: There can be a substantial amount of subjectivity, variability, and biases in scoring when the examinations are in a format other than multiple-choice. Subjectivity and variability can be reduced by developing criteria for scoring in advance of scoring. It is best to develop the correct responses at the time the examination is constructed and to have the responses reviewed by individuals other than those constructing the examination.

For scoring performance tests, checklists are often the best format. A scorer can quickly and accurately judge whether a task was performed or not performed. For problem or essay examination questions, a correct answer should be prepared so the scorer can easily compare it with the candidate's answer. Oral examinations should use standardized questions and objective rating scales. For all examinations other than multiple-choice questions, one should use multiple evaluators (scorers) to ensure the reliability of the scoring.

Oral examinations are particularly vulnerable to systematic biases that affect the accuracy of candidate evaluation because of interactions between the examiners and the candidate. The biases are listed below.

Oral examiners should be made aware of these systematic biases so that they are avoided during examinations.


Back to index

© 2002 Council on Licensure, Enforcement and Regulation