Frequently Asked Questions About Licensing Exams

Setting a passing score/ Angoff ratings

CLEAR Exam Review (Winter 1990)
Eric Werner, M.A.

Question: The definition of "entry-level knowledge" applied at a pass-point setting workshop would seem to be quite important to the outcome of the effort and significantly influenced by the qualifications of subject matter experts (judges) chosen to participate in the workshop. How should such an experts' panel be assembled and managed to help ensure a technically sound and defensible passing score for a multiple-choice test?

Answer: A question very similar to this was asked at the recent "Fundamentals of Testing" symposium held at the CLEAR Annual Conference in Seattle. The question is a good one, because passing scores are so critical from a practical standpoint, and common sense suggests that judges' professional background and priorities will influence their ideas about what level of test performance can be reasonably expected of licensure applicants. In fact, research and experience have shown that the kinds of judges used in connection with a given standard setting procedure can markedly influence the resulting standards. There are quite a few methods for developing passing scores. My answer to your question applies especially to those that involve judgments about the questions of a test for which a standard is to be set. The methods developed by Angoff and Nedelsky, including their variations, are the most common examples. Each of these methods yields a passing score based upon an aggregation over all test questions of judges' evaluations of correct-answer expectations for candidates possessing just sufficient knowledge and ability to be credentialed ("entry-level knowledge", as this term is used in your question). There are three ways in which the potential problem raised by your question can be addressed: careful selection of standard-setting participants, proper workshop procedure and orientation, and appropriate treatment of the quantitative data resulting from the workshop.

Without careful selection of standard-setting participants, passing scores will not have credibility. Judges selected must be representative of all persons professionally qualified to determine that level of tested professional knowledge necessary for safe and effective practice of the occupation or profession. Every judge should be highly familiar with the nature of practice in general and, collectively, judges should understand practice within important specialty areas.

Judges should have a stake in the pass/fail decisions to be based on test performance. Professionals who are responsible for reviewing the work of certificants or licensees (for example, as building inspectors review the work of design and construction professionals) are likely to be enthusiastic and productive participants in a pass-point setting workshop. Finally, the panel of judges should be composed so that no one member exercises undue influence over others. The aim is to obtain independent judgments, not deferential endorsements by the majority of the views of a few.

The procedures employed at a standard setting workshop are also a significant influence on the definition of entry-level knowledge used to guide judgments about test items. This is not the place to review the relatively technical steps involved. A good how-to discussion can be found in Passing Scores: A Manual for Setting Standards of Performance on Educational and Occupational Tests (1983) by Livingston and Zieky (available through ETS).a more technical review of procedural issues is given in Jaeger's chapter in the third edition of (Educational Measurement(1989), edited by Line (see pp. 491-496).

The important procedural point is that the work of standard-setting judges should be guided by an analysis of practice (e.g., the results of a job analysis), an awareness of knowledge and job-related skills that distinguish acceptable from unacceptable candidates, and familiarity with what the test measures. If definition of the "just-acceptable" candidates is based on such information, extreme or unwarranted variation about how such a candidate should perform on a test is apt to be minimized. After the workshop, it is appropriate to combine consideration of how particular judges performed at the session, with close review of quantitative data generated by those same judges during the workshop. Occasionally, despite careful selection and good process, "outliers" will be found who produce, without justification, extremely high or extremely low expectations of how "just-acceptable" candidates should perform on a test. In a recent example, a licensing board elected to eliminate data from such an outlier prior to computation of the passing score to be used. Obviously, this approach can be controversial and should be used with discretion. Some testing specialists would not use it at all.

CLEAR Exam Review (Summer 1990)
Eric Werner, M.A.

Question: Our board uses a 150-item multiple-choice examination representing four major content areas which collectively embrace nearly 20 subordinate content areas. We are considering changing our exam requirement so that a successful candidate would have to satisfy passing scores applied to each of the four major areas rather than just one overall passing score. Our test consultant refers to our anticipated approach as a "multiple-cutoff" method and expresses reservations. What would you suggest?

Answer: I'll respond to the general case, since I have too little information about the particular situation you describe to suggest either the status quo or the change considered. Although multiple cutoff arrangements are appropriate in some testing contexts, careful consideration of several matters should precede decisions to implement them. The issues are both psychometric and administrative.

Reliability and validity of pass/fail decisions are major issues. Assuming traditional approaches to examination development, the short tests that would result from partitioning a 150-item measure into four or more separate tests will probably not support accurate inferences about whether candidates have sufficiently mastered each of the subordinate content areas involved.

Another issue is the intercorrelations among the subordinate areas considered. In general, it makes less sense to implement multiple cutoffs when scores on the sections are highly related, than when they are relatively independent of one another. High intercorrelations are quite common. In either case, a decision to use multiple cutoffs should be based on strong evidence from analysis of licensee practice that relatively strong performance in one area cannot offset relatively weak performance in another. Also consider the expense and workload issues involved in such a change. For example, a multiple-cutoff strategy of the sort you describe would require that four passing scores be properly set and that four test forms be equated. Then, too, there would be the administrative tasks of tracking and appropriately scheduling re-take candidates for the test parts they have yet to pass. Finally, passing rates for the examination as a whole might drop markedly as the number of separately scored parts increase.

Perhaps your board should adopt multiple cutoff, but you can see that there are many issues for you to analyze before deciding to do so. If you go ahead, be sure to provide candidates and training programs with plenty of advanced information about the change.

CLEAR Exam Review (Summer 1993)
Eric Werner, M.A.

Question: How do you set a passing score for a performance exam? Can you apply written test methods like Angoff?

Answer: This matter is somewhat complex, and a testing agency can fruitlessly expend considerable resources if it undertakes such a project in an improper manner. So, I'll give you a short-form answer, some helpful references, and the advice that you seek assistance from a testing professional who understands your particular program thoroughly. My comments will emphasize some interrelated conceptual and technical decisions to make (only roughly in the order in which they are discussed) rather than less technical, but crucial, matters like properly selecting subject-matter experts.

First, you must answer the question "How good is good enough to practice safely and effectively?" Your decision on what constitutes minimal acceptable competence will play a central role in determining the test performance expectations that result from the standard-setting effort. But defining what is good enough can be difficult. For help in this connection, look at the article by Mills, Melican, and Ahluwalia (1991). Also, you must decide how the exam is to be scored. (This should be done when the exam is designed.) As noted in the answer to the previous question, performance tests can be graded using holistic or analytic methods. Holistic scoring is most appropriate for exams that test narrowly defined competency areas (unlike most licensing exams). it usually involves giving a single score to each candidate after considering his or her performance in relation to a model performance or a set of model performances at different points along a continuum of performance acceptability.

With analytic scoring, examiners usually function less as evaluators of candidate performance and more as observers and recorders with respect to whether the candidate performed a certain task (process) or achieved a certain result (product). Yes/no criterion checklists completed by each examiner are often used for analytic scoring. Candidates earn points to the degree that they satisfy the criteria. I assume analytic scoring in what follows.

A third decision concerns whether to set passing scores so as to require at least minimal acceptable performance in each of several areas tested, (b) allow stronger performance in some areas to offset weaker performance in others, or (c) permit some limited degree of compensation -- in other words, combine the approaches in (a) and (b). This decision usually affects whether and how performance on several subjects will be combined. That is, a board may require candidates to pass each of several exam parts or it may combine part scores into a single aggregate value and make pass/fail determinations on the basis of this total score. But the decision can also affect the way pass/fail determinations for a particular exam part are made, as illustrated below.

Fourth, it is important to decide whether performance criteria are to carry equal or differing weights. Variation in weights will reflect testing agency decisions about the criticality of the criteria for safe, effective practice. For example, an absolutely crucial criterion might be numerically weighted so that a candidate who doesn't satisfy it will fail, regardless of his or her performance on other criteria. Thus, whether and how to weight criteria can be very much a part of the compensation issue discussed above, although a decision to allow compensation does not require differential weighting of criteria. See Gross (1993) for a discussion of criterion weighting on a national performance exam.

Fifth, you must apply a method for deciding how well candidates must perform in order to pass. The Angoff approach you mention in your question is among the methods available. With Angoff, one can sum over criteria the estimated probability that a just-acceptable candidate would satisfy each of them, just as on a written test. (If criteria have been differentially weighted, the process and its interpretation become a little more complicated.) upward or downward adjustments to the sum may be appropriate. The Ebel method can be used, too. Take a look at the discussion of this method in Livingston and Zieky (1982). I think you'll quickly see how it can be applied to analytically scored performance exams. In Colorado, we're experimentally applying a modification of Ebel's method to our cosmetologist performance exam. Our adaptation has resulted in pass/fail decision rules of this nature: The candidate must satisfy 65 percent of the criteria designated as least critical and 75 percent of the criteria designated as most critical. This approach seems to be passing about the same percentage of candidates as our traditional approach of a single overall percentage standard. However, a somewhat different mix of candidates is passing under the new method. This is because it has become more difficult to pass by accumulating points on less critical criteria while failing to satisfy criteria more directly and significantly related to public health and safety.

 CLEAR Exam Review (Winter 1995)
Norman R. Hertz

Question: We know that the passing score for a licensing examination should discriminate between candidates who are "minimally competent" to practice at entry level and those who are not qualified. We use a basic criterion-referenced methodology (Angoff) to establish the passing score. Using this method, we are required to develop an entry-level criterion as the reference to judge the difficulty of the test questions. What guidelines can we apply in developing the entry-level criterion?

Answer: There is nothing more difficult in establishing passing scores than developing the criterion of "minimum competence." Most often the process for establishing the criterion for minimum competence is based upon the principle that "time in practice" is the controlling factor. Because of the wide variety of practice settings, length of time in practice is thought to be the best guide in developing the criterion. As a rule, practitioners have not reached journey level until after about five years of practice. The "Uniform Guidelines on Employee Selection Procedures" does not address the definition of entry level for licensing, but does address the overall definition of entry level. Section 4 (I) states that a reasonable length of time for entry level "will seldom be more than 5 years." One can extrapolate this interpretation to licensure and define entry level as no greater than the level of competence that practitioners possess after five years of licensure. The accuracy of the passing score in differentiating between candidates qualified for licensure and those who are not is enhanced when the judges clearly understand the concept of entry level. Therefore, before they rate test questions, sufficient effort should be directed to building a consensus definition of entry level.

Back to index

2002 Council on Licensure, Enforcement and Regulation