by Norman R. Hertz and Roberta N. Chinn
Purpose of a Licensing Examination
A board must understand the purpose of a licensing examination in order to appreciate the process of developing, maintaining, and administering it. The sole purpose of a licensing examination is to identify persons who possess the minimum knowledge and experience necessary to perform tasks on the job safely and competently--not to select the "top" candidates or ensure the success of licensed persons. Therefore, licensing examinations are very different from academic or employment examinations. Academic examinations assess how well a person can define and comprehend terms and concepts. Employment examinations can rank order candidates who possess the qualifications for the job.
Reliability and Validity
A good licensing examination should be reliable and valid. A licensing examination that is reliable produces consistent results from administration to administration when it is based on clearly outlined test specifications as well as established technical and professional standards.
Validity of an examination is inferred if the examination tests job-related tasks or knowledge established by the results of the job analysis. An examination is considered content valid if it is based upon the results of a job analysis— sometimes called an occupational analysis or practice analysis. Content-related validity is based on the premise that a candidate who passes a licensing examination is knowledgeable in the required content of the job.
Technical, Professional and Legal Standards
The Standards for Educational and Psychological Testing (1985) and the Principles for Validation and Use of Personal Selection Procedures (1987) are widely used as professional standards for examination programs. A number of statutes and guidelines impact licensure examinations, including the federal Uniform Guidelines for Employee Selections Procedures (1978), the Civil Rights Act of 1991, and the Americans with Disabilities Act of 1990.
Availability of Resources
A licensing examination requires significant amounts of time and money to develop—usually a year or more. As a general rule, each item appearing on an examination takes between two to four hours to plan, write, and review before the item is suitable for publication in an examination. Additional time is required to administer and maintain an examination.
Much of the planning and oversight responsibilities can be delegated to an examination committee that reports to the board. The board can also contract with testing consultants or test providers who can evaluate the psychometric quality of the examination and assist all aspects of development process.
A board may underestimate the impact of factors on the reliability and validity of even the best of examinations. Administrative procedures are as critically important to the reliability and validity of an examination as the examination development procedures.
There are some basic questions for a board to ask itself when developing administrative procedures:
Proctors. Are the proctors trained to know their responsibilities before, during, and after the examination? Is there a core group of trained proctors?
Proctor-candidate ratio. Are there sufficient numbers of proctors to monitor candidate activity during the examination?
Candidate registration. Are the candidates positively identified before they enter the examination room? Are there formal, standardized procedures in place to register candidates?
Seating arrangements. Are the candidates seated in the examination room, e. g., pre-assigned seats, alphabetically arranged seats, so that the proctors can prevent them from looking at, or talking with other candidates during the examination?
Inventory of examination materials. Are the test booklets inventoried, e. g., by serial number, before and after they are delivered to the examination site? After the examination is over, are test booklets checked for missing pages?
Storage of examination materials. Are the test booklets and other examination materials stored in a secure location before, during, and after the examination?
Incident reports. Do the proctors and/or examiners write formal reports of alleged incidents that occurred during the examination?
Other factors ensure equivalent testing conditions—the candidate’s success or failure may be the result of testing conditions rather than competency in job tasks and knowledge:
Candidate handbooks. Are candidates provided with advance information about the examination process, examination format, and subject matter areas covered on the examination? Are candidates familiar with what materials can and cannot be brought to the examination?
Scheduling. Are examinations scheduled at the same time of the day throughout the state to prevent candidates in one location from divulging information about the examination to candidates in another location?
Site. Is the examination administered at a site where candidates are subject to a minimum of distractions or disruptions?
Time allowed. Do candidates have sufficient time to complete the examination?
Number of questions. Are there sufficient questions on the examination to assess the content to be covered?
Some factors are particularly critical to the reliability and validity of practical and oral examinations:
Documentation of process. Is the candidate’s examination documented during the examination, e. g., formal examiner notes, audiotapes, etc.?
Standardized procedures. Are there standardized procedures to conduct the examination?
Candidate-examiner ratio. Is the candidate evaluated by at least two examiners, according to standardized procedures in oral or practical examinations?
Considerable thought and planning is needed to determine the examination format that best measures the competencies required of an entry-level, minimally competent practitioner. The examination can be structured into multiple-choice, essay, practical, or oral formats. Multiple-choice formats are less costly to score than practical or oral examinations and provide a suitable methodology for assessing even the most complex competencies.
The best approach is to measure all elements of practice in a multiple-choice examination, and use alternative forms of testing, such as an oral or practical examination, only if there are elements that cannot be measured in the multiple-choice examination. Essay, practical and oral examinations are not inherently better than multiple-choice examinations and require the use of many examiners or graders to evaluate the candidates’ performance. If essay, practical or oral formats must be used, the content of and administrative procedures for the examination must be as standardized as possible to ensure the equivalence of the examination for all candidates.
Appeals and Re-examination
Appeals are not intended to be an avenue for unsubstantiated complaints about the content of, or process of, the examination. Rather, appeals provide a means for all candidates to challenge the results of a licensing examination within the context of specific conditions. There may be specific statutes and regulations pertaining to the final filing date or the format for the appeal. The specific conditions under which an appeal should be granted and the recourse that should be implemented must be clearly specified. If the appeal process involves a review of examination materials, policies should be established to address candidate requests for attorneys or interpreters to be present during the review. Typically, the appeal should be formally stated in writing and should outline the premise and circumstances for the appeal.
Only under circumstances where it is clear that the candidate passed the examination and an error was made in evaluating the candidate’s performance, or in scoring the examination, should a candidate be issued a license or granted re-examination. Appeals should not be granted without substantiating documentation from proctors at the examination site or other evidence that the candidate was disadvantaged during the examination.
Examples of documentation include written reports of the alleged incident and audiotapes of oral examinations. Common grounds for appeals may include procedural errors such as the amount of time allotted or nonstandard test administration, e. g., power outages, misprinted examinations.
Consultants and Contractors
If an examination is purchased from a test provider, the board should review the job analysis, test specifications, test items, procedures used to establish the passing score, and the procedures used to administer the examination. The board should also make certain that the proposed contractual arrangements are consistent with board interests and with legal advice.
STEPS TO ENSURE RELIABILITYAND VALIDITY
Typically, practitioners provide job information through individual interviews and focus groups. The information is synthesized into a survey questionnaire that is sent to a large sample of practitioners. For example, respondents could be asked to rate the relative importance of job tasks and knowledge. Respondents could also be asked to rate the criticality and potential for harm if a task is performed incorrectly.
Care should be taken to include practitioners that represent those activities performed in actual practice throughout the job analysis. Thus, practitioners from diverse practice settings and geographic regions should be included in interviews, focus groups, and in the sample for the survey questionnaire. A sampling plan for a survey questionnaire should include persons from a variety of experience levels, geographic locations, and practice settings. Emphasis should be placed upon persons with a modest amount of experience so as to reflect the job activities of entry-level, minimally competent practitioners.
Test specifications serve as a blueprint for examination development and a guide for candidates to prepare for the examination. The test specifications define major subject matter areas of practice in terms of the tasks and knowledge identified in the job analysis. Although the actual questions will vary from examination to examination, the number of questions for each subject matter area should remain the same. By following the test specifications, a board can be confident that a licensing examination assesses a candidate’s competency fairly and in a manner that is defensible in case of legal challenge.
Test Development Process
The test development process begins with formal training of groups of practitioners as to the technical, professional, and legal standards that serve as guidelines for test development. The content of test items (e.g., multiple-choice questions, problems, and vignettes) should deal with actual situations that a licensee will encounter on the job. The items should focus on a candidate’s ability to analyze a situation, diagnose a problem, or evaluate results, rather than memorize facts and formulas.
Several workshops may be needed to develop an examination. Collectively, the groups of practitioners in the different workshops provide the objectivity necessary to produce test items, which can fairly assess a candidate’s competence. At least two to four times the number of test items should be developed because some of the items will not survive critical review and other items are needed as alternatives within a pooled bank of items.
A number of issues should be addressed in each item. Many of the issues pertain directly to the most common format for examinations--multiple-choice:
Readibility. Is the item clearly stated for the intended audience—entry-level practitioners?
Difficulty and complexity. Is the difficulty or complexity of the item suited to the job task and knowledge being tested?
Amount of information presented. Do multiple-choice items describe a single problem? How much information is the candidate being asked to consider at one time?
Best answer. Is the key clearly the best answer in each multiple-choice item? Is there a finite set of answers considered correct in practical or oral examinations?
Item content. Does the item assess content that is important to current, mainstream practice?
Authoritative source. Is the item based on a finite set of authoritative reference sources, laws, and statutes that are widely available to the candidate?
Standard grammar. Does the item follow standard rules of grammar?
Leading questions or instructional material. Is the item free of instructional material or other material that leads the candidate to the correct answer?
Offensive content. Is the item free of offensive cultural, racial, ethnic, or gender-related expressions or phrases?
Clue words. Is the item free of material that provides clues to the correct answer in other items?
Negatives. Is the item stated in terms of negative terms, rather than positive job situations, e. g., not, except?
Absolutes. Have words that give verbal clues, e. g., none, always, been avoided?
Repetitious words and expressions. Are the key and distractors in multiple-choice items free of repetitious words and expressions that could be included in the item stem?
Length. Is the key approximately the same length as the distractors in multiple-choice items?
Level of detail. Are key and distractors in multiple-choice items similar in type, concept and focus?
Overlap. Do the key and distractors overlap with each other, e. g., distractors are subsets of the key? Are distractors that mean the same thing or opposite things avoided?
Plausibility. Are the distractors in multiple-choice items plausible to unprepared candidates who do not know the correct answer?
There are several commonly used methodologies to establish passing scores for multiple-choice tests, e. g., Angoff, Nedelsky, Ebel. The Angoff methodology is the most common. In the Angoff methodology, practitioners begin by developing a common concept of minimum competence, then individually rate the difficulty of each item on an examination relative to the minimum competence criteria. The judgments of all the practitioners are used to determine the passing score.
Passing scores can be established for practical and oral examinations by developing a concept of minimum competence, and then, using narrative or behaviorally anchored descriptions of ratings to guide examiners’ evaluations of the candidate.
Post-Test Analysis of Results
Questionable items should be edited, revised, or removed. Examination results should not be released until content experts have reviewed the results of the item analysis. For practical or oral examinations, a statistical analysis of the results typically addresses the distribution of candidate scores, reliability of examiner ratings, percentage of agreement between examiners, and correlations amongst different subject matter areas of the examination.COMMONLY ASKED QUESTIONS
A number of test providers use such a procedure. However, there are a number of distinctions that should be drawn between large and small examination programs.
First, the procedures may be suited for a large-scale examination program because the items are produced during the item writing workshop are stored in a large item bank and probably will not be used on the same version of an examination. If the security for all the items of one item writer were compromised, the overall security of the examination would be maintained. Only a few items from a single item writer are likely to have been included in the examination and these more than likely would have been modified.
On the other hand, if an examination program does not have a large item bank or if a licensing program is new, it is conceivable that many of the compromised items would appear on the examination, thereby causing its results to be challenged. Most item writers are responsible and are interested in keeping the items secure. However, if the items are not secure, it is possible that unauthorized persons could gain access to the items. Some agencies using outside item writers require that they sign security agreements. If the integrity of the examination were challenged, it is important to weigh the benefits of having the items prepared in advance with the potential costs of re-development.
Second, if persons who have not been trained specifically to write items for licensing examinations write the items, many of the items will not survive a critical review. The procedures for writing items for licensing examinations are sufficiently different from the procedures for writing multiple-choice or other types of questions for academic-type examinations. Additional training is necessary for item writers whose experience is limited to academic examinations. If the items require significant revision, which is invariably the case, the benefit of bringing items to the workshop is marginal at best.
In our multiple-choice item writing workshops, we develop quite a few items that do not survive pre-testing based on item analysis results. What can we do to ensure that more of the items survive?
There are no shortcuts for producing quality items. Developing items is time consuming, and, of course, costly—the cost per item is driven upwards because many of the items do not survive review and pre-testing. There are some procedures, if followed, that will improve the quality of items and increase the number of usable items. Flawed items should be removed early in the development process to reduce the costs involved with processing them. The longer the items stay in the development process, the more likely they are to survive even if they are flawed.
One way is to develop the items according to test specifications. The test specifications provide a detailed description of the subject matter areas that should be included in the examination. Close adherence to the test specifications will increase the number of items that survive since the items are likely to be viewed as relevant. Another way is to ensure that the item writers are demographically representative of the practitioners. If the item writers represent the diversity of the practitioner population, biased items are unlikely to survive the initial item-writing workshop.
Individual item writers should understand that a critique of their items by workshop participants is essential to ensure that the items are universally true, and that they do not reflect an individual’s preference or a narrow application. The process of initial review is enhanced if the items were written in draft form with the understanding that, regardless of the strengths of item writers, items can be improved.
Finally, a group of practitioners should review the items before they are pretested to make final edits or to remove any items they judge to be flawed. Critical review is best accomplished by selecting a group of practitioners who did not participate in the development of the items.
How can one ensure that the information provided by subject matter experts who assist in examination-related activities reflects actual practice?
To ensure that examinations are related to practice, or that results from an occupational analysis actually depict practice, the subject matter experts should be selected with the following considerations in mind: specialties of practice, ethnicity, gender, tenure of licensure, geographic region and location of work setting. It is important to involve a large percentage of subject matter experts with no more than five years experience so that the questions are sensitive to entry-level requirements.
We are interested in adopting an examination that was developed by a professional association and is in use by some states. However, our consultant has advised us that the examination does not meet the testing standards for our state. Aren’t the testing standards consistent from state to state?
Yes, they are. The standards most often applied are the Standards for Educational and Psychological Testing. However, these standards are limited in scope when considering the entire examination program. For example, although most examination providers conduct an occupational analysis, the question of the occupational analyses varies, and the consultant may object to the methods employed. The sample size may have been too small, segments of the population may have been underrepresented, or, as often happens, the number of respondents from a state may be too small for appropriate statistical inferences. The consultant may be concerned about the manner in which the examination questions were developed. He or she may object to the quality of the questions when item and overall test statistics are not available, or if the statistics identify problems with the items. Finally, the consultant may challenge whether the questions are actually measuring job-related competencies. Another area of concern may relate to the passing score workshop. If the passing score was not carried out by applying accepted procedures (e.g., Angoff, Nedesky, Ebel, etc.) the consultant may challenge the process. Thus, the consultant must not only evaluate whether the steps of examination development were carried out, but must also evaluate the quality of the work for each step of the process. Ultimately, the licensing agency is responsible for its licensing examinations, and the consultant’s advice is usually designed to prevent challenges to the process.
Examination programs oftentimes are composed of multiple-choice questions and also contain a performance assessment (e.g., oral, practical, clinical, etc.). What are the issues in designing these tests?
The most important concept to be addressed in responding to your question is whether the performance examination measures the skill or knowledge better than some other form of assessment. According to Guion (1995), the meaning of performance assessment is more obscure and subject to irrelevant sources of variance.
One of the shortcomings of performance examinations is that the choice of tasks is a potential source of measurement error. If a few tasks are selected for measurement, the results of the assessment do not provide a clear indication of the candidate’s capacity to perform all the activities in the profession. On the other hand, if the tasks being assessed are more varied, the reliability of the assessment may be lowered. The selection of tasks becomes a problem because performance examinations require more time to administer than, for example, multiple choice examinations.
Another shortcoming is the scoring procedure. The best procedure would be to establish one passing score based upon the sum of scores for each type of examination. One of the advantages of adding the scores is that reliability of the examination results is maximized. It is very difficult to support the concept that each of the tests measures such different parts of practice that each examination must be passed separately.
After our testing consultant presents information (item analysis) about the results of our examinations, we are unsure about how to evaluate the quality of test questions. We understand the concept of test reliability. What do you look for in a "good" multiple-choice item?
First, an item should not be either too easy or too difficult. The "difficulty" of a multiple-choice item is characterized by the percentage of candidates who answered the question correctly. In item analysis printouts, the correct answer is usually denoted with an asterisk or other symbol. As a frame of reference, the reliability of an examination is maximized when 50 per cent of the candidates answer the question correctly. The percent of candidates responding to the correct answer should be no less than 35 per cent and no greater than 90 per cent. If the difficulty index is less than 35 per cent, the item is too difficult; if the index is greater than per cent, the item is too easy. Items with these two characteristics should be removed because they lower the reliability of the examination. Easy items are sometimes included, however, because the concept is important, and mastery of the concept needs to be demonstrated.
The next value that should be considered is the column with the discrimination index, or "point biserial." You do not have to understand the statistical derivation of the discrimination index in order to use it effectively. The discrimination index ranges from –1.00 to +1.00; the index for the correct answer should be positive—the larger the better, but at least +.10. The discrimination indices for incorrect responses ("distractors") should be negative—the more negative, the better. If the discrimination indices do not meet these standards, the item should be revised or removed from the test. However, item analysis results can be misleading; if the number of candidates is fewer than 100, the results are marginally meaningful.
The quality of the items can be quickly evaluated using only the difficulty and discrimination indices. A measurement specialist, who is knowledgeable in the content, can assist the identification of problem areas within examination and can provide valuable insights during revision of content in an item.
A national association has responsibility for our examination program, including establishing the passing score. What options do we have other than accepting passing scores that have been established by the national association?
State licensing boards should strongly consider retaining the responsibility for establishing a passing score that reflects the standards required for minimal competence in their state. For many professions, the scope of practice does not vary significantly from state to state. However, performance expectations, training requirements, or extent of experience may vary and lead to higher or lower expectations in some states than in others. In such cases, the licensing board may wish to establish its own passing score.
Licensing boards also should consider obtaining an explanation from the national association when a very high percentage of candidates passes or fails the examination, and the percentage of candidates passing, or pass rate, differs substantially from the national pass rate. It would be best to work within the national association to influence the passing score process by involving practitioners from your state in the test development and passing score processes. However, if your licensing board believes that the passing score is not set at the appropriate level to provide for public protection, then it may be necessary for the board to establish its own passing score. If a trade association is responsible for developing and marketing the examination, it should not establish the passing score. In such cases, a panel of practitioners—independent of the association can establish the passing score to help allay accusations that entrance into the profession is restricted.
We use a criterion-referenced methodology when establishing the passing score for our examination. The result from applying this methodology is that the actual passing score varies from examination to examination. What is the best way to explain the variation?
In theory, the variation is easy to explain. Since the difficulty of the questions selected for a given administration differs from those of another administration, the passing score varies around the fixed concept of minimal competence to account for the differences in the examination questions. That is, the examination with more difficult questions, overall, will have a lower passing score. The variation in the passing score is psychometrically sound and legally defensible. However, as you mentioned, the variation is difficult to explain to candidates who may achieve a score that fails them on one examination but would be a passing score on a subsequent examination.
A good way to reduce, or even eliminate, variations in passing scores is to assemble tests that are equal in difficulty, or the proportion of candidates answering the item correctly. The passing score is not equivalent to the average difficulty; however, if the average difficulty of the examinations were about the same, then the passing scores should not vary significantly. A limitation of this method is that the multiple-choice items in an examination must have stable statistics. This stability usually is not obtained until the items have been administered to several hundred candidates. With a small examination program, stable item statistics may not be available.
A second method that works for both large and small programs is to report scaled scores. When a scaled criterion-referenced passing score is used for multiple-choice examinations, the score required for passing remains consistent. Scaling the score does not affect the level of performance required for passing the examination. Scaling simply allows the licensing agency to report, for example, that the passing score is 70 (not to be confused with 70 per cent) while the actual passing score is free to vary according to the difficulty of the examination.
What is the role of board members and educators in the development of a valid examination program?
Board members may provide initial input for the content of the job analysis or issues to be addressed in the examination. However, the role of board members should be limited to approval and disapproval of products and recommendations from the examination committee or contractor. The rationale for limited involvement is based upon several considerations. First, boards should be able to independently and objectively evaluate the quality of the examination programs. When board members are involved in the development of an examination, the evaluation is no longer independent or objective. Second, in the case of security breaches, board members who were involved in developing the examination programs may become suspect. Third, a board member may be subject to pressure from some candidates to grant special considerations. Fourth, licensed board members are usually viewed as experts. Because of their background and board membership, their presence in test development or passing score workshops has undue influence on the decisions reached during the workshop.
The role of educators should be minimized. The purpose of licensing examinations is so different from academic examinations that educators have a difficult time in making the transition. Educators often base their perceptions of the difficulty of the test items upon whether candidates "should" know the information because it is or is not taught. Educators often feel compelled to make a licensing examination an end of course ("final") examination.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Bellezza, F. S., & Bellezza, S. F. (1995). Detection of copying on multiple-choice tests: An update. Teaching of Psychology, 22 (3), 180-182.
California Department of Consumer Affairs. (1995). Examination security. Technical report OER 95-01. Sacramento, CA: Office of Examination Resources.
Carlin, J. B., & Rubin, D. B. (1991). Summarizing multiple-choice tests using three informative statistics. Psychological Bulletin, 110(2), 338-349.
Case, S. M., Swanson, D. B., & Ripkey, D. R. (1994). Comparison of items in five-option and extended-matching formats for assessment of diagnostic skills. Academic Medicine, 69(10), S1-S4.
Cizek, G. J., & O’Day, D. M. (1994). Further investigations of nonfunctioning items in multiple-choice test items. Educational and Psychological Measurement, 54(4), 861-872.
Crehan, D. K., Haladyna, T. M., & Brewer, B. W. (1993). Use of an inclusive option and the optimal number of options for multiple-choice items. Educational and Psychological Measurement 54(1), 241-247.
Ebel, R., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed). Englewood Cliffs, NJ: Prentice Hall.
Elstein, A. S., (1993). Beyond multiple-choice questions and essays: The need for a new way to assess clinical competence. Academic Medicine, 68(4), 244-249.
Gross, L. J. (1994). Logical versus empirical guidelines for writing test items: The case of "none of the above." Evaluation and the Health Professions, 17(1), 123-126.
Guion, R. M. (1995). Commentary on values and standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 25-27.
Haladyna, R. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological Measurement, 53(4), 999-1010.
Kolstad, R. K., & Kolstad, R. A. (1991). The effect of "none of these" on multiple-choice tests. Journal of Research and Development in Education, 24(4), 33-36.
Livingston, S. A., & Zeiky, M. J. (1982). Passing scores: A manual for setting standards of performance in educational and psychological tests. Princeton, NJ: Educational Testing Service, 54-64.
Maguire, T., Skakun, E., & Harley, C. (1992). Setting standards for multiple-choice items in clinical reasoning. Evaluation and the Health Professions, 15(4), 434-452.
COPYRIGHT 2000. Rights to copy and distribute this publication are hereby granted to members of the Council on Licensure, Enforcement and Regulation (CLEAR), providing credit is given to CLEAR and copies are not distributed for profit.