Requests for proposals for testing services often reference the desire that the sponsoring organization will have a “reliable and valid test.” From a measurement perspective, this request cannot be met because reliability as a concept is somewhat different from the popular perspective. Reliability refers to the level of precision or consistency of test scores, not the test itself. As an example, the reliability of scores would likely be different if calculated on two groups of examinees who have a different range of ability.
A central reliability concept focuses on the internal consistency of scores. If a test is split into two halves (for example by scoring odd numbered items and even numbered items) the correlation between the two sets of partial scores should be high, implying high internal consistency. Two reliability statistics are frequently cited, although in the case of items which are scored as one point for right and zero for wrong, they are the same statistic. These reliability statistics are KR20, a formula proposed by Kuder and Richardson some seventy years ago, and Cronbach’s alpha. Cronbach’s alpha also has some additional applications for items with weighted scoring, not just right or wrong.
KR20 and alpha are conceptually the average of all possible split half correlations. In general, the higher the correlation, the better. The reliability coefficient has a conceptual upper limit of 1.00. In practice, value of KR20 obtained for certification and licensure examinations of a length of around 200 items tends to be .90 or higher. Reliability statistics may be affected by many factors, such as test length, the performance on individual test items, and the characteristics of the candidate sample.
Another reliability index that has gained increasing attention is pass-fail decision consistency. Conceptually, measurement cannot be completely accurate and always contains an error component. The decision consistency or classification accuracy statistic reflects an estimate of the extent to which candidates would be consistently classified as passing or failing the test based on a particular passing score. For example, if 96% of the candidates were to be estimated as consistently being classified as passing or failing the test based on a particular cutoff score, this would be considered to be a good outcome. Classification accuracy for a licensure or certification test may be of greater interest than a classical reliability statistic because the passing or failing the test is the outcome of interest. In some cases, such as variable length adaptive computer tests, traditional reliability statistics are not available at all, but a classification accuracy statistic may be obtained using simulated data.
ERAC's Question and Answer Series is prepared by the CLEAR
Examination Resources and Advisory Committee (ERAC).
©2008 The Council on
Licensure, Enforcement and Regulation (CLEAR)