Reliability is a crucial piece of evidence for any operational testing program, as laid out in the Standards for Educational and Psychological Testing. For diagnostic models, reliability has been conceptualized as the classification accuracy and classification consistency. That is, a mastery or proficiency determination is made for each assessed attribute, and the reliability method focus on the accuracy and consistency of those decisions. This approach can be limiting in an operational setting. Often times, additional results are reported beyond the individual attribute classifications. For example, an overall performance level in a state-wide educational accountability assessment, or a pass/fail determination in a certification or licensure examination. Existing measures of reliability for diagnostic assessments do not easily scale to other results that aggregate the individual attribute classifications. In this paper we describe a method of simulated retests for measuring the reliability of diagnostic assessments. As, the name implies, this method simulates a retest for students using their operational assessment data and the estimated model parameters. The simulated retest is then scored using the standard assessment scoring rules, and the results from the operational assessment and the simulated retests are compared. In this way, we can examine not only the reliability of the attribute classifications, but any result that is reported. In a simulation study, we show that the reliability estimates achieved from the simulated retest method are highly consistent with standard measures of classification accuracy and consistency. We also demonstrate how this method can be used to evaluate the consistency in aggregations of the attribute classifications. Overall, the findings demonstrate the utility of the simulated retest method for assessing the reliability of diagnostic assessments in an operational setting.
This talk summarizes the work in two papers: