Date of Degree
PhD (Doctor of Philosophy)
Psychological and Quantitative Foundations
Yarbrough, Donald B
Harris, Deborah J
First Committee Member
Second Committee Member
Welch, Catherine J
Third Committee Member
Cowles, Mary Kathryn
Fourth Committee Member
Donoghue, John R
Trend scoring is often used in large-scale assessments to monitor for rater drift when the same constructed response items are administered in multiple test administrations. In trend scoring, a set of responses from Time A are rescored by raters at Time B. The purpose of this study is to examine the ability of trend-monitoring statistics to detect rater effects in the context of trend scoring. The present study examines the percent of exact agreement and Cohen’s kappa as interrater agreement measures, and the paired t-test and Stuart’s Q as marginal homogeneity measures. Data that contains specific rater effects is simulated under two frameworks: the generalized partial credit model and the latent-class signal detection theory model.
The findings indicate that the percent of exact agreement, the paired t-test, and Stuart’s Q showed high Type I error rates under a rescore design in which half of the rescore papers have a uniform score distribution and the other half have a score distribution proportional to the population papers at Time A. All these Type I errors were reduced when using a rescore design in which all rescore papers have a score distribution proportional to the population papers at Time A. For the second rescore design, results indicate that the ability of the percent of exact agreement, Cohen’s kappa, and the paired t-test in detecting various effects varied across items, sample sizes, and type of rater effect. The only statistic that always detected every level of rater effect across items and frameworks was Stuart’s Q.
Although advances have been made in the automated scoring field, the fact is that many testing programs require humans to score constructed response items. Previous research indicates that rater effects are common in constructed response scoring. In testing programs that keep trends in data across time, changes in scoring across time confound the measurement of change in student performance. Therefore, the study of methods to ensure rating consistency across time, such as trend scoring, is important and needed to ensure fairness and validity.
Rater drift, Rater effects, Trend scoring, Type I error and power analysis
xviii, 178 pages
Includes bibliographical references (pages 154-160).
Copyright © 2019 Widad Abdalla
Abdalla, Widad. "Detecting rater effects in trend scoring." PhD (Doctor of Philosophy) thesis, University of Iowa, 2019.