Document Type


Date of Degree

Spring 2019

Degree Name

PhD (Doctor of Philosophy)

Degree In

Psychological and Quantitative Foundations

First Advisor

Yarbrough, Donald B

Second Advisor

Harris, Deborah J

First Committee Member

Lee, Won-Chan

Second Committee Member

Welch, Catherine J

Third Committee Member

Cowles, Mary Kathryn

Fourth Committee Member

Donoghue, John R


Trend scoring is often used in large-scale assessments to monitor for rater drift when the same constructed response items are administered in multiple test administrations. In trend scoring, a set of responses from Time A are rescored by raters at Time B. The purpose of this study is to examine the ability of trend-monitoring statistics to detect rater effects in the context of trend scoring. The present study examines the percent of exact agreement and Cohen’s kappa as interrater agreement measures, and the paired t-test and Stuart’s Q as marginal homogeneity measures. Data that contains specific rater effects is simulated under two frameworks: the generalized partial credit model and the latent-class signal detection theory model.

The findings indicate that the percent of exact agreement, the paired t-test, and Stuart’s Q showed high Type I error rates under a rescore design in which half of the rescore papers have a uniform score distribution and the other half have a score distribution proportional to the population papers at Time A. All these Type I errors were reduced when using a rescore design in which all rescore papers have a score distribution proportional to the population papers at Time A. For the second rescore design, results indicate that the ability of the percent of exact agreement, Cohen’s kappa, and the paired t-test in detecting various effects varied across items, sample sizes, and type of rater effect. The only statistic that always detected every level of rater effect across items and frameworks was Stuart’s Q.

Although advances have been made in the automated scoring field, the fact is that many testing programs require humans to score constructed response items. Previous research indicates that rater effects are common in constructed response scoring. In testing programs that keep trends in data across time, changes in scoring across time confound the measurement of change in student performance. Therefore, the study of methods to ensure rating consistency across time, such as trend scoring, is important and needed to ensure fairness and validity.


Rater drift, Rater effects, Trend scoring, Type I error and power analysis


xviii, 178 pages


Includes bibliographical references (pages 154-160).


Copyright © 2019 Widad Abdalla