Document Type


Date of Degree


Degree Name

PhD (Doctor of Philosophy)

Degree In


First Advisor

Walter Vispoel

Second Advisor

Won-Chan Lee

First Committee Member

Michael Kolen

Second Committee Member

Timothy Ansley

Third Committee Member

Richard Dykstra


The purpose of this dissertation is to investigate how using different IRT calibration methods may affect student achievement growth pattern recovery. In this study, 96 vertical scales (4 × 2 × 2 × 2 ×3) are constructed using different combinations of IRT calibration methods (separate, pair-wise concurrent, semi-concurrent, & concurrent), lengths of common-item set (10 vs. 20 common items), types of common item set (dichotomous only vs. dichotomous and polytomous), and numbers of polytomous item (6 vs. 12) for 3 simulated datasets which differ in sample size (500, 1000, 5000 per grade). Three criteria (RMSE, SE and bias) are used to evaluate the performance of these calibration methods on proficiency score distribution recovery over 40 replications. The results suggest that for data used in this study, when parameters of interest are related to measuring students' growth (i.e., proficiency score mean and effect size), pair-wise concurrent calibration overall produced the most accurate results. When parameters of interest are related to performance variability (i.e., standard deviation), concurrent calibration in general produced the most stable and accurate estimates. When the emphasis is to classify students' performance accurately, with the increase of sample size, taken collectively, pair-wise concurrent and semi-concurrent calibration outperformed concurrent and separate calibration. Overall, pair-wise concurrent was more effective than the other methods in constructing a vertical scale and use of either separate or concurrent calibration to create a vertical scale seems least warranted.

In addition, it is observed that (1) Larger sample size stabilized estimation results and reduced error; (2) Compared to tests containing 10 common items, errors and biases were in general smaller for tests with 20 common items; (3) Compared to tests containing a mixed-format common-item set, errors and biases were usually smaller for tests containing a dichotomous-only common-item set; (4) For tests containing a mixed-format common-item set, errors and biases were in general smaller for tests containing more polytomous items; and (5) For tests containing a dichotomous-only common-item set, increasing the number of polytomous items did not necessarily either reduce or increase errors and biases.


xix, 344 pages


Includes bibliographical references (pages 204-209).


Copyright 2007 Huijuan Meng

Included in

Education Commons