Date of Degree
PhD (Doctor of Philosophy)
Psychological and Quantitative Foundations
David A. Frisbie
Michael J. Kolen
The main purpose of this study was to construct different vertical scales based on various combinations of calibration methods and proficiency estimators to investigate the impact different choices may have on these properties of the vertical scales that result: grade-to-grade growth, grade-to-grade variability, and the separation of grade distributions. Calibration methods investigated were concurrent calibration, separate calibration, and fixed a, b, and c item parameters for common items with simple prior updates (FSPU). Proficiency estimators investigated were Maximum Likelihood Estimator (MLE) with pattern scores, Expected A Posteriori (EAP) with pattern scores, pseudo-MLE with summed scores, pseudo-EAP with summed scores, and Quadrature Distribution (QD). The study used datasets from the Iowa Tests of Basic Skills (ITBS) in the Vocabulary, Reading Comprehension (RC), Math Problem Solving and Data Interpretation (MPD), and Science tests for grades 3 through 8.
For each of the research questions, the following conclusions were drawn from the study. With respect to the comparisons of three calibration methods, for the RC and Science tests, concurrent calibration, compared to FSPU and separate calibration, showed less growth and more slowly decreasing growth in the lower grades, less decrease in variability over grades, and less separation in the lower grades in terms of horizontal distances. For the Vocabulary and MPD tests, differences in both grade-to-grade growth and in the separation of grade distributions were trivial. With respect to the comparisons of five proficiency estimators, for all content areas, the trend of pseudo-MLE ≥ MLE > QD > EAP ≥ pseudo-EAP was found in within-grade SDs, and the trend of pseudo-EAP ≥ EAP > QD > MLE ≥ pseudo-MLE was found in the effect sizes. However, the degree of decrease in variability over grades was similar across proficiency estimators. With respect to the comparisons of the four content areas, for the Vocabulary and MPD tests compared to the RC and Science tests, growth was less, but somewhat steady, and the decrease in variability over grades was less. For separation of grade distributions, it was found that the large growth suggested by larger mean differences for the RC and Science tests was reduced through the use of effect sizes to standardize the differences.
Copyright 2007 Jungnam Kim
Kim, Jungnam. "A comparison of calibration methods and proficiency estimators for creating IRT vertical scales." dissertation, University of Iowa, 2007.