Document Type


Date of Degree

Spring 2011

Degree Name

PhD (Doctor of Philosophy)

Degree In

Psychological and Quantitative Foundations

First Advisor

Kolen, Michael J.

Second Advisor

Harris, Deborah J.

First Committee Member

Welch, Catherine

Second Committee Member

Lee, Won-Chan

Third Committee Member

Hollingworth, Liz


Broadening the term augmented testing to include a combination of multiple measures to assess examinee performance on a single construct, the issues of IRT item parameter and proficiency estimates were investigated. The intent of this dissertation is to determine if different IRT calibration designs result in differences to item and proficiency parameter estimates and to understand the nature of those differences.

Examinees were sampled from a testing program in which each examinee was administered three mathematics assessments measuring a broad mathematics domain at the high school level. This sample of examinees was used to perform a real data analysis to investigate the item and proficiency estimates. A simulation study was also conducted based upon the real data.

The factors investigated for the real data study included three IRT calibration designs and two IRT models. The calibration designs included: separately calibrating each assessment, calibrating all assessments in one joint calibration, and separately calibrating items in three distinct content areas. Joint calibration refers to the use of IRT methodology to calibrate two or more tests, which have been administered to a single group, together so as to place all of the items on a common scale. The two IRT models were the one- and three-parameter logistic model. Also investigated were five proficiency estimators: maximum likelihood estimates, expected a posteriori, maximum a posteriori, summed-score EAP, and test characteristic curve estimates. The simulation study included the same calibration designs and IRT models but the data were simulated with varying levels of correlations among the proficiencies to determine the affect upon the item parameter estimates.

The main findings indicate that item parameter and proficiency estimates are affected by the IRT calibration design. The discrimination parameter estimates of the three-parameter model were larger when calibrated under the joint calibration design for one assessment but not for the other two. Noting that equal item discrimination is an assumption of the 1-PL model, this finding raises questions as to the degree of model fit when the 1-PL model is used. Items on a second assessment had lower difficulty parameters in the joint calibration design while the item parameter estimates of the other two assessments were higher. Differences in proficiency estimates between calibration designs were also discovered, which were found to result in examinees being inconsistently classified into performance categories. Differences were observed in regards to the choice of IRT model. Finally, as the level of correlation among proficiencies increased in the simulation data, the differences observed in the item parameter estimates were decreased.

Based upon the findings, IRT item parameter estimates resulting from differing calibrations designs should not be used interchangeably. Practitioners who use item pools should base the pool refreshment calibration design upon the one used to originally create the pool. Limitations to this study include the use of a single dataset consisting of high school examinees in only one subject area, thus the degree of generalization regarding research findings to other content areas of grade levels should be made with caution.


Item Parameter Estimate Calibration, Item Response Theory, Proficiency Estimates


xvii, 256 pages


Includes bibliographical references (pages 242-246).


Copyright 2011 Nathan Lane Wall