Date of Degree
PhD (Doctor of Philosophy)
Psychological and Quantitative Foundations
Michael J. Kolen
Mixed-format tests containing both multiple-choice and constructed-response items are widely used on educational tests. Such tests combine the broad content coverage and efficient scoring of multiple-choice items with the assessment of higher-order thinking skills thought to be provided by constructed-response items. However, the combination of both item formats on a single test complicates the use of psychometric procedures. The purpose of this dissertation was to examine how characteristics of mixed-format tests and composition of the common-item set impact the accuracy of equating results in the common-item nonequivalent groups design.
Operational examinee item responses for two classes of data were considered in this dissertation: (1) operational test forms and (2) pseudo-test forms that were assembled from portions of operational test forms. Analyses were conducted on three mixed-format tests from the Advanced Placement Examination program: English Language, Spanish Language, and Chemistry.
For the operational test form analyses, two factors of investigation were considered as follows: (1) difference in proficiency between old and new form groups of examinees and (2) relative difficulty of multiple-choice and constructed-response items. For the pseudo-test form analyses, two additional factors of investigation were considered: (1) format representativeness of the common-item set and (2) statistical representativeness of the common-item set. For each study condition, two traditional equating methods, frequency estimation and chained equipercentile equating, and two item response theory (IRT) equating methods, IRT true score and IRT observed score methods, were considered.
There were five main findings from the operational and pseudo-test form analyses. (1) As the difference in proficiency between old and new form groups of examinees increased, bias also tended to increase. (2) Relative to the criterion equating relationship for a given equating method, increases in bias were typically largest for frequency estimation and smallest for the IRT equating methods. However, it is important to note that the criterion equating relationship was different for each equating method. Additionally, only one smoothing value was analyzed for the traditional equating methods. (3) Standard errors of equating tended to be smallest for IRT observed score equating and largest for chained equipercentile equating. (4) Results for the operational and pseudo-test analyses were similar when the pseudo-tests were constructed to be similar to the operational test forms. (5) Results were mixed regarding which common-item set composition resulted in the least bias.
Copyright 2010 Sarah Lynn Hagge