Document Type

Dissertation

Date of Degree

Fall 2014

Degree Name

PhD (Doctor of Philosophy)

Degree In

Psychological and Quantitative Foundations

First Advisor

Michael J. Kolen

Second Advisor

Deborah J. Harris

Abstract

The use of testlets in a test can cause multidimensionality and local item dependence (LID), which can result in inaccurate estimation of item parameters, and in turn compromise the quality of item response theory (IRT) true and observed score equating of testlet-based tests. Both unidimensional and multidimensional IRT models have been developed to control local item dependence caused by testlets. The purposes of the current study were to (1) investigate how different levels of LID can affect IRT true and observed score equating of testlet-based tests when the traditional three parameter logistic (3PL) IRT model was used for calibration, and (2) compare the performance of four different IRT models, including the 3PL IRT model, graded response model (GRM), testlet response theory model (TRT), and bifactor model, in IRT true and observed score equating of testlet-based tests with various levels of local item dependence.

Both real and simulated data analyses were conducted in this study. Two testlet-based tests (i.e., Test A and Test B) that differed in subjects, test length, and testlet length were used in the real data analysis. For simulated data analysis, two main factors were investigated in this study: (1) testlet length (5 or 10), and (2) LID level within testlets that was defined by testlet effect variance (0, 0.25, 0.5625, 0.75, 1, and 1.5). For the unidimensional IRT models (i.e., 3PL IRT model and GRM), unidimensional IRT true score and observed score equating procedures, explained in Kolen and Brennan (2004), were used. For the two investigated multidimensional IRT models (i.e., 3PL TRT model and bifactor model), the unidimensional approximation of multidimensional item response theory (MIRT) true score equating procedure and the unidimensional approximation of MIRT observed score equating procedure (Brossman & Lee, 2013) were applied. The traditional equipercentile equating method was used as the baseline for comparison in both real data and simulated data analyses.

It was found in the study that both testlet length and the LID level affected the performance of the investigated models on IRT true and observed score equating of testlet-based tests. When the traditional 3PL IRT model was used for tests with long testlets, higher levels of local item dependence led to IRT equating results that deviated further away from those obtained from the baseline method. However, the effect of local item dependence on IRT equating results was not prominent for tests with short testlets.

Moreover, for tests consisting of long testlets (e.g., a testlet length of 10 or more) and having a very low level of local item dependence (e.g., a LID level of 0.25 or lower), and for tests consisting of short testlets (e.g., a testlet length around 5), all four investigated IRT models worked well in IRT true and observed score equating. For tests with long testlets and a relatively high level of local item dependence (e.g., a LID level of 0.5625 or higher), the GRM, bifactor, and TRT models outperformed the traditional 3PL IRT model in IRT true and observed equating of testlet-based tests.

The study suggested that the selection of models for IRT true and observed score equating of testlet-based tests should be considered with respect to the features of the testlet-based tests and the groups of examinees from which the data is collected. It is hoped that this study encourages researchers to identify differences among existing models for IRT true and observed score equating of testlet-based tests with various features, and to develop new models that are appropriate for modeling testlet-based tests to obtain accurate IRT number correct score equating results.

Public Abstract

Unidimensional item response theory (IRT) equating methods (Kolen & Brennan, 2004) are often used in testing programs to adjust score difficulty across multiple forms of a test. When test items are organized by testlets that share a common stimulus, multidimensionality and local item dependence (LID) might be present, resulting in a secondary dimension that is related to the stimulus. In this case, the testlet-based test might measure constructs in addition to examinees’ ability level.

This study compares the performance of four different models on IRT true and observed score equating of testlet-based tests that incorporated different testlet length and LID levels. These models include the three parameter logistic (3PL) IRT model, graded response theory model (GRM), 3PL testlet response theory model (TRT), and bifactor model.

The study found that both testlet length and the LID level affected the performance of the investigated IRT equating methods for testlet-based tests. For tests with long testlets, higher LID levels led to 3PL IRT equating results that deviated further away from those obtained from the baseline method. However, this trend was not as evident for tests containing short testlets. Moreover, for tests with long testlets and a low LID level, and for tests with short testlets, all four investigated IRT models worked well in IRT true and observed score equating. For tests with long testlets and a relatively high LID level, the GRM, bifactor, and TRT models outperformed the traditional 3PL IRT model in IRT true and observed equating of testlet-based tests.

Keywords

publicabstract

Pages

xiv, 150 pages

Bibliography

Includes bibliographical references (pages 103-108).

Copyright

Copyright 2014 Juan Chen

Share

COinS