Document Type

Dissertation

Date of Degree

Spring 2016

Degree Name

PhD (Doctor of Philosophy)

Degree In

Psychological and Quantitative Foundations

First Advisor

Won-Chan Lee

Abstract

The purpose of this study was to investigate if different ways of treating missing responses affects the IRT item parameters and vertical scales. An empirical study was conducted with the verbal test and quantitative test of a large-scale ability test for grades 4 through 12.

Five commonly used methods for scoring missing responses were investigated: Listwise deletion (LW), scoring as incorrect (IN), scoring as not-presented (NP), treating omitted items as incorrect and not-reached items as not-presented (INNR), and assigning a partial score (BN). In addition, three multiple imputation methods that show promising results outside of IRT were investigated: multiple imputation using stochastic regression with data augmentation algorithm (MISR), multiple imputation by chained equations (MICE), and multiple imputation under two-way imputation with error (MITW). The effect of missing data treatments were investigated with both concurrent and separate calibrations and with three proficiency estimators including EAP, MLE, and QD. The vertical scale was evaluated based on the three properties including grade-to-grade growth, within-grade variability, and effect size. The impact of missing data treatments on the item parameter estimates was also examined by comparing the summary statistics for item discrimination, item difficulty, and pseudo guessing. Lastly, the practical impact was investigated by comparing raw-to-scale score conversion tables.

The results showed that different ways of handling missing responses affect the resulting item parameter estimates and vertical scales. In general, IN produced higher item discrimination and item difficulty parameter estimates, but yielded lower pseudo-guessing parameter estimates compared to other missing data treatments. IN also produced higher mean theta estimates and larger growth while MITW yielded smaller theta estimates and growth. MICE and MISR tended to perform similarly to INNP and NP. The choice of missing data treatment had a greater impact on the results with separate calibration than concurrent calibration, and with MLE than EAP or QD. In addition, missing data treatments had a larger effect on low and high item difficulty estimates than items with middle range difficulty estimates, and yielded differences in developmental scale scores in particular at both ends of the score scale.

Public Abstract

The purpose of this study was to examine if different ways of handling missing data affects the item parameter estimates, the construction of a vertical scale, and the subsequent consequences for interpreting students’ growth. An empirical study was conducted with the verbal test and quantitative test of a large-scale ability test for grades 4 through 12.

Five commonly used methods for handling missing responses were investigated: deleting cases with missing responses, scoring them as incorrect (IN), scoring them as if they were not-presented (NP), treating missing responses in the middle of the test as incorrect and at end of the test as not-reached (INNR), and assigning missing responses a partial score (BN). In addition, three multiple imputation methods, a more recent approach for handling missing responses were investigated. The effect of missing data treatments were investigated under conditions when the calibration was conducted separately for each grade as well as when the calibration was conducted concurrently across grades. In addition, the impact of missing data was examined when three proficiency estimators including EAP, MLE, and QD was used to estimate the student’s ability. The vertical scale was evaluated based on student growth, the variability of students’ ability within each grade, and standardized growth. The item parameter estimates were evaluated based on the summary statistics. Lastly, the practical impact was investigated based on how students’ raw scores convert to different scale scores.

The results showed that different ways of handling missing responses affect the resulting item parameter estimates and vertical scales. In general, IN produced items with higher discrimination and difficulty but lower pseudo-guessing compared to other missing data treatments. IN also produced higher mean ability estimates and larger growth. Two of the multiple imputation methods tended to perform similarly to INNP and NP. Missing data had a greater impact when the calibration was done separately for each grade than when the calibration was done concurrently for all grades. Moreover, missing data had a larger effect when the student’s ability was estimated based on MLE than EAP or QD. The impact of missing data was larger on low and high item difficulty estimates than items with middle range difficulty estimates, and yielded differences in students’ scale scores especially for those who were at the extreme ends.

Keywords

publicabstract

Pages

xx, 280 pages

Bibliography

Includes bibliographical references (pages 139-148).

Copyright

Copyright 2016 AhYoung Shin

Share

COinS