Document Type


Date of Degree

Spring 2018

Access Restrictions

Access restricted until 07/03/2020

Degree Name

PhD (Doctor of Philosophy)

Degree In

Teaching and Learning

First Advisor

Plakans, Lia M.

First Committee Member

Wesely, Pamela M.

Second Committee Member

Johnson, David C.

Third Committee Member

Lee, Won-Chan

Fourth Committee Member

Gebril, Atta


In measuring second language learners’ writing proficiency, test takers’ performance on a particular assessment task is evaluated by raters using a set of criteria to generate writing scores. The scores are used by teachers, students, and parents to make inferences about their performance levels in real-life writing situations. To examine the accuracy of this inference, it is imperative that we investigate the sources of measurement error involved in the writing score. It is also important to ensure rater consistency, both within a single rater and between raters, to provide evidence that the scores are valid indicators of tested constructs.

This mixed methods research addressed the validity of integrated listening-to-write (L-W) scores. More specifically, it examined the generalizability of L-W scores and raters’ decision-making processes and scoring challenges. A total of 198 high school English learners in Taiwan completed up to two L-W tasks, each of which required them to listen to an academic lecture and respond to a related writing prompt in English. Nine raters who had experience teaching English evaluated each student’s written materials using a holistic scale.

This study employed a univariate two-facet random effects generalizability study (p × t × r) to investigate the effects of tasks and raters on the score variance. Subsequent decision studies (p × T × R) estimated standard error of measurement and generalizability coefficients. Post-rating stimulated recall interview data were analyzed qualitatively to explore raters’ alignment of rating scale descriptors, decision-making behaviors, and scoring challenges.

The results indicated that the majority of score variance was explained by test takers’ ability difference in academic writing proficiency. The raters were similar in their stringency and did not contribute much to score variance. Due to a relatively large magnitude of person-by-task interaction effect, increasing the number of tasks, rather than raters, resulted in a much lower degree of error and higher degree of score generalizability. The ideal assessment procedure to achieve an acceptable level of score generalizability would be to administer two L-W tasks scored by two raters.

When evaluating written materials for L-W tasks, nine raters primarily focused on the content of the essays and paid less attention to language-related features. The raters did not equally consider all aspects of essay features described in the holistic rubric. The most prominent scoring challenges included 1) assigning a holistic score while balancing students’ listening comprehension skills and writing proficiency and 2) assessing the degree of students’ successful reproduction of lecture content. The findings of this study have practical and theoretical implications for integrated writing assessments for high school EFL learners.


Integrated writing assessment, Rating criteria, Rating process, Score generalizability, Secondary school English learners, Validity


xiv, 225 pages


Includes bibliographical references (pages 180-191).


Copyright © 2018 Renka Ohta

Available for download on Friday, July 03, 2020