DOI

10.17077/etd.m4wm0mim

Document Type

Dissertation

Date of Degree

Spring 2018

Access Restrictions

Access restricted until 07/03/2020

Degree Name

PhD (Doctor of Philosophy)

Degree In

Teaching and Learning

First Advisor

Lia M. Plakans

First Committee Member

Pamela M. Wesely

Second Committee Member

David C. Johnson

Third Committee Member

Won-Chan Lee

Fourth Committee Member

Atta Gebril

Abstract

In measuring second language learners’ writing proficiency, test takers’ performance on a particular assessment task is evaluated by raters using a set of criteria to generate writing scores. The scores are used by teachers, students, and parents to make inferences about their performance levels in real-life writing situations. To examine the accuracy of this inference, it is imperative that we investigate the sources of measurement error involved in the writing score. It is also important to ensure rater consistency, both within a single rater and between raters, to provide evidence that the scores are valid indicators of tested constructs.

This mixed methods research addressed the validity of integrated listening-to-write (L-W) scores. More specifically, it examined the generalizability of L-W scores and raters’ decision-making processes and scoring challenges. A total of 198 high school English learners in Taiwan completed up to two L-W tasks, each of which required them to listen to an academic lecture and respond to a related writing prompt in English. Nine raters who had experience teaching English evaluated each student’s written materials using a holistic scale.

This study employed a univariate two-facet random effects generalizability study (p × t × r) to investigate the effects of tasks and raters on the score variance. Subsequent decision studies (p × T × R) estimated standard error of measurement and generalizability coefficients. Post-rating stimulated recall interview data were analyzed qualitatively to explore raters’ alignment of rating scale descriptors, decision-making behaviors, and scoring challenges.

The results indicated that the majority of score variance was explained by test takers’ ability difference in academic writing proficiency. The raters were similar in their stringency and did not contribute much to score variance. Due to a relatively large magnitude of person-by-task interaction effect, increasing the number of tasks, rather than raters, resulted in a much lower degree of error and higher degree of score generalizability. The ideal assessment procedure to achieve an acceptable level of score generalizability would be to administer two L-W tasks scored by two raters.

When evaluating written materials for L-W tasks, nine raters primarily focused on the content of the essays and paid less attention to language-related features. The raters did not equally consider all aspects of essay features described in the holistic rubric. The most prominent scoring challenges included 1) assigning a holistic score while balancing students’ listening comprehension skills and writing proficiency and 2) assessing the degree of students’ successful reproduction of lecture content. The findings of this study have practical and theoretical implications for integrated writing assessments for high school EFL learners.

Keywords

Integrated writing assessment, Rating criteria, Rating process, Score generalizability, Secondary school English learners, Validity

Pages

xiv, 225 pages

Bibliography

Includes bibliographical references (pages 180-191).

Copyright

Copyright © 2018 Renka Ohta

Available for download on Friday, July 03, 2020

Share

COinS