Repository logo
 

The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English


Type

Report

Change log

Authors

Nicholls, Diane 
Buttery, Paula 

Abstract

We present a new annotated corpus of written learner English, derived from essays submitted to the learning platform Write & Improve (W&I). Users of W&I are presented with automated scoring and feedback on grammatical errors, and are encouraged to act on their error feedback, submitting multiple versions of their essays for any given prompt. We build the corpus on this interplay between users and prompts, collecting sets of essays submitted by users for a selected list of 50 popular prompts. The prompts include 20 aimed at beginner learners of English, 20 aimed at intermediate learners, and 10 at advanced learners. This distribution reflects the greater use of W&I by beginner and intermediate learners of English. We ensured that the prompts were not likely to elicit personal information and covered a broad range of tasks and topics. This list of prompts enabled us to identify 5050 essay sets written by 766 users, forming the basis for the Write & Improve Corpus, which is being made available for non-commercial use on the ELiT website. We describe the steps we took to ensure the corpus contains appropriate texts, does not include personal information, and will come with annotations relating to Common European Framework of Reference (CEFR) level and grammatical errors. All essays were submitted between 2020 and 2022 by registered users of W&I who have supplied their first language (L1) in an optional questionnaire. In total, there are more than 23K essays containing more than 3.5 million word tokens. The final versions of each essay set amount to 762K word tokens. There are 22 different L1s in the corpus, with the most common being Spanish, Portuguese, Japanese, Arabic and Vietnamese. We present some descriptive statistics for the corpus, and consider some research use cases.

Description

Keywords

corpus linguistics, education technology, language learning, machine learning, natural language processing

Is Part Of

Publisher

Publisher DOI

Publisher URL

Sponsorship
Cambridge Assessment (unknown)