The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English
Repository URI
Repository DOI
Change log
Authors
Abstract
We present a new annotated corpus of written learner English, derived from essays submitted to the learning platform Write & Improve (W&I). Users of W&I are presented with automated scoring and feedback on grammatical errors, and are encouraged to act on their error feedback, submitting multiple versions of their essays for any given prompt. We build the corpus on this interplay between users and prompts, collecting sets of essays submitted by users for a selected list of 50 popular prompts. The prompts include 20 aimed at beginner learners of English, 20 aimed at intermediate learners, and 10 at advanced learners. This distribution reflects the greater use of W&I by beginner and intermediate learners of English. We ensured that the prompts were not likely to elicit personal information and covered a broad range of tasks and topics. This list of prompts enabled us to identify 5050 essay sets written by 766 users, forming the basis for the Write & Improve Corpus, which is being made available for non-commercial use on the ELiT website. We describe the steps we took to ensure the corpus contains appropriate texts, does not include personal information, and will come with annotations relating to Common European Framework of Reference (CEFR) level and grammatical errors. All essays were submitted between 2020 and 2022 by registered users of W&I who have supplied their first language (L1) in an optional questionnaire. In total, there are more than 23K essays containing more than 3.5 million word tokens. The final versions of each essay set amount to 762K word tokens. There are 22 different L1s in the corpus, with the most common being Spanish, Portuguese, Japanese, Arabic and Vietnamese. We present some descriptive statistics for the corpus, and consider some research use cases.