Show simple item record

dc.contributor.authorBryant, Christopher Jack
dc.date.accessioned2019-06-18T11:43:09Z
dc.date.available2019-06-18T11:43:09Z
dc.date.issued2019-04-30
dc.date.submitted2018-12-19
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/293719
dc.description.abstractGrammatical Error Correction (GEC) is the task of automatically detecting and correcting grammatical errors in text. Although previous work has focused on developing systems that target specific error types, the current state of the art uses machine translation to correct all error types simultaneously. A significant disadvantage of this approach is that machine translation does not produce annotated output and so error type information is lost. This means we can only evaluate a system in terms of overall performance and cannot carry out a more detailed analysis of different aspects of system performance. In this thesis, I develop a system to automatically annotate parallel original and corrected sentence pairs with explicit edits and error types. In particular, I first extend the Damerau- Levenshtein alignment algorithm to make use of linguistic information when aligning parallel sentences, and supplement this alignment with a set of merging rules to handle multi-token edits. The output from this algorithm surpasses other edit extraction approaches in terms of approximating human edit annotations and is the current state of the art. Having extracted the edits, I next classify them according to a new rule-based error type framework that depends only on automatically obtained linguistic properties of the data, such as part-of-speech tags. This framework was inspired by existing frameworks, and human judges rated the appropriateness of the predicted error types as ‘Good’ (85%) or ‘Acceptable’ (10%) in a random sample of 200 edits. The whole system is called the ERRor ANnotation Toolkit (ERRANT) and is the first toolkit capable of automatically annotating parallel sentences with error types. I demonstrate the value of ERRANT by applying it to the system output produced by the participants of the CoNLL-2014 shared task, and carry out a detailed error type analysis of system performance for the first time. I also develop a simple language model based approach to GEC, that does not require annotated training data, and show how it can be improved using ERRANT error types.
dc.language.isoen
dc.rightsAll rights reserved
dc.subjectNatural Language Processing
dc.subjectGrammatical Error Correction
dc.subjectAutomatic Annotation
dc.titleAutomatic annotation of error types for grammatical error correction
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.publisher.departmentComputer Science and Technology
dc.date.updated2019-06-18T10:21:48Z
dc.identifier.doi10.17863/CAM.40832
dc.publisher.collegeChurchill
dc.type.qualificationtitlePhD in Computer Science
cam.supervisorBriscoe, Edward
cam.thesis.fundingfalse


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record