Automatic syntactic analysis of learner English
View / Open Files
Authors
Huang, Yan
Advisors
Korhonen, Anna
Date
2019-03-23Awarding Institution
University of Cambridge
Author Affiliation
Faculty of Modern and Medieval Languages
Qualification
Doctor of Philosophy (PhD)
Language
English
Type
Thesis
Metadata
Show full item recordCitation
Huang, Y. (2019). Automatic syntactic analysis of learner English (Doctoral thesis). https://doi.org/10.17863/CAM.33319
Abstract
Automatic syntactic analysis is essential for extracting useful information from large-scale learner data for linguistic research and natural language processing (NLP). Currently, researchers use standard POS taggers and parsers developed on native language to analyze learner language. Investigation of how such systems perform on learner data is needed to develop strategies for minimizing the cross-domain effects. Furthermore, POS taggers and parsers are developed for generic NLP purposes and may not be useful for identifying specific syntactic constructs such as subcategorization frames (SCFs). SCFs have attracted much research attention as they provide unique insight into the interplay between lexical and structural information. An automatic SCF identification system adapted for learner language is needed to facilitate research on L2 SCFs.
In this thesis, we first provide a comprehensive evaluation of standard POS taggers and parsers on learner and native English. We show that the common practice of constructing a gold standard by manually correcting the output of a system can introduce bias to the evaluation, and we suggest a method to control for the bias. We also quantitatively evaluate the impact of fine-grained learner errors on POS tagging and parsing, identifying the most influential learner errors. Furthermore, we show that the performance of probabilistic POS taggers and parsers on native English can predict their performance on learner English.
Secondly, we develop an SCF identification system for learner English. We train a machine learning model on both native and learner English data. The system can label individual verb occurrences in learner data for a set of 49 distinct SCFs. Our evaluation shows that the system reaches an accuracy of 84\% F1 score. We then demonstrate that the level of accuracy is adequate for linguistic research. We design the first multidimensional SCF diversity metrics and investigate how SCF diversity changes with L2 proficiency on a large learner corpus. Our results show that as L2 proficiency develops, learners tend to use more diverse SCF types with greater taxonomic distance; more advanced learners also use different SCF types more evenly and locate the verb tokens of the same SCF type further away from each other. Furthermore, we demonstrate that the proposed SCF diversity metrics contribute a unique perspective to the prediction of L2 proficiency beyond existing syntactic complexity metrics.
Keywords
subcategorization identification, learner English, dependency parsing, subcategorization frame, SCF, syntactic analysis, computational linguistics, natural language processing, second language acquisition, corpus linguistics
Sponsorship
Grace and Thomas CH Chan Cambridge International Scholarship
Identifiers
This record's DOI: https://doi.org/10.17863/CAM.33319
Rights
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Licence URL: https://creativecommons.org/licenses/by-nc-sa/4.0/
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk