Show simple item record

dc.contributor.authorMajewska, Olga
dc.date.accessioned2021-10-12T20:09:11Z
dc.date.available2021-10-12T20:09:11Z
dc.date.submitted2021-02-01
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/329292
dc.description.abstractAdvances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs. To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.
dc.description.sponsorshipESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909]
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subjectlexical semantics
dc.subjectverb semantics
dc.subjectlanguage resources
dc.subjectmultilingual NLP
dc.subjectsemantic dataset
dc.titleAcquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.identifier.doi10.17863/CAM.76739
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.contributor.orcidMajewska, Olga [0000-0003-4509-8817]
rioxxterms.typeThesis
dc.publisher.collegeCorpus Christi
dc.type.qualificationtitlePhD in Computational Linguistics
pubs.funder-project-idESRC (1804172)
pubs.funder-project-idEuropean Research Council (648909)
cam.supervisorKorhonen, Anna
cam.supervisorMcCarthy, Diana


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Except where otherwise noted, this item's licence is described as Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)