Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing

Majewska, Olga

doi:10.17863/CAM.76739

Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing

Repository URI

https://www.repository.cam.ac.uk/handle/1810/329292

Repository DOI

https://doi.org/10.17863/CAM.76739

Files

Thesis (7.19 MB)

Type

Thesis

Authors

Majewska, Olga

https://orcid.org/0000-0003-4509-8817

Abstract

Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs. To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.

Date

2021-02-01

Advisors

Korhonen, Anna
McCarthy, Diana

Keywords

lexical semantics, verb semantics, language resources, multilingual NLP, semantic dataset

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Sponsorship

ESRC (1804172)
European Research Council (648909)

ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909]

Collections

Theses - Theoretical and Applied Linguistics

Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections