Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis

Majewska, Olga; Vulic, Ivan; McCarthy, Diana; Korhonen, Anna

doi:10.17863/CAM.62213

Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/315106

Repository DOI

https://doi.org/10.17863/CAM.62213

Files

Published version (1.07 MB)

Type

Conference Object

Authors

Majewska, Olga

https://orcid.org/0000-0003-4509-8817

Vulic, Ivan

McCarthy, Diana

Korhonen, Anna

Abstract

We present the first evaluation of the applicability of a spatial arrangement method (SpAM) to a typologically diverse language sample, and its potential to produce semantic evaluation resources to support multilingual NLP, with a focus on verb semantics. We demonstrate SpAM’s utility in allowing for quick bottom-up creation of large-scale evaluation datasets that balance cross-lingual alignment with language specificity. Starting from a shared sample of 825 English verbs, translated into Chinese, Japanese, Finnish, Polish, and Italian, we apply a two-phase annotation process which produces (i) semantic verb classes and (ii) fine-grained similarity scores for nearly 130 thousand verb pairs. We use the two types of verb data to (a) examine cross-lingual similarities and variation, and (b) evaluate the capacity of static and contextualised representation models to accurately reflect verb semantics, contrasting the performance of large language-specific pretraining models with their multilingual equivalent on semantic clustering and lexical similarity, across different domains of verb meaning. We release the data from both phases as a large-scale multilingual resource, comprising 85 verb classes and nearly 130k pairwise similarity scores, offering a wealth of possibilities for further evaluation and research on multilingual verb semantics.

Journal Title

Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)

Conference Name

28th International Conference on Computational Linguistics (COLING 2020)

Publisher

International Committee on Computational Linguistics

Publisher DOI

https://doi.org/10.17863/CAM.62213

Rights

Attribution 4.0 International

Sponsorship

European Research Council (648909)
ESRC (1804172)

Collections

University of Cambridge Research Outputs (Articles and Conferences)

Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis

Published version

Peer-reviewed

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Keywords

Journal Title

Conference Name

Journal ISSN

Volume Title

Publisher

Publisher DOI

Rights

Sponsorship

Collections