Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis
View / Open Files
Authors
Majewska, Olga
Vulic, Ivan
McCarthy, Diana
Korhonen, Anna
Publication Date
2020-12Journal Title
Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)
Conference Name
28th International Conference on Computational Linguistics (COLING 2020)
Publisher
International Committee on Computational Linguistics
Type
Conference Object
This Version
VoR
Metadata
Show full item recordCitation
Majewska, O., Vulic, I., McCarthy, D., & Korhonen, A. (2020). Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis. Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020) https://www.aclweb.org/anthology/2020.coling-main.423
Abstract
We present the first evaluation of the applicability of a spatial arrangement method (SpAM) to a typologically diverse language sample, and its potential to produce semantic evaluation resources to support multilingual NLP, with a focus on verb semantics. We demonstrate SpAM’s utility in allowing for quick bottom-up creation of large-scale evaluation datasets that balance cross-lingual alignment with language specificity. Starting from a shared sample of 825 English verbs, translated into Chinese, Japanese, Finnish, Polish, and Italian, we apply a two-phase annotation process which produces (i) semantic verb classes and (ii) fine-grained similarity scores for nearly 130 thousand verb pairs. We use the two types of verb data to (a) examine cross-lingual similarities and variation, and (b) evaluate the capacity of static and contextualised representation models to accurately reflect verb semantics, contrasting the performance of large language-specific pretraining models with their multilingual equivalent on semantic clustering and lexical similarity, across different domains of verb meaning. We release the data from both phases as a large-scale multilingual resource, comprising 85 verb classes and nearly 130k pairwise similarity scores, offering a wealth of possibilities for further evaluation and research on multilingual verb semantics.
Sponsorship
ECH2020 EUROPEAN RESEARCH COUNCIL (ERC) (648909)
Identifiers
External link: https://www.aclweb.org/anthology/2020.coling-main.423
This record's URL: https://www.repository.cam.ac.uk/handle/1810/315106