Spatial multi-arrangement for clustering and multi-way similarity dataset construction

Majewska, O; McCarthy, D; van den Bosch, J; Kriegeskorte, N; Vulic, I; Korhonen, A

Spatial multi-arrangement for clustering and multi-way similarity dataset construction

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/306834

Repository DOI

https://doi.org/10.17863/CAM.53925

Files

Published version (772.75 KB)

Type

Conference Object

Authors

Majewska, Olga

https://orcid.org/0000-0003-4509-8817

McCarthy, D

van den Bosch, J

Kriegeskorte, N

Vulic, I

Show 1 more

Abstract

We present a novel methodology for fast bottom-up creation of large-scale semantic similarity resources to support development and evaluation of NLP systems. Our work targets verb similarity, but the methodology is equally applicable to other parts of speech. Our approach circumvents the bottleneck of slow and expensive manual development of lexical resources by leveraging semantic intuitions of native speakers and adapting a spatial multi-arrangement approach from cognitive neuroscience, used before only with visual stimuli, to lexical stimuli. Our approach critically obtains judgments of word similarity in the context of a set of related words, rather than of word pairs in isolation. We also handle lexical ambiguity as a natural consequence of a two-phase process where verbs are placed in broad semantic classes prior to the fine-grained spatial similarity judgments. Our proposed design produces a large-scale verb resource comprising 17 relatedness-based classes and a verb similarity dataset containing similarity scores for 29,721 unique verb pairs and 825 target verbs, which we release with this paper.

Keywords

Lexicon, Lexical Database, Semantics, Crowdsourcing

Journal Title

LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference Name

LREC 2020 the 12th International Conference on Language Resources and Evaluation (LREC 2020)

Journal ISSN

2522-2686

Publisher

European Language Resources Association

Publisher DOI

https://doi.org/10.17863/CAM.53925

Rights

Attribution-NonCommercial 4.0 International

Sponsorship

European Research Council (648909)
ESRC (1804172)

Collections

Cambridge University Research Outputs