Show simple item record

dc.contributor.authorVulic, Ivanen
dc.contributor.authorBaker, Simonen
dc.contributor.authorPonti, Edoardoen
dc.contributor.authorPetti, Ullaen
dc.contributor.authorLeviant, Iraen
dc.contributor.authorWing, Kellyen
dc.contributor.authorMajewska, Olgaen
dc.contributor.authorBar, Edenen
dc.contributor.authorMalone, Matten
dc.contributor.authorPoibeau, Thierryen
dc.contributor.authorReichart, Roien
dc.contributor.authorKorhonen, Anna-Leenaen
dc.date.accessioned2020-12-15T00:31:34Z
dc.date.available2020-12-15T00:31:34Z
dc.identifier.issn0891-2017
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/315099
dc.description.abstractWe introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity data sets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages. We make these contributions - the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning - available via a website which will encourage community effort in further expansion of Multi-SimLex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.
dc.publisherMIT Press
dc.rightsAll rights reserved
dc.titleMulti-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarityen
dc.typeArticle
prism.publicationNameComputational Linguisticsen
dc.identifier.doi10.17863/CAM.62206
dcterms.dateAccepted2020-10-03en
rioxxterms.versionofrecord10.1162/coli_a_00391en
rioxxterms.versionAM
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserveden
rioxxterms.licenseref.startdate2020-10-03en
dc.contributor.orcidBaker, Simon [0000-0002-0998-438X]
dc.contributor.orcidPonti, Edoardo [0000-0002-6308-1050]
dc.contributor.orcidMajewska, Olga [0000-0003-4509-8817]
dc.identifier.eissn1530-9312
rioxxterms.typeJournal Article/Reviewen
pubs.funder-project-idECH2020 EUROPEAN RESEARCH COUNCIL (ERC) (648909)
cam.issuedOnline2020-10-22en
rioxxterms.freetoread.startdate2023-12-14


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record