Show simple item record

dc.contributor.authorChiu, Billyen
dc.contributor.authorPyysalo, Sampoen
dc.contributor.authorVulić, Ivanen
dc.contributor.authorKorhonen, Anna-Leenaen
dc.date.accessioned2018-06-06T09:36:06Z
dc.date.available2018-06-06T09:36:06Z
dc.date.issued2018-02-05en
dc.identifier.issn1471-2105
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/276650
dc.description.abstractBackground: Word representations support a variety of Natural Language Processing (NLP) tasks. The quality of these representations is typically assessed by comparing the distances in the induced vector spaces against human similarity judgements. Whereas comprehensive evaluation resources have recently been developed for the general domain, similar resources for biomedicine currently suffer from the lack of coverage, both in terms of word types included and with respect to the semantic distinctions. Notably, verbs have been excluded, although they are essential for the interpretation of biomedical language. Further, current resources do not discern between semantic similarity and semantic relatedness, although this has been proven as an important predictor of the usefulness of word representations and their performance in downstream applications. Results: We present two novel comprehensive resources targeting the evaluation of word representations in biomedicine. These resources, Bio-SimVerb and Bio-SimLex, address the previously mentioned problems, and can be used for evaluations of verb and noun representations respectively. In our experiments, we have computed the Pearson’s correlation between performances on intrinsic and extrinsic tasks using twelve popular state-of-the-art representation models (e.g. word2vec models). The intrinsic–extrinsic correlations using our datasets are notably higher than with previous intrinsic evaluation benchmarks such as UMNSRS and MayoSRS. In addition, when evaluating representation models for their abilities to capture verb and noun semantics individually, we show a considerable variation between performances across all models. Conclusion: Bio-SimVerb and Bio-SimLex enable intrinsic evaluation of word representations. This evaluation can serve as a predictor of performance on various downstream tasks in the biomedical domain. The results on Bio-SimVerb and Bio-SimLex using standard word representation models highlight the importance of developing dedicated evaluation resources for NLP in biomedicine for particular word classes (e.g. verbs). These are needed to identify the most accurate methods for learning class-specific representations. Bio-SimVerb and Bio-SimLex are publicly available.
dc.format.mediumElectronicen
dc.languageengen
dc.publisherBioMed Central
dc.rightsAttribution 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subjectHumansen
dc.subjectLanguageen
dc.subjectBiomedical Technologyen
dc.subjectSemanticsen
dc.subjectNatural Language Processingen
dc.subjectSoftwareen
dc.subjectDatabases as Topicen
dc.titleBio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine.en
dc.typeArticle
prism.issueIdentifier1en
prism.publicationDate2018en
prism.publicationNameBMC bioinformaticsen
prism.startingPage33
prism.volume19en
dc.identifier.doi10.17863/CAM.18170
dcterms.dateAccepted2018-01-24en
rioxxterms.versionofrecord10.1186/s12859-018-2039-zen
rioxxterms.versionVoR*
rioxxterms.licenseref.urihttp://creativecommons.org/licenses/by/4.0/en
rioxxterms.licenseref.startdate2018-02-05en
dc.contributor.orcidChiu, Billy [0000-0001-6683-3249]
dc.identifier.eissn1471-2105
rioxxterms.typeJournal Article/Reviewen
pubs.funder-project-idMedical Research Council (MR/M013049/1)
pubs.funder-project-idECH2020 EUROPEAN RESEARCH COUNCIL (ERC) (648909)


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International
Except where otherwise noted, this item's licence is described as Attribution 4.0 International