Show simple item record

dc.contributor.authorZhu, Yen
dc.contributor.authorVulić, Ien
dc.contributor.authorKorhonen, Anna-Leenaen
dc.date.accessioned2019-05-09T23:31:00Z
dc.date.available2019-05-09T23:31:00Z
dc.date.issued2019-01-01en
dc.identifier.isbn9781950737130en
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/292619
dc.description.abstractThe use of subword-level information (e.g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning. Its importance is attested especially for morphologically rich languages which generate a large number of rare words. Despite a steadily increasing interest in such subword-informed word representations, their systematic comparative analysis across typologically diverse languages and different tasks is still missing. In this work, we deliver such a study focusing on the variation of two crucial components required for subword-level integration into word representation models: 1) segmentation of words into subword units, and 2) subword composition functions to obtain final word representations. We propose a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position embeddings and self-attention. Using the unified framework, we run experiments over a large number of subword-informed word representation configurations (60 in total) on 3 tasks (general and rare word similarity, dependency parsing, fine-grained entity typing) for 5 languages representing 3 language types. Our main results clearly indicate that there is no ``one-size-fits-all'' configuration, as performance is both language- and task-dependent. We also show that configurations based on unsupervised segmentation (e.g., BPE, Morfessor) are sometimes comparable to or even outperform the ones based on supervised word segmentation.
dc.rightsAll rights reserved
dc.rights.uri
dc.titleA systematic study of leveraging subword information for learning word representationsen
dc.typeConference Object
prism.endingPage932
prism.publicationDate2019en
prism.publicationNameNAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conferenceen
prism.startingPage912
prism.volume1en
dc.identifier.doi10.17863/CAM.39780
dcterms.dateAccepted2019-02-22en
rioxxterms.versionAM
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserveden
rioxxterms.licenseref.startdate2019-01-01en
rioxxterms.typeConference Paper/Proceeding/Abstracten
pubs.funder-project-idECH2020 EUROPEAN RESEARCH COUNCIL (ERC) (648909)
cam.orpheus.successThu Nov 05 11:53:56 GMT 2020 - Embargo updated*
rioxxterms.freetoread.startdate2020-01-01


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record