Modelling the combination of generic and target domain embeddings in a convolutional neural network for sentence classification

Word embeddings have been successfully exploited in systems for NLP tasks, such as parsing and text classification. It is intuitive that word embeddings created from a larger corpus would provide a better coverage of vocabulary. Meanwhile, word embeddings trained on a corpus related to the given task or target domain would more effectively represent the semantics of terms. However, in some emerging domains (e.g. bio-surveillance using social media data), it may be difficult to find a domain corpus that is large enough for creating effective word embeddings. To deal with this problem, we propose novel approaches that use both word embeddings created from generic and target domain corpora. Our experimental results on sentence classifi- cation tasks show that our approaches significantly improve the performance of an existing convolutional neural network that achieved state-of-the-art performances on several text classification tasks.

Keywords

46 Information and Computing Sciences, 4611 Machine Learning, Machine Learning and Artificial Intelligence

Journal Title

BioNLP 2016 - Proceedings of the 15th Workshop on Biomedical Natural Language Processing

Volume Title

W16

Publisher

Association for Computational Linguistics

Publisher DOI

https://doi.org/10.18653/v1/w16-2918

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

Engineering and Physical Sciences Research Council (Grant ID: EP/M005089/1)

Collections

Scholarly Works - Theoretical and Applied Linguistics
Symplectic mapped items for data match