Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics

Multi-modal distributional models learn grounded representations for improved performance in semantics. Deep visual representations, learned using convolutional neural networks, have been shown to achieve particularly high performance. In this study, we systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures. In addition, we explore the various data sources that can be used for retrieving relevant images, showing that images from search engines perform as well as, or better than, those from manually crafted resources such as ImageNet. Furthermore, we explore the optimal number of images and the multi-lingual applicability of multi-modal semantics. We hope that these findings can serve as a guide for future research in the field.

Journal Title

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Conference Name

Empirical Methods in Natural Language Processing Conference (EMNLP 2016)

Volume Title

D16

Publisher

Association for Computational Linguistics

Publisher DOI

https://doi.org/10.18653/v1/D16-1043

Rights and licensing

Except where otherwised noted, this item's license is described as http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

Anita Verõ is supported by the Nuance Foundation Grant: Learning Type-Driven Distributed Representations of Language. Stephen Clark is supported by the ERC Starting Grant: DisCoTex (306920).

Collections

Scholarly Works - Computer Science and Technology
Symplectic mapped items for data match