CARD-660: Cambridge rare word dataset - A reliable benchmark for infrequent word representation models
View / Open Files
Authors
Pilehvar, MT
Kartsaklis, D
Prokhorov, V
Collier, N
Publication Date
2018Journal Title
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Conference Name
EMNLP 2018
ISBN
9781948087841
Pages
1391-1401
Type
Conference Object
This Version
AM
Metadata
Show full item recordCitation
Pilehvar, M., Kartsaklis, D., Prokhorov, V., & Collier, N. (2018). CARD-660: Cambridge rare word dataset - A reliable benchmark for infrequent word representation models. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 1391-1401. https://doi.org/10.17863/CAM.35329
Abstract
Rare word representation has recently enjoyed a surge of interest, owing to the crucial role that effective handling of infrequent words can play in accurate semantic understanding. However, there is a paucity of reliable benchmarks for evaluation and comparison of these techniques. We show in this paper that the only existing benchmark (the Stanford Rare Word dataset) suffers from low-confidence annotations and limited vocabulary; hence, it does not constitute a solid comparison framework. In order to fill this evaluation gap, we propose CAmbridge Rare word Dataset (CARD-660), an expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques. Through a set of experiments we show that even the best mainstream word embeddings, with millions of words in their vocabularies, are unable to achieve performances higher than 0.43 (Pearson correlation) on the dataset, compared to a human-level upperbound of 0.90. We release the dataset and the annotation materials at https://pilehvar.github.io/card-660/.
Sponsorship
Medical Research Council (MR/M025160/1)
Identifiers
External DOI: https://doi.org/10.17863/CAM.35329
This record's URL: https://www.repository.cam.ac.uk/handle/1810/288010
Rights
Licence:
http://www.rioxx.net/licenses/all-rights-reserved
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk