Show Some Love to Your n-grams: A Bit of Progress and Stronger n-gram Language Modeling Baselines

Shareghi, Ehsan; Gerz, Daniela; Vulic, Ivan; Korhonen, anna

Show Some Love to Your n-grams: A Bit of Progress and Stronger n-gram Language Modeling Baselines

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/292617

Repository DOI

https://doi.org/10.17863/CAM.39778

Files

Accepted version (230.99 KB)

Type

Conference Object

Authors

Shareghi, Ehsan

Gerz, Daniela

Vulic, Ivan

Korhonen, anna

Abstract

In recent years neural language models (LMs) have set state-of-the-art performance for several benchmarking datasets. While the reasons for their success and their computational demand are well-documented, a comparison between neural models and more recent developments in n-gram models is neglected. In this paper, we examine the recent progress in n-gram literature, running experiments on 50 languages covering all morphological language families. Experimental results illustrate that a simple extension of Modified Kneser-Ney outperforms an LSTM language model on 42 languages while a word-level Bayesian n-gram LM outperforms the character-aware neural model on average across all languages, and its extension which explicitly injects linguistic knowledge on 8 languages. Further experiments on larger Europarl datasets for 3 languages indicate that neural architectures are able to outperform computationally much cheaper n-gram models: n-gram training is up to 15,000 times quicker. Our experiments illustrate that standalone n-gram models lend themselves as natural choices for resource-lean or morphologically rich languages, while the recent progress has significantly improved their accuracy.

Conference Name

Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

Publisher DOI

https://doi.org/10.17863/CAM.39778

Rights

Sponsorship

European Research Council (648909)

Collections

Cambridge University Research Outputs