Cross-domain paraphrasing for improving language modelling using out-of-domain data

Liu, X; Gales, MJF; Woodland, PC

Cross-domain paraphrasing for improving language modelling using out-of-domain data

Repository URI

https://www.repository.cam.ac.uk/handle/1810/245705

Files

xunyingliu-IS2013-xdomainplm.pdf (60.27 KB)

Type

Conference Object

Authors

Liu, X

Gales, MJF

Woodland, PC

Abstract

In natural languages the variability in the underlying linguistic generation rules significantly alters the observed surface word sequence they create, and thus introduces a mismatch against other data generated via alternative realizations associated with, for example, a different domain. Hence, direct modelling of out-of-domain data can result in poor generalization to the indomain data of interest. To handle this problem, this paper investigated using cross-domain paraphrastic language models to improve in-domain language modelling (LM) using out-ofdomain data. Phrase level paraphrase models learnt from each domain were used to generate paraphrase variants for the data of other domains. These were used to both improve the context coverage of in-domain data, and reduce the domain mismatch of the out-of-domain data. Significant error rate reduction of 0.6% absolute was obtained on a state-of-the-art conversational telephone speech recognition task using a cross-domain paraphrastic multi-level LM trained on a billion words of mixed conversational and broadcast news data. Consistent improvements on the in-domain data context coverage were also obtained.

Journal Title

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Journal ISSN

2308-457X
1990-9772

Publisher

ISCA

Publisher URL

http://www.isca-speech.org/archive/interspeech_2013/i13_3424.html

Rights

Attribution-NonCommercial 2.0 UK: England & Wales

Sponsorship

The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology) and DARPA under the Broad Operational Language Translation (BOLT) program.

Collections

Scholarly Works - Engineering
Symplectic mapped items for data match