Repository logo
 

Cross-domain paraphrasing for improving language modelling using out-of-domain data


Loading...
Thumbnail Image

Type

Conference Object

Change log

Authors

Liu, X 
Gales, MJF 
Woodland, PC 

Abstract

In natural languages the variability in the underlying linguistic generation rules significantly alters the observed surface word sequence they create, and thus introduces a mismatch against other data generated via alternative realizations associated with, for example, a different domain. Hence, direct modelling of out-of-domain data can result in poor generalization to the indomain data of interest. To handle this problem, this paper investigated using cross-domain paraphrastic language models to improve in-domain language modelling (LM) using out-ofdomain data. Phrase level paraphrase models learnt from each domain were used to generate paraphrase variants for the data of other domains. These were used to both improve the context coverage of in-domain data, and reduce the domain mismatch of the out-of-domain data. Significant error rate reduction of 0.6% absolute was obtained on a state-of-the-art conversational telephone speech recognition task using a cross-domain paraphrastic multi-level LM trained on a billion words of mixed conversational and broadcast news data. Consistent improvements on the in-domain data context coverage were also obtained.

Description

Keywords

Journal Title

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Conference Name

Journal ISSN

2308-457X
1990-9772

Volume Title

Publisher

ISCA

Publisher DOI

Sponsorship
The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology) and DARPA under the Broad Operational Language Translation (BOLT) program.