Improving Cascaded Systems in Spoken Language Processing

Lu, Yiting

Improving Cascaded Systems in Spoken Language Processing

Repository URI

https://www.repository.cam.ac.uk/handle/1810/348309

Repository DOI

https://doi.org/10.17863/CAM.95732

Files

Thesis (2.99 MB)

Type

Thesis

Authors

Lu, Yiting

Abstract

Spoken language processing encompasses a broad range of speech production and perception tasks. One of the central challenges in building spoken language systems is the lack of end-to-end training corpora. For example in spoken language translation, there is little annotated data that directly transcribes speech into a foreign language. Therefore, in spoken language processing, a cascaded structure is widely adopted. This breaks down the complex task into simpler modules, so that individual modules can be trained with sufficient amount of data from the associated domains. However, this simplified cascaded structure suffers from several issues. The upstream and downstream modules are usually connected via an intermediate variable, which does not always encapsulate all the information needed for the downstream processing. For example, speech transcriptions cannot convey prosodic information, and any downstream tasks operating on transcripts will have no access to speech prosodies. The cascaded structure also forces early decisions to be made at the upstream modules, and early stage errors would potentially propagate through and degrade the downstream modules. Furthermore, individual modules in the cascaded system are often trained in their corresponding domains, which can be different from the target domain of the spoken language task. The mismatched training and evaluation domains would cause performance degradation at the inference stage. The focus of this thesis is therefore to investigate multi- modular integration approaches addressing the issues facing the simple cascaded structure, and to improve spoken language processing tasks under limited end-to-end data.

The contributions of this thesis are three-fold. The first contribution is to describe the general concept of multimodular combination. The scoring criteria are modified to enable assessment of individual modules and complete systems, and approaches are explored to improve the vanilla cascaded structure. Three categories of spoken language systems are considered with an increasing level of module integration: cascaded, integrated and end-to-end systems. Cascaded systems train individual modules in their corresponding domains, and do not require any end-to-end corpora. Integrated systems propagate richer information across modular connections, and require a small amount of end-to-end data to adapt to the target domain. End-to-end systems drop the notion of modules and require large amount of end-to-end data to reach convergence. More tightly integrated systems generally require larger amount of end-to-end training data. With the trade-off between modelling power and data efficiency, different approaches are discussed aiming to strike a balance between the two. The second contribution of this thesis is to propose a general framework of reranking for multimodular systems, addressing both the error propagation and information loss issues. Rerankers are commonly used for single module sequence generation tasks, such as speech recognition and machine translation. In this work, rerankers are applied to multimodular systems, where they directly access the hypothesis space of the intermediate variables at the modular connection. Taking into account multiple hypotheses of the modular connection leads to a richer information flow across modules, and consequently helps reduce error propagation. The third contribution of this thesis is to propose the embedding passing approach. The idea is to extract continuous feature representations of the upstream context, and use them as the modular connection. The embedding connection allows richer information propagation as well as gradient backpropagation across modules, thus enabling joint optimisation of the multimodular system. Among the wide range of possible spoken language tasks, this thesis considers three example tasks with an increasing level of complexity: spoken disfluency detection (SDD), spoken language translation (SLT) and spoken grammatical error correction (SGEC). Spontaneous speech often comes with disfluencies, such as filled pauses, repetitions and false starts. As an important pre-processing step for many spoken language systems, SDD removes speech disfluencies and recovers a fluent transcription flow for downstream text processing tasks. SLT converts speech inputs into foreign text outputs, which is commonly adopted for automatic video subtitling as well as simultaneous interpreting. It is a challenging application that brings together automatic speech recognition (ASR) and neural machine translation (NMT), both of which are complex sequence-to-sequence tasks. With growing global demand for learning a second language, SGEC has become increasingly important to give feedback on the grammatical structure of spoken language. SGEC converts non-native disfluent speech into grammatically correct fluent text, and the main challenge is to operate under extremely limited end-to-end data. The SDD and SLT systems are respectively evaluated on the pub- licly available Switchboard and MuSTC datasets, and the SGEC system is evaluated on a proprietary LIN corpus. The experiments demonstrate that: the simple cascaded structure gives reasonable baselines for spoken language tasks; the proposed reranking and embedding passing approaches are both effective in propagating richer information and mitigating error propagation under limited end-to-end training corpora.

Date

2022-09-30

Advisors

Gales, Mark

Keywords

Cascaded systems, Language assessment, Speech, Spoken language processing

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Collections

Theses - Engineering