Improving Attention-based Sequence-to-sequence Models
Attention-based models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Neural Machine Translation (NMT), Automatic Speech Recognition (ASR) and speech synthesis, also known as Text-To-Speech (TTS). These models are often autoregressive, which leads to high modeling capacity, but also makes training difficult. The standard training approach, teacher forcing, suffers from exposure bias: during training the model is guided with the reference output, but the generated output must be used at inference stage. To address this issue, scheduled sampling and professor forcing guide a model with both the reference and the generated output history. To facilitate convergence, they depend on a heuristic schedule or an auxiliary classifier, which can be difficult to tune. Alternatively, sequence-level training approaches guide the model with the generated output history, and optimize a sequence-level criterion. However, many tasks, such as TTS, do not have a well-established sequence-level criterion. In addition, the generation process is often sequential, which is undesirable for parallelizable models such as Transformer.
This thesis introduces attention forcing and deliberation networks to improve attention-based sequence-to-sequence models. Attention forcing guides a model with the generated output history and reference attention. The training criterion is a combination of maximum log-likelihood and the KL-divergence between the reference attention and the generated attention. This approach does not rely on a heuristic schedule or a classifier, and does not require a sequence-level criterion. Variations of attention forcing are proposed for more challenging application scenarios. For tasks such as NMT, the output space is multi-modal in the sense that the given an input, the distribution of the corresponding output can be multi-modal. So a selection scheme is introduced to automatically turn attention forcing on and off depending on the mode of attention. For parallelizable models, an approximation scheme is proposed to run attention forcing in parallel across time.
Deliberation networks consist of multiple attention-based models. The output is generated in multiple passes, each one conditioned on the initial input and the free running output of the previous pass. This thesis shows that deliberation networks can address exposure bias, which is essential for performance gains. In addition, various training approaches are discussed, and a separate training approach is proposed for its synergy with parallelizable models. Finally, for tasks where the output space is continuous, such as TTS, deliberation networks tend to ignore the free running outputs, thus losing its benefits. To address this issue, a guided attention loss is proposed to regularize the corresponding attention, encouraging the use of the free running outputs.
TTS and NMT are investigated as example sequence-to-sequence tasks, and task-specific techniques are proposed, such as neural vocoder adaption using attention forcing. The experiments demonstrate that attention forcing improves the overall performance and diversity. It is also demonstrated that deliberation networks improve the overall performance, and reduce the chances of attention failure.