Efficient Uncertainty Estimation and Sequence Modelling
Repository URI
Repository DOI
Change log
Authors
Abstract
Transformer-based autoregressive sequence models have revolutionised natural language processing and speech processing, achieving state-of-the-art performance on a wide range of tasks. However, their deployment in real-world scenarios, especially safety-critical applications like autonomous systems or medical diagnosis, necessitates not only high accuracy but also robust and efficient methods for estimating the uncertainty associated with their predictions. Furthermore, extending these powerful, predominantly text-based models to new modalities like audio, and addressing the inherent computational challenges posed by the quadratic complexity of self-attention in long sequences, remain significant research problems limiting their deployment in resource-controlled settings.
This thesis investigates various aspects of efficiency within deep sequence modelling, with a primary focus on reliable, efficient uncertainty estimation and sequence representation. The first part addresses the challenge of efficiently capturing predictive uncertainty, often derived from computationally expensive ensembles. We explore Ensemble Distribution Distillation (EDD) techniques to compress the distributional knowledge of an ensemble into a single, compact student model, introducing improved training objectives. We further propose Self-Distribution Distillation (S2D) and its hierarchical extension (H2D) as methods enabling a single model to implicitly capture ensemble-like diversity. Additionally, we introduce Non-Autoregressive Proxy (NAP) models – lightweight networks trained to directly predict sequence-level attributes (e.g., uncertainty, performance metrics) from encoder representations, bypassing the expensive autoregressive decoding process entirely and enabling efficient downstream applications.
Secondly, we extend the treatment of uncertainty estimation to the context of ranking, specifically investigating the emerging field of using Large Language Models (LLMs) for assessment of NLG outputs. We introduce a generalised Product-of-Experts (PoE) framework that leverages pairwise LLM judgments, enabling robust ranking even with incomplete comparison data and mitigating the quadratic computational cost of full pairwise comparisons. Within this framework, we derive novel uncertainty metrics to improve the modelling and efficiency of ranking NLG outputs and show how to estimate our confidence in the estimated rankings.
Finally, addressing the computational bottleneck of self-attention for long sequences, we investigate structured recurrent models, proposing Multi-Head Structured State Space Models (MH-SSMs) with a novel inter-head gating mechanism to capture diverse temporal dynamics efficiently. Experimental results on large-scale ASR benchmarks demonstrate that these structured recurrent approaches achieve competitive and state-of-the-art performance while maintaining linear computational complexity, offering a powerful alternative for efficient long-sequence modelling.

