Bayesian approaches to time-frequency inverse problems
Repository URI
Repository DOI
Change log
Authors
Abstract
Although the phenomenon of sound is intrinsically associated with the propagation of acoustic waves through physical media such as gas, liquid or solid, audio signals are better defined and understood by the way their spectral composition evolves over time. Characterising the dynamics of these hidden spectral components—rather than their raw waveform—is crucial in a wide range of practical applications involving sound. However, the mathematical description of auditory phenomena possesses a high degree of ambiguity due to the countless ways a signal can be interpreted and broken down. Hence the adoption of a probabilistic modelling approach capable of handling uncertainty and incorporating prior knowledge becomes essential.
This dissertation studies the characterisation of audio signals as time-frequency dictionary representations and its application to audio reconstruction problems. To address the inherent ambiguity of overcomplete dictionaries, the assumed generative mechanism of the audio waveform is enriched with prior structures that not only serve as a regularisation device but also reflect domain-specific modelling assumptions based on the empirical observation of audio spectra. The resulting models are thus addressed through the well-established apparatus of Bayesian computation, which can be tailored to the specific problems investigated in this dissertation and leveraged in the formulation of efficient and high-quality audio signal restoration techniques.
A synthesis-based approach is favoured throughout this dissertation. This means that, although some prior assumptions stem from the characterisation of spectral components, regularisation schemes are imposed on the regression coefficients of the dictionary representation (termed synthesis coefficients in the context of audio signal processing) instead of the short-time Fourier transform of the corrupted input signal (analysis coefficients). Hence the restoration techniques presented in this thesis are underpinned by generative models of the raw audio waveform itself rather than its spectrogram. This approach allows for great modelling flexibility and can organically handle data corruption mechanisms occurring in the waveform domain such as missing data entries, non-linear distortion of the amplitude, and stationary and non-stationary observation noise.
The modelling assumptions adopted throughout this dissertation are varied. Chapters 2 to 4 explore a structured sparsity approach based on a hierarchical Bayesian variable selection scheme. This is accomplished through a spike-and-slab prior on the synthesis coefficients, which explicitly models the activation or exclusion of the basis functions of the time-frequency dictionary. Chapter 2 introduces a time-frequency dictionary based on the complex modified discrete cosine transform (CMDCT) and derives a series of efficient Markov chain Monte Carlo algorithms to estimate the synthesis coefficients. Building upon the dictionary and sampling techniques introduced in Chapter 2, Chapter 3 recasts time-frequency dictionary representations as state-space models and formulates a sequential Markov chain Monte Carlo strategy to carry out real-time audio signal restoration. Exploiting a Gaussian mixture spike-and-slab prior, Chapter 4 develops a high-performance expectation–maximisation algorithm for sparse reconstruction of corrupted audio signals.
Chapters 5 and 6 mark a departure from the sparse approximation paradigm, investigating instead the potential of low-rank latent structures in time-frequency inverse problems. Following a synthesis-oriented approach, the waveform production mechanism adopted in these chapters still relies on the expansion of a time-frequency dictionary, but regression coefficients are instead governed by a low-rank latent variance structure. The resulting generative mechanism is termed low-rank time-frequency synthesis (LRTFS) model and is tested on a series of audio inverse problems. Chapter 5 proposes a computationally feasible expectation–maximisation scheme to compute the latent parameters of the LRTFS model and examines its application in simultaneous reconstruction and source separation tasks. In turn, Chapter 6 focuses on the reconstruction of clipped audio signals, for which an alternate minimisation strategy is presented.
