Advances in approximate inference: combining VI and MCMC and improving on Stein discrepancy
In the modern world, machine learning, including deep learning, has become an indispensable part of many intelligent systems, helping people automate the decision-making process. For certain applications (e.g. health care services), reliable predictions and trustworthy decisions are crucial factors to be considered when deploying a machine learning system. In other words, machine learning models should be able to reason under uncertainty. Bayesian inference, powered by the probabilistic framework, is believed to be a principled way to incorporate uncertainty into the decision making process.
The difficulty of applying Bayesian inference in practice is rooted in the intractability of computing posterior probabilities. Approximate inference provides an alternative workaround by providing a tractable estimate of posterior probabilities. The performance of Bayesian inference, especially in Bayesian deep learning, crucially depends on the quality of the chosen approximate inference algorithm in terms of its accuracy and scalability. Particularly, variational inference (VI) and Markov Chain Monte Carlo (MCMC) are two major techniques with their own merits and limitations.
In the first part of the thesis, we aim to design efficient approximate inference algorithms by combining VI and MCMC (particularly, stochastic gradient MCMC (SG-MCMC)). The first proposed algorithm is called partial amortised inference, which leverages SG-MCMC to improve VI’s accuracy. The uncertainty quantification provided by this inference allows us to solve a practical problem: How to train a VAE-like generative model with insufficient training data. The second algorithm, named Meta-SGMCMC, aims at improving the efficiency of SG-MCMC by automating its dynamics design through meta learning and VI. In the second part of the thesis, we shift our focus to a promising future: Stein discrepancy, which greatly expands the choice of approximating distributions compared to the Kullback-Leibler (KL) divergence. We aim to improve on it by addressing the well-known curse-of-dimensionality problem of its scalable variant: kernelized Stein discrepancy. Inspired by the ’slicing’ idea, we propose a new discrepancy family called sliced kernelized Stein discrepancy that is robust to increasing dimensions, along with two theoretically verified downstream applications.