Automatic Model Construction with Gaussian Processes David Kristjanson Duvenaud University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Pembroke College June 2014 Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other University. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated in the text. This dissertation contains less than 60,000 words and has less than 150 figures. David Kristjanson Duvenaud June 2014 Acknowledgements First, I would like to thank my supervisor, Carl Rasmussen, for so much advice and encouragement. It was wonderful working with someone who has spent many years thinking deeply about the business of modeling. Carl was patient while I spent the first few months of my PhD chasing half-baked ideas, and then gently suggested a series of ideas which actually worked. I’d also like to thank my advisor, Zoubin Ghahramani, for providing much encour- agement, support, feedback and advice, and for being an example. I’m also grateful for Zoubin’s efforts to constantly populate the lab with interesting visitors. Thanks to Michael Osborne, whose thesis was an inspiration early on, and to Ro- man Garnett for being a sane third party. Sinead Williamson, Peter Orbanz and John Cunningham gave me valuable advice when I was still bewildered. Ferenc Huszár and Dave Knowles made it clear that constantly asking questions was the right way to go. Tom Dean and Greg Corrado made me feel at home at Google. Philipp Hennig was a mentor to me during my time in Tübingen, and demonstrated an inspiring level of effectiveness. Who else finishes their conference papers with entire days to spare? I’d also like to thank Ryan Adams for hosting me during my visit to Harvard, for our enjoyable collaborations, and for building such an awesome group. Josh Tenenbaum’s broad perspective and enthusiasm made the projects we worked on very rewarding. The time I got to spend with Roger Grosse was mind-expanding – he constantly surprised me by pointing out basic unsolved questions about decades-old methods, and had extremely high standards for his own work. I’d like to thank Andrew McHutchon, Konstantina Palla, Alex Davies, Neil Houlsby, Koa Heaukulani, Miguel Hernández-Lobato, Yue Wu, Ryan Turner, Roger Frigola, Sae Franklin, David Lopez-Paz, Mark van der Wilk and Rich Turner for making the lab feel like a family. I’d like to thank James Lloyd for many endless and enjoyable discussions, and for keeping a level head even during the depths of deadline death-marches. Christian Steinruecken showed me that amazing things are hidden all around. Thanks to Mel for reminding me that I needed to write a thesis, for supporting me iv during the final push, and for her love. My graduate study was supported by the National Sciences and Engineering Research Council of Canada, the Cambridge Commonwealth Trust, Pembroke College, a grant from the Engineering and Physical Sciences Research Council, and a grant from Google. Abstract This thesis develops a method for automatically constructing, visualizing and describ- ing a large class of models, useful for forecasting and finding structure in domains such as time series, geological formations, and physical dynamics. These models, based on Gaussian processes, can capture many types of statistical structure, such as periodicity, changepoints, additivity, and symmetries. Such structure can be encoded through ker- nels, which have historically been hand-chosen by experts. We show how to automate this task, creating a system that explores an open-ended space of models and reports the structures discovered. To automatically construct Gaussian process models, we search over sums and prod- ucts of kernels, maximizing the approximate marginal likelihood. We show how any model in this class can be automatically decomposed into qualitatively different parts, and how each component can be visualized and described through text. We combine these results into a procedure that, given a dataset, automatically constructs a model along with a detailed report containing plots and generated text that illustrate the structure discovered in the data. The introductory chapters contain a tutorial showing how to express many types of structure through kernels, and how adding and multiplying different kernels combines their properties. Examples also show how symmetric kernels can produce priors over topological manifolds such as cylinders, toruses, and Möbius strips, as well as their higher-dimensional generalizations. This thesis also explores several extensions to Gaussian process models. First, build- ing on existing work that relates Gaussian processes and neural nets, we analyze natural extensions of these models to deep kernels and deep Gaussian processes. Second, we ex- amine additive Gaussian processes, showing their relation to the regularization method of dropout. Third, we combine Gaussian processes with the Dirichlet process to produce the warped mixture model: a Bayesian clustering model having nonparametric cluster shapes, and a corresponding latent space in which each cluster has an interpretable parametric form. Contents List of Figures x List of Tables xii Notation xiii 1 Introduction 1 1.1 Gaussian process models . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Useful properties of Gaussian processes . . . . . . . . . . . . . . . 4 1.1.4 Limitations of Gaussian processes . . . . . . . . . . . . . . . . . . 5 1.2 Outline and contributions of thesis . . . . . . . . . . . . . . . . . . . . . 6 2 Expressing Structure with Kernels 8 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 A few basic kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Combining kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Combining properties through multiplication . . . . . . . . . . . . 11 2.3.3 Building multi-dimensional models . . . . . . . . . . . . . . . . . 12 2.4 Modeling sums of functions . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Modeling noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Additivity across multiple dimensions . . . . . . . . . . . . . . . . 14 2.4.3 Extrapolation through additivity . . . . . . . . . . . . . . . . . . 15 2.4.4 Example: An additive model of concrete strength . . . . . . . . . 16 2.4.5 Posterior variance of additive components . . . . . . . . . . . . . 18 2.5 Changepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Contents vii 2.5.1 Multiplication by a known function . . . . . . . . . . . . . . . . . 21 2.6 Feature representation of kernels . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.1 Relation to linear regression . . . . . . . . . . . . . . . . . . . . . 21 2.6.2 Feature-space view of combining kernels . . . . . . . . . . . . . . 22 2.7 Expressing symmetries and invariances . . . . . . . . . . . . . . . . . . . 22 2.7.1 Three recipes for invariant priors . . . . . . . . . . . . . . . . . . 23 2.7.2 Example: Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.3 Example: Symmetry about zero . . . . . . . . . . . . . . . . . . . 25 2.7.4 Example: Translation invariance in images . . . . . . . . . . . . . 26 2.8 Generating topological manifolds . . . . . . . . . . . . . . . . . . . . . . 26 2.8.1 Möbius strips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.9 Kernels on categorical variables . . . . . . . . . . . . . . . . . . . . . . . 29 2.10 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.11 Building a kernel in practice . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Automatic Model Construction 31 3.1 Ingredients of an automatic statistician . . . . . . . . . . . . . . . . . . . 32 3.2 A language of regression models . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 A model search procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 A model comparison procedure . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 A model description procedure . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 Structure discovery in time series . . . . . . . . . . . . . . . . . . . . . . 38 3.6.1 Mauna Loa atmospheric CO2 . . . . . . . . . . . . . . . . . . . . 38 3.6.2 Airline passenger counts . . . . . . . . . . . . . . . . . . . . . . . 39 3.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8.1 Interpretability versus accuracy . . . . . . . . . . . . . . . . . . . 44 3.8.2 Predictive accuracy on time series . . . . . . . . . . . . . . . . . . 44 3.8.3 Multi-dimensional prediction . . . . . . . . . . . . . . . . . . . . . 46 3.8.4 Structure recovery on synthetic data . . . . . . . . . . . . . . . . 46 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Automatic Model Description 48 4.1 Generating descriptions of composite kernels . . . . . . . . . . . . . . . . 49 4.1.1 Simplification rules . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.2 Describing each part of a product of kernels . . . . . . . . . . . . 50 viii Contents 4.1.3 Combining descriptions into noun phrases . . . . . . . . . . . . . 51 4.1.4 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Example descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Summarizing 400 years of solar activity . . . . . . . . . . . . . . . 53 4.2.2 Describing changing noise levels . . . . . . . . . . . . . . . . . . . 56 4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Limitations of this approach . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Deep Gaussian Processes 59 5.1 Relating deep neural networks to deep GPs . . . . . . . . . . . . . . . . . 60 5.1.1 Definition of deep GPs . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.2 Single-hidden-layer models . . . . . . . . . . . . . . . . . . . . . . 60 5.1.3 Multiple hidden layers . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1.4 Two network architectures equivalent to deep GPs . . . . . . . . . 63 5.2 Characterizing deep Gaussian process priors . . . . . . . . . . . . . . . . 64 5.2.1 One-dimensional asymptotics . . . . . . . . . . . . . . . . . . . . 64 5.2.2 Distribution of the Jacobian . . . . . . . . . . . . . . . . . . . . . 66 5.3 Formalizing a pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Fixing the pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5 Deep kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5.1 Infinitely deep kernels . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.2 When are deep kernels useful models? . . . . . . . . . . . . . . . . 77 5.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6 Additive Gaussian Processes 81 6.1 Different types of multivariate additive structure . . . . . . . . . . . . . . 82 6.2 Defining additive kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.1 Weighting different orders of interaction . . . . . . . . . . . . . . 83 6.2.2 Efficiently evaluating additive kernels . . . . . . . . . . . . . . . . 84 6.3 Additive models allow non-local interactions . . . . . . . . . . . . . . . . 86 6.4 Dropout in Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.1 Dropout on infinitely-wide hidden layers has no effect . . . . . . . 87 6.4.2 Dropout on inputs gives additive covariance . . . . . . . . . . . . 88 6.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Contents ix 6.6 Regression and classification experiments . . . . . . . . . . . . . . . . . . 91 6.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7 Warped Mixture Models 96 7.1 The Gaussian process latent variable model . . . . . . . . . . . . . . . . 96 7.2 The infinite warped mixture model . . . . . . . . . . . . . . . . . . . . . 98 7.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.5.1 Synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.5.2 Clustering face images . . . . . . . . . . . . . . . . . . . . . . . . 104 7.5.3 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5.4 Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5.5 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.5.6 Clustering performance . . . . . . . . . . . . . . . . . . . . . . . . 107 7.5.7 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8 Discussion 111 8.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2 Structured versus unstructured models . . . . . . . . . . . . . . . . . . . 112 8.3 Approaches to automating model construction . . . . . . . . . . . . . . . 113 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Appendix A Gaussian Conditionals 115 Appendix B Kernel Definitions 116 Appendix C Search Operators 118 Appendix D Example Automatically Generated Report 120 Appendix E Inference in the Warped Mixture Model 128 References 133 List of Figures 1.1 A one-dimensional Gaussian process posterior . . . . . . . . . . . . . . . 2 2.1 Examples of structures expressible by some basic kernels . . . . . . . . . 9 2.2 Examples of structures expressible by multiplying kernels . . . . . . . . . 11 2.3 A product of squared-exponential kernels across different dimensions . . . 12 2.4 Examples of structures expressible by adding kernels . . . . . . . . . . . 13 2.5 Additive kernels correspond to additive functions . . . . . . . . . . . . . 15 2.6 Extrapolation in functions with additive structure . . . . . . . . . . . . . 16 2.7 Decomposition of posterior into interpretable one-dimensional functions . 17 2.8 Visualizing posterior correlations between components . . . . . . . . . . . 19 2.9 Draws from changepoint priors . . . . . . . . . . . . . . . . . . . . . . . . 20 2.10 Three ways to introduce symmetry . . . . . . . . . . . . . . . . . . . . . 24 2.11 Generating 2D manifolds with different topological structures . . . . . . 27 2.12 Generating Möbius strips . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Another set of basic kernels . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 A search tree over kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Progression of models as the search depth increases . . . . . . . . . . . . 36 3.4 Decomposition of the Mauna Loa model . . . . . . . . . . . . . . . . . . 39 3.5 Decomposition of airline dataset model . . . . . . . . . . . . . . . . . . . 40 3.6 Extrapolation error of all methods on 13 time-series datasets . . . . . . . 45 4.1 Solar irradiance dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Automatically-generated description of the solar irradiance data set . . . 54 4.3 A component corresponding to the Maunder minimum . . . . . . . . . . 55 4.4 ABCD isolating part of the signal explained by a slowly-varying trend . . 55 4.5 Automatic description of the solar cycle . . . . . . . . . . . . . . . . . . . 55 4.6 Short descriptions of the four components of the airline model . . . . . . 56 List of Figures xi 4.7 Describing non-stationary periodicity in the airline data . . . . . . . . . . 56 4.8 Describing time-changing variance in the airline dataset . . . . . . . . . . 57 5.1 Neural network architectures giving rise to GPs . . . . . . . . . . . . . . 62 5.2 Neural network architectures giving rise to deep GPs . . . . . . . . . . . 63 5.3 A one-dimensional draw from a deep GP prior . . . . . . . . . . . . . . . 65 5.4 Desirable properties of representations of manifolds . . . . . . . . . . . . 68 5.5 Distribution of singular values of the Jacobian of a deep GP . . . . . . . 69 5.6 Points warped by a draw from a deep GP . . . . . . . . . . . . . . . . . . 70 5.7 Visualization of a feature map drawn from a deep GP . . . . . . . . . . . 71 5.8 Two different architectures for deep neural networks . . . . . . . . . . . . 72 5.9 A draw from a 1D deep GP prior with each layer connected to the input 72 5.10 Points warped by a draw from an input-connected deep GP . . . . . . . . 73 5.11 Distribution of singular values of an input-connected deep GP . . . . . . 73 5.12 Feature map of an input-connected deep GP . . . . . . . . . . . . . . . . 74 5.13 Infinitely deep kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.1 Isocontours of additive kernels in 3 dimensions . . . . . . . . . . . . . . . 86 6.2 A comparison of different additive model classes . . . . . . . . . . . . . . 90 7.1 One-dimensional Gaussian process latent variable model . . . . . . . . . 97 7.2 Two-dimensional Gaussian process latent variable model . . . . . . . . . 98 7.3 A draw from the infinite warped mixture model prior . . . . . . . . . . . 99 7.4 Recovering clusters on synthetic data . . . . . . . . . . . . . . . . . . . . 103 7.5 Latent clusters of face images . . . . . . . . . . . . . . . . . . . . . . . . 104 7.6 Comparing density estimates of the GP-LVM and the iWMM . . . . . . 105 7.7 Visualization of the behavior of a sampler for the iWMM . . . . . . . . . 106 7.8 Comparison of latent coordinate estimates . . . . . . . . . . . . . . . . . 106 List of Tables 3.1 Common regression models expressible in the kernel language . . . . . . 34 3.2 Kernels chosen on synthetic data . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Descriptions of the effect of each kernel, written as a post-modifier . . . . 51 4.2 Noun phrase descriptions of each type of kernel . . . . . . . . . . . . . . 51 6.1 Relative variance contributed by each order of the additive model . . . . 84 6.2 Regression dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Classification dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4 Comparison of predictive error on regression problems . . . . . . . . . . . 94 6.5 Comparison of predictive likelihood on regression problems . . . . . . . . 94 6.6 Comparison of predictive error on classification problems . . . . . . . . . 94 6.7 Comparison of predictive likelihood on classification problems . . . . . . 94 7.1 Datasets used for evaluation of the iWMM . . . . . . . . . . . . . . . . . 107 7.2 Clustering performance comparison . . . . . . . . . . . . . . . . . . . . . 107 7.3 Predictive likelihood comparison . . . . . . . . . . . . . . . . . . . . . . . 108 Notation Unbolded x represents a single number, boldface x represents a vector, and capital boldface X represents a matrix. An individual element of a vector is denoted with a subscript and without boldface. For example, the ith element of a vector x is xi. A bold lower-case letter with an index such as xj represents a particular row of matrix X. Symbol Description SE The squared-exponential kernel, also known as the radial-basis function (RBF) kernel, the Gaussian kernel, or the exponentiated quadratic. RQ The rational-quadratic kernel. Per The periodic kernel. Lin The linear kernel. WN The white-noise kernel. C The constant kernel. σ The changepoint kernel, σ(x, x′) = σ(x)σ(x′), where σ(x) is a sig- moidal function such as the logistic function. ka + kb Addition of kernels, shorthand for ka(x,x′) + kb(x,x′) ka × kb Multiplication of kernels, shorthand for ka(x,x′)× kb(x,x′) k(X,X) The Gram matrix, whose i, jth element is k(xi,xj). K Shorthand for the Gram matrix k(X,X) f(X) A vector of function values, whose ith element is given by f(xi). mod(i, j) The modulo operator, giving the remainder after dividing i by j. O(·) The big-O asymptotic complexity of an algorithm. Y:,d the dth column of matrix Y. Precise definitions of all kernels listed here are given in appendix B. Chapter 1 Introduction “All models are wrong, but yours are stupid too.” @ML_Hipster (2013) Prediction, extrapolation, and induction are all examples of learning a function from data. There are many ways to learn functions, but one particularly elegant way is by probabilistic inference. Probabilistic inference takes a group of hypotheses (a model), and weights those hypotheses based on how well their predictions match the data. This approach is appealing for two reasons. First, keeping all hypotheses that match the data helps to guard against over-fitting. Second, comparing how well a dataset is fit by different models gives a way of finding which sorts of structure are present in that data. This thesis focuses on constructing models of functions. Chapter 2 describes how to model functions having many different types of structure, such as additivity, symmetry, periodicity, changepoints, or combinations of these, using Gaussian processes (GPs). Chapters 3 and 4 show how such models can be automatically constructed from data, and then automatically described. Later chapters explore several extensions of these models. This short chapter introduces the basic properties of GPs, and provides an outline of the thesis. 1.1 Gaussian process models Gaussian processes are a simple and general class of models of functions. To be pre- cise, a GP is any distribution over functions such that any finite set of function values f(x1), f(x2), . . . f(xN) have a joint Gaussian distribution (Rasmussen and Williams, 2006, chapter 2). A GP model, before conditioning on data, is completely specified by 2 Introduction f(x) f(x) x x Figure 1.1: A visual representation of a Gaussian process modeling a one-dimensional function. Different shades of red correspond to deciles of the predictive density at each input location. Coloured lines show samples from the process – examples of some of the hypotheses included in the model. Top left: A GP not conditioned on any datapoints. Remaining plots: The posterior after conditioning on different amounts of data. All plots have the same axes. its mean function, E [f(x)] = µ(x) (1.1) and its covariance function, also called the kernel: Cov [f(x), f(x′)] = k(x,x′) (1.2) It is common practice to assume that the mean function is simply zero everywhere, since uncertainty about the mean function can be taken into account by adding an extra term to the kernel. After accounting for the mean, the kind of structure that can be captured by a GP model is entirely determined by its kernel. The kernel determines how the model generalizes, or extrapolates to new data. 1.1 Gaussian process models 3 There are many possible choices of covariance function, and we can specify a wide range of models just by specifying the kernel of a GP. For example, linear regression, splines, and Kalman filters are all examples of GPs with particular kernels. However, these are just a few familiar examples out of a wide range of possibilities. One of the main difficulties in using GPs is constructing a kernel which represents the particular structure present in the data being modelled. 1.1.1 Model selection The crucial property of GPs that allows us to automatically construct models is that we can compute the marginal likelihood of a dataset given a particular model, also known as the evidence (MacKay, 1992). The marginal likelihood allows one to compare models, balancing between the capacity of a model and its fit to the data (MacKay, 2003; Rasmussen and Ghahramani, 2001). The marginal likelihood under a GP prior of a set of function values [f(x1), f(x2), . . . f(xN)] := f(X) at locations X is given by: p(f(X)|X, µ(·), k(·, ·)) = N (f(X)|µ(X), k(X,X)) (1.3) = (2π)−N2 × |k(X,X)|− 12︸ ︷︷ ︸ controls model capacity × exp { −12 (f(X)− µ(X)) T k(X,X)−1 (f(X)− µ(X)) } ︸ ︷︷ ︸ encourages fit with data This multivariate Gaussian density is referred to as the marginal likelihood because it implicitly integrates (marginalizes) over all possible functions values f(X¯), where X¯ is the set of all locations where we have not observed the function. 1.1.2 Prediction We can ask the model which function values are likely to occur at any location, given the observations seen so far. By the formula for Gaussian conditionals (given in appendix A), 4 Introduction the predictive distribution of a function value f(x⋆) at a test point x⋆ has the form: p(f(x⋆)|f(X),X, µ(·), k(·, ·)) = N ( f(x⋆) |µ(x⋆) + k(x⋆,X)k(X,X)−1 (f(X)− µ(X))︸ ︷︷ ︸ predictive mean follows observations , k(x⋆,x⋆)− k(x⋆,X)k(X,X)−1k(X,x⋆)︸ ︷︷ ︸ predictive variance shrinks given more data ) (1.4) These expressions may look complex, but only require a few matrix operations to evaluate. Sampling a function from a GP is also straightforward: a sample from a GP at a finite set of locations is just a single sample from a single multivariate Gaussian distribution, given by equation (1.4). Figure 1.1 shows prior and posterior samples from a GP, as well as contours of the predictive density. Our use of probabilities does not mean that we are assuming the function being learned is stochastic or random in any way; it is simply a consistent method of keeping track of uncertainty. 1.1.3 Useful properties of Gaussian processes There are several reasons why GPs in particular are well-suited for building a language of regression models: • Analytic inference. Given a kernel function and some observations, the predic- tive posterior distribution can be computed exactly in closed form. This is a rare property for nonparametric models to have. • Expressivity. Through the choice of covariance function, we can express a wide range of modeling assumptions. Some examples will be shown in chapter 2. • Integration over hypotheses. The fact that a GP posterior, given a fixed kernel, lets us integrate exactly over a wide range of hypotheses means that overfitting is less of an issue than in comparable model classes. For example, compared to neural networks, relatively few parameters need to be estimated, which lessens the need for the complex optimization or regularization schemes. • Model selection. A side benefit of being able to integrate over all hypotheses is that we can compute the marginal likelihood of the data given a model. This 1.1 Gaussian process models 5 gives us a principled way of comparing different models. • Closed-form predictive distribution. The predictive distribution of a GP at a set of test points is simply a multivariate Gaussian distribution. This means that GPs can easily be composed with other models or decision procedures. • Easy to analyze. It may seem unsatisfying to restrict ourselves to a limited model class, as opposed to trying to do inference in the set of all computable functions. However, simple models can be used as well-understood building blocks for constructing more interesting models. For example, consider linear models. Although they form an extremely limited model class, they are simple, easy to analyze, and easy to incorporate into other models or procedures. Gaussian processes can be seen as an extension of linear models which retain these attractive properties (Rasmussen and Williams, 2006, chapter 2). 1.1.4 Limitations of Gaussian processes There are several issues which make GPs sometimes difficult to use: • Slow inference. Computing the matrix inverse in equations (1.3) and (1.4) takes O(N3) time, making exact inference prohibitively slow for more than a few thou- sand datapoints. However, this problem can be addressed by approximate inference schemes (Hensman et al., 2013; Quiñonero-Candela and Rasmussen, 2005; Snelson and Ghahramani, 2006). • Light tails of the predictive distribution. The predictive distribution of a standard GP model is Gaussian. We may sometimes with to use non-Gaussian predictive likelihoods, for example in order to be robust to outliers, or to perform classification. Using non-Gaussian likelihoods requires approximate inference. For- tunately, mature software packages exist (Hensman et al., 2014b; Rasmussen and Nickisch, 2010; Vanhatalo et al., 2013) which can automatically perform approxi- mate inference for a wide variety of non-Gaussian likelihoods, and also implement sparse approximations. • The need to choose a kernel. The flexibility of GP models raises the question of which kernel to use for a given problem. Choosing a useful kernel is equivalent to learning a useful representation of the input. Kernel parameters can be set 6 Introduction automatically by maximizing the marginal likelihood, but until recently, human experts were required to choose the parametric form of the kernel. Chapter 3 will show a way in which kernels can be automatically constructed for a given dataset. 1.2 Outline and contributions of thesis The main contribution of this thesis is to develop a method to automatically model, visualize, and describe a variety of statistical structures in data, by searching through an open-ended language of regression models. This thesis also includes a set of related results showing how Gaussian processes can be extended or composed with other models. Chapter 2 is a tutorial showing how to build a wide variety of structured models of functions by constructing appropriate covariance functions. We will also show how GPs can produce nonparametric models of manifolds with diverse topological structures, such as cylinders, toruses and Möbius strips. Chapter 3 shows how to search over an open-ended language of models, built by adding and multiplying different kernels. Since we can evaluate each model by the marginal likelihood, we can automatically construct custom models for each dataset by a straightforward search procedure. We will show how the nature of GPs allow the resulting models to be visualized by decomposing them into diverse, interpretable components, each capturing a different type of structure. Our experiments show that capturing such high-level structure sometimes allows one to extrapolate beyond the range of the data. One benefit of using a compositional model class is that the resulting models are relatively interpretable. Chapter 4 demonstrates a system which automatically describes the structure implied by a given kernel on a given dataset, generating reports with graphs and English-language text describing the resulting model. Combined with the automatic model search developed in chapter 3, this system represents the beginnings of what could be called an “automatic statistician”, capable of some aspects of model-building and explanation currently performed by experts. Chapter 5 analyzes deep neural network models by characterizing the prior over functions obtained by composing GP priors to form deep Gaussian processes. We show that, as the number of layers increase, the amount of information retained about the original input diminishes to a single degree of freedom. A simple change to the network architecture fixes this pathology. We relate these models to neural networks, and as a side effect derive several forms of infinitely deep kernels. 1.2 Outline and contributions of thesis 7 Chapter 6 examines a more limited, but much faster way of discovering structure using GPs. Specifying a kernel having many different types of structure, we use kernel parameters to discard whichever types of structure are not found in the current dataset. The particular model class we examine is called additive Gaussian processes, a model summing over exponentially-many GPs, each depending on a different subset of the input variables. We give a polynomial-time inference algorithm for this model, and relate it to other model classes. For example, additive GPs are shown to have the same covariance as a GP that uses dropout, a recently developed regularization technique for neural networks. Chapter 7 develops a Bayesian clustering model in which the clusters have nonpara- metric shapes, called the infinite warped mixture model. The density manifolds learned by this model follow the contours of the data density, and have interpretable, parametric forms in the latent space. The marginal likelihood lets us infer the effective dimension and shape of each cluster separately, as well as the number of clusters. Chapter 2 Expressing Structure with Kernels This chapter shows how to use kernels to build models of functions with many different kinds of structure: additivity, symmetry, periodicity, interactions between variables, and changepoints. We also show several ways to encode group invariants into kernels. Combining a few simple kernels through addition and multiplication will give us a rich, open-ended language of models. The properties of kernels discussed in this chapter are mostly known in the literature. The original contribution of this chapter is to gather them into a coherent whole and to offer a tutorial showing the implications of different kernel choices, and some of the structures which can be obtained by combining them. 2.1 Definition A kernel (also called a covariance function, kernel function, or covariance kernel), is a positive-definite function of two inputs x,x′. In this chapter, x and x′ are usually vectors in a Euclidean space, but kernels can also be defined on graphs, images, discrete or categorical inputs, or even text. Gaussian process models use a kernel to define the prior covariance between any two function values: Cov [f(x), f(x′)] = k(x,x′) (2.1) Colloquially, kernels are often said to specify the similarity between two objects. This is slightly misleading in this context, since what is actually being specified is the similarity between two values of a function evaluated on each object. The kernel specifies which 2.2 A few basic kernels 9 functions are likely under the GP prior, which in turn determines the generalization properties of the model. 2.2 A few basic kernels To begin understanding the types of structures expressible by GPs, we will start by briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are defined in figure 2.1. Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, x′) = σ2f exp ( − (x−x′)22ℓ2 ) σ2f exp ( − 2 ℓ2 sin 2 ( π x−x ′ p )) σ2f (x− c)(x′ − c) Plot of k(x, x′): 0 0 0 x− x′ x− x′ x (with x′ = 1) ↓ ↓ ↓ Functions f(x) sampled from GP prior: x x x Type of structure: local variation repeating structure linear functions Figure 2.1: Examples of structures expressible by some basic kernels. Each covariance function corresponds to a different set of assumptions made about the function we wish to model. For example, using a squared-exp (SE) kernel implies that the function we are modeling has infinitely many derivatives. There exist many variants of “local” kernels similar to the SE kernel, each encoding slightly different assumptions about the smoothness of the function being modeled. Kernel parameters Each kernel has a number of parameters which specify the precise shape of the covariance function. These are sometimes referred to as hyper-parameters, since they can be viewed as specifying a distribution over function parameters, instead of being parameters which specify a function directly. An example would be the lengthscale 10 Expressing Structure with Kernels parameter ℓ of the SE kernel, which specifies the width of the kernel and thereby the smoothness of the functions in the model. Stationary and Non-stationary The SE and Per kernels are stationary, meaning that their value only depends on the difference x− x′. This implies that the probability of observing a particular dataset remains the same even if we move all the x values by the same amount. In contrast, the linear kernel (Lin) is non-stationary, meaning that the corresponding GP model will produce different predictions if the data were moved while the kernel parameters were kept fixed. 2.3 Combining kernels What if the kind of structure we need is not expressed by any known kernel? For many types of structure, it is possible to build a “made to order” kernel with the desired properties. The next few sections of this chapter will explore ways in which kernels can be combined to create new ones with different properties. This will allow us to include as much high-level structure as necessary into our models. 2.3.1 Notation Below, we will focus on two ways of combining kernels: addition and multiplication. We will often write these operations in shorthand, without arguments: ka + kb = ka(x,x′) + kb(x,x′) (2.2) ka × kb = ka(x,x′)× kb(x,x′) (2.3) All of the basic kernels we considered in section 2.2 are one-dimensional, but kernels over multi-dimensional inputs can be constructed by adding and multiplying between kernels on different dimensions. The dimension on which a kernel operates is denoted by a subscripted integer. For example, SE2 represents an SE kernel over the second dimension of vector x. To remove clutter, we will usually refer to kernels without specifying their parameters. 2.3 Combining kernels 11 Lin× Lin SE× Per Lin× SE Lin× Per 0 0 0 0 x (with x′ = 1) x− x′ x (with x′ = 1) x (with x′ = 1) ↓ ↓ ↓ ↓ quadratic functions locally periodic increasing variation growing amplitude Figure 2.2: Examples of one-dimensional structures expressible by multiplying kernels. Plots have same meaning as in figure 2.1. 2.3.2 Combining properties through multiplication Multiplying two positive-definite kernels together always results in another positive- definite kernel. But what properties do these new kernels have? Figure 2.2 shows some kernels obtained by multiplying two basic kernels together. Working with kernels, rather than the parametric form of the function itself, allows us to express high-level properties of functions that do not necessarily have a simple parametric form. Here, we discuss a few examples: • Polynomial Regression. By multiplying together T linear kernels, we obtain a prior on polynomials of degree T . The first column of figure 2.2 shows a quadratic kernel. • Locally Periodic Functions. In univariate data, multiplying a kernel by SE gives a way of converting global structure to local structure. For example, Per corresponds to exactly periodic structure, whereas Per×SE corresponds to locally periodic structure, as shown in the second column of figure 2.2. • Functions with Growing Amplitude. Multiplying by a linear kernel means that the marginal standard deviation of the function being modeled grows linearly away from the location given by kernel parameter c. The third and fourth columns of figure 2.2 show two examples. 12 Expressing Structure with Kernels One can multiply any number of kernels together in this way to produce kernels combining several high-level properties. For example, the kernel SE×Lin×Per specifies a prior on functions which are locally periodic with linearly growing amplitude. We will see a real dataset having this kind of structure in chapter 3. 2.3.3 Building multi-dimensional models A flexible way to model functions having more than one input is to multiply together kernels defined on each individual input. For example, a product of SE kernels over different dimensions, each having a different lengthscale parameter, is called the SE-ARD kernel: SE-ARD(x,x′) = D∏ d=1 σ2d exp ( −12 (xd − x′d)2 ℓ2d ) = σ2f exp ( −12 D∑ d=1 (xd − x′d)2 ℓ2d ) (2.4) Figure 2.3 illustrates the SE-ARD kernel in two dimensions. × = → SE1(x1, x′1) SE2(x2, x′2) SE1×SE2 f(x1, x2) drawn fromGP(0, SE1×SE2) Figure 2.3: A product of two one-dimensional kernels gives rise to a prior on functions which depend on both dimensions. ARD stands for automatic relevance determination, so named because estimating the lengthscale parameters ℓ1, ℓ2, . . . , ℓD, implicitly determines the “relevance” of each dimension. Input dimensions with relatively large lengthscales imply relatively little variation along those dimensions in the function being modeled. SE-ARD kernels are the default kernel in most applications of GPs. This may be partly because they have relatively few parameters to estimate, and because those pa- rameters are relatively interpretable. In addition, there is a theoretical reason to use them: they are universal kernels (Micchelli et al., 2006), capable of learning any contin- uous function given enough data, under some conditions. 2.4 Modeling sums of functions 13 However, this flexibility means that they can sometimes be relatively slow to learn, due to the curse of dimensionality (Bellman, 1956). In general, the more structure we account for, the less data we need - the blessing of abstraction (Goodman et al., 2011) counters the curse of dimensionality. Below, we will investigate ways to encode more structure into kernels. 2.4 Modeling sums of functions An additive function is one which can be expressed as f(x) = fa(x) + fb(x). Additivity is a useful modeling assumption in a wide variety of contexts, especially if it allows us to make strong assumptions about the individual components which make up the sum. Restricting the flexibility of component functions often aids in building interpretable models, and sometimes enables extrapolation in high dimensions. Lin+ Per SE+ Per SE+ Lin SE(long) + SE(short) 0 0 0 0 x (with x′ = 1) x− x′ x (with x′ = 1) x− x′ ↓ ↓ ↓ ↓ periodic plus trend periodic plus noise linear plus variation slow & fast variation Figure 2.4: Examples of one-dimensional structures expressible by adding kernels. Rows have the same meaning as in figure 2.1. SE(long) denotes a SE kernel whose lengthscale is long relative to that of SE(short) It is easy to encode additivity into GP models. Suppose functions fa, fb are drawn independently from GP priors: fa ∼ GP(µa, ka) (2.5) fb ∼ GP(µb, kb) (2.6) 14 Expressing Structure with Kernels Then the distribution of the sum of those functions is simply another GP: fa + fb ∼ GP(µa + µb, ka + kb). (2.7) Kernels ka and kb can be of different types, allowing us to model the data as a sum of independent functions, each possibly representing a different type of structure. Any number of components can be summed this way. 2.4.1 Modeling noise Additive noise can be modeled as an unknown, quickly-varying function added to the signal. This structure can be incorporated into a GP model by adding a local kernel such as an SE with a short lengthscale, as in the fourth column of figure 2.4. The limit of the SE kernel as its lengthscale goes to zero is a “white noise” (WN) kernel. Function values drawn from a GP with a WN kernel are independent draws from a Gaussian random variable. Given a kernel containing both signal and noise components, we may wish to isolate only the signal components. Section 2.4.5 shows how to decompose a GP posterior into each of its additive components. In practice, there may not be a clear distinction between signal and noise. For example, section 3.6 contains examples of models having long-term, medium-term, and short-term trends. Which parts we designate as the “signal” sometimes depends on the task at hand. 2.4.2 Additivity across multiple dimensions When modeling functions of multiple dimensions, summing kernels can give rise to addi- tive structure across different dimensions. To be more precise, if the kernels being added together are each functions of only a subset of input dimensions, then the implied prior over functions decomposes in the same way. For example, f(x1, x2) ∼ GP(0, k1(x1, x′1) + k2(x2, x′2)) (2.8) 2.4 Modeling sums of functions 15 + = k1(x1, x′1) k2(x2, x′2) k1(x1, x′1) + k2(x2, x′2) ↓ ↓ ↓ + = f1(x1) ∼ GP (0, k1) f2(x2) ∼ GP (0, k2) f1(x1) + f2(x2) Figure 2.5: A sum of two orthogonal one-dimensional kernels. Top row: An additive kernel is a sum of kernels. Bottom row: A draw from an additive kernel corresponds to a sum of draws from independent GP priors, each having the corresponding kernel. is equivalent to the model f1(x1) ∼ GP(0, k1(x1, x′1)) (2.9) f2(x2) ∼ GP(0, k2(x2, x′2)) (2.10) f(x1, x2) = f1(x1) + f2(x2) . (2.11) Figure 2.5 illustrates a decomposition of this form. Note that the product of two kernels does not have an analogous interpretation as the product of two functions. 2.4.3 Extrapolation through additivity Additive structure sometimes allows us to make predictions far from the training data. Figure 2.6 compares the extrapolations made by additive versus product-kernel GP mod- els, conditioned on data from a sum of two axis-aligned sine functions. The training points were evaluated in a small, L-shaped area. In this example, the additive model is able to correctly predict the height of the function at an unseen combinations of inputs. The product-kernel model is more flexible, and so remains uncertain about the function 16 Expressing Structure with Kernels GP mean using GP mean using True function: sum of SE kernels: product of SE kernels: f(x1, x2) = sin(x1) + sin(x2) k1(x1, x′1) + k2(x2, x′2) k1(x1, x′1)×k2(x2, x′2) Figure 2.6: Left: A function with additive structure. Center: A GP with an additive kernel can extrapolate away from the training data. Right: A GP with a product kernel allows a different function value for every combination of inputs, and so is uncertain about function values away from the training data. This causes the predictions to revert to the mean. away from the data. These types of additive models have been well-explored in the statistics literature. For example, generalized additive models (Hastie and Tibshirani, 1990) have seen wide adoption. In high dimensions, we can also consider sums of functions of multiple input dimensions. Chapter 6 considers this model class in more detail. 2.4.4 Example: An additive model of concrete strength To illustrate how additive kernels give rise to interpretable models, we built an addi- tive model of the strength of concrete as a function of the amount of seven different ingredients (cement, slag, fly ash, water, plasticizer, coarse aggregate and fine aggre- gate), and the age of the concrete (Yeh, 1998). Our simple model is a sum of 8 different one-dimensional functions, each depending on only one of these quantities: f(x) = f1(cement) + f2(slag) + f3(fly ash) + f4(water) + f5(plasticizer) + f6(coarse) + f7(fine) + f8(age) + noise (2.12) where noise iid∼ N (0, σ2n). Each of the functions f1, f2, . . . , f8 was modeled using a GP with an SE kernel. These eight SE kernels plus a white noise kernel were added together as in equation (2.8) to form a single GP model whose kernel had 9 additive components. 2.4 Modeling sums of functions 17 After learning the kernel parameters by maximizing the marginal likelihood of the data, one can visualize the predictive distribution of each component of the model. st re ng th cement (kg/m3) slag (kg/m3) fly ash (kg/m3) st re ng th water (kg/m3) plasticizer (kg/m3) coarse (kg/m3) st re ng th Data Posterior density Posterior samples fine (kg/m3) age (days) Figure 2.7: The predictive distribution of each one-dimensional function in a multi- dimensional additive model. Blue crosses indicate the original data projected on to each dimension, red indicates the marginal posterior density of each function, and colored lines are samples from the marginal posterior distribution of each one-dimensional function. The vertical axis is the same for all plots. Figure 2.7 shows the marginal posterior distribution of each of the eight one-dimensional functions in the model. The parameters controlling the variance of two of the functions, f6(coarse) and f7(fine) were set to zero, meaning that the marginal likelihood preferred a parsimonious model which did not depend on these inputs. This is an example of the automatic sparsity that arises by maximizing marginal likelihood in GP models, and is another example of automatic relevance determination (ARD) (Neal, 1995). The ability to learn kernel parameters in this way is much more difficult when using non-probabilistic methods such as Support Vector Machines (Cortes and Vapnik, 1995), for which cross-validation is often the best method to select kernel parameters. 18 Expressing Structure with Kernels 2.4.5 Posterior variance of additive components Here we derive the posterior variance and covariance of all of the additive components of a GP. These formulas allow one to make plots such as figure 2.7. First, we write down the joint prior distribution over two functions drawn indepen- dently from GP priors, and their sum. We distinguish between f(X) (the function values at training locations [x1,x2, . . . ,xN ]T := X) and f(X⋆) (the function values at some set of query locations [x⋆1,x⋆2, . . . ,x⋆N ]T := X⋆). Formally, if f1 and f2 are a priori independent, and f1 ∼ GP(µ1, k1) and f2 ∼ GP(µ2, k2), then f1(X) f1(X⋆) f2(X) f2(X⋆) f1(X) + f2(X) f1(X⋆) + f2(X⋆)  ∼ N   µ1 µ⋆1 µ2 µ⋆2 µ1 + µ2 µ⋆1 + µ⋆2  ,  K1 K⋆1 0 0 K1 K⋆1 K⋆1T K⋆⋆1 0 0 K⋆1 K⋆⋆1 0 0 K2 K⋆2 K2 K⋆2 0 0 K⋆2T K⋆⋆2 K⋆2 K⋆⋆2 K1 K⋆1T K2 K⋆2T K1 +K2 K⋆1 +K⋆2 K⋆1T K⋆⋆1 K⋆2T K⋆⋆2 K⋆1T +K⋆2T K⋆⋆1 +K⋆⋆2   (2.13) where we represent the Gram matrices, whose i, jth entry is given by k(xi,xj) by Ki = ki(X,X) (2.14) K⋆i = ki(X,X⋆) (2.15) K⋆⋆i = ki(X⋆,X⋆) (2.16) The formula for Gaussian conditionals A.2 can be used to give the conditional distri- bution of a GP-distributed function conditioned on its sum with another GP-distributed function: f1(X⋆) ∣∣∣f1(X) + f2(X) ∼ N(µ⋆1 +K⋆T1 (K1 +K2)−1[f1(X) + f2(X)− µ1 − µ2], K⋆⋆1 −K⋆ T 1 (K1 +K2)−1K⋆1 ) (2.17) These formulas express the model’s posterior uncertainty about the different components of the signal, integrating over the possible configurations of the other components. To extend these formulas to a sum of more than two functions, the termK1+K2 can simply be replaced by ∑iKi everywhere. 2.4 Modeling sums of functions 19 cement slag fly ash water plasticizer age ce m en t sla g fly as h wa te r pl as tic iz er ag e Correlation −0.5 0 0.5 1 Figure 2.8: Posterior correlations between the heights of different one-dimensional func- tions in equation (2.12), whose sum models concrete strength. Red indicates high corre- lation, teal indicates no correlation, and blue indicates negative correlation. Plots on the diagonal show posterior correlations between different evaluations of the same function. Correlations are evaluated over the same input ranges as in figure 2.7. Correlations with f6(coarse) and f7(fine) are not shown, because their estimated variance was zero. Posterior covariance of additive components One can also compute the posterior covariance between the height of any two functions, conditioned on their sum: Cov [ f1(X⋆),f2(X⋆) ∣∣∣f(X)] = −K⋆T1 (K1 +K2)−1K⋆2 (2.18) If this quantity is negative, it means that there is ambiguity about which of the two functions is high or low at that location. For example, figure 2.8 shows the posterior correlation between all non-zero components of the concrete model. This figure shows 20 Expressing Structure with Kernels that most of the correlation occurs within components, but there is also negative corre- lation between the height of f1(cement) and f2(slag). 2.5 Changepoints An example of how combining kernels can give rise to more structured priors is given by changepoint kernels, which can express a change between different types of structure. Changepoints kernels can be defined through addition and multiplication with sigmoidal functions such as σ(x) = 1/1+exp(−x): CP(k1, k2)(x, x′) = σ(x)k1(x, x′)σ(x′) + (1− σ(x))k2(x, x′)(1− σ(x′)) (2.19) which can be written in shorthand as CP(k1, k2) = k1×σ + k2×σ¯ (2.20) where σ = σ(x)σ(x′) and σ¯ = (1− σ(x))(1− σ(x′)). This compound kernel expresses a change from one kernel to another. The parameters of the sigmoid determine where, and how rapidly, this change occurs. Figure 2.9 shows some examples. CP(SE,Per) CP(SE,Per) CP(SE, SE) CP(Per,Per) f(x) x x x x Figure 2.9: Draws from different priors on using changepoint kernels, constructed by adding and multiplying together base kernels with sigmoidal functions. We can also build a model of functions whose structure changes only within some interval – a change-window – by replacing σ(x) with a product of two sigmoids, one increasing and one decreasing. 2.6 Feature representation of kernels 21 2.5.1 Multiplication by a known function More generally, we can model an unknown function that’s been multiplied by any fixed, known function a(x), by multiplying the kernel by a(x)a(x′). Formally, f(x) = a(x)g(x), g ∼ GP( 0, k(x,x′)) ⇐⇒ f ∼ GP( 0, a(x)k(x,x′)a(x′)) . (2.21) 2.6 Feature representation of kernels By Mercer’s theorem (Mercer, 1909), any positive-definite kernel can be represented as the inner product between a fixed set of features, evaluated at x and at x′: k(x,x′) = h(x)Th(x′) (2.22) For example, the squared-exponential kernel (SE) on the real line has a representation in terms of infinitely many radial-basis functions of the form hi(x) ∝ exp(− 14ℓ2 (x− ci)2). More generally, any stationary kernel can be represented by a set of sines and cosines - a Fourier representation (Bochner, 1959). In general, any particular feature representation of a kernel is not necessarily unique (Minh et al., 2006). In some cases, the input to a kernel, x, can even be the implicit infinite-dimensional feature mapping of another kernel. Composing feature maps in this way leads to deep kernels, which are explored in section 5.5. 2.6.1 Relation to linear regression Surprisingly, GP regression is equivalent to Bayesian linear regression on the implicit features h(x) which give rise to the kernel: f(x) = wTh(x), w ∼ N (0, I) ⇐⇒ f ∼ GP ( 0,h(x)Th(x) ) (2.23) The link between Gaussian processes, linear regression, and neural networks is explored further in section 5.1. 22 Expressing Structure with Kernels 2.6.2 Feature-space view of combining kernels We can also view kernel addition and multiplication as a combination of the features of the original kernels. For example, given two kernels ka(x,x′) = a(x)Ta(x′) (2.24) kb(x,x′) = b(x)Tb(x′) (2.25) their addition has the form: ka(x,x′) + kb(x,x′) = a(x)Ta(x′) + b(x)Tb(x′) =  a(x) b(x) T  a(x′) b(x′)  (2.26) meaning that the features of ka+kb are the concatenation of the features of each kernel. We can examine kernel multiplication in a similar way: ka(x,x′)× kb(x,x′) = [ a(x)Ta(x′) ] × [ b(x)Tb(x′) ] (2.27) = ∑ i ai(x)ai(x′)× ∑ j bj(x)bj(x′) (2.28) = ∑ i,j [ ai(x)bj(x) ][ ai(x′)bj(x′) ] (2.29) In words, the features of ka×kb are made of up all pairs of the original two sets of features. For example, the features of the product of two one-dimensional SE kernels (SE1×SE2) cover the plane with two-dimensional radial-basis functions of the form: hij(x1, x2) ∝ exp ( −12 (x1 − ci)2 2ℓ21 ) exp ( −12 (x2 − cj)2 2ℓ22 ) (2.30) 2.7 Expressing symmetries and invariances When modeling functions, encoding known symmetries can improve predictive accuracy. This section looks at different ways to encode symmetries into a prior on functions. Many types of symmetry can be enforced through operations on the kernel. We will demonstrate the properties of the resulting models by sampling functions from their priors. By using these functions to define smooth mappings from R2 → R3, we will show how to build a nonparametric prior on an open-ended family of topological manifolds, such as cylinders, toruses, and Möbius strips. 2.7 Expressing symmetries and invariances 23 2.7.1 Three recipes for invariant priors Consider the scenario where we have a finite set of transformations of the input space {g1, g2, . . .} to which we wish our function to remain invariant: f(x) = f(g(x)) ∀x ∈ X , ∀g ∈ G (2.31) As an example, imagine we wish to build a model of functions invariant to swapping their inputs: f(x1, x2) = f(x2, x1), ∀x1, x2. Being invariant to a set of operations is equivalent to being invariant to all compositions of those operations, the set of which forms a group. (Armstrong et al., 1988, chapter 21). In our example, the elements of the group Gswap containing all operations the functions are invariant to has two elements: g1([x1, x2]) = [x2, x1] (swap) (2.32) g2([x1, x2]) = [x1, x2] (identity) (2.33) How can we construct a prior on functions which respect these symmetries? Gins- bourger et al. (2012) and Ginsbourger et al. (2013) showed that the only way to construct a GP prior on functions which respect a set of invariances is to construct a kernel which respects the same invariances with respect to each of its two inputs: k(x,x′) = k(g(x), g(x′)), ∀x,x′ ∈ X , ∀g, g′ ∈ G (2.34) Formally, given a finite group G whose elements are operations to which we wish our function to remain invariant, and f ∼ GP(0, k(x,x′)), then every f is invariant under G (up to a modification) if and only if k(·, ·) is argument-wise invariant under G. See Ginsbourger et al. (2013) for details. It might not always be clear how to construct a kernel respecting such argument-wise invariances. Fortunately, there are a few simple ways to do this for any finite group: 1. Sum over the orbit. The orbit of x with respect to a group G is {g(x) : g ∈ G}, the set obtained by applying each element of G to x. Ginsbourger et al. (2012) and Kondor (2008) suggest enforcing invariances through a double sum over the orbits of x and x′ with respect to G: ksum(x,x′) = ∑ g,∈G ∑ g′∈G k(g(x), g′(x′)) (2.35) 24 Expressing Structure with Kernels Additive method Projection method Product method SE(x1, x′1)×SE(x2, x′2) + SE(x1, x′2)×SE(x2, x′1) SE(min(x1, x2),min(x′1, x′2)) ×SE(max(x′1, x′2),max(x′1, x′2)) SE(x1, x′1)×SE(x2, x′2) × SE(x1, x′2)×SE(x2, x′1) Figure 2.10: Functions drawn from three distinct GP priors, each expressing symmetry about the line x1 = x2 using a different type of construction. All three methods introduce a different type of nonstationarity. For the group Gswap, this operation results in the kernel: kswitch(x,x′) = ∑ g∈Gswap ∑ g′∈Gswap k(g(x), g′(x′)) (2.36) = k(x1, x2, x′1, x′2) + k(x1, x2, x′2, x′1) + k(x2, x1, x′1, x′2) + k(x2, x1, x′2, x′1) (2.37) For stationary kernels, some pairs of elements in this sum will be identical, and can be ignored. Figure 2.10(left) shows a draw from a GP prior with a product of SE kernels symmetrized in this way. This construction has the property that the marginal variance is doubled near x1 = x2, which may or may not be desirable. 2. Project onto a fundamental domain. Ginsbourger et al. (2013) also explored the possibility of projecting each datapoint into a fundamental domain of the group, using a mapping AG: kproj(x,x′) = k(AG(x), AG(x′)) (2.38) For example, a fundamental domain of the group Gswap is all {x1, x2 : x1 < x2}, a set which can be mapped to using AGswap(x1, x2) = [ min(x1, x2),max(x1, x2) ] . Constructing a kernel using this method introduces a non-differentiable “seam” along x1 = x2, as shown in figure 2.10(center). 2.7 Expressing symmetries and invariances 25 3. Multiply over the orbit. Ryan P. Adams (personal communication) suggested a construction enforcing invariances through a double product over the orbits: ksum(x,x′) = ∏ g∈G ∏ g′∈G k(g(x), g′(x′)) (2.39) This method can sometimes produce GP priors with zero variance in some regions, as in figure 2.10(right). There are often many possible ways to achieve a given symmetry, but we must be careful to do so without compromising other qualities of the model we are constructing. For example, simply setting k(x,x′) = 0 gives rise to a GP prior which obeys all possible symmetries, but this is presumably not a model we wish to use. 2.7.2 Example: Periodicity Periodicity in a one-dimensional function corresponds to the invariance f(x) = f(x+ τ) (2.40) where τ is the period. The most popular method for building a periodic kernel is due to MacKay (1998), who used the projection method in combination with an SE kernel. A fundamental domain of the symmetry group is a circle, so the kernel Per(x, x′) = SE (sin(x), sin(x′))×SE (cos(x), cos(x′)) (2.41) achieves the invariance in equation (2.40). Simple algebra reduces this kernel to the form given in figure 2.1. 2.7.3 Example: Symmetry about zero Another example of an easily-enforceable symmetry is symmetry about zero: f(x) = f(−x). (2.42) This symmetry can be enforced using the sum over orbits method, by the transform kreflect(x, x′) = k(x, x′) + k(x,−x′) + k(−x, x′) + k(−x,−x′). (2.43) 26 Expressing Structure with Kernels 2.7.4 Example: Translation invariance in images Many models of images are invariant to spatial translations (LeCun and Bengio, 1995). Similarly, many models of sounds are also invariant to translation through time. Note that this sort of translation invariance is completely distinct from the station- arity of kernels such as SE or Per. A stationary kernel implies that the prior is invariant to translations of the entire training and test set. In contrast, here we use translation invariance to refer to situations where the signal has been discretized, and each pixel (or the audio equivalent) corresponds to a different input dimension. We are interested in creating priors on functions that are invariant to swapping pixels in a manner that corresponds to shifting the signal in some direction: f   = f   (2.44) For example, in a one-dimensional image or audio signal, translation of an input vector by i pixels can be defined as shift(x, i) = [ xmod(i+1,D), xmod(i+2,D), . . . , xmod(i+D,D) ]T (2.45) As above, translation invariance in one dimension can be achieved by a double sum over the orbit, given an initial translation-sensitive kernel between signals k: kinvariant (x,x′) = D∑ i=1 D∑ j=1 k(shift(x, i), shift(x, j)) . (2.46) The extension to two dimensions, shift(x, i, j), is straightforward, but notationally cumbersome. Kondor (2008) built a more elaborate kernel between images that was approximately invariant to both translation and rotation, using the projection method. 2.8 Generating topological manifolds In this section we give a geometric illustration of the symmetries encoded by different compositions of kernels. The work presented in this section is based on a collaboration with David Reshef, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. The derivation of the Möbius kernel was my original contribution. Priors on functions obeying invariants can be used to create a prior on topological 2.8 Generating topological manifolds 27 Euclidean (SE1 × SE2) Cylinder (SE1 × Per2) Toroid (Per1 × Per2) Figure 2.11: Generating 2D manifolds with different topologies. By enforcing that the functions mapping from R2 to R3 obey certain symmetries, the surfaces created have corresponding topologies, ignoring self-intersections. manifolds by using such functions to warp a simply-connected surface into a higher- dimensional space. For example, one can build a prior on 2-dimensional manifolds embedded in 3-dimensional space through a prior on mappings from R2 to R3. Such mappings can be constructed using three independent functions [f1(x), f2(x), f3(x)], each mapping from R2 to R. Different GP priors on these functions will implicitly give rise to different priors on warped surfaces. Symmetries in [f1, f2, f3] can connect different parts of the manifolds, giving rise to non-trivial topologies on the sampled surfaces. Figure 2.11 shows 2D meshes warped into 3D by functions drawn from GP priors with various kernels, giving rise to a different topologies. Higher-dimensional analogues of these shapes can be constructed by increasing the latent dimension and including corresponding terms in the kernel. For example, an N -dimensional latent space using kernel Per1×Per2×. . .×PerN will give rise to a prior on manifolds having the topology of N -dimensional toruses, ignoring self-intersections. This construction is similar in spirit to the GP latent variable model (GP-LVM) of Lawrence (2005), which learns a latent embedding of the data into a low-dimensional space, using a GP prior on the mapping from the latent space to the observed space. 28 Expressing Structure with Kernels Draw from GP with kernel: Per(x1, x′1)×Per(x2, x′2) +Per(x1, x′2)×Per(x2, x′1) Möbius strip drawn from R2 → R3 GP prior Sudanese Möbius strip generated parametrically x2 x1 Figure 2.12: Generating Möbius strips. Left: A function drawn from a GP prior obeying the symmetries given by equations (2.47) to (2.49). Center: Simply-connected surfaces mapped from R2 to R3 by functions obeying those symmetries have a topology corre- sponding to a Möbius strip. Surfaces generated this way do not have the familiar shape of a flat surface connected to itself with a half-twist. Instead, they tend to look like Sudanese Möbius strips (Lerner and Asimov, 1984), whose edge has a circular shape. Right: A Sudanese projection of a Möbius strip. Image adapted from Wikimedia Com- mons (2005). 2.8.1 Möbius strips A space having the topology of a Möbius strip can be constructed by enforcing invariance to the following operations (Reid and Szendrői, 2005, chapter 7): gp1([x1, x2]) = [x1 + τ, x2] (periodic in x1) (2.47) gp2([x1, x2]) = [x1, x2 + τ ] (periodic in x2) (2.48) gs([x1, x2]) = [x2, x1] (symmetric about x1 = x2) (2.49) Section 2.7 already showed how to build GP priors invariant to each of these types of transformations. We’ll call a kernel which enforces these symmetries a Möbius kernel. An example of such a kernel is: k(x1, x2, x′1, x′2) = Per(x1, x′1)×Per(x2, x′2) + Per(x1, x′2)×Per(x2, x′1) (2.50) Moving along the diagonal x1 = x2 of a function drawn from the corresponding GP prior is equivalent to moving along the edge of a notional Möbius strip which has had that 2.9 Kernels on categorical variables 29 function mapped on to its surface. Figure 2.12(left) shows an example of a function drawn from such a prior. Figure 2.12(center) shows an example of a 2D mesh mapped to 3D by functions drawn from such a prior. This surface doesn’t resemble the typical representation of a Möbius strip, but instead resembles an embedding known as the Sudanese Möbius strip (Lerner and Asimov, 1984), shown in figure 2.12(right). 2.9 Kernels on categorical variables Categorical variables are variables which can take values only from a discrete, unordered set, such as {blue, green, red}. A simple way to construct a kernel over categorical variables is to represent that variable by a set of binary variables, using a one-of-k encoding. For example, if x can take one of four values, x ∈ {A, B, C, D}, then a one-of-k encoding of x will correspond to four binary inputs, and one-of-k(C) = [0, 0, 1, 0]. Given a one-of-k encoding, we can place any multi-dimensional kernel on that space, such as the SE-ARD: kcategorical(x, x′) = SE-ARD(one-of-k(x), one-of-k(x′)) (2.51) Short lengthscales on any particular dimension of the SE-ARD kernel indicate that the function value corresponding to that category is uncorrelated with the others. More flexible parameterizations are also possible (Pinheiro and Bates, 1996). 2.10 Multiple outputs Any GP prior can easily be extended to the model multiple outputs: f1(x), f2(x), . . . , fT (x). This can be done by building a model of a single-output function which has had an ex- tra input added that denotes the index of the output: fi(x) = f(x, i). This can be done by extending the original kernel k(x,x′) to have an extra discrete input dimension: k(x, i,x′, i′). A simple and flexible construction of such a kernel multiplies the original kernel k(x,x′) with a categorical kernel on the output index (Bonilla et al., 2007): k(x, i,x′, i′) = kx(x,x′)×ki(i, i′) (2.52) 30 Expressing Structure with Kernels 2.11 Building a kernel in practice This chapter outlined ways to choose the parametric form of a kernel in order to express different sorts of structure. Once the parametric form has been chosen, one still needs to choose, or integrate over, the kernel parameters. If the kernel relatively few parameters, these parameters can be estimated by maximum marginal likelihood, using gradient- based optimizers. The kernel parameters estimated in sections 2.4.3 and 2.4.4 were optimized using the GPML toolbox (Rasmussen and Nickisch, 2010), available at http://www.gaussianprocess.org/gpml/code. A systematic search over kernel parameters is necessary when appropriate parameters are not known. Similarly, sometimes appropriate kernel structure is hard to guess. The next chapter will show how to perform an automatic search not just over kernel parameters, but also over an open-ended space of kernel expressions. Source code Source code to produce all figures and examples in this chapter is available at http://www.github.com/duvenaud/phd-thesis. Chapter 3 Automatic Model Construction “It would be very nice to have a formal apparatus that gives us some ‘optimal’ way of recognizing unusual phenomena and inventing new classes of hypotheses that are most likely to contain the true one; but this remains an art for the creative human mind.” – E. T. Jaynes (1985) In chapter 2, we saw that the choice of kernel determines the type of structure that can be learned by a GP model, and that a wide variety of models could be constructed by adding and multiplying a few base kernels together. However, we did not answer the difficult question of which kernel to use for a given problem. Even for experts, choosing the kernel in GP regression remains something of a black art. The contribution of this chapter is to show a way to automate the process of building kernels for GP models. We do this by defining an open-ended space of kernels built by adding and multiplying together kernels from a fixed set. We then define a procedure to search over this space to find a kernel which matches the structure in the data. Searching over such a large, structured model class has two main benefits. First, this procedure has good predictive accuracy, since it tries out a large number of different regression models. Second, this procedure can sometimes discover interpretable structure in datasets. Because GP posteriors can be decomposed (as in section 2.4.4), the resulting structures can be examined visually. In chapter 4, we also show how to automatically generate English-language descriptions of the resulting models. This chapter is based on work done in collaboration with James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. It was published in Duvenaud et al. (2013) and Lloyd et al. (2014). Myself, James Lloyd and Roger Grosse jointly de- veloped the idea of searching through a grammar-based language of GP models, inspired 32 Automatic Model Construction by Grosse et al. (2012), and wrote the first versions of the code together. James Lloyd ran most of the experiments, and I constructed most of the figures. 3.1 Ingredients of an automatic statistician Gelman (2013) asks: “How can an artificial intelligence do statistics? . . . It needs not just an inference engine, but also a way to construct new models and a way to check models. Currently, those steps are performed by humans, but the AI would have to do it itself”. This section will discuss the different parts we think are required to build an artificial intelligence that can do statistics. 1. An open-ended language of models. Many learning algorithms consider all models in a class of fixed size. For example, graphical model learning algorithms (Eaton and Murphy, 2007; Friedman and Koller, 2003) search over different con- nectivity graphs for a given set of nodes. Such methods can be powerful, but human statisticians are sometimes capable of deriving novel model classes when required. An automatic search through an open-ended class of models can achieve some of this flexibility, possibly combining existing structures in novel ways. 2. A search through model space. Every procedure which eventually considers arbitrarily-complex models must start with relatively simple models before moving on to more complex ones. Thus any search strategy capable of building arbitrarily complex models is likely to resemble an iterative model-building procedure. Just as human researchers iteratively refine their models, search procedures can propose new candidate models based on the results of previous model fits. 3. A model comparison procedure. Search strategies requires an objective to optimize. In this work, we use approximate marginal likelihood to compare models, penalizing complexity using the Bayesian Information Criterion as a heuristic. More generally, an automatic statistician needs to somehow check the models it has constructed. Gelman and Shalizi (2012) review the literature on model checking. 4. A model description procedure. Part of the value of statistical models comes from helping humans to understand a dataset or a phenomenon. Furthermore, a clear description of the statistical structure found in a dataset helps a user to notice when the dataset has errors, the wrong question was asked, the model- 3.2 A language of regression models 33 building procedure failed to capture known structure, a relevant piece of data or constraint is missing, or when a novel statistical structure has been found. In this chapter, we introduce a system containing simple examples of all the above ingredients. We call this system the automatic Bayesian covariance discovery (ABCD) system. The next four sections of this chapter describe the mechanisms we use to incorporate these four ingredients into a limited example of an artificial intelligence which does statistics. 3.2 A language of regression models As shown in chapter 2, one can construct a wide variety of kernel structures by adding and multiplying a small number of base kernels. We can therefore define a language of GP regression models simply by specifying a language of kernels. Kernel name: Rational quadratic (RQ) Cosine (cos) White noise (Lin) k(x, x′) = σ2f ( 1 + (x−x′)22αℓ2 )−α σ2f cos ( 2π (x−x′) p ) σ2fδ(x− x′) Plot of kernel: 0 0 0 x− x′ x− x′ x− x′ ↓ ↓ ↓ Functions f(x) sampled from GP prior: x x x Type of structure: multiscale variation sinusoidal uncorrelated noise Figure 3.1: New base kernels introduced in this chapter, and the types of structure they encode. Other types of kernels can be constructed by adding and multiplying base kernels together. Our language of models is specified by a set of base kernels which capture different properties of functions, and a set of rules which combine kernels to yield other valid kernels. In this chapter, we will use such base kernels as white noise (WN), constant 34 Automatic Model Construction (C), linear (Lin), squared-exponential (SE), rational-quadratic (RQ), sigmoidal (σ) and periodic (Per). We use a form of Per due to James Lloyd (personal communication) which has its constant component removed, and cos(x−x′) as a special case. Figure 3.1 shows the new kernels introduced in this chapter. For precise definitions of all kernels, see appendix B. To specify an open-ended language of structured kernels, we consider the set of all kernels that can be built by adding and multiplying these base kernels together, which we write in shorthand by: k1 + k2 = k1(x,x′) + k2(x,x′) (3.1) k1 × k2 = k1(x,x′)× k2(x,x′) (3.2) The space of kernels constructable by adding and multiplying the above set of kernels contains many existing regression models. Table 3.1 lists some of these, which are discussed in more detail in section 3.7. Regression model Kernel structure Example of related work Linear regression C + Lin + WN Polynomial regression C + ∏Lin + WN Semi-parametric Lin + SE + WN Ruppert et al. (2003) Multiple kernel learning ∑ SE + WN Gönen and Alpaydın (2011) Fourier decomposition C + ∑ cos +WN Trend, cyclical, irregular ∑ SE + ∑Per + WN Lind et al. (2006) Sparse spectrum GPs ∑ cos +WN Lázaro-Gredilla et al. (2010) Spectral mixture ∑ SE×cos +WN Wilson and Adams (2013) Changepoints e.g. CP(SE, SE) + WN Garnett et al. (2010) Time-changing variance e.g. SE + Lin×WN Interpretable + flexible ∑d SEd + ∏d SEd Plate (1999) Additive GPs e.g. ∏d(1 + SEd) Chapter 6 Table 3.1: Existing regression models expressible by sums and products of base kernels. cos(·, ·) is a special case of our reparametrized Per(·, ·). 3.3 A model search procedure We explore this open-ended space of regression models using a simple greedy search. At each stage, we choose the highest scoring kernel, and propose modifying it by applying 3.3 A model search procedure 35 No structure SE RQ SE + RQ . . . Per+ RQ SE+ Per+ RQ . . . SE×(Per+ RQ) . . . . . . . . . . . . . . . Per×RQ Lin Per Figure 3.2: An example of a search tree over kernel expressions. Figure 3.3 shows the corresponding model increasing in sophistication as the kernel expression grows. an operation to one of its parts, that combines or replaces that part with another base kernel. The basic operations we can perform on any part k of a kernel are: Replacement: k → k′ Addition: k → (k + k′) Multiplication: k → (k × k′) where k′ is a new base kernel. These operators can generate all possible algebraic expressions involving addition and multiplication of base kernels. To see this, observe that if we restricted the addition and multiplication rules to only apply to base kernels, we would obtain a grammar which generates the set of algebraic expressions. Figure 3.2 shows an example search tree followed by our algorithm. Figure 3.3 shows how the resulting model changes as the search is followed. In practice, we also include extra operators which propose commonly-occurring structures, such as changepoints. A complete list is contained in appendix C. Our search operators have rough parallels with strategies used by human researchers to construct regression models. In particular, • One can look for structure in the residuals of a model, such as periodicity, and then extend the model to capture that structure. This corresponds to adding a new kernel to the existing structure. 36 Automatic Model Construction Level 1: Level 2: Level 3: RQ Per+ RQ SE×(Per+ RQ) RQ 2000 2005 2010 −20 0 20 40 60 ( Per + RQ ) 2000 2005 2010 0 10 20 30 40 SE × ( Per + RQ ) 2000 2005 2010 10 20 30 40 50 Figure 3.3: Posterior mean and variance for different depths of kernel search on the Mauna Loa dataset, described in section 3.6.1. The dashed line marks the end of the dataset. Left: First, the function is only modeled as a locally smooth function, and the extrapolation is poor. Middle: A periodic component is added, and the extrapolation improves. Right: At depth 3, the kernel can capture most of the relevant structure, and is able to extrapolate reasonably. • One can start with structure which is assumed to hold globally, such as linear- ity, but find that it only holds locally. This corresponds to multiplying a kernel structure by a local kernel such as SE. • One can incorporate input dimensions incrementally, analogous to algorithms like boosting, back-fitting, or forward selection. This corresponds to adding or multi- plying with kernels on dimensions not yet included in the model. Hyperparameter initialization Unfortunately, optimizing the marginal likelihood over parameters is not a convex op- timization problem, and the space can have many local optima. For example, in data having periodic structure, integer multiples of the true period (harmonics) are often local optima. We take advantage of our search procedure to provide reasonable initializations: all parameters which were part of the previous kernel are initialized to their previous values. Newly introduced parameters are initialized randomly. In the newly proposed kernel, all parameters are then optimized using conjugate gradients. This procedure is not guaranteed to find the global optimum, but it implements the commonly used heuristic of iteratively modeling residuals. 3.4 A model comparison procedure 37 3.4 A model comparison procedure Choosing a kernel requires a method for comparing models. We choose marginal likeli- hood as our criterion, since it balances the fit and complexity of a model (Rasmussen and Ghahramani, 2001). Conditioned on kernel parameters, the marginal likelihood of a GP can be computed analytically by equation (1.3). In addition, if one compares GP models by the maximum likelihood value obtained after optimizing their kernel param- eters, then all else being equal, the model having more free parameters will be chosen. This introduces a bias in favor of more complex models. We could avoid overfitting by integrating the marginal likelihood over all free param- eters, but this integral is difficult to do in general. Instead, we loosely approximate this integral using the Bayesian information criterion (BIC) (Schwarz, 1978): BIC(M) = log p(D |M)− 12 |M | logN (3.3) where p(D|M) is the marginal likelihood of the data evaluated at the optimized kernel parameters, |M | is the number of kernel parameters, and N is the number of data points. BIC simply penalizes the marginal likelihood in proportion to how many parameters the model has. Because BIC is a function of the number of parameters in a model, we did not count kernel parameters known to not affect the model. For example, when two kernels are multiplied, one of their output variance parameters becomes redundant, and can be ignored. The assumptions made by BIC are clearly inappropriate for the model class being considered. For instance, BIC assumes that the data are i.i.d. given the model param- eters, which is not true except under a white noise kernel. Other more sophisticated approximations are possible, such as Laplace’s approximation. We chose to try BIC first because of its simplicity, and it performed reasonably well in our experiments. 3.5 A model description procedure As discussed in chapter 2, a GP whose kernel is a sum of kernels can be viewed as a sum of functions drawn from different GPs. We can always express any kernel structure as a sum of products of kernels by distributing all products of sums. For example, SE×(RQ + Lin) = SE×RQ + SE×Lin . (3.4) 38 Automatic Model Construction When all kernels in a product apply to the same dimension, we can use the formulas in section 2.4.5 to visualize the marginal posterior distribution of that component. This decomposition into additive components provides a method of visualizing GP models which disentangles the different types of structure in the model. The following section shows examples of such decomposition plots. In chapter 4, we will extend this model visualization method to include automatically generated English text explaining types of structure discovered. 3.6 Structure discovery in time series To investigate our method’s ability to discover structure, we ran the kernel search on several time-series. In the following example, the search was run to depth 10, using SE, RQ, Lin, Per and WN as base kernels. 3.6.1 Mauna Loa atmospheric CO2 First, our method analyzed records of carbon dioxide levels recorded at the Mauna Loa observatory (Tans and Keeling, accessed January 2012). Since this dataset was analyzed in detail by Rasmussen andWilliams (2006, chapter 5), we can compare the kernel chosen by our method to a kernel constructed by human experts. Figure 3.3 shows the posterior mean and variance on this dataset as the search depth increases. While the data can be smoothly interpolated by a model with only a single base kernel, the extrapolations improve dramatically as the increased search depth allows more structure to be included. Figure 3.4 shows the final model chosen by our method together with its decompo- sition into additive components. The final model exhibits plausible extrapolation and interpretable components: a long-term trend, annual periodicity, and medium-term de- viations. These components have similar structure to the kernel hand-constructed by Rasmussen and Williams (2006, chapter 5): SE︸︷︷︸ long-term trend + SE×Per︸ ︷︷ ︸ yearly periodic + RQ︸︷︷︸ medium-term irregularities + SE + WN︸ ︷︷ ︸ short-term noise (3.5) We also plot the residuals modeled by a white noise (WN) component, showing that there is little obvious structure left in the data. More generally, some components capture slowly-changing structure while others capture quickly-varying structure, but often there 3.6 Structure discovery in time series 39 Complete model: Lin×SE + SE×(Per + RQ) + WN( Lin × SE + SE × ( Per + RQ ) ) 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 −40 −20 0 20 40 60 Long-term trend: Lin×SE Yearly periodic: SE×Per Lin × SE 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 −40 −20 0 20 40 60 SE × Per 1984 1985 1986 1987 1988 1989 −5 0 5 Medium-term deviation: SE×RQ Noise: WNSE × RQ 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 −4 −2 0 2 4 Residuals 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 −0.5 0 0.5 Figure 3.4: First row: The full posterior on the Mauna Loa dataset, after a search of depth 10. Subsequent rows: The automatic decomposition of the time series. The model is a sum of long-term, yearly periodic, medium-term components, and residual noise, respectively. The yearly periodic component has been rescaled for clarity. is no hard distinction between “signal” components and “noise” components. 3.6.2 Airline passenger counts Figure 3.5 shows the decomposition produced by applying our method to monthly totals of international airline passengers (Box et al., 1970). We observe similar components to those in the Mauna Loa dataset: a long term trend, annual periodicity, and medium- term deviations. In addition, the composite kernel captures the near-linearity of the long-term trend, and the linearly growing amplitude of the annual oscillations. 40 Automatic Model Construction Complete Model: SE×Lin + Per×Lin×SE + Lin×SE + WN×LinFull model posterior with extrapolations 1948 1950 1952 1954 1956 1958 1960 1962 1964 0 200 400 600 800 1000 Long-term trend: SE×Lin Yearly periodic: Per×Lin×SEPosterior of component 1 1948 1950 1952 1954 1956 1958 1960 1962 1964 0 100 200 300 400 500 600 700 800 Posterior of component 2 1948 1950 1952 1954 1956 1958 1960 1962 1964 −200 −100 0 100 200 300 Medium-term deviation: SE Growing noise: WN×LinPosterior of component 3 1948 1950 1952 1954 1956 1958 1960 1962 1964 −30 −20 −10 0 10 20 30 2.3 Component 3 : A smooth function This component is a smooth function with a typical lengthscale of 8.1 months. Posterior of component 3 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −40 −30 −20 −10 0 10 20 30 40 Sum of components up to component 3 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500 600 700 Figure 6: Pointwise posterior of component 3 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 3 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −20 −15 −10 −5 0 5 10 15 20 Figure 7: Pointwise posterior of residuals after adding component 3 Figure 3.5: First row: The airline dataset and posterior after a search of depth 10. Subsequent rows: Additive decomposition of posterior into long-term smooth rend, yearly variation, and short-term deviations. Due to the linear kernel, the marginal variance grows over time, making this a heteroskedastic model. The model search can be run without modification on multi-dimensional datasets (as in sections 3.8.4 and 6.6), but the resulting structures are more difficult to visualize. 3.7 Related work Building kernel functions by hand Rasmussen and Williams (2006, chapter 5) devoted 4 pages to manually constructing 3.7 Related work 41 a composite kernel to model the Mauna Loa dataset. Other examples of papers whose main contribution is to manually construct and fit a composite GP kernel are Preotiuc- Pietro and Cohn (2013), Lloyd (2013), and Klenske et al. (2013). These papers show that experts are capable of constructing kernels, in one or two dimensions, of similar complexity to the ones shown in this chapter. However, a more systematic search can consider possibilities that might otherwise be missed. For example, the kernel structure SE×Per×Lin, while appropriate for the airline dataset, had never been considered by the authors before it was chosen by the automatic search. Nonparametric regression in high dimensions Nonparametric regression methods such as splines, locally-weighted regression, and GP regression are capable of learning arbitrary smooth functions from data. Unfortunately, they suffer from the curse of dimensionality: it is sometimes difficult for these models to generalize well in more than a few dimensions. Applying nonparametric methods in high-dimensional spaces can require imposing additional structure on the model. One such structure is additivity. Generalized additive models (Hastie and Tibshirani, 1990) assume the regression function is a transformed sum of functions defined on the individual dimensions: E[f(x)] = g−1(∑Dd=1 fd(xd)). These models have a restricted form, but one which is interpretable and often generalizes well. Models in our grammar can capture similar structure through sums of base kernels along different dimensions, although we have not yet tried incorporating a warping function g(·). It is possible to extend additive models by adding more flexible interaction terms between dimensions. Chapter 6 considers GP models whose kernel implicitly sums over all possible interactions of input variables. Plate (1999) constructs a special case of this model class, summing an SE kernel along each dimension (for interpretability) plus a single SE-ARD kernel over all dimensions (for flexibility). Both types of model can be expressed in our grammar. A closely related procedure is smoothing-splines ANOVA (Gu, 2002; Wahba, 1990). This model is a weighted sum of splines along each input dimension, all pairs of dimen- sions, and possibly higher-dimensional combinations. Because the number of terms to consider grows exponentially with the number of dimensions included in each term, in practice, only one- and two-dimensional terms are usually considered. Semi-parametric regression (e.g. Ruppert et al., 2003) attempts to combine inter- pretability with flexibility by building a composite model out of an interpretable, para- 42 Automatic Model Construction metric part (such as linear regression) and a “catch-all” nonparametric part (such as a GP having an SE kernel). This model class can be represented through kernels such as SE + Lin. Kernel learning There is a large body of work attempting to construct rich kernels through a weighted sum of base kernels, called multiple kernel learning (MKL) (e.g. Bach et al., 2004; Gönen and Alpaydın, 2011). These approaches usually have a convex objective function. How- ever the component kernels, as well as their parameters, must be specified in advance. We compare to a Bayesian variant of MKL in section 3.8, expressed as a restriction of our language of kernels. Salakhutdinov and Hinton (2008) use a deep neural network with unsupervised pre- training to learn an embedding g(x) onto which a GP with an SE kernel is placed: Cov [f(x), f(x′)] = k(g(x), g(x′)). This is a flexible approach to kernel learning, but relies mainly on finding structure in the input density p(x). Instead, we focus on domains where most of the interesting structure is in f(x). Sparse spectrum GPs (Lázaro-Gredilla et al., 2010) approximate the spectral density of a stationary kernel function using sums of Dirac delta functions, which corresponds to kernels of the form ∑ cos. Similarly, Wilson and Adams (2013) introduced spectral mixture kernels, which approximate the spectral density using a mixture of Gaussians, corresponding to kernels of the form ∑ SE× cos. Both groups demonstrated, using Bochner’s theorem (Bochner, 1959), that these kernels can approximate any stationary covariance function. Our language of kernels includes both of these kernel classes (see table 3.1). Changepoints There is a wide body of work on changepoint modeling. Adams and MacKay (2007) developed a Bayesian online changepoint detection method which segments time-series into independent parts. This approach was extended by Saatçi et al. (2010) to Gaus- sian process models. Garnett et al. (2010) developed a family of kernels which modeled changepoints occurring abruptly at a single point. The changepoint kernel (CP) pre- sented in this work is a straightforward extension to smooth changepoints. 3.7 Related work 43 Equation learning Todorovski and Džeroski (1997), Washio et al. (1999) and Schmidt and Lipson (2009) learned parametric forms of functions, specifying time series or relations between quan- tities. In contrast, ABCD learns a parametric form for the covariance function, allowing it to model functions which do not have a simple parametric form but still have high- level structure. An examination of the structure discovered by the automatic equation- learning software Eureqa (Schmidt and Lipson, accessed February 2013) on the airline and Mauna Loa datasets can be found in Lloyd et al. (2014). Structure discovery through grammars Kemp and Tenenbaum (2008) learned the structural form of graphs that modeled human similarity judgements. Their grammar on graph structures includes planes, trees, and cylinders. Some of their discrete graph structures have continuous analogues in our language of models. For example, SE1×SE2 and SE1×Per2 can be seen as mapping the data onto a Euclidean surface and a cylinder, respectively. Section 2.8 examined these structures in more detail. Diosan et al. (2007) and Bing et al. (2010) learned composite kernels for support vector machines and relevance vector machines, respectively, using genetic search algo- rithms to optimize cross-validation error. Similarly, Kronberger and Kommenda (2013) searched over composite kernels for GPs using genetic programming, optimizing the un- penalized marginal likelihood. These methods explore similar languages of kernels to the one explored in this chapter. It is not clear whether the complex genetic searches used by these methods offer advantages over the straightforward but naïve greedy search used in this chapter. Our search criterion has the advantages of being both differentiable with respect to kernel parameters, and of trading off model fit and complexity automat- ically. These related works also did not explore the automatic model decomposition, summarization and description made possible by the use of GP models. Grosse et al. (2012) performed a greedy search over a compositional model class for unsupervised learning, using a grammar of matrix decomposition models, and a greedy search procedure based on held-out predictive likelihood. This model class contains many existing unsupervised models as special cases, and was able to discover diverse forms of structure, such as co-clustering or sparse latent feature models, automatically from data. Our framework takes a similar approach, but in a supervised setting. Similarly, Steinruecken (2014) showed to automatically perform inference in arbitrary 44 Automatic Model Construction compositions of discrete sequence models. More generally, Dechter et al. (2013) and Liang et al. (2010) constructed grammars over programs, and automatically searched the resulting spaces. 3.8 Experiments 3.8.1 Interpretability versus accuracy BIC trades off model fit and complexity by penalizing the number of parameters in a kernel expression. This can result in ABCD favoring kernel expressions with nested products of sums, producing descriptions involving many additive components after expanding out all terms. While these models typically have good predictive performance, their large number of components can make them less interpretable. We experimented with not allowing parentheses during the search, discouraging nested expressions. This was done by distributing all products immediately after each search operator was applied. We call this procedure ABCD-interpretability, in contrast to the unrestricted version of the search, ABCD-accuracy. 3.8.2 Predictive accuracy on time series We evaluated the performance of the algorithms listed below on 13 real time-series from various domains from the time series data library (Hyndman, accessed July 2013). The pre-processed datasets used in our experiments are available at http://github.com/jamesrobertlloyd/gpss-research/tree/master/data/tsdlr Algorithms We compare ABCD to equation learning using Eureqa (Schmidt and Lipson, accessed February 2013), as well as six other regression algorithms: linear regression, GP regres- sion with a single SE kernel (squared exponential), a Bayesian variant of multiple kernel learning (MKL) (e.g. Bach et al., 2004; Gönen and Alpaydın, 2011), changepoint model- ing (e.g. Fox and Dunson, 2013; Garnett et al., 2010; Saatçi et al., 2010), spectral mixture kernels (Wilson and Adams, 2013) (spectral kernels), and trend-cyclical-irregular models (e.g. Lind et al., 2006). We set Eureqa’s search objective to the default mean-absolute-error. All algorithms besides Eureqa can be expressed as restrictions of our modeling language (see table 3.1), 3.8 Experiments 45 so we performed inference using the same search and objective function, with appropriate restrictions to the language. We restricted our experiments to regression algorithms for comparability; we did not include models which regress on previous values of times series, such as auto-regressive or moving-average models (e.g. Box et al., 1970). Constructing a language of autoregressive time-series models would be an interesting area for future research. Extrapolation experiments To test extrapolation, we trained all algorithms on the first 90% of the data, predicted the remaining 10% and then computed the root mean squared error (RMSE). The RMSEs were then standardised by dividing by the smallest RMSE for each data set, so the best performance on each data set has a value of 1. 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 ABCD accuracy ABCD interpretability Spectral kernels Trend, cyclical irregular Bayesian MKL Eureqa Changepoints Squared Exponential Linear regression St an da rd ise d RM SE Figure 3.6: Box plot (showing median and quartiles) of standardised extrapolation RMSE (best performance = 1) on 13 time-series. Methods are ordered by median. Figure 3.6 shows the standardised RMSEs across algorithms. ABCD-accuracy usually outperformed ABCD-interpretability. Both algorithms had lower quartiles than all other methods. Overall, the model construction methods having richer languages of models per- formed better: ABCD outperformed trend-cyclical-irregular, which outperformed Bayesian MKL, which outperformed squared-exponential. Despite searching over a rich model class, Eureqa performed relatively poorly. This may be because few datasets are parsi- moniously explained by a parametric equation, or because of the limited regularization ability of this procedure. 46 Automatic Model Construction Not shown on the plot are large outliers for spectral kernels, Eureqa, squared expo- nential and linear regression with normalized RMSEs of 11, 493, 22 and 29 respectively. 3.8.3 Multi-dimensional prediction ABCD can be applied to multidimensional regression problems without modification. An experimental comparison with other methods can be found in section 6.6, where it has the best performance on every dataset. 3.8.4 Structure recovery on synthetic data The structure found in the examples above may seem reasonable, but we may wonder to what extent ABCD is consistent – that is, does it recover all the structure in any given dataset? It is difficult to tell from predictive accuracy alone if the search procedure is finding the correct structure, especially in multiple dimensions. To address this question, we tested our method’s ability to recover known structure on a set of synthetic datasets. For several composite kernel expressions, we constructed synthetic data by first sam- pling 300 locations uniformly at random, then sampling function values at those loca- tions from a GP prior. We then added i.i.d. Gaussian noise to the functions at various signal-to-noise ratios (SNR). Table 3.2: Kernels chosen by ABCD on synthetic data generated using known kernel structures. D denotes the dimension of the function being modeled. SNR indicates the signal-to-noise ratio. Dashes (–) indicate no structure was found. Each kernel implicitly has a WN kernel added to it. True kernel D SNR = 10 SNR = 1 SNR = 0.1 SE + RQ 1 SE SE×Per SE Lin×Per 1 Lin×Per Lin×Per SE SE1 + RQ2 2 SE1 + SE2 Lin1 + SE2 Lin1 SE1 + SE2×Per1 + SE3 3 SE1 + SE2×Per1 + SE3 SE2×Per1 + SE3 – SE1×SE2 4 SE1×SE2 Lin1×SE2 Lin2 SE1×SE2 + SE2×SE3 4 SE1×SE2 + SE2×SE3 SE1 + SE2×SE3 SE1 (SE1 + SE2)×(SE3 + SE4) 4 (SE1 + SE2)× . . . (SE1 + SE2)× . . . –(SE3×Lin3×Lin1 + SE4) SE3×SE4 Table 3.2 shows the results. For the highest signal-to-noise ratio, ABCD usually 3.9 Conclusion 47 recovers the correct structure. The reported additional linear structure in the last row can be explained the fact that functions sampled from SE kernels with long length-scales occasionally have near-linear trends. As the noise increases, our method generally backs off to simpler structures rather than reporting spurious structure. Source code All GP parameter optimization was performed by automated calls to the GPML tool- box (Rasmussen and Nickisch, 2010). Source code to perform all experiments is available at http://www.github.com/jamesrobertlloyd/gp-structure-search. 3.9 Conclusion This chapter presented a system which constructs a model from an open-ended language, and automatically generates plots decomposing the different types of structure present in the model. This was done by introducing a space of kernels defined by sums and products of a small number of base kernels. The set of models in this space includes many standard regression models. We proposed a search procedure for this space of kernels, and argued that this search process parallels the process of model-building by statisticians. We found that the learned structures enable relatively accurate extrapolation in time-series datasets. The learned kernels can yield decompositions of a signal into di- verse and interpretable components, enabling model-checking by humans. We hope that this procedure has the potential to make powerful statistical model-building techniques accessible to non-experts. Some discussion of the limitations of this approach to model-building can be found in section 4.4, and discussion of this approach relative to other model-building approaches can be found in section 8.3. The next chapter will show how the model components found by ABCD can be automatically described using English-language text. Chapter 4 Automatic Model Description “Not a wasted word. This has been a main point to my literary thinking all my life.” – Hunter S. Thompson The previous chapter showed how to automatically build structured models by search- ing through a language of kernels. It also showed how to decompose the resulting models into the different types of structure present, and how to visually illustrate the type of structure captured by each component. This chapter shows how automatically describe the resulting model structures using English text. The main idea is to describe every part of a given product of kernels as an adjective, or as a short phrase that modifies the description of a kernel. To see how this could work, recall that the model decomposition plots of chapter 3 showed that most of the structure in each component was determined by that component’s kernel. Even across different datasets, the meanings of individual parts of different kernels are consistent in some ways. For example, Per indicates repeating structure, and SE indicates smooth change over time. This chapter also presents a system that generates reports combining automatically generated text and plots which highlight interpretable features discovered in a data sets. A complete example of an automatically-generated report can be found in appendix D. The work appearing in this chapter was written in collaboration with James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani, and was published in Lloyd et al. (2014). The procedure translating kernels into adjectives developed out of discussions between James and myself. James Lloyd wrote the code to automatically generate reports, and ran all of the experiments. The paper upon which this chapter is based was written mainly by both James Lloyd and I. 4.1 Generating descriptions of composite kernels 49 4.1 Generating descriptions of composite kernels There are two main features of our language of GP models that allow description to be performed automatically. First, any kernel expression in the language can be simplified into a sum of products. As discussed in section 2.4, a sum of kernels corresponds to a sum of functions, so each resulting product of kernels can be described separately, as part of a sum. Second, each kernel in a product modifies the resulting model in a consistent way. Therefore, one can describe a product of kernels by concatenating descriptions of the effect of each part of the product. One part of the product needs to be described using a noun, which is modified by the other parts. For example, one can describe the product of kernels Per×SE by representing Per by a noun (“a periodic function”) modified by a phrase representing the effect of the SE kernel (“whose shape varies smoothly over time”). To simplify the system, we restricted base kernels to the set {C, Lin, WN, SE, Per, and σ}. Recall that the sigmoidal kernel σ(x, x′) = σ(x)σ(x′) allows changepoints and change-windows. 4.1.1 Simplification rules In order to be able to use the same phrase to describe the effect of each base kernel in different circumstances, our system converts each kernel expression into a standard, simplified form. First, our system distributes all products of sums into sums of products. Then, it applies several simplification rules to the kernel expression: • Products of two or more SE kernels can be equivalently replaced by a single SE with different parameters. • Multiplying the white-noise kernel (WN) by any stationary kernel (C, WN, SE, or Per) gives another WN kernel. • Multiplying any kernel by the constant kernel (C) only changes the parameters of the original kernel, and so can be factored out of any product in which it appears. After applying these rules, any composite kernel expressible by the grammar can be written as a sum of terms of the form: K ∏ m Lin(m) ∏ n σ(n), (4.1) 50 Automatic Model Description where K is one of {WN,C, SE,∏k Per(k)} or {SE×∏k Per(k)}, and ∏i k(i) denotes a prod- uct of kernels, each having different parameters. Superscripts denote different instances of the same kernel appearing in a product: SE(1) can have different kernel parameters than SE(2). 4.1.2 Describing each part of a product of kernels Each kernel in a product modifies the resulting GP model in a consistent way. This allows one to describe the contribution of each kernel in a product as an adjective, or more generally as a modifier of a noun. We now describe how each of the kernels in our grammar modifies a GP model: • Multiplication by SE removes long range correlations from a model, since SE(x, x′) decreases monotonically to 0 as |x− x′| increases. This converts any global corre- lation structure into local correlation only. • Multiplication by Lin is equivalent to multiplying the function being modeled by a linear function. If f(x) ∼ GP(0, k), then x×f(x) ∼ GP (0,Lin×k). This causes the standard deviation of the model to vary linearly, without affecting the correlation between function values. • Multiplication by σ is equivalent to multiplying the function being modeled by a sigmoid, which means that the function goes to zero before or after some point. • Multiplication by Per removes correlation between all pairs of function values not close to one period apart, allowing variation within each period, but maintain- ing correlation between periods. • Multiplication by any kernel modifies the covariance in the same way as mul- tiplying by a function drawn from a corresponding GP prior. This follows from the fact that if f1(x) ∼ GP(0, k1) and f2(x) ∼ GP(0, k2) then Cov [ f1(x)f2(x), f1(x′)f2(x′) ] = k1(x, x′)×k2(x, x′). (4.2) Put more plainly, a GP whose covariance is a product of kernels has the same covariance as a product of two functions, each drawn from the corresponding GP prior. However, the distribution of f1×f2 is not always GP distributed – it can have third and higher central moments as well. This identity can be used to generate 4.1 Generating descriptions of composite kernels 51 a cumbersome “worst-case” description in cases where a more concise description of the effect of a kernel is not available. For example, it is used in our system to describe products of more than one periodic kernel. Table 4.1 gives the corresponding description of the effect of each type of kernel in a product, written as a post-modifier. Kernel Postmodifier phrase SE whose shape changes smoothly Per modulated by a periodic function Lin with linearly varying amplitude∏ k Lin(k) with polynomially varying amplitude∏ k σ (k) which applies until / from [changepoint] Table 4.1: Descriptions of the effect of each kernel, written as a post-modifier. Table 4.2 gives the corresponding description of each kernel before it has been mul- tiplied by any other, written as a noun phrase. Kernel Noun phrase WN uncorrelated noise C constant SE smooth function Per periodic function Lin linear function∏ k Lin(k) {quadratic, cubic, quartic, . . . } function Table 4.2: Noun phrase descriptions of each type of kernel. 4.1.3 Combining descriptions into noun phrases In order to build a noun phrase describing a product of kernels, our system chooses one kernel to act as the head noun, which is then modified by appending descriptions of the other kernels in the product. As an example, a kernel of the form Per×Lin×σ could be described as a Per︸︷︷︸ periodic function × Lin︸︷︷︸ with linearly varying amplitude × σ︸︷︷︸ which applies until 1700. 52 Automatic Model Description where Per was chosen to be the head noun. In our system, the head noun is chosen according to the following ordering: Per,WN, SE,C, ∏ m Lin(m), ∏ n σ(n) (4.3) Combining tables 4.1 and 4.2 with ordering 4.3 provides a general method to produce descriptions of sums and products of these base kernels. Extensions and refinements In practice, the system also incorporates a number of other rules which help to make the descriptions shorter, easier to parse, or clearer: • The system adds extra adjectives depending on kernel parameters. For example, an SE with a relatively short lengthscale might be described as “a rapidly-varying smooth function” as opposed to just “a smooth function”. • Descriptions can include kernel parameters. For example, the system might write that a function is “repeating with a period of 7 days”. • Descriptions can include extra information about the model not contained in the kernel. For example, based on the posterior distribution over the function’s slope, the system might write “a linearly increasing function” as opposed to “a linear function”. • Some kernels can be described through pre-modifiers. For example, the system might write “an approximately periodic function” as opposed to “a periodic func- tion whose shape changes smoothly”. Ordering additive components The reports generated by our system attempt to present the most interesting or im- portant features of a dataset first. As a heuristic, the system orders components by always adding next the component which most reduces the 10-fold cross-validated mean absolute error. 4.2 Example descriptions 53 4.1.4 Worked example This section shows an example of our procedure describing a compound kernel containing every type of base kernel in our set: SE×(WN×Lin + CP(C,Per)). (4.4) The kernel is first converted into a sum of products, and the changepoint is converted into sigmoidal kernels (recall the definition of changepoint kernels in section 2.5): SE×WN×Lin + SE×C×σ + SE×Per×σ¯ (4.5) which is then simplified using the rules in section 4.1.1 to WN×Lin + SE×σ + SE×Per×σ¯. (4.6) To describe the first component, (WN×Lin), the head noun description for WN, “uncorrelated noise”, is concatenated with a modifier for Lin, “with linearly increasing standard deviation”. The second component, (SE×σ), is described as “A smooth function with a length- scale of [lengthscale] [units]”, corresponding to the SE, “which applies until [change- point]”. Finally, the third component, (SE×Per× σ¯), is described as “An approximately periodic function with a period of [period] [units] which applies from [changepoint]”. 4.2 Example descriptions In this section, we demonstrate the ability of our procedure, ABCD, to write intelligible descriptions of the structure present in two time series. The examples presented here describe models produced by the automatic search method presented in chapter 3. 4.2.1 Summarizing 400 years of solar activity First, we show excerpts from the report automatically generated on annual solar irradi- ation data from 1610 to 2011. This dataset is shown in figure 4.1. This time series has two pertinent features: First, a roughly 11-year cycle of solar activity. Second, a period lasting from 1645 to 1715 having almost no variance. This flat 54 Automatic Model Description 1 Executive summary The raw data and full model posterior with extrapolations are shown in figure 1. Raw data 1650 1700 1750 1800 1850 1900 1950 2000 2050 1360 1360.5 1361 1361.5 1362 Full model posterior with extrapolations 1650 1700 1750 1800 1850 1900 1950 2000 2050 1359.5 1360 1360.5 1361 1361.5 1362 1362.5 Figure 1: Raw data (left) and model posterior with extrapolation (right) The structure search algorithm has identified eight additive components in the data. The first 4 additive components explain 92.3% of the variation in the data as shown by the coefficient of de- termination (R2) values in table 1. The first 6 additive components explain 99.7% of the variation in the data. After the first 5 components the cross validated mean absolute error (MAE) does not decrease by more than 0.1%. This suggests that subsequent terms are modelling very short term trends, uncorrelated noise or are artefacts of the model or search procedure. Short summaries of the additive components are as follows: • A constant. • A constant. This function applies from 1643 until 1716. • A smooth function. This function applies until 1643 and from 1716 onwards. • An approximately periodic function with a period of 10.8 years. This function applies until 1643 and from 1716 onwards. • A rapidly varying smooth function. This function applies until 1643 and from 1716 on- wards. • Uncorrelated noise with standard deviation increasing linearly away from 1837. This func- tion applies until 1643 and from 1716 onwards. • Uncorrelated noise with standard deviation increasing linearly away from 1952. This func- tion applies until 1643 and from 1716 onwards. • Uncorrelated noise. This function applies from 1643 until 1716. # R2 (%) ∆R2 (%) Residual R2 (%) Cross validated MAE Reduction in MAE (%) - - - - 1360.65 - 1 0.0 0.0 0.0 0.33 100.0 2 37.4 37.4 37.4 0.23 32.0 3 72.8 35.4 56.6 0.18 21.1 4 92.3 19.4 71.5 0.15 16.8 5 98.1 5.9 75.9 0.15 0.4 6 99.7 1.6 85.6 0.15 0.0 7 100.0 0.3 99.8 0.15 0.0 8 100.0 0.0 100.0 0.15 0.0 Table 1: Summary statistics for cumulative additive fits to the data. The residual coefficient of determination (R2) values are computed using the residuals from the previous fit as the target values; this measures how much of the residual variance is explained by each new component. The mean absolute error (MAE) is calculated using 10 fold cross validation with a contiguous block design; this measures the ability of the model to interpolate and extrapolate over moderate distances. The model is fit using the full data and the MAE values are calculated using this model; this double use of data means that the MAE values cannot be used reliably as an estimate of out-of-sample predictive performance. Figure 4.1: Solar irradiance data (Lean et al., 1995). region is known as to the Maunder minimum, a period in which sunspots were extremely rare (Lean et al., 1995). The Maunder minimum is an example of the type of structure that can be captured by change-windows. 1 Executive summary The raw data and full model posterior with extrapolations are shown in figure 1. Raw data 1650 1700 1750 1800 1850 1900 1950 2000 2050 1360 1360.5 1361 1361.5 1362 Full model posterior with extrapolations 1650 1700 1750 1800 1850 1900 1950 2000 2050 1359.5 1360 1360.5 1361 1361.5 1362 1362.5 Figure 1: Raw data (left) and model posterior with extrapolation (right) The structure search algorithm has identified eight additive components in the data. The first 4 additive components explain 92.3% of the variation in the data as shown by the coefficient of de- termination (R2) values in tabl 1. The first 6 additive components explain 99.7% of the variation in the data. After the first 5 components the cross validated mean absolute error (MAE) does not decrease by more than 0.1%. This suggests that subsequent terms are modelling very short term trends, uncorrelated noise or are artefacts of the model or search procedure. Short summaries of the additive components are as follows: • A constant. • A constant. This function applies from 1643 until 1716. • A smooth function. This function applies until 1643 and from 1716 onwards. • An approximately periodic function with a period of 10.8 years. This function applies until 1643 and from 1716 onwards. • A rapidly varying smooth function. This function applies until 1643 and from 1716 on- wards. • Uncorrelated noise with standard deviation increasing linearly away from 1837. This func- tion applies until 1643 and from 1716 o wards. • Uncorrelated noise with standard deviation increasing linearly away from 1952. This func- tion applies until 1643 and from 1716 onwards. • Uncorrelated noise. This function applies from 1643 until 1716. # R2 (%) ∆R2 (%) Residual R2 (%) Cross validated MAE Reduction in MAE (%) - - - - 1360.65 - 1 0.0 0.0 0.0 0.33 100.0 2 37.4 37.4 37.4 0.23 32.0 3 72.8 35.4 56.6 0.18 21.1 4 92.3 19.4 71.5 0.15 16.8 5 98.1 5.9 75.9 0.15 0.4 6 99.7 1.6 85.6 0.15 0.0 7 100.0 0.3 99.8 0.15 0.0 8 100.0 0.0 100.0 0.15 0.0 Table 1: Summary statistics for cumulative additive fits to the data. The residual coefficient of determination (R2) values are computed using the residuals from the previous fit as the target values; this measures how much of the residual variance is explained by each new component. The mean absolute error (MAE) is calculated using 10 fold cross validation with a contiguous block design; this measures the ability of the model to interpolate and extrapolate over moderate distances. The model is fit using the full data and the MAE values are calculated using this model; this double use of data means that the MAE values cannot be used reliably as an estimate of out-of-sample predictive performance. Figure 4.2: Automatically generated descriptions of the first four components discovered by ABCD on the solar irradiance data set. The dataset has been decomposed into diverse structures having concise descriptions. The first section of each report generated by ABCD is a summary of the structure found in the dataset. Figure 4.2 shows natural-language summaries of the top four components discovered by ABCD on the solar dataset. From these summaries, we can see that the system has identified the Maunder minimum (second component) and the 11- year solar cycle (fourth component). These components are visualized and described in figures 4.3 and 4.5, respectively. The third component, visualized in figure 4.4, captures the smooth variation over time of the overall level of sol r activity. The complete report generated on this dataset can be found in appendix D. Each repo t also contains samples from the model posterior. 4.2 Example descriptions 55 2.2 Component 2 : A constant. This function applies from 1643 until 1716 This component is constant. This component applies from 1643 until 1716. This component explains 37.4% of the residual variance; this increases the total variance explained from 0.0% to 37.4%. The addition of this component reduces the cross validated MAE by 31.97% from 0.33 to 0.23. Posterior of component 2 1650 1700 1750 1800 1850 1900 1950 2000 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 Sum of components up to component 2 1650 1700 1750 1800 1850 1900 1950 2000 1360 1360.5 1361 1361.5 1362 Figure 4: Pointwise posterior of component 2 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 2 1650 1700 1750 1800 1850 1900 1950 2000 −1 −0.5 0 0.5 1 1.5 Figure 5: Pointwise posterior of residuals after adding component 2 2.2 Component 2 : A constant. This function applies from 1643 until 1716 This component is constant. This component applies from 1643 until 1716. This co ponent explains 37.4% of the residual variance; this increases the total variance explained from 0.0% t 37.4%. The addition of this component r duces the cross validated MAE by 31.97% from 0.33 to 0.23. Posterior of component 2 1650 1700 1750 1800 1850 1900 1950 2000 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 Sum of components up to component 2 1650 1700 1750 1800 1850 1900 1950 2000 1360 1360.5 1361 1361.5 1362 Figure 4: Pointwise posterior of component 2 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 2 1650 1700 1750 1800 1850 1900 1950 2000 −1 −0.5 0 0.5 1 1.5 Figure 5: Pointwise posterior of residuals after adding component 2 Figure 4.3: Extract from an automatically-generated report describing the model com- ponent correspond ng o the Maunder minimum.2.3 Component 3 : A smooth function. This function applies until 1643 and from 1716 onwards This component is a smooth function with a typical lengthscale of 23.1 years. This component applies until 1643 and from 1716 onwards. This component explains 56.6% of the residual variance; this increases the total variance explained from 37.4% to 72.8%. The addition of this component reduces the cross validated MAE by 21.08% from 0.23 to 0.18. Post rior of component 3 1650 1700 1750 1800 1850 1900 1950 2000 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 8 Sum of co ponents up to component 3 1650 1700 1750 1800 1850 1900 1950 2000 1360 1360.5 1361 1361.5 1362 Figure 6: Pointwise posterior of component 3 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 3 1650 1700 1750 1800 1850 1900 1950 2000 −1 −0.5 0 0.5 1 Figure 7: Pointwise posterior of residuals after adding component 3 2.3 Component 3 : A smooth function. This function applies until 1643 and from 1716 onwards This component is a smooth function with a typical lengthscale of 23.1 years. This component applies until 1643 and from 1716 onwards. This component explains 56.6% of the residual variance; this increases the total variance explained from 37.4% to 72.8%. The addition of this component reduces the cross validated MAE by 21.08% from 0.23 t 0.18. Posterior of component 3 1650 1700 1750 1800 1850 1900 1950 2000 −0.6 −0.4 −0.2 0.2 0.4 0.6 .8 Sum of components up to component 3 1650 1700 1750 1800 1850 1900 1950 2000 1360 1360.5 1361 1361.5 1362 Figure 6: Pointwise posterior of component 3 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 3 1650 1700 1750 1800 1850 1900 1950 2000 −1 −0.5 0 0.5 1 Figure 7: Pointwise posterior of residuals after adding component 3 Figure 4.4: Characterizing the medium-term smoothness of solar activity levels. By allowing other compon nts to explain the periodicity, noise, and the Maunder minimum, ABCD ca isolate he part of the signal best explained by a slowly-varying trend.2.4 Component 4 : An approximately periodic function with a period of 10.8 years. This function applies until 1643 and from 1716 onwards This component is approximately periodic with a period of 10.8 years. Across periods the shape of this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function within each period is very smooth and resembles a sinusoid. This component applies until 1643 and from 1716 onwards. This component explains 71.5% of the residual variance; this increases the total variance explained from 72.8% to 92.3%. The addition of this component reduces the cross validated MAE by 16.82% from 0.18 to 0.15. Posterior of component 4 1650 1700 1750 1800 1850 1900 1950 2000 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Sum of components up to component 4 1650 1700 1750 1800 1850 1900 1950 2000 1360 1360.5 1361 1361.5 1362 Figure 8: Pointwise posterior of component 4 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 4 1650 1700 1750 1800 1850 1900 1950 2000 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Figure 9: Pointwise posterior of residuals after adding component 4 2.4 Component 4 : An approximately periodic function with a period of 10.8 years. This function applies until 1643 and from 1716 onwards This component is approximately periodic with a period of 10.8 years. Across periods the shape of this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function within each period is very smooth and resembles a sinusoid. This component applies until 1643 and from 1716 onwards. his co ponent explains 71.5% of the residual variance; this increases the total variance explained from 72.8% to 92.3%. The addition of this component reduces the cross validated MAE by 16.82% from 0.18 to 0.15. Posterior of component 4 1650 1700 1750 1800 1850 1900 1950 2000 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Sum of components up to component 4 1650 1700 1750 1800 1850 1900 1950 2000 1360 1360.5 1361 1361.5 1362 Figure 8: Pointwise posterior of component 4 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 4 1650 1700 1750 1800 1850 1900 1950 2000 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Figure 9: Pointwise posterior of residuals after adding component 4 Figure 4.5: This part of the report isolates and describes the approximately 11-year sunspot cycle, also noting its disap earance during t Maunder minimum. 56 Automatic Model Description 4.2.2 Describing changing noise levels Next, we present excerpts of the description generated by our procedure on a model of international airline passenger counts over time, shown in figure 3.5. High-level descrip- tions of the four components discovered are shown in figure 4.6. 1 Executive summary The raw data and full model posterior with extrapolations are shown in figure 1. Raw data 1950 1952 1954 1956 1958 1960 1962 100 200 300 400 500 600 700 Full model posterior with extrapolations 1950 1952 1954 1956 1958 1960 1962 0 100 200 300 400 500 600 700 Figure 1: Raw data (left) and model posterior with extrapolation (right) The structure search algorithm has identified four additive components in the data. The first 2 additive components explain 98.5% of the variation in the data as shown by the coefficient of de- termination (R2) values in table 1. The first 3 additive components explain 99.8% of the variation in the data. After the first 3 components the cross validated mean absolute error (MAE) does not decrease by more than 0.1%. This suggests that subsequent terms are modelling very short term trends, uncorrelated noise or are artefacts of the model or search procedure. Short summaries of the additive components are as follows: • A linearly increasing function. • An approximately periodic function with a period of 1.0 years and with linearly increasing amplitude. • A smooth function. • Uncorrelated noise with linearly increasing standard deviation. # R2 (%) ∆R2 (%) Residual R2 (%) Cross validated MAE Reduction in MAE (%) - - - - 280.30 - 1 85.4 85.4 85.4 34.03 87.9 2 98.5 13.2 89.9 12.44 63.4 3 99.8 1.3 85.1 9.10 26.8 4 100.0 0.2 100.0 9.10 0.0 Table 1: Summary statistics for cumulative additive fits to the data. The residual coefficient of determination (R2) values are computed using the residuals from the previous fit as the target values; this measures how much of the residual variance is explained by each new component. The mean absolute error (MAE) is calculated using 10 fold cross validation with a contiguous block design; this measures the ability of the model to interpolate and extrapolate over moderate distances. The model is fit using the full data and the MAE values are calculated using this model; this double use of data means that the MAE values cannot be used reliably as an estimate of out-of-sample predictive performance. Model checking statistics are summarised in table 2 in section 4. These statistics have not revealed any inconsistencies between the model and observed data. The rest of the document is structured as follows. In section 2 the forms of the additive components are described and their posterior distributions are displayed. In section 3 the modelling assumptions of each component are discussed with reference to how this affects the extrapolations made by the model. Section 4 discusses model checking statistics, with plots showing the form of any detected discrepancies between the model and observed data. Figure 4.6: Short descriptions of the four components of a model describing the airline dataset. 2.2 Component 2 : An approximately periodic function with a period of 1.0 years and with linearly increasing amplitude This component is approximately periodic with a period of 1.0 years and varying amplitude. Across periods the shape of this function varies very smoothly. The amplitude of the function increases linearly. The shape of this function within each period has a typical lengthscale of 6.0 weeks. This component explains 89.9% of the residual variance; this increases the total variance explained from 85.4% to 98.5%. The addition of this component reduces the cross validated MAE by 63.45% from 34.03 to 12.44. Posterior of component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −150 −100 −50 0 50 100 150 200 Sum of components up to component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 0 100 200 300 400 500 600 700 Figure 4: Pointwise posterior of component 2 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −50 0 50 Figure 5: Pointwise posterior of residuals after adding component 2 2.2 Component 2 : An approximately periodic function with a period of 1.0 years and with linearly increasing amplitude This component is approximately periodic with a period of 1.0 years and varying amplitude. Across periods the shape of this function varies very smoothly. The amplitude of the function increases linearly. The shape of this function within each period has a typical lengthscale of 6.0 weeks. This component explains 89.9% of the residual variance; this increases the total variance explained from 85.4% to 98.5%. The addition of this component reduces the cross lidated MAE by 63.45% from 34.03 to 12.44. Posterior of component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −150 −100 −50 0 50 100 150 200 Sum of components up to component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 0 100 200 300 400 500 600 700 Figure 4: Pointwise posterior of component 2 (left) and the posterior of the cumulative sum of components with data (right) Residuals after component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −50 0 50 Figure 5: Pointwise posterior of residuals after adding component 2 Figure 4.7: Describing non-stationary periodicity in the airline data. The second component, shown in figure 4.7, is accurately described as approximately (SE) periodic (Per) with linearly growing amplitude (Lin). The description of the fourth component, shown in figure 4.8, expresses the fact that the scale of the unstructured noise in the model grows linearly with time. The complete report generated on this dataset can be found in the supplementary material of Lloyd et al. (2014). Other example reports describing a wide variety of time-series can be found at http://mlg.eng.cam.ac.uk/lloyd/abcdoutput/ 4.3 Related work 57 2.4 Component 4 : Uncorrelated noise with linearly increasing standard deviation This component models uncorrelated noise. The standard deviation of the noise increases linearly. This component explains 100.0% of the residual variance; this increases the total variance explained from 99.8% to 100.0%. The addition of this component reduces the cross validated MAE by 0.00% from 9.10 to 9.10. This component explains residual variance but does not improve MAE which suggests that this component describes very short term patterns, uncorrelated noise or is an artefact of the model or search procedure. Posterior of component 4 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −20 −15 −10 −5 0 5 10 15 20 Sum of components up to component 4 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 0 100 200 300 400 500 600 700 Figure 8: Pointwise posterior of component 4 (left) and the posterior of the cumulative sum of components with data (right) 2.4 Component 4 : Uncorrelated noise with linearly increasing standard deviation This component models uncorrelated noise. The standard deviation of the noise increases linearly. This component explains 100.0% of the residual variance; this increases the total variance explained from 99.8% to 100.0%. The addition of this component reduces the cross validated MAE by 0.00% from 9.10 to 9.10. This component explains residual variance but does not improve MAE which suggests that this component describes very short term patterns, uncorrelated noise or is an artefact of the model or search procedure. Posterior of component 4 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −20 −15 −10 −5 0 5 10 15 20 Sum of components up to component 4 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 0 100 200 300 400 500 600 700 Figure 8: Pointwise posterior of component 4 (left) and the posterior of the cumulative sum of components with data (right) Figure 4.8: Describing time-changing variance in the airline dataset. 4.3 Related w k To the best of our knowledge, our procedure is the first example of automatic textual description of a nonparametric statistical model. However, systems with natural lan- guage output have been developed for automatic video description (Barbu et al., 2012) and automated theorem proving (Ganesalingam and Gowers, 2013). Although not a description procedure, Durrande et al. (2013) developed an analytic method for decomposing GP posteriors into entirely periodic and entirely non-periodic parts, even when using non-periodic kernels. 4.4 Limitations of this approach During development, we noted several difficulties with this overall approach: • Some kernels are hard to describe. For instance, we did not include the RQ kernel in the text-generation procedure. This was done for several reasons. First, the RQ kernel can be equivalently expressed as a scale mixture of SE kernels, making it redundant in principle. Second, it was difficult to think of a clear and concise description for effect of the hyperparameter that controls the heaviness of the tails of the RQ kernel. Third, a product of two RQ kernels does not give another RQ kernel, which raises the question of how to concisely describe products of RQ kernels. • Reliance on additivity. Much of the modularity of the description procedure is due to the additive decomposition. However, additivity is lost under any nonlinear 58 Automatic Model Description transformation of the output. Such warpings can be learned (Snelson et al., 2004), but descriptions of transformations of the data may not be as clear to the end user. • Difficulty of expressing uncertainty. A natural extension to the model search procedure would be to report a posterior distribution on structures and kernel parameters, rather than point estimates. Describing uncertainty about the hyper- parameters of a particular structure may be feasible, but describing even a few most-probable structures might result in excessively long reports. Source code Source code to perform all experiments is available at http://www.github.com/jamesrobertlloyd/gpss-research. 4.5 Conclusions This chapter presented a system which automatically generates detailed reports describ- ing statistical structure captured by a GP model. The properties of GPs and the kernels being used allow a modular description, avoiding an exponential blowup in the number of special cases that need to be considered. Combining this procedure with the model search of chapter 3 gives a system com- bining all the elements of an automatic statistician listed in section 3.1: an open-ended language of models, a method to search through model space, a model comparison procedure, and a model description procedure. Each particular element used in the sys- tem presented here is merely a proof-of-concept. However, even this simple prototype demonstrated the ability to discover and describe a variety of patterns in time series. Chapter 5 Deep Gaussian Processes “I asked myself: On any given day, would I rather be wrestling with a sampler, or proving theorems?” – Peter Orbanz, personal communication For modeling high-dimensional functions, a popular alternative to the Gaussian pro- cess models explored earlier in this thesis are deep neural networks. When training deep neural networks, choosing appropriate architectures and regularization strategies is important for good predictive performance. In this chapter, we propose to study this problem by viewing deep neural networks as priors on functions. By viewing neural networks this way, one can analyze their properties without reference to any particular dataset, loss function, or training method. As a starting point, we will relate neural networks to Gaussian processes, and ex- amine a class of infinitely-wide, deep neural networks called deep Gaussian processes: compositions of functions drawn from GP priors. Deep GPs are an attractive model class to study for several reasons. First, Damianou and Lawrence (2013) showed that the probabilistic nature of deep GPs guards against overfitting. Second, Hensman et al. (2014a) showed that stochastic variational inference is possible in deep GPs, allowing mini-batch training on large datasets. Third, the availability of an approximation to the marginal likelihood allows one to automatically tune the model architecture without the need for cross-validation. Finally, deep GPs are attractive from a model-analysis point of view, because they remove some of the details of finite neural networks. Our analysis will show that in standard architectures, the representational capacity of standard deep networks tends to decrease as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture 60 Deep Gaussian Processes that connects the input to each layer that does not suffer from this pathology. We also examine deep kernels, obtained by composing arbitrarily many fixed feature transforms. The ideas contained in this chapter were developed through discussions with Oren Rippel, Ryan Adams and Zoubin Ghahramani, and appear in Duvenaud et al. (2014). 5.1 Relating deep neural networks to deep GPs This section gives a precise definition of deep GPs, reviews the precise relationship between neural networks and Gaussian processes, and gives two equivalent ways of con- structing neural networks which give rise to deep GPs. 5.1.1 Definition of deep GPs We define a deep GP as a distribution on functions constructed by composing functions drawn from GP priors. An example of a deep GP is a composition of vector-valued functions, with each function drawn independently from GP priors: f (1:L)(x) = f (L)(f (L−1)(. . .f (2)(f (1)(x)) . . . )) (5.1) with each f (ℓ)d ind∼ GP ( 0, k(ℓ)d (x,x′) ) Multilayer neural networks also implement compositions of vector-valued functions, one per layer. Therefore, understanding general properties of function compositions might helps us to gain insight into deep neural networks. 5.1.2 Single-hidden-layer models First, we relate neural networks to standard “shallow” Gaussian processes, using the standard neural network architecture known as the multi-layer perceptron (MLP) (Rosen- blatt, 1962). In the typical definition of an MLP with one hidden layer, the hidden unit activations are defined as: h(x) = σ (b+Vx) (5.2) where h are the hidden unit activations, b is a bias vector, V is a weight matrix and σ is a one-dimensional nonlinear function, usually a sigmoid, applied element-wise. The 5.1 Relating deep neural networks to deep GPs 61 output vector f(x) is simply a weighted sum of these hidden unit activations: f(x) =Wσ (b+Vx) =Wh(x) (5.3) where W is another weight matrix. Neal (1995, chapter 2) showed that some neural networks with infinitely many hidden units, one hidden layer, and unknown weights correspond to Gaussian processes. More precisely, for any model of the form f(x) = 1 K wTh(x) = 1 K K∑ i=1 wihi(x), (5.4) with fixed1 features [h1(x), . . . , hK(x)]T = h(x) and i.i.d. w’s with zero mean and finite variance σ2, the central limit theorem implies that as the number of features K grows, any two function values f(x) and f(x′) have a joint distribution approaching a Gaussian: lim K→∞ p  f(x) f(x′)  = N  0 0  , σ2 K  ∑Ki=1 hi(x)hi(x) ∑Ki=1 hi(x)hi(x′)∑K i=1 hi(x′)hi(x) ∑K i=1 hi(x′)hi(x′)  (5.5) A joint Gaussian distribution between any set of function values is the definition of a Gaussian process. The result is surprisingly general: it puts no constraints on the features (other than having uniformly bounded activation), nor does it require that the feature weights w be Gaussian distributed. An MLP with a finite number of nodes also gives rise to a GP, but only if the distribution on w is Gaussian. One can also work backwards to derive a one-layer MLP from any GP: Mercer’s theorem, discussed in section 2.6, implies that any positive-definite kernel function cor- responds to an inner product of features: k(x,x′) = h(x)Th(x′). Thus, in the one-hidden-layer case, the correspondence between MLPs and GPs is straightforward: the implicit features h(x) of the kernel correspond to hidden units of an MLP. 1The above derivation gives the same result if the parameters of the hidden units are random, since in the infinite limit, their distribution on outputs is always the same with probability one. However, to avoid confusion, we refer to layers with infinitely-many nodes as “fixed”. 62 Deep Gaussian Processes Neural net corresponding to a GP Net corresponding to a GP with a deep kernel x1 x2 x3 h1 h2 h∞ f1 f2 f3 ... Inputs Fixed Random x1 x2 x3 h (1) 1 h (1) 2 h(1)∞ h (2) 1 h (2) 2 h(2)∞ f1 f2 f3 Inputs ... Fixed ... Fixed Random Figure 5.1: Left: GPs can be derived as a one-hidden-layer MLP with infinitely many fixed hidden units having unknown weights. Right: Multiple layers of fixed hidden units gives rise to a GP with a deep kernel, but not a deep GP. 5.1.3 Multiple hidden layers Next, we examine infinitely-wide MLPs having multiple hidden layers. There are several ways to construct such networks, giving rise to different priors on functions. In an MLP with multiple hidden layers, activation of the ℓth layer units are given by h(ℓ)(x) = σ ( b(ℓ) +V(ℓ)h(ℓ−1)(x) ) . (5.6) This architecture is shown on the right of figure 5.1. For example, if we extend the model given by equation (5.3) to have two layers of feature mappings, the resulting model becomes f(x) = 1 K wTh(2) ( h(1)(x) ) . (5.7) If the features h(1)(x) and h(2)(x) are fixed with only the last-layer weights w un- known, this model corresponds to a shallow GP with a deep kernel, given by k(x,x′) = [ h(2)(h(1)(x)) ]T h(2)(h(1)(x′)) . (5.8) Deep kernels, explored in section 5.5, imply a fixed representation as opposed to a prior over representations. Thus, unless we richly parameterize these kernels, their 5.1 Relating deep neural networks to deep GPs 63 A neural net with fixed activation functions corresponding to a 3-layer deep GP x1 x2 x3 h (1) 1 h (1) 2 h(1)∞ h (2) 1 h (2) 2 h(2)∞ h (3) 1 h (3) 2 h(3)∞ f (1) 1 f (1) 2 f (1) 3 f (2) 1 f (2) 2 f (2) 3 f (3) 1 f (3) 2 f (3) 3 Inputs x Fixed f (1)(x) Random Fixed f (1:2)(x) Random Fixed Random y ... ... ... A net with nonparametric activation functions corresponding to a 3-layer deep GP x1 x2 x3 Inputs x GP GP f (1)(x) f (1:2)(x) GP y Figure 5.2: Two equivalent views of deep GPs as neural networks. Top: A neural network whose every other layer is a weighted sum of an infinite number of fixed hidden units, whose weights are initially unknown. Bottom: A neural network with a finite number of hidden units, each with a different unknown non-parametric activation function. The activation functions are visualized by draws from 2-dimensional GPs, although their input dimension will actually be the same as the output dimension of the previous layer. capacity to learn an appropriate representation will be limited in comparison to more flexible models such as deep neural networks or deep GPs. 5.1.4 Two network architectures equivalent to deep GPs There are two equivalent neural network architectures that correspond to deep GPs: one having fixed nonlinearities, and another having GP-distributed nonlinearities. 64 Deep Gaussian Processes To construct a neural network corresponding to a deep GP using only fixed nonlin- earities, one can start with the infinitely-wide deep GP shown in figure 5.1(right), and introduce a finite set of nodes in between each infinitely-wide set of fixed basis functions. This architecture is shown in the top of figure 5.2. The D(ℓ) nodes f (ℓ)(x) in between each fixed layer are weighted sums (with random weights) of the fixed hidden units of the layer below, and the next layer’s hidden units depend only on these D(ℓ) nodes. This alternating-layer architecture has an interpretation as a series of linear infor- mation bottlenecks. To see this, substitute equation (5.3) into equation (5.6) to get h(ℓ)(x) = σ ( b(ℓ) + [ V(ℓ)W(ℓ−1) ] h(ℓ−1)(x) ) (5.9) where W(ℓ−1) is the weight matrix connecting h(ℓ−1) to f (ℓ−1). Thus, ignoring the in- termediate outputs f (ℓ)(x), a deep GP is an infinitely-wide, deep MLP with each pair of layers connected by random, rank-Dℓ matrices given by V(ℓ)W(ℓ−1). The second, more direct way to construct a network architecture corresponding to a deep GP is to integrate out all W(ℓ), and view deep GPs as a neural network with a finite number of nonparametric, GP-distributed basis functions at each layer, in which f (1:ℓ)(x) represent the output of the hidden nodes at the ℓth layer. This second view lets us compare deep GP models to standard neural net architectures more directly. Figure 5.1(bottom) shows an example of this architecture. 5.2 Characterizing deep Gaussian process priors This section develops several theoretical results characterizing the behavior of deep GPs as a function of their depth. Specifically, we show that the size of the derivative of a one- dimensional deep GP becomes log-normal distributed as the network becomes deeper. We also show that the Jacobian of a multivariate deep GP is a product of independent Gaussian matrices having independent entries. These results will allow us to identify a pathology that emerges in very deep networks in section 5.3. 5.2.1 One-dimensional asymptotics In this section, we derive the limiting distribution of the derivative of an arbitrarily deep, one-dimensional GP having a squared-exp kernel: 5.2 Characterizing deep Gaussian process priors 65 1 Layer 2 Layers 5 Layers 10 Layers f (1 :ℓ ) (x ) −4 −2 0 2 4 −2 −1.5 −1 −0.5 0 0.5 1 Layer 1 Compostion −4 −2 0 2 4 −1.5 −1 −0.5 0 Layer 2 Compostion −4 −2 0 2 4 −1 −0.5 0 0.5 1 1.5 2 Layer 5 Compostion −4 −2 0 2 4 0.88 0.89 0.9 0.91 0.92 0.93 Layer 10 Compostion x x x x Figure 5.3: A function drawn from a one-dimensional deep GP prior, shown at different depths. The x-axis is the same for all plots. After a few layers, the functions begin to be either nearly flat, or quickly-varying, everywhere. This is a consequence of the distribution on derivatives becoming heavy-tailed. As well, the function values at each layer tend to cluster around the same few values as the depth increases. This happens because once the function values in different regions are mapped to the same value in an intermediate layer, there is no way for them to be mapped to different values in later layers. SE(x, x′) = σ2 exp (−(x− x′)2 2w2 ) . (5.10) The parameter σ2 controls the variance of functions drawn from the prior, and the lengthscale parameter w controls the smoothness. The derivative of a GP with a squared- exp kernel is point-wise distributed as N (0, σ2/w2). Intuitively, a draw from a GP is likely to have large derivatives if the kernel has high variance and small lengthscales. By the chain rule, the derivative of a one-dimensional deep GP is simply a product of the derivatives of each layer, which are drawn independently by construction. The distribution of the absolute value of this derivative is a product of half-normals, each with mean √ 2σ2/πw2. If one chooses kernel parameters such that σ2/w2 = π/2, then the expected magnitude of the derivative remains constant regardless of the depth. The distribution of the log of the magnitude of the derivatives has finite moments: mlog = E [ log ∣∣∣∣∣∂f(x)∂x ∣∣∣∣∣ ] = 2 log ( σ w ) − log 2− γ vlog = V [ log ∣∣∣∣∣∂f(x)∂x ∣∣∣∣∣ ] = π 2 4 + log2 2 2 − γ 2 − γ log 4 + 2 log ( σ w ) [ γ + log 2− log ( σ w )] (5.11) where γ u 0.5772 is Euler’s constant. Since the second moment is finite, by the central limit theorem, the limiting distribution of the size of the gradient approaches a log- 66 Deep Gaussian Processes normal as L grows: log ∣∣∣∣∣∂f (1:L)(x)∂x ∣∣∣∣∣ = log L∏ ℓ=1 ∣∣∣∣∣∂f (ℓ)(x)∂x ∣∣∣∣∣ = L∑ ℓ=1 log ∣∣∣∣∣∂f (ℓ)(x)∂x ∣∣∣∣∣ L→∞∼ N(Lmlog, L2vlog) (5.12) Even if the expected magnitude of the derivative remains constant, the variance of the log-normal distribution grows without bound as the depth increases. Because the log-normal distribution is heavy-tailed and its domain is bounded below by zero, the derivative will become very small almost everywhere, with rare but very large jumps. Figure 5.3 shows this behavior in a draw from a 1D deep GP prior. This figure also shows that once the derivative in one region of the input space becomes very large or very small, it is likely to remain that way in subsequent layers. 5.2.2 Distribution of the Jacobian Next, we characterize the distribution on Jacobians of multivariate functions drawn from deep GP priors, finding them to be products of independent Gaussian matrices with independent entries. Lemma 5.2.1. The partial derivatives of a function mapping RD → R drawn from a GP prior with a product kernel are independently Gaussian distributed. Proof. Because differentiation is a linear operator, the derivatives of a function drawn from a GP prior are also jointly Gaussian distributed. The covariance between partial derivatives with respect to input dimensions d1 and d2 of vector x are given by Solak et al. (2003): cov ( ∂f(x) ∂xd1 , ∂f(x) ∂xd2 ) = ∂ 2k(x,x′) ∂xd1∂x ′ d2 ∣∣∣∣∣ x=x′ (5.13) If our kernel is a product over individual dimensions k(x,x′) = ∏Dd kd(xd, x′d), then the off-diagonal entries are zero, implying that all elements are independent. For example, in the case of the multivariate squared-exp kernel, the covariance be- 5.3 Formalizing a pathology 67 tween derivatives has the form: f(x) ∼ GP ( 0, σ2 D∏ d=1 exp ( −12 (xd − x′d)2 w2d )) =⇒ cov ( ∂f(x) ∂xd1 , ∂f(x) ∂xd2 ) =  σ2 w2 d1 if d1 = d2 0 if d1 ̸= d2 (5.14) Lemma 5.2.2. The Jacobian of a set of D functions RD → R drawn from independent GP priors, each having product kernel is a D×D matrix of independent Gaussian R.V.’s Proof. The Jacobian of the vector-valued function f(x) is a matrix J with elements Jij(x) = ∂fi(x)∂xj . Because the GPs on each output dimension f1(x), f2(x), . . . , fD(x) are independent by construction, it follows that each row of J is independent. Lemma 5.2.1 shows that the elements of each row are independent Gaussian. Thus all entries in the Jacobian of a GP-distributed transform are independent Gaussian R.V.’s. Theorem 5.2.3. The Jacobian of a deep GP with a product kernel is a product of inde- pendent Gaussian matrices, with each entry in each matrix being drawn independently. Proof. When composing L different functions, we denote the immediate Jacobian of the function mapping from layer ℓ − 1 to layer ℓ as J (ℓ)(x), and the Jacobian of the entire composition of L functions by J (1:L)(x). By the multivariate chain rule, the Jacobian of a composition of functions is given by the product of the immediate Ja- cobian matrices of each function. Thus the Jacobian of the composed (deep) function f (L)(f (L−1)(. . .f (3)(f (2)(f (1)(x))) . . . )) is J (1:L)(x) = J (L)J (L−1) . . . J (3)J (2)J (1). (5.15) By lemma 5.2.2, each J (ℓ)i,j ind∼ N , so the complete Jacobian is a product of independent Gaussian matrices, with each entry of each matrix drawn independently. This result allows us to analyze the representational properties of a deep Gaussian process by examining the properties of products of independent Gaussian matrices. 5.3 Formalizing a pathology A common use of deep neural networks is building useful representations of data man- ifolds. What properties make a representation useful? Rifai et al. (2011a) argued that 68 Deep Gaussian Processes good representations of data manifolds are invariant in directions orthogonal to the data manifold. They also argued that, conversely, a good representation must also change in directions tangent to the data manifold, in order to preserve relevant information. Figure 5.4 visualizes a representation having these two properties. tangent orthogonal Figure 5.4: Representing a 1-D data manifold. Colors are a function of the computed representation of the input space. The representation (blue & green) changes little in directions orthogonal to the manifold (white), making it robust to noise in those directions. The representation also varies in directions tangent to the data manifold, preserving information for later layers. As in Rifai et al. (2011b), we characterize the representational properties of a function by the singular value spectrum of the Jacobian. The number of relatively large singular values of the Jacobian indicate the number of directions in data-space in which the representation varies significantly. Figure 5.5 shows the distribution of the singular value spectrum of draws from 5-dimensional deep GPs of different depths.2 As the nets gets deeper, the largest singular value tends to dominate, implying there is usually only one effective degree of freedom in the representations being computed. Figure 5.6 demonstrates a related pathology that arises when composing functions to produce a deep density model. The density in the observed space eventually becomes locally concentrated onto one-dimensional manifolds, or filaments. This again suggests that, when the width of the network is relatively small, deep compositions of indepen- dent functions are unsuitable for modeling manifolds whose underlying dimensionality is greater than one. To visualize this pathology in another way, figure 5.7 illustrates a color-coding of the representation computed by a deep GP, evaluated at each point in the input space. After 10 layers, we can see that locally, there is usually only one direction that one can move in x-space in order to change the value of the computed representation, or to cross 2Rifai et al. (2011b) analyzed the Jacobian at location of the training points, but because the priors we are examining are stationary, the distribution of the Jacobian is identical everywhere. 5.4 Fixing the pathology 69 2 Layers 6 Layers N or m al iz ed sin gu la r va lu e 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 Singular value index Singular value index Figure 5.5: The distribution of normalized singular values of the Jacobian of a func- tion drawn from a 5-dimensional deep GP prior 25 layers deep (Left) and 50 layers deep (Right). As nets get deeper, the largest singular value tends to become much larger than the others. This implies that with high probability, these functions vary little in all directions but one, making them unsuitable for computing representations of manifolds of more than one dimension. a decision boundary. This means that such representations are likely to be unsuitable for decision tasks that depend on more than one property of the input. To what extent are these pathologies present in the types of neural networks com- monly used in practice? In simulations, we found that for deep functions with a fixed hidden dimension D, the singular value spectrum remained relatively flat for hundreds of layers as long as D > 100. Thus, these pathologies are unlikely to severely effect the relatively shallow, wide networks most commonly used in practice. 5.4 Fixing the pathology As suggested by Neal (1995, chapter 2), we can fix the pathologies exhibited in figures figure 5.6 and 5.7 by simply making each layer depend not only on the output of the pre- vious layer, but also on the original input x. We refer to these models as input-connected networks, and denote deep functions having this architecture with the subscript C, as in fC(x). Formally, this functional dependence can be written as f (1:L) C (x) = f (L) ( f (1:L−1) C (x),x ) , ∀L (5.16) 70 Deep Gaussian Processes No transformation: p(x) 1 Layer: p ( f (1)(x) ) 4 Layers: p ( f (1:4)(x) ) 6 Layers: p ( f (1:6)(x) ) Figure 5.6: Points warped by a function drawn from a deep GP prior. Top left: Points drawn from a 2-dimensional Gaussian distribution, color-coded by their location. Sub- sequent panels: Those same points, successively warped by compositions of functions drawn from a GP prior. As the number of layers increases, the density concentrates along one-dimensional filaments. Warpings using random finite neural networks exhibit the same pathology, but also tend to concentrate density into 0-dimensional manifolds (points) due to saturation of all of the hidden units. 5.4 Fixing the pathology 71 Identity Map: y = x 1 Layer: y = f (1)(x) 10 Layers: y = f (1:10)(x) 40 Layers: y = f (1:40)(x) Figure 5.7: A visualization of the feature map implied by a function f drawn from a deep GP. Colors are a function of the 2D representation y = f(x) that each point is mapped to. The number of directions in which the color changes rapidly corresponds to the number of large singular values in the Jacobian. Just as the densities in figure 5.6 became locally one-dimensional, there is usually only one direction that one can move x in locally to change y. This means that f is unlikely to be a suitable representation for decision tasks that depend on more than one aspect of x. Also note that the overall shape of the mapping remains the same as the number of layers increase. For example, a roughly circular shape remains in the top-left corner even after 40 independent warpings. 72 Deep Gaussian Processes Figure 5.8 shows a graphical representation of the two connectivity architectures. a) Standard MLP connectivity b) Input-connected architecture x f (1)(x) f (2)(x) f (3)(x) x f (1)C (x) f (2) C (x) f (3) C (x) Figure 5.8: Two different architectures for deep neural networks. Left: The standard architecture connects each layer’s outputs to the next layer’s inputs. Right: The input- connected architecture also connects the original input x to each layer. Similar connections between non-adjacent layers can also be found the primate visual cortex (Maunsell and van Essen, 1983). Visualizations of the resulting prior on functions are shown in figures 5.9, 5.10 and 5.12. 1 Layer 2 Layers 5 Layers 10 Layers f (1 :L ) C (x ) −4 −2 0 2 4 −1.5 −1 −0.5 0 0.5 1 1.5 Layer 1 Compostion −4 −2 0 2 4 −1.5 −1 −0.5 0 0.5 1 1.5 Layer 2 Compostion −4 −2 0 2 4 −4 −2 0 2 4 Layer 5 Compostion −4 −2 0 2 4 −2 −1 0 1 2 Layer 10 Compostion x x x x Figure 5.9: A draw from a 1D deep GP prior having each layer also connected to the input. The x-axis is the same for all plots. Even after many layers, the functions remain relatively smooth in some regions, while varying rapidly in other regions. Compare to standard-connectivity deep GP draws shown in figure 5.3. The Jacobian of an input-connected deep function is defined by the recurrence J (1:L) C = J (L)  J (1:L−1)C ID . (5.17) where ID is a D-dimensional identity matrix. Thus the Jacobian of an input-connected deep GP is a product of independent Gaussian matrices each with an identity matrix appended. Figure 5.11 shows that with this architecture, even 50-layer deep GPs have well-behaved singular value spectra. The pathology examined in this section is an example of the sort of analysis made possible by a well-defined prior on functions. The figures and analysis done in this 5.4 Fixing the pathology 73 3 Connected layers: p ( f (1:3) C (x) ) 6 Connected layers: p ( f (1:6) C (x) ) Figure 5.10: Points warped by a draw from a deep GP with each layer connected to the input x. As depth increases, the density becomes more complex without concentrating only along one-dimensional filaments. 25 layers 50 layers N or m al iz ed sin gu la r va lu e 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 Singular value number Singular value number Figure 5.11: The distribution of singular values drawn from 5-dimensional input- connected deep GP priors, 25 layers deep (Left) and 50 layers deep (Right). Compared to the standard architecture, the singular values are more likely to remain the same size as one another, meaning that the model outputs are more often sensitive to several directions of variation in the input. 74 Deep Gaussian Processes Identity map: y = x 2 Connected layers: y = f (1:2)(x) 10 Connected layers: y = f (1:10)(x) 20 Connected layers: y = f (1:20)(x) Figure 5.12: The feature map implied by a function f drawn from a deep GP prior with each layer also connected to the input x, visualized at various depths. Compare to the map shown in figure 5.7. In the mapping shown here there are sometimes two directions that one can move locally in x to in order to change the value of f(x). This means that the input-connected prior puts significant probability mass on a greater variety of types of representations, some of which depend on all aspects of the input. 5.5 Deep kernels 75 section could be done using Bayesian neural networks with finite numbers of nodes, but would be more difficult. In particular, care would need to be taken to ensure that the networks do not produce degenerate mappings due to saturation of the hidden units. 5.5 Deep kernels Bengio et al. (2006) showed that kernel machines have limited generalization ability when they use “local” kernels such as the squared-exp. However, as shown in chapters 2, 3 and 6, structured, non-local kernels can allow extrapolation. Another way to build non- local kernels is by composing fixed feature maps, creating deep kernels. To return to an example given in section 2.7, periodic kernels can be viewed as a 2-layer-deep kernel, in which the first layer maps x→ [sin(x), cos(x)], and the second layer maps through basis functions corresponding to the implicitly feature map giving rise to an SE kernel. This section builds on the work of Cho and Saul (2009), who derived several kinds of deep kernels by composing multiple layers of feature mappings. In principle, one can compose the implicit feature maps of any two kernels ka and kb to get a new kernel, which we denote by (kb ◦ ka): ka(x,x′) = ha(x)Tha(x′) (5.18) kb(x,x′) = hb(x)Thb(x′) (5.19) (kb ◦ ka) (x,x′) = kb ( ha(x),ha(x′) ) = [hb (ha(x))]T hb (ha(x′)) (5.20) However, this composition might not always have a closed form if the number of hidden features of either kernel is infinite. Fortunately, composing the squared-exp (SE) kernel with the implicit mapping given by any other kernel has a simple closed form. If k(x,x′) = h(x)Th(x′), then (SE ◦ k) (x,x′) = kSE ( h(x),h(x′) ) (5.21) = exp ( −12 ||h(x)− h(x ′)||22 ) (5.22) = exp ( −12 [ h(x)Th(x)− 2h(x)Th(x′) + h(x′)Th(x′) ]) (5.23) = exp ( −12 [ k(x,x)− 2k(x,x′) + k(x′,x′) ]) . (5.24) This formula expresses the composed kernel (SE ◦ k) exactly in terms of evaluations of the original kernel k. 76 Deep Gaussian Processes Input-connected deep kernels Draws from corresponding GPs k (x ,x ′ ) −2 0 2 4 0 0.2 0.4 0.6 0.8 1 x − x’ co v( f(x ), f (x’ ) 1 layer 2 layers 3 layers ∞ layers f (x ) −2 0 2 4 −2 −1 0 1 2 x− x′ x Figure 5.13: Left: Input-connected deep kernels of different depths. By connecting the input x to each layer, the kernel can still depend on its input even after arbitrarily many layers of composition. Right: Draws from GPs with deep input-connected kernels. 5.5.1 Infinitely deep kernels What happens when one repeatedly composes feature maps many times, starting with the squared-exp kernel? If the output variance of the SE is normalized to k(x,x) = 1, then the infinite limit of composition with SE converges to (SE ◦ SE ◦ . . . ◦ SE) (x,x′) = 1 for all pairs of inputs. A constant covariance corresponds to a prior on constant functions f(x) = c. As above, we can overcome this degeneracy by connecting the input x to each layer. To do so, we concatenate the composed feature vector at each layer, h(1:ℓ)(x), with the input vector x to produce an input-connected deep kernel k(1:L)C , defined by: k (1:ℓ+1) C (x,x′) = exp −12 ∣∣∣∣∣∣ ∣∣∣∣∣∣  h(1:ℓ)(x) x −  h(1:ℓ)(x′) x′ ∣∣∣∣∣∣ ∣∣∣∣∣∣ 2 2  (5.25) = exp ( − 12 [ k (1:ℓ) C (x,x)− 2k(1:ℓ)C (x,x′) + k(1:ℓ)C (x′,x′)−||x− x′||22 ]) Starting with the squared-exp kernel, this repeated mapping satisfies k (1:∞) C (x,x′)− log ( k (1:∞) C (x,x′) ) = 1 + 12 ||x− x ′||22 . (5.26) The solution to this recurrence is related to the Lambert W function (Corless et al., 1996) and has no closed form. In one input dimension, it has a similar shape to the Ornstein- Uhlenbeck covariance OU(x, x′) = exp(−|x− x′|) but with lighter tails. Samples from a GP prior having this kernel are not differentiable, and are locally fractal. Figure 5.13 5.6 Related work 77 shows this kernel at different depths, as well as samples from the corresponding GP priors. One can also consider two related connectivity architectures: one in which each layer is connected to the output layer, and another in which every layer is connected to all subsequent layers. It is easy to show that in the limit of infinite depth of composing SE kernels, both these architectures converge to k(x,x′) = δ(x,x′), the white noise kernel. 5.5.2 When are deep kernels useful models? Kernels correspond to fixed feature maps, and so kernel learning is an example of im- plicit representation learning. As we saw in chapters 2 and 3, kernels can capture rich structure and can enable many types of generalization. We believe that the relatively uninteresting properties of the deep kernels derived in this section simply reflect the fact that an arbitrary computation, even if it is “deep”, is not likely to give rise to a useful representation unless combined with learning. To put it another way, any fixed repre- sentation is unlikely to be useful unless it has been chosen specifically for the problem at hand. 5.6 Related work Deep Gaussian processes Neal (1995, chapter 2) explored properties of arbitrarily deep Bayesian neural networks, including those that would give rise to deep GPs. He noted that infinitely deep random neural networks without extra connections to the input would be equivalent to a Markov chain, and therefore would lead to degenerate priors on functions. He also suggested connecting the input to each layer in order to fix this problem. Much of the analysis in this chapter can be seen as a more detailed investigation, and vindication, of these claims. The first instance of deep GPs being used in practice was (Lawrence and Moore, 2007), who presented a model called “hierarchical GP-LVMs”, in which time was mapped through a composition of multiple GPs to produce observations. The term “deep Gaussian processes” was first used by Damianou and Lawrence (2013), who developed a variational inference method, analyzed the effect of automatic relevance determination, and showed that deep GPS could learn with relatively little data. They used the term “deep GP” to refer both to supervised models (compositions 78 Deep Gaussian Processes of GPs) and to unsupervised models (compositions of GP-LVMs). This conflation may be reasonable, since the activations of the hidden layers are themselves latent variables, even in supervised settings: Depending on kernel parameters, each latent variable may or may not depend on the layer below. In general, supervised models can also be latent-variable models. For example, Wang and Neal (2012) investigated single-layer GP regression models that had additional latent inputs. Nonparametric neural networks Adams et al. (2010) proposed a prior on arbitrarily deep Bayesian networks having an unknown and unbounded number of parametric hidden units in each layer. Their architecture has connections only between adjacent layers, and so may have similar pathologies to the one discussed in the chapter as the number of layers increases. Wilson et al. (2012) introduced Gaussian process regression networks, which are defined as a matrix product of draws from GPs priors, rather than a composition. These networks have the form: y(x) =W(x)f(x) where each fd,Wd,j iid∼ GP(0, SE+WN) . (5.27) We can easily define a “deep” Gaussian process regression network: y(x) =W(3)(x)W(2)(x)W(1)(x)f(x) (5.28) which repeatedly adds and multiplies functions drawn from GPs, in contrast to deep GPs, which repeatedly compose functions. This prior on functions has a similar form to the Jacobian of a deep GP (equation (5.15)), and so might be amenable to a similar analysis to that of section 5.2. Information-preserving architectures Deep density networks (Rippel and Adams, 2013) are constructed through a series of parametric warpings of fixed dimension, with penalty terms encouraging the preservation of information about lower layers. This is another promising approach to fixing the pathology discussed in section 5.3. 5.6 Related work 79 Recurrent networks Bengio et al. (1994) and Pascanu et al. (2012) analyzed a related problem with gradient- based learning in recurrent networks, the “exploding-gradients” problem. They noted that in recurrent neural networks, the size of the training gradient can grow or shrink exponentially as it is back-propagated, making gradient-based training difficult. Hochreiter and Schmidhuber (1997) addressed the exploding-gradients problem by introducing hidden units designed to have stable gradients. This architecture is known as long short-term memory. Deep kernels The first systematic examination of deep kernels was done by Cho and Saul (2009), who derived closed-form composition rules for SE, polynomial, and arc-cosine kernels, and showed that deep arc-cosine kernels performed competitively in machine-vision applica- tions when used in a SVM. Hermans and Schrauwen (2012) constructed deep kernels in a time-series setting, constructing kernels corresponding to infinite-width recurrent neural networks. They also proposed concatenating the implicit feature vectors from previous time-steps with the current inputs, resulting in an architecture analogous to the input-connected archi- tecture proposed by Neal (1995, chapter 2). Analyses of deep learning Montavon et al. (2010) performed a layer-wise analysis of deep networks, and noted that the performance of MLPs degrades as the number of layers with random weights increases, a result consistent with the analysis of this chapter. The experiments of Saxe et al. (2011) suggested that most of the good performance of convolutional neural networks could be attributed to the architecture alone. Later, Saxe et al. (2013) looked at the dynamics of gradient-based training methods in deep linear networks as a tractable approximation to standard deep (nonlinear) neural networks. Source code Source code to produce all figures is available at http://www.github.com/duvenaud/ deep-limits. This code is also capable of producing visualizations of mappings such as figures 5.7 and 5.12 using neural nets instead of GPs at each layer. 80 Deep Gaussian Processes 5.7 Conclusions This chapter demonstrated that well-defined priors allow explicit examination of the assumptions being made about functions being modeled by different neural network architectures. As an example of the sort of analysis made possible by this approach, we attempted to gain insight into the properties of deep neural networks by charac- terizing the sorts of functions likely to be obtained under different choices of priors on compositions of functions. First, we identified deep Gaussian processes as an easy-to-analyze model correspond- ing to multi-layer preceptrons having nonparametric activation functions. We then showed that representations based on repeated composition of independent functions exhibit a pathology where the representations becomes invariant to all but one direction of variation. Finally, we showed that this problem could be alleviated by connecting the input to each layer. We also examined properties of deep kernels, corresponding to arbitrarily many compositions of fixed feature maps. Much recent work on deep networks has focused on weight initialization (Martens, 2010), regularization (Lee et al., 2007) and network architecture (Gens and Domingos, 2013). If we can identify priors that give our models desirable properties, these might in turn suggest regularization, initialization, and architecture choices that also provide such properties. Existing neural network practice also requires expensive tuning of model hyperpa- rameters such as the number of layers, the size of each layer, and regularization penalties by cross-validation. One advantage of deep GPs is that the approximate marginal like- lihood allows a principled method for automatically determining such model choices. Chapter 6 Additive Gaussian Processes Chapter 3 showed how to learn the structure of a kernel by building it up piece-by-piece. This chapter presents an alternative approach: starting with many different types of structure in a kernel, adjusting kernel parameters to discard whatever structure is not present in the current dataset. The advantage of this approach is that we do not need to run an expensive discrete-and-continuous search in order to build a structured model, and implementation is simpler. This model, which we call additive Gaussian processes, is a sum of functions of all possible combinations of input variables. This model can be specified by a weighted sum of all possible products of one-dimensional kernels. There are 2D combinations of D objects, so naïve computation of this kernel is intractable. Furthermore, if each term has different kernel parameters, fitting or inte- grating over so many parameters is difficult. To address these problems, we introduce a restricted parameterization of the kernel which allows efficient evaluation of all interac- tion terms, while still allowing a different weighting of each order of interaction. Empiri- cally, this model has good predictive performance in regression tasks, and its parameters are relatively interpretable. This model also has an interpretation as an approximation to dropout, a recently-introduced regularization method for neural networks. The work in this chapter was done in collaboration with Hannes Nickisch and Carl Rasmussen, who derived and coded up the additive kernel. My role in the project was to examine the properties of the resulting model, clarify the connections to existing methods, to create all figures and run all experiments. That work was published in Duvenaud et al. (2011). The connection to dropout regularization in section 6.4 is an independent contribution. 82 Additive Gaussian Processes 6.1 Different types of multivariate additive structure Chapter 2 showed how additive structure in a GP prior enabled extrapolation in multi- variate regression problems. In general, models of the form f(x) = g ( f(x1) + f(x2) + · · ·+ f(xD) ) (6.1) are widely used in machine learning and statistics, partly for this reason, and partly because they are relatively easy to fit and interpret. Examples include logistic regres- sion, linear regression, generalized linear models (Nelder and Wedderburn, 1972) and generalized additive models (Hastie and Tibshirani, 1990). At the other end of the spectrum are models which allow the response to depend on all input variables simultaneously, without any additive decomposition: f(x) = f(x1, x2, . . . , xD) (6.2) An example would be a GP with an SE-ARD kernel. Such models are much more flexible than those having the form (6.1), but this flexibility can make it difficult to generalize to new combinations of input variables. In between these extremes are function classes depending on pairs or triplets of inputs, such as f(x1, x2, x3) = f12(x1, x2) + f23(x2, x3) + f13(x1, x3). (6.3) We call the number of input variables appearing in each term the order of that term. Models containing terms only of intermediate order such as (6.3) allow more flexibility than models of form (6.2) (first-order), but have more structure than those of form (6.1) (D-th order). Capturing the low-order additive structure present in a function can be expected to improve predictive accuracy. However, if the function being learned depends in some way on an interaction between all input variables, a Dth-order term is required in order for the model to be consistent. 6.2 Defining additive kernels 83 6.2 Defining additive kernels To define the additive kernels introduced in this chapter, we first assign each dimension i ∈ {1 . . . D} a one-dimensional base kernel ki(xi, x′i). Then the first order, second order and nth order additive kernels are defined as: kadd1(x,x′) = σ21 D∑ i=1 ki(xi, x′i) (6.4) kadd2(x,x′) = σ22 D∑ i=1 D∑ j=i+1 ki(xi, x′i)kj(xj, x′j) (6.5) kaddn(x,x′) = σ2n ∑ 1≤i1