Bayesian Methods for Gravitational Waves and Neural Networks Philip Graff Cavendish Astrophysics and Queens’ College University of Cambridge A thesis submitted for the degree of Doctor of Philosophy Submitted 06 June 2012 In final form 25 June 2012 Declaration This dissertation is the result of work carried out in the Astrophysics Group of the Cavendish Laboratory, Cambridge, between October 2008 and June 2012. Except where explicit reference is made to the work of others, the work contained in this dissertation is my own, and is not the outcome of work done in collaboration. No part of this dissertation has been submitted for a degree, diploma or other qualification at this or any other university. The total length of this dissertation does not exceed sixty thousand words. Philip Graff 25 June 2012 Cambridge was the place for someone from the Colonies or the Dominions to go on to, and it was to the Cavendish Laboratory that one went to do physics. Sir Aaron Klug I would like to dedicate this thesis to my loving parents, who have supported and encouraged me through the years. Acknowledgements I acknowledge financial support from the Gates Cambridge Trust, founded by the Bill and Melinda Gates Foundation, which provided me with the op- portunity to study in Cambridge. The Cambridge Philosophical Society, Queens’ College, and the Astrophysics Group of the Cavendish Labora- tory have also provided financial support for my studies and presentations at conferences. I am very grateful to Anthony Lasenby and Mike Hobson for their con- tinued encouragement and help over the past three and a half years, both in my projects and in how to perform proper scientific research. I also thank them for proof-reading this thesis and providing useful feedback. I would also like to thank Farhan Feroz for all of the help and advice he has given me. Without him, much of this work would not have been able to be completed in the time it was. I thank David Titterington for help with lo- cal computing issues and Stuart Rankin and Andrey Kaliazin for technical support with the Darwin and COSMOS supercomputing facilities. I am very happy to have spent the last three and a half years in the Astro- physics Group, whose members past and present have been great friends. Whether in the office, in the tea room, at the pub after a hard day’s work, at a party, or elsewhere, they have been wonderful compatriots in sharing complaints, jokes, and life as an astrophysicist. The members of Queens’ College MCR, and especially my flatmate for two years Robert Lowe, also deserve great thanks for making my time in Cambridge as fantastic as it has been. My other friends in and around Cambridge – fellow Gates Cam- bridge Scholars, the Newnham College MCR matriculation class of 2009, the Gaelic Athletics Club, the Cambridge Royals baseball team, and many more – have also been great sources of fun and support without whom my Cambridge experience would have been much more boring. Lastly, I would like to thank my parents, siblings, and friends back in the USA for their continued support throughout my PhD. LIGO Scientic Collaboration Acknowledgement The second chapter of this thesis contains analysis of LIGO data per- formed by the author as a member of the LIGO Scientic Collaboration. The author gratefully acknowledges the support of the United States Na- tional Science Foundation for the construction and operation of the LIGO Laboratory and the Science and Technology Facilities Council of the United Kingdom, the Max-Planck-Society, and the State of Niedersachsen/Germany for support of the construction and operation of the GEO600 detector. The authors also gratefully acknowledge the support of the research by these agencies and by the Australian Research Council, the Council of Scientic and Industrial Research of India, the Istituto Nazionale di Fisica Nucleare of Italy, the Spanish Ministerio de Educacio´n y Ciencia, the Conselleria d’Economia, Hisenda i Innovacio of the Govern de les Illes Balears, the Scottish Funding Council, the Scottish Universities Physics Alliance, The National Aeronautics and Space Administration, the Carnegie Trust, the Leverhulme Trust, the David and Lucile Packard Foundation, the Research Corporation, and the Alfred P. Sloan Foundation. Abstract Einstein’s general theory of relativity has withstood 100 years of testing and will soon be facing one of its toughest challenges. In a few years we expect to be entering the era of the first direct observations of gravi- tational waves. These are tiny perturbations of space-time that are gen- erated by accelerating matter and affect the measured distances between two points. Observations of these using the laser interferometers, which are the most sensitive length-measuring devices in the world, will allow us to test models of interactions in the strong field regime of gravity and eventually general relativity itself. I apply the tools of Bayesian inference for the examination of gravitational wave data from the LIGO and Virgo detectors. This is used for signal detection and estimation of the source parameters. I quantify the abil- ity of a network of ground-based detectors to localise a source position on the sky for electromagnetic follow-up. Bayesian criteria are also ap- plied to separating real signals from glitches in the detectors. These same tools and lessons can also be applied to the type of data expected from planned space-based detectors. Using simulations from the Mock LISA Data Challenges, I analyse our ability to detect and characterise both burst and continuous signals. The two seemingly different signal types will be overlapping and confused with one another for a space-based detector; my analysis shows that we will be able to separate and identify many signals present. Data sets and astrophysical models are continuously increasing in com- plexity. This will create an additional computational burden for perform- ing Bayesian inference and other types of data analysis. I investigate the application of the MOPED algorithm for faster parameter estimation and data compression. I find that its shortcomings make it a less favourable candidate for further implementation. The framework of an artificial neural network is a simple model for the structure of a brain which can “learn” functional relationships between sets of inputs and outputs. I describe an algorithm developed for the training of feed-forward networks on pre-calculated data sets. The trained networks can then be used for fast prediction of outputs for new sets of inputs. Af- ter demonstrating capabilities on toy data sets, I apply the ability of the network to classifying handwritten digits from the MNIST database and measuring ellipticities of galaxies in the Mapping Dark Matter challenge. The power of neural networks for learning and rapid prediction is also useful in Bayesian inference where the likelihood function is computa- tionally expensive. The new BAMBI algorithm is detailed, in which our network training algorithm is combined with the nested sampling algo- rithm MULTINEST to provide rapid Bayesian inference. Using samples from the normal inference, a network is trained on the likelihood function and eventually used in its place. This is able to provide significant increase in the speed of Bayesian inference while returning identical results. The trained networks can then be used for extremely rapid follow-up analyses with different priors, obtaining orders of magnitude of speed increase. Learning how to apply the tools of Bayesian inference for the optimal recovery of gravitational wave signals will provide the most scientific in- formation when the first detections are made. Complementary to this, the improvement of our analysis algorithms to provide the best results in less time will make analysis of larger and more complicated models and data sets practical. Contents 1 Introduction 1 1.1 Gravitational Waves . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 General Relativity Theory . . . . . . . . . . . . . . . . . . . 2 1.1.2 Gravitational Wave Sources . . . . . . . . . . . . . . . . . . 4 1.1.3 Gravitational Wave Detectors . . . . . . . . . . . . . . . . . 8 1.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.2 MCMC and Nested Sampling . . . . . . . . . . . . . . . . . 14 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Analysis of LIGO Data 21 2.1 Simple Inspiral Model . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Generating the Signal . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Calculating the Likelihood . . . . . . . . . . . . . . . . . . . 23 2.1.3 MULTINEST Results . . . . . . . . . . . . . . . . . . . . . . 24 2.2 LIGO Sky Localisation . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 The Injections . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . 31 2.3 The “Big Dog” Event and Coherence Tests . . . . . . . . . . . . . . . 41 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 xiii CONTENTS 3 The Mock LISA Data Challenges 45 3.1 Cosmic String Cusps . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.1 Burst waveform models . . . . . . . . . . . . . . . . . . . . 47 3.1.2 MLDC challenge 3.4 search . . . . . . . . . . . . . . . . . . 51 3.1.3 Model selection using Bayesian evidence . . . . . . . . . . . 53 3.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Galactic White Dwarf Binaries . . . . . . . . . . . . . . . . . . . . . 65 3.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3 Cosmic Strings in the Presence of a Galactic Foreground . . . . . . . 70 3.3.1 Data and Signal Models . . . . . . . . . . . . . . . . . . . . 73 3.3.2 Comparison of Results . . . . . . . . . . . . . . . . . . . . . 74 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4 An Investigation into the MOPED Algorithm 87 4.1 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . 88 4.1.1 A Simple Iterative Approach . . . . . . . . . . . . . . . . . . 88 4.2 The MOPED Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.1 Data Compression with MOPED . . . . . . . . . . . . . . . . 89 4.2.2 Speed Gains with MOPED . . . . . . . . . . . . . . . . . . . 91 4.3 Simple Example With One Parameter . . . . . . . . . . . . . . . . . 92 4.4 Recovery in a 2 Parameter Case . . . . . . . . . . . . . . . . . . . . 94 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5 Artificial Neural Networks 103 5.1 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1.1 Choosing the Number of Hidden Layer Nodes . . . . . . . . . 105 5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Data Whitening . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.3 Finding the next step . . . . . . . . . . . . . . . . . . . . . . 109 5.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.5 Autoencoders and Pre-Training . . . . . . . . . . . . . . . . 112 xiv CONTENTS 5.2.6 The Evidence of a Network . . . . . . . . . . . . . . . . . . 114 5.3 Toy Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.1 Sinc Function with Noise . . . . . . . . . . . . . . . . . . . . 115 5.3.2 Three-Way Classification . . . . . . . . . . . . . . . . . . . . 116 5.3.3 Autoencoders as Non-Linear PCA . . . . . . . . . . . . . . . 117 5.4 MNIST Handwriting Recognition . . . . . . . . . . . . . . . . . . . 120 5.5 Mapping Dark Matter Challenge . . . . . . . . . . . . . . . . . . . . 125 5.5.1 The Data and Model . . . . . . . . . . . . . . . . . . . . . . 127 5.5.2 Results of Training . . . . . . . . . . . . . . . . . . . . . . . 129 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6 The BAMBI Algorithm 133 6.1 The Structure of BAMBI . . . . . . . . . . . . . . . . . . . . . . . . 134 6.1.1 When to Use the Trained Network . . . . . . . . . . . . . . . 135 6.2 BAMBI Toy Examples . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2.1 Eggbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.2 Gaussian Shells . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.2.3 Rosenbrock function . . . . . . . . . . . . . . . . . . . . . . 141 6.3 Cosmological Parameter Estimation with BAMBI . . . . . . . . . . . 142 6.3.1 Updated Timing Comparisons . . . . . . . . . . . . . . . . . 149 6.4 Using Trained Networks for Follow-up in BAMBI . . . . . . . . . . . 151 6.4.1 Exact Error Calculation . . . . . . . . . . . . . . . . . . . . . 152 6.4.2 Fast Error Calculation . . . . . . . . . . . . . . . . . . . . . 157 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7 Conclusions and Future Work 165 7.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.2 Avenues for Future Work . . . . . . . . . . . . . . . . . . . . . . . . 168 References 170 xv CONTENTS xvi Chapter 1 Introduction I was born not knowing and have had only a little time to change that here and there. Richard P. Feynman 1.1 Gravitational Waves Einstein’s general theory of relativity [1] (GR) predicts the existence of gravitational waves (GWs), perturbations to the curvature of spacetime that are generated in regions of strong gravity and will allow scientists to observe the universe through an entirely new window. The direct detection of gravitational waves will usher in a new era of astronomy. The GW spectrum in many ways independent of and complementary to electromagnetic radiation. GWs can be used to probe astrophysical events otherwise unobservable, such as the merger of two black holes, the inner processes of supernovae, and the equation of state of neutron star matter. Projects are currently underway across the world to detect these very weak signals. Indirect evidence for the existence of gravitational waves exists from observations of binary pulsars spinning in towards each other as the system loses energy due to the emission of GWs [2]. GR has withstood almost 100 years of testing and verification; however, it is now facing one of its toughest challenges in gravitational wave searches. Direct observa- tions may allow for the testing of alternate theories of gravity as a test of the strong field predictions of general relativity. 1 1. INTRODUCTION 1.1.1 General Relativity Theory In the linearised regime of general relativity, applicable for weak gravitational fields, we can assume a flat Minkowski background metric for the spacetime, given by ηab = diag(−1,1,1,1). The true metric can then be defined as gab = ηab+hab, ‖ hab ‖ 1. (1.1) This constrains us to a weak gravitational field and we will refer to hab as the metric perturbation. We can now derive Einstein’s equations in the linearised regime, where terms of order h2ab and above will be ignored as they are significantlly small relative to the first-order perturbation. I will only present the main results of this procedure, full steps can be found in [3]. We calculate the Einstein tensor with this new metric, discarding small terms of O(h2ab) and higher, and obtain Gab = 1 2 (∂c∂bhca+∂ c∂ahbc−hab−∂a∂bh−ηab∂c∂ dhcd +ηabh). (1.2) We substitute in for the trace-reversed perturbation, h¯ab = hab− 12ηabh (where h = haa) and choose to satisfy the Lorenz gauge condition that ∂ ah¯ab = 0. Doing so is allowed because there is freedom to choose the gauge (coordinate system) we want to work in without affecting the physical observables of the system. Therefore, the Einstein tensor simplifies to Gab = −12h¯ab, where  = ∂c∂ c = ∇2− ∂ 2t . The linearised Einstein equation is thus h¯ab =−16piTab. (1.3) In a vacuum Tab = 0 and this equation describes the propagation of plane waves trav- eling at the speed of light. These are gravitational waves. Furthermore, in addition to the Lorenz gauge we can specify the gauge to be purely spatial and traceless, htt = hti = 0 and h = 0. (1.4) This now fully specifies our coordinate system. The combination of these choices determines that the perturbation is transverse, meaning that ∂ihi j = 0. Therefore, we call this the transverse-traceless gauge and write it as hT Ti j . Being traceless, h T T i j = h¯ T T i j . This gauge choice is very useful and leaves only two independent components of the 2 1.1 Gravitational Waves Figure 1.1: The effect of a passing sinusoidal gravitational wave with period T on a ring of particles. Image from www.learner.org. metric perturbation. When orienting our coordinate system to describe waves traveling along the z axis, these degrees of freedom are given by hT Txx =−hT Tyy ≡ h+(t− z), (1.5a) hT Txy = h T T yx ≡ h×(t− z). (1.5b) The two components can now be seen as two independent polarisations, plus and cross, whose effect on a ring of particles (for a sinusoidal GW source) can be seen in Fig- ure 1.1. The matter source of a gravitational wave will specify the strengths of its two polarisations as a function of time. We are able to detect gravitational waves because they affect the geodesics that free-falling bodies follow in spacetime. In the transverse-traceless gauge the spacetime coordinates of a body will remain the same, as they move with the waves, but the proper separation of two bodies will change as the metric varies with the passing wave. A straightforward analysis [3] shows that the proper separation of two bodies on the x axis will oscillate with a fractional length change of δL L ' 1 2 hT Txx (t,z = 0), (1.6) from a wave travelling along the z axis. This formula can be generalized for any wave propagation direction and any coordinate separation. Because of this effect, the amplitude of the GW signal is known as the GW strain. 3 1. INTRODUCTION A complete mathematical background and derivation of general relativity, as well as its various applications and extensions, can be found in [4]. 1.1.2 Gravitational Wave Sources If we consider a gravitational wave source that is near the origin of our coordinate system, measure the field at a distance r that is large compared to the spatial extent of the source, and assume that the constituents of the source are moving slowly compared to c, then the solution to the linearised Einstein equation can be given by the compact- source approximation (a full derivation and further details on the following analysis can be found in [5]), h¯µν(ct,~x) =−4G c4r ∫ T µν(ct− r,~y)d3~y. (1.7) Taking the different components of T µν separately, ∫ T 00d3~y=Mc2 and ∫ T 0id3~y=∫ T i0d3~y = Pic. These components describe the integrated energy and momentum of the system, respectively, and are therefore constant for a closed system and will not contribute to the varying gravitational wave strain. Additionally, we are able to trans- form our coordinates to the source’s centre-of-momentum frame so that h¯00 =−4GM c2r , h¯0i = h¯i0 = 0. (1.8) Using the transverse property of T µν and Gauss’s law, we can re-write the integral over the remaining components as∫ T i jd3~y = 1 2c2 d2 dt ′2 ∫ T 00yiy jd3~y, (1.9) where ct ′ = ct− r. Therefore we obtain the quadrupole formula h¯i j(ct,~x) =−2G c6r d2Ii j(ct ′) dt ′2 , (1.10) where we define the quadrupole moment tensor of the source as Ii j(ct) = ∫ T 00(ct,~y)yiy jd3~y. (1.11) For a given source, this can be used to determine the far-field gravitational radiation. 4 1.1 Gravitational Waves As an illustration, let us consider two particles of equal mass M moving non- relativistically in circular orbits of radius a in the z = 0 plane about their common centre of mass with angular frequency Ω. Since the particles are moving slowly, we may use the approximation that T 00 ≈ c2ρ , where ρ is the density of the source in its own reference frame. Therefore, Ii j(ct) = c2 ∫ ρ(ct,~y)yiy jd3~y. (1.12) It is straightforward to show that for this source Ii j(ct) = Mc2a2  1+ cos(2Ωt) sin(2Ωt) 0sin(2Ωt) 1− cos(2Ωt) 0 0 0 0  . (1.13) Substituting this into Equation (1.10) yields h¯i j(ct,~x) = 8GMa2Ω2 c4r  cos(2Ωt ′) sin(2Ωt ′) 0sin(2Ωt ′) −cos(2Ωt ′) 0 0 0 0  . (1.14) Considering only the radiative part of hµν gives h¯µνrad(ct,~x) = 8GMa2Ω2 c4r  0 0 0 0 0 cos(2Ωt ′) sin(2Ωt ′) 0 0 sin(2Ωt ′) −cos(2Ωt ′) 0 0 0 0 0  . (1.15) We should note at this point two important aspects of our solution. First, the amplitude scales as 1/r, which means that the GW is a spherical wave. However, at large radii it can be approximated as a plane wave. Additionally, the 1/r scaling implies that a factor of X improvement in detector sensitivity will yield the same factor of X in observable distance and thus X3 in observable volume. Our second comment on the solution is that the frequency of the GW is twice the orbital frequency of the binary; this is due to the fact that the radiation is quadrupolar. Monopole and dipole radiation are zero due to the conservation of mass and momentum, respectively. We now consider the polarisation of the GW received by an observer. For an ob- server on the z axis, we begin by noting that Equation (1.15) is already transverse- traceless so h¯µνT T = h µν T T . h+(t) ∝ cos(2Ωt) and h×(t) ∝ sin(2Ωt), corresponding to right-handed circularly polarised radiation. For an observer on the x axis, however, 5 1. INTRODUCTION we must first convert Equation (1.15) to the transverse-traceless gauge. We zero the non-transverse components and then subtract one-half of the trace from the remaining diagonal components to obtain (h¯T Trad) µν(ct,~x) = 4GMa2Ω2 c4r  0 0 0 0 0 0 0 0 0 0 −cos(2Ωt ′) 0 0 0 0 cos(2Ωt ′)  . (1.16) We observe that the radiation is in the plus polarisation entirely. It is also important to note that the coefficient in front is a factor of 2 smaller. This combined with the fact that there is radiation in only one polarisation state implies that the GW energy emitted in the z direction will be eight times larger than that in the x or y directions, making it highly anisotropic. Integrating the energy flux over all angles and averaging over multiple GW periods yields that the energy lost by the system due to the emission of gravitational waves is LGW = 128GM2a4Ω6 5c2 . (1.17) This loss of energy is not considered in the initial analysis where we maintained con- servation of mass-energy. Waveforms defined in this manner must use a quasi-static approximation – the evolution of the orbital radius is defined by the internal energy and this further defines the orbital period through Kepler’s third law. The loss of en- ergy means that system will evolve inwards (decreasing a) with increasing Ω until collision/merger. The strong dependence of the emitted energy on frequency results in this occurring at an increasing rate and creates a burst of radiation at the end. This is the characteristic “chirp” signal that data analysis pipelines are searching for. For sources far from merger, the energy lost in gravitational wave emission is low enough that the frequency evolution can be modeled as an initial frequency f and a frequency derivative f˙ over a period of observation that is short compared to the time to reach merger. In general, this approximation holds for binaries with large separations. As the binary spirals inwards, further post-Newtonian [6–12] corrections must be made to the waveform models. These are corrections that expand in powers of v/c, where v is the transverse velocity of a binary member, and provide more accurate modeling 6 1.1 Gravitational Waves Figure 1.2: A sample inspiral GW signal that LIGO might observe. Image from www.ligo.org. of the phase and amplitude of the GW signal. At higher post-Newtonian orders the spin of the binary members becomes important and spin-orbit (L ·S1 and L ·S2 terms) and spin-spin (S1 ·S2 terms) interactions can have significant effects on the waveform (amplitude and frequency modulation and orbit precession) [13–18]. A sample inspiral GW signal is shown in Figure 1.2. Near to the point of merger, the effective-one-body (EOB) approach to waveform modeling can provide a better template [19–23]. This method reconstitutes the prob- lem as that of a test particle with some spin spiraling in to a Kerr black hole. Since accurate numerical simulations of Einstein’s full equations have been performed [24– 29], this waveform can then be attached to a fit of the signal coming from the merger of the two objects (this has been done for black hole binaries only). A ringdown waveform derived from black hole perturbation theory and numerical simulations is then used to complete the signal. This is characterised by constant modes of oscil- lation exponentially decreasing in amplitude over time. Combining post-Newtonian and/or EOB theory with numerical relativity simulations and black hole perturbation theory allows for the construction of complete inspiral-merger-ringdown (IMR) [30– 36] hybrid waveform families that are analytic fits over specific parameter ranges. By modeling the entire signal, these allow for the extraction of the most information and signal-to-noise ratio. However, they are limited by our ability to perform numerical simulations with enough orbits prior to merger and covering a large enough area of the possible parameter space. 7 1. INTRODUCTION Similar brief derivations of the quadrupole formula for gravitational wave radiation from compact binaries can be found in [37, 38]. A more complete derivation and analysis of the gravitational waves from compact binaries is available in [39] and a review of the post-Newtonian formalism can be found in [40]. Gravitational waves are created by a variety of astrophysical sources, such as supernova explosions [41] and spinning neutron stars [42–47], but compact binaries are the most well-studied. A review of sources that may be found in high and low frequency ranges can be found in [48]. Follow-up searches to find GWs associated with observed gamma-ray bursts have been performed [49] and studies are underway to cross-correlate GW and neutrino observations [50]. Gravitational wave detections can be used to test general relativity and measure any deviations from Einstein’s theory [51, 52], such as the existence of a massive graviton [53] (as GR predicts a massless graviton). Observations can also be used to test cosmological models by using binary mergers as a standard siren for luminosity distance and redshift measurements [54–58]. 1.1.3 Gravitational Wave Detectors The Laser Interferometer Gravitational-wave Observatory (LIGO) [59] is a major US project funded by the National Science Foundation for the detection of GWs. It con- sists of a pair of observatories located in Livingston, Louisiana and Hanford, Wash- ington in the United States. The detectors are sensitive in the audio band (∼ 30 Hz to several kHz) with strain amplitudes as low as 10−21. The main research centres are located at the California Institute of Technology in Pasadena, California and the Massachusetts Institute of Technology in Cambridge, Massachusetts. The basic geometry of each observatory is that of a 90◦ Michelson interferometer. However, some changes have been made so as to increase their sensitivity. Each arm consists of a resonant Fabry-Perot cavity, with the two mirrors also acting as the gravi- tational test masses. The resonance amplifies the effect of a passing gravitational wave on the phase difference between the two arms. Another mirror is placed between the laser source and the beamsplitter so as to create a resonant cavity that increases the to- tal power in the interferometer by not letting light escape back towards the laser—this is called the power recycling mirror. An input mode cleaner before the PRM removes 8 1.1 Gravitational Waves unwanted spatial modes from the laser source, leaving only the (l,m) = (0,0) mode, a single central Gaussian peak. The Livingston observatory has an interferometer with 4km arm lengths, denoted L1. The Hanford observatory has two interferometers in the same vacuum system; H1 has 4km arms and H2 has 2km arms. In the final Initial LIGO science run, S5, all detectors operated at design strain sensitivity. The status of compact binary coalescence searches at that time is given in [60] and descriptions of the standard analysis pipeline are found in [61, 62]. Enhanced LIGO (eLIGO) [63] was an improvement upon the sensitivity of LIGO by a factor of approximately two that did not require opening the vacuum chambers. A plot of the sensitivity measured during eLIGO’s science run (S6) as a function of GW frequency can be seen in Figure 1.3. At the low frequency end, the sensitivity is limited by seismic noise. At the high frequency end, the limiting factor is quantum shot noise from the error in counting the photons that reach the detector; the fractional error in photon counts on the shorter integration timescales approaches the same fractional distance to be measured. For eLIGO, a more powerful 35 W laser was installed, the GW signal was switched to a DC readout that operates slightly off the dark fringe, and an output mode cleaner was installed to remove unwanted spatial modes that add shot noise but do not contribute to the GW signal. The S6 science run was performed over 2009–2010 in this configuration. Despite the many controls, LIGO experiences transient noise sources that detection pipelines have to rule out as possible signals [64– 67]. The next stage of development is Advanced LIGO (aLIGO) [69–71], which aims to increase the sensitivity by a further factor of five. aLIGO’s upgrades include adding a signal recycling mirror that both increases the broadband sensitivity and allows the sensitivity of the interferometer to be “tuned”. Additionally, the test masses will be hung from triple pendulums to decrease their sensitivity to seismic noise and reduce the lower limit on observations to ∼ 10Hz. A 180 W laser will be installed and active modulators and isolators will be used on the test masses to compensate for thermal deformation. The test masses are also larger in size at 40 kg, up from 10 kg in Initial and Enhanced LIGO. The H2 interferometer will be removed and the mirrors sent to India in support of the building of a detector there by the IndIGO project [72]. A schematic of the aLIGO setup can be seen in Figure 1.4. 9 1. INTRODUCTION Figure 1.3: A plot of the strain sensitivities of the LIGO and Virgo interferometers from the S6 and VSR2/3 science runs [68]. In addition to the LIGO project in the United States, there is a joint Italian-French observatory, Virgo [73], located in Cascina, Italy. Virgo has a similar setup to LIGO but with 3km arms and active seismic isolation. Virgo’s instrumental noise is similar to LIGO’s as seen in Figure 1.3, but with a higher noise floor and better sensitivity at frequencies below ∼ 50 Hz. Virgo’s second and third science runs were performed in coincidence with LIGO’s S6. There is also a joint German-British observatory, GEO600 [74], near Sarstedt, Germany that is part of the LIGO Scientific Collaboration. GEO600 is primarily used for testing technology that will be used in upcoming LIGO upgrades, but is currently on “astrowatch” while LIGO and Virgo are offline being up- graded. Due to its shorter arm lengths and lack of Fabry-Perot cavities in the arms, GEO’s sensitivity is much lower than LIGO or Virgo. In Japan the TAMA300 [75] ob- servatory is operating while the KAGRA detector (the Large Cryogenic Gravitational- wave Telescope) is under construction [76, 77]. Together, LIGO, Virgo, GEO, LCGT, and IndIGO will form a worldwide network of detectors that will allow for simultane- ous detection of GWs and improve our ability to determine the parameters of a signal. 10 1.1 Gravitational Waves Figure 1.4: A diagram of the aLIGO interferometer. ETM = end test mass, ITM = initial test mass, BS = 50/50 beam splitter, PD = photodetector, PRM = power recycling mirror, SRM = signal recycling mirror, and MOD = phase modulation. Image from www.ligo.caltech.edu. 11 1. INTRODUCTION Studies are also underway for the construction of a third-generation ground-based in- terferometric detector, the Einstein Telescope [78, 79]. The next generation GW detector under development is ESA’s New Gravitational- wave Observatory (NGO) [80, 81], a re-scoped version of the Laser Interferometer Space Antenna project (LISA) [82]. NGO will consist of three satellites (one ‘mother’ and two ‘daughters’) that will be the vertices of an equilateral triangle one million km on a side. The NGO constellation is planned to orbit the Sun in the same orbit as the Earth, but behind and at an inclination of 60◦ to the ecliptic. NGO will form an inter- ferometer that can observe GWs from ∼0.1 mHz up to ∼0.1 Hz. This will allow NGO to detect the mergers of supermassive black holes [83, 84], quasi-stationary signals from smaller compact binaries in the galaxy, extreme-mass-ratio inspirals, bursts from cusps in cosmic strings, and a stochastic GW background [85, 86]. In preparation for the NGO mission, LISA Pathfinder [87] will be launched by ESA to test new technology. LPF is an advanced geodesics explorer and will be a proof of principle for NGO. Pathfinder is scheduled for launch in 2014, with NGO launch currently unsure. A summary of the use of interferometry for detection of gravitational waves both on the ground and in space is given by [88]. The near-perfect clock nature of pulsars also allows for the detection of very long wavelength gravitational waves. As a GW passes between a source pulsar and the Earth, it periodically distorts the spacetime that the pulsar signal must pass through. This results in periodic variations in the timing of the pulsar signal at Earth. By mea- suring many pulsars’ timings to high precision over several years, these variations may be cross-correlated to indicate the passing of a gravitational wave or the presence of a GW background [89–91]. The Gravitational Wave International Committee’s roadmap [92] outlines plans for GW detection in the coming decades. 1.2 Bayesian Inference Gravitational wave detection poses a major data analysis problem. Most signals in the collected data streams are expected to be of low SNR. Additionally, there may be many overlapping signals, not all of the same type, and a non-stationary noise background. 12 1.2 Bayesian Inference Therefore, we need an efficient and robust method for identifying signals in the data and determining their physical parameters, both intrinsic and extrinsic. To do this, we turn to the methods of Bayesian statistics. 1.2.1 Bayes’ Theorem In our analysis, we wish to estimate the parameters Θ in a model or hypothesis H for given data D. Bayes’ theorem states that Pr(Θ|D,H) = Pr(D|Θ,H)Pr(Θ|H) Pr(D|H) . (1.18) Pr(Θ|D,H)≡ P(Θ) is the posterior probability distribution, Pr(D|Θ,H)≡L(Θ) is the likelihood function, Pr(Θ|H) ≡ pi(Θ) is the prior distribution of the parameters, and Pr(D|H)≡Z is the Bayesian evidence. The evidence is the factor required to normalise the posterior over Θ, so we can ignore it for parameter estimation and analyse just the un-normalised posterior. It can be found, however, with the following integral (where N is the dimension of the parameter space): Z= ∫ L(Θ)pi(Θ)dN(Θ). (1.19) To compare two models, H0 and H1, we compare their relative probabilities with respect to given data. This is known as the odds ratio and is given by Pr(H1|D) Pr(H0|D) = Pr(D|H1)Pr(H1) Pr(D|H0)Pr(H0) = Z1 Z0 Pr(H1) Pr(H0) . (1.20) The prior probability ratio between the two models, Pr(H1)/Pr(H0), is taken to be unity when no further information is available as to which model is more probable. Thus, it can be seen from Equation (1.20) that the evidence plays a critical role in model selection as it determines the relative probabilities; Z1/Z0 is known as the evidence ra- tio. However, calculating the evidence requires the evaluation of a multi-dimensional integral which is far from trivial, especially in the multimodal and degenerate parame- ter spaces that we will be exploring. In cases of both model selection and parameter estimation, we require a sampling of the parameter space that provides us with information about the shape of the likeli- hood function. From this information, we can numerically integrate and compare the 13 1. INTRODUCTION evidence of different models or compare the relative likelihoods (and, as we will see, local evidences) of different parameter choices. A more detailed review of Bayesian statistics and its particular applications to cosmology can be found in [93]. 1.2.2 MCMC and Nested Sampling Due to the difficulty in integrating Equation (1.19) for complicated likelihood func- tions in many dimensions and the computational impracticality of grid approximations, Markov chain Monte Carlo (MCMC) methods have long been used for computing the integral and obtaining samples from the posterior distribution. One of the most popu- lar implementations is the Metropolis-Hastings algorithm [94, 95]. In this algorithm, an initial point is chosen and steps to new points are chosen according to proposal distributions. New points are accepted with the following probability: Pr(xt+1|xt) = min ( 1, P(xt+1)Q(xt+1,xt) P(xt)Q(xt ,xt+1) ) , (1.21) where xt is the initial point, xt+1 is the new point, and the proposal distribution Q(xt+1,xt) is the probability of jumping to point xt+1 from point xt . If the point is not accepted then xt+1 = xt . After an initial “burn-in” series of steps, the resulting distribution of samples will be equivalent to sampling from P(x). We therefore set P(xt) =L(xt)pi(xt) to obtain samples from the posterior probability distribution. However, this methodol- ogy will require many samples and likelihood evaluations, which can be computation- ally expensive. Additionally, in order to obtain a calculation of the Bayesian evidence, additional techniques such as simulated annealing (a.k.a. parallel tempering) [96, 97] are typically used, further increasing the computational cost. Model selection with MCMC may also be performed using a technique known as “reversible jump” MCMC (RJMCMC), which allows jumps between different model parameter spaces [98]. Nested sampling [99, 100] is a Monte Carlo method targeted at the efficient cal- culation of the evidence, but also produces posterior inferences as a by-product. It calculates the evidence by transforming the multi-dimensional evidence integral into a one-dimensional integral that is easy to evaluate numerically. This is accomplished by defining the prior volume X as dX = pi(Θ)dNΘ, so that X(λ ) = ∫ L(Θ)>λ pi(Θ)dNΘ, (1.22) 14 1.2 Bayesian Inference (a) (b) Figure 1.5: Cartoon illustrating (a) the posterior of a two dimensional problem; and (b) the transformed L(X) function where the prior volumes Xi are associated with each likelihood Li. Images from [101]. where the integral extends over the region(s) of parameter space contained within the iso-likelihood contour L(Θ) = λ . The evidence integral, Equation (1.19), can then be written as Z= ∫ 1 0 L(X)dX , (1.23) where L(X), the inverse of Equation (1.22), is a monotonically decreasing function of X . Thus, if one can evaluate the likelihoods Li = L(Xi), where Xi is a sequence of decreasing values, 0 < XM < · · ·< X2 < X1 < X0 = 1, (1.24) as shown schematically in Figure 1.5, the evidence can be approximated numerically using standard quadrature methods as a weighted sum Z= M ∑ i=1 Liwi, (1.25) where the weights wi for the simple trapezium rule are given by wi = 12(Xi−1−Xi+1). An example of a posterior in two dimensions and its associated functionL(X) is shown in Figure 1.5. 15 1. INTRODUCTION The summation in Equation (1.25) is performed as follows. The iteration counter is first set to i = 0 and N “active” (or “live”) samples are drawn from the full prior pi(Θ), so the initial prior volume is X0 = 1. The samples are then sorted in order of their likelihood and the smallest (with likelihood L0) is removed from the active set and replaced by a point drawn from the prior subject to the constraint that the point has a likelihood L > L0. The corresponding prior volume contained within the iso- likelihood contour associated with the new live point will be a random variable given by X1 = t1X0, where t1 follows the distribution Pr(t) = NtN−1 (i.e., the probability distribution for the largest of N samples drawn uniformly from the interval [0,1]). At each subsequent iteration i, the removal of the lowest likelihood point Li in the active set, the drawing of a replacement with L> Li and the reduction of the corresponding prior volume Xi = tiXi−1 are repeated, until the entire prior volume has been traversed. The algorithm thus travels through nested shells of likelihood as the prior volume is reduced. The mean and standard deviation of log(t), which dominates the geometrical exploration, are: E[log t] =−1/N, σ [log t] = 1/N. (1.26) Since each value of log(t) is independent, after i iterations the prior volume will shrink down such that logXi ≈−(i± √ i)/N. Thus, one takes Xi = exp(−i/N). The nested sampling algorithm is terminated when the evidence has been computed to a pre-specified precision. The evidence that could be contributed by the remaining live points is estimated as ∆Zi =LmaxXi, where Lmax is the maximum-likelihood value of the remaining live points, and Xi is the remaining prior volume. The algorithm terminates when ∆Zi is less than a user-defnied value (we use 0.5 in log-evidence). Once the evidence Z is found, posterior inferences can be easily generated using the final live points and the full sequence of discarded points from the nested sam- pling process, i.e., the points with the lowest likelihood value at each iteration i of the algorithm. Each such point is simply assigned the probability weight pi = Liwi Z . (1.27) These samples can then be used to calculate inferences of posterior parameters such as means, standard deviations, covariances and so on, or to construct marginalised posterior distributions. 16 1.3 Thesis Outline By providing both posterior samples and the value of the evidence, nested sampling can greatly reduce the number of points at which we need to evaluate the likelihood. This will provide us with what is necessary for both parameter estimation and model selection in less time than traditional MCMC methods. 1.2.2.1 The MULTINEST Algorithm The most challenging task in implementing nested sampling is to draw samples from the prior within the hard constraint L> Li at each iteration i. The MULTINEST algo- rithm [101, 102] tackles this problem through an ellipsoidal rejection sampling scheme. The live point set is enclosed within a set of (possibly overlapping) ellipsoids and a new point is then drawn uniformly from the region enclosed by these ellipsoids. The ellip- soidal decomposition of the live point set is chosen to minimize the sum of volumes of the ellipsoids. The ellipsoidal decomposition is well suited to dealing with poste- riors that have curving degeneracies, and allows mode identification in multi-modal posteriors, as shown in Figure 1.6. If there are subsets of the ellipsoid set that do not overlap with the remaining ellipsoids, these are identified as a distinct mode and sub- sequently evolved independently. The MULTINEST algorithm has proven very useful for tackling inference problems in cosmology and particle physics [103–106], typi- cally showing two orders of magnitude improvement in efficiency over conventional techniques. More recently, it has been shown to perform well as a search tool for grav- itational wave data analysis [107]. A different implementation of nested sampling for LIGO analysis is described in [108]. 1.3 Thesis Outline In Chapter 2 I describe applications of Bayesian inference to LIGO observations of binary inspirals. I begin with a simple example and proof-of-principle, followed by a systematic analysis of the ability of a LIGO-Virgo network to localise the position of a GW source on the sky. These are both performed with MULTINEST for Bayesian infer- ence. I then consider using coherent versus incoherent model selection to distinguish a real GW signal from glitches in actual LIGO data. 17 1. INTRODUCTION (a) (b) Figure 1.6: Examples of MULTINEST enclosing live points in sets of ellipsoids. 1000 points have been randomly sampled from (a) a torus and (b) two non-intersecting el- lipsoids. Images from [102]. 18 1.3 Thesis Outline Chapter 3 describes Bayesian inference for LISA/NGO sources using MLDC sim- ulations. The first section covers parameter estimation and model selection for burst sources from cusps on cosmic strings. I then analyse continuous gravitational wave sources for LISA, where there are many overlapping signals present in the data to the point of presenting foreground confusion noise at lower frequencies. Finally, I consider detecting burst sources in the presence of the foreground of continuous GW sources. At this point I turn to looking at data analysis techniques in a more general sense. Chapter 4 analyses the benefits and drawbacks of the Multiple Optimised Parame- ter Estimation and Data compression (MOPED) algorithm. I consider cases where it works and where it fails and attempt to set a prescription for determining the viability of a MOPED analysis. Artificial neural networks are introduced in Chapter 5 as a tool for learning and performing regression and classification functions. I detail the structure of neural net- works and the algorithm used for training them on prepared data sets. Several examples are then provided to illustrate the capabilities of the networks and the training algo- rithm. These include the learning of analytic functions, classification of complicated data, and reduction of the dimensionality of data. In order to further improve tools for general Bayesian inference, neural networks are integrated into the MULTINEST algorithm to learn the structure of the likelihood function. They can then be used to make future predictions of the likelihood at new points. This is especially beneficial for problems where likelihood evaluations are par- ticularly time-consuming. In addition, the networks may be used for follow-up analy- ses that can now be performed up to a hundred thousand times faster. The new Blind Accelerated Multimodal Bayesian Inference (BAMBI) algorithm is demonstrated on a few toy problems and then applied to the task of cosmological parameter estimation, where significant increases in speed are demonstrated for initial analyses and excep- tional speed-ups are achieved in follow-up. 19 1. INTRODUCTION 20 Chapter 2 Analysis of LIGO Data In theory, there is no difference between theory and practice. But, in practice, there is. Jan L. A. van de Snepscheut In the next few years the Advanced LIGO and Advanced Virgo detectors will be coming online and collecting science data. During their observation period, one or more compact binary coalescence events are expected to be detected [71, 109]. These signals are well-modeled by post-Newtonian theory [6–8, 40], the effective-one-body approach [19–22], and numerical relativity [24–27], giving us a variety of waveform models to use to search the data. Additionally, the sensitivity of the LIGO and Virgo detectors is designed to be optimal for these sources. In this chapter I begin by in- troducing a simple proof-of-principle for detecting the gravitational waveform from one of these events using Bayesian inference techniques. I then present work done as part of the LIGO Scientific Collaboration (LSC) to quantify the ability of Bayesian inference to determine the sky location of a detected event for use in electromagnetic follow-up searches. This was done in part with the Parameter Estimation sub-group of the Compact Binary Coalescence group of the LSC and will form part of a publication currently in preparation; code from the LSC Algorithm Library (LAL [110]) was used in generating and analysing this data. I conclude with my contribution to the analysis of a hardware blind injection into LIGO’s sixth and Virgo’s second science runs [68]. The analysis in this section was presented as a poster at the Ninth Edoardo Amaldi Conference on gravitational waves [111]. 21 2. ANALYSIS OF LIGO DATA 2.1 Simple Inspiral Model Many stars form as part of a binary system [112]. Over their lifetimes, these stars will slowly spiral into each other as they lose energy from the emission of gravitational waves. These predictions have been verified by the discovery of binary pulsar PSR B1913+16 in 1974 by Hulse and Taylor and subsequent observations [2]. As the two stars spiral in on each other, their GW signal produces a “chirp”, characterised by increasing amplitude and frequency until a specific frequency of the last stable orbit, after which they go through merger and ringdown phases. 2.1.1 Generating the Signal The sensitivity of the LIGO observatory is best for the inspiral phase of a compact binary coalescence for objects with sizes on the order of solar masses, so for this reason and for computational simplicity we consider only the inspiral. We assume the two black holes are nonspinning and have masses m1 and m2. We therefore define the total mass, m=m1+m2, symmetric mass ratio, η = (m1m2)/(m1+m2)2, and chirp mass, M= (m1m2)3/5(m1+m2)−1/5 = mη3/5. Following the approach of Veitch and Vecchio [113], we take the leading Newtonian quadrupole order of the strain in the frequency domain. The GW signal can thus be expressed as h˜( f ;~θ) = { AM5/6 f−7/6eıψ( f ;~θ) f ≤ fLSO 0 f > fLSO , (2.1) where ψ( f ;~θ) = 2pi f tc−φc− pi4 + 3 4 (8piM f )−5/3 (2.2) is the signal phase and fLSO is the frequency of the last stable orbit in the Schwarzschild spacetime. For a general system this is given by fLSO = (63/22−1pim)−1, (2.3) but for simplicity we assume m1 = m2 so we can re-write fLSO as fLSO = (63/221/5piM)−1. (2.4) 22 2.1 Simple Inspiral Model In equations (2.1) and (2.2), ~θ refers to the 4-dimensional parameter vector, where our parameters are ~θ = {A,M, tc,φc}. (2.5) A is the overall signal amplitude and depends on the source masses, distance, sky location, inclination angle, and polarisation; tc and φc are the time and phase at coa- lescence, respectively. In this analysis, geometrical units are being used, so c = G = 1 and therefore M = 4.926×10−6 s and A has the units of s−2. It should be noted that only the chirp mass, M, is being used in this formula. Therefore, a degeneracy exists between the two black hole masses—any two masses that produce the same chirp mass will give the same results with this formula. A more accurate model of the signal that also uses the symmetric mass ratio, or any other function of the masses, would be able to break this degeneracy. The detector noise profile in the frequency domain is modelled as a Gaussian ran- dom process in the real and imaginary parts, with zero mean and a variance at fre- quency fk (k = 1, . . . ,K, where K is the number of frequency bins) given by σ2k = ( fk f0 )−4 +1+ ( fk f0 )2 . (2.6) This is representative, within a multiplicative constant, of the LIGO interferometers with f0 = 150 Hz setting the scale for the frequency of maximum sensitivity. 2.1.2 Calculating the Likelihood We define our likelihood function to be a multi-variate Gaussian, centred on the true parameter values: L(~θ) = K ∏ k=1 [ 1 2piσ2k exp ( −|h˜k( ~θ)− d˜k|2 2σ2k )] . (2.7) To simplify this, we compute the log-likelihood and subtract out any constants. This will give us an evidence value that is off by a constant factor, but all local evidences will have this factor which will be cancelled out in the evidence ratio. The log-likelihood is therefore lnL(~θ) = K ∑ k=1 ( −|h˜k( ~θ)− d˜k|2 2σ2k ) . (2.8) 23 2. ANALYSIS OF LIGO DATA Parameter Measured Actual A (s−1/M ) 4.809 ± 0.328 5.00 M (M ) 8.000 ± 0.015 8.009 tc (s) 20.0 ± 0.21 ×10−3 20.0 φc (rad) 1.262 ± 0.280 1.00 Table 2.1: Results from the MULTINEST parameter inference on a simple inspiral chirp signal. In order to run the nested sampling algorithm MULTINEST [101, 102] on each data set we need to define the prior volume for it to explore. In order to have the most generality, we will use a uniform prior range defined over the following intervals: A ∈ [0,1000] s−1/M , (2.9a) M ∈ [1,20]M , (2.9b) tc ∈ [19.5,20.5] s, (2.9c) φc ∈ [0,2pi) rad. (2.9d) This is a significantly wider prior range than that used by Veitch and Vecchio. Their prior ranges were A ∈ [0,10] s−1/M , M ∈ [7.7,8.3]M , tc ∈ [19.9,20.1] s, and φc ∈ [0,2pi) rad. 2.1.3 MULTINEST Results MULTINEST was used to analyse the simulated data produced by the model described in Section 2.1.1 with parameters of ~θ = {5s−1/M ,8.009M ,20,1}. The results of the run are given in Table 2.1 and the final marginalized posterior distributions can be seen in Figure 2.1. These results show that MULTINEST was able to estimate all four parameter values very closely. In Figure 2.2 is plotted the log-likelihood function as it varies over the amplitude and chirp mass for the correct values of tc and φc. We can clearly see a ridge at constant M and a background drop-off for increasing A andM. In order to ensure sampling from this ridge, whose peak coincides with the true parameters, an increased number of live points for MULTINEST need to be used. This had the expected side effect of increasing 24 2.1 Simple Inspiral Model Figure 2.1: Plots of the one-and two-dimensional marginalized posterior distributions obtained by MULTINEST. Veritcal red lines and plus signs indicate true input values. [A] = s−1/M , [M] = M and [φc] =rad. 25 2. ANALYSIS OF LIGO DATA Figure 2.2: The log-likelihood function for a given range of amplitude and chirp mass values. The actual values of signal are A = 300s−1/M , M = 3.52M , tc = 20 s, and φc = 0 rad. tc and φc have been set to their actual values for this calculation. the computation time. On a single Linux machine, MULTINEST still took less than an hour to run, even for signals with low SNRs. If we were to look at the log-likelihood for the prior ranges of [113], the large background feature in A and M is not significant, as can be seen in Figure 2.3. The range of chirp masses they explore limits the effect of this background effect and also causes the ridge to occupy a larger fraction of the sampled space. This makes the likelihood easier to sample, but limits the detectability to signals from binaries whose parameters fall within a narrow range. We can now measure how well MULTINEST can find the true signal embedded in noise. A natural measure of this is the Bayes factor, the ratio of the posterior odds of the signal and noise-only models. We assume a prior odds ratio of one so the Bayes factor is calculated simply by dividing the evidence of the signal model as measured by MULTINEST and the evidence of the noise-only model. A Bayes factor greater than one indicates that the signal model is more probable than the noise model; we ask for a slightly larger Bayes factor to declare a detection. As the Bayes factor increases, the confidence of our detection grows as well. To measure how detection is affected by weaker signals, I computed the Bayes factor for several signal-to-noise ratios (SNRs). 26 2.1 Simple Inspiral Model Figure 2.3: The log-likelihood function for data with an input signal of ~θ = {5/M ,8M ,20,0}, where tc = 20 s and φc = 0 rad have been set. The SNR, specified by ρ , provides a measure of the relative strength of the signal to the background noise and is calculated using the following formula: ρ = √√√√ N∑ k=1 |h˜k(~θ)|2 σ2k . (2.10) Varying values of the SNR were obtained by changing the amplitude of the signal. The chirp mass (total mass and symmetric mass ratio), time of coalescence, and phase of coalescence were all kept equal so as to minimize any effects other than varying sig- nal strength. The global evidence of the signal model was given by MULTINEST at the end of each run, which were performed using the original log-likelihood function and 1000 live points to ensure a thorough sampling of the prior volume. The null evidence (that of the noise model) was computed by fixing the amplitude to zero. Figure 2.4 shows the log Bayes factor as a function of the measured SNR over a range of SNRs. We can see that once a certain threshold has been reached by an SNR of 10, we can be very confident in our detection of a signal. Below an SNR of ∼ 6, however, we cannot be sure of a signal being present. 27 2. ANALYSIS OF LIGO DATA Figure 2.4: The log Bayes factor as a function of SNR. Past an SNR of ∼ 10, our confidence of detection increases exponentially. This clearly shows that Bayesian inference as performed by MULTINEST is a strong tool for detecting gravitational wave signals in instrument noise. Being able to make detections at SNRs below 10 is important as the current instruments do not expect to have many SNRs above that if a signal is in fact present. For comparison, see Veitch and Vecchio’s original work on this same problem in [113, 114], which shows similar results for Bayesian evidences to those obtained here, and their use of delayed rejection in [115, 116]. MULTINEST has also been used to detect the fully parameterised waveform from non-spinning supermassive black holes in [107]. The usefulness of inspiral-only templates for detection of full inspiral-merger-ringdown waveforms is examined in [117]. The use of MULTINEST for the detection of galaxy clusters with a fully Bayesian analysis can be found in [118]. 2.2 LIGO Sky Localisation In order to confirm the first detections of gravitational waves and extract the most science from an observed event, we will want to obtain follow-up observations from electromagnetic telescopes [119–126]. For certain events, these will be able to confirm 28 2.2 LIGO Sky Localisation the event occurring as well as provide additional information. For binary black hole mergers, the lack of an observed signal will confirm theory and the detection of a signal could indicate the presence of an accretion disk around one or both of the black holes. To this end, a project was begun in the Bayesian parameter estimation subgroup of the CBC group in the LSC to determine our ability to localise the position of a source on the sky and measure how fast we would be able to accomplish this. As this was also a validation test of the Bayesian codes, we elected to use a simpler model for the inspiral signals and analytically generate the noise according to a known power spectral density (PSD). Studies have already been performed to analyse the ability of GW detector net- works to localise a source on the sky [127–134], but this is the first systematic Bayesian study that utilises coherent detection and compares many inference algorithms to con- firm results and measure speed. The full paper presenting results from the Parameter Estimation group is in preparation. Other studies, such as [135], have measured statis- tical uncertainties in measured parameters with the advanced detector network, but use the Fisher matrix approximation that cannot capture the full nature of posterior degen- eracies and does not apply at low SNRs. There have also been efforts to characterise the ability of a detector network to measure other source parameters [136]. 2.2.1 The Injections The injections were performed with the restricted frequency-domain post-Newtonian waveform model, TaylorF2 [10], at second order in phase (corrections of O(v4)) and with zeroth order (Newtonian) amplitude. No spin was included in the models in order to simplify the analysis so there were 9 total parameters: chirp mass, symmetric mass ratio, luminosity distance, inclination of the orbital plane (angular momentum) to the line of sight, polarisation, phase at coalescence, time of coalescence at the geocentre, longitude (related to right ascension), and colatitude (related to declination). The last two of these are the position on the sky of the source. The parameter vector is thus given by ~θ = {M,η ,dL, ι ,ψ,φ0, tc,φs,θs}. φs and θs are related to the right ascension α and declination δ by φs = α −GAST (GAST is the Greenwich Apparent Sidereal 29 2. ANALYSIS OF LIGO DATA Time) and θs = pi/2−δ . The waveform is given in the source frame by [10]: h˜+( f ) = M5/6 dL f−7/6(1+ cos2(ι))cos(Φ( f )) (2.11a) h˜×( f ) =−M 5/6 dL f−7/6 cos(ι)sin(Φ( f )) (2.11b) Φ( f ) =2pi f tc−φ0− pi4 + 3 128ηv5 × (2.11c)[ 1+ 20 9 ( 743 336 + 11 4 η ) v2−16piv3+10 ( 3058673 1016064 + 5429 1008 η+ 617 144 η2 ) v4 ] where v = (piM f )1/3. The waveform is evaluated for frequencies up to the last stable orbit given by Equation (2.3). This is measured by each of the LIGO and Virgo detectors with sensitivity patterns given by: F+(ψ,φs,θs) =−12(1+ cos 2(θs))cos(2φs)cos(2ψ)− cos(θs)sin(2φs)sin(2ψ), (2.12) F×(ψ,φs,θs) = 1 2 (1+ cos2(θs))cos(2φs)sin(2ψ)− cos(θs)sin(2φs)cos(2ψ), (2.13) such that the signal measured is given by h˜ = h˜+F++ h˜×F×. (2.14) We analyse 200 injections chosen uniformly from the distributions given in Ta- ble 2.2. Notice that the masses are sampled by m1 and m2 instead of M and η – this is to allow equal-mass binaries more easily. The additional constraints of m1 > m2 and m1+m2 ≤ 20M are also applied. tT is the pre-determined and supplied trigger time of the signal that we use as the basis for a search interval. These same ranges are also used as the priors for the subsequent data analysis. The noise PSD was generated in the frequency domain according to the analytic formula given below with f0 = 150Hz. S2h( f ) = 9×10−46 (( 4.49 f f0 )−56 +0.16 ( f f0 )−4.52 +0.52+0.32 ( f f0 )2) . (2.15) 30 2.2 LIGO Sky Localisation Parameter Units Minimum Maximum m1 M 1 15 m2 M 1 15 log(dL) Mpc 10 40 cos(ι) – −1 1 ψ rad 0 pi φ0 rad 0 2pi tc sec tT −0.1 tT +0.1 φs rad 0 2pi cos(θs) – −1 1 Table 2.2: Ranges for the sampling distributions of signal parameters. These are used for both signal generation and prior probabilities. This same function was used in the analysis for the PSD. Although this does not mir- ror the non-stationarity and non-Gaussianities in real LIGO noise, it does allow us to confirm the results of our analysis using statistical measures. 2.2.2 Analysis and Results The 200 injections were analysed with MULTINEST on LIGO’s Caltech cluster, from which we were able to obtain converged results on 187 (the remaining 13 were ex- cluded due to computational issues that kept preventing analysis jobs from finishing). 5000 live points were used to ensure complete sampling of the prior. Run times on a single processor varied from approximately 2 days to 2 weeks depending on the num- ber of likelihood evaluations necessary. The loudest and quietest signals had the lowest sampling efficiency due to sharp peaks or broad degenerate modes in the likelihood function, respectively, and so took the longest to converge. A typical posterior distribution is shown in Figure 2.5 for an event with a total network SNR of 19.6. Strong correlations can be seen between certain parameters, particularly between distance and inclination and between different pairs of mass pa- rameters. We can also see that the phase at coalescence is not well constrained and that there is a multimodal degeneracy in the polarisation of the GW signal. It is impor- 31 2. ANALYSIS OF LIGO DATA tant to fully explore these modes and degeneracies as the true signal parameters can be located anywhere within them. Focusing on the intrinsic parameters of the system, which for this simple model is limited to M (mc) and η , we see in Figure 2.6 a degeneracy but accurate recovery of the true parameters within the 68% confidence interval. For the sky location, which is what we need to measure for electromagnetic follow-up, there is again a degeneracy but the location has been measured to within a fraction of a steradian and the true value recovered just outside the 68% confidence interval. The 68% interval contains less than 50 deg2 and the 95% interval less than 100 deg2; the posterior as mapped on the entire sky can be seen in Figure 2.7. For our first test that posteriors were recovered correctly, we look at the fraction of injections recovered at a given level of each parameter’s integrated one-dimensional posterior probability. We integrate the posteriors from the minimum to the maximum of each parameter and record the cumulative distribution function (CDF) value at the point of the true injected value. We expect the distribution of these values to map directly to the posterior distribution. If we therefore plot the fraction of injections re- covered below a given CDF value against CDF values, the distribution should follow a line corresponding to y = x from 0 to 1. We are essentially comparing the CDF of the injections with the CDF of the measured posteriors to confirm they are equal. In Figure 2.8 we show this for all 9 model parameters as well as m1 and m2, which our prior is defined on. We find that the 1D CDFs all match the expected distributions very closely, usually to within 2σ error. Some deviations are seen, but these are most exaggerated for poorly constrained parameters and correlations between parameters can cause these errors to affect other measurements. To quantitatively measure the dif- ferences in experimental and theoretical CDFs, we performed a Kolmogorov-Smirnov (K-S) test on each. This test measures the maximum difference in CDFs to decide the probability that the data supports the null hypothesis that both CDFs are from the same distribution. A p-value less than the threshold of 0.05 indicates that we reject the null hypothesis at 95% confidence. All parameters return p-values above 0.05 as indicated in the subplot titles. This is a good indication that the recovered posterior distributions do accurately represent our state of knowledge. A more thorough check involves measuring the posterior confidence level at which the injection was recovered in sky position. This interval is is calculated by measuring 32 2.2 LIGO Sky Localisation Figure 2.5: Marginalised 2D and 1D posterior distributions for event 22 (all parameters except tc). This event has an SNR of 9.0 in H1, 13.4 in L1, and 11.1 in V1 for a total network SNR of 19.6. Red vertical lines and crosses indicate the true values. Contours on 2D plots indicate 68% and 95% confidence intervals. 33 2. ANALYSIS OF LIGO DATA (a) (b) Figure 2.6: Marginalised 2D posterior distributions for event 22 in (a) chirp mass and symmetric mass ratio and (b) sky position. The red cross indicates the true value. Contours indicate 68% and 95% confidence intervals. Figure 2.7: The marginalised posterior probability distribution for the sky position of event 22 as shown across the entire sky. The white × marks the true location. 34 2.2 LIGO Sky Localisation Figure 2.8: The fraction of injections recovered at or below 1D posterior CDF values. Error bars are given by √ p(1− p)/N, where N is the number of injections. K-S test p-values are given in the titles for each subplot. All parameters pass the test for a threshold of 0.05. 35 2. ANALYSIS OF LIGO DATA Figure 2.9: The fraction of injections recovered within high probability density inter- vals of the posterior as computed over θs and φs. The sag is due to posterior biases and K-D tree binning. the accumulated posterior probability in regions with higher probability density than where the injection is located. We perform a K-D tree [137] binning on the two sky location variables and sum the posterior density for ‘leaves’ with higher density than where the injection is found. As in the 1D case, we expect to find X% of injections within the X% high probability density (HPD) region of the posterior. Figure 2.9 displays our measured recovery compared to the theoretical line. There is a degree of ‘sag’ below the line, but this is an expected artifact from the process of K-D tree binning and Poisson error on the number of samples in a certain region. A final validation of the recovered posterior values involves direct measurement of the full nine-dimensional HPD interval at which each injection is recovered. We do this by calculating the log-prior at each posterior sample to add to the log-likelihood and obtain an non-normalised log-posterior density value. The same value is computed for the true parameters of the injection. We then count the fraction of posterior sam- ples that have a higher density than the true parameters and this equals the confidence interval at which the injection was recovered. We expect the fraction of injections re- 36 2.2 LIGO Sky Localisation Figure 2.10: The fraction of injections recovered within high probability density in- tervals of the posterior as computed from posterior probability density values directly. Error bars are given by √ p(1− p)/N, where N is the number of injections. The sag is due to accumulation of biases seen in the 1D cases (Figure 2.8). The K-S test p-value is 0.179. covered within a given HPD to follow y= x as in Figure 2.8 if the recovered posteriors are, in fact, representative of the correct priors and likelihood. This comparison is shown in Figure 2.10 along with 1σ error bars. The recovery is consistent to within 2σ of the theoretical value for all confidence intervals, but there is a systematic bias towards lower values. This is likely due to the biases seen in the one-dimensional CDFs in Figure 2.8 accumulating to push the peak away from the true value beyond the expected statistical variation. The K-S test results in a p-value of 0.179, so the null hypothesis that the CDFs are equal is not rejected. Having passed these tests of the posterior distribution, we now look at the statis- tics for the areas within certain intervals. Our first consideration is the area contained within a particular posterior high probability density interval. Using the K-D tree bin- ning, we calculate the sky area within 68%, 90%, and 95% confidence intervals. In Figure 2.11 we plot the fraction of injections whose confidence intervals contain a 37 2. ANALYSIS OF LIGO DATA Figure 2.11: The fraction of injections whose HPD posterior confidence intervals con- tain the amount of sky area or less. Areas are calculated using K-D tree binning. certain amout of sky area or less. This allows us to say that if we can search over a certain amount of sky area, say 100 deg2, then ∼ 80% of all signals will have their 90% confidence intervals contained within that search area. We therefore expect to place constraints on ∼ 72% of signals. We can also determine that our confidence regions increase by approximately an order of magnitude in sky area in going from 68% to 90% and again from 90% to 95%. Therefore, using larger confidence intervals may make the search regions impractically large for electromagnetic follow-up. Pos- terior distributions tend to be ellipses in right ascension and declination, making them slightly curved on the sky when extending further from the peak. The next consideration is the fraction of injections that were recovered within a given sky area. For this, we use the K-D tree binning to determine the sky area con- tained within bins with posterior density equal to or greater than that for the bin in which the true point is located. This comparison is shown in Figure 2.12 and allows us to determine the amount of sky area we need to be able to search to find counterparts for a given fraction of signals or conversely, the fraction of signals we can expect to find counterparts for in a given search area. For example, if we can only search over an 38 2.2 LIGO Sky Localisation Figure 2.12: The fraction of injections that were recovered within the amount of sky area or less. Areas are calculated using K-D tree binning. area of 4 deg2, then we can expect to locate counterparts (where they exist) for 50% of signals; however, if we wish to identify counterparts for 90% of signals then we will need to search 300 deg2 of sky. With this information we can optimise our follow-up search strategies based on either the sky area they can map or the fraction of signals we wish to recover. The first two analyses were averaged over all injections. However, we expect to be able to localise signals with higher SNR to smaller sky areas. Therefore, the final com- parison considers the sky area as a function of signal-to-noise ratio. Figure 2.13 shows the sky area contained within the 90% posterior confidence interval as a function of SNR. It also shows the area within the interval at which each signal was recovered, but this measurement is subject to statistical variations in the relative positions of the true location and the posterior peak due to the particular noise realisation. Overall, we can see in both sets of points that increased SNR does result in smaller sky areas, with the 90% interval area scaling roughly ∝ 1/SNR2. For SNRs above 25, all 90% confidence regions contain 100 deg2 or less, but for all signals with SNR below 20 this same interval contains at least 20 deg2. We see then that in order to find electro- 39 2. ANALYSIS OF LIGO DATA Figure 2.13: The sky area within 90% posterior confidence intervals and the posterior interval at which an injected signal was recovered as a function of signal-to-noise ratio. Areas are calculated using K-D tree binning. A line showing A ∝ 1/SNR2 is plotted for comparison. magnetic counterparts to weaker signals, we will have to be able to search larger areas of the sky. Even with very loud signals, 90% confidence intervals still contain over a square degree, which is a large area for electromagnetic telescopes. Since our ability to localise sources is fundamentally limited to areas of O(10) deg2 for 90% confidence intervals, optimised methods for electromagnetic follow-up, such as those explored in [138], are necessary to increase the chances of observing a counterpart. Combined detections of both GWs and electromagnetic signals (and possibly neutrinos) will further confirm source models for GW observations and lead to improved science. It may be possible to discover more about the internal physics of supernovae, neutron star mergers that form black holes, and much more. 40 2.3 The “Big Dog” Event and Coherence Tests 2.3 The “Big Dog” Event and Coherence Tests Throughout 2009 and 2010, LIGO and Virgo were collecting data in their sixth and second/third science runs, respectively (called S6 and VSR2/3). One of the primary sources that these detectors could observe and that was searched for in the data is low- mass compact binary coalescences. These involve binaries with a total mass less than 25M and a minimum component mass of 1M . The CBC group of the LVC has published a report on the results of this search [68, 139], which made no detections of gravitational waves. A complementary high-mass search was also performed [140]. A similar analysis was previously performed on the joint observations during S5 and VSR1 [141–144]. A review of knowledge of event source rates following S5/VSR1 is presented in [145]. During S6 and VSR2/3, a hardware injection was made into the detectors (by ac- tuating test mass mirror controls) without the knowledge of the data analysis teams as part of a “blind injection challenge”. The task was for the detection pipelines to find the event and then for the analysis groups to correctly identify and characterise the signal that was injected. The event came to be known as the “big dog” since initial sky location estimates placed the source in the Canis Major constellation. The analysis in this section was presented as a poster at the Ninth Edoardo Amaldi Conference on gravitational waves [111] and used data from the S6 and VSR2 science runs. One of the major considerations for determining the existence of a real signal is the possibility of false coincidences. Each of the detectors registered an event trigger at a similar time, but this could also potentially happen if they each glitched at ap- proximately the same time. We are able to rule out coincident glitches by performing a coherent search and comparing the evidence to that of an incoherent search. In the case of just the LIGO detectors, this can be summarised as: coherent =⇒ ∫ LHLLdθ = ZHL incoherent =⇒ ∫ LHdθ1× ∫ LLdθ2 = ZHZL If each detector did indeed observe the same signal, then the combined likelihood function will provide a larger evidence ZHL than the product of the evidences from the separate likelihoods ZHZL. If in fact this was not a real GW signal, then the two detectors will favour different parameters and the coherent evidence will be severely penalised. We can consider these as two separate models and compare the Bayesian 41 2. ANALYSIS OF LIGO DATA evidences, such that if ZHL > ZHZL we will favour the coherent signal model and if ZHL < ZHZL we favour the incoherent model, which implies no GW signal. A toy example of this application can be found in [146]. A large number of time-slides were analysed in the calculation of the false alarm rate for the “big dog” event. These involve taking data from one of the two LIGO de- tectors and translating it in time such that there can be no coherent GW signal present. This data is then analysed for potential triggers. Data from Virgo was not used in the analysis that generated the time-slides and the signal in Virgo was of low SNR so V1 data is not used here, either. In total, 14 triggers were found with SNR or “newSNR” (SNR modified by χ2 of data minus best-fit template) greater than that of the original event; all of these contained the injected signal in H1 and a glitch in L1. Details of the time-slides and newSNR calculation can be found in [68]. These time-slides provide us with an opportunity to test the Bayesian criterion for comparing coherent and inco- herent signal models. The 14 time-slide triggers and the “big dog” were each analysed with the TaylorF2 waveform model at 2, 2.5, and 3.5 post-Newtonian order in phase (0pN amplitude). Table 2.3 reports the Bayesian evidence ratios calculated for each trigger using each waveform model. The significantly negative log-evidence ratios clearly eliminate all time-slide trig- gers from being coherent signals with probabilities much less than 1%. The most significant time-slide, TS5, contained a weak signal in L1, so was thus able to disagree with the signal in H1 the least. L1 on its own returned a very low evidence that would not likely have passed as a trigger on its own. The cause for low evidence ratios for the “big dog” event is parameter biases from the fact that we are not using the correct waveform model. After the true signal was revealed, we found that the injection was calculated with a time-domain waveform and contained significant spin in the larger component, something the TaylorF2 waveform was not modeling. Even so, the lowest coherent probability was still 37.5%, more than enough to prevent us from eliminating the model as a possibility without further investigation. This analysis demonstrated the usefulness of Bayesian evidence criteria for model selection and detection for LIGO. When the noise is difficult to model but uncorre- lated between separate detectors, comparing the Bayesian evidence from coherent and incoherent signal models can be used as a further detection threshold to eliminate co- incident glitches. 42 2.3 The “Big Dog” Event and Coherence Tests Event log(R) 2pN log(R) 2.5pN log(R) 3.5 pN BD 1.39 -0.509 1.90 TS1 -17.8 -15.4 -15.1 TS2 -11.5 -10.2 -10.1 TS3 -12.7 -14.3 -13.4 TS4 -11.1 -12.5 -10.8 TS5 -2.96 -3.88 -2.65 TS6 -14.2 -14.2 -14.1 TS7 -14.2 -12.0 -11.7 TS8 -8.38 -8.52 -6.59 TS9 -7.74 -9.21 -9.19 TS10 -14.2 -14.7 -14.0 TS11 -15.2 -13.1 -14.2 TS12 -7.39 -8.06 -8.07 TS13 -10.4 -7.53 -9.00 TS14 -10.6 -11.8 -12.3 Table 2.3: Logs of the Bayesian evidence ratios (R = ZHL/(ZHZL)) comparing coher- ent and incoherent signal models for the time-slide triggers (TS1-TS14) and “big dog” event (BD). This analysis clearly rules out all time-slide triggers as signals but is not definitive with regards to the “big dog”. 43 2. ANALYSIS OF LIGO DATA 2.4 Conclusion Throughout this chapter I have shown the useufulness of Bayesian inference methods for the analysis of ground-based gravitational wave detector data. A first toy example demonstrated the ability of Bayesian techniques to provide accurate parameter estima- tion as well as a threshold for detection. An analysis of many simulated signals injected into simulated noise representa- tive of the LIGO-Virgo network further illustrates this applicability to ground-based detector data. We are able to obtain accurate posterior inferences about the source parameters, allowing us to measure intrinsic parameters of the system as well as de- termine the location of a source on the sky. This latter capability was quantified as it is important for obtaining electromagnetic follow-up observations to supplement the information about the source event. We found that the LIGO-Virgo network will be able to identify 50% of signals to within 4 deg2 and 90% of signals to within 300 deg2. Louder signals will be more precisely measured on the sky, with the area in a given posterior confidence interval scaling roughly ∝ 1/SNR2. Lastly, through analysis of the “big dog” blind injection event in S6 and VSR2, we were able to test a Bayesian criterion for the veracity of a detected signal. By requiring that the signal in distant detectors be from the same source, we compared coherent and incoherent signal models using the Bayesian evidence. Coincident but incoherent sig- nals were simulated by using time-slides of detector data and the evidence ratio clearly indicated that these were all incoherent detections, while supporting the hyposthesis that the injection itself was a real signal. These two tools of being able to distinguish signals from coincident glitches and measure the probability distribution for source parameters are vital to future LIGO and Virgo observations. Once detection claims are being made it is even more important to be sure that these are in fact real astrophysical signals and not conspiring glitches in the detectors. Additionally, measuring source parameters will allow for tests of general relativity and astrophysical source and population models. Determining sky locations facilitates electromagnetic follow-up observations to obtain even more infor- mation about a detection and perform multi-messenger astronomy. 44 Chapter 3 The Mock LISA Data Challenges Measure what can be measured, and make measureable what cannot be measured. Galileo Galilei In addition to ground-based gravitational wave searches, there are long-term plans for a space-based detector. Originally the joint NASA and ESA project, the Laser Interferometer Space Antenna (LISA) [82], it is now an ESA-only venture, the New Gravitational-wave Observatory (NGO) [80, 81]. Preparation for this has involved both technical development of the satellites as well as exploration of the science case. This has involved modeling the populations of expected sources in the sensitivity range for the detector and then preparing and analysing expected observational data [147, 148]. In this chapter I present the analysis of data from a few of the mock data challenges that were run in order to spur development of data analysis code and demonstrate our ability to obtain scientific information from future observations. The work in Section 3.1 was done in collaboration with Jonathan Gair and Farhan Feroz and published in [149]; Section 3.2 was done in collaboration with F. Feroz and Stanislav Babak of the Max Plank Institute for Gravitational Physics (Albert Einstein Institute, AEI) in Potsdam, Germany; Section 3.3 was performed in collaboration with F. Feroz, S. Babak, and Antoine Petiteau also of the AEI. 45 3. THE MOCK LISA DATA CHALLENGES 3.1 Cosmic String Cusps In preparation for the future launch of a space-based gravitational wave detector, such as LISA or NGO, development of data analysis algorithms has been a field of active research. To this end, LISA algorithm development has been encouraged by the Mock LISA Data Challenges (MLDCs) [150]. Round number 3, completed in April 2009, included data sets containing galactic white dwarf binaries (see Section 3.2), super- massive black hole binary mergers, extreme-mass-ratio inspirals, cosmic string cusps, or a stochastic background added to instrumental noise. In [149], we applied MULTI- NEST to the problem of detecting and characterising cosmic string cusps (MLDC 3.4). A summary of all submissions for MLDC 3 as well as plans for MLDC 4 can be found in [151]. A separate analysis of MLDC 3.4 is described in [152], where the authors focus on the problems of the degeneracies inherent in this signal type’s posterior dis- tribution. Cosmic string cusps in the observable frequency range of LISA are a burst source with durations of hundreds of seconds or less, which is short on the timescale of a LISA mission lasting up to two years. However, they are just one type of potential source for burst signals. Since they can be physically modelled we can use a matched filtering search with expected waveforms. There may be bursts of gravitational waves in the LISA data stream from other sources and the question arises as to whether we would be able to detect them and if we can distinguish cosmic string bursts from bursts due to these other sources. The Bayesian evidence that MULTINEST computes is one tool that can be used to address model selection. Evidence has been used for gravitational wave model selection to test the theory of relativity in ground based observations [114], and, in a LISA context, to distinguish between an empty data set and one containing a signal [153]. Calculation of an evidence ratio requires a second model for the burst as an alternative to compare against. We chose to use a sine-Gaussian waveform as a generic alternative burst model, as this is one of the burst models commonly used in LIGO data analysis (see for instance [154]). We find that the evidence is a powerful tool for characterising bursts — for the majority of detectable bursts, the evidence ratio strongly favours the true model over the alternative. While the sine-Gaussian model does not necessarily well describe all possible un-modelled bursts, these results 46 3.1 Cosmic String Cusps suggest that the evidence can be used to correctly identify any cosmic string bursts that are present in the LISA data stream. 3.1.1 Burst waveform models 3.1.1.1 Cosmic Strings Gravitational waves can be generated by cosmic strings through the formation of cusps, where portions of the string are traveling at nearly the speed of light [155, 156]. Such radiation is highly beamed and when viewed along the emission axis is linearly po- larised and takes a simple power-law form, h(t) ∝ |t − tc|1/3 [155]. When viewed slightly off-axis, the waveform is still approximately linearly polarised but the cusp spectrum is rounded off and decays rapidly for frequencies above fmax ∼ 2/(α3L), where α is the viewing angle and L is the dimension of the feature generating the cusp [155, 157]. The particular model for the frequency domain waveform adopted in the MLDC is given by [150] |h( f )|= { A f−4/3 f < fmax A f−4/3exp ( 1− ffmax ) f > fmax . (3.1) In addition, the MLDC waveforms include a fourth-order Butterworth filter to mitigate dynamic-range issues associated with inverse Fourier transforms. We adopt the same ansatz as the MLDC, namely that the Fourier domain waveform amplitude is given by |h+| =  A f−4/3 ( 1+ ( flow f )2)−4 f < fmax A f−4/3 ( 1+ ( flow f )2)−4 exp ( 1− ffmax ) f > fmax |h×| = 0 (3.2) and the phase by exp(2piı f tc), where tc is the burst time. 3.1.1.2 Sine Gaussians A sine-Gaussian waveform is centred on a particular frequency, and has exponentially suppressed power at nearby frequencies. We choose to consider a linearly polarised 47 3. THE MOCK LISA DATA CHALLENGES sine-Gaussian, for which the waveform magnitudes in the frequency domain are given by |h+|= A2 √ Q2 2pi f 2c exp ( −Q 2 2 ( f − fc fc )2) , |h×|= 0 (3.3) where A is the dimensionless amplitude, fc is the frequency of the sinusoidal oscil- lation, and Q is the width of the Gaussian envelope. The phase of the wave is again exp(2piı f tc), where tc is the central burst time. In the time domain, the sine-Gaussian is a small burst “packet” of particular frequency, fc, with the number of cycles in the burst determined by Q. 3.1.1.3 Detector model To include the LISA response we made use of the static LISA model as described in [158]. This approximation is valid for burst sources, as LISA does not move signif- icantly over the typically short duration of the bursts, which is of the order of 1000s or less. The static LISA response model was also adopted for cosmic string bursts in [157]. Three optimal detection time-delay interferometry (TDI) channels, A, E and T , can be constructed from the LISA data stream [159]. The noise is uncorrelated within and across these channels. For the search of the MLDC data, the A, E and T channels were constructed as linear combinations of the three Michelson channels, X , Y and Z, that were provided in the data release. A = 1√ 2 (Z−X), (3.4a) E = 1√ 6 (X−2Y +Z), (3.4b) T = 1√ 3 (X +Y +Z). (3.4c) 48 3.1 Cosmic String Cusps The power spectral densities of the noise in the three channels are given by SA( f ) = SE( f ) = 16sin2(2pi f tL) ( 2 ( 1+ cos(2pi f tL)+ cos2(2pi f tL) ) Spm( f ) +(1+ cos(2pi f tL)/2)Ssn f 2 ) (3.5) ST ( f ) = 16sin2(2pi f tL) ( 2 ( 1−2cos(2pi f tL)+ cos2(2pi f tL) ) Spm( f ) +(1− cos(2pi f tL))Ssn f 2 ) (3.6) Spm( f ) = ( 1+ ( 10−4Hz f )2) Sacc f 2 (3.7) where tL = 16.678s is the light travel time along one arm of the LISA constella- tion, Sacc = 2.5× 10−48Hz−1 is the proof mass acceleration noise and Ssn = 1.8× 10−37Hz−1 is the shot noise. The LISA data stream will also contain a confusion noise foreground from unre- solved white dwarf binaries in our galaxy. These were not included in the noise model for the MLDC round 3.4, and so we have ignored the confusion noise contribution in the current work. In practice, we found that the T channel was too noisy to be used in the MLDC search and did not contribute any significant SNR, so we used the A and E channels only for data analysis. 3.1.1.4 Likelihood evaluation The space of gravitational waveform signals possesses a natural scalar product [160, 161], 〈h |s〉= 2 ∫ ∞ 0 d f Sn( f ) [ h˜( f )s˜∗( f )+ h˜∗( f )s˜( f ) ] , (3.8) where h˜( f ) = ∫ ∞ −∞ dt h(t)e2piı f t (3.9) is the Fourier transform of the time domain waveform h(t). The quantity Sn( f ) is the one-sided noise spectral density of the detector. For a family of sources with waveforms h(t;~λ ) that depend on parameters ~λ , the output of the detector, s(t) = h(t; ~λ0) + n(t), consists of the true signal h(t; ~λ0) and a particular realisation of the 49 3. THE MOCK LISA DATA CHALLENGES Parameter Minimum Maximum log(A) −23.3 −21 fmax (Hz) 0.001 0.5 tc (s) T0 T0+∆T sin(θ) −1 1 φ (rad) 0 2pi ψ (rad) 0 2pi Table 3.1: Prior probability distributions for the cosmic string cusp model. noise, n(t). Assuming that the noise is stationary and Gaussian, the logarithm of the likelihood that the parameter values are given by~λ is logL ( ~λ ) =C− 1 2 〈 s−h ( ~λ )∣∣∣s−h(~λ)〉 , (3.10) and it is this log-likelihood that is evaluated at each point in the search. The constant, C, depends on the dimensionality of the search space, but its value is not important as we are only interested in the relative likelihoods of different points. As mentioned earlier, the LISA data stream has several independent data channels. The total multi- channel likelihood is obtained by summing the scalar product in Equation (3.8) over the A and E channels that are used. 3.1.1.5 Priors The parameter space over which to search for signals must also be specified. For the searches with the cosmic string model, we used uniform priors for each parameter that covered the signal space from which the MLDC sources were drawn. These correspond to the ranges in Table 3.1. Here, θ is the sky colatitude, φ is the sky azimuthal angle, and ψ is the polarization. The data stream was divided into segments for the search and T0 is the start time of the particular data segment being searched and ∆T is that segment’s length. For the sine-Gaussian model, we maintained the approach of using uniform prior ranges while making sure that all signals would fall within our specified bounds. The priors used are given in Table 3.2. T0, ∆T , θ , φ , and ψ are the same physical quantities 50 3.1 Cosmic String Cusps Parameter Minimum Maximum log(A) −22.3 −18 fc (Hz) 0.001 0.5 Q 1 10 tc (s) T0 T0+∆T sin(θ) −1 1 φ (rad) 0 2pi ψ (rad) 0 2pi Table 3.2: Prior probability distributions for the sine-Gaussian model. as in the cosmic string model; the prior on log(A) was modified in order to cover the same range of SNRs for the signals. 3.1.2 MLDC challenge 3.4 search Mock LISA Data Challenge 3.4 [150] consisted of a data set of 221 samples at a ca- dence of 1s, containing GW burst signals from cosmic string cusps [155, 156]. The signals occurred as a Poissonian random process throughout the data set, with an ex- pectation value of five events. To analyse the data, we divided the data stream into segments of 32768s in length, with 2048s of overlap between neighbouring segments. The overlap was chosen to ensure that each of the bursts, which were of maximum duration of ∼ 1000s, would be entirely contained in at least one segment. To avoid artefacts in the Fourier domain, we used Welch windowing before performing the Fast Fourier Transform (FFT) of each segment. We also analysed data segments of the same length, offset from the first set by 16384s, to check no signals had been missed and to verify that the detected signals were not edge artefacts. We also repeated the search with segments of 16384s and 8192s in length to verify our results. MULTINEST was run with 4000 live points on each segment, using 8 3GHz Intel Woodcrest processors in parallel; the algorithm took approximately 5 minutes to run on a single segment when no signal was present, and twice this when one was present. The search of the offset and shorter segments returned all of the same detections. We ran the search on the MLDC training data set to test the algorithm, and then on the challenge data set. 51 3. THE MOCK LISA DATA CHALLENGES To assess the quality of our solutions we will not only use the recovered parameters for a source, but also the signal-to-noise ratio (SNR) of the recovered solution. The degeneracies in the parameter space mean that a waveform can match the true wave- form very well with very different parameter values. The recovered SNR is a measure of how much of a residual would be left in the data stream if the recovered parameters were used to subtract a given source from the data set. If the recovered SNR is close to the true SNR we can say that we have correctly identified the waveform of the source even if the parameters are quite different. 3.1.2.1 Challenge data Three signals were found in the challenge data set at tc ∼ 6× 105s, tc ∼ 1.07× 106s and tc ∼ 1.6× 106s and this was the correct number of signals. We label these as source candidates 3, 2 and 1 respectively for consistency with the MLDC key file. For source candidates 1 and 3, MULTINEST returned a flat distribution in fmax, thereby leading us to conclude that the actual maximum was above the Nyquist frequency of 0.5 Hz. We identified several modes in the posterior in each case, and used the local Bayesian evidence to characterise these. We decided to submit the two modes of high- est evidence; these generally corresponded to two antipodal sky solutions. However, for source candidate 3 there was a third mode of almost equal local evidence, and so we submitted that mode as well. The middle source, 2, was found to have a break frequency of 0.0011 Hz, very close to the minimum value in the prior range of 0.001 Hz. At this low frequency, LISA is not able to resolve the sky position for a burst source. As expected, we found very broad posteriors for all parameters other than the maximum frequency. As the posterior was not well separated into modes, we chose to submit only one set of parameters in this case, which we took to be the maximum a-posteriori parameters. In Table 3.3 we list the true parameters for the three sources in the challenge data set (available at [162]), and the parameters recovered by MULTINEST. For simplicity we only include the parameters for the mode of highest local evidence, not all of the modes submitted. The agreement between the true and recovered parameters is broadly consistent with our expectations, and the true parameters lie within a few standard deviations of the means in most cases. For source 1, tc is quite far from the true value 52 3.1 Cosmic String Cusps by this measure. However, this is due to the correlated degeneracies between tc, θ and φ . The true time of coalesence was consistent with the full recovered posterior distribution. Additionally, for source 1 the true value of fmax is considerably below Nyquist, while our analysis recovered a flat distribution indicative of a value above Nyquist. We carried out further checks, but there is no indication that the data favours a particular value of fmax in this case. In Figure 3.1 we show the true waveforms, in the A channel, for the three sources present in the data set, and overlay them with the waveforms corresponding to each of the modes that we included in our submission. We see clearly that, in all three cases, the recovered waveforms reproduce the true waveforms well, with small differences that arise from noise fluctuations in the detector. We expect the recovered SNR to be a better measure of the quality of the solution than the values of the recovered pa- rameters. In Table 3.4 we list the true SNRs in the A and E channels, and the SNRs recovered by each of the solutions included in the submission. We have clearly recov- ered all of the SNR of the true signal, and the residual after subtraction would be at the level of instrumental noise. We thus regard the search as a success. In Figure 3.2 we show 1D and 2D marginalised posterior probability distributions for the second of the sources present in the challenge data set. We see that in this case fmax can be measured very well, since it is far below Nyquist, but degeneracies in sky position and time of coalesence are clearly evident. The sky position degeneracies are particularly severe since the source is at very low frequency. 3.1.3 Model selection using Bayesian evidence The MLDC search demonstrated that MULTINEST is able to correctly identify and characterize bursts from cosmic string cusps in LISA data. However, cosmic string cusps produce a very particular waveform signature. LISA might also detect bursts of gravitational radiation from other sources, such as supernovae of hypermassive ob- jects, or even unmodelled sources. A search of the LISA data using a cosmic string cusp model could conceivably detect these other types of burst. In order to make sci- entific statements it is important to be able to say what the most likely origin for an observed burst could be. The Bayesian evidence provides a tool for carrying out model 53 3. THE MOCK LISA DATA CHALLENGES Source A (×10−21) ψ (rad) θ (rad) φ (rad) fmax (Hz) tc (×106s) 1 True 0.866 3.32 0.556 3.71 0.0296 1.6022 Recovered 1.65 2.82 0.349 6.01 0.5∗ 1.6030 Mean 1.13 2.70 −0.141 4.38 0.27 1.6030 Std. Dev. ±0.22 ±0.14 ±0.30 ±2.6 ±0.14 ±0.000071 2 True 2.79 5.12 −0.444 3.17 0.00108 1.0727 Recovered 2.97 0.271 −0.893 5.12 0.00108 1.0732 Mean 2.75 0.91 −0.00804 5.31 0.00108 1.0733 Std. Dev. ±0.63 ±0.28 ±0.52 ±0.46 ±0.000048 ±0.00019 3 True 0.854 4.66 −0.800 0.217 6.15 0.60000 Recovered 1.14 0.927 −0.562 5.45 0.5∗ 0.59991 Mean 1.16 0.962 −0.488 5.48 0.24 0.59992 Std. Dev. ±0.18 ±0.16 ±0.16 ±0.18 ±0.14 ±0.000076 Table 3.3: Parameters of the three signals present in the challenge data set, and the parameters recovered by MULTINEST for the mode of highest local evidence identified by the search. The parameters are amplitude, A, polarisation, ψ , sky colatitude, θ , and longitude, φ , break frequency, fmax, and time of coalesence at the Solar System barycentre, tc. In the row labelled “Recovered” we quote the maximum likelihood parameters of the mode with highest evidence identified by MULTINEST. In the rows labelled “Mean” and “Std. Dev.” we quote the mean and standard deviation in that parameter, as computed from the recovered posterior for the mode of highest evidence. Where the value of fmax is marked by a ∗, the distribution in this parameter was very flat, indicating that fmax was probably above Nyquist. In these cases, we set fmax to the Nyquist frequency of 0.5Hz for the maximum likelihood parameters, but still quote the actual recovered mean and standard deviation. 54 3.1 Cosmic String Cusps 1.6024 1.6024 1.6025 1.6025 1.6026 1.6026 1.6027 1.6027 1.6028 x 106 ï3 ï2.5 ï2 ï1.5 ï1 ï0.5 0 0.5 1 1.5 x 10ï20 Time (s) A Ch an ne l A m pli tu de True Recovered 1 Recovered 2 (a) 1.072 1.0725 1.073 1.0735 1.074 1.0745 x 106 ï0.5 0 0.5 1 1.5 2 2.5 x 10ï21 Time (s) A Ch an ne l A m pli tu de True Recovered (b) 5.995 5.9955 5.996 5.9965 5.997 5.9975 5.998 5.9985 5.999 x 105 ï1.5 ï1 ï0.5 0 0.5 1 1.5 2 2.5 x 10ï20 Time (s) A Ch an ne l A m pli tu de True Recovered 1 Recovered 2 Recovered 3 (c) Figure 3.1: Plots comparing the true waveforms with the waveforms for the various modes recovered by MULTINEST and included in our submission for the analysis of the challenge data set. (a) Source 1, (b) Source 2, and (c) Source 3. 55 3. THE MOCK LISA DATA CHALLENGES !"# $"# %"# ! # $"!&%' $"!&() *+$! ) %! ,! ! !"!!# !"!$ ! # ! " !"# $"# %"# ! # -+ ./ 0 1 2 !"# $"# %"# $"!&%' $"!&() *+$! ) 3 +* $ ! ! % % !"# $"# %"# %! ,! 4 5 6 * !"# $"# %"# ! !"!!# !"!$ # ! !"# $"# %"# ! # ! # $"!&%' $"!&() *+$! ) ! # %! ,! ! # ! !"!!# !"!$ " ! # ! # $"!&%' $"!&() *+$! ) %! ,! $"!&%' $"!&() *+$! ) ! !"!!# !"!$ -+./012 $"!&%' $"!&() *+$! ) ! # %! ,! ! !"!!# !"!$ 3+*$! !%% %! ,! ! # 4 56* ! !"!!# !"!$ ! # Figure 3.2: Two-dimensional marginalised posteriors as recovered by MULTINEST in the search for the second of the cosmic string bursts in the MLDC challenge data set. The parameters, from top-to-bottom and left-to-right, are colatitude, longitude, burst time, burst amplitude, burst break frequency and waveform polarization. At the top of each column we also show the one-dimensional posterior for the column parameter. 56 3.1 Cosmic String Cusps Recovered SNR Source tc(s) Channel True SNR Mode 1 Mode 2 Mode 3 1 1.6×106 A E 41.0 14.5 41.2 14.5 41.0 14.6 N/A 2 1.1×106 A E 30.7 13.9 30.7 13.9 N/A N/A 3 6×105 A E 18.8 36.9 18.9 36.7 18.5 37.1 18.4 36.8 Table 3.4: Signal-to-noise ratios for the three sources in the challenge data set. We show the true SNRs in the A and E channels, plus the SNRs for all of the modes we submitted for analysis. selection of this type, although it relies on having two alternative hypotheses to com- pare. As an alternative to a cosmic string model, we consider a sine-Gaussian burst, which is a generic burst waveform commonly used in burst analysis for ground-based gravitational wave detectors [154]. We can use the evidence ratio to compare these two models when analysing a data stream containing a cosmic string burst or containing a sine-Gaussian burst. There are two stages to this analysis — 1) determining the signal- to-noise ratio at which we begin to be able to detect a signal of each type in a LISA data set; 2) determining the SNR at which the Bayesian evidence begins to favour the true model over the alternative. 3.1.3.1 Detection threshold To determine the SNR needed for detection of a burst source with the MULTINEST al- gorithm, we generated a sequence of data sets containing a particular source and Gaus- sian instrumental noise generated using the theoretical noise spectral densities given in Equation (3.7), but varying the burst amplitude between the data sets to give a range of SNRs between 5 and 15. Note that here and in the following we quote total SNRs for the A and E channels combined, ρ2Total = ρ 2 A + ρ 2 E . Each data set was searched with MULTINEST and the global log-evidence of the data set, as computed by MULTI- NEST, was recorded. This was repeated for 8 different cosmic string sources and 9 dif- ferent sine-Gaussian sources. The parameters for the cosmic string sources were taken 57 3. THE MOCK LISA DATA CHALLENGES from the parameters of the five training and three blind sources injected in the MLDC data sets, as these covered the parameter space of possible signals nicely. The im- portant parameters of the sine-Gaussian model are the frequency and the burst width, so we considered three different values of each, and constructed waveforms for the nine possible combinations, while choosing a random value for the sky position and waveform polarisation in each case. In Figure 3.3 we show the global log-evidence of the data as a function of the SNR of the injected signal for the cosmic string burst sources. Figure 3.4 shows the corre- sponding results for the sine-Gaussian bursts. The detection threshold for cosmic string bursts is at an SNR of 5–8 depending on the source parameters. Most of the sources have a significant log-evidence (greater than 3) at an SNR of 7, but the SNR required for detection is higher for the first training source and the second blind source. These are the two sources with the lowest values of the break frequency fmax, which makes the waveforms much smoother and simpler in the time domain. This may explain why it is more difficult to distinguish them from noise. The MLDC data sets had a prior for the source SNR, in a single Michelson channel, of ρ ∈ [10,100], so these results suggest we would be able to detect any source drawn from the MLDC prior. The sine-Gaussian bursts require a slightly higher SNR for detection, of 6–9, but this increase in threshold is only of the order of 1 in SNR. The sine-Gaussian signals are in general much simpler in form than the cosmic strings and therefore it is perhaps unsurprising that it requires a higher SNR to distinguish them from noise. In this case, it was the sources with highest frequency, fc = 0.49Hz, that were most difficult to detect. However, since the Nyquist frequency of the data sets was 0.5Hz, the difficulty of detection may have arisen because the signal was partially out of band. 3.1.3.2 Model selection To explore model selection, we used MULTINEST to search the same data sets de- scribed above, but now using the alternative model, i.e., we searched the cosmic string data sets using the sine-Gaussian likelihood and vice versa. For sufficiently high SNR, in both cases the alternative model was able to successfully detect a signal in the data set. Typically, the best-fit sine-Gaussian signal to a cosmic string source has low Q and a frequency that matches the two lobes of the cosmic string burst. The parameter Q sets 58 3.1 Cosmic String Cusps Figure 3.3: The Bayesian log-evidence for the data set as a function of the SNR for eight different cosmic string cusp signals. These signals had the same parameters, other than amplitude, as the various sources used in the training and blind data sets of the MLDC. The labels in the key are consistent with labels in Table 3.3 and the solutions at [162]. Figure 3.4: As Fig. 3.3, but now showing results for nine different sine-Gaussian sig- nals, with frequency, f , and width, Q, as shown. 59 3. THE MOCK LISA DATA CHALLENGES 2.0596 2.0597 2.0598 2.0599 2.06 2.0601 2.0602 2.0603 x 106 −20 −15 −10 −5 0 5 x 10−21 Time (s) A ch an ne l a m pli tu de Comparison of Recovered Signals for a Cosmic String True Cosmic String Recovered Sine Gaussian Recovered Cosmic String Figure 3.5: Typical example of confusion when searching a cosmic string data set with the wrong model. The plot shows a comparison of the injected cosmic string signal to the best-fit signals found by MULTINEST using the cosmic string model as templates and using the sine-Gaussian model as templates. the number of cycles in the sine-Gaussian wave packet, and so a sine-Gaussain with Q ∼ 2 most closely resembles a cosmic string event, which typically has two cycles. This is illustrated for a typical case in Figure 3.5. Similarly, the best-fit cosmic string source to a sine-Gaussian signal matches the central two peaks of the sine-Gaussian waveform as well as possible. A typical case is shown in Figure 3.6. From the results of these searches it is possible to construct the evidence ratio of the two models for each of the data sets. In Figure 3.7 we show the ratio of the Bayesian evidence for the cosmic string model to that of the sine-Gaussian model when searching the cosmic string data sets. We see that for some signals the evidence ratio starts to significantly favour the true model, i.e., the cosmic string, at an injected SNR of ∼ 5, which is the point at which we first start to be able to detect the cosmic string burst at all. However, for the two low frequency sources, training source 1 and blind source 2, the evidence ratio only starts to favour the true model at SNR∼ 11 or higher. This partially reflects the higher SNR required to detect the source, but also includes the confusion caused by the signal very closely resembling a sine- Gaussian. We conclude that when a cosmic string burst is loud enough to be detected, then the evidence favours the interpretation of the event as a cosmic string burst, with 60 3.1 Cosmic String Cusps 0.998 0.999 1 1.001 1.002 1.003 1.004 1.005 1.006 x 105 −5 −4 −3 −2 −1 0 1 2 3 4 x 10−20 Time (s) A ch an ne l a m pli tu de Comparison of Recovered Signals for a Sine Gaussian True Sine Gaussian Recovered Sine Gaussian Recovered Cosmic String Figure 3.6: As Figure 3.5, but we now show a typical confusion example for searches of the sine-Gaussian data sets. We compare the injected sine-Gaussian signal to the best-fit signals recovered by MULTINEST using both the cosmic string and the sine Gaussian models. lower frequency sources requiring additional SNR. Since MLDC sources have an SNR greater than 10 in each Michelson channel, they should be clearly distinguishable from sine-Gaussian bursts. In Figure 3.8 we show the ratio of the evidence of the sine-Gaussian model to that of the cosmic string model when searching the data sets containing sine-Gaussian signals. These are distinguishable from cosmic strings at SNRs of ∼ 6–9. This slightly higher SNR in order to correctly choose the sine-Gaussian model reflects the fact that we need a somewhat higher SNR to detect the sine-Gaussians in the first place. The only case for which the evidence ratio does not begin to favour the sine-Gaussian model at the point where the source becomes detectable is the case with f = 0.002 Hz and Q = 2. This is a sine-Gaussian signal with only two smooth oscillations, and so it does look rather like a low frequency cosmic string event. Even in that case, the evidence begins clearly to favour the correct model for SNRs of ∼ 12 and higher. 61 3. THE MOCK LISA DATA CHALLENGES Figure 3.7: The ratio of the Bayesian evidence for the cosmic string model to that of the sine-Gaussian model when searching a data set containing a cosmic string burst source. We show the Bayesian evidence ratio as a function of signal SNR for a variety of different cosmic string sources. Figure 3.8: As Fig. 3.7, but now showing the ratio of the Bayesian evidence for the sine-Gaussian model to that of the cosmic string model when searching a data set containing a sine-Gaussian burst source. 62 3.1 Cosmic String Cusps 3.1.4 Discussion We have considered the use of the multi-modal nested sampling algorithm MULTI- NEST for detection and characterisation of cosmic string burst sources in LISA data. As a search tool, the algorithm was able successfully to find the three cosmic string bursts that were present in the MLDC challenge data set. These sources were cor- rectly identified in the sense that the full signal-to-noise ratio of the injected source was recovered, and a posterior distribution for the parameters obtained. The maximum likelihood and maximum a-posteriori parameters were not particularly close to the true parameters of the injected signals, but this was a consequence of the intrinsic degen- eracies in the cosmic string model parameter space and in all cases the true parameters were consistent with the recovered posterior distributions. In controlled studies, we found that the SNR threshold required for detection of the cosmic string bursts was ∼ 5–8, depending on the burst parameters. Bursts with a low break-frequency require a higher SNR to detect than those with high break frequen- cies. We also explored the detection of sine-Gaussian bursts and in that case the SNR required for detection was slightly higher, being typically 6–9, with sources having frequency close to Nyquist being more difficult to detect. MULTINEST is designed to evaluate the evidence of the data under a certain hy- pothesis, and this can be used to compare possible models for the burst sources. LISA may detect bursts from several different sources, and it is important for scientific in- terpretation that the nature of the burst be correctly identified. We used the Bayesian evidence as a tool to choose between two different models for a LISA burst source — the cosmic string model and the sine-Gaussian model, which was chosen to represent a generic burst. The Bayesian evidence works very well as a discriminator between these two models. The evidence ratio begins to clearly favour the correct model over the alternative at the same SNR that the sources become loud enough to detect in the first place. The usefulness of MULTINEST as a search tool in this problem is a further illus- tration of the potential utility of this algorithm for LISA data analysis, as previously demonstrated in a search for non-spinning SMBH binaries [107]. Other algorithms based on Markov Chain Monte Carlo techniques have also been applied to the search for cosmic strings [157]. Both approaches performed equally well as search tools in 63 3. THE MOCK LISA DATA CHALLENGES the last round of the MLDC. MULTINEST was not designed primarily as a search al- gorithm, but as a tool for evidence evaluation, and this work has demonstrated the utility of the Bayesian evidence as a tool for model selection in a LISA context. Other problems where the evidence ratio approach could be applied include choosing be- tween relativity and alternative theories of gravity as explanations for the gravitational waves observed by LISA, or choosing between different models for a gravitational wave background present in the LISA data set. The Bayesian evidence was previously used in a LIGO context as a tool to choose between alternative theories of gravity [114] and in a LISA context to distinguish a data set containing a source from one containing purely instrumental noise [153]. In the context of interpretation of LISA and NGO burst events, what we have con- sidered here is only part of the picture. We have shown that we are able to correctly choose between two particular models for a burst, and this can easily be extended to include other burst models. However, LISA/NGO might also detect bursts from un- modelled sources. In that case, algorithms such as MULTINEST which rely on matched filtering would find the best fit parameters within the model space, but a higher intrin- sic SNR of the source would be required for detection. In such a situation, we would like to be able to say that the source was probably not from a model of particular type, e.g., not a cosmic string burst. There are several clues which would provide an indi- cation that this was the case. The sine-Gaussian model is sufficiently generic that we would expect it, in general, to provide a better match to unmodelled bursts than the cosmic string model, which has a very specific form. Therefore, we could say that if the evidence ratio favoured the cosmic string model over the sine-Gaussian model it was highly likely that the burst was in fact a cosmic string and not something else. Similarly, if we found that several of the alternative models had almost equal evidence, but the SNR was quite high, it would be indicative that the burst was not described by any of the models. We have seen that at relatively moderate SNRs, when the signal is described by one of the models, the evidence clearly favours the true model over an alternative. If we found that two models gave almost equally good descriptions of the source, it would suggest that the burst was not fully described by either of them. A third clue would come from the shape of the posterior for the source parameters. The cosmic string waveform space contains many degeneracies, but these can be charac- terised theoretically for a given choice of source parameters. If the signal was not from 64 3.2 Galactic White Dwarf Binaries a cosmic string, we might find that the structure of the posterior was modified. Fi- nally, some techniques have been developed for the Bayesian reconstruction of generic bursts [163] which could also be applied in a LISA/NGO context. The usefulness of these various approaches can be explored further by analysing data sets into which nu- merical supernovae burst waveforms have been injected. While the necessary mass of the progenitor is probably unphysically high for a supernova to produce a burst in the LISA/NGO frequency band, such waveforms provide examples of unmodelled burst signals on which to test analysis techniques. The final LISA/NGO analysis will em- ploy a family of burst models to characterize any detected events. The work described here demonstrates that the Bayesian evidence will be a useful tool for choosing be- tween such models, and MULTINEST is a useful tool for computing those evidences. 3.2 Galactic White Dwarf Binaries The Milky Way Galaxy has a population of order tens of millions of objects that can be classified as compact binaries. These include separated binaries of two objects that interact only gravitationally and interacting binaries which have mass transfer from one object to the other. Separated binaries may consist of two white dwarfs, two neu- tron stars, or one of each. Types of interacting binaries include AM CVn stars (white dwarf accreting mass from a companion) and ultra-compact X-ray binaries (neutron star accreting mass from a companion). Mock LISA Data Challenge 3.1 [150] uses a population model which is described by [164]. It consists of ∼ 26 million detached binaries and∼ 34 million interacting binaries, each modeled as two point masses in cir- cular orbit with increasing or decreasing orbital frequency depending on the dominant process (gravitational radiation or mass transfer, respectively). 3.2.1 Data Model The plus and cross polarizations of the binary signal are given by [150] h+(t) =A(1+ cos2 ι)cos [ 2pi( f t+ f˙ t2/2)+φ0 ] , (3.11) h×(t) =−2A(cos ι)sin [ 2pi( f t+ f˙ t2/2)+φ0 ] , (3.12) A = 4(GM)5/3 c4DL (pi f )2/3, (3.13) 65 3. THE MOCK LISA DATA CHALLENGES where M is the chirp mass as defined before, f˙ is the constant frequency derivative, φ0 is the phase at t = 0, and ι is the inclination of the binary to our line of sight. A fast-slow decomposition of the LISA orbital motion can also be used to simulate the detector phase measurements in the frequency domain, as described in [165]. This work was performed as an extension to that done in [166]. In that paper, the authors use an F-statistic to analytically maximise over most amplitude and phase factors and then perform a template bank search followed by parameter refinement with the Nelder-Mead optimisation algorithm [167]. The search is done in the four- dimensional space defined by the initial frequency, f , frequency drift, f˙ , and sky lati- tude and longitude, β and λ . The challenge data set was broken into 0.1 mHz frequency bands to simplify the analysis; each band was separated with Butterworth bandpass filters as described in Section V of [166]. In the initial analysis for each frequency band, the loudest signal is subtracted and then the search is performed again until no signal can be identified. This process is able to handle the large number of signals present in the data, espe- cially below ∼ 7mHz, but can be very time-consuming at higher frequencies where larger template banks are needed to ensure a minimum correlation with any signals present. We therefore aimed to use the ability of MULTINEST to identify multiple signals present in each band and evaluate local Bayesian evidences and a posterior maximum for each. Each bandpassed frequency range was analysed for signals with frequencies in the centre of the band. The prior on f was U[ fc− 0.06, fc + 0.06]mHz. Since the bandpasses were performed at 0.1 mHz intervals, this allowed for 0.01 mHz of overlap on either side while not searching too close to the edges of the filter. The frequency drift, f˙ for detached binaries evolving via gravitaitonal radiation is given by f˙ = 24 5pi ( GM 2c3 )5/3 (2pi f )11/3. (3.14) Therefore, in order to set prior limits on this parameter, we assumed a uniform prior over [(−0.25) f˙max, f˙max]. f˙max is calculated with a maximum chirp mass of 1.5M at the highest frequency in the prior. A minimum f˙ of −1/4 times the maximum is used because as [165, 168] describe, contact binaries with mass transfer are driven to longer periods with similar frequency evolution to Equation (3.14), but with opposite sign and 66 3.2 Galactic White Dwarf Binaries slightly lower magnitude. We have the benefit of knowing the true signals present in the data and these prior bounds were confirmed to include them all. The sky position prior is taken to be uniform over the sphere: uniform in λ ∈ [0,2pi] and uniform in cosβ ∈ [−1,1]. The noise in each TDI channel can be taken as approximately constant over the frequency band and is calculated at the central frequency using the formulas provided in [166]. 3.2.2 Results We searched 71 frequency ranges, spanning 4.99–12.11mHz. The density of signals is much greater at lower frequencies so we expect more modes of the posterior to be identified in those bands. Modes were clustered by starting frequency only. Once the analyses were complete, the posteriors were evaluated to determine how many signals were found and with what parameter values. This was done by collecting all posterior modes with local log-evidences above 100. If any two were found to have frequencies within 0.16µHz and frequency derivative within 2.6×10−14Hz/s, the one with larger evidence was considered. These constraints on saying two signals are potentially the same are determined by 10/Tobs and 100/T 2obs, where Tobs is the total two year obser- vation time. This served to eliminate potential false detections and secondary max- ima. Maximum likelihood parameter values were saved for each detected signal; very narrow modes meant that these did not differ significantly from maximum posterior values. In total, 754 signals were identified in the 4.99–12.11mHz range this way. Using the criteria from [166] to determine the true/false positive rate for these de- tections we find that the majority of signals found are true. The frequency resolution is given by the width of bins in Fourier space, which is given by ∆ f = 1/Tobs ' 16nHz. The resolution in the frequency drift is the difference at which over the total obser- vation time the frequency will drift by the resolution in frequency; this is given by ∆ f˙ = 1/T 2obs ' 2.5× 10−16Hz/s. We therefore require our reported solutions to be correct within the resolution possible. However, in this frequency range we find only about one quarter of the 2924 total signals present. At higher frequency bands we find all or nearly all of the signals but the recovery decreases as more signals add to the confusion. Figure 3.9 shows the number of true signals in the MLDC3.1 key (bright 67 3. THE MOCK LISA DATA CHALLENGES Figure 3.9: Number of true, recovered, and correctly detected signals in each frequency band analysed. signals that can be measured above the background) as well as the number of reported and correct detections in each frequency band. Complementary to this, Figure 3.10 shows the purity and completeness of the reported detections from the analysis of each frequency band. The purity is the percent of reported detections that are correct and completeness is the percentage of all true signals that are correctly detected. The purity is roughly consistent across all frequency bands, staying above 80% most of the time and only once dipping just below 70%. Completeness is very poor at low frequencies, with values as low as 10%. However, at frequencies above 10mHz most signals present are recovered. Values of f and f˙ for true signals present in the data versus those recovered are shown in Figure 3.11. The first plot shows all signals within the 4.99–12.11mHz band, while the second and third plots show the upper and lower 0.5mHz ranges, respectively. We can compare in these last two the relative density of signals and how that has an effect on the completeness of the detections and the rate of false positives (indicated by a red ×). In the range of less dense signals we not only recover a larger fraction of 68 3.2 Galactic White Dwarf Binaries Figure 3.10: The purity and completeness of detections for all frequency bands anal- ysed. The purity is the percent of reported detections that are correct and completeness is the percentage of all true signals that are correctly detected. 69 3. THE MOCK LISA DATA CHALLENGES signals present, but also with smaller errors and, in this case, no false positives. For all correctly recovered signals we can consider the recovery errors. Figure 3.12 plots histograms of the difference in the reported values of the measured parameters versus their true values. f and f˙ are reported in units of the resolution size. We find that f and f˙ are found to high precision, with most values being correct to within a fraction of the resolution. This is possible as the signals will not be entirely in one frequency bin over the entire observation period and the relative magnitudes can provide this extra resolution. In addition, the sky location is found to good precision with most errors much less than a radian. 3.2.3 Discussion This analysis has shown that MULTINEST, and nested sampling in general, is a useful tool for detecting multiple signals present in a single data set. Even in the presence of a much larger number of signals causing confusion, the purity of recovered signals is not adversely affected. This will be very useful for analysis of eLISA/NGO data that will have many white dwarf binary signals present. In order to improve the purity and completeness of these detections, it will likely be necessary to implement a multi- stage analysis where the loudest detected signals are subtracted and the residual data re-analysed for more signals. This type of pipeline has been tested for white dwarf binary sources for LISA [169]. At lower frequencies it may also be necessary to use narrower frequency bands for analysis to lower the number of signals present. The status of detection at this time is very promising as this analysis may be performed very rapidly and in parallel to detect multiple signals at a time. 3.3 Cosmic Strings in the Presence of a Galactic Fore- ground When a space-based GW detector is eventually launched, it will observe many signals from various sources all at the same time. As considered in the previous section, 3.2, there may be a foreground of galactic white dwarf binary signals. The impact of this will depend on the final sensitivity of the eLISA/NGO detector. Here we consider the data from the fourth round of the Mock LISA Data Challenge in an attempt to recover 70 3.3 Cosmic Strings in the Presence of a Galactic Foreground (a) (b) (c) Figure 3.11: Plots comparing the values of f and f˙ for true signals present in MLDC 3.1 challenge data in the 4.99–12.11mHz range. True positive detections are indicated by a blue +, false positives by a red ×. (a) Displays the entire frequency range, (b) the upper 0.5mHz, and (c) the lower 0.5mHz. Dashed green lines indicate the maximum and minimum for the f˙ prior. 71 3. THE MOCK LISA DATA CHALLENGES Figure 3.12: The error in recovered parameters for correctly detected signals. f and f˙ are reported in terms of the resolution size. 72 3.3 Cosmic Strings in the Presence of a Galactic Foreground cosmic string bursts in the presence of no galaxy, a reduced galactic component, and the full galaxy foreground. The data generation procedure is described in [151]. With the training data key file for MLDC 4, we generated data sets with just the cosmic strings, instrumental noise, and one of three options for the level of inclusion for galactic white dwarf (WD) binaries. At the lowest level, no WDs are included. In the middle level, a confusion foreground of millions of WD binaries that cannot be separated from each other is added. These have an approximate power spectral density given by Equation (3.15), which was calculated in [170] (Equation 14) for the specific model of the galactic stellar population used in MLDC 4. Sconf( f ) =  10−44.62 f−2.3 10−4.0 < f ≤ 10−3.0 10−50.92 f−4.4 10−3.0 < f ≤ 10−2.7 10−62.80 f−8.8 10−2.7 < f ≤ 10−2.4 10−89.68 f−20.0 10−2.4 < f ≤ 10−2.0 0 else m2Hz−1 (3.15) Finally, the full WD galaxy population, including the confusion foreground and thou- sands of louder, separable binaries is added. 3.3.1 Data and Signal Models The full MLDC 4 data spanned approximately 2 years and was sampled at a 1.875s cadence, yielding 225 total samples. Cosmic string signals were injected at times ran- domly sampled from a uniform distribution over the entire span, with a number that was sampled from a Poisson distribution with a mean of 20 signals (22 were actually injected). In order to efficiently search the data for these sparse signals, we segmented the data into stretches that were 61440s long with 15360s of overlap (each start was 46080s after previous), to avoid missing signals potentially at the edges. Additionally, we applied a Tukey window to each segment to avoid edge effects in the Fourier trans- form; the Tukey window was used in place of the Welch window that was implemented in Section 3.1.2 as it ensures that no data point was windowed by more than 50% with the given overlap (the minimum overlap from the Welch window was ∼ 25%). There were 1366 segments in total for each data set. The instrumental noise model was the same as in MLDC3, so Equations (3.5), (3.6), and (3.7) were used. Additionally, for the data sets that included either the reduced 73 3. THE MOCK LISA DATA CHALLENGES or full galaxy population we added the galactic foreground confusion noise given by Equation (3.15). Values from Equation (3.15) were further multiplied by a factor to bring them into the same units as the instrumental noise and our data values. S′conf( f ) = 64pi 2 f 2t2L sin 2(2pi f tL)×Sconf( f ) (3.16) 3.3.2 Comparison of Results 3.3.2.1 No Galaxy In this data set, only instrumental noise and cosmic strings were present. MULTINEST was run on each segment of data and modes found with a local log-evidence of more than 7 were recorded as potential detections. This limit is chosen due to the analy- sis in Section 3.1.3.1. Additionally, the global log-evidence from each segment was recorded. Of the 1366 segments, 1339 contained no GW signals. The 22 signals were present in 27 of the segments due to overlap. When compiling the list of detected sig- nals, any two within 998s of each other were considered to be the same signal; this time was chosen as it is approximately equal to the light travel time across the diam- eter of the Earth’s orbit, along which LISA would travel. If a signal was found twice, the mode with higher log-evidence was chosen to give the evidence and parameters of detection. This always selected the segment in which the signal was further from the edge and thus suffered less from windowing. Each signal was recovered to high accuracy with log-evidences of 70 or greater. In Figure 3.13 we plot the receiver- operator curve (ROC) for using the log-evidence as a threshold for detection and the true and false positive rates as a function of this threshold. The ROC is perfect, which indicates that there is a clear detection threshold for these signals in the presence of only instrumental noise and that all signals pass. Figure 3.14 shows a histogram of the global log-evidences from noise-only segments and a plot of the same log-evidences for each segment versus time. The frequency distribution of log-evidences follows an inverse-χ2 distribution and there is no time dependency, as expected. All segments that contained only noise had log-evidences of 4 or less, with 98.5% having ln(Z)< 0 and thus supporting the noise-only model. 74 3.3 Cosmic Strings in the Presence of a Galactic Foreground Figure 3.13: (Top) ROC for cosmic strings in instrumental noise, showing true positive rate vs. false positive rate as the threshold varies. (Bottom) The true and false positive detection rates as a function of the log-evidence. ln(Z) is adjusted by adding a constant to all values so that the minimum is greater than 1. 75 3. THE MOCK LISA DATA CHALLENGES Figure 3.14: (Top) Histogram of global log-evidences for the segments that do not contain a cosmic string GW signal. These fit an inverse-χ2 distribution and 98.5% have ln(Z) < 0. (Bottom) The log-evidence values of noise-only segments over time (segment number). No time dependence is present. 76 3.3 Cosmic Strings in the Presence of a Galactic Foreground Figure 3.15: (Top) Histogram of global log-evidences for the segments that do not contain a cosmic string GW signal with the reduced galaxy population. (Bottom) The log-evidence values of noise-only segments over time (segment number). There is a clear time-dependence as the detector’s sensitivity sweeps over the galaxy. 3.3.2.2 Reduced Galaxy The second data set analysed contained the reduced population of galactic binaries in addition to instrumental noise. MULTINEST was run and we recorded detections and evidences as before. Figure 3.15 shows the histogram and time-dependent plot of log- evidences from segments that were known not to contain signals. The histogram does not follow the inverse-χ2 distribution expected if Equation (3.15) correctly accounts for the presence of the galactic foreground confusion noise. We can also see in the time-dependent plot that the distribution of ln(Z) is not time-independent. This time-dependence of the magnitude of the foreground confusion noise is due to the varying sensitivity of the LISA detector to signals coming from the galaxy over the course of the simulated mission. To account for this, the foreground confusion noise was multipled by a time-varying factor that applied the necessary modulation and 77 3. THE MOCK LISA DATA CHALLENGES Figure 3.16: Power spectral density of a data segment containing only instrumental and reduced galaxy foreground confusion noise. On top are plotted the instrumental and instrumental plus reduced galaxy noise models. increase in PSD when the detector is most sensitive to the galaxy. The new confusion noise is given, up to an unimportant normalisation factor, by S′′conf = ( 1+ sin2 ( 2pi tm 1 year )) ×S′conf, (3.17) where tm is the time of the mid-point of a data segment. Figure 3.16 shows the power spectral density from a data segment containing no cosmic string signal, only instru- mental and reduced galactic foreground confusion noise. The instrumental and in- strumental plus reduced galaxy confusion noise models have been plotted on top for comparison. The analysis was re-run with this new model for the galactic foreground and the resulting histogram and time-dependent plot of ln(Z) from segments without a cos- 78 3.3 Cosmic Strings in the Presence of a Galactic Foreground Figure 3.17: (Top) Histogram of global log-evidences for the segments that do not contain a cosmic string GW signal with the reduced galaxy population and modulated foreground noise model. These fit the correct distribution much more closely and 99.6% have ln(Z)< 0. (Bottom) The log-evidence values of noise-only segments over time (segment number). The time-dependence is much less significant. mic string signal now have the features we expect, as seen in Figure 3.17. The time- dependence is nearly gone, but what remains does not significantly affect further re- sults. The ROC along with true and false positive rates as a function of detection threshold are shown in Figure 3.18. Again the ROC is perfect and the detection thresh- old set is consistent with the results from Section 3.1.3.1. Figures 3.19 and 3.20 show the true and recovered waveforms in the A and E channels for a selection of four of the 22 signals (true in solid black, recovered in dashed red). Differences from the true signal are a result of the addition of instrumental and galactic foreground noise as well as windowing for signals that were nearer to the edge of a segment. Low-frequency signals, such as signals 2 and 6 in Figure 3.19, were often recovered with higher precision due to their simpler structure and the fact that the entire signal is within the frequency sensitivity of the detector. Signals with higher 79 3. THE MOCK LISA DATA CHALLENGES Figure 3.18: (Top) ROC for cosmic strings in instrumental and galactic foreground confusion noise, showing true positive rate vs. false positive rate as the threshold varies. (Bottom) The true and false positive detection rates as a function of the log- evidence. ln(Z) is adjusted by adding a constant to all values so that the minimum is greater than 1. 80 3.3 Cosmic Strings in the Presence of a Galactic Foreground Figure 3.19: A comparison of the true and recovered waveforms in the A and E chan- nels for two of the 22 signals. The true signal is in solid black and the recovered is in dashed red. These signals have a low value of fmax. fmax show more discrepancies, as can be seen for signals 3 and 14 in Figure 3.20. 3.3.2.3 Full Galaxy The final data set analysed consisted of instrumental noise, the full galaxy population of white dwarf binaries, and the cosmic strings we were looking for. The modulated foreground confusion noise model given in Equation (3.17) was used in this analysis, which was performed in the same way as for the previous two data sets. In this sce- nario, we do not expect the noise model to account for all contaminating sources and thus we will be presented with many false detections. Due to the presence of loud signals across all frequencies in each data segment, 81 3. THE MOCK LISA DATA CHALLENGES Figure 3.20: A comparison of the true and recovered waveforms in the A and E chan- nels for two of the 22 signals. The true signal is in solid black and the recovered is in dashed red. These signals have a high value of fmax. 82 3.3 Cosmic Strings in the Presence of a Galactic Foreground Figure 3.21: (Top) Histogram of global log-evidences for the segments that do not contain a cosmic string GW signal with the full galaxy population and modulated fore- ground noise model. (Bottom) The log-evidence values of noise-only segments over time (segment number). evidences for all segments were significantly increased. In Figure 3.21 we show the histrogram and time scatter plot of log-evidences of segments that do not contain a cosmic string GW signal. There is clearly a much larger base level that will obscure finding true signals. The ROC along with true and false positive rates for detecting segments with signals present are shown in Figure 3.22. It is clear that some level of contamination will be present if we wish to set a threshold that does not exclude true cosmic string signals. 3.3.2.4 Summary The results presented in this section clearly show that perfect identification and excel- lent recovery of cosmic string burst signals is possible in the presence of instrumental noise and reduced galactic foreground. The population of galactic WD binaries with 83 3. THE MOCK LISA DATA CHALLENGES Figure 3.22: (Top) ROC for cosmic strings in instrumental noise with the full galactic population, showing true positive rate vs. false positive rate as the threshold varies. (Bottom) The true and false positive detection rates as a function of the log-evidence. ln(Z) is adjusted by adding a constant to all values so that the minimum is greater than 1. 84 3.4 Discussion the loudest signals removed forms a confusion foreground that, when appropriately modeled, can be accounted for and will not affect detection of cosmic string bursts. However, when the full WD population is left in the data, there is significant contami- nation and keeping this to a manageable amount will result in potentially missing some cosmic string signals. Spurious signals may be identified by comparing to either a sine- Gaussian or constant frequency model, using the Bayesian evidence to determine if one of these is preferred over the cosmic string burst model. This demonstrates the need to identify and subtract out of the data the loudest WD binaries detected by LISA, or any similar detector, in order to be able to detect other signals underneath. Doing so should not be affected by cosmic strings still present in the data since the latter are transient while the former are present over the entire two year data span. A main caveat is that this analysis was performed without the presence of SMBH or EMRI signals, which are transient but significantly longer (typically of duration three months or longer) and potentially much louder than any cosmic string bursts. SMBH signals can have SNRs of thousands when coalescing during the ob- servation period; EMRIs, however, build up an SNR of similar magnitude to cosmic strings (typically 10 to 100) but over a duration of months. Detection and removal of these sources in the presence of all others will need to be performed as well. We have shown, however, that cosmic string bursts can be detected even in the presence of the full galactic WD population and that the Bayesian evidence can be used as a detection threshold. Like any detection statistic, some false positives may be included and will require further analysis to be filtered out. 3.4 Discussion The eLISA/NGO detector that is eventually launched will be able to observe multiple signals that evolve on very different time scales. From bursts that last only a few minutes to supermassive black hole binaries that coalesce over months to galactic white dwarf binaries that are essentially monochromatic over a two year observation, these signals will need to be detected and characterised from a single data stream. Data analysis methods investigated here that utilise nested sampling will be essential in obtaining the most information from the data as possible. 85 3. THE MOCK LISA DATA CHALLENGES Segmenting data to look for transients will enable us to separate these short-lived signals from the more constant sources that require time to build up a significant SNR. Modeling of other sources present in the data in the noise estimation will further im- prove our ability to recover the signals we are looking for. In searching for cosmic string burst signals with a reduced WD binary foreground, the quality of parameter estimation was not adversely affected as the additional sources were well-modeled by the noise PSD and the signals of interest were of sufficiently high SNR. Analysing fre- quency bands for near-monochromatic signals such as white dwarf binaries can yield many detections that will not be influenced by noise at other frequencies. Subtracting the loudest signals and re-analysing should return a high rate of detection. For data with other signals present, analysing the two years of observation separately and re- quiring coherent detection should reduce the effect of transient sources passing through the frequency bands of interest. The source environment for eLISA/NGO will require many diverse techniques for separating out the different signal types which can also vary over orders of magnitude in SNR. Some of these methods have been explored here and present promising results for future data analysis pipeline development. 86 Chapter 4 An Investigation into the MOPED Algorithm Science is what we understand well enough to explain to a computer. Donald Knuth The analysis of gravitational wave data involves measuring a signal that is given by thousands of data points but is only defined by 17 or fewer individual parame- ters. This drastic increase in the number of values to be measured slows down like- lihood calculations and requires large amounts of computer memory. Additionally, estimation and inversion of noise covariance matrices needs increased computational resources. Simplifying assumptions are used to reduce these requirements, but remov- ing the need analytically without losing information would be preferable. With this goal in mind, I investigated implementing the Multiple Optimised Parameter Estima- tion and Data compression (MOPED) algorithm for use in gravitational wave burst searches for LISA. I begin this chapter by introducing the general likelihood function and a simple method of decreasing its complexity in Section 4.1. I then introduce MOPED in Sec- tion 4.2 and define the MOPED likelihood function along with comments on the poten- tial speed benefits of MOPED. In Section 4.3 I introduce my astrophysical scenario of gravitational wave burst signals where we found that MOPED did not accurately por- tray the true likelihood function. In Section 4.4 I expand upon this scenario to another 87 4. AN INVESTIGATION INTO THE MOPED ALGORITHM where MOPED is found to work and to two other scenarios where it does not. I present a discussion of the criteria under which I believe MOPED will accurately represent the likelihood in Section 4.5, as well as a discussion of an implementation of a solution provided by [171]. Work presented in this chapter has been published in [172]. 4.1 The Likelihood Function We begin by defining our data as a vector, x. Our model describes x by a signal plus random noise, x = u(θ T )+n(θ T ), (4.1) where the signal is given by a vector u(θ ) that is a function of the set of parameters θ = {θi} defining our model, and the true parameters are given by θ T . The noise is assumed to be Gaussian with zero mean and noise covariance matrix N jk = 〈 n jnk 〉 , where the angle brackets indicate an ensemble average over noise realisations. In general this matrix may also be a function of the parameters θ , as in the case of galaxy spectra explored in [173] where noise in a frequency bin depends on the amplitude of the signal in the bin; however, in the examples explored in this chapter we will not be considering this fully general case in a GW context. The full likelihood for N data points in x is given by L(θ ) = 1 (2pi)N/2 √|N(θ )| exp { −1 2 [x−u(θ )]TN−1(θ )[x−u(θ )] } . (4.2) At each point, then, this requires the calculation of the determinant and inverse of an N ×N matrix. Both scale as N3, so even for smaller datasets this can become cumbersome. 4.1.1 A Simple Iterative Approach One can perform a simple approximation to avoid performing the O(N3) matrix oper- ations repeatedly by following the steps below. 1. Choose a fiducial point from the prior, θ F . 2. Calculate N−1(θ F) and |N(θ F)|. 88 4.2 The MOPED Algorithm 3. Locate the peak of the posterior probability distribution of θ assuming constant N−1(θ F) and |N(θ F)| instead of evaluating these at each point θ . 4. Take the posterior maximum as the new fiducial point and repeat steps 2 and 3. Continue doing so until convergence is reached; the final posterior distribution is the solution. This method will provide a quick approximation of the true likelihood function and only requires one matrix inversion and determinant calculation per iteration. 4.2 The MOPED Algorithm Multiple Optimised Parameter Estimation and Data compression (MOPED; [173]) is a patented algorithm for the compression of data and the speeding up of the evaluation of likelihood functions in astronomical data analysis and beyond. It becomes particularly useful when the noise covariance matrix is dependent upon the parameters of the model and so must be calculated and inverted at each likelihood evaluation. However, such benefits come with limitations. Since MOPED only guarantees maintaining the Fisher matrix of the likelihood at the chosen fiducial point, multimodal and some degenerate distributions will present a problem. In this chapter I report on some of the limita- tions of the application of the MOPED algorithm. In the cases where MOPED does accurately represent the likelihood function, however, its compression of the data and consequent much faster likelihood evaluation does provide orders of magnitude im- provement in runtime. In [173], the authors demonstrate the method by analysing the spectra of galaxies and in [174] they illustrate the benefits of MOPED for estimation of the CMB power spectrum. The problem of “badly” behaved likelihoods was found by [171] for the problem of light transit analysis; nonetheless, the authors present a solution that still allows MOPED to provide a large speed increase. 4.2.1 Data Compression with MOPED Full details of the MOPED method are given in [173], here we will only present a limited introduction. 89 4. AN INVESTIGATION INTO THE MOPED ALGORITHM MOPED allows one to eliminate the need for the full matrix inversion by compress- ing the N data points in x into M data values, one for each parameter of the model. Ad- ditionally, MOPED creates the compressed data values such that they are independent and have unit variance, further simplifying the subsequent likelihood calculation to an O(M) operation. Typically, M N so this gives us a significant increase in speed. A single compression is done on the data, x, and then again for each point in parameter space where we wish to compute the likelihood. The compression is done by generat- ing a set of weighting vectors, bi(θ F) (i = 1 . . .M), from which we can generate a set of MOPED components from the theoretical model and data, yi(θ F)≡ bi(θ F) ·x = bTi (θ F)x. (4.3) Note that the weighting vectors must be computed at some assumed fiducial set of parameter values, θ F . The only choice that will truly maintain the likelihood peak is when the fiducial parameters are the true parameters, but obviously we will not know these in advance for real analysis situations. Thus, we can choose our fiducial model to be anywhere and iterate the procedure, taking our likelihood peak in one iteration as the fiducial model for the next iteration. This process will converge very quickly, and may not even be necessary in some instances. For our later examples, since we do know the true parameters we will use these as the fiducial (θ F = θ T ) in order to remove this as a source of confusion (all equations, however, are written for the more general case). Note that the true parameters, θ T , will not necessarily coincide with the peak θˆO of the original likelihood or the peak θˆM of the MOPED likelihood (see below). The weighting vectors must be generated in some order so that each subsequent vector (after the first) can be made orthogonal to all previous ones using Gram-Schmidt orthonormalisation. We begin by writing the derivative of the model with respect to the ith parameter as ∂u∂θi ∣∣∣ θ F = u,i(θ F). This gives us a solution for the first weighting vector, properly normalised, of b1(θ F) = N−1(θ F)u,1(θ F)√ uT,1(θ F)N−1(θ F)u,1(θ F) . (4.4) The first compressed value is y1(θ F) = bT1 (θ F)x and will weight up the data combina- tion most sensitive to the first parameter. The subsequent weighting vectors are made 90 4.2 The MOPED Algorithm orthogonal by subtracting out parts that are parallel to previous vectors and are then normalised. The resulting formula for the remaining weighting vectors is bm = N−1u,m−∑m−1q=1 (uT,mbq)bq√ uT,mN−1u,m−∑m−1q=1 (uT,mbq)2 , (4.5) where m= 2 . . .M and all values are evaluated at θ F . Weighting vectors generated with Equations (4.4) and (4.5) form an orthonormal set with respect to the noise covariance matrix so that bTi (θ F)N(θ F)b j(θ F) = δi j. (4.6) This means that the noise covariance matrix of the compressed values yi is the identity, which significantly simplifies the likelihood calculation. The new likelihood function is given by LMOPED(θ ) = 1 (2pi)M/2 exp { −1 2 M ∑ i=1 (yi(θ F)−〈yi〉(θ ;θ F))2 } , (4.7) where yi(θ F) = bTi (θ F)x represents the ith compressed data value and 〈yi〉(θ ;θ F) = bTi (θ F)u(θ ) represents the ith compressed signal value. This is a much easier likeli- hood to calculate and is time-limited by the generation of a new signal template instead of the inversion of the noise covariance matrix. The peak value of the MOPED likeli- hood function is not guaranteed to be unique as there may be other points in the original parameter space that map to the same point in the compressed parameter space; this is a characteristic that we will investigate. The MOPED likelihood will closely approximate the true likelihood function. Most importantly, the peak value will remain the same when the true parameters (or very similar) are used for the fiducial. Additionally, in the limiting case where the noise covariance matrix does not depend on the parameters, the Fisher matrix describing the curvature of the likelihood surface at the peak is identical to the original. How- ever, properties of the likelihood away from the peak value are not guaranteed to be so well-behaved. 4.2.2 Speed Gains with MOPED MOPED implicity assumes that the covariance matrix,N, is independent of the param- eters θ . With this assumption it is appropriate to compare MOPED to the approximate 91 4. AN INVESTIGATION INTO THE MOPED ALGORITHM method described in Section 4.1.1. Therefore, a full likelihood calculation with N data points would require an O(N2) operation at each point in parameter space (or O(N) if N is diagonal). In MOPED, however, the compression of the theoretical data is an O(NM) linear operation followed by an O(M) misfit calculation, leading to an overall complexity of O(NM)+O(M) ' O(NM) for each likelihood calculation. Thus, one loses on speed if N is diagonal but gains by a factor of N/M otherwise. For the data sets we will analyze, N/M > 100. We begin, though, by assuming a diagonal N for simplicity, recognizing that this will cause a speed reduction but that it is a necessary step before addressing a more complex noise model. One can iterate the parameter estimation procedure if necessary, taking the maxi- mum likelihood or posterior point found as the new fiducial and re-analyzing (perhaps with tighter prior constraints) as in the approximate method suggested in the previous section. This procedure is recommended for MOPED in [173] but is not always found to be necessary. MOPED has the additional benefit that the weighting vectors, bi, need only be computed once provided the fiducial model parameters are kept constant over the anal- ysis of different data sets. Computed compressed parameters, 〈yi〉, can also be stored for reference and require significantly less memory than storing the entire theoretical data set. 4.3 Simple Example With One Parameter In order to demonstrate some of the limitations of the applicability of the MOPED algorithm, we will consider a few test cases. These originate in the context of gravita- tional wave data analysis for LISA since it is in this scenario that we first discovered such cases of failure. The full problem is seven-dimensional parameter estimation, but we have fixed most of these variables to their known true values in the simulated data set in order to create a lower-dimensional problem that is simpler to analyse. We consider the case of a sine-Gaussian burst signal present in the LISA detector. The short duration of the burst with respect to the motion of LISA allows us to use the static approximation to the response. As in Section 3.1.1.2, the frequency-space 92 4.3 Simple Example With One Parameter waveform is described by [149] h˜( f ) = AQf exp { −12Q2( f− fcfc ) 2 } exp(2piıt0 f ). (4.8) Here A is the dimensionless amplitude factor; Q is the width of the Gaussian envelope of the burst measured in cycles; fc is the central frequency of the oscillation being mod- ulated by the Gaussian envelope; and t0 is the central time of arrival of the burst. This waveform is further multiplied by a projection factor dependent on the sky position of the burst source, θ and φ , and the burst polarisation, ψ , with respect to the detector. The one-sided noise power spectral density of the LISA detector is given by [149] Sh( f ) = 16sin2(2pi f tL)×( 2 ( 1+ cos(2pi f tL)+ cos2(2pi f tL) ) Spm( f ) +(1+ cos(2pi f tL)/2)Ssn f 2 ) , (4.9) Spm( f ) = ( 1+ ( 10−4Hz f )2) Sacc f 2 , (4.10) where tL = 16.678s is the light travel time along one arm of the LISA constella- tion, Sacc = 2.5× 10−48Hz−1 is the proof mass acceleration noise and Ssn = 1.8× 10−37Hz−1 is the shot noise. This is independent of the signal parameters and all frequencies are independent of each other, so the noise covariance matrix is constant and diagonal. This less computationally expensive example allows us to show some interesting properties. We begin by taking the one-dimensional case where the only unknown parameter of the model is the central frequency of the oscillation, fc. We set Q = 5 and t0 = 105s; we then analyze a 2048s segment of LISA data, beginning at t = 9.9× 104s, sampled at a 1s cadence. For this example, the data were generated with random noise (following the LISA noise power spectrum) at an SNR of ∼ 34 with fc,T = 0.1 Hz (thus fc,F = 0.1 Hz for MOPED). The prior range on the central frequency is uniform from 10−3 Hz to 0.5 Hz. 10,000 samples uniformly spaced in fc were taken and their likelihoods calculated using both the original and MOPED likelihood functions. The log-likelihoods are shown in Figure 4.1. Note that the absolute magnitudes are not important but the relative values within each plot are meaningful. Both the original and MOPED likelihoods have a peak close to the input value fc,T . 93 4. AN INVESTIGATION INTO THE MOPED ALGORITHM Figure 4.1: The original and MOPED log-likelihoods as a function of fc for the chosen template. We see, however, that in going from the original to MOPED log-likelihood, the latter also has a second peak of equal height at an incorrect fc. To see where this peak comes from, we look at the values of the compressed parameter 〈y1〉( fc; fc,F) as it varies with respect to fc as shown in Figure 4.2. The true compressed value peak occurs at fc ' 0.1 Hz, where y1( fc,F) = 〈y1〉( fc; fc,F). However, we see that there is another frequency that yields this exact same value of 〈y1〉( fc; fc,F); it is at this frequency that the second, incorrect peak occurs. By creating a mapping from fc to 〈y1〉( fc; fc,F) that is not one-to-one, MOPED has created the possibility for a second solution that is indistinguishable in likelihood from the correct one. This is a very serious problem for parameter estimation. 4.4 Recovery in a 2 Parameter Case Interestingly, we find that even when MOPED fails in a one-parameter case, adding a second parameter may actually rectify the problem, although not necessarily. If we now allow the width of the burst, Q, to be a variable parameter, there are now two 94 4.4 Recovery in a 2 Parameter Case Figure 4.2: The value of the MOPED compressed parameter as a function of the orig- inal frequency parameter. orthogonal MOPED weighting vectors that need to be calculated. This gives us two compressed parameters for each pair of fc and Q. Each of these may have its own unphysical degeneracies, but in order to give an unphysical mode of equal likelihood to the true peak, these degeneracies will need to coincide. In Figure 4.3, we plot the loci in ( fc,Q) space where 〈yi〉(θ ;θ F) = 〈yi〉(θˆM;θ F) as θ ranges over fc and Q values. We can clearly see the degeneracies present in either variable, but since these only overlap at the one location, near to where the true peak is, there is no unphysical second mode in the MOPED likelihood. Hence, when we plot the original and MOPED log-likelihoods in Figure 4.4, although the behaviour away from the peak has changed, the peak itself remains in the same location and there is no second mode. Adding more parameters, however, does not always improve the situation as the signal varies in different ways for each. We now consider the case where Q is once again fixed to its true value and we instead make the polarisation of the burst, ψ , a vari- able parameter. There are degeneracies in both of these parameters and in Figure 4.5 we plot the loci in ( fc,ψ)-space where the compressed values are each equal to the value at the maximum MOPED likelihood point. These two will necessarily intersect at the maximum likelihood solution, near the true value ( fc = 0.1 Hz and ψ = 1.3 rad), 95 4. AN INVESTIGATION INTO THE MOPED ALGORITHM Figure 4.3: Loci of 〈y1〉(θ ;θ F) = 〈y1〉(θˆM;θ F) and 〈y2〉(θ ;θ F) = 〈y2〉(θˆM;θ F) in the parameter space θ = { fc,Q}. The one intersection is the true maximum likelihood peak. but a second intersection is also apparent. This second intersection will have the same likelihood as the maximum and be another mode of the solution. However, as we can see in the left plot of Figure 4.6, this is not a mode of the original likelihood function. MOPED has, in this case, created a second mode of equal likelihood to the true peak. For an even more extreme scenario, we now fix to the true ψ and allow the time of arrival of the burst t0 to vary (we also define ∆t0 = t0− t0,T ). In this scenario, the loci in ( fc,∆t0)-space where 〈yi〉(θ ;θ F) = 〈yi〉(θˆM;θ F) are much more complicated. Thus, we have many more intersections of the two loci than just at the likelihood peak near the true values and MOPED creates many alternative modes of likelihood equal to that of the original peak. This is very problematic for parameter estimation. In Figure 4.7 we plot these loci so the multiple intersections are apparent. Figure 4.8 shows the original and MOPED log-likelihoods, where we can see the single peak becoming a set of peaks. 96 4.4 Recovery in a 2 Parameter Case Figure 4.4: Contours of the original and MOPED log-likelihoods (left and right, re- spectively). The MOPED likelihood has been multiplied by a constant factor so that its peak value is equal to the peak of the original likelihood. Contours are at 1, 2, 5, 10, 20, 30, 40, 50, 75, and 100 log-units below the peak going from the inside to outside. 97 4. AN INVESTIGATION INTO THE MOPED ALGORITHM Figure 4.5: Loci of 〈y1〉(θ ;θ F) = 〈y1〉(θˆ ;θ F) and 〈y2〉(θ ;θ F) = 〈y2〉(θˆ ;θ F) in the parameter space θ = { fc,ψ}. 4.5 Discussion What we can determine from the previous two sections is a general rule for when MOPED will generate additional peaks in the likelihood function equal in magni- tude to the true one. For an M-dimensional model, if we consider the (M − 1)- dimensional hyper-surfaces where 〈yi〉(θ ;θ F) = 〈yi〉(θˆM;θ F), then any point where these M hyper-surfaces intersect will yield a set of θ -parameter values with likelihood equal to that at the peak near the true values. We expect that there will be at least one intersection at the likelihood peak corresponding to approximately the true solution. However, we have shown that other peaks can exist as well. The set of intersections of contour surfaces will determine where these additional peaks are located. This degen- eracy will interact with the model’s intrinsic degeneracy, as any degenerate parameters will yield the same compressed values for different original parameter values. Unfortunately, there is no simple way to find these contours other than by mapping out the 〈yi〉(θ ;θ F) values, which is equivalent in procedure to mapping the MOPED likelihood surface. The benefit comes when this procedure is significantly faster than mapping the original likelihood surface. The mapping of 〈yi〉(θ ;θ F) can even be per- 98 4.5 Discussion Figure 4.6: Contours of the original and MOPED log-likelihoods (left and right, re- spectively). The MOPED likelihood has been multiplied by a constant factor so that its peak value is equal to the peak of the original likelihood. Contours are at 1, 2, 5, 10, 20, 30, 40, 50, 75, and 100 log-units below the peak going from the inside to outside. 99 4. AN INVESTIGATION INTO THE MOPED ALGORITHM Figure 4.7: Loci of 〈y1〉(θ ;θ F) = 〈y1〉(θˆ ;θ F) and 〈y2〉(θ ;θ F) = 〈y2〉(θˆ ;θ F) in the parameter space θ = { fc,∆t0}. We can see many intersections here other than the true peak. formed before data is obtained or used, if the fiducial model is chosen in advance; this allows us to analyse properties of the MOPED compression before applying it to data analysis. If the MOPED likelihood is mapped and there is only one contour intersec- tion, then we can rely on the MOPED algorithm and will have saved a considerable amount of time, since MOPED has been demonstrated to provide speed-ups of a factor of up to 107 in [174]. However, if there are multiple intersections then it is necessary to map the original likelihood to know if they are due to degeneracy in the model or were created erroneously by MOPED. In this latter case, the time spent finding the MOPED likelihood surface can be much less than that which will be needed to map the original likelihood, so relatively little time will have been wasted. If any model degeneracies are known in advance, then we can expect to see them in the MOPED likelihood and will not need to find the original likelihood on their account. One possible way of determining the validity of degenerate peaks in the MOPED likelihood function is to compare the original likelihoods of the peak parameter values with each other. By using the maximum MOPED likelihood point found in each mode and evaluating the original likelihood at this point, we can determine which one is 100 4.5 Discussion Figure 4.8: Contours of the original and MOPED log-likelihoods (left and right, re- spectively). The MOPED likelihood has been multiplied by a constant factor so that its peak value is equal to the peak of the original likelihood. Contours are at 1, 2, 5, 10, 20, 30, 40, 50, 75, and 100 log-units below the peak going from the inside to outside. 101 4. AN INVESTIGATION INTO THE MOPED ALGORITHM correct. The correct peak and any degeneracy in the original likelihood function will yield similar values to one another, but a false peak in the MOPED likelihood will have a much lower value in the original likelihood and can be ruled out. This means that a Bayesian evidence calculation cannot be obtained from using the MOPED likelihood; however, the algorithm was not designed to be able to provide this. The solution for this problem presented in [171] is to use multiple fiducial mod- els to create multiple sets of weighting vectors. The log-likelihood is then averaged across these choices. Each different fiducial will create a set of likelihood peaks that include the true peaks and any extraneous ones. However, the only peaks that will be consistent between fiducials are the correct ones. Therefore, the averaging maintains the true peak(s) but decreases the likelihood at incorrect values. This was tested with 20 random choices for θ F for the two-parameter models presented and was found to leave only the true peak at the maximum likelihood value. Other, incorrect, peaks are still present, but at log-likelihood values five or more units below the true peak. When applied to the full seven parameter model, however, the SNR threshold for signal re- covery is increased significantly, from ' 10 to ' 30. The MOPED algorithm for reducing the computational expense of likelihood func- tions can, in some examples, be extremely useful and provide orders of magnitude of improvement. However, as we have shown, this is not always the case and MOPED can produce erroneous peaks in the likelihood that impede parameter estimation. It is important to identify whether or not MOPED has accurately portrayed the likelihood function before using the results it provides. Some solutions to this problem have been presented and tested but further implementation was not pursued. 102 Chapter 5 Artificial Neural Networks The computer was born to solve problems that did not exist before. Bill Gates Many calculations in gravitational wave physics, and astrophysics in general, are very computationally expensive. If a calculation needs to be performed a large number of times then it can slow down simulations and analysis significantly. One example in gravitational wave physics is the calculation of waveforms for binary black hole mergers where the full spin dynamics of both black holes are considered. A single waveform can take on the order of seconds to calculate, or even longer in the case of numerical relativity simulations. For this and other problems, we set out to develop a generic algorithm for the training of artificial neural networks in order to facilitate and expedite these repeated difficult computations. Artificial neural networks (NNs) are a method of computation loosely based on the structure of a brain. They consist of a group of interconnected nodes, which pro- cess information that they receive and then pass this product along to other nodes via weighted connections. We will consider only feed-forward NNs, for which the struc- ture is directed; an input layer of nodes passes information to an output layer via zero, one, or many “hidden” layers in between. A network is able “learn” a relationship between inputs and outputs given a set of training data and can then make predictions of the outputs for new input data. Further introduction can be found in [175]. 103 5. ARTIFICIAL NEURAL NETWORKS In section 5.1 I describe the general structure of NNs and how they may be trained on a specific problem. Section 5.2 will then detail the procedure used to train NNs to solve a particular task. This is then applied to some toy examples in Section 5.3. Applying NNs to learning how to classify images of handwritten digits is described in Section 5.4 and to the problem of measuring the shape of sheared, blurred, and pixelated images of galaxies in Section 5.5. Work in this chapter was performed in collaboration with Farhan Feroz. 5.1 Network Structure A multilayer perceptron artificial neural network (NN) is the simplest type of network and consists of ordered layers of perceptron nodes that pass scalar values from one layer to the next. The perceptron is the simplest kind of node, and maps an input vector x ∈ℜn to a scalar output f (x;w,θ) via f (x;w,θ) = θ + n ∑ i=1 wixi, (5.1) where w = {wi} and θ are the parameters of the perceptron, called the “weights” and “bias”, respectively. For a 3-layer NN, which consists of an input layer, a hidden layer, and an output layer as shown in Figure 5.1, the outputs of the nodes in the hidden and output layers are given by the following equations: hidden layer: h j = g(1)( f (1) j ); f (1) j = θ (1) j +∑ l w(1)jl xl, (5.2) output layer: yi = g(2)( f (2) i ); f (2) i = θ (2) i +∑ j w(2)i j h j, (5.3) where l runs over input nodes, j runs over hidden nodes, and i runs over output nodes. The functions g(1) and g(2) are called activation functions and must be bounded, smooth, and monotonic for our purposes. We use g(1)(x) = 1 / (1+ e−x) = sig(x) (sig- moid) and g(2)(x) = x; the non-linearity of g(1) is essential to allowing the network to model non-linear functions. The weights and biases are the values we wish to determine in our training (de- scribed in Section 5.2). As they vary, a huge range of non-linear mappings from inputs to outputs is possible. In fact, a universal approximation theorem [176] states that a 104 5.1 Network Structure Figure 5.1: A 3-layer neural network with 3 inputs, 4 hidden nodes, and 2 outputs. Image courtesy of Wikimedia Commons. NN with three or more layers can approximate any continuous function as long as the activation function is locally bounded, piecewise continuous, and not a polynomial (hence our use of sigmoid g(1), although other functions would work just as well, such as a tanh). By increasing the number of hidden nodes, we can achieve more accuracy at the risk of overfitting to our training data. To expand the NN to include more hidden layers, we mirror Equation (5.2) for each connection from one hidden layer to the next, each time using the same g(1) activation function. The final hidden layer will connect to the output layer with Equation (5.3). 5.1.1 Choosing the Number of Hidden Layer Nodes An important choice when training a NN is the number of hidden layer nodes to use. The optimal number and organisation into one or more layers is a complex relationship between the number of training data points, the number of inputs and outputs, and the complexity of the function to be trained. Choosing too few nodes will mean that the NN is unable to learn the relationship to the highest possible accuracy; choosing too many will increase the risk of overfitting to the training data and will also slow down the training process. As a general rule, we find that a NN should not need more hidden nodes than the number of training data values used (each with a vector of inputs and outputs). 105 5. ARTIFICIAL NEURAL NETWORKS We choose to use 5–10 nodes in the hidden layer in our toy examples (see Sec- tion 5.3). These choices allow the network to model the complexity of the function without unnecessary work. In practice, it will be better to slightly over-estimate the number of hidden nodes required. There are checks built in to prevent over-fitting (de- scribed in Section 5.2) and the additional training time will not be a large penalty if an optimal network can be obtained in an early attempt. The optimal NN structure can be determined by comparing the Bayesian evidence of different trained NNs as calculated in Section 5.2.6. 5.2 Network Training In training a NN, we wish to find the optimal set of network weights and biases that maximise the accuracy of predicted outputs. However, we must be careful to avoid overfitting to our training data at the expense of making predictions for input values the network has not been trained on. The NN can be trained from a random initial state or can be pre-trained, a procedure which will be discussed in Section 5.2.5. But first we will discuss the general process of NN training. The set of training data inputs and outputs, D= {x(k), t(k)}, is provided by the user. Approximately 75% should be used for actual NN training and the remainder provided as a validation set of data that will be used to determine convergence to avoid overfitting. This ratio of 3:1 gives plenty of information for training but still leaves a representative subset of the data for checks to be made. 5.2.1 Overview The weights and biases we will collectively call the network parameter vector a. We can now consider the probability that a given set of network parameters is able to reproduce the known training data outputs – representing how well our NN model of the original function reproduces the true values. For problems of regression (fitting the model to a function), this gives us a log-likelihood function for a, depending on a standard χ2 error function, given by log(L(a;σ )) =−K log(2pi) 2 − N ∑ i=1 log(σi)− 12 K ∑ k=1 N ∑ i=1 [ t(k)i − yi(x(k);a) σi ]2 , (5.4) 106 5.2 Network Training where N is the number of outputs, K is the number of data points and y(x(k);a) is the NN’s predicted output vector for the input vector x(k) and network parameters a. The values of σ = {σi} are hyper-parameters of the model that describe the standard deviation (error size) of each of the outputs. For a classification network that aims to learn the probabilities that a set of inputs belongs to a set of output classes, the outputs are transformed according to the softmax procedure in order that they are all non-negative and sum to one. y′i(x (k);a) = exp ( yi(x(k);a) ) ∑Nj=1 exp ( y j(x(k);a) ) (5.5) The classification likelihood is then given by the cross entropy function [175], log(L(a;σ )) =− K ∑ k=1 N ∑ i=1 t(k)i log(y ′ i(x (k);a)). (5.6) In this scenario, the true and predicted output values are probabilities (in [0,1]). In the true outputs, all are zero except for the correct output which has a value of one. For classification networks, the σ hyper-parameters do not factor into the log-likelihood. Our algorithm considers the parameters a to be probabilistic with a log-prior dis- tribution given by log(S(a;α)) =−α 2 ∑i a2i . (5.7) α is a hyper-parameter of the model, called the “regularisation constant”, that gives the relative influence of the prior and the likelihood. The posterior probability of a set of NN parameters is thus Pr(a;α,σ ) ∝ L(a;σ )×S(a;α). (5.8) The network training begins by setting random values for the weights, sampled from a normal distribution with zero mean. The initial value of σ is set by the user and can be set on either the true log-likelihood values themselves or on their whitened values (as defined in Section 5.2.2). The only difference between these two settings is the magnitude of the error used. The algorithm then calculates a large initial estimate for α , α = |∇ log(L)|√ Mr , (5.9) 107 5. ARTIFICIAL NEURAL NETWORKS where M is the total number of weights and biases (NN parameters) and r is a rate set by the user (0 < r ≤ 1, default r = 0.1) that defines the size of the “confidence region” for the gradient. This formula for α sets larger regularisation (a.k.a. “damping”) when the magnitude of the gradient of the likelihood is larger. This relates the amount of “smoothing” required to the steepness of the function being smoothed. The rate factor in the denominator allows us to increase the damping for smaller confidence regions on the value of the gradient. This results in smaller, more conservative steps that are more likely to result in an increase in the function value but results in more steps being required to reach the optimal weights. The training then uses conjugate gradients to calculate a step, ∆a, that should be taken (see Section 5.2.3). Following a step, adjustments to α and σ may be made be- fore another step is calculated. The methods for calculating the initial α value and then determining subsequent adjustments of α and/or σ are as developed for the MEMSYS software package, described in [177]. 5.2.2 Data Whitening Most of the time, it is prudent to “whiten” the data before training a network. Whiten- ing normalises the input and/or output values according to a specified format, which makes it easier to train from the initial weights which are small and centred on zero. The network weights in the first and last layers can then be “unwhitened” after training so that the network will be able to process the original inputs and outputs. Standard whitening will transform each input to a standard normal distribution by subtracting the mean and dividing by the standard deviation over all elements in the training data. x(k)′l = x(k)l − xl σl (5.10a) xl = 1 K K ∑ k=1 x(k)l (5.10b) σ2l = 1 K−1 K ∑ k=1 (x(k)l − xl)2 (5.10c) 108 5.2 Network Training Data can also be whitened by bringing all values into [0,1]. x(k)′l = x(k)l −mink(x(k)l ) maxk(x (k) l )−mink(x(k)l ) (5.11) These two functions are normally performed separately on each input, but can be cal- culated across all inputs if the inputs are related. The mean, standard deviation, min- imum, or maximum would then be computed over all inputs for all data values. The same whitening functions are also used for whitening the outputs. Since both functions consist of subtracting an offset and multiplying by a scale factor, they can easily be per- formed and reversed. To unwhiten network weights the inverse transform is applied, with the offset and scale determined by the source input node or target output node. Outputs for a classification network are not whitened since they are already simple probabilities. 5.2.3 Finding the next step In order to find the most efficient path to an optimal set of parameters, we perform conjugate gradients using second-order derivative information. Newton’s method gives the second-order approximation of a function, f (a+∆a)≈ f (a)+(∇ f (a))T∆a+ 1 2 (∆a)TB∆a, (5.12) where B is the Hessian matrix of second derivatives of f at a. In this approximation, the maximum of f will occur when ∇ f (a+∆a)≈ ∇ f (a)+B∆a = 0. (5.13) Solving this for ∆a gives us ∆a =−B−1∇ f (a). (5.14) Iterating this stepping procedure will eventually bring us to a local maximum value of f . For our purposes, the function f is the log-posterior distribution of the NN parameters and hence Equation (5.12) is a Gaussian approximation to the posterior. The Hessian of the log-posterior is the regularised (“damped”) Hessian of the log- likelihood function, where the prior – whose magnitude is set by α – provides the 109 5. ARTIFICIAL NEURAL NETWORKS regularisation. If we define the Hessian matrix of the log-likelihood as H, then B = H+αI (I being the identity matrix). Regularisation increases the probability that the local maximum found is also the global maximum. Using the second-order information provided by the Hessian allows for more effi- cient steps to be made, since curvature information can extend step sizes in directions where the gradient varies less and shorten where it is varying more. Additionally, using the Hessian of the log-posterior instead of the log-likelihood adds the regularisation of the prior. As mentioned before, this helps prevent the algorithm from getting stuck in a local maximum by smoothing out the function being explored. It also aids in reducing the region of confidence for the gradient information which will make it less likely that a step results in a worse set of parameters. Given the form of the log-likelihood, Equation (5.4), is a sum of squares (plus a constant), we can also save computational expense by utilising the Gauss-Newton approximation of its Hessian, given by Hi j =− K ∑ k=1 ( ∂rk ∂ai · ∂rk ∂a j + rk · ∂ 2rk ∂ai∂aj ) ≈− K ∑ k=1 ( ∂rk ∂ai · ∂rk ∂a j ) , (5.15) where rk = t(k)−y(x(k);a) σ . (5.16) The drawback of using second-order information is that calculation of the Hessian is computationally expensive and requires large storage, especially so in many dimen- sions as we will encounter for more complex networks. In general, the Hessian is not guaranteed to be positive semi-definite and so may not be invertible; however, the Gauss-Netwon approximation does have this guarantee and does not require calculat- ing second derivatives. Inversion of the very large matrix will still be computationally expensive. As noted in [178] however, we only need products of the Hessian with a vector to compute the solution, never actually the full Hessian itself. To calculate these approx- imate Hessian-vector products, we use a fast approximate method given in [179, 180]. This method takes advantage of the approximation made in Equation (5.13). If we 110 5.2 Network Training choose ∆a = rv where v is any vector and r is a small number we can solve for the Hessian-vector product, Bv≈ ∇ f (a+ rv)−∇ f (a) r . (5.17) This is a simple approximation for the product and can be computed in about the same time as required to find the gradient, which must be calculated anyway. In order to make this an exact method and less susceptible to numerical errors, we take the limit as r→ 0. Doing so gives us Bv = ∂ ∂ r ∇ f (a+ rv) ∣∣∣∣ r=0 . (5.18) We can therefore define the Rv{·} operator, Rv{ f (a)} ≡ ∂∂ r f (a+ rv) ∣∣∣∣ r=0 , (5.19) such that Bv = Rv{∇ f (a)}. This operator is then applied to all of the same equations and procedures used to calculate the gradient. These equivalent functions have the same cost as a gradient calculation, which is an O(M) operation for M weights. This is a vast improvement over calculating and storing the entire Hessian matrix. By combining all of these methods, second-order information is practical to use and significantly improves the rate of convergence of NN training. 5.2.4 Convergence Following each step, the posterior, likelihood, correlation, and error squared values are calculated for the training data and the validation data that was not used in training (calculating the steps). Convergence to a best-fit set of parameters is determined by maximising the posterior, likelihood, correlation, or negative of the error squared of the validation data, as chosen by the user. This prevents overfitting as it provides a check that the network is still valid on points not in the training set. We use the error squared as the default function to maximise as it does not include the model hyper-parameters in its calculation and is less prone to problems with zeros than the correlation. 111 5. ARTIFICIAL NEURAL NETWORKS 5.2.5 Autoencoders and Pre-Training Autoencoders are a specific type of neural network wherein the inputs are mapped to themselves through one or more hidden layers. These typically have a symmetric setup of several hidden layers with a central layer containing fewer nodes than there are in- puts. The NN can be split in half, with one part mapping the inputs to this central layer and the second part mapping the central layer weights to the outputs (same as original inputs). These two parts are called the “encoder” and “decoder” respectively and map either to or from a reduced basis set of “feature vectors” embodied in the central layer. This is akin to a non-linear principle component analysis (PCA). The non-linearity should allow for more complex relationships and more information to be contained in the same number of components/feature vectors. The individual feature vectors can be found by decoding (1,0,0, ...,0), (0,1,0, ...,0), and so on. A basic diagram of an autoencoder is shown in Figure 5.2. The three inputs (x1,x2,x3) are mapped to them- selves via three symmetric hidden layers, with 2 nodes in the central layer. The weights of the central layer (z1,z2) are feature vector weights for the reduced non-linear basis set. Autoencoders are notoriously difficult to train. A broad local maximum exists wherein all outputs are the average value of the inputs. Geoffrey Hinton, Simon Osin- dero, and Yee-Whye Teh developed a method for “pre-training” networks to obtain a set of weights near the true global maximum [181]. This method was created with symmetric autoencoders in mind. The map from the inputs to the first hidden layer and then back is treated symmetrically and the weights are adjusted through a num- ber of “epochs”, gradually reducing the reproduction error. This is repeated for the first to second hidden layer and so on until the central layer is reached. The network weights can then be “unfolded” by using the transpose for the symmetric connections in the decoding half to provide a decent starting point for the full training to begin. This is shown in Figure 5.2, where the W1 and W2 weights matrices are defined by pre-training. More details can be found in [181] and [182] has useful diagrams and explanations. This pre-training can also be used for non-autoencoder networks. All layers of weights, except for the final one that connects the last hidden layer to the outputs, are pre-trained as if they were the first half of a symmetric autoencoder. However, 112 5.2 Network Training Figure 5.2: Schematic diagram of an autoencoder. The 3 input values are being en- coded to 2 feature vectors. Pre-training defines the W1 and W2 weight matrices to provide a starting point for fine-tuning in training. 113 5. ARTIFICIAL NEURAL NETWORKS the network weights are not unfolded; instead the final layer of weights is initialised randomly as would have been done without pre-training. In this way, the network “learns the inputs” before mapping to a set of outputs. When an autoencoder is pre-trained, the activation function to the central hidden layer is made linear and the activation function from the final hidden layer to the out- puts is made sigmoidal. Regular networks that are pre-trained continue to use the original activation functions. Examples of the use of pre-training for autoencoders and other networks will be provided in the following sections of this chapter. 5.2.6 The Evidence of a Network As we are determining the NN’s optimal weights from a Bayesian formalism with like- lihood and prior functions, we can also define the Bayesian evidence of the network model. This may be used to compare different types of networks and determine the op- timal NN structure. The best NN model will have a larger evidence than other models and will, in general, be the smallest network that provides the best predictions possible for the given data. We approximate the evidence as deriving from a single Gaussian peak about the optimal weights found as in [183]. In practice, this approximation is close enough to the true value to provide a good guide. From [183] the evidence for an optimised network is given by log(Znetwork) = log(S(aMP;α))+ log(L(aMP);σ )− 12 log(|HMP|) + M 2 log(α)− 1 2 N ∑ i=1 log(σi)− N2 log(2pi), (5.20) where ‘MP’ indicates the value at the posterior maximum and M and N are defined as before (number of NN weights and outputs, respectively). There can be significant un- certainty in this measurement due to the statistical method of calculating log(|H|) that avoids calculating H itself. However, later examples will display trends that indicate when an optimal network structure has been obtained. 114 5.3 Toy Examples Figure 5.3: Comparisons of the true and predicted values for the sinc function on the training and validation data sets. 5.3 Toy Examples 5.3.1 Sinc Function with Noise The first toy example we examined was a simple regression problem. In this, we gen- erate 200 samples of x ∈ U[−5pi,5pi] for which we evaluate a modified sinc function, y(x) = sin(x) x +0.04x, (5.21) and then add Gaussian noise with zero mean and a standard deviation of 0.05. The noise is added to make the learning of the function more difficult and prevent any exact solution being possible. The samples are split, using a random selection of 150 for training and 50 for validation. We use Equation (5.10) to whiten the inputs and outputs and train a network with 7 hidden layer nodes, obtaining correlations of greater than 99.3%. A comparison of the true and predicted outputs is shown in Figure 5.3. The optimal number of hidden layer nodes can be determined by comparing the calculated evidence from networks with different numbers of hidden nodes. These results are shown in Table 5.1. The evidence clearly increases until we reach 6 hidden 115 5. ARTIFICIAL NEURAL NETWORKS Nhid Avg. Err. Sqr. Correlation % ln(Znet)±25 3 7.08×10−3 98.23 −279 4 6.44×10−3 98.38 −256 5 3.20×10−3 99.26 −161 6 2.93×10−3 99.31 −142 7 2.76×10−3 99.35 −143 8 2.86×10−3 99.34 −144 9 2.87×10−3 99.34 −143 10 2.73×10−3 99.36 −137 11 2.75×10−3 99.36 −134 12 2.92×10−3 99.32 −154 13 2.77×10−3 99.35 −134 14 2.72×10−3 99.37 −170 15 2.82×10−3 99.35 −183 Table 5.1: Results for training regression networks on a modified sinc function with noise. nodes and then levels off to within its own measurement error. We can say then that any increase in accuracy for larger networks is offset by the increased complexity of the network. This additional complexity does not contribute enough additional accuracy beyond 13 nodes, when the evidence begins to drop significantly again. 5.3.2 Three-Way Classification Radford Neal created a three-way classification data set [184] for testing his own al- gorithm for NN training. In this data set, four variables x1, x2, x3, and x4 are sampled from U[0,1] 1000 times each. If the two-dimensional Euclidean distance between (x1,x2) and (0.4,0.5) is less than 0.35, the point is placed in class 0; otherwise, if 0.8x1 + 1.8x2 < 0.6, the class was set to 1; and if neither of these conditions is true, the class was set to 2. Note that the values of x3 and x4 play no part in the classifica- tion. Gaussian noise with standard deviation 0.1 was then added to the input values. Figure 5.4 shows the data set with classifications. Approximately 75% of data was used for training and the remaining 25% for validation. A network was trained using 8 116 5.3 Toy Examples Figure 5.4: The classifications for all data points (training and validation). True Class Number Predicted Class (%) 0 1 2 0 282 84.0 4.96 11.0 1 93 14.0 82.8 3.2 2 386 7.0 1.3 91.7 Table 5.2: Classifications for the toy training data set. hidden layer nodes with the inputs whitened using Equation (5.10). In total, 87.8% of training data points and 85.4% of validation points were correctly classified. The sum- mary of classifications is given in Tables 5.2 and 5.3. These results compare well with Neal’s own original results [184] and are similar to classifications based on applying the original criteria to the new points that have noise added. 5.3.3 Autoencoders as Non-Linear PCA As a first test, we trained an autoencoder network on the three-way classification in- puts. For this we used hidden layers of 10 and then 4 units, so the final network 117 5. ARTIFICIAL NEURAL NETWORKS True Class Number Predicted Class (%) 0 1 2 0 99 75.7 6.1 18.2 1 19 21.1 78.9 0.0 2 121 5.0 0.8 94.2 Table 5.3: Classifications for the toy validation data set. was 4+10+4+10+4. This network is not reducing the dimensionality, but does per- form a non-linear transformation. Without pre-training, even this relatively simple autoencoder with 188 weights continuously achieves an average error squared value of 0.0950269, which corresponds to each output being equal to the average value of that particular input across all data points. However, with pre-training we can obtain an av- erage error squared of 0.00352961, which corresponds to a correlation of 99.7%. After just one step of training the error squared is already below the best without pre-training. Reducing the central layer to 3 nodes does not affect the results without pre-training; with pre-training we are now only able to obtain an error squared of 0.00621163, which is a correlation of 92.6%. Clearly some significant information is lost, which agrees with the fact that there are four independent inputs. To provide a quick comparison with traditional principle component analysis (PCA), we use the example of data points sampled from a multivariate Gaussian distribution. The eigenvalues and eigenvectors of this data represent the components that a PCA analysis would measure and use for data compression. In the first example, a 3D non- singular covariance matrix is used to generate samples from a multivariate Gaussian. The eigenvalues and eigenvectors of the covariance matrix calculated from these sam- ples are then computed, to compare with the analytic values. As expected, these match very closely. Autoencoders with an increasing number of hidden layer nodes (in a single layer) are then trained. In the simplest case of a single hidden layer node, we expect that the one feature vector that will be represented will be the eigenvector with the largest eigenvalue, as this captures as much information as possible. For two hid- den layer nodes, the two feature vectors will now be a linear combination of the two eigenvectors with the two largest eigenvalues, so that the same plane in the 3D space is 118 5.3 Toy Examples Nhid Avg. Err. Sqr. Correlation % ln(Znet) 1 0.00191 94.06 5984 2 5.73×10−4 98.08 7510 3 5.93×10−5 99.82 10243 Table 5.4: Results for training autoencoder networks on a 3D multivariate Gaussian. spanned. Finally, with three hidden layer nodes the three feature vectors should span the 3D space and be able to generate a nearly 100% correlation. Pretraining was used as, even in these very small networks, it is easy to fall into the large local maximum of each output always containing the average value. Table 5.4 shows the results obtained when training autoencoders with increasing hidden layer nodes. As more nodes are added, there is a significant decrease in the average error squared as well as increasing correlation, up to 99.82%. The evidence value for the net- work also increases, giving another indication of a better fit. Furthermore, the feature vector from the network with a single hidden layer node was indeed an eignenvector with the highest eigenvalue and the feature vectors from the two hidden layer nodes network spanned approximately the same plane as the two eigenvectors with highest eigenvalues. Additionally, we tested our ability to determine the optimal number of hidden layer nodes for an autoencoder when additional, redundant, information is provided. To ac- complish this, a 3D multivariate Gaussian was rotated into 5D space. Therefore, there were only three independent combinations of the five provided values. The known optimal number of hidden nodes is thus also three and the autoencoders reflect that clearly in the evidences shown in Tabke 5.5. Once three hidden nodes are used, adding more neither decreases the error squared or increases the correlation; however, the ev- idence does decrease as Occam’s razor applies a penalty for using more parameters without improving the fit to the data. In both of these examples, it is found that with one hidden layer node the feature vector that is represented is equivalent to the primary eigenvector of the covariance matrix of samples, exactly consistent with a PCA analysis. Having shown that PCA- like results can be reproduced on a simple example, we can now apply autoencoders 119 5. ARTIFICIAL NEURAL NETWORKS Nhid Avg. Err. Sqr. Correlation % ln(Znet) 1 0.00613 79.6 −6293 2 0.00127 96.0 12430 3 4.87×10−5 99.86 17164 4 4.87×10−5 99.86 16823 5 4.87×10−5 99.86 16859 Table 5.5: Results for training autoencoder networks on a 3D multivariate Gaussian in 5D space. for finding non-linear features in more complicated data sets by using larger network structures. 5.4 MNIST Handwriting Recognition The MNIST database of handwritten digits is a subset of a larger set available from NIST (National Institute for Standards and Technology). It consists of 60,000 training and 10,000 validation samples of handwritten digits. The images have been size- normalised and centred in 28×28 pixel greyscale images. They are publicly available at [185], which also provides more information on the generation of the data set and re- sults from previous analyses by other researchers. This data set has become a standard for testing of machine learning algorithms. The learning task is to correctly identify the digit written in each image, with the chosen digit being given by the output class (0 to 9) with the highest probability. Some sample digits are shown in Figure 5.5. We have trained several different networks on this data set, using pre-training for any hidden layers. All networks whitened the inputs using Equation (5.11) across all inputs. Additionally, all networks had 784 inputs (each pixel of the image) and 10 outputs (one for each possible digit). Deeper and larger networks were able to obtain the best results. Results are summarised in Table 5.6, where the error rates are those calculated on the validation set of images. These can be compared with results referenced at [185], which obtain error rates as low as 0.35% [186] but more typically between 1% and 5% [187]. 120 5.4 MNIST Handwriting Recognition Figure 5.5: Sample handwritten digits from the MNIST database. Hidden Layer Nodes Error Rate (%) 0 8.08 100 4.15 250 3.00 1000 2.38 300+30 2.83 500+50 2.62 1000+300+30 2.31 500+500+2000 1.76 Table 5.6: Error rates for different networks trained to predict the MNIST data set. Training with larger networks was stopped before full convergence due to the large computational cost for marginal improvements. 121 5. ARTIFICIAL NEURAL NETWORKS True Count Predicted Digit (%) 0 1 2 3 4 5 6 7 8 9 0 980 99.1 0.0 0.2 0.1 0.0 0.3 0.2 0.1 0.0 0.0 1 1135 0.0 99.0 0.1 0.3 0.0 0.3 0.1 0.1 0.2 0.0 2 1032 0.2 0.1 97.6 0.9 0.1 0.1 0.2 0.4 0.5 0.0 3 1010 0.0 0.0 0.0 98.4 0.0 0.4 0.0 0.3 0.7 0.2 4 982 0.0 0.0 0.2 0.0 98.3 0.0 0.5 0.0 0.0 1.0 5 892 0.1 0.0 0.0 1.1 0.0 98.0 0.3 0.0 0.3 0.1 6 958 0.3 0.2 0.0 0.2 0.2 0.3 98.3 0.0 0.4 0.0 7 1028 0.0 0.1 0.5 0.5 0.1 0.0 0.0 98.0 0.2 0.6 8 974 0.2 0.0 0.1 0.8 0.1 0.4 0.0 0.1 97.7 0.5 9 1009 0.2 0.1 0.0 0.5 0.5 0.2 0.1 0.4 0.2 97.8 Table 5.7: Classifications for the MNIST blind data set. (Rows may not add up to exactly 100 due to rounding.) In Table 5.7 we provide the classification rates made the best-performing (784 + 500 + 500 + 2000 + 10) network on the blind data set. From this we can see how the network is distributing its correct and incorrect predictions. To further illustrate, in Figure 5.6 we show a sample selection of the digits predicted incorrectly. Some are indeed hard even for a human to distinguish but some are unclear as to why they were mis-identified. A pair of auto-encoder networks were also trained on the data set, one with hid- den layers of 1000+ 300+ 30+ 300+ 1000 (called AE-30) and another with hidden layers of 1000+500+50+500+1000 (called AE-50). Both had 784 inputs/outputs to match the image size. The AE-30 network was able to obtain an average total error squared of only 4.64 on reproducing the digits and AE-50 obtained an average total error squared of 3.29. The networks encoded each image as 30 or 50 feature vector weights, respectively, and then decoded these weights to produce an image of the digit comparable to the original. The error squared values are comparable to those obtained by Hinton and Salakhutdinov in [182]. The feature vectors from the AE-30 network are shown in Figure 5.7 – each one being obtained by setting a single decoder input weight to one and all others to zero. Non-linear weighted combinations of these fea- 122 5.4 MNIST Handwriting Recognition Figure 5.6: A sample of digits predicted incorrectly by the 784+500+500+2000+10 network. The true and predicted digit values are given above each image (true → predicted). 123 5. ARTIFICIAL NEURAL NETWORKS Figure 5.7: Feature vectors of the MNIST handwriting samples for the AE-30 network. The total error squared from these features is 4.64. tures are used to reproduce each of the handwriting samples to within a small error. We are able to use the feature vector weights of the encoded writing samples for classification. All of the training data images were passed through the two encoders to obtain the sets of 30 or 50 feature vector weights. Networks with 30 or 50 inputs and the ten classification outputs were then trained on this compressed data set. Results of the NN training are given in Table 5.8. An autoencoder that calculated only two feature vectors using a deep network (1000+ 500+ 250+ 2 symmetric hidden layers) was also trained. Although this net- work was significantly less able to reproduce the original digits, having an average total error squared of 31.0, we can plot the values of the two feature vector weights for 124 5.5 Mapping Dark Matter Challenge Encoder Hidden Layers Error Rate (%) AE-30 0 9.57 10 6.39 30 3.03 100+50+10 2.55 AE-50 0 8.68 10 6.61 50 2.65 100+50+10 2.71 Table 5.8: Error rates for different networks trained to predict the MNIST data set based on encoded feature vector weights. each of the classes to illustrate the classification differences. We do so in Figure 5.8 for the 10,000 blind data points. There is some clear overlap between digits with similar shapes, but some digits do occupy distinct regions of the parameter space (particularly 1 in the top right, some 0s in the bottom right, and many 2s in the middle right). Through this example, we have shown the NN training algorithm to be able to handle the training of both deep classification and autoencoder networks. Using pre- training and second-order gradient descent we trained several networks to classify im- ages of handwritten digits to a high degree of accuracy. Although this may be an easy task for a human brain, it is quite difficult for a computer to learn. Additionally, we were able to train autoencoder networks with millions of weights in order to reduce the dimensionality of the information we were attempting to classify from 784 pix- els to 30 or 50 non-linear feature vector weights. These reduced basis sets retained enough information about the original images to reproduce them to within small er- rors, thus enabling us to perform classification using these weights. This classification was nearly as accurate as our best-performing network trained on the full images. 5.5 Mapping Dark Matter Challenge The Mapping Dark Matter Challenge was presented on www.kaggle.com as a sim- plified version of the GREAT10 Challenge [188]. In this problem, each data value 125 5. ARTIFICIAL NEURAL NETWORKS Figure 5.8: Scatterplot distribution of feature vector weights for blind data set digits as given from the encoding half of the 784+1000+500+250+2 symmetric autoencoder. This is comparable to Figure 3B in [182]. 126 5.5 Mapping Dark Matter Challenge consists of an image of a galaxy and an image of a corresponding star. Each image is 48× 48 pixels and greyscale. The galaxy has some ellipticity that has been obscured by a point-spread function and statistical noise. The star is a point source that has also been transformed by the same point-spread function and statistical noise. The trained network must take the galaxy and star images as inputs and predict the galaxy’s ellipticity. The training data set contained 40,000 image pairs and a challenge (vali- dation) data set contained 60,000 image pairs. During the challenge, the solutions for the validation data set were kept secret and participating teams submitted their predic- tions. Further details on the challenge and descriptions of the top results can be found in [189]. 5.5.1 The Data and Model Each galaxy image is an ellipse with a simple brightness profile. This is then convolved with a point-spread function which blurs the ellipse in a similar way to telescope obser- vations. There is also Poisson noise degrading the image. Accompanying each galaxy image is a star image, which is a point source convolved with the same point-spread function and similar noise. A sample galaxy and star image pair is shown in Figure 5.9. The ellipticity of a galaxy, which is essentially an ellipse, is given by two param- eters, e1 and e2. These measure the relative lengths of the major and minor axes and the angle of the ellipse and may vary in [−1,1]. Equation (5.22) gives these definitions with a, b, and θ shown in Figure 5.10. e1 = a−b a+b cos(2θ) (5.22a) e2 = a−b a+b sin(2θ) (5.22b) Further details about the data set can be found at the challenge’s webpage [190]. The website also gives the unweighted quadrupole moments (UWQM) formula for calculating the ellipticity. However, as the competition organisers note, this formula will not provide very good measurements as it does not account for the point-spread function. Our aim was to use NNs to provide a better way of measuring the ellipticity. 127 5. ARTIFICIAL NEURAL NETWORKS (a) (b) Figure 5.9: Example pair of (a) galaxy and (b) star images for the Mapping Dark Matter challenge. Each image is 48×48 greyscale pixels. Figure 5.10: Definition of ellipse measurements for Equation (5.22). Image from [190]. 128 5.5 Mapping Dark Matter Challenge Data Set Hidden Layers RMSE Full Images 0 0.0224146 2 0.0186944 5 0.0184237 10 0.0182661 Cropped Star 0 0.0175578 2 0.0176236 5 0.0175945 10 0.0174997 50 0.0172977 50+10 0.0171719 Galaxy Only 0 0.0234740 2 0.0234669 5 0.0236508 10 0.0226440 Table 5.9: Root mean square error rates on ellipticity predictions for different networks trained on the MDM data sets and evaluated on the 60,000 challenge image pairs. 5.5.2 Results of Training The quality of a network’s predictions are measured by the root mean squared error of its predictions of the ellipticities for the 60,000 challenge galaxies. Better predictions will result in lower values of the RMSE. The magnitude of the problem and the size of the dataset limited the ability to train large networks due to immense computational cost. Therefore, for this demonstration we only trained relatively smaller networks, but used three different data sets: (A) the full galaxy and star images, (B) the full galaxy image and a centrally cropped star image, and (C) just the full galaxy image. 75% of the provided training data was used for the training subset and the remaining 25% was used for validation. We originally trained on just 25% of the training subset and without using pre-training. Networks were then fine-tuned using the entire training subset. RMSE values for trained networks evaluated on the challenge set are given in Table 5.9. These scores show naive first results that already perform very well; the standard 129 5. ARTIFICIAL NEURAL NETWORKS software package SourceExtractor scores 0.0862191 on this test data and UWQM scores 0.1855600. The results could potentially be further improved by fitting pro- files to the images and using these fits for training, which would reduce the number of inputs by about two orders of magnitude. Additionally, an autoencoder could be trained to allow us to decompose the images into feature vectors and use the corre- sponding weights for training, thereby reducing the dimensionality and the impact of noise in the images. By reducing the number of inputs without affecting the information content – in this problem by cropping the star images to the central 24×24 pixels – we were able to improve the precision of predictions and lower our RMSE. Due to its better per- formance on the initial networks trained, the “CroppedStar” data set was also used to train two slightly larger networks with 50 and 50+ 10 hidden layer nodes. These continued to show improving predictions, indicating that even more complex networks could further improve ellipticity measurements. The best results obtained, with an RMSE of 0.0171719, compare well with the competition winners who had an RMSE of 0.0150907 [189, 190] and used a mixture of methods that included NNs. This scored would have placed us in 32nd place out of 66 teams that entered. The “GalaxyOnly” data set shows us that removing the star images does not al- low the NN to account for the point-spread function and therefore gives a significant increase in the RMSE as might be expected. The RMSE on the challenge data was slightly increased relative to that obtained on initial training data because it was generated with a non-zero mean ellipticity (actual mean of 0.01). Since the network is only able to predict what it has been trained on, the statistical differences between the two data sets led to a slightly increased prediction error on average. 5.6 Discussion In this chapter I have introduced the framework of artificial neural networks used for machine learning. By using an efficient and robust algorithm, we are able to train large and deep networks. We use second-order information to allow convergence in fewer steps and prevent overfitting with prior information and checks on separate validation data. The algorithm is demonstrated on toy examples of regression, classification, and 130 5.6 Discussion autoencoder networks. The framework was then applied to handwriting classification in determining digits from the MNIST database. Shallow and deep classification net- works as well as autoencoders were used to accomplish this machine learning task. Lastly, NNs were applied to measuring the ellipticity of noisy and convolved galaxy images in the Mapping Dark Matter Challenge. NNs proved to perform well given the raw, un-treated data. The learning power of NNs combined with the training algorithm described is a powerful tool for use in further astrophysics projects, such as the generation of grav- itational wave signals, as well as computational tasks of other types that involve re- peated and complex functions. This includes evaluations of likelihoods for Bayesian inference, an application that is investigated in the next chapter. 131 5. ARTIFICIAL NEURAL NETWORKS 132 Chapter 6 The BAMBI Algorithm G-d does not care about our mathematical difficulties. He integrates empirically. Albert Einstein Bayesian methods of inference are widely used in astronomy and cosmology and are gaining popularity in other fields, such as particle physics. Some uses have al- ready been described in Chapters 2 and 3. At each point in parameter space, Bayesian methods require the evaluation of a likelihood function describing the probability of obtaining the data for a given set of model parameters. For some cosmological and particle physics problems each such function evaluation takes up to tens of seconds. MCMC applications may require millions of these evaluations, making them pro- hibitively costly. MULTINEST is able to reduce the number of likelihood function calls by an order of magnitude or more, but further gains can be achieved if we are able to speed up the evaluation of the likelihood itself. An artificial neural network is ideally suited for this task. A universal approx- imation theorem [176] assures us that we can accurately and precisely approximate the likelihood with a three-layer, feed-forward NN. The training of NNs is one of the most widely studied problems in machine learning, so techniques for learning the like- lihood function are well established. In the Blind Accelerated Multimodal Bayesian Inference (BAMBI) algorithm [191], we incorporate NNs with MULTINEST. Samples from MULTINEST are used to train a NN or set of NNs on the likelihood function 133 6. THE BAMBI ALGORITHM which can subsequently be used to predict new likelihood values in a tiny fraction of the time originally required. We implement the algorithm described in Chapter 5 for our NN training. In this chapter, Section 6.1 will describe the structure of the new algorithm, BAMBI. Section 6.2 then demonstrates this algorithm performing on a few toy examples. The algorithm is applied to the computationally costly problem of cosmological parameter estimation in Section 6.3. For this same problem, we then demonstrate rapid follow- up analyses in Section 6.4. Work in this chapter was performed in collaboration with Farhan Feroz and large portions are published in [191]. 6.1 The Structure of BAMBI The Blind Accelerated Multimodal Bayesian Inference (BAMBI) algorithm combines nested sampling and neural networks. After a specified number of new samples from MULTINEST have been obtained (specified by the updInt run parameter), BAMBI uses these to train a regression network on the log-likelihood function. Approximately 80% of the samples are used for training and the remaining 20% are used for the val- idation set. These values are slightly different from the previous chapter to give more information for training on the likelihood, but at the sacrifice of a smaller validation set. After convergence to the optimal NN weights, we test that the network is able to predict likelihood values to within a specified tolerance level. If not, sampling con- tinues using the original log-likelihood until enough new samples have been made for training to be resumed. Once a network is trained that is sufficiently accurate, its pre- dictions are used in place of the original log-likelihood function for future samples in MULTINEST. Consistency checks are made to ensure the NN is not making predic- tions outside the range of data on which it was trained. Using the network reduces the log-likelihood evaluation time from seconds to microseconds, allowing MULTINEST to complete analysis much more rapidly. As a bonus, the user also obtains a network or set of networks that are trained to easily and quickly provide more log-likelihood evaluations near the peak if needed, or in subsequent analyses. 134 6.1 The Structure of BAMBI 6.1.1 When to Use the Trained Network The optimal network possible with a given set of training data may not be able to predict log-likelihood values accurately enough, so an additional criterion is placed on when to use the trained network. This requirement is that the root-mean-square error of log-likelihoods evaluations by the network is less than a user-specified tolerance, tol. When the trained network does not pass this test, then BAMBI will continue using the original log-likelihood function to obtain updInt/2 new samples to generate a new training data set of the last updInt accepted samples. Network training will then resume, beginning with the weights that it had found as optimal for the previous data set. Since samples are generated from nested contours and each new data set contains half of the previous one, the saved network will already be able to produce reasonable predictions on this new data; resuming therefore enables us to save time as fewer steps will be required to reach the new optimum weights. Once a NN is in use in place of the original log-likelihood function its evalua- tions are taken to be the actual log-likelihoods, but checks are made to ensure that the network is maintaining its accuracy. If the network makes a prediction outside of [mintraining(logL)−tol,maxtraining(logL)+tol], then that value is discarded and the original log-likelihood function is used for that point. Additionally, the central 95th per- centile of the output log-likelihood values from the training data used is calculated and if the network is making > 95% of its predictions outside of this range then it will be re-trained. To re-train the network, BAMBI first substitutes the original log-likelihood function back in and gathers the required number of new samples from MULTINEST. Training then commences, resuming from the previously saved network. These crite- ria ensure that the network is not trusted too much when making predictions beyond the limits of the data it was trained on, as we cannot be sure that such predictions are accurate. The flow of sampling and training within BAMBI is demonstrated by the flowchart given in Figure 6.1. Red boxes indicate sampling with the original likelihood function and we want to minimise the time spent in these. The brown box is time spent training the NN and so does not advance the Bayesian inference; time here should be minimized as well. Lastly, the green box is sampling done with a trained NN and this is where we 135 6. THE BAMBI ALGORITHM Figure 6.1: A flowchart depicting the transitions between sampling and NN training within BAMBI. N is given by updInt from MULTINEST. want to maximise usage, so that a larger percentage of the log-likelihood evaluations are done rapidly. 6.2 BAMBI Toy Examples In order to demonstrate the ability of BAMBI to learn and accurately explore multi- modal and degenerate likelihood surfaces, we first tested the algorithm on a few toy examples. The eggbox likelihood has many separate peaks of equal likelihood, mean- ing that the network must be able to make predictions across many different areas of the prior. The Gaussian shells likelihood presents the problem of making predictions in a very narrow and curving region. Lastly, the Rosenbrock function gives a long, curving degeneracy that can be extended to higher dimensions. They all require high accuracy and precision in order to reproduce the posterior truthfully and each presents unique challenges to the NN in learning the log-likelihood. It is important to note that running BAMBI on these problems required more time than the straightforward analysis; this was as expected since the actual likelihood functions are simple analytic expressions that do not require much computational expense. 136 6.2 BAMBI Toy Examples Figure 6.2: The eggbox log-likelihood surface, given by Equation (6.1). 6.2.1 Eggbox This is a standard example of a very multimodal likelihood distribution in two dimen- sions. It has many peaks of equal value, so the network must be able to take samples from separated regions of the prior and make accurate predictions in all peaks. The eggbox log-likelihood [102] is given by log(L(x,y)) = ( 2+ cos( x2)cos( y 2) )5 , (6.1) where we take a uniform prior U[0,10pi] for both x and y. The structure of the surface can be seen in Figure 6.2. We ran the eggbox example in both MULTINEST and BAMBI, both using 4000 live points. For BAMBI, we used 4000 samples for training a network with 50 hidden nodes. These values were chosen after some initial testing to give the NN sufficient complexity and data to learn the function. In Table 6.1 we report the evidences re- covered by both methods as well as the true value obtained analytically from Equa- tion (6.1). Both methods return evidences that agree with the analytically determined value to within the given error bounds. Figure 6.3 compares the posterior probability distributions returned by the two algorithms via the distribution of lowest-likelihood points removed at successive iterations by MULTINEST. We can see that they are identical distributions; therefore, we can say that the use of the NN did not reduce the quality of the results either for parameter estimation or model selection. During 137 6. THE BAMBI ALGORITHM Method log(Z) Analytical 235.88 MULTINEST 235.859±0.039 BAMBI 235.901±0.039 Table 6.1: The log-evidence values of the eggbox likelihood as found analytically and with MULTINEST and BAMBI. (a) (b) Figure 6.3: Points of lowest likelihood of the eggbox log-likelihood from successive iterations as given by (a) MULTINEST and (b) BAMBI. the BAMBI analysis 51.3% of the log-likelihood function evaluations (∼ 70,000 to- tal) were done using the NN; if this were a more computationally expensive function, significant speed gains would have been realised. 6.2.2 Gaussian Shells The Gaussian shells likelihood function has low values over most of the prior, except for thin circular shells that have Gaussian cross-sections. We use two separate Gaus- sian shells of equal magnitude so that this is also a mutlimodal inference problem. Therefore, our Gaussian shells likelihood is L(x) = circ(x;c1,r1,w1)+ circ(x;c2,r2,w2), (6.2) 138 6.2 BAMBI Toy Examples Figure 6.4: The Gaussian shell likelihood surface, given by Equations (6.2) and (6.3). where each shell is defined by circ(x;c,r,w) = 1√ 2piw2 exp [ −(|x− c|− r) 2 2w2 ] . (6.3) We used values of r1 = r2 = 2, w1 = w2 = 0.1, c1 = (−3.5,0), and c2 = (3.5,0). This is shown in Figure 6.4. As with the eggbox problem, we analysed the Gaussian shells likelihood with both MULTINEST and BAMBI using uniform priors U[−6,6] on both dimensions of x. MULTINEST sampled with 1000 live points and BAMBI used 2000 samples for train- ing a network with 100 hidden nodes (again, tests were performed to find suitable values for NN learning). In Table 6.2 we report the evidences recovered by both meth- ods as well as the true value obtained analytically from Equations (6.2) and (6.3). The evidences are both consistent with the true value. Figure 6.5 compares the posterior probability distributions returned by the two algorithms (in the same manner as with the eggbox example). Again, we see that the distribution of returned values is nearly identical when using the NN, which BAMBI used for 18.2% of its log-likelihood func- tion evaluations (∼ 10,000 total). This is a significant fraction, especially since they are all at the end of the analysis when exploring the peaks of the distribution. 139 6. THE BAMBI ALGORITHM Method log(Z) Analytical −1.75 MULTINEST −1.768±0.052 BAMBI −1.757±0.052 Table 6.2: The log-evidence values of the Gaussian shell likelihood as found analyti- cally and with MULTINEST and BAMBI. (a) (b) Figure 6.5: Points of lowest likelihood of the Gaussian shell likelihood from successive iterations as given by (a) MULTINEST and (b) BAMBI. 140 6.2 BAMBI Toy Examples Figure 6.6: The Rosenbrock log-likelihood surface given by Equation (6.4) with N = 2. 6.2.3 Rosenbrock function The Rosenbrock function is a standard example used for testing optimization as it presents a long, curved degeneracy through all dimensions. For our NN training, it presents the difficulty of learning the likelihood function over a large, curving region of the prior. We use the Rosenbrock function to define the negative log-likelihood, so the log-likelihood function is given in N dimensions by log(L(x)) =− N−1 ∑ i=1 [ (1− xi)2+100(xi+1− x2i )2 ] . (6.4) Figure 6.6 shows how this appears for N = 2. We set uniform priors of U[−5,5] in all dimensions and performed analysis with both MULTINEST and BAMBI with N = 2 and N = 10. For N = 2, MULTINEST sam- pled with 2000 live points and BAMBI used 2000 samples for training a NN with 50 hidden-layer nodes. With N = 10, we used 2000 live points, 6000 samples for network training, and 50 hidden nodes. Tests ensured that these settings allowed good sam- pling of the prior and ample information and NN structure to allow training to a high enough accuracy. Table 6.3 gives the calculated evidences returned by both algorithms as well as the analytically calculated values from Equation (6.4) (there does not exist an analytical solution for the 10D case so this is not included). Figure 6.7 compares the posterior probability distributions returned by the two algorithms for N = 2. For 141 6. THE BAMBI ALGORITHM Method log(Z) Analytical 2D −5.804 MULTINEST 2D −5.799±0.049 BAMBI 2D −5.757±0.049 MULTINEST 10D −41.54±0.13 BAMBI 10D −41.53±0.13 Table 6.3: The log-evidence values of the Rosenbrock likelihood as found analytically and with MULTINEST and BAMBI. N = 10, we show in Figure 6.8 comparisons of the marginalised two-dimensional pos- terior distributions for 12 variable pairs. We see that MULTINEST and BAMBI return nearly identical posterior distributions as well as consistent estimates of the evidence. For N = 2 and N = 10, BAMBI was able to use a NN for 64.7% and 30.5% of its log-likelihood evaluations respectively (∼ 40,000 total for N = 2 and ∼ 1.3×106 for N = 10). Even when factoring in time required to train the NN this would have resulted in large decreases in running time for a more computationally expensive likelihood function. In the N = 2 case the network was trained twice, with re-training required because the first training did not achieve sufficient accuracy. In the N = 10 case the network was trained 12 times since the first 10 times did not achieve enough accu- racy and the first network used needed re-training once the sampling moved further up in log-likelihood contours and the distribution of predictions no longer resembled the data on which it was trained. 6.3 Cosmological Parameter Estimation with BAMBI While likelihood functions resembling our previous toy examples do exist in real phys- ical models, we would also like to demonstrate the usefulness of BAMBI on simpler likelihood surfaces where the time of evaluation is the critical limiting factor. One such example in astrophysics is that of cosmological parameter estimation and model selection. We implement BAMBI within the standard COSMOMC code [192], which by de- fault uses MCMC sampling. This allows us to compare the performance of BAMBI 142 6.3 Cosmological Parameter Estimation with BAMBI (a) (b) Figure 6.7: Points of lowest likelihood of the Rosenbrock likelihood for N = 2 from successive iterations as given by (a) MULTINEST and (b) BAMBI. Figure 6.8: Marginalised 2D posteriors for the Rosenbrock function with N = 10. The 12 most correlated pairs are shown. MULTINEST is in solid black, BAMBI in dashed blue. Inner and outer contours represent 68% and 95% confidence levels, respectively. 143 6. THE BAMBI ALGORITHM with other methods, such as MULTINEST, COSMONET, INTERPMC, PICO, Particle Swarm Optimization, and others [102, 193–198]. In this work, we will only report performances of BAMBI and MULTINEST, but these can be compared with reported performance from the other methods. Bayesian parameter estimation in cosmology requires evaluation of theoretical tem- perature and polarisation CMB power spectra (Cl values) using cosmological evolution codes such as CAMB [199]. These evaluations can take on the order of tens of seconds depending on the cosmological model. The Cl spectra are then compared to WMAP and other observations for the likelihood function. Considering that tens of thousands of these evaluations will be required, this is a computationally expensive step and a limiting factor in the speed of any Bayesian analysis. With BAMBI, we use samples to train a NN on the combined likelihood function which will allow us to avoid evaluating the full power spectra. This has the benefit of not requiring a pre-computed sample of points as in COSMONET and PICO, which is particularly important when including new parameters or new physics in the model. In these cases we will not know in ad- vance where the peak of the likelihood will be and it is around this location that the most samples would be needed for accurate results. The set of cosmological parameters that we use as variables and their prior ranges are given in Table 6.4; the parameters have their usual meanings in cosmology [200]. The prior probability distributions are uniform over the ranges given. A non-flat cos- mological model incorporates all of these parameters, while we set ΩK = 0 for a flat model. We use w = −1 in both cases. The flat model thus represents the standard ΛCDM cosmology. We use two different data sets for analysis: (1) CMB observa- tions alone and (2) CMB observations plus Hubble Space Telescope constraints on H0, large-scale structure constraints from the luminous red galaxy subset of the SDSS and the 2dF survey, and supernovae Ia data. The CMB dataset consists of WMAP seven-year data [200] and higher resolution observations from the ACBAR, CBI, and BOOMERanG experiments. The COSMOMC website [192] provides full references for the most recent sources of these data. Analyses with MULTINEST and BAMBI were run on all four combinations of models and data sets. MULTINEST sampled with 1000 live points and an efficiency of 0.5, both on its own and within BAMBI; BAMBI used 2000 samples for training 144 6.3 Cosmological Parameter Estimation with BAMBI Parameter Min Max Description Ωbh2 0.018 0.032 Physical baryon density ΩDMh2 0.04 0.16 Physical cold dark matter density θ 0.98 1.1 100× Ratio of the sound horizon to angular diameter distance at last scattering τ 0.01 0.5 Reionisation optical depth ΩK −0.1 0.1 Spatial curvature ns 0.8 1.2 Spectral index of density perturbations log[1010As] 2.7 4 Amplitude of the primordial spectrum of curvature perturbations ASZ 0 2 Amplitude of the Sunyaev-Zel’dovich spectrum Table 6.4: The cosmological parameters and their minimum and maximum values. Uniform priors were used on all variables. ΩK was set to 0 for the flat model. a NN on the likelihood, with 50 hidden-layer nodes for both the flat model and non- flat model. These values were chosen based on the toy examples and analysis runs confirmed their viability. Table 6.5 reports the recovered evidences from the two al- gorithms for both models and both data sets. It can be seen that the two algorithms report equivalent values to within statistical error for all four combinations. In Fig- ures 6.9 and 6.10 we show the recovered one- and two-dimensional marginalised pos- terior probability distributions for the non-flat model using the CMB-only data set. Figures 6.11 and 6.12 show the same for the non-flat model using the complete data set. We see very close agreement between MULTINEST (in solid black) and BAMBI (in dashed blue) across all parameters. The only exception is ASZ since it is uncon- strained by these models and data and is thus subject to large amounts of variation in sampling. The posterior probability distributions for the flat model with either data set are extremely similar to those of the non-flat flodel with setting ΩK = 0, as expected, so we do not show them here. A by-product of running BAMBI is that we now have a network that is trained to predict likelihood values near the peak of the distribution. To see how accurate this net- work is, in Figure 6.13 we plot the error in the prediction (∆ log(L) = log(Lpredicted)− log(Ltrue)) versus the true log-likelihood value for the different sets of training and val- 145 6. THE BAMBI ALGORITHM Algorithm Model Data Set log(Z) MULTINEST ΛCDM CMB only −3754.58±0.12 BAMBI ΛCDM CMB only −3754.57±0.12 MULTINEST ΛCDM all −4124.40±0.12 BAMBI ΛCDM all −4124.11±0.12 MULTINEST ΛCDM+ΩK CMB only −3755.26±0.12 BAMBI ΛCDM+ΩK CMB only −3755.57±0.12 MULTINEST ΛCDM+ΩK all −4126.54±0.13 BAMBI ΛCDM+ΩK all −4126.35±0.13 Table 6.5: Evidences calculated by MULTINEST and BAMBI for the flat (ΛCDM) and non-flat (ΛCDM+ΩK) models using the CMB-only and complete data sets. The two algorithms are in close agreement in all cases. Figure 6.9: Marginalised 1D posteriors for the non-flat model (ΛCDM+ΩK) using only CMB data. MULTINEST is in solid black, BAMBI in dashed blue. 146 6.3 Cosmological Parameter Estimation with BAMBI Figure 6.10: Marginalised 2D posteriors for the non-flat model (ΛCDM+ΩK) using only CMB data. The 12 most correlated pairs are shown. MULTINEST is in solid black, BAMBI in dashed blue. Inner and outer contours represent 68% and 95% confidence levels, respectively. Figure 6.11: Marginalised 1D posteriors for the non-flat model (ΛCDM+ΩK) using the complete data set. MULTINEST is in solid black, BAMBI in dashed blue. 147 6. THE BAMBI ALGORITHM Figure 6.12: Marginalised 2D posteriors for the non-flat model (ΛCDM+ΩK) using the complete data set. The 12 most correlated pairs are shown. MULTINEST is in solid black, BAMBI in dashed blue. Inner and outer contours represent 68% and 95% confidence levels, respectively. idation points that were used. What we show are results for networks that were trained to sufficient accuracy in order to be used for making likelihood predictions; this results in two networks for each case shown. We can see that although the flat model used the same number of hidden-layer nodes, the simpler physical model allowed for smaller error in the likelihood predictions. Both final networks (one for each model) are signif- icantly more accurate than the specified tolerance of a maximum average error of 0.5. In fact, for the flat model, all but one of the 2000 points have an error of less than 0.06 log-units. The two networks trained in each case overlap in the range of log-likelihood values on which they trained. The first network, although trained to lower accuracy, is valid over a much larger range of log-likelihoods. The accuracy of each network in- creases with increasing true log-likelihood and the second network, trained on higher log-likelihood values, is significantly more accurate than the first. The final comparison, and perhaps the most important, is the running time required. The analyses were run using MPI parallelisation on 48 processors. We recorded the time required for the complete analysis, not including any data initialisation prior to initial sampling. We then divide this time by the number of likelihood evaluations performed to obtain an average time per likelihood (twall clock, sec×NCPUs/Nlog(L) evals). 148 6.3 Cosmological Parameter Estimation with BAMBI Figure 6.13: The error in the predicted likelihood (∆ log(L) = log(Lpredicted) − log(Ltrue)) for the BAMBI networks trained on the flat (top row) and non-flat (bot- tom row) models using the complete data set. The left column represents predictions from the first NN trained to sufficient accuracy; the right column are results from the second, and final, NN trained in each case. The flat and non-flat models both used 50 hidden-layer nodes. Therefore, time required to train the NN is still counted as a penalty factor. If a NN takes more time to train, this will hurt the average time, but obtaining a usable NN sooner and with fewer training calls will give a better time since more likelihoods will be evaluated by the NN. The resulting average times per likelihood and speed increases are given in Table 6.6. Although the speed increases appear modest, one must remember that these include time taken to train the NNs, during which no likelihoods were evaluated. This can be seen in that although 30–40% of likelihoods are evaluated with a NN, as reported in Table 6.7, we do not obtain the full equivalent speed increase. We are still able to obtain a significant decrease in running time while adding in the bonus of having a NN trained on the likelihood function. 6.3.1 Updated Timing Comparisons The NN training algorithm code was updated to improve memory efficiency and to use single-precision variables by default. Following these updates, we decided to re- run the analyses on the cosmological parameter estimation to confirm that results were 149 6. THE BAMBI ALGORITHM Model Data set MULTINEST BAMBI Speed tL (s) tL (s) factor ΛCDM CMB only 2.394 1.902 1.26 ΛCDM all 3.323 2.472 1.34 ΛCDM+ΩK CMB only 12.744 9.006 1.42 ΛCDM+ΩK all 12.629 10.651 1.19 Table 6.6: Time per likelihood evaluation, factor of speed increase from MULTINEST to BAMBI (tMN/tBAMBI). Model Data set %log(L) Equivalent Actual with NN speed factor speed factor ΛCDM CMB only 40.5 1.68 1.26 ΛCDM all 40.2 1.67 1.34 ΛCDM+ΩK CMB only 34.2 1.52 1.42 ΛCDM+ΩK all 30.0 1.43 1.19 Table 6.7: Percentage of likelihood evaluations performed with a NN, equivalent speed factor, and actual factor of speed increase. 150 6.4 Using Trained Networks for Follow-up in BAMBI Model Data set MULTINEST BAMBI Speed tL (s) tL (s) factor ΛCDM CMB only 2.70 2.02 1.34 ΛCDM all 3.69 2.46 1.50 ΛCDM+ΩK CMB only 13.56 9.33 1.45 ΛCDM+ΩK all 13.85 10.36 1.34 Table 6.8: Time per likelihood evaluation, factor of speed increase from MULTINEST to BAMBI (tMN/tBAMBI). not adversely affected by use of single-precision. As before, a single hidden layer with 50 nodes was used for all four pairings of model and data and a tolerance of 0.5 was set for accepting a network as sufficiently accurate. MULTINEST used 1000 live points and 2000 points were used for network training. After completing, all calculated evidences were confirmed to agree between MULTINEST and BAMBI and previous results. Table 6.8 is like Table 6.6 but for the new runs. The reduction in time spent training a network is reflected in the speed factors which have in all cases increased from the initial comparisons. 6.4 Using Trained Networks for Follow-up in BAMBI A major benefit of BAMBI is that following an initial run the user is provided with a trained NN, or set of NNs, that model the log-likelihood function. These can be used in a subsequent analysis with different priors to obtain much faster results. This is a comparable analysis to that of COSMONET [193, 194], except that the NNs here are a product of an initial Bayesian analysis where the peak of the distribution was not previously known. No prior knowledge of the structure of the likelihood surface was used to generate the networks that are now able to be re-used. In this section we will examine two methods for calculating the error in a network’s prediction and provide results for performing a follow-up analysis on the cosmological parameter estimation problems. 151 6. THE BAMBI ALGORITHM 6.4.1 Exact Error Calculation When multiple NNs are trained and used in the initial BAMBI analysis, we must deter- mine which network’s prediction to use in the follow-up. The approximate uncertainty error in a NN’s prediction of the value y(x;a) (x denoting input parameters, a the NN weights and biases as before) that models the log-likelihood function is given by [201] as σ2 = σ2pred+σ 2 ν , (6.5) where σ2pred = g TB−1g (6.6) and σ2ν is the variance of the noise on the output from the network training (the network hyper-parameter σ 2 on the single output defined in Equation (5.4)). In Equation (6.6), B is the Hessian of the log-posterior as before, and g is the gradient of the NN’s pre- diction with respect to the weights about their maximum posterior values, g = ∂y(x;a) ∂a ∣∣∣∣ x,aMP . (6.7) This uncertainty of the prediction is calculated for each training and validation data point used in the initial training of the NN for each saved NN. The threshold for ac- cepting a predicted log-likelihood from a network is then set to be 1.2 times the max- imum uncertainty value found for points in its training and validation data sets. We can therefore ignore the value of σ2ν in Equation (6.5) as it is a constant value for all inputs and we only ever compare predictions of a network to previous predictions of that same network. A log-likelihood is calculated by first making a prediction with the final NN to be trained and saved and then calculating the error for this prediction. If the error, σpred, is greater than that NN’s threshold, then we consider the previous trained NN. We again calculate the predicted log-likelihood value and error to compare with its threshold. This continues to the first NN saved until a NN makes a sufficiently confident predic- tion. If no NNs can make a confident enough prediction, then we set log(L) = −∞. This is justified because the NNs are trained to predict values in the highest likelihood regions of parameter space and if a set of parameters lies outside their collective region of validity, then it must not be within the region of highest likelihoods. This effectively 152 6.4 Using Trained Networks for Follow-up in BAMBI limits the prior and so will bias evidences upwards but not significantly affect parame- ter estimation. To demonstrate the speed-up potential of using the NNs, we first ran an analysis of the cosmological parameter estimation using both models and both data sets. This time, however, we set the tolerance of the NNs to 1.0 instead of 0.5, so that they would be valid over a larger range of log-likelihoods and pass the accuracy criterion sooner. Each analysis produced two trained NNs. We then repeated each of the four analyses, but set the prior ranges to be uniform over the region defined by xmax(log(L))±4σ , where σ is the vector of standard deviations of the marginalised one-dimensional posterior probabilities. In Figure 6.14 we show predictions from the two trained NNs on the two sets of validation data points in the case of the non-flat model using the complete data set. In the left-hand column, we can see that the first NN trained is able to make reasonable predictions on its own validation data as well as on the second set of points, in red crosses, from the second NN’s validation data. The final NN is able to make more pre- cise predictions on its own data set than the initial NN, but is unable to make accurate predictions on the first NN’s data points. The right-hand column shows the error bar sizes for each of the points shown. For both NNs, the errors decrease with increasing log-likelihood. The final NN has significantly lower uncertainty on predictions for its own validation data, which enables us to set the threshold for when we can trust its prediction. The cases for the other three sets of cosmological models and data sets are very similar to this one. This demonstrates the need to use the uncertainty error mea- surement in determining which NN prediction to use, if any. Always using the final NN would produce poor predictions away from the peak and the initial NN does not have sufficient precision near the peak to properly measure the best-fit cosmological parameters. But by choosing which NN’s prediction to accept, as we have shown, we can quickly and accurately reproduce the log-likelihood surface for sampling. Fur- thermore, if one were interested only in performing a re-analysis about the peak, then one could use just the final NN, thereby omitting the calculational overhead associated with choosing the appropriate network. For this same case, we plot the one- and two-dimensional marginalised posterior probabilities in Figures 6.15 and 6.16, respectively. Although the priors do not cover exactly the same ranges, we expect very similar posterior distributions since the priors 153 6. THE BAMBI ALGORITHM Figure 6.14: Predictions with uncertainty error bars for NNs saved by BAMBI when analysing the non-flat model using the complete data set. The left-hand side shows predictions with errors for the two NNs on their own and the other’s validation data sets. Each network’s own points are in blue plusses, the other NN’s points are in red crosses. As many error bars are too small to be seen, the right-hand side uses the same colour and label scheme to show the magnitudes of the error bars from each NN on the predictions. 154 6.4 Using Trained Networks for Follow-up in BAMBI Figure 6.15: Marginalised 1D posteriors for the non-flat model (ΛCDM+ΩK) using the complete data set. BAMBI’s initial run is in solid black, the follow-up analysis in dashed blue. are sufficiently wide as to encompass nearly all of the posterior probability. We see this very close agreement in all cases. Calculating the uncertainty error requires calculating approximate inverse Hessian- vector products which slow down the process. We sacrifice a large factor of speed increase in order to maintain the robustness of our predictions. Using the same method as before, we computed the time per likelihood calculation for the initial BAMBI run as well as the follow up; these are compared in Table 6.9. We can see that in addition to the initial speed-up obtained with BAMBI, this follow-up analysis obtains an even larger speed-up in time per likelihood calculation. This speed-up is especially large for the non-flat model, where CAMB takes longer to compute the CMB spectra. The speed factor also increases when using the complete data set, as the original likelihood calculation takes longer than for the CMB-only data set; NN predictions take equal time regardless of the data set. 155 6. THE BAMBI ALGORITHM Figure 6.16: Marginalised 2D posteriors for the non-flat model (ΛCDM+ΩK) using the complete data set. The 12 most correlated pairs are shown. BAMBI’s initial run is in solid black, the follow-up analysis in dashed blue. Inner and outer contours represent 68% and 95% confidence levels, respectively. Model Data set Initial Follow-up Speed tL (s) tL (s) factor ΛCDM CMB only 1.635 0.393 4.16 ΛCDM all 2.356 0.449 5.25 ΛCDM+ΩK CMB only 9.520 0.341 27.9 ΛCDM+ΩK all 8.640 0.170 50.8 Table 6.9: Time per likelihood evaluation, factor of speed increase from BAMBI’s initial run to a follow-up analysis. 156 6.4 Using Trained Networks for Follow-up in BAMBI 6.4.2 Fast Error Calculation One possible way to avoid the computational cost of computing error bars on the pre- dictions is that suggested by [201] (see Section 9). One can take the NN training data and add Gaussian noise and train a new NN, using the old weights as a starting point. Performing many realisations of this will quickly provide multiple NNs whose aver- age prediction will be a good fit to the original data and whose standard deviation from this mean will measure the error in the prediction. This will reduce the time needed to compute an error bar since multiple NN predictions are faster than a single inverse Hessian-vector product. The steps are given as: 1. Start with the converged network with weights w∗ trained on true data set D∗ = {x(m), t(m)}. Estimate the Gaussian noise level of the residuals using σ2 = ∑m(t(m)− y(x(m),w∗))2/N. 2. Define a new data set D1 by adding Gaussian noise of zero mean and variance σ2 to the outputs in D∗. 3. Train a NN on D1 using w∗ as a starting point. Training should converge rapidly as the new data set is only slightly different from the original. Call the new network w1. 4. Repeat Steps 2 and 3 multiple times to find an ensemble of networks w j. 5. Use each of the networks w j to make a prediction for a given set of inputs. The error on the predicted value can be estimated as the standard deviation of this set of values. In addition to these steps, we include the option for the user to add a random Gaus- sian offset to the saved weights read in before training is performed on the new data set (Step 3). This offset will aid the training in moving the optimisation from a potential local maximum in the posterior distribution of the network weights, but its size must be chosen for each problem. We are thus adding noise to both the training data and the saved network weights before training a new network whose posterior maximum will be near to, but not exactly the same as, the original network’s. The training of the additional networks is performed by a separate program and the resulting error estimates can be compared with those from the exact, slow calculation 157 6. THE BAMBI ALGORITHM Figure 6.17: Comparison of NN prediction error calculations for the 2D Gaussian shells likelihood from the exact, slow method (Equation (6.6)) and the fast method with different weight offsets applied. Fast method calculations employ the original network and networks from 29 realisations of Gaussian noise added to the training data outputs for a total of 30 networks used in the error estimate. (Equation (6.6)) described in the previous section (6.4). Time taken for this additional training of networks depends only on the size of the networks and the training data sets, not on the complexity of the likelihood function being modeled. The method for estimating error was initially tested on two of the problems already analysed with BAMBI – the 2D Gaussian shells and the non-flat cosmology using the complete data set. Comparisons of the error estimates from different choices of the random offset applied to saved weights and from the exact, slow method are shown in Figures 6.17 and 6.18. The estimates from the fast method just described use the original network plus 29 different realisations of Gaussian noise added to the outputs to give a total of 30 networks, which results in total training time of less than one hour when run on 24 parallel CPUs. From Figures 6.17 and 6.18 we can see that the value of the offset applied has a significant effect on the size of the resulting error estimates. In general, larger offsets result in larger error estimates. However, the general trend of increasing error for lower log-likelihoods (further from the peak) is followed in all cases. We can determine 158 6.4 Using Trained Networks for Follow-up in BAMBI Figure 6.18: Comparison of NN prediction error calculations for the non-flat cosmol- ogy with complete data set likelihood from the exact, slow method (Equation (6.6)) and the fast method with different weight offsets applied (Gaussian random variable with mean 0 and standard deviation as noted). Fast method calculations employ the original network and networks from 29 realisations of Gaussian noise added to the training data outputs for a total of 30 networks used in the error estimate. 159 6. THE BAMBI ALGORITHM from Figure 6.17 that an offset with standard deviation s ∼ 0.01 is best for the 2D Gaussian shells and from Figure 6.18 that an offset with s∼ 0.0 is best for the non-flat cosmology with the complete data set. Therefore, we find that the value of the best offset is roughly inversely proportional to the prior weight, s ∝ 1/α , from the best fit network w∗. In the 2D Gaussian shells example α ' 5 and in the non-flat cosmology example α ' 54000. Therefore, the non-flat cosmology will require a significantly smaller offset to accurately reproduce the prediction error values. Since the absolute error values are not used for determining the quality of a fit, but rather the relative magnitudes, the fast calculations do not need to be too precise. We suggest using an offset with a standard deviation of s. 1/α as this will yield usable error estimates. An additional option has been included at this stage to not discard points whose NN predictions fail error-checking; instead the original likelihood function will be evaluated at these points. Discarding the points (assigning them log(L) =−∞) essen- tially removes them from the prior and will only affect points further from the peak where the networks are not all well trained. As such it will bias the evidence upwards but should not generally affect parameter estimation as the discarded points will be of significantly lower likelihood so as to be in the far tails of the distribution. These points will also be heavily weighted towards early in the analysis before the likelihood contours move well within the NNs’ region of confident predictions. To observe the potential speed increases with this method of error estimation, we re-analysed the cosmological models and data sets and then performed both the slow and fast methods of error calculation in separate follow-up analyses on restricted prior ranges. Our initial analyses were performed with the updated version of the NN train- ing algorithm mentioned in Section 6.3.1. The follow-up analysis limited the prior to xmax(log(L))±4σ , where σ is the vector of standard deviations of the marginalised one- dimensional posterior probabilities as in the previous section. When training networks for the fast error approximation, an offset with standard deviation s ≈ 1/α was used for an ensemble of 20 total networks per saved network. Table 6.10 shows the average time taken per log-likelihood function evaluation when this follow-up was performed with MULTINEST and with BAMBI using both the slow error calculation and the fast method. This timing includes any MULTINEST operations as well as loading the saved networks in the beginning and preparing to use them (it does not include COSMOMC data initialisation). In all cases where points 160 6.4 Using Trained Networks for Follow-up in BAMBI Method Model Data tL (ms) tL (ms) factor factor (discard) (discard) MULTINEST ΛCDM CMB 2775 — — — all 3813 — — — ΛCDM+ΩK CMB 12830 — — — all 10980 — — — BAMBI slow ΛCDM CMB 558.7 378.1 4.96 7.33 all 639.5 393.4 5.96 9.68 ΛCDM+ΩK CMB 1823.8 97.67 7.03 131.4 all 1181.5 219.8 10.86 49.95 BAMBI fast ΛCDM CMB 3.883 0.2077 714.7 13360 all 28.01 0.2146 136.1 17770 ΛCDM+ΩK CMB 1105 0.08449 11.61 151900 all 402.9 0.2032 27.25 54040 Table 6.10: Time per likelihood evaluation in a follow-up analysis and speed factors with respect to MULTINEST. Average times without discarding bad predictions are limited by the number of calls to the original likelihood function. were not discarded, the BAMBI evidences matched the MULTINEST values to within statistical error. When discarding points, only the non-flat model with CMB-only data calculated an evidence that did not agree; this was due to this being the only case with only a single trained NN. In the other three cases, two trained networks were available to use. These three cases therefore required fewer points to be calculated with the original likelihood or discarded (< 2% for nonflat model and < 0.2% for flat model) but took additional time to load and find the best network prediction. The analysis using the non-flat model and CMB-only data, that had only a single network, runs more rapidly when points are discarded, but requires more likelihood samples and picks up a small error in the evidence calculation of approximately one log-unit (compared to an evidence uncertainty of approximately a tenth of a log-unit). When not discarding, this model and data combination required∼ 10% of points to have their log-likelihood computed with the original function. 161 6. THE BAMBI ALGORITHM Overall, the massive gains in speed from using the fast method along with discard- ing points (that the networks cannot predict precisely enough) of O(104) to O(105) make it worthwhile to use this as an initial follow-up procedure. Should there be too many discarded points, the analysis may be performed again without discarding to compute a more reliable evidence value but still obtaining a speed increase of O(10) to O(100). We can thus obtain accurate posterior distributions and evidence calculations orders of magnitude faster than originally possible. Follow-up without discarding is limited by the number of calls to the original likelihood function while follow-up with discarding is limited by the number of stored networks being used. 6.5 Summary We have introduced and demonstrated a new algorithm for rapid Bayesian data anal- ysis. The Blind Accelerated Multimodal Bayesian Inference algorithm combines the sampling efficiency of MULTINEST with the predictive power of artificial neural net- works to reduce significantly the running time for computationally expensive prob- lems. The first applications we demonstrated are toy examples that demonstrate the abil- ity of the NN to learn complicated likelihood surfaces and produce accurate evidences and posterior probability distributions. The eggbox, Gaussian shells, and Rosenbrock functions each present difficulties for Monte Carlo sampling as well as for the train- ing of a NN. With the use of enough hidden-layer nodes and training points, we have demonstrated that a NN can learn to accurately predict log-likelihood function values. We then apply BAMBI to the problem of cosmological parameter estimation and model selection. We performed this using flat and non-flat cosmological models and incorporating only CMB data and using a more extensive data set. In all cases, the NN is able to learn the likelihood function to sufficient accuracy after training on early nested samples and then predict values thereafter. By calculating a significant fraction of the likelihood values with the NN instead of the full function, we are able to reduce the running time by a factor of up to 1.50 and potentially more. This is in comparison to use of MULTINEST only, which already provides significant speed-ups in comparison to traditional MCMC methods [102]. 162 6.5 Summary Through all of these examples we have shown the capability of BAMBI to increase the speed at which Bayesian inference can be done. This is a fully general method and one need only change the settings for MULTINEST and the network training in order to apply it to different likelihood functions. For computationally expensive likelihood functions, the network training takes less time than is required to sample enough train- ing points and sampling a point using the network is extremely rapid as it is a simple analytic function. Therefore, the main computational expense of BAMBI is calculat- ing training points as the sampling evolves until the network is able to reproduce the likelihood accurately enough. With the trained NNs, we can perform additional analyses using the same likeli- hood function but different priors and save large amounts of time in sampling points with the original likelihood and in training a NN. Follow-up analyses using already trained NNs provide much larger speed increases, with factors of 4 to 11 obtained for cosmological parameter estimation relative to BAMBI speeds when not discarding poorly estimated points, and 7 to 130 when discarding using the slow exact method. Having trained an ensemble of networks to provide error estimates, the speed-ups pos- sible increase drastically to potentially O(102) times faster when not discarding and O(104) to O(105) times faster when discarding. The limiting factor in these runs is the fraction of points early in the analysis that the NNs are unable to make sufficiently accurate predictions for and thus require the use of the original likelihood function or re-sampling. The calculation of the error of predictions is a flat cost based on the size of the NN and data set regardless of the original likelihood function, so the more com- putationally expensive the original likelihood function is the more benefit is gained from using NNs. The NNs trained by BAMBI for cosmology cover a larger range of log-likelihoods than the one trained for COSMONET. This allows us to use a wider range of priors for subsequent analysis and not be limited to the four-sigma region around the max- imum likelihood point. By setting the tolerance for BAMBI’s NNs to a larger value, fewer NNs with larger likelihood ranges can be trained, albeit with larger errors on the predictions. Allowing for larger priors requires us to test the validity of our NNs’ approximations, which ends up slowing the overall analysis when no networks can provide a precise enough prediction. 163 6. THE BAMBI ALGORITHM Since BAMBI uses a NN to calculate the likelihood at later times in the analysis where we typically also suffer from lower sampling efficiency (harder to find a new point with higher likelihood than most recent point removed), we are more easily able to implement Hamiltonian Monte Carlo [202, 203] for finding a proposed sample. This method uses gradient information to make better proposals for the next point that should be sampled. Calculating the gradient is usually a difficult task, but with the NN approximation they are very fast and simple. This improvement may be investigated in future work. As larger data sets and more complicated models are used in cosmology, particle physics, and other fields, the computational cost of Bayesian inference will increase. The BAMBI algorithm can, without any pre-processing, significantly reduce the re- quired running time for these inference problems. In addition to providing accurate posterior probability distributions and evidence calculations, the user also obtains NNs trained to produce likelihood values near the peak(s) of the distribution that can be used in extremely rapid follow-up analysis. 164 Chapter 7 Conclusions and Future Work Somehere, something incredible is waiting to be known. Carl Sagan 7.1 Summary of Results Einstein’s general theory of relativity predicts the existence of gravitational waves, which are perturbations to the curvature of spacetime caused by the acceleration of matter and propagate at the speed of light. The periodic strain signal they carry affects measured distances between points in space, enabling their passing to be detected by laser interferometers. These signals originate in the regimes of strongest gravity, al- lowing for one of the first tests of GR where the full non-linear equations must be used. Binary star systems emit gravitational radiation and thus lose energy and fall in toward one another in a characteristic inspiral, generating a “chirp” signal observable by ground-based detectors such as LIGO and Virgo. By approximating the evolution of these systems we can obtain waveform functions that model the signal we expect to observe. Using the tools of Bayesian inference, we are then able to measure probability distributions for source parameters given these waveform models and detector data. In Chapter 2 we consider these binary systems and test the posterior probability distributions given by the MULTINEST algorithm. We then use these posteriors to 165 7. CONCLUSIONS AND FUTURE WORK quantitatively analyse the ability of a network of ground-based detectors to localise the position of a source on the sky. We do this for a sample of injected waveforms into simulated noise from which we can infer the area of sky that will need to be observed in electromagnetic follow-ups in order to find a counterpart signal. We then use Bayesian model selection criteria for ruling out incoherent signals in LIGO data. This is performed in the context of the “big dog” blind hardware injection, where the coherent signal model was shown to be favoured over the incoherent; this same statistic ruled out manufactured incoherent signals, displaying its ability to differentiate between real gravitational waves and detector glitches. Further advances in gravitational wave detection will be realised by the launch of a space-based laser interferometer (formerly LISA and now NGO). This will allow for the detection of GWs at much lower frequencies and with much higher SNRs. Simu- lated data was generated in the MLDCs in order to encourage the development of data analysis code for LISA and demonstrate the community’s ability to extract scientific information from the kind of data that was expected. In Chapter 3 I analyse some of this data, initially looking for burst signals in long stretches of instrument noise. We are able to use Bayesian criteria to detect and characterise signals and demonstrate the ability to perform model selection to determine the correct injected signal waveform. A similar analysis is then performed on LISA data containing a large number of con- tinuous sources. Despite the confusion, we are able to accurately find many signals at once, with near-constant purity across a large range of frequencies and high complete- ness at higher frequencies where there are fewer signals present. Lastly we combine these two in order to attempt to detect burst signals in the presence of many continuous sources. Again Bayesian criteria prove to be powerful in separating true from false detections. As more detailed signal models and larger data sets are analysed, performing Bayesian inference will become a more cumbersome task. To assist in simplifying our data anal- ysis procedures, I investigated the application of the MOPED (Multiple Optimised Parameter Estimation and Data compression) algorithm. I found, as detailed in Chap- ter 4, that in some cases the algorithm will work, but as the analysis involves more parameters, there is the potential for additional modes in the likelihood to be falsely created, each with equal likelihood to that of the peak. A solution is tested, but the total overhead is found to not warrant further implementation. 166 7.1 Summary of Results An ever-increasingly popular tool for handling large computations is artificial neu- ral networks. These mimic the structure of a brain by passing information through a set of interconnected nodes with the ability to “learn” a function by adjusting connec- tion weights. In Chapter 5 I consider the application of feed-forward networks, where nodes are arranged in ordered layers. These can be trained on a pre-calculated set of data points from the function to be learned, which can either involve regression or clas- sification. We implement a Bayesian optimisation algorithm that uses second-order in- formation on the log-posterior function in order to find the optimal set of weights in as few steps as possible. We also prevent over-fitting to the precise training data set pro- vided by using regularisation and a validation data set. The capabilities of the network are first demonstrated on a few toy problems. The network training algorithm is then applied to solve two larger and more complex examples. In the first, we identify hand- written digits from the MNIST database, which involves finding patterns in images to make ideal classifications. In the second, we measure the ellipticities of sheared and convolved galaxies, a difficult task addressed by the Mapping Dark Matter challenge. As its name suggests, if this procedure can be done accurately and rapidly it may be applied to large surveys of galaxies and can help in mapping dark matter distributions throughout the universe and testing cosmological models. Our neural network training algorithm is able to learn both of these data sets very well; prediction error on a further sample of handwritten digits produces an error rate of less than 2% and the error in galaxy ellipticity measurements in reduced by a factor of ∼ 5 compared to previous standard methods. In Chapter 6 I look to applying neural networks directly to Bayesian inference. The Blind Accelerated Multimodal Bayesian Inference (BAMBI) algorithm is described, wherein samples of the log-likelihood function from MULTINEST are used to train a neural network. This can then be used in place of the original log-likelihood once it is able to make sufficiently accurate predictions. I first test BAMBI on a few toy like- lihoods to ensure that the neural network is able to learn the function from a reason- able number of samples before MULTINEST’s sampling converges. We then address the task of performing cosmological parameter estimation. This is a problem with a well-defined likelihood function and common data sets through the CosmoMC soft- ware package. Likelihood function calls normally take on the order of seconds each 167 7. CONCLUSIONS AND FUTURE WORK so analysis may take many CPU days. With BAMBI, however, we are able to demon- strate speed increases of 50%, indicating that ∼ 33% of likelihood samples were per- formed using a trained neural network. Even more impressive gains in speed come from follow-up analyses with different priors. We can re-use the trained NNs, thereby avoiding the original likelihood function entirely. Doing so with a rapid method of calculating the error in NN predictions yields posterior inferences O(104) to O(105) times faster than using only MULTINEST. BAMBI’s ability to decrease the computa- tional expense of initial Bayesian inference and perform incredibly rapid follow-up in a fully general way means it can be applied to all types of parameter estimation and model selection problems. 7.2 Avenues for Future Work Analysis of data from ground-based GW detectors in the advanced era will need to utilise more advanced waveform families. Inspirals that include the full effects of spin in the phase and amplitude corrections will provide the best measurements of compact binary signals we hope to detect. Additionally, the inclusion of merger and ringdown stages to create a hybrid waveform should provide the most information about a GW source. Creating these waveform models and analysing their different parameter es- timation biases is important before and during the advanced detector era where de- tections are expected. Model selection will prove to be important when determining which waveform type best fits an observed signal and for filtering out detector glitches. As the LISA mission has been reconfigured to NGO, we must reconsider the sci- entific potential of the new detector. Data analysis pipelines will need to work even harder as there will be lower SNR for NGO-observed signals than for LISA. Neural networks and the BAMBI algorithm can play a large role in these projects. NNs can be applied to computing very difficult gravitational waveforms in order to more rapidly model systems. BAMBI will greatly increase the speed at which Bayesian analysis may be performed on GW signals, providing sky location and other parameter estimates with significantly less lag time from the initial trigger. We are always look- ing to improve our NN training and BAMBI and will continue to develop these two algorithms to add more functionality and improve the quality and speed of results. 168 7.2 Avenues for Future Work There is an optimistic future for the search for gravitational waves and the scien- tific potential they hold. With the right tools we can efficiently and rapidly analyse detector data to find and characterise any signals present. These tools are not limited to just gravitational wave astronomy, but may be applied to solve computational or data analysis problems in a wide number of fields. 169 7. CONCLUSIONS AND FUTURE WORK 170 References [1] A. Einstein, Relativity: The special and general theory. New York, NY, USA: Henry Holt and Co., 1920. 1 [2] J. M. Weisberg, D. J. Nice, and J. H. Taylor, “Timing Measurements of the Relativistic Binary Pulsar PSR B1913+16,” The Astrophysical Journal, vol. 722, pp. 1030–1034, Oct. 2010, arXiv:1011.0718. 1, 22 [3] E´. E´. Flanagan and S. A. Hughes, “The basics of gravitational wave theory,” New Journal of Physics, vol. 7, p. 204, Sept. 2005, arXiv:gr-qc/0501041. 2, 3 [4] C. W. Misner, K. S. Thorne, and J. A. Wheeler, Gravitation. San Francisco, CA, USA: W. H. Freeman and Company, 1973. 4 [5] M. P. Hobson, G. P. Estathiou, and A. N. Lasenby, General Relativity: An In- troduction for Physicists. Cambridge, UK: Cambridge University Press, 2006. 4 [6] L. Blanchet, “Post-Newtonian theory and the two-body problem,” ArXiv e- prints, July 2009, arXiv:0907.3596. 6, 21 [7] M. Sasaki and H. Tagoshi, “Analytic black hole perturbation approach to gravi- tational radiation,” Living Reviews in Relativity, vol. 6, no. 6, 2003. [8] T. Futamase and Y. Itoh, “The post-newtonian approximation for relativistic compact binaries,” Living Reviews in Relativity, vol. 10, no. 2, 2007. 21 171 REFERENCES [9] L. Blanchet, G. Faye, B. R. Iyer, and B. Joguet, “Gravitational-wave inspiral of compact binary systems to 7/2 post-Newtonian order,” Physical Review D, vol. 65, p. 061501, Mar. 2002, arXiv:gr-qc/0105099. [10] A. Buonanno, B. R. Iyer, E. Ochsner, Y. Pan, and B. S. Sathyaprakash, “Com- parison of post-newtonian templates for compact binary inspiral signals in gravitational-wave detectors,” Phys. Rev. D, vol. 80, p. 084043, Oct 2009. 29, 30 [11] L. E. Kidder, “Coalescing binary systems of compact objects to (post)5/2- Newtonian order. V. Spin effects,” Physical Review D, vol. 52, pp. 821–847, July 1995, arXiv:gr-qc/9506022. [12] C. M. Will and A. G. Wiseman, “Gravitational radiation from compact binary systems: Gravitational waveforms and energy loss to second post-newtonian order,” Phys. Rev. D, vol. 54, pp. 4813–4848, Oct 1996. 6 [13] C. Van Den Broeck and A. S. Sengupta, “Phenomenology of amplitude- corrected post-Newtonian gravitational waveforms for compact binary inspiral: I. Signal-to-noise ratios,” Classical and Quantum Gravity, vol. 24, pp. 155–176, Jan. 2007, arXiv:gr-qc/0607092. 7 [14] K. G. Arun, A. Buonanno, G. Faye, and E. Ochsner, “Higher-order spin effects in the amplitude and phase of gravitational waveforms emitted by inspiraling compact binaries: Ready-to-use gravitational waveforms,” Physical Review D, vol. 79, p. 104023, May 2009, arXiv:0810.5336. [15] P. Ajith, “Addressing the spin question in gravitational-wave searches: Wave- form templates for inspiralling compact binaries with nonprecessing spins,” Physical Review D, vol. 84, p. 084037, Oct. 2011, arXiv:1107.1267. [16] D. A. Brown, A. Lundgren, and R. O’Shaughnessy, “Nonspinning searches for spinning binaries in ground-based detector data: Amplitude and mismatch pre- dictions in the constant precession cone approximation,” ArXiv e-prints, Mar. 2012, arXiv:1203.6060. 172 REFERENCES [17] G. Faye, L. Blanchet, and A. Buonanno, “Higher-order spin effects in the dy- namics of compact binaries. I. Equations of motion,” Physical Review D, vol. 74, p. 104033, Nov. 2006, arXiv:gr-qc/0605139. [18] L. Blanchet, A. Buonanno, and G. Faye, “Higher-order spin effects in the dy- namics of compact binaries. II. Radiation field,” Physical Review D, vol. 74, p. 104034, Nov. 2006, arXiv:gr-qc/0605140. 7 [19] Y. Pan, A. Buonanno, M. Boyle, L. T. Buchman, L. E. Kidder, H. P. Pfeiffer, and M. A. Scheel, “Inspiral-merger-ringdown multipolar waveforms of nonspin- ning black-hole binaries using the effective-one-body formalism,” Phys. Rev. D, vol. 84, p. 124052, Dec 2011. 7, 21 [20] A. Buonanno and T. Damour, “Transition from inspiral to plunge in binary black hole coalescences,” Phys. Rev. D, vol. 62, p. 064015, Aug 2000. [21] A. Buonanno and T. Damour, “Effective one-body approach to general relativis- tic two-body dynamics,” Phys. Rev. D, vol. 59, p. 084006, Mar 1999. [22] T. Damour, “Coalescence of two spinning black holes: An effective one-body approach,” Phys. Rev. D, vol. 64, p. 124013, Nov 2001. 21 [23] Y. Pan, A. Buonanno, L. T. Buchman, T. Chu, L. E. Kidder, H. P. Pfeiffer, and M. A. Scheel, “Effective-one-body waveforms calibrated to numerical relativity simulations: Coalescence of nonprecessing, spinning, equal-mass black holes,” Physical Review D, vol. 81, p. 084041, Apr. 2010, arXiv:0912.3466. 7 [24] F. Pretorius, “Evolution of binary black-hole spacetimes,” Phys. Rev. Lett., vol. 95, p. 121101, Sep 2005. 7, 21 [25] J. G. Baker, J. Centrella, D.-I. Choi, M. Koppitz, and J. van Meter, “Gravitational-wave extraction from an inspiraling configuration of merging black holes,” Phys. Rev. Lett., vol. 96, p. 111102, Mar 2006. [26] M. Campanelli, C. O. Lousto, P. Marronetti, and Y. Zlochower, “Accurate evolu- tions of orbiting black-hole binaries without excision,” Phys. Rev. Lett., vol. 96, p. 111101, Mar 2006. 173 REFERENCES [27] S. T. McWilliams, “The status of black-hole binary merger simulations with numerical relativity,” Classical and Quantum Gravity, vol. 28, p. 134001, July 2011, arXiv:1012.2872. 21 [28] J. Centrella, J. G. Baker, B. J. Kelly, and J. R. van Meter, “The Final Merger of Black-Hole Binaries,” Annual Review of Nuclear and Particle Science, vol. 60, pp. 75–100, Nov. 2010, arXiv:1010.2165. [29] J. Centrella, J. G. Baker, B. J. Kelly, and J. R. van Meter, “Black-hole bina- ries, gravitational waves, and numerical relativity,” Reviews of Modern Physics, vol. 82, pp. 3069–3119, Oct. 2010, arXiv:1010.5260. 7 [30] F. Ohme, “Analytical meets numerical relativity - status of complete gravita- tional waveform models for binary black holes,” ArXiv e-prints, Nov. 2011, arXiv:1111.3737. 7 [31] A. Taracchini, Y. Pan, A. Buonanno, E. Barausse, M. Boyle, T. Chu, G. Lovelace, H. P. Pfeiffer, and M. A. Scheel, “A prototype effective-one- body model for non-precessing spinning inspiral-merger-ringdown waveforms,” ArXiv e-prints, Feb. 2012, arXiv:1202.0790. [32] P. Ajith, S. Babak, Y. Chen, M. Hewitson, B. Krishnan, J. T. Whelan, B. Bru¨gmann, P. Diener, J. Gonzalez, M. Hannam, S. Husa, M. Koppitz, D. Poll- ney, L. Rezzolla, L. Santamarı´a, A. M. Sintes, U. Sperhake, and J. Thornburg, “A phenomenological template family for black-hole coalescence waveforms,” Classical and Quantum Gravity, vol. 24, p. 689, Oct. 2007, arXiv:0704.3764. [33] P. Ajith, M. Hannam, S. Husa, Y. Chen, B. Bru¨gmann, N. Dorband, D. Mu¨ller, F. Ohme, D. Pollney, C. Reisswig, L. Santamarı´a, and J. Seiler, “Inspiral-Merger-Ringdown Waveforms for Black-Hole Binaries with Non- precessing Spins,” Physical Review Letters, vol. 106, p. 241101, June 2011, arXiv:0909.2867. [34] L. Santamarı´a, F. Ohme, P. Ajith, B. Bru¨gmann, N. Dorband, M. Hannam, S. Husa, P. Mo¨sta, D. Pollney, C. Reisswig, E. L. Robinson, J. Seiler, and B. Kr- ishnan, “Matching post-Newtonian and numerical relativity waveforms: Sys- 174 REFERENCES tematic errors and a new phenomenological model for nonprecessing black hole binaries,” Physical Review D, vol. 82, p. 064016, Sept. 2010, arXiv:1005.3306. [35] I. MacDonald, S. Nissanke, and H. P. Pfeiffer, “Suitability of post- Newtonian/numerical-relativity hybrid waveforms for gravitational wave de- tectors,” Classical and Quantum Gravity, vol. 28, p. 134002, July 2011, arXiv:1102.5128. [36] R. Sturani, S. Fischetti, L. Cadonati, G. M. Guidi, J. Healy, D. Shoemaker, and A. Vicere’, “Phenomenological gravitational waveforms from spinning coalesc- ing binaries,” ArXiv e-prints, Dec. 2010, arXiv:1012.5172. 7 [37] P. C. Peters and J. Mathews, “Gravitational radiation from point masses in a keplerian orbit,” Physical Review, vol. 131, no. 1, pp. 435–440, 1963. 8 [38] B. F. Schutz, “Gravitational waves on the back of an envelope,” American Jour- nal of Physics, vol. 52, pp. 412–419, May 1984. 8 [39] S. A. Hughes, “Gravitational Waves from Merging Compact Binaries,” An- nual Review of Astronomy & Astrophysics, vol. 47, pp. 107–157, Sept. 2009, arXiv:0903.4877. 8 [40] L. Blanchet, “Gravitational radiation from post-newtonian sources and inspi- ralling compact binaries,” Living Reviews in Relativity, vol. 9, no. 4, 2006. 8, 21 [41] K. N. Yakunin, P. Marronetti, A. Mezzacappa, S. W. Bruenn, C.-T. Lee, M. A. Chertkow, W. R. Hix, J. M. Blondin, E. J. Lentz, O. E. Bronson Messer, and S. Yoshida, “Gravitational waves from core collapse supernovae,” Classical and Quantum Gravity, vol. 27, p. 194005, Oct. 2010, arXiv:1005.0779. 8 [42] P. Jaranowski, A. Kro´lak, and B. F. Schutz, “Data analysis of gravitational-wave signals from spinning neutron stars: The signal and its detection,” Phys. Rev. D, vol. 58, p. 063001, Aug 1998. 8 [43] E. Goetz and K. Riles, “An all-sky search algorithm for continuous gravitational waves from spinning neutron stars in binary systems,” Classical and Quantum Gravity, vol. 28, p. 215006, Nov. 2011, arXiv:1103.1301. 175 REFERENCES [44] M. Pitkin, “Prospects of observing continuous gravitational waves from known pulsars,” Monthly Notices of the Royal Astronomical Society, vol. 415, pp. 1849–1863, Aug. 2011, arXiv:1103.5867. [45] R. Prix, S. Giampanis, and C. Messenger, “Search method for long-duration gravitational-wave transients from neutron stars,” Physical Review D, vol. 84, p. 023007, July 2011, arXiv:1104.1704. [46] J. Abadie, B. P. Abbott, R. Abbott, M. Abernathy, T. Accadia, F. Acernese, C. Adams, R. Adhikari, C. Affeldt, B. Allen, and et al., “Beating the Spin-down Limit on Gravitational Wave Emission from the Vela Pulsar,” The Astrophysical Journal, vol. 737, p. 93, Aug. 2011, arXiv:1104.2712. [47] C. Cutler, “An improved, ”phase-relaxed” F-statistic for gravitational-wave data analysis,” ArXiv e-prints, Apr. 2011, arXiv:1104.2938. 8 [48] C. Cutler and K. S. Thorne, “An overview of gravitational-wave sources,” in Proceedings of GR16 (N. T. Bishop and S. D. Maharaj, eds.), Singapore: World- Scientific, 2002, gr-qc/0204090. 8 [49] M. S. Briggs, V. Connaughton, K. C. Hurley, P. A. Jenke, A. von Kienlin, A. Rau, X.-L. Zhang, The LIGO Scientific Collaboration, Virgo Collaboration: J. Abadie, B. P. Abbott, and et al., “Search for gravitational waves associated with gamma-ray bursts during LIGO science run 6 and Virgo science runs 2 and 3,” ArXiv e-prints, May 2012, arXiv:1205.2216. 8 [50] S. Rosswog, T. Piran, and E. Nakar, “The multi-messenger picture of compact object encounters: binary mergers versus dynamical collisions,” ArXiv e-prints, Apr. 2012, arXiv:1204.6240. 8 [51] T. G. F. Li, W. Del Pozzo, S. Vitale, C. Van Den Broeck, M. Agathos, J. Veitch, K. Grover, T. Sidery, R. Sturani, and A. Vecchio, “Towards a generic test of the strong field dynamics of general relativity using compact binary coalescence,” ArXiv e-prints, Oct. 2011, arXiv:1110.0530. 8 176 REFERENCES [52] S. Gossan, J. Veitch, and B. S. Sathyaprakash, “Bayesian model selection for testing the no-hair theorem with black hole ringdowns,” ArXiv e-prints, Nov. 2011, arXiv:1111.5819. 8 [53] W. Del Pozzo, J. Veitch, and A. Vecchio, “Testing general relativity using Bayesian model selection: Applications to observations of gravitational waves from compact binary systems,” Physical Review D, vol. 83, p. 082002, Apr. 2011, arXiv:1101.1391. 8 [54] S. R. Taylor and J. R. Gair, “Cosmology with the lights off: standard sirens in the Einstein Telescope era,” ArXiv e-prints, Apr. 2012, arXiv:1204.6739. 8 [55] S. R. Taylor, J. R. Gair, and I. Mandel, “Cosmology using advanced gravitational-wave detectors alone,” Physical Review D, vol. 85, p. 023535, Jan. 2012, arXiv:1108.5161. [56] C. Messenger and J. Read, “Measuring a Cosmological Distance-Redshift Re- lationship Using Only Gravitational Wave Observations of Binary Neutron Star Coalescences,” Physical Review Letters, vol. 108, p. 091101, Mar. 2012, arXiv:1107.5725. [57] W. Del Pozzo, “Cosmology with Gravitational Waves: statistical inference of the Hubble constant,” ArXiv e-prints, Aug. 2011, arXiv:1108.1317. [58] S. Nissanke, D. E. Holz, S. A. Hughes, N. Dalal, and J. L. Sievers, “Exploring short gamma-ray bursts as gravitational-wave standard sirens,” The Astrophysi- cal Journal, vol. 725, no. 1, p. 496, 2010. 8 [59] B. P. Abbott, R. Abbott, R. Adhikari, P. Ajith, B. Allen, G. Allen, R. S. Amin, S. B. Anderson, W. G. Anderson, M. A. Arain, and et al., “LIGO: the Laser In- terferometer Gravitational-Wave Observatory,” Reports on Progress in Physics, vol. 72, p. 076901, July 2009, arXiv:0711.3041. 8 [60] A. S. Sengupta, LIGO Scientific Collaboration, and Virgo Collaboration, “LIGO-Virgo searches for gravitational waves from coalescing binaries: A sta- tus update,” Journal of Physics Conference Series, vol. 228, p. 012002, May 2010, arXiv:0911.2738. 9 177 REFERENCES [61] B. Allen, W. G. Anderson, P. R. Brady, D. A. Brown, and J. D. E. Creighton, “FINDCHIRP: an algorithm for detection of gravitational waves from inspi- raling compact binaries,” ArXiv General Relativity and Quantum Cosmology e-prints, Sept. 2005, arXiv:gr-qc/0509116. 9 [62] S. Bose, T. Dayanga, S. Ghosh, and D. Talukder, “A blind hierarchical coher- ent search for gravitational-wave signals from coalescing compact binaries in a network of interferometric detectors,” Classical and Quantum Gravity, vol. 28, p. 134009, July 2011, arXiv:1104.2650. 9 [63] S. Ballmer and et al., “Enhancements to the laser interferometer gravitational- wave observatory (ligo),” Bulletin of the American Astronomical Society, vol. 41, p. 443, 2009. 9 [64] L. Blackburn and et al., “The lsc glitch group: Monitoring noise transients during the fifth ligo science run,” Classical and Quantum Gravity, vol. 25, p. 184004, 2008. 9 [65] T. B. Littenberg and N. J. Cornish, “Separating gravitational wave signals from instrument artifacts,” Physical Review D, vol. 82, p. 103007, Nov. 2010, arXiv:1008.1577. [66] V. Raymond, M. V. van der Sluys, I. Mandel, V. Kalogera, C. Ro¨ver, and N. Christensen, “The effects of LIGO detector noise on a 15-dimensional Markov-chain Monte Carlo analysis of gravitational-wave signals,” Classical and Quantum Gravity, vol. 27, p. 114009, June 2010, arXiv:0912.3746. [67] T. Prestegard, E. Thrane, N. L. Christensen, M. W. Coughlin, B. Hubbert, S. Kandhasamy, E. MacAyeal, and V. Mandic, “Identification of noise arti- facts in searches for long-duration gravitational-wave transients,” Classical and Quantum Gravity, vol. 29, p. 095018, May 2012, arXiv:1111.1631. 9 [68] J. Abadie and et al., “Search for gravitational waves from low mass compact binary coalescence in ligo’s sixth science run and virgo’s science runs 2 and 3,” Physical Review D, vol. 85, p. 082002, Apr 2012. 10, 21, 41, 42 178 REFERENCES [69] G. M. Harry and LIGO Scientific Collaboration, “Advanced LIGO: the next generation of gravitational wave detectors,” Classical and Quantum Gravity, vol. 27, p. 084006, Apr. 2010. 9 [70] S. J. Waldman, “The Advanced LIGO Gravitational Wave Detector,” ArXiv e- prints, Mar. 2011, arXiv:1103.2728. [71] A. J. Weinstein, for the LIGO Scientific Collaboration, and for the Virgo Collab- oration, “Astronomy and astrophysics with gravitational waves in the Advanced Detector Era,” ArXiv e-prints, Dec. 2011, arXiv:1112.1057. 9, 21 [72] IndIGO, “Indian initiative in gravitational-wave observation,” May 2012. http://www.gw-indigo.org. 9 [73] F. Acernese, M. Alshourbagy, P. Amico, F. Antonucci, S. Aoudia, K. G. Arun, P. Astone, S. Avino, and et al., “Virgo status,” Classical and Quantum Gravity, vol. 25, p. 184001, Sept. 2008. 10 [74] H. Grote and et al., “The status of geo 600,” Classical and Quantum Gravity, vol. 25, no. 11, p. 114043, 2008. 10 [75] M. Ando, “Current status of the tama300 gravitational-wave detector,” Classical and Quantum Gravity, vol. 22, no. 18, pp. 881–889, 2005. 10 [76] K. Kuroda and et al., “Status of lcgt,” Classical and Quantum Gravity, vol. 27, p. 084004, 2010. 10 [77] K. Somiya and for the KAGRA Collaboration, “Detector configuration of KA- GRA - the Japanese cryogenic gravitational-wave detector,” ArXiv e-prints, Nov. 2011, arXiv:1111.7185. 10 [78] M. Punturo and et al., “The Einstein Telescope: a third-generation gravitational wave observatory,” Classical and Quantum Gravity, vol. 27, p. 194002, Oct. 2010. 12 [79] B. Sathyaprakash and et al., “Scientific Potential of Einstein Telescope,” ArXiv e-prints, Aug. 2011, arXiv:1108.1423. 12 179 REFERENCES [80] P. Amaro-Seoane, S. Aoudia, S. Babak, P. Bine´truy, E. Berti, A. Bohe´, C. Caprini, M. Colpi, N. J. Cornish, K. Danzmann, J.-F. Dufaux, J. Gair, O. Jen- nrich, P. Jetzer, A. Klein, R. N. Lang, A. Lobo, T. Littenberg, S. T. McWilliams, G. Nelemans, A. Petiteau, E. K. Porter, B. F. Schutz, A. Sesana, R. Stebbins, T. Sumner, M. Vallisneri, S. Vitale, M. Volonteri, and H. Ward, “eLISA: As- trophysics and cosmology in the millihertz regime,” ArXiv e-prints, Jan. 2012, arXiv:1201.3621. 12, 45 [81] P. Amaro-Seoane, S. Aoudia, S. Babak, P. Binetruy, E. Berti, A. Bohe, C. Caprini, M. Colpi, N. J. Cornish, K. Danzmann, J.-F. Dufaux, J. Gair, O. Jen- nrich, P. Jetzer, A. Klein, R. N. Lang, A. Lobo, T. Littenberg, S. T. McWilliams, G. Nelemans, A. Petiteau, E. K. Porter, B. F. Schutz, A. Sesana, R. Steb- bins, T. Sumner, M. Vallisneri, S. Vitale, M. Volonteri, and H. Ward, “Low- frequency gravitational-wave science with eLISA/NGO,” ArXiv e-prints, Feb. 2012, arXiv:1202.0839. 12, 45 [82] O. Jennrich, “Lisa – a mission overview,” 37th COSPAR Scientific Assembly, vol. 37, p. 1367, 2008. 12, 45 [83] J. S. Key and N. J. Cornish, “Characterizing spinning black hole binaries in eccentric orbits with LISA,” Physical Review D, vol. 83, p. 083001, Apr. 2011, arXiv:1006.3759. 12 [84] A. Petiteau, S. Babak, and A. Sesana, “Constraining the Dark Energy Equation of State Using LISA Observations of Spinning Massive Black Hole Binaries,” The Astrophysical Journal, vol. 732, p. 82, May 2011, arXiv:1102.0769. 12 [85] T. Regimbau, “The astrophysical gravitational wave stochastic background,” Research in Astronomy and Astrophysics, vol. 11, pp. 369–390, Apr. 2011, arXiv:1101.2762. 12 [86] P. A. Rosado, “Gravitational wave background from binary systems,” Physical Review D, vol. 84, p. 084004, Oct. 2011, arXiv:1106.5795. 12 [87] M. Armano and et al., “Lisa pathfinder: the experiment and the route to lisa,” Classical and Quantum Gravity, vol. 26, p. 094001, 2009. 12 180 REFERENCES [88] M. Pitkin, S. Reid, S. Rowan, and J. Hough, “Gravitational Wave Detection by Interferometry (Ground and Space),” Living Reviews in Relativity, vol. 14, p. 5, July 2011, arXiv:1102.3355. 12 [89] K. J. Lee, N. Wex, M. Kramer, B. W. Stappers, C. G. Bassa, G. H. Janssen, R. Karuppusamy, and R. Smits, “Gravitational wave astronomy of single sources with a pulsar timing array,” Monthly Notices of the Royal Astronomi- cal Society, vol. 414, pp. 3251–3264, July 2011, arXiv:1103.0115. 12 [90] A. Sesana and A. Vecchio, “Measuring the parameters of massive black hole binary systems with pulsar timing array observations of gravitational waves,” Physical Review D, vol. 81, p. 104008, May 2010, arXiv:1003.0677. [91] S. J. Chamberlin and X. Siemens, “Stochastic backgrounds in alternative theo- ries of gravity: Overlap reduction functions for pulsar timing arrays,” Physical Review D, vol. 85, p. 082001, Apr. 2012, arXiv:1111.5661. 12 [92] J. Marx, K. Danzmann, J. Hough, K. Kuroda, D. McClelland, B. Mours, S. Phinney, S. Rowan, B. Sathyaprakash, F. Vetrano, S. Vitale, S. Whit- comb, and C. Will, “The Gravitational Wave International Committee Roadmap: The future of gravitational wave astronomy,” ArXiv e-prints, Nov. 2011, arXiv:1111.5825. 12 [93] R. Trotta, “Bayes in the sky: Bayesian inference and model selection in cos- mology,” Contemporary Physics, vol. 49, pp. 71–104, 2008, arXiv:0803.4089. 14 [94] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, no. 6, p. 1087, 1953. 14 [95] W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, 1970. 14 [96] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated an- nealing,” Science, vol. 220, p. 671, 1983. 14 181 REFERENCES [97] D. J. Earl and M. W. Deem, “Parallel tempering: Theory, applications, and new perspectives,” Physical Chemistry Chemical Physics (Incorporating Fara- day Transactions), vol. 7, p. 3910, 2005, arXiv:physics/0508111. 14 [98] W. M. Farr and I. Mandel, “An Efficient Interpolation Technique for Jump Pro- posals in Reversible-Jump Markov Chain Monte Carlo Calculations,” ArXiv e- prints, Apr. 2011, arXiv:1104.0984. 14 [99] J. Skilling, “Nested sampling,” in 24th International Workshop on Bayesian In- ference and Maximum Entropy Methods in Science and Engineering (R. Fischer, R. Preuss, & U. V. Toussaint, ed.), vol. 735 of American Institue of Physics Con- ference Proceedings, pp. 395–405, American Institute of Physics, 2004. 14 [100] J. Skilling, “Nested sampling for general bayesian computation,” Bayesian Analysis, vol. 1, no. 4, pp. 833–860, 2006. 14 [101] F. Feroz and M. P. Hobson, “Multimodal nested sampling: an efficient and ro- bust alternative to markov chain monte carlo methods for astronomical data analyses,” Monthly Notices of the Royal Astronomical Society, vol. 384, no. 2, pp. 449–463, 2008, arXiv:0704.3704. 15, 17, 24 [102] F. Feroz, M. P. Hobson, and M. Bridges, “Multinest: an efficient and ro- bust bayesian inference tool for cosmology and particle physics,” Monthly No- tices of the Royal Astronomical Society, vol. 398, no. 4, pp. 1601–1614, 2009, arXiv:0809.3437. 17, 18, 24, 137, 144, 162 [103] F. Feroz, P. J. Marshall, and M. P. Hobson, “Cluster detection in weak lensing surveys,” ArXiv e-prints, Oct. 2008, arXiv:0810.0781. 17 [104] F. Feroz, M. P. Hobson, J. T. L. Zwart, R. D. E. Saunders, and K. J. B. Grainge, “Bayesian modelling of clusters of galaxies from multifrequency- pointed Sunyaev-Zel’dovich observations,” Monthly Notices of the Royal As- tronomical Society, vol. 398, pp. 2049–2060, Oct. 2009, arXiv:0811.1199. [105] F. Feroz, B. C. Allanach, M. Hobson, S. S. Abdus Salam, R. Trotta, and A. M. Weber, “Bayesian selection of sign µ within mSUGRA in global fits including 182 REFERENCES WMAP5 results,” Journal of High Energy Physics, vol. 10, p. 64, Oct. 2008, arXiv:0807.4512. [106] R. Trotta, F. Feroz, M. Hobson, L. Roszkowski, and R. Ruiz de Austri, “The impact of priors and observables on parameter inferences in the con- strained MSSM,” Journal of High Energy Physics, vol. 12, p. 24, Dec. 2008, arXiv:0809.3792. 17 [107] F. Feroz, J. Gair, M. P. Hobson, and E. K. Porter, “Use of the multinest algorithm for gravitational wave data analysis,” Classical and Quantum Gravity, vol. 26, no. 21, p. 215003, 2009, arXiv:0904.1544. 17, 28, 63 [108] J. Veitch and A. Vecchio, “Bayesian coherent analysis of in-spiral gravitational wave signals with a detector network,” Physical Review D, vol. 81, p. 062003, Mar. 2010, arXiv:0911.3820. 17 [109] J. Abadie, B. P. Abbott, R. Abbott, M. Abernathy, T. Accadia, F. Acernese, C. Adams, R. Adhikari, P. Ajith, B. Allen, and et al., “TOPICAL REVIEW: Predictions for the rates of compact binary coalescences observable by ground- based gravitational-wave detectors,” Classical and Quantum Gravity, vol. 27, p. 173001, Sept. 2010, arXiv:1003.2480. 21 [110] L. S. Collaboration, “Lalsuite home page,” May 2012. https://www.lsc-group.phys.uwm.edu/daswg/projects/lalsuite.html. 21 [111] P. Graff, “A coherent bayesian method tested on hardware injection time-slide triggers,” 2011. LIGO-G1100599. 21, 41 [112] C. J. Lada, “Stellar Multiplicity and the Initial Mass Function: Most Stars Are Single,” The Astrophysical Journal, vol. 640, pp. L63–L66, Mar. 2006, arXiv:astro-ph/0601375. 22 [113] J. Veitch and A. Vecchio, “Bayesian approach to the follow-up of candidate gravitational wave signals,” Physical Review D, vol. 78, no. 2, p. 022001, 2008, arXiv:0801.4313. 22, 26, 28 183 REFERENCES [114] J. Veitch and A. Vecchio, “Assigning confidence to inspiral gravitational wave candidates with bayesian model selection,” Classical and Quantum Gravity, vol. 25, no. 18, p. 184010, 2008, arXiv:0807.4483. 28, 46, 64 [115] M. Trias, A. Vecchio, and J. Veitch, “Delayed rejection schemes for effi- cient markov-chain monte-carlo sampling of multimodal distributions,” ArXiv e-prints, 2009, arXiv:0904.2207. 28 [116] M. Trias, A. Vecchio, and J. Veitch, “Studying stellar binary systems with the laser interferometer space antenna using delayed rejection markov chain monte carlo methods,” Classical and Quantum Gravity, vol. 26, no. 20, p. 204024, 2009, arXiv:0905.2976. 28 [117] B. Farr, S. Fairhurst, and B. S. Sathyaprakash, “Searching for binary coales- cences with inspiral templates: Detection and parameter estimation,” Classical and Quantum Gravity, vol. 26, no. 11, p. 114009, 2009, arXiv:0902.0307. 28 [118] F. Feroz, P. J. Marshall, and M. P. Hobson, “Cluster detection in weak lensing surveys,” ArXiv e-prints, 2008, arXiv:0810.0781. 28 [119] J. S. Bloom, D. E. Holz, S. A. Hughes, K. Menou, A. Adams, S. F. Anderson, A. Becker, G. C. Bower, N. Brandt, B. Cobb, K. Cook, A. Corsi, S. Covino, D. Fox, A. Fruchter, C. Fryer, J. Grindlay, D. Hartmann, Z. Haiman, B. Kocsis, L. Jones, A. Loeb, S. Marka, B. Metzger, E. Nakar, S. Nissanke, D. A. Per- ley, T. Piran, D. Poznanski, T. Prince, J. Schnittman, A. Soderberg, M. Strauss, P. S. Shawhan, D. H. Shoemaker, J. Sievers, C. Stubbs, G. Tagliaferri, P. Uber- tini, and P. Wozniak, “Astro2010 Decadal Survey Whitepaper: Coordinated Sci- ence in the Gravitational and Electromagnetic Skies,” ArXiv e-prints, Feb. 2009, arXiv:0902.1527. 28 [120] The LIGO Scientific Collaboration, Virgo Collaboration: J. Abadie, B. P. Ab- bott, R. Abbott, T. D. Abbott, M. Abernathy, T. Accadia, F. Acernese, C. Adams, R. Adhikari, and et al., “Implementation and testing of the first prompt search for gravitational wave transients with electromagnetic counterparts,” ArXiv e- prints, Sept. 2011, arXiv:1109.3498. 184 REFERENCES [121] J. Kanner, T. L. Huard, S. Ma´rka, D. C. Murphy, J. Piscionere, M. Reed, and P. Shawhan, “LOOC UP: locating and observing optical counterparts to gravi- tational wave bursts,” Classical and Quantum Gravity, vol. 25, p. 184034, Sept. 2008, arXiv:0803.0312. [122] D. B. for the LVC, “Very low latency search pipeline for low mass compact binary coalescences in the ligo s6 and virgo vsr2 data,” Classical and Quantum Gravity, vol. 27, no. 19, p. 194013, 2010. [123] D. M. Coward, B. Gendre, P. J. Sutton, E. J. Howell, T. Regimbau, M. Laas- Bourez, A. Klotz, M. Boe¨r, and M. Branchesi, “Towards an optimal search strategy of optical and gravitational wave emissions from binary neutron star coalescence,” Monthly Notices of the Royal Astronomical Society, vol. 415, pp. L26–L30, July 2011, arXiv:1104.5552. [124] P. A. Evans, J. K. Fridriksson, N. Gehrels, J. Homan, J. P. Osborne, M. Siegel, A. Beardmore, P. Handbauer, J. Gelbord, J. A. Kennea, and et al., “Swift follow-up observations of candidate gravitational-wave transient events,” ArXiv e-prints, May 2012, arXiv:1205.1124. [125] T. Piran, E. Nakar, and S. Rosswog, “The Electromagnetic Signals of Compact Binary Mergers,” ArXiv e-prints, Apr. 2012, arXiv:1204.6242. [126] J. A. Rueda and R. Ruffini, “Gravitational Waves versus Electromagnetic Emis- sion in Gamma-Ray Bursts,” ArXiv e-prints, May 2012, arXiv:1205.6915. 28 [127] L. Wen and Y. Chen, “Geometrical expression for the angular resolution of a network of gravitational-wave detectors,” Physical Review D, vol. 81, p. 082001, Apr. 2010, arXiv:1003.2504. 29 [128] S. Nissanke, J. Sievers, N. Dalal, and D. Holz, “Localizing Compact Binary Inspirals on the Sky Using Ground-based Gravitational Wave Interferometers,” The Astrophysical Journal, vol. 739, p. 99, Oct. 2011, arXiv:1105.3184. 185 REFERENCES [129] S. Klimenko, G. Vedovato, M. Drago, G. Mazzolo, G. Mitselmakher, C. Pankow, G. Prodi, V. Re, F. Salemi, and I. Yakushin, “Localization of grav- itational wave sources with networks of advanced detectors,” Physical Review D, vol. 83, p. 102001, May 2011, arXiv:1101.5408. [130] A. Manzotti and A. Dietz, “Prospects for early localization of gravitational-wave signals from compact binary coalescences with advanced detectors,” ArXiv e- prints, Feb. 2012, arXiv:1202.4031. [131] I. Mandel, L. Z. Kelley, and E. Ramirez-Ruiz, “Towards Improving the Prospects for Coordinated Gravitational-Wave and Electromagnetic Observa- tions,” in IAU Symposium, vol. 285 of IAU Symposium, pp. 358–360, Apr. 2012, arXiv:1111.0005. [132] L. K. Nuttall and P. J. Sutton, “Identifying the host galaxy of gravitational wave signals,” Physical Review D, vol. 82, p. 102002, Nov. 2010, arXiv:1009.1791. [133] S. Fairhurst, “Source localization with an advanced gravitational wave detector network,” Classical and Quantum Gravity, vol. 28, no. 10, p. 105021, 2011. [134] S. Fairhurst, “Improved source localization with LIGO India,” ArXiv e-prints, May 2012, arXiv:1205.6611. 29 [135] P. Ajith and S. Bose, “Estimating the parameters of nonspinning binary black holes using ground-based gravitational-wave detectors: Statistical errors,” Phys- ical Review D, vol. 79, p. 084032, Apr. 2009, arXiv:0901.4936. 29 [136] C. Cutler and E´. E. Flanagan, “Gravitational waves from merging compact bi- naries: How accurately can one extract the binary’s parameters from the inspiral waveform?,” Physical Review D, vol. 49, pp. 2658–2697, Mar. 1994, arXiv:gr- qc/9402014. 29 [137] J. L. Bentley, “Multidimensional binary search trees used for associative search- ing,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975. 36 [138] L. Singer, L. Price, and A. Speranza, “Optimizing optical follow-up of gravitational-wave candidates,” ArXiv e-prints, Apr. 2012, arXiv:1204.4510. 40 186 REFERENCES [139] L. S. Collaboration, “The science of lsc research,” May 2012. http://www.ligo.org/science/GW100916/. 41 [140] D. Buskulic, Virgo Collaboration, and LIGO Scientific Collaboration, “Very low latency search pipeline for low mass compact binary coalescences in the LIGO S6 and Virgo VSR2 data,” Classical and Quantum Gravity, vol. 27, p. 194013, Oct. 2010. 41 [141] J. Abadie, B. P. Abbott, R. Abbott, M. Abernathy, T. Accadia, F. Acernese, C. Adams, R. Adhikari, P. Ajith, B. Allen, and et al., “Search for gravitational waves from compact binary coalescence in LIGO and Virgo data from S5 and VSR1,” Physical Review D, vol. 82, p. 102001, Nov. 2010, arXiv:1005.4655. 41 [142] The LIGO Scientific Collaboration, the Virgo Collaboration: J. Abadie, B. P. Abbott, R. Abbott, M. Abernathy, T. Accadia, F. Acernese, C. Adams, R. Ad- hikari, P. Ajith, and et al., “Sensitivity to Gravitational Waves from Compact Binary Coalescences Achieved during LIGO’s Fifth and Virgo’s First Science Run,” ArXiv e-prints, Mar. 2010, arXiv:1003.2481. [143] J. Abadie, B. P. Abbott, R. Abbott, T. Accadia, F. Acernese, R. Adhikari, P. Ajith, B. Allen, G. Allen, E. Amador Ceron, and et al., “All-sky search for gravitational-wave bursts in the first joint LIGO-GEO-Virgo run,” Physical Re- view D, vol. 81, p. 102001, May 2010, arXiv:1002.1036. [144] J. Abadie, B. P. Abbott, R. Abbott, T. D. Abbott, M. Abernathy, T. Accadia, F. Acernese, C. Adams, R. Adhikari, C. Affeldt, and et al., “All-sky search for periodic gravitational waves in the full S5 LIGO data,” Physical Review D, vol. 85, p. 022001, Jan. 2012, arXiv:1110.0208. 41 [145] I. Mandel and R. O’Shaughnessy, “Compact binary coalescences in the band of ground-based gravitational-wave detectors,” Classical and Quantum Gravity, vol. 27, p. 114007, June 2010, arXiv:0912.1074. 41 [146] F. Feroz, B. C. Allanach, M. P. Hobson, S. S. A. Salam, R. Trotta, and A. M. We- ber, “Bayesian selection of sign µ within msugra in global fits including wmap5 187 REFERENCES results,” Journal of High Energy Physics, vol. 10, p. 64, 2008, arXiv:0807.4512. 42 [147] L. Barack and C. Cutler, “LISA capture sources: Approximate waveforms, signal-to-noise ratios, and parameter estimation accuracy,” Physical Review D, vol. 69, p. 082005, Apr. 2004, arXiv:gr-qc/0310125. 45 [148] J. R. Gair, A. Sesana, E. Berti, and M. Volonteri, “Constraining properties of the black hole population using LISA,” Classical and Quantum Gravity, vol. 28, p. 094018, May 2011, arXiv:1009.6172. 45 [149] F. Feroz, J. Gair, P. Graff, M. P. Hobson, and A. Lasenby, “Classifying lisa grav- itational wave burst signals using bayesian evidence,” Classical and Quantum Gravity, vol. 27, no. 7, p. 075010, 2010, arXiv:0911.0288. 45, 46, 93 [150] S. Babak and et al., “The mock lisa data challenges: from challenge 1b to chal- lenge 3,” Classical and Quantum Gravity, vol. 25, no. 18, p. 184026, 2008, arXiv:0806.2110. 46, 47, 51, 65 [151] S. Babak and et al., “The mock lisa data challenges: from challenge 3 to chal- lenge 4,” Classical and Quantum Gravity, vol. 27, no. 8, p. 084009, 2010, arXiv:0912.0548. 46, 73 [152] M. I. Cohen, C. Cutler, and M. Vallisneri, “Searches for cosmic-string gravitational-wave bursts in mock lisa data,” Classical and Quantum Gravity, vol. 27, no. 18, p. 185012, 2010, arXiv:1002.4153. 46 [153] T. Littenberg and N. Cornish, “Bayesian approach to the detection problem in gravitational wave astronomy,” Physical Review D, vol. 80, no. 6, p. 063007, 2009, arXiv:0902.0368. 46, 64 [154] F. Beauville and et al., “A first comparison of search methods for gravitational wave bursts using ligo and virgo simulated data,” Classical and Quantum Grav- ity, vol. 22, no. 18, pp. S1293–S1301, 2005. 46, 57 [155] T. Damour and A. Vilenkin, “Gravitational wave bursts from cusps and kinks on cosmic strings,” Physical Review D, vol. 64, no. 6, p. 064008, 2001, arXiv:gr- qc/0104026. 47, 51 188 REFERENCES [156] X. Siemens and K. D. Olum, “Cosmic string cusps with small-scale structure: Their forms and gravitational waveforms,” Physical Review D, vol. 68, no. 8, p. 085017, 2003, arXiv:gr-qc/0307113. 47, 51 [157] J. S. Key and N. J. Cornish, “Characterizing the gravitational wave signature from cosmic string cusps,” Physical Review D, vol. 79, no. 4, p. 043014, 2009, arXiv:0812.1590. 47, 48, 63 [158] L. J. Rubbo, N. J. Cornish, and O. Poujade, “Forward modeling of space-borne gravitational wave detectors,” Physical Review D, vol. 69, no. 8, p. 082003, 2004, arXiv:gr-qc/0311069. 48 [159] M. Tinto and S. V. Dhurandhar, “Time-delay interferometry,” Living Reviews in Relativity, vol. 8, no. 4, 2005. 48 [160] S. W. Helstrom, Statistical Theory of Signal Detection. London, UK: Pergamon, 1968. 49 [161] B. J. Owen, “Search templates for gravitational waves from inspiraling binaries: Choice of template spacing,” Physical Review D, vol. 53, no. 12, pp. 6749–6761, 1996, arXiv:gr-qc/9511032. 49 [162] “Mock lisa data challenge,” Jan. 2012. http://astrogravs.nasa.gov/docs/mldc/. 52, 59 [163] C. Ro¨ver, M.-A. Bizouard, N. Christensen, H. Dimmelmeier, I. S. Heng, and R. Meyer, “Bayesian reconstruction of gravitational wave burst signals from simulations of rotating stellar core collapse and bounce,” Physical Review D, vol. 80, no. 10, p. 102004, 2009, arXiv:0909.1093. 65 [164] G. Nelemans, “The galactic gravitational wave foreground,” Classical and Quantum Gravity, vol. 26, no. 9, p. 094030, 2009, arXiv:0901.1778. 65 [165] N. J. Cornish and T. Littenberg, “Tests of bayesian model selection techniques for gravitational wave astronomy,” Physical Review D, vol. 76, no. 8, p. 083006, 2007, arXiv:0704.1808. 66 189 REFERENCES [166] A. Błaut, S. Babak, and A. Kro´lak, “Mock lisa data challenge for the galac- tic white dwarf binaries,” Physical Review D, vol. 81, p. 063008, 2010, arXiv:0911.3020. 66, 67 [167] J. A. Nelder and R. Mead, “A simplex method for function minimization,” Com- puter Journal, vol. 7, pp. 308–313, 1965. 66 [168] G. Nelemans, L. R. Yungelson, and S. F. Portegies Zwart, “Short-period AM CVn systems as optical, X-ray and gravitational-wave sources,” Monthly No- tices of the Royal Astronomical Society, vol. 349, pp. 181–192, Mar. 2004, arXiv:astro-ph/0312193. 66 [169] T. B. Littenberg, “Detection pipeline for Galactic binaries in LISA data,” Phys- ical Review D, vol. 84, p. 063009, Sept. 2011, arXiv:1106.6355. 70 [170] S. E. Timpano, L. J. Rubbo, and N. J. Cornish, “Characterizing the galactic grav- itational wave background with lisa,” Physical Review D, vol. 73, p. 122001, 2006, arXiv:gr-qc/0504071. 73 [171] P. Protopapas, R. Jimenez, and C. Alcock, “Fast identification of transits from light-curves,” Monthly Notices of the Royal Astronomical Society, vol. 362, no. 2, pp. 460–468, 2005, arXiv:astro-ph/0502301. 88, 89, 102 [172] P. Graff, M. P. Hobson, and A. Lasenby, “An investigation into the multiple opti- mised parameter estimation and data compression algorithm,” Monthly Notices of the Royal Astronomical Society: Letters, vol. 413, no. 1, pp. L66–L70, 2011, arXiv:1010.5907. 88 [173] A. Heavens, R. Jimenez, and O. Lahav, “Massive lossless data compression and multiple parameter estimation from galaxy spectra,” Monthly Notices of the Royal Astronomical Society, vol. 317, no. 4, pp. 965–972, 2000, arXiv:astro- ph/9911102. 88, 89, 92 [174] S. Gupta and A. Heavens, “Fast parameter estimation from the cosmic mi- crowave background power spectrum,” Monthly Notices of the Royal Astronom- ical Society, vol. 334, no. 1, pp. 167–172, 2002, arXiv:astro-ph/0108315. 89, 100 190 REFERENCES [175] D. J. C. MacKay, Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press, 2003. 103, 107 [176] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an un- known mapping and its derivatives using multilayer feedforward networksuni- versal approximation of an unknown mapping and its derivatives using multi- layer feedforward networks,” Neural Networks, vol. 3, p. 359, 1990. 104, 133 [177] S. F. Gull and J. Skilling, Quantified Maximum Entropy: MemSys 5 Users’ Man- ual. Bury St. Edmonds, UK: Maximum Entropy Data Consults, Ltd., 1999. 108 [178] J. Martens, “Deep learning via hessian-free optimization,” in Proceedings of the 27th International Conference on Machine Learning (J. Fu¨rnkranz and T. Joachims, eds.), (Haifa, Israel), pp. 735–742, Omnipress, 2010. 110 [179] N. N. Schraudolph, “Fast curvature matrix-vector products for second-order gra- dient descent,” Neural Computation, vol. 14, no. 7, pp. 1723–1738, 2002. 110 [180] B. A. Pearlmutter, “Fast exact multiplication by the hessian,” Neural Computa- tion, vol. 6, no. 1, pp. 147–160, 1994. 110 [181] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006. 112 [182] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 2006. 112, 122, 126 [183] D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, vol. 4, no. 3, pp. 415–447, 1992. 114 [184] R. Neal, “A three-way classification problem,” Jan. 2012. http://www.cs.toronto.edu/∼radford/fbm.2004-11-10.doc/Ex-netgp-c.html. 116, 117 [185] Y. LeCun and C. Cortes, “Mnist handwritten digit database,” Jan. 2012. http://yann.lecun.com/exdb/mnist/. 120 191 REFERENCES [186] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep big simple neural nets excel on handwritten digit recognition,” Neural Computation, vol. 22, no. 12, pp. 3207–3220, 2010, arXiv:1003.0358. 120 [187] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. 120 [188] T. Kitching, S. Balan, G. Bernstein, M. Bethge, S. Bridle, F. Courbin, M. Gentile, A. Heavens, M. Hirsch, R. Hosseini, A. Kiessling, A. Amara, D. Kirk, K. Kuijken, R. Mandelbaum, B. Moghaddam, G. Nurbaeva, S. Paulin- Henriksson, A. Rassat, J. Rhodes, B. Scho¨lkopf, J. Shawe-Taylor, M. Gill, M. Shmakova, A. Taylor, M. Velander, L. van Waerbeke, D. Witherick, D. Wittman, S. Harmeling, C. Heymans, R. Massey, B. Rowe, T. Schrabback, and L. Voigt, “Gravitational Lensing Accuracy Testing 2010 (GREAT10) Chal- lenge Handbook,” ArXiv e-prints, Sept. 2010, arXiv:1009.0779. 125 [189] T. D. Kitching, J. Rhodes, R. M. C. Heymans, Q. Liu, M. Cobzarenco, B. L. Cra- gin, A. Hassaine, D. Kirkby, E. J. Lok, D. Margala, J. Moser, M. O’Leary, A. M. Pires, and S. Yurgenson, “Image analysis for cosmology: Shape measurement challenge review and results from the mapping dark matter challenge,” New As- tronomy Reviews (submitted), 2012, arXiv:1204.4096. 127, 130 [190] K. Inc, “Mapping dark matter challenge,” Jan. 2012. http://www.kaggle.com/c/mdm/. 127, 128, 130 [191] P. Graff, F. Feroz, M. P. Hobson, and A. Lasenby, “BAMBI: blind accelerated multimodal Bayesian inference,” Monthly Notices of the Royal Astronomical Society, vol. 421, pp. 169–180, Mar. 2012, arXiv:1110.2997. 133, 134 [192] A. Lewis and S. Bridle, “Cosmological parameters from cmb and other data: A monte carlo approach,” Physical Review D, vol. 66, no. 10, p. 103511, 2002, arXiv:astro-ph/0205436. 142, 144 [193] T. Auld, M. Bridges, M. P. Hobson, and S. F. Gull, “Fast cosmological parameter estimation using neural networks,” Monthly Notices of the Royal Astronomical 192 REFERENCES Society: Letters, vol. 376, no. 1, pp. L11–L15, 2007, arXiv:astro-ph/0608174. 144, 151 [194] T. Auld, M. Bridges, and M. P. Hobson, “Cosmonet: fast cosmological parame- ter estimation in non-flat models using neural networks,” Monthly Notices of the Royal Astronomical Society, vol. 387, no. 4, pp. 1575–1582, 2008, arXiv:astro- ph/0703445. 151 [195] A. Bouland, R. Easther, and K. Rosenfeld, “Caching and interpolated likeli- hoods: accelerating cosmological monte carlo markov chains,” Journal of Cos- mology and Astroparticle Physics, vol. 5, p. 16, 2011, arXiv:1012.5299. [196] W. A. Fendt and B. D. Wandelt, “Pico: Parameters for the impatient cosmolo- gist,” The Astrophysical Journal, vol. 654, no. 1, pp. 2–11, 2007, arXiv:astro- ph/0606709. [197] J. Prasad and T. Souradeep, “Cosmological parameter estimation using particle swarm optimization (pso),” ArXiv e-prints, 2011, arXiv:1108.5600. [198] S. F. Daniel, A. J. Connolly, and J. Schneider, “An Efficient Parameter Space Search as an Alternative to Markov Chain Monte Carlo,” ArXiv e-prints, May 2012, arXiv:1205.2708. 144 [199] A. Lewis, A. Challinor, and A. Lasenby, “Efficient computation of cosmic mi- crowave background anisotropies in closed friedmann-robertson-walker mod- els,” The Astrophysical Journal, vol. 538, no. 2, pp. 473–476, 2000, arXiv:astro- ph/9911177. 144 [200] D. Larson, J. Dunkley, G. Hinshaw, E. Komatsu, M. R. Nolta, C. L. Ben- nett, B. Gold, M. Halpern, R. S. Hill, N. Jarosik, A. Kogut, M. Limon, S. S. Meyer, N. Odegard, L. Page, K. M. Smith, D. N. Spergel, G. S. Tucker, J. L. Weiland, E. Wollack, and E. L. Wright, “Seven-year wilkinson microwave anisotropy probe (wmap) observations: Power spectra and wmap-derived pa- rameters,” The Astrophysical Journal Supplement, vol. 192, no. 2, p. 16, 2011, arXiv:1001.4635. 144 193 REFERENCES [201] D. J. C. MacKay, “Probable networks and plausible predictions - a review of practical bayesian methods for supervised neural networks,” Network: Compu- tation in Neural Systems, vol. 6, p. 469, 1995. 152, 157 [202] M. Betancourt, “Nested sampling with constrained hamiltonian monte carlo,” in American Institute of Physics Conference Series (A. Mohammad-Djafari, J.-F. Bercher, & P. Bessie´re, ed.), vol. 1305 of American Institute of Physics Conference Series, (New York, New York), pp. 165–172, American Institute of Physics, 2011, arXiv:1005.0157. 164 [203] J. Sohl-Dickstein, “Hamiltonian Monte Carlo with Reduced Momentum Flips,” ArXiv e-prints, May 2012, arXiv:1205.1939. 164 194 The End.