Bayesian Methods in Music Modelling Paul Halliday Peeling Clare College December 19, 2010 Declaration This dissertation is submitted for the degree of Doctor of Philosophy. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed 50,000 words and 50 figures. Paul H. Peeling Abstract This thesis presents several hierarchical generative Bayesian models of musical signals designed to improve the accuracy of existing multiple pitch detection systems and other musical signal processing applications whilst remaining feasible for real-time computation. At the lowest level the signal is modelled as a set of overlapping sinusoidal basis functions. The parameters of these basis functions are built into a prior framework based on principles known from musical theory and the physics of musical instruments. The model of a musical note optionally includes phenomena such as frequency and amplitude modulations, damping, volume, timbre and inharmonicity. The occurrence of note onsets in a performance of a piece of music is controlled by an underlying tempo process and the alignment of the timings to the underlying score of the music. A variety of applications are presented for these models under differing inference constraints. Where full Bayesian inference is possible, reversible-jump Markov Chain Monte Carlo is employed to estimate the number of notes and partial frequency components in each frame of music. We also use approximate techniques such as model selection criteria and variational Bayes methods for inference in situations where computation time is limited or the amount of data to be processed is large. For the higher level score parameters, greedy search and conditional modes algorithms are found to be sufficiently accurate. We emphasize the links between the models and inference algorithms developed in this thesis with that in existing and parallel work, and demonstrate the effects of making modifications to these models both theoretically and by means of experimental results. Acknowledgments First of all I would like to thank my supervisor Prof. Simon Godsill, whose insight and direction has guided much of the work in this thesis, and attention to detail has been invaluable. Very special thanks to Dr. Taylan Cemgil, for being a constant source of inspiration and encouragement on this project and for providing so many examples and resources which made this possible. I am very grateful to all those who I also have collaborated with at various points over the last four years: Drs. Sumeetpal Singh, Nick Whiteley, Daniel Clark, Onur Dikmen and Chung-fai Li. Thanks to all the staff and students in the Signal Processing Laboratory, especially Henry, Jonathan and Simon for on- and off-topic coffee room discussions, and to Janet and Rachel for putting in so much effort to support us all. Thanks also to the folks at Featurespace, especially Bill, Dave and Kirsty, for encouraging me along the way. My final thanks go to my family and friends, who haven't needed to understand what this thesis is about to guide and support me: Matthew, Glen, Denise, Glen, Elliot, Theodore, Bob, Jon, Ding, Yuwei, Ted, Olly, 1 Thomas. Thanks to Mum, Dad and Katie for their love and support throughout the university years. Most of all, thanks to Angel for caring and loving every time. This thesis is dedicated to our new arrival. Notation y ∼ p (y) y is sampled from the probability distribution p (y) p (y|θ) The conditional probability density of y given θ N (y;µ, σ2) y is normally distributed with mean µ and variance σ2 NC ( y;µ, σ2 ) Complex normal distribution Am,n The element of matrix A in the mth row and nth column Ep(y) [f (y)] The expectation of the function f (y) under the probability distribution p (y) 〈f (y)〉p(y) The expectation of the function f (y) under the probability distribution p (y) f0 The fundamental frequency of a musical note Tr A The trace of matrix A A† Pseudo-inverse of matrix A A440 The note A, which has a pitch of 440 Hz θˆ A point estimate of the parameters θ y1:K The set of observations {y1, . . . , yK} Acronyms DFT Discrete Fourier transform STFT Short-time Fourier transform (spectrogram) (M)DCT (Modified) Discrete Cosine transform MIDI Musical instrument Digital Interface MIREX Music Information Retrieval Exchange NMF Non-negative Matrix Factorization HMM Hidden Markov Model ML Maximum-likelihood parameter estimate MAP Maximum a posteriori parameter estimate EM Expectation-maximization algorithm MCMC Markov Chain Monte Carlo MH Metropolis-Hastings GMM Gaussian mixture model ACF Autocorrelation function SNR Signal-noise ratio 2 Contents 1 Introduction 12 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Psychoacoustics and auditory modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.2 Machine learning techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.3 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Structure of Musical Audio 19 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Perception of Musical Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Dynamics (Loudness) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Oscillators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Stringed Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Wind Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.4 Vibrato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Harmony, Chords and Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Literature Review 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Multipitch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Auditory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Spectrum Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Harmonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.4 Spectral Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.5 Transform Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3 3.3 Polyphony Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.1 Attack-Sustain-Decay-Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Pitch Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Tempo Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Bayesian Methods for Signal Processing 43 4.1 Bayesian Modelling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1 Bayes Rule and Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1.1 Parameter Estimation using Bayes Rule . . . . . . . . . . . . . . . . . . . . . 43 4.1.1.2 Marginal Likelihood for Model Comparison . . . . . . . . . . . . . . . . . . . 44 4.1.1.3 Generative and Discriminative Models . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1.4 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1.5 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.4 Exponential Family of Probability Distributions . . . . . . . . . . . . . . . . . . . . . 47 4.2 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2.2 Importance Sampling and Sequential Monte Carlo . . . . . . . . . . . . . . . 51 4.2.3 Variational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 A Signal Model for Pitched Musical Instruments 55 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Model for an Isolated Partial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2 Amplitude and Phase Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.3 Analytic Representation of Sinusoidal and Noise Signals . . . . . . . . . . . . . . . . . 57 5.2.4 State-Space Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.5 Gabor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Probabilistic Model for Multiple Partials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3.2 Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3.3 State-Space Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.4 Gabor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Bayesian Inference using Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.1 Proposals and Acceptance Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.2 Prior Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.3 Examples of Reversible Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.3.1 n-increase/decrease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4 5.4.3.2 double/halve frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.3.3 note birth/death . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Performance on Monophonic Extracts . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Multiple F0 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Multiple Pitch Estimation using Non-homogeneous Poisson Processes 80 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Non-homogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.1 Frequency-Domain Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.2 Superposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.3 Evaluation of Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.3.1 Exact Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.3.2 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.3.3 Censored Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Bayesian Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.1 Fixed Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.2 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.3 Model for mixture weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Signal Model Based Partial Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.2 Bayesian Model Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.3 Zero-Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4.4 Likelihood-Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5 Polyphonic Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5.2 Estimation of number of notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5.3 Comparison with State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7 Gaussian Variance Generative Matrix Factorization Models 99 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Gaussian Variance Matrix Factorization Model . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2.1 Maximum-likelihood and the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2.2 Expectation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2.3 Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3 Bayesian Hierarchical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.3.1 Inference by Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.3.1.1 Variational update equations and sufficient statistics . . . . . . . . . . . . . . 107 7.3.1.2 The Variational Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.1.3 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 7.3.2 Markov Chain Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3.2.1 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3.2.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.2.3 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.4 Consistency of Marginal Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . 115 7.4 Musical Audio Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.1 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.2 Source Separation and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.5 Prior Model for Polyphonic Piano Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.5.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.6.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8 A Probabilistic Framework for Inferring Temporal Structure in Music 136 8.1 Audio Matching using Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.1.1 Existing Dynamic Time Warping Approach . . . . . . . . . . . . . . . . . . . . . . . . 136 8.1.2 Model Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.1.3 Interpretation of Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.2 Score Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2.1 Treatment of Score Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2.2 Dynamic Time Warping Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2.3 Hidden Markov Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3 Event Based Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.1 Counting of Temporal Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.3.2 Clutter and Missed Detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.3.3 Query-by-Tapping Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 9 Conclusion 153 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.3.1 Improvements to the Gaussian Variance Model . . . . . . . . . . . . . . . . . . . . . . 155 9.3.2 Frame Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6 9.3.3 Note Envelopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.3.4 High Level Score Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Bibliography 158 A Probability Distributions 168 A.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 A.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 A.3 Inverse-Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 A.4 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 B Derivation of Results 171 B.1 Mode of Posterior Distribution of Signal-to-noise Parameter . . . . . . . . . . . . . . . . . . . 171 B.2 Posterior over Latent Sources in Gaussian Variance Matrix Factorization Model . . . . . . . . 172 7 List of Figures 2.1 Flow of information in musical production and the auditory system . . . . . . . . . . . . . . . 20 2.2 Impulse and frequency response for a second-order gammatone filter . . . . . . . . . . . . . . 21 2.3 Volume curves used by some Midi implementations. . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Comparison of the Mel frequency scale and the Midi definition of pitch . . . . . . . . . . . . . 23 2.5 Frequency response and autocorrelation of a comb filter . . . . . . . . . . . . . . . . . . . . . 25 3.1 Comparison of Fourier and summary spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Comparison of salience functions obtained from spectra . . . . . . . . . . . . . . . . . . . . . 34 3.3 Bayesian network representations of tempo models . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Hidden Markov model. Observed random variables have doubled lines . . . . . . . . . . . . . 47 5.1 Comparison of sinc and Hamming basis functions for modelling sinusoids in noise . . . . . . . 61 5.2 Convergence of the model parameters using MCMC . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Comparison of the residual using periodogram and maximum likelihood estimates . . . . . . 77 6.1 Probability mass function for the Poisson distribution . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Prior on expected number of partials and marginal distribution of number of partials . . . . . 87 6.3 Poisson intensity function using a Gaussian mixture model . . . . . . . . . . . . . . . . . . . . 88 6.4 Partial estimation results for zero padding method . . . . . . . . . . . . . . . . . . . . . . . . 92 6.5 Partial estimation results and periodogram estimate for a polyphonic mixture of four notes. . 94 7.1 Representations of the single-channel source separation model as a matrix factorization problem101 7.2 The inverse-gamma distribution, p(r) = IG(r; a, a) for different a, and scale parameter b = 1 . 106 7.3 Template hyperparameters for single source models of piano notes . . . . . . . . . . . . . . . 117 7.4 Transcription using the Gaussian variance matrix factorization model. . . . . . . . . . . . . . 119 7.5 Optimal number of sources for a set of piano notes . . . . . . . . . . . . . . . . . . . . . . . . 120 7.6 Parameter estimates for the Gaussian variance model from training data . . . . . . . . . . . . 126 7.7 Parameter estimates for the Poisson model from training data . . . . . . . . . . . . . . . . . . 127 7.8 Transcription using a priori independent frames . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.9 Transcription using Markov transition probabilities between frames . . . . . . . . . . . . . . . 130 7.10 Ground truth for the transcription results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.11 Detection assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8 7.12 Number of errors for the Gaussian variance Markov model by number of notes and error type. 133 8.1 Audio alignment using DTW with note onset costs . . . . . . . . . . . . . . . . . . . . . . . . 144 8.2 Score alignment using Gaussian variance model . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.3 Inter-onset timings in a query-by-tapping problem . . . . . . . . . . . . . . . . . . . . . . . . 151 9 List of Tables 2.1 Intervals and harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Partial estimation results for real and analytical representation . . . . . . . . . . . . . . . . . 74 5.2 Partial estimation results for different basis function and model choices . . . . . . . . . . . . . 75 5.3 Polyphonic pitch estimation using the Bayesian harmonic model . . . . . . . . . . . . . . . . 76 6.1 Polyphonic pitch estimation using the Poisson process model . . . . . . . . . . . . . . . . . . 95 6.2 Precision and recall using the Poisson process model . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 F-measure of multiple pitch estimation on woodwind data . . . . . . . . . . . . . . . . . . . . 97 7.1 Frame-level transcription accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2 Frame-level transcription results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.1 Score alignment: median alignment in milliseconds . . . . . . . . . . . . . . . . . . . . . . . . 145 10 List of Algorithms 4.1 Generic MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Metropolis-Hastings Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Bootstrap Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Gibbs sampler for the state-space model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Metropolis-Hastings for the Gabor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.1 Partial estimation scheme for a frame of audio y with N samples . . . . . . . . . . . . . . . . 91 7.1 Variational Bayes for the Gaussian variance model, with hyperparameter optimization . . . . 111 7.2 Gaussian Variance: algorithm for polyphonic transcription . . . . . . . . . . . . . . . . . . . . 123 7.3 Poisson Intensity: algorithm for polyphonic transcription . . . . . . . . . . . . . . . . . . . . 124 11 Chapter 1 Introduction 1.1 Background This thesis is concerned with Bayesian methods for the modelling of musical signals. Musical signals are rich in structure, and are capable of evoking emotional and aesthetic response in the listener. When processing music, much of this structure is known in advance. Thus we may use Bayesian methods to infer some of the unknown aspects of the musical signal in a signal processing setting. Bayesian methods center around the application of Bayes' rule to statistical models of observed data y. The unknown information about the musical signal y is encapsulated in a set of parameters θ. We define a statistical model p (y|θ), the likelihood, which describes how the signal is related to these parameters. To complete the picture, we define a prior p (θ) which describes what we know about the parameters before we observe any data. A Bayesian model thus may be represented by its joint probability distribution p (y, θ) = p (y|θ) p (θ) Using the prior and the likelihood, the posterior distribution p (θ|y) of the parameters after we observe the signal is given by Bayes' rule p (θ|y) ∝ p (y|θ) p (θ) = p (y|θ) p (θ) p (y) (1.1) The quantity p (y) in (1.1) is both the normalization constant of the posterior p (θ|y) and also the marginal distribution of the observed data for our particular choice of model p (y, θ). p (y) = ˆ p (y, θ) dθ p (y) is therefore known as the marginal likelihood or the evidence of the data, and may be used to evaluate which model from a range of potential models best explains the data observed. From a musical signal processing perspective, the unknown parameters θ represent a hierarchy of musical information. At the highest level we have the cognitive concepts of genre, mood, style and so forth, which 12 may be used to group different pieces of music together. Within a piece of music we have music theoretic constructs such as key, tempo and meter, which are defined globally, and time localized structure such as pitch, tone, harmony and rhythm. At this local level, we can begin to represent the structure in terms of physical entities, such as frequencies, transients and noise. A complete representation of musical information can be naturally expressed as a hierarchical Bayes model [Gelman, 2004]. For example we could express three levels of musical information thus described as θ1, θ2, θ3 progressively incorporating higher level information, and write the prior as p (θ) ≡ p (θ1, θ2, θ3) = p (θ1|θ2, θ3) p (θ2|θ3) p (θ3) When we are able to draw random samples from the likelihood and prior distributions we may simulate data y under various conditions and assumptions. Such a scheme is known as generative modelling. This allows us to evaluate the modelling assumptions not only using mathematical considerations, but also per- ceptually by listening to the generated data and assessing it qualitatively. For instance, a generative model for the frequencies contained in a model of musical pitch (which is a perceived characteristic of a musical note) would include definitions of fundamental frequency and harmonicity. As inharmonicity can be sub- stantial in some musical instruments, we would be tempted to put a vague model on how the frequencies of individual harmonics are related to the fundamental. If the model is too vague, we would be able to perceive by listening to simulated data that the perceived pitch is no longer equal to the desired pitch. 1.2 Scope of Work The signal processing and modelling of musical signals is a diverse field, with a multitude of approaches, philosophies and goals. Here we present a brief overview of popular approaches, and how they relate to the methods we develop in this thesis. 1.2.1 Psychoacoustics and auditory modelling These approaches characterize and model the human auditory system. Principles of psychoacoustics underlie present audio coding standards [Ahlzen and Song, 2003]. Music is not merely the production of sound by physical processes, but also includes how these sounds are perceived. Hence any system which processes musical signals, even when based solely on physical models, should consider the perception of musical sounds. We might ask whether an automatic music transcription system should transcribe notes which are not audible for instance. Principles known from psychoacoustics include: • The frequency and dynamic (intensity of sound) response of the ear, including the range of sounds that can be detected and the resolution with which they are perceived. Physical measurements of different parts of the ear, (from canal to neurones) have been modelled using filter banks and other non-linear signal processing techniques, to compute a summary spectrum, which maps higher order harmonics to lower order harmonics, partially accounting for the ear's ability to perceive pitch. Currently in some Bayesian systems, the periodogram is used as a spectrum estimator to guide Bayesian inference to more likely frequencies and pitches in the signal: a proposal distribution in an MCMC setting, see 13 Section 4.2.2. It is reasonable to suggest that a spectrum estimator based on an auditory model and pitch perception could improve existing methods. • Masking: a psychoacoustical effect when tones (for example, musical pitches) are presented in such a way that one tone renders the other(s) inaudible [Gelfand, 2004]. For example, a loud tone can mask a quiet tone with a similar frequency or pitch. A related phenomenon is the fusing of two similar sounds into one, such as two tones with equal loudness and similar frequency. • Perception of the parameters of musical notes such as pitch, loudness and harmonicity. Pitch and loudness for example are not perceived linearly, or independently one of another. The Bark scale [Zwicker, 1961] is a subjective scale of loudness, and the Mel scale [Stevens et al., 1937] is a perceptual scale of pitch. A well known phenomenon in the perception of harmonicity is the ability of the human ear to reconstruct missing fundamental frequencies [Todd and Loy, 1991] for example when listening to a bassoon. 1.2.2 Machine learning techniques We may describe an example of this type of approach in general terms as a two-stage procedure: 1. Extracting salient features from frames of audio. The method used to extract the feature depends on the application. Applications which aim to extract frequency related information, such as pitch or harmony, from the music, may start by computing the Fourier transform. A popular feature computed via the DFT is a chroma vector [Wakefield, 1999], which describes the energy distribution between the 12 notes of the Western pitch scale. This information can then be used in a chord recognition [Papadopoulos and Peeters, 2007] or key recognition [Peeters, 2006] applications, by reasoning that notes with higher energy distributed to them will often form a root, third or fifth of the chord. Feature extraction methods which prove successful and useful for a musical signal processing application can often inspire or be adapted in a generative model approach. Considering the example of chroma vectors, the expected distribution of energy across different note groups can be treated as a Bayesian prior for chords and keys in music. 2. Learning and classification algorithms. These algorithms map the features as inputs to the informa- tion to be extracted from the music as an output. The majority of these algorithms were not designed specifically for musical signals, but rather map real-valued vectors to labeled classes. The distinc- tion between the feature extraction method and the machine learning algorithm may be considered analogous to the separation of a probabilistic model and the inference algorithm, although machine learning algorithms do assume models of both the features and the labels. Some popular algorithms and their uses include support vector machines for classification [Burges, 1998] and dynamic programming [Rabiner and Juang, 1993] for aligning sequences of music. 1.2.3 Generative models The scope of a generative model for music, i.e., the actual data being modelled, varies. A model which describes the production of each sample and channel of a digital audio signal may be desirable for high 14 fidelity signal processing, however sample rates of 44.1 kHz (CD quality) or 48 kHz (digital audio tape) may be unmanageable in terms of computation and inference. Instead, a generative model is usually defined on a simpler domain, e.g., overlapping frames of audio, and additionally some preprocessing and downsampling may be applied using the Fourier transform. These effects however are often non-invertible, for example taking the magnitude of the Fourier spectrum, and losing the phase continuity between frames of audio. For the signal to be reproduced without audible artefacts then requires additional postprocessing such as using a phase vocoder [Flanagan et al., 1965]. The modified discrete cosine transform (MDCT) which has a 50% overlap between frames retains the phase continuity between frames [Princen et al., 1987] hence its inclusion in the MP3 standard. In this thesis we have chosen to build upon approaches that utilize Bayesian and generative modelling ideas for music. Crucially at the lowest level we restrict ourselves to models proved to be valid for audio signal processing, for in a hierarchical Bayes model musical information is represented by additional levels of hierarchy above that existing for audio signals, because musical signals are a subset of audio signals. For this reason we have adopted a bottom-up approach to the design for our models, and the structure of this thesis is similar. Most Bayesian analysis is carried out in a similar manner: beginning with the definition of the likelihood, and adding levels of hierarchy until the model is deemed sufficient for the purpose. The models that we propose and develop here are not a complete representation of music information. We have focused here particularly on the modelling of pitch and temporal structure, such that we can infer the positions of musical notes both in time and frequency, and their respective dynamics. The information we obtain may be stored in the intermediate Midi format for musical events. Midi (Musical Instrument Digital Interface) is a widely adopted standard available at http://www.midi.org for controlling electronic musical instruments. The data transmitted through a Midi interface does not contain the recorded waveforms used to synthesize the music into audio, and therefore compactly represents musical content. 1.3 Motivation The motivation for this work is not directly related to the classification of music and music recommendation systems. The technology behind these systems is advanced and has led to the development of commercial applications, for example www.Last.FM which is an Internet radio, social network and music recommendation service. That technology does not necessarily involve signal processing or machine listening, as there is abundant information supplied by human listeners, for example in the form of tagging. Listeners are often very willing to provide such information and even gain utility from it, which contributes to the success of these systems. However there is interest in the automatic generation of tags, see for example Eck et al. [2007]. The classification and identification of structure within a piece of music is perceived to be much more laborious, time consuming, subjective and frustrating. Such is the task of a trained musician in music transcription: to write the musical score after listening to extracts of the audio multiple times. Unsurprisingly there are few sources of musical audio labeled with a corresponding accurate Midi transcription readily available. Transcription is not necessarily the final goal: being able to align a Midi file with a performance in the audio is also appropriate for the applications we envisage, as there is a large quantity of Midi files related 15 to the score of a piece of music, but however divorced from a real performance of that piece. Content-based music information retrieval (MIR) is possible using the models presented here. Some examples of how a system may be used include: • Notation of improvised music, for example in Western Jazz • Preservation and study of music from traditions without a notation system, which instead rely on oral transmission. • Detailed performance analysis for musicology [Scheirer, 1998, Seashore, 1936]. One example of this is the Mazurka project 1 , which has made detailed annotations of the tempo and dynamics of recordings of Chopin's Mazurkas by several performers. These annotations have been used to compare and contrast different styles and approaches to playing the same piece, capturing performance effects such as phrase arching [Cook] and accentuation patterns. This approach has been used to identify historical record- ings which have influenced today's performers, and even cases of copyright violation where existing performances have been replayed. • Visualization of musical structure in a music media player framework. The Sonic Visualizer [Cannam et al., 2006] allows musicians to listen to and study recordings, offering synchronized playback and display of annotations (including Midi) and frequency spectra. This tool is already integrated with an audio alignment system called Match [Dixon and Widmer, 2005]. • Score alignment - matching events in a score with corresponding audio cues in a recording. The two applications mentioned above would be improved with automated score alignment, reducing much of the manual labour involved. • Score following - tracking the position of a live performance in a score. One motivation for this application arises in present day music compositions which rely on the synchronization of human performers with prerecorded or synthesized electronic music. The current score follower at Ircam is trained during rehearsal time to improve its performance, see for example Schwarz et al. [2005]. Current development is centered on anticipatory score following, see Cont [2009], permitting a higher level of interaction between the performance and the score follower. A similar development is the Music Plus One system [Raphael, 2006] which uses score following to accompany and anticipate a performer. The system is used for example to control the playback of a Music Minus One 2 recording to assist musicians when practicing a piece of music without live accompaniment. As we are using generative models, we have also the following applications: • Source separation. Score guided source separation may be used to produce the Music Minus One recording automatically from a favoured historical recording [Raphael, 2008], a process known as de- soloing. Another use of source separation for musical signals is the separation of polyphonic instruments (for example piano and guitar) into separate channels based on pitch or register, which in a recording studio would be typically recorded on a single channel. This functionality has been demonstrated in Melodyne's Direct Note Access technology 3 . 1 http://www.mazurka.org.uk 2 A recording of a piece of music for soloist and accompaniment where only the accompaniment is recorded 3 www.celomony.com 16 • Object coding based music compression and synthesis. As discussed in Plumbley et al. [2002] the recent Mpeg-4 standard for structured audio [Vercoe et al., 1998] provides for an advanced form of parametric coding of musical signals. The labeling required for the coding could be produced with automated music transcription, and the decomposition and synthesis tasks could also be automated using source separation. • Morphing, reconstruction and digital effects. Bayesian models have been used for audio reconstruction [Godsill, 1997, Cemgil and Godsill, 2005], noise estimation [Godsill, 2009] and enhancement [Wolfe et al., 2003]. These approaches can be readily modified by extending the prior models for general audio to account for the higher level of structure present in music (for example harmonicity). These extensions form a major part of this thesis. 1.4 Outline of Thesis This introduction has described the motivation and approach to the research covered in this thesis. We have demonstrated our reasons for adopting a generative modelling approach using Bayesian inference, but have reflected on how complementary approaches from psychoacoustic modelling and machine learning can be adopted. The three following chapters provide the background and foundation for our research. Chapter 2 provides an overview of our current knowledge of the structure of musical audio signals, drawing from music theory, the physics of musical instruments and the propagation of sounds, and psychoacoustics. The material covered in this chapter is the basis and reasoning for having selected the particular Bayesian priors that we use in this research. Chapter 3 reviews the literature for machine listening of musical signals. Since 2004 a community driven evaluation of machine listening systems for the important applications in this field, known as Mirex) 4 [Downie, 2008], has become prominent. Previously there had been no overall consensus on standardized data sets or evaluation metrics to use for comparing the performance of different systems. In Chapter 3 we use this resource to learn how this area of research has developed and has been influenced over the last few years, and make an objective comparison of the current state-of-the-art technologies. Chapter 4 introduces the field and methods of Bayesian signal processing, reviewing models and inference methods. Models and inference algorithms used in this thesis will be covered in greater detail in this chapter so that these are collected into one place and are referenced throughout this thesis where they are used. The remaining chapters describe the new research carried out. Each chapter describes firstly a novel Bayesian model for some level of musical structure in audio. Secondly, a variety of inference methods for these models are developed, and finally the applications of each model, and results presented using data-sets from the literature review in Chapter 3. The chapters are not fully self contained, as the models used are extensions of models already described in the thesis or in the literature, but the focus of each chapter is on one particular level in a hierarchical Bayes model of musical audio, and the chapters are ordered accordingly: partial frequencies and amplitudes in Chapter 5; the grouping of partials into musical notes in Chapter 6; the harmonic and temporal distributions of time-frequency basis coefficients in Chapter 7; prior distributions for the volume of notes and their continuity in Chapter 7; and finally dynamic models for tempo and how the performance of a piece of music moves through the score of that piece in Chapter 8. 4 http://www.music-ir.org/MIREX 17 Chapter 5 describes a generative model for musical audio using the analytic representation of a signal. This model is based on existing Bayesian models for musical audio using sinusoidal and Gabor bases, but modelling the analytic representation has a number of implications for both inference and the prior structure. We also consider common musical effects such as frequency modulations (vibrato) and amplitude modulations (tremolo) and model these in musical notes as multiple partials occupying the same harmonic position. We then use the model and inference methods developed to perform spectrum estimation and polyphonic pitch transcription in musical signals, demonstrating its ability to model vibrato and detect higher order partials. Chapter 6 describes a Poisson point process model for inferring musical notes and chords from partial frequency estimates. The model allows for multiple and missed partial detections, and can be applied to the model described in Chapter 5 and also other spectral estimation schemes, both Bayesian and heuristic. The primary advantage of this model is that calculating the likelihood function is computationally inexpensive and inference is straightforward. A simple and intuitive prior is used with a partial estimation scheme using Bayesian model selection on the model in Chapter 5 to produce an effective system for inferring polyphony in short frames of music. Chapter 7 describes modelling the coefficients of a time-frequency transform of a musical signal by variance parameters. The variances are grouped into a matrix, which is composed of harmonic factors across frequency and excitation factors across time, analogous to non-negative matrix factorization (NMF). A number of Bayesian inference procedures are proposed for rapid and efficient inference, which is necessary for processing large amounts of audio data. A number of applications for general musical signal processing are illustrated using these models. We extend the model with prior structures for the volume of a note, and its onset and offset. This allows the inference of a Midi transcription from musical audio. The transcription performance of the model is compared to existing systems using a large selection of synthesized classical piano music. Chapter 8 develops two models for inferring the motion of a hypothetical 'score pointer' through the performance of a piece of music. The first model is a hidden Markov model with a tempo variable describing the probability of moving from one position in the score to the next. The second is a Poisson model counting the expected number of note onset and offsets occurring at each point in the music. Results are presented for score following and query-by-tapping music retrieval applications. 18 Chapter 2 Structure of Musical Audio 2.1 Introduction In this chapter we develop some understanding of musical audio which will aid us considerably in defining models. This chapter covers known and experimentally derived results from physics, psychoacoustics and musical theory, which are a necessary foundation for the rest of this thesis. In Chapter 3 we cover the progress made in using these models for the applications grouped under the encompassing term machine listening. In analyzing music audio, we need to address both the physical production of the sound and how it is perceived. Most systems for the analysis of musical audio model either the musical instrument or the auditory system in isolation. In reality neither exists in isolation. Sound is produced by the instrument and received by the sensory system (which includes but is not restricted to the auditory system, as the existence of deaf musicians proves). Feedback may exist in the form of performance, as indicated in Figure 2.1 on page 20. Physical models of musical instruments are well studied, and although this area of research is by no means dormant or redundant; much of what has already been discovered is of value to us. Fletcher and Rossing [1998] reviews the physics of musical instruments, and forms the basis of our description of pitched musical instruments in Section 2.3. Models of the perception of sound focus firstly on the auditory periphery and secondly on psychoacoustical studies. We will not describe in detail the research in these areas, pointing the reader to Klapuri [2006] for an excellent introduction; but focus on the models that have been developed as a result of this research. 2.2 Perception of Musical Audio The process of perceiving audio can be divided into two stages. The first, physical, stage converts the pressure waves that transmit sound through the air into electrical signals in the brain. Central to this is the action of the basiliar membrane, which is a stiff structure in the inner ear separating two fluids: the endolymph and the perilymph. One function of the basiliar membrane is frequency dispersion: the membrane varies in thickness and stiffness, thus it responds to different frequencies across its length. The variation of location 19 Musical Instrument Production of Sound Acoustics Sensory System Performance Signal Auditory Filterbank Neural Transducers Pyschoacoustics Perception Figure 2.1: The flow of information in musical production and the auditory system. Sound is produced by a musical instrument and received by the sensory system. Performance is a feedback route from the sensory system back to the instrument. The flow of information through the auditory system is shown here as a block diagram. In reality many more channels exist than are shown here. with frequency is described by the Greenwood function [Greenwood, 1961]. The auditory periphery may be modelled by a filter bank, splitting the received signal into a number of channels (Figure 2.1 on page 20). Each filter has a bandwidth selected to model the frequency selectivity of various parts of the basiliar membrane. Experimental results (Patterson et al. [1992]) suggest that gamma- tone filters are a good approximation to the frequency response. Figure 2.2 on page 21 shows the impulse and frequency response of a second-order gammatone filter. Each channel is then subject to dynamic level compression, half-wave rectification and low pass filtering. These processes are designed to model neural transduction. The dynamic level compression models the loudness response of the ear which is relevant in our discussion of dynamics in Section 2.2.1. The frequency content is then typically analyzed by computing the summary spectrum which is the summation of the spectrum magnitudes across the channel outputs after the low pass filtering operation. A side effect of the half-wave rectification combined with low pass filtering is that the harmonic complexity of a musical signal is reduced: higher order partials are mapped onto lower order partials (see Section 2.3 for an explanation of these terms) which may explain why humans are good at perceiving multiple pitches. One difficulty with applying standard spectral analysis even for monophonic signals is that the fundamental frequency does not necessarily have the largest amplitude. However when the mapping of partials takes place and we observe the summary spectrum (see Figure 3.1b on page 33 for example), the fundamental does then have the largest amplitude and can be easily identified. A complete auditory model based on the process outlined above appears in Meddis and O' Mard [1997]. The second stage of perception is psychological, occurring within the brain. The first research on the fre- quency response of the ear was carried out in Fletcher and Munson [1933] where contours of equal subjective loudness (measured in phons) are obtained for different frequencies and sound pressure levels. 20 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0 5 10 15 20 25 30 35 40 A m p li tu d e Time / ms Impulse Response Envelope (a) Impulse response and amplitude envelope 0 500 1000 1500 2000 2500 3000 3500 4000 -4 -3 -2 -1 0 1 2 3 4 P h as e / ra d s Frequency / Hz -150 -100 -50 0 50 M ag n it u d e / d B (b) Magnitude and phase of the frequency response Figure 2.2: Impulse and frequency response for a second-order gammatone filter with centre frequency 440Hz and sampling rate 8000Hz 21 00.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 A m p li tu d e sc al in g MIDI velocity β = 1 β = 1.6661 β = 2 Figure 2.3: Volume curves (2.1) used by someMidi implementations. β = 1 gives a linear response, Roland's GS standard uses β = 2, a square-law relationship, and β ≈ 1.661 is derived from the rule of thumb that loudness doubles when the sound intensity is increased by a factor of ten 2.2.1 Dynamics (Loudness) Dynamics refers to the volume of a musical note, which may have stylistic interpretation, but primarily refers to the note on velocity Midi event. Typically a particular dynamic notated in a musical score e.g., forte (loud), is mapped to a range of velocities. The mapping of velocity to a scaling of the note in amplitude is relevant to underlying signal models. Based on the human ear having a perceptual logarithmic response to amplitude variations, GM (General Midi) synthesizers use the following volume curve a = ( v 127 )β (2.1) which expresses the amplitude scaling a in terms of the note velocity v as a fraction of the maximum velocity 127 allowed by the Midi standard, and a logarithmic response scaling term β. Figure 2.3 on page 22 plots volume curves for the different values of β for implementations of the Midi standard. The RWC database [Goto et al., 2003, Goto, 2004] also contains samples at three levels of dynamics: forte, mezzo, piano. Studying the musical instrument samples here can give us a granular yet instructive set of volume curves for different instruments. The dynamic of a note is not necessarily constant across the duration of the note. crescendo (gradually getting louder) and diminuendo (gradually getting softer) are typically modelled by synthesizers as linear trajectories in note velocity. tremolo is a periodic variation of volume, which occurs often with vibrato (2.3.4), the pitch analog of volume oscillations. 22 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 20 40 60 80 100 120 M ID I p it ch Frequency / Hz 2600 2700 2800 2900 3000 3100 3200 3300 M el fr eq u en cy Figure 2.4: Comparison of the Mel frequency scale (2.2) and the Midi definition of pitch (2.3) 2.2.2 Pitch Pitch is the perceived fundamental frequency of a musical note. This perception can deviate substantially from physical reality, especially when the sounds are not quasiperiodic. Experimental studies have shown that missing fundamentals and harmonics are tolerated, as is a substantial amount of inharmonicity [Plack et al., 2005]. Hence we typically distinguish between pitch estimation and fundamental frequency estimation. Pitch estimation may not therefore directly rely on physical signal models, but rather on a more approximate representation of a fundamental frequency. One strong argument in favour of this is that a listener is able to group musical notes by pitch indepen- dently of the timbre of that note. A listener is able to identify a piano, a bell, a tympanum and filtered white noise as having a common pitch. Moorer [1977] defines the pitch of a sound as some property that allows it to be matched to a sine wave with a particular frequency. A sine wave itself has no harmonic structure whatsoever, however it has an identifiable pitch. Hence although the harmonic structure of a musical note may be invaluable to identifying the pitch of the note because the instrument is approximately a harmonic oscillator (Section 2.3), harmonicity is not an adequate description of pitch itself. Pitch perception is not independent of the volume either. The Mel frequency scale [Stevens et al., 1937], which was arrived at by comparing perceived intervals of equal pitch with frequency intervals, illustrates that the perception of pitch is non-linear. A general rule is that that pitch decreases with increased loudness. One formulation of the Mel frequency scale is m = 1127 loge (f/700 + 1) (2.2) As with many parametric models in signal processing, pitch estimation may be approached in the time 23 domain or in the frequency domain. Autocorrelation function (ACF) based methods detect periodicity in a non-linear transformation of the Fourier spectrum of a signal. As such, regularly spaced partial frequencies in the spectrum are detected by an increase in the energy distribution around those frequencies. Comb filters add a delayed copy of a signal to itself, causing periodicity in the signal to be constructive when the delay is correctly specified. The lag at which the maximum of the ACF or the comb filter response occurs corresponds to a dominant fundamental frequency in the signal, and is used as an estimate of the pitch, as illustrated in Figure 2.5 on page 25. Klapuri [2004] shows that comb filter solutions exhibit both better SNR and pitch detection range than ACF methods, but at the cost of additional computation. Despite our above discussion concerning the imprecise relationship between pitch and fundamental fre- quency, we still need to mention that the MIDI standard has adopted the following formula to map between Western tonal musical pitches and fundamental frequency: p = 69 + 12× log2 ( f0 440 ) (2.3) where p is a pitch such that there are 12 integer pitches per octave, and 69 = A440Hz is the standardized pitch. A semitone is the difference between two pitches, and a cent is 1/100 of this interval. 2.2.3 Timbre Timbre is another grouping category of music perception which can stand independent of dynamics and pitch. In terms of frequency analysis, because pitch refers to a specific frequency, therefore timbre may only refer to the overall spectral profile. This is backed up by studies which have shown that timbre is related to the relative amplitudes of the partials. Inharmonicity in the partials also affects timbre. Away from frequency domain considerations, the timbre also can be characterized by the onset of the note, which is typically unpitched due to the non-linear activation mechanism to generate the sound. 2.3 Oscillators The perception of a pitched sound requires some periodicity in the signal. The most common means of producing periodicity is by a mechanical system known as an oscillator. Oscillators are used in a great variety of musical instruments. 2.3.1 Terminology The fundamental frequency often corresponds to the lowest partial frequency in a signal, which may be the oscillation across the entire string length for a string instrument, or the air column for a wind instrument. Other partial frequencies occur at the resonant frequencies of the instrument. Because of the relative spacing of the resonant frequencies, partial frequencies are often integer multiples of the fundamental frequency. Inharmonicity is often measured as the deviation in cents of a partial from the closest ideal harmonic of the fundamental frequency. 24 -20 -10 0 10 20 30 40 0 0.2 0.4 0.6 0.8 1 M ag n it u d e / d B Normalised frequency (×pi rads / sample) (a) Magnitude response of the comb filter -0.5 0 0.5 1 1.5 2 0 10 20 30 40 50 A u to co rr el at io n Lag / samples (b) Autocorrelation of a white noise signal filtered by the comb filter Figure 2.5: Frequency response and autocorrelation of a comb filter with 10 samples delay. The maximum of the autocorrelation occurs at zero lag, but the peak at 10 samples corresponds to the fundamental frequency of the signal. 25 2.3.2 Stringed Instruments For a string to be a harmonic oscillator the requirement is that it must be homogeneous, infinitely thin and flexible. In this case, a string will exhibit a fundamental frequency related to its length and partial frequencies at exactly integer multiples of the fundamental. This is rarely the method used to construct a stringed instruments simply because the space required for the string lengths and their transverse vibrations is impractical for low pitches. In these cases, the stiffness of the string is increased, so that the string vibrates more slowly. The result of this however is that inharmonicity is introduced. Stiffness is increased by using thicker strings, or winding strings, or increasing the tension applied to the string. Hence for stringed instruments such as the piano, guitar, harp and members of the violin family played pizzicato (plucking the string), inharmonicity is present. For a piano, the following formula relates the harmonic frequencies fh to the fundamental [Fletcher and Rossing, 1998] fh = hf0 √ 1 + h2B (2.4) where B is the inharmonicity coefficient. In the case of a bowed violin family instrument, the non-linear action of the bow on the string negates the effect of the inharmonicity. The partial frequencies are driven to be multiples of the fundamental frequency. The fundamental however is not precisely equal to the length of the string, a fact which is taken into account when a violinist plays the instrument. 2.3.3 Wind Instruments The resonator in a wind instrument is an air column, the length of which is controlled by various means. A partially open pipe has end effects where the acoustic length of the pipe differs from the geometric length. However the effect varies with frequency, giving rising to inharmonicity. There are exceptions: flutes and clarinets have been experimentally determined to have harmonic partial frequencies. This is due to the sound-generating mechanism not being linear whilst the resonator itself is (approximately) so. The system exhibits mode locking behaviour which forces partial frequencies to assume their ideal positions. Other interesting points about wind instruments are that the fundamental frequency of a bassoon is nearly always absent; and even-numbered harmonics of a clarinet are suppressed because of the cylindrical shape of the resonator [Barthet et al., 2005]. The saxophone, which is related to the clarinet, having a conical resonator, does not suppress the even harmonics. 2.3.4 Vibrato Vibrato refers to the periodic variation in pitch around a note, and as such is characterized by its depth: the amplitude of the pitch variation, and the speed of the variation. Brown and Vaughn [1996] have determined experimentally that the perceived pitch centre of a note with vibrato is the mean value of the perceived frequency of the sound. Vibrato is not purely a frequency modulation, as it is difficult to perform this effect without some degree of amplitude modulation, as shown by Arroabarren et al. [2003]. Deliberate amplitude modulation in the absence of vibrato in a musical performance is known as tremolo. 26 Ratio Harmonic Name Example 2:1 2 Octave C 3:2 3 Perfect Fifth G 4:3 Perfect Fourth F 5:4 5 Major Third E 6:5 6 Minor Third E[ 9:8 9 Major Second D Table 2.1: Intervals and harmonics Vibrato is subject to the mechanical limitations of the instrument, physical limitations of the player, and stylistic rules. For some instruments vibrato is not possible to produce. For the violin, Geringer and Allen [2004] have investigated vibrato performance across students, finding that the speed of vibrato was approximately 5.5Hz and independent of the experience of the performer. A more detailed study by MacLeod [2008] suggests that vibrato speed is a function of the pitch and dynamic, whilst the depth is a function of the dynamic only. Similar empirical studies have been carried out for wind instruments and the human voice [Prame, 1994]. 2.4 Harmony, Chords and Key Chords are a grouping of simultaneous pitches, and harmony is the relationship between these pitches to create chords. The dual use of the word harmony in describing the relationship between partial frequencies and also notes within a chord is deliberate. Table 2.1 on page 27 is based on the Pythagorean scale of just intonation. In practice, Western music has generally adopted a tempered system, where the pitch ratios are approximate, being logarithmically spaced with 12 semitones per octave. The pitch ratios in Table 2.1 on page 27 can be achieved on instruments which do not have fixed tuning. The seventh harmonic is omitted here as the interval suggested is dissonant (the same is true for the eleventh harmonic). This particular overtone is often avoided in the production of sounds, as it is not sufficiently close to the pitch B[ related to C. Consonance is the pleasantness with which different pitches sound together, and it is usually attributed to notes sharing harmonics. One rule for defining consonance is to consider notes in the chord pairwise. If the lowest overlapping harmonic is the eighth or less, the chord is consonant. This accounts for the prevalence of major and minor triads. A chord possessing a semitone interval is always dissonant. The number of possible chords in Western music is reduced massively when semitone intervals are rejected. Overlapping harmonics is thus common in polyphonic music, and is also one of the difficulties for signal processing algorithms to overcome when attempting to resolve separate notes from within a chord. Favoured chord progressions are those which involve the least number of modifications to pitches present within a chord. For example, chord progressions involving I, IV and V chords are extremely common, each only involving one note out of three to be modified. The concept of a key derives from the chord progressions. 27 2.5 Tempo The rhythm of Western music is often felt as a regular occurring pulse, known also as the beat or tactus. The rate at which beats occur in music is known as the tempo, measured in beats per minute (BPM). Consecutive groups of typically two, three or four beats are known as measures or bars. The properties of this grouping is known as the meter of the music. Subdivisions of the tactus, corresponding to the shortest musical note durations (semiquavers for example), are referred to as the tatum. Generative models of tempo are presented in Section 3.4. 2.6 Conclusion In this chapter we have briefly covered several topics in psychoacoustics and music theory. The goal of this review has been to describe a priori structure in musical audio. The technology we review in Chapter 3 and the methods we develop in this thesis all make use of this prior knowledge to infer musical information from audio. In Chapter 3 we will study the applications and implementations of the models presented here for inferring notes and tempo in musical audio. Some of the results in this chapter may be applied to musical signal processing directly. The auditory model described in Section 2.2 is based on physical evidence and experimental results. When this model is applied in the literature, there are some variations in its implementation, such as the choice of the number of gammatone filters to use in the filter bank, or how the summary spectrum is computed. These variations are motivated by a need for computational efficiency when processing multiple output channels from the filter bank compared to the faithfulness of the model to the ear. On the other hand, only general principles and theories are available from psychological evidence when we consider how the brain perceives pitch. The pitch estimation stage of systems in the literature therefore vary widely in both the model they use and their implementations: for example, compare the use of comb filters and the auto correlation function and the different models they imply. Ultimately we must compare these systems by measuring and evaluating their performance against that of a human listener. A similar situation, where we have both reliable experimental knowledge and only general principles, is when modelling the sound of a musical instrument. For an inharmonic stringed instrument, we have a model of the harmonic positions of the partial frequencies given by (2.4), with one unknown parameter which can be estimated experimentally [Godsill and Davy, 2005]. However the sound of a note from this instrument may come with a performance effect like vibrato. As described in 2.3.4, experimental studies of vibrato are limited and have only provided us with some limitations on the depth and speed of vibrato, rather than precise measures. In conclusion we have a set of models describing both the structure of musical audio and how it is perceived and interpreted. These models at present have different levels of certainty because of the process by which they were arrived at. Although we expect models to improve in accuracy with further investigation and experimentation, there will always be uncertainty due to the human element in the performance and reception of music. A flexible musical signal processing system needs to take into account the knowledge covered in this chapter and the certainty we have about it. Moreover, the system should be able to incorporate existing information where it is available. For example, if the key of a piece of music is known beforehand, 28 for example determined by a human listener, then the system may take advantage of this prior knowledge to estimate chords and chord progressions and consequently improve transcription performance. 29 Chapter 3 Literature Review 3.1 Introduction Machine listening of music audio [Scheirer, 2000] refers to the processing of digital audio signals to extract information within a music information retrieval (MIR) context. It differs from, and complements, semantic information which relies on information supplied by human listeners. As the previous chapter has shown, the modelling of musical audio is not a straightforward task. Physical models of musical instruments are approximate at best, and evaluation using perceptual models is at present still focused on reducing computational expense whilst remaining realistic. Physical models for example may have large numbers of parameters for each instant in time: partial frequencies, amplitudes, noise and so on, making inference using these models expensive until recently. As a result, many authors have resorted to approximations of these models, which have resulted in a wide array of algorithms in the literature. Moreover, evaluation of machine listening systems is not necessarily straightforward. Evaluation is crucial to performing comparative research and understanding which models and techniques are most appropriate in given situations. Often the presentation of results in publications can be biased by the researcher's particular choice of evaluation criteria, and the conclusions therefore misleading. Taking the example of music transcription, a suitable method of evaluation would involve taking a set of trained musicians and asking them to subjectively assess and rank the transcription outputs of different systems. This type of evaluation involves considerable expense and time, and is not feasible for studies on large libraries of music. An automated approach to evaluation requires choosing libraries of music for which there is a ground truth, such as a reliable Midi file with its events aligned in time to the audio. Assuming such a ground truth exists, we need to consider how the performance of a transcription system can be measured. Poliner and Ellis [2007] state that even simple measures such as frame-level transcription accuracy can be biased by reporting too many notes. An event based metric, such as edit distance, or the measure used by the authors mentioned, is more appropriate given that music may be regarded as a stream of note events. The authors also note that, at least in the case of polyphonic piano music which they study, the position of the note onsets is more important than the duration and release of the notes. Some form of consensus is being arrived at in the form of the annual Music Information Retrieval Eval- 30 uation eXchange (Mirex) 1 [Downie, 2008] contest, which consists of a community-driven open evaluation of research systems on a selection of evaluation tasks deemed to be of interest. Mirex is gaining more credibility in the research community: a good or improved performance at one of the evaluation tasks is now considered acceptable as comparative research. This chapter will focus on those areas of interest highlighted at Mirex which are relevant to this thesis, and will describe the models and algorithms that have been shown to be suitable to the processing tasks they are designed for. We will demonstrate that the methods which have their basis and justification in the concepts described in Chapter 2 are also the methods which perform best in these comparative evaluations. 3.2 Multipitch Analysis The goal of automatic music transcription systems is to correctly infer the notes being played in a polyphonic piece of music, producing an intermediate representation, such as Midi, from an audio track. The inference of polyphonic music may be broken into two tasks: 1. Estimating multiple pitches / fundamental frequencies in individual frames of music. We focus on this task in this section. 2. Tracking the pitches as note contours over consecutive frames, also filtering and smoothing the estimates in the frames. This aspect of multipitch analysis is the focus of Section 3.3. InMirex the multiple fundamental frequency estimation and tracking task evaluates frequencies as correct when they are within a semitone of the ground truth. The accuracy of the estimation of the fundamental frequencies is not evaluated: a system which reports ideal fundamental frequencies corresponding to the MIDI standard will be ranked equally as one which finely evaluates the fundamental frequencies. For physical models to perform well on such a task requires that the model of the audio fits the observed data very well, and such tractable models have only become feasible in recent years. Auditory models on the other hand will in general perform quite well even with simple implementations, because the definition of pitch is looser than that of fundamental frequency. 3.2.1 Auditory Models The best performing auditory model, both at Mirex and in other comparisons such as Poliner and Ellis [2007] is that of Klapuri [2008]. The unitary model of pitch perception described in Section 2.2 was first used by Tolonen and Karjalainen [2000] for multipitch analysis using an autocorrelation method. Klapuri [2008] uses a comb filter which analyzes the periodicity of the summary spectrum (see Section 2.2). The output of the comb filter is used to select fundamental frequency candidates. The comb filter weights partial m with fundamental frequency f0 using the formula: f0 + 1 mf0 + 2 (3.1) 1 http://www.music-ir.org/MIREX 31 where 1 ≈ 20Hz and 2 ≈ 320Hz, and the number of partials is limited to 20. This weighting ascribes more importance to partials with lower harmonic index than higher, as many of these partials have already been mapped to lower frequencies due to the auditory model processing. In Figure 3.1b on page 33 we compare the summary spectrum which is obtained from the output of the auditory model with the power spectrum obtained using Fourier analysis in Figure 3.1a on page 33. In Figure 3.2b on page 34 and Figure 3.2a on page 34 we compare the salience functions which are obtained from the periodicity analysis by the comb filter. There are pronounced peaks in the salience functions of both spectra at the correct pitches . However the salience function derived from the auditory model correctly ranks the true pitches in first and second position, whereas the power spectrum ranks the third harmonic of one of the notes with a higher salience. This improvement in multiple pitch estimation is due to the aforementioned higher-order partial suppression which is a property of the auditory model. The estimation of multiple pitches is iterative. At each iteration the pitch with the highest salience is chosen and then the partials corresponding to this frequency are subtracted from the summary spectrum using (3.3). A heuristic formula is used for evaluating the point at which the algorithm should stop, outputting the number of notes that have been detected. The above model is an example of an iterative estimation algorithm, where a preferred predominant fundamental frequency candidate is determined; and then its contribution to the spectrum is canceled. The algorithm iterates until a stopping criterion is met. This scheme requires less effort than jointly estimating fundamental frequency candidates and also has justification from a psychoacoustical perspective [Bregman, 1990, Hartmann, 1996].Klapuri [2003] refers to this as predominant F0 estimation. 3.2.2 Spectrum Estimation In this section we look at algorithms based on sinusoidal models. In the recent Mirex evaluations of 2008 and 2009 sinusoidal models have shown the best performance, whereas in previous years auditory models outperformed other schemes. Sinusoidal models are used to identify multiple frequencies in musical signals, and the algorithm identifies which of those frequencies are the fundamentals. In early work such as Maher and Beauchamp [1995], estimating multiple frequencies, i.e., spectrum estimation, was performed in isolation to the task of detecting harmonic structure in the frequency estimates and thus identifying the fundamental frequencies. Spectrum estimation has been a popular area of research in the signal processing community for years, and there are many algorithms to choose from. However their ability to detect all of the frequencies of interest in a complex polyphonic signal is limited. The successful approaches that we cover in this section incorporate known priors of musical signals such as harmonicity (3.2.3) and spectral smoothness into their spectrum estimation algorithms. After obtaining the sinusoidal representation of the signal, the algorithm must then determine which par- tial frequencies belong to which pitch, and how many pitches are in the frame. The problem of labelling each frequency is combinatorial in nature, especially when the number of pitches is not known, as is normally the case. Hence the algorithm may only perform a limited search through possible combinations of fundamental frequencies. Spectrum estimation may be carried out in time domain or the frequency domain. The motivation for frequency domain approaches arises from the result by Bretthorst [1989] that the maxima of the periodogram spectrum estimator gives the frequency of a single sinusoid embedded in white noise. Although the result 32 -40 -20 0 20 40 60 0 1000 2000 3000 4000 5000 M ag n it u d e / d B Frequency / Hz (a) Fourier spectrum for a mixture of two notes with pitch 1 and 5 relative to A440. 0 20 40 60 80 100 0 1000 2000 3000 4000 5000 M ag n it u d e / d B Frequency / Hz (b) Summary spectrum obtained using an auditory model from the same signal as Figure 3.1a. The higher order partial frequencies have been suppressed due to the half-wave rectification and low pass filtering applied by the model. Figure 3.1: Comparison of Fourier and summary spectra 33 100 200 300 400 500 600 -30 -20 -10 0 10 20 S al ie n ce Pitch relative to A440 (a) Salience function obtained from Fourier spectrum in Figure 3.1a on page 33. Pitch 5 is correctly identified with the highest salience, however pitch 20 (the third harmonic of pitch 5) is ranked with a higher salience than the correct pitch 1. 2000 4000 6000 8000 10000 12000 14000 16000 -30 -20 -10 0 10 20 S al ie n ce Pitch relative to A440 (b) Salience function obtained from summary spectrum. Pitches 1 and 5 are correctly ranked with the highest salience, even before the contribution of pitch 1 is removed from the spectrum in a iterative-F0 estimation procedure. Figure 3.2: Comparison of salience functions obtained from spectra 34 does not hold for multiple sinusoids, a popular approach picks local maxima of the periodogram above a noise floor threshold. One approach for estimating the noise floor in this context is to compute the statistics of the local minima of the periodogram [Martin, 2001], as the minima are expected to be noise components of the signal rather than the sinusoidal components. Bayesian spectrum estimation on the other hand makes use of an explicit signal model, i.e., sum of sinusoids, and a noise model. The principles used to design the systems are similar: 1. Perform spectrum estimation: i.e., identify a number of frequencies present in the noise-corrupted signal. This may be carried out using an explicit signal model, or by extraction from the coefficients in a transform domain of the signal. 2. Identify and label fundamental frequencies and corresponding partial frequencies. A set of possible candidates is generated in this step. 3. Select the candidate set with the simplest explanation. Stages 2 and 3 may be combined into a single step as the principles used to identify the frequencies are typically rules based on harmonicity and correlation between the amplitudes of the detected partials. For example, Yeh et al. [2005] assumes a sinusoidal model with multiplicative detuning parameters for each partial frequency. The preference expressed by the authors is to prefer candidates with smaller detuning parameters. In a Bayesian setting, this would correspond to a prior which assigned greater probability mass to smaller detuning parameters. Pertusa and Inesta [2008] apply the principle of Gaussian smoothness to the partial amplitudes, which is similar to treating the evolution of the amplitudes as a linear dynamical system. The preference for the simplest explanation for an observation is encapsulated by Occam's razor, and is naturally applied by means of Bayesian model section (Section 4.1). This principle can be used to guide the design of the models for musical signals, and may be used to select between competing models in a mathematically rigorous manner. It is encouraging to see that signal models are indeed able to outperform auditory models for multiple pitch recognition. One goal of this thesis will be to set the signal models described above in a fully Bayesian framework, using the principles described in Chapter 2. In so doing, we rely heavily on the work carried out for Bayesian spectrum estimation in Andrieu and Doucet [1999] and its extension to polyphonic music in Davy et al. [2006]. 3.2.3 Harmonicity An explicit model for the inharmonicity of stringed instruments is given by (2.4). The inharmonicity pa- rameter is estimated as part of a Bayesian modelling scheme by Godsill and Davy [2005], and an estimation scheme is given as part of the multi-pitch extraction system of Klapuri [2003]. In addition, there exist models for inharmonicity in generic musical instruments, which we review in this section. These models implicitly prefer a lower amount of inharmonicity, that is, partial frequencies are expected to lie close to integer multiples of the fundamental frequency. Another property of the models is that higher frequency partials are allowed to deviate more from their ideal harmonic positions than lower frequency partials, as in (2.4). 35 Yeh et al. [2005]describe the degree of deviation d of an observed partial f from its expected harmonic position hf0 as d =  |f−hf0| αhf0 if |f − hf0| < αhf0 1 otherwise (3.2) where α is a tolerance on the amount of inharmonicity allowed. Godsill and Davy [2005] introduce a generative model for inharmonicity via detuning parameters. The detuning of a single partial δ is defined as the multiplicative error term that maps the expected harmonic hf0 to the observed partial f , i.e., f = (1 + δ)hf0 (3.3) Each δ in the model has a zero-mean Gaussian prior with a constant variance σ2δ = 3× 10−8. Thus smaller absolute values of the detuning parameters have a higher probability. 3.2.4 Spectral Envelope In this section we cover how the spectral envelope of musical signals has been modelled in the literature. The relative partial energies of the harmonic series of a musical instrument has an important contribution to its timbre. The partial energies of a musical instrument decay relatively smoothly with increasing frequency, so that the majority of the signal energy is concentrated within the lower harmonics. A simple Bayesian model for the decay of the amplitudes with increasing frequency is given in Godsill and Davy [2005]. The prior for the amplitude bm of the mth partial is a zero mean Gaussian: p (bm) = N ( 0, σ2nξkm ) km = 1 1 + (Tm) ν σ2n is the variance of the ambient noise, and ξ represents the overall signal-to-noise ratio (see Section 5.3 for details of this Bayesian model). The scaling km controls the decay of the amplitudes according to a low-pass filter with cut-off frequency T and decay constant ν. The use of low pass filters for modelling instrument timbre is common, for example the system of Karjalainen and Laine [1991]. A fractional delay filter provides the periodicity defining the fundamental frequency of the note, and a low pass filter in the feedback loop causes higher harmonics to decay more rapidly. When excited with a short burst of white noise, the output of the system has a realistic plucked string sound. 3.2.5 Transform Decomposition Methods The last class of methods we will consider for multi-pitch analysis are those which decompose a linear basis transform, such as the Fourier transform, of a musical signal into separate harmonic contributions from each pitch. The method was pioneered for polyphonic music transcription by Smaragdis and Brown [2003]. In their method, the spectrogram is formed into a matrix X with the Fourier transform of each overlapping frame of music forming the columns of this matrix, i.e., Xω,t = |STFT (t, ω)|2 36 The matrix X is then decomposed by non-negative matrix factorization [Lee and Seung, 2000] into two factors X ≈WH where the number of columns of W and the number of rows of H are both equal to the number of pitches that are modelled in this segment of music. Each column of W models the harmonic profile of a note with a certain pitch. W may be trained or constrained to some parametric form. Each element of H gives the weight, or the energy, of that note in a frame. H, when the rows are arranged in ascending pitch order, has the appearance of a piano roll (see Figure 7.4 on page 119 as an example), and may be used as a starting point for a full transcription. 3.3 Polyphony Tracking The task of tracking multiple pitches present in a piece of musical audio over time is normally evaluated separately from the task of estimating multiple pitches in a single frame (or otherwise stationary section of audio in terms of pitch). This is useful because it is possible to combine different pitch analysis and tracking systems together, and evaluate the performance of each separately. Two related applications to tracking polyphonic signals are 1. Melody extraction. This refers to extracting a melodic line from a polyphonic piece. What constitutes a melodic line must be answered using music theory, but a practical working definition is that the melodic line corresponds to the predominant pitch (3.2.1) in each frame. We will therefore view melody extraction as a sub-problem within the general polyphonic tracking problem, and do not directly address it here. However, we note that many of the melody extraction systems that perform well in theMirex task use similar models and algorithms to systems which carry out full multiple-pitch estimation and tracking, rather than being specifically designed for the purpose of melodic extraction. 2. Score alignment. This refers to tracking polyphony through audio when the ordering of the pitches present is known to some extent, but the tempo is not. Hence polyphonic tracking systems should have one model to describe the evolution of the pitches, and another model to describe the evolution of the tempo. We cover tempo models in Section 3.4 as a separate application. In this section we discuss models that describe how the pitches within a polyphonic piece change over time. Polyphony tracking may be carried out in a real-time online manner, or oine. Oine approaches tend to outperform online approaches, as expected and reported in Robertson and Plumbley [2009], who develop a real-time system based on Puckette et al. [1998]. 3.3.1 Attack-Sustain-Decay-Release An Attack-Sustain-Decay-Release (ASDR) envelope is a function of time present in modern synthesizers used to modulate the loudness of the note. Dodge and Jerse [1997] state that the shape of the ASDR envelope is the dominant factor in the perception of instrument timbre. Thus this relatively simple model is used frequently in the literature for an isolated musical note spanning multiple frames of audio data. Orio and Déchelle [2001] model each note event in a score using a three state hidden Markov model attack-sustain-rest. The duration of each note is governed by the transition probability from the note being in the sustain state in one frame to the note remaining in the sustain phase in the succeeding frame. The 37 duration of each note thus follows a negative binomial law. A similar technique is used by Devaney et al. [2009] where transient and sustained sections of sung notes are modelled using a three state hidden Markov model. Cemgil et al. [2006] use a state space model for the evolution of sinusoids through time. The beginning of each note is modelled by the state being drawn from a zero mean Gaussian with covariance matrix modelling the distribution of energy across partials. The sustain/decay phase is treated by each harmonic sinusoid h having a damping ratio ρh = ρ h d where ρd is the damping ratio of the fundamental. In the rest phase, the damping ratio is increased to ρr, so that the note rapidly becomes inaudible . 3.3.2 Pitch Evolution A popular method for polyphonic tracking in the literature is a hierarchical hidden Markov model (HMM). Individual notes are modelled as in 3.3.1. The transition probabilities between notes of different pitches are estimated from large databases of music, such as the work carried out by Ryynänen and Klapuri [2004]. The key of the music is provided as prior information to improve the relevance of the model, and most databases can be shifted cyclically so that data for one key can be applied to another key to improve the diversity of training data. Key detection systems are global, and may rely on coarse feature vectors such as chroma. 3.4 Tempo Tracking Listeners without musical training are capable of tapping a rhythm corresponding to the pulse of the music, and are able to adapt to changing tempos. With some training, listeners are able to infer the meter (Section 2.5). One goal of machine listening therefore is to be able to infer the same basic rhythmic structure. This task is collectively known as beat tracking, although the emphasis may be more on tracking tempo rather than precise beat locations. Score following is a closely related application to beat tracking. In score following, knowledge of the score automatically gives us the deterministic relationship between the tatum, beat and metrical structures. Inferring the tatum is adequate for this task. Other than this, the onset detection model of a beat tracker may need to be modified, at least to allow for misses and false alarms. It is also typical in the literature to extend the onset detection model so that it incorporates knowledge of the underlying score, particularly note pitches and volumes. It is an open question whether such extensions generally improve the quality of score following or whether there might be a degradation in robustness to variations in timbre etc. The seemingly satisfactory performance of query by tapping systems (QBT) [Jang et al., 2001] for querying musical databases adds some weight to this question. Moreover, a perceptual evaluation of the alignment of the audio to the score may be based on the alignment of onsets in the music. In the literature, beat tracking is often coupled with audio onset detection to form a complete listening system. However it is instructive to study these separately, as the onsets may already be available to us in some electronic format, for example Midi events, and also we may wish to evaluate the performance of different models in a modular system. We will also indicate how these models may be modified as part of a score following system. Cemgil and Kappen [2003] and Raphael [2001] model the observed onset times yk as actual onsets τk with some added noise k per onset. The difference between successive actual onsets τk−1 and τk is given 38 by the expected inter-onset interval (IOI) γk (as a multiple of the tatum) scaled by the current value of the tempo ∆k−1, with some additional process noise στ . The tempo evolves as a random process ∆k−1,∆k, . . . with process noise σ∆. The model can be written in state space form:[ τk ∆k ] = [ 1 γk 0 1 ][ τk−1 ∆k−1 ] + [ στ σ∆ ] yk = τk + k (3.4) These models solely relate the observed IOIs to the expected IOIs. Beat and metrical information is supplied a priori as p (γk). Figure 3.3a on page 40 presents the model as a Bayesian network . This model is appropriate for rhythm-based quantization of MIDI events, however we must generalize the model for some detection probability p (yk|τk) when using audio onset detection. For score following we could extend this model to have a probability distribution over the observed audio p ( syk:yk+1 |γk ) between the recorded onset times. Raphael [2004] uses a simple generative model for the spectrum of each frame given the score γk. Klapuri et al. [2006] assume a frame-based onset detector, and include metrical structure in their gener- ative Markov model p ( sk, τ tatum k−1:k , τ beat k−1:k, τ measure k−1:k ) = p ( sk|τ tatumk , τbeatk , τmeasurek ) × p (τ tatumk |τbeatk , τ tatumk−1 ) × p (τmeasurek |τbeatk , τmeasurek−1 ) × p (τbeatk |τbeatk−1 ) (3.5) As can be seen from the structure of this model, the fundamental state is the beat or pulse. The metrical structure is derived from the evolution of the beat state. The onset detector itself assigns a likelihood p ( sk|τ tatumk , τbeatk , τmeasurek ) for a feature vector sk based on the frame of audio. Figure 3.3b on page 40 presents the model as a Bayesian network. Whiteley et al. [2006] adopt a different definition of tempo as the velocity nk of the tatum mk over feature vectors sk of frames of music. p (sk,mk+1, nk+1, θk+1|mk, nk, θk) = p (sk|mk, θk) × p (θk+1|θk,mk+1,mk) × p (mk+1|mk, nk, θk) × p (nk+1|nk) The θk = {Mk, rk} denotes changes in meter Mk (the number of tatum positions in the bar) and also rhythmical pattern indicators rk within a bar, hence its inclusion in the onset detector p (sk|mk, θk) = p (sk|mk, rk). The tatum moves deterministically according to mk+1 = (mk + nk − 1) modMk + 1 39 σ yk τk τk−1 γk στ ∆k−1 ∆k σ∆ (a) State-space model of tempo ∆k with actual onsets τk, observed onsets yk and expected inter-onset intervals γk (3.4) for a single time slice k τtatumk sk τbeatk τmeasurek τtatumk−1τ measure k−1 τbeatk−1 (b) Markov model of metrical structure (3.5) for a single time slice k Figure 3.3: Bayesian network representations of tempo models 40 Both of the above models can be extended to score following by adding a probability distribution of the form p ( smk:mk+1 ) on the frames of audio between tatum positions. Peeling et al. [2007a] extend the model of Klapuri et al. [2006] to score following using a generative model for the spectrogram coefficients. 3.5 Conclusion Music is inherently hierarchical in structure, and in this chapter we have seen that the state-of-the-art systems for extracting pitches, notes and tempo from musical signals reflect this hierarchy. To extract multiple pitches from a signal, individual partial frequencies are estimated and grouped into the smallest set of pitches that explain the harmonicity and timbre of the signal. Pitch estimates in consecutive frames of music are then linked together as notes. Missing pitch estimates between a note's onset and offset are filled in, and spurious pitches are discarded as noise. Notes in polyphonic music tend to arrive in groups with regular spaced intervals, giving rise to the perception of beat and tempo in music. Concurrently sounding notes also tend to have extended harmonic relationships, giving rise to chords, chord sequences and the key of the music. It is clear from the above description that a single pass through a musical signal, first extracting pitches, then smoothing pitch tracks to obtain notes, and estimating metrical structure and key, would be incomplete. For example, when spurious pitch detections are discarded as noise, this gives us more information about the structure of the noise in that frame, and can therefore be used to improve the original system that extracted multiple pitches from the frame. An iterative approach thus seems appropriate, however computational performance can be an issue here, particularly if the pitch extraction algorithm only works on a single frame at a time, and frames are then processed in order. Therefore single-pass systems are often experimentally trained oine to determine the optimum set of parameters for pitch extraction and smoothing notes. However today's music has a large variety of genres, instruments, styles and so forth. Selecting an appropriate training set which will generalize well is difficult. The motivation for an iterative approach has given rise to algorithms based on matrix factorization of time-frequency coefficients, which in single steps can process multiple frames of music, and are computation- ally viable alternatives (3.2.5). These methods have yet to be shown to be similar in performance to multiple pitch extraction schemes based on auditory or sinusoidal models. Reasons for this include that the models are not physically realizable, as they rely on the spectrogram being additive, the difficulty of determining the number of pitches in each frame, and linking the frames together to track polyphony. These three reasons are the motivation for our research in Chapter 7 and Chapter 8, where we attempt to address each issue without sacrificing computation. In this chapter we have also seen models for various music phenomena, such as harmonicity, timbre and the envelope of the note energy. Some of these models were derived from studying the physics of the musical instruments that produce the sound, whilst others were developed to capture more general characteristics, such as low inharmonicity and the spectral smoothness of partial amplitudes. The application of these models has been shown in the Mirex evaluations to be effective at extracting multiple pitches from frames of music, and may be cast in a Bayesian setting. In Chapter 5 and Chapter 6 we will consider additional 41 parameters for frequency and amplitude modulations in a note, and investigate efficient schemes to infer with models containing large numbers of parameters. Combining a generative multiple pitch model in a frame with a hierarchical hidden Markov model for tracking polyphony and tempo is attractive for musical signal analysis, as the entire model is suitable for Bayesian inference, prior information can be incorporated in a transparent way, and the model is useful for a number of applications. In Chapter 8 we consider models for tracking the movement of a score pointer in a performance, and apply the model to the applications of score following and query by tapping. 42 Chapter 4 Bayesian Methods for Signal Processing In this chapter we describe basic Bayesian methods for modelling and inference on which the models developed in this thesis rely on. In Section 4.1 we discuss modelling concepts using Bayes rule, and introduce some popular models and representations of models. In Section 4.2 we describe the inference algorithms applied to these models that are used elsewhere in this thesis. 4.1 Bayesian Modelling Methods In this section we introduce some probabilistic modelling techniques. 4.1.1 provides definitions and concepts relating to Bayes' rule and comparing probabilistic models as explanations of observed data.. In Section 4.1.2 we show how graph structures can represent complex probabilistic models. We then define the hidden Markov model in Section 4.1.3 which can be used to model causal and dynamical systems. Finally we discuss the exponential family of probability distributions in Section 4.1.4 which are often used in Bayesian approaches because of their useful properties. 4.1.1 Bayes Rule and Model Comparison 4.1.1.1 Parameter Estimation using Bayes Rule In frequentist statistics, we have observations y produced by a set of unknown model parameters θ according to a likelihood function p(y|θ). One estimate of the model parameters given the data is the maximum likelihood (ML) estimate θˆML = arg max θ p(y|θ) However we often have some prior knowledge expressed as a Bayesian belief p(θ) concerning the parameters, and p(y|θ) is interpreted as the information provided by observations y conditioned on the parameters. Bayes' rule states that 43 p(θ|y) = p(y|θ)p(θ) p(y) (4.1) posterior = likelihood× prior evidence which can be viewed as weighting our prior belief with the likelihood of the data to give a posterior estimate of the parameters. The prior captures our belief about the model parameters before we observed any data. The ML estimate is now replaced with the maximum a posteriori (MAP) estimator θˆMAP = arg max θ p(θ|y) (4.2) 4.1.1.2 Marginal Likelihood for Model Comparison The evidence or marginal likelihood p (y)is the normalizing term in (4.1). The term marginal likelihood arises as p (y) is the likelihood of the observations y after marginalizing the model parameters θ: p (y) = ˆ θ∈Θ p (y|θ) p (θ) dθ (4.3) The marginal likelihood is important for Bayesian model comparison as the only remaining unknown is the identity of the probabilistic model for y itself. A higher marginal likelihood for a model indicates that the model is a better explanation of the observed data. Calculating the marginal likelihood is an application of Occam's razor, which states that a simpler explanation for an observation is to be preferred. Here, simpler implies a model with less parameters. Both ML and MAP estimators will prefer a model with more parameters, a problem known as over-fitting. We can select the best model for y by computing the marginal likelihood p (y) for every model in the set of models we are comparing. We will illustrate this here with two probabilistic models M1 and M2 which are considered possible explanations for y. The two models may have different numbers of parameters. We compute the marginal likelihood for each model, which we denote as p (y|M1) and p (y|M2). One method of comparing the models is the Bayes factor [Kass and Raftery, 1995] p (y|M1) p (y|M2) which assesses the evidence M1 against M2. A Bayes factor greater than 1 indicates that M1 should be preferred. In full Bayesian inference we do not compute point estimates of the model parameters θ such as the MAP estimate. Rather we are interested in inferring the posterior probability distribution p (θ|y), for which the MAP estimate is the mode of this distribution. Inferring the posterior distribution allows us to compute expectations such as the variance, which gives us some idea of the uncertainty in our parameter estimates, and is important for making decisions based on such inference. Note that as the likelihood p (y|θ) and the prior p (θ) are known, then computing the marginal likelihood p (y) is equivalent to computing the normalization constant of the posterior distribution and thus computing the distribution itself (4.1). 44 4.1.1.3 Generative and Discriminative Models A generative model for y is a probabilistic model p (y, θ) = p (y|θ) p (θ) for which we can randomly sample a set of parameters θ′ ∼ p (θ) and then sample or generate an observation y′ ∼ p (y|θ′). When choosing a generative model to study a signal, we desire that • Statistical properties of the observed signal should match the properties of data generated by the model • The parameters of the model θ include information which is to be extracted from the signal • The prior p (θ) and the likelihood p (y|θ) of the model are based on the known physical processes which drive the signal The alternative to a generative model is a discriminative model which directly defines a probability distri- bution p (θ|y) which can then be used to obtain the MAP estimate (4.2). This often means that the original likelihood and prior in (4.1) may not be available in closed form and cannot be used as a generative model. A discriminative model may be designed to give accurate results for the task it is designed for, such as multiple pitch detection, but needs to go through cross validation to ensure that the model will generalize well to unseen signals. For generative models we are able to use model comparison to select the most appropriate model, and model selection can be carried out not only oine on training data but also at the time when the signal is observed. 4.1.1.4 Hierarchical Models A hierarchical model enforces conditional independencies between the model parameters. There is often good justification in a signal processing application for making this assumption. For example, in a beat tracking application, where we want to jointly track the tempo and the position of beats in a drum track, the beats themselves appear as sudden bursts of energy in the signal, but the tempo is not directly relevant to the observed signal. Rather, the tempo controls the rate at which the beats occur. If we define the parameters related to the beats as θb and the tempo parameters as θt, then the model p (y, θb, θd) = p (y|θb) p (θb|θt) p (θt) is not only justifiable from a theoretical point of view, but also can be a generative model: first sample the tempo according to the prior p (θt) and then sample the beats, and finally the signal. Hierarchical generative models are often represented graphically as Bayesian networks (Section 4.1.2). The graph can then be used to determine how to sample the parameters in turn to generate data from the model, and also guides inference algorithms to estimate unknown parameters given observed data. 4.1.1.5 Conjugate Priors In this section we define a useful property of certain families of probability distributions which simplifies inference and computation. A family of probability distributions is a set of distributions which share the same functional form but have different parameters. For example, the normal distribution may be parametrized 45 by mean µ and standard deviation σ N (y;µ, σ) ≡ 1√ 2piσ exp ( − (y − µ) 2 σ2 ) The mean µ is a location parameter for the normal distribution. A family of distributions with a location parameter has the following functional form fµ (y) = f (y − µ) The standard deviation is a scale parameter for the normal distribution. A family of distributions with a scale parameter has the following functional form fσ (y) = f (y/σ) /σ Now assume the mean µ of some observed data y is unknown, but the standard deviation σ is known. The posterior of µ is p (µ|y, σ) ∝ p (y|µ, σ) p (µ) If the prior p (µ) is also chosen to be a normal distribution, then it can be shown that the posterior p (µ|y, σ) is also a normal distribution, where the mean and standard deviation of the posterior are given by standard rules (Section A.1). The normal distribution prior p (µ) is a conjugate prior to the mean parameter of a the normal distribution, because the posterior is in the same family of probability distributions as the prior. A conjugate prior is a choice of prior for a particular parameter of a likelihood function, where the posterior of this parameter is in the same family as the prior. 4.1.2 Bayesian networks Graphical models are a method of visualizing the structure of probabilistic models by diagrammatically representing probability distributions. Inference algorithms can be viewed and defined as messages being passed between nodes of the graphical model [Bishop, 2006]. Directed acyclic graphs, also known as Bayesian networks, are useful for constructing models via conditional probability distributions. Figure 3.3 on page 40 provides two examples of Bayesian networks. Circular nodes represent the values of random variables, such as the unknown parameters and observed data, and edges represent statistical dependencies between the random variables. Bayesian networks intuitively represent causality. The definition is that an edge is directed from A to B if B is conditionally dependent on A. We call A a parent of B, and B a child of A. The acyclic property makes these graphs suitable for working with generative probabilistic models, where we can generate synthetic data by sampling the distributions of nodes with no parents (hence they are conditionally independent of all other random variables in the network), and then moving through the network along the directed edges, sampling each child node dependent on its parents. This is known as ancestral sampling [Bishop, 2006]. The probability distribution, and the method of generating samples from it, are represented by the following factorization: 46 θ0 θ1 . . . y1 θn . . . . . . yn θN . . . yN Figure 4.1: Hidden Markov model. Observed random variables have doubled lines p(x1, . . . , xN ) = N∏ i=1 p(xi|Par(xi)) (4.4) where Par(xi) denotes the set of parent nodes of xi. One weakness of Bayesian networks is that although conditionally dependencies are expressed directly, the task of determining whether a variable is conditionally dependent of another is not so clearly evident. We denote the Markov blanket of a node as the set of nodes for which, given the values of these nodes, the node is conditionally independent of all other nodes in the network. For a Bayesian network, the Markov blanket of xi is the union set {Par(xi)∪Chl(xi)∪Par(Chl(xi))} where Chl(xi) denotes the set of child nodes of xi. 4.1.3 Hidden Markov Models A hidden Markov model (HMM) is a probabilistic model with an underlying Markov process, that is, the conditional probability distribution of future states of the process given the present and all past states of the process is conditionally independent of the past states, i.e., it depends only on the present state. This state of the process is assumed to be hidden, and the task is to infer the sequence of states over time given some observations dependent only on the current value of the state. Usually the hidden Markov model is viewed as a special case of a general state space probabilistic model, where the state space Θ is discrete. Hidden Markov models are often used for modelling dynamical systems. We will use the following notation, which is represented as a Bayesian network in Figure 4.1 on page 47: the hidden state sequence is θ0:K and y1:K is the sequence of observations produced according to the state likelihoods (also known as emission probabilities) p(yk|θk), over discrete times k = 1, . . . ,K. The state sequence evolves as a Markov chain, that is, it has an initial probability distribution p(θ0) and a set of transition probability distributions p(θk|θ1:k−1) = p(θk|θk−1), k = 1, . . . ,K. 4.1.4 Exponential Family of Probability Distributions The exponential family of probability distributions is a class of probability distribution having the form 47 p(y|θ) = 1 Zθ e−〈θ,T (y)〉 (4.5) The normalizing factor Zθ is given by ´ dy e−〈θ,T (y)〉. This family has a number of useful properties which make them important for Bayesian inference: • The finite vector T (y) is the collection of sufficient statistics, which capture all the possible information about θ that is represented by observations y, that is: p(y|T (y), θ) = p(y|T (y)) ∀θ. For inference only the sufficient statistics T (y) are required, not the entire data. • The maximum likelihood parameters θˆML that maximize (4.5) are those for which the observed values of the sufficient statistics equal their expected values: T (y) = 〈T 〉θ. • For a likelihood function p(y|θ) in the exponential family, there exists a conjugate prior (4.1.1.5) p(θ), often itself in the exponential family, for which the posterior p(θ|y) is the same class of distribution as the prior. This is useful in variational methods (Section 4.2.2) as the update equations require only updating the sufficient statistics of a factor rather than performing a computationally intensive calculation over the whole parameter space of a probability distribution. 4.2 Inference Algorithms In this section we cover inference algorithms falling into three main categories: exact inference (Section 4.2.1) for situations where it is possible to compute the posterior for all possible parameter settings, sampling methods (Section 4.2.2) which aim to generate samples from the posterior and compute expectations via Monte Carlo integration, and variational methods (Section 4.2.2) which approximate the full posterior with distributions for which the integrals can be computed. 4.2.1 Exact Inference Exact inference refers to being able to compute posterior quantities precisely. Often this requires marginal- izing over parameter spaces, which therefore requires the integrals to be analytic (such as in the case of models with Gaussian conditional probabilities) or that the parameters only assume discrete values. The Kalman filter is an exact inference algorithm for linear dynamical systems with Gaussian transition and observation noise. For a hidden Markov model (Section 4.1.3) with a discrete and finite state space, such that θk can assume one of E possible values, and the conditional probability of the observation given the state is computable, there exist a group of message-passing inference algorithms with complexity O(E2K) to compute various inference tasks. Typically we may wish to determine the probability of states at time k, given past observations p(θk|y1:k), which is known as filtering and can be carried out recursively on-line, or including all future observations p(θk|y1:K) up to time K, which is known as smoothing and must be carried out oine, or including N recent observations p(θk|y1:k+N ), which is known as fixed-lag smoothing and is practical if a certain amount of latency in the inference is acceptable. We may also wish to predict future states p(θk+N |y1:k) or infer the most likely sequence of states p(θ0:K |y1:K) which is known as the Viterbi path. The computations required 48 for all these related but distinct queries can be viewed in terms of message passing algorithms. Both the Kalman filter and the following HMM algorithms are actually special cases of inference algorithms, known as the sum-product algorithm (for filtering and smoothing), that act over general Bayesian networks or factor graphs, see Bishop [2006] for details. Rabiner [1989] provides a tutorial on HMMs, also describing the Baum- Welch algorithm, which is an expectation-maximization (EM) algorithm for learning hidden parameters of the transition and observation process. Let θ0:K be the unknown state sequence in a hidden Markov model, and let y1:K be the sequence of observations generated. By Bayes' theorem, the posterior distribution over all possible state sequences is given by p(θ0:K |y1:K) = p(y1:K |θ0:K)p(θ0:K) p(y1:K) (4.6) The marginal filtering density p(θk|y1:k) can be computed up to the normalizing constant p(y1:k) by passing αk|k (θk) ≡ p(θk|y1:k)p(y1:k) `alpha' messages between neighbouring frames: α0|0 (θ0) = p(θ0) (4.7) αk|k−1 (θk) = ∑ θk−1 p(θk|θk−1)αk−1|k−1 (θk−1) (4.8) αk|k (θk) = p(yk|θk)αk|k−1 (θk) (4.9) The marginal smoothing density p(θk|y1:K) is computed oine by passing βk|k (θk) ≡ p(yk+1:K |θk) `beta' messages as follows: βK|K+1 (θK) = 1 (4.10) βk|k (θk) = p(yk|θk)βk|k+1 (θk) (4.11) βk|k+1 (θk) = ∑ θk+1 p(θk+1|θk)βk+1|k+1 (θk+1) (4.12) p(θk|y1:K) ∝ αk|k (θk)βk|k+1 (θk) (4.13) The Viterbi path is computed in an analogous manner, where messages from neighbouring frames and observations are combined by taking the maximum rather than summing, i.e., αk|k−1 (θk) = max θk−1 p(θk|θk−1)αk−1|k−1 (θk−1) (4.14) βk|k+1 (θk) = max θk+1 p(θk+1|θk)βk+1|k+1 (θk+1) (4.15) arg max θ0:K p(θ0:K |y1:K) = arg max k=0:K αk|k (θk)βk|k+1 (θk) (4.16) 49 Algorithm 4.1 Generic MCMC • Sample θ˜(0) ∼ pi0(θ) • For i = 1, . . . ,M  Sample θ˜(i) ∼ K(θ|θ˜(i−1)) 4.2.2 Monte Carlo Methods Bayesian inference problems frequently involve computing high-dimensional integrals. For example, if we are interested in the mean of the posterior p(θ|y), we are required to compute the following integral over the space of possible parameter settings Θ. Ep(θ|y)[θ] = ˆ Θ θp(θ|y)dθ (4.17) In many cases, the dimension of Θ is very large and the problem is intractable. We must resort to using a Monte Carlo estimate, which is a stochastic numerical integration method. Monte Carlo methods [Gilks and Spiegelhalter, 1996, Liu, 2003] are a general class of methods to compute expectations of random variables. We denote the target probability density function (pdf) of interest as pi (θ), and assume we have a set of random samples θ˜(i), i = 1, . . . , N drawn from pi (θ), which are calledMonte Carlo samples. Now, for a general function h(θ) over the parameter space, we have, by the law of large numbers, the following approximation to the general integration problem ˆ θ h(θ)pi(θ)dθ ≈ 1 N N∑ i=1 h(θ˜(i)) (4.18) pi (θ) can however be of large dimension, and with an unknown normalizing constant, and therefore may be difficult to sample from. See Robert and Casella [2004] for a full overview of Monte Carlo techniques. Here we will mention two methods in particular which are useful to us for generating random samples from such a pdf. 4.2.2.1 Markov Chain Monte Carlo Markov Chain Monte-Carlo (MCMC) methods construct a Markov chain with stationary distribution equal to the target pdf pi(θ) which we wish to sample from. A Markov chain is specified by an initial sample dis- tribution pi0(θ) and a transition kernel K(θ|θ′) which is a probability density. For the stationary distribution of the Markov chain to be equal to the target pdf pi(θ) the transition kernel must obey ˆ Θ K(θ|θ′)pi(θ′)dθ′ = pi(θ) ∀θ ∈ Θ (4.19) Given such a kernel, the Monte Carlo samples θ˜(i) are generated from the Markov chain as in 4.1. A general method of constructing the kernel K(θ|θ′) is provided by the Metropolis-Hastings (MH) algo- rithm. Suppose we have a proposal pdf q(θ|θ′) which we can directly sample from, with q(θ|θ′) > 0 wherever pi(θ) > 0. Step 3 in 4.1, which involves drawing samples from the kernel, is expanded in 4.2, where candidate 50 Algorithm 4.2 Metropolis-Hastings Kernel • Sample θ? ∼ q(θ|θ˜(i−1)) • Compute the acceptance ratio α MH = min [ 1, pi(θ ?) pi(θ˜(i−1)) q(θ˜(i−1)|θ?) q(θ?|θ˜(i−1)) ] • Accept : with probability α MH set θ˜(i) ← θ? • Reject : otherwise set θ˜(i) ← θ˜(i−1) samples θ? are drawn from the proposal pdf q(θ|θ′), and then accepted or rejected according to an acceptance ratio or probability α MH . It is not necessary for the entire parameter θ to be sampled in each step. Let θ be partitioned into D disjoint components {θ1, θ2, . . . , θD} which may be groups (blocks) of parameters. It is permissible at each iteration for the target pdf to be one of p(θd|θ¬d), d = 1, . . . , D provided every component θd has some probability of being chosen at each iteration. For the special case when we are able to sample directly from p(θd|θ¬d) then αMH = 1, that is, every candidate is accepted. This is known as Gibb's sampler. This can typical occur in Bayesian networks with conditional dependencies between the nodes defined as standard probability distributions. The algorithm can then be viewed in terms of passing messages between nodes, and has been implemented in software packages such as WinBugs and OpenBugs. Commonly the Markov chain is allowed to run for a burn-in period before computing statistics of the Monte Carlo samples, as it is a commonly observed phenomenon that the Markov chain `converges' to likely parameter settings after such a period. [Green, 1995] extends MCMC to model selection problems, where the model parameters have different dimensions. The scheme is known as Reversible Jump MCMC. An MH kernel is required to move between models Mj and Mj′ of differing dimension. The jump must be reversible in that for every accepted move θj → θj′ , the reverse move θj′ → θj must have positive probability of acceptance. The MH proposals are normally designed so that θj and θj′ share most of the parameter settings between them, and therefore a sensible proposal normally requires that the difference between the dimensions of j and j′ is small. The MH proposal pdf is augmented to become q(θj′ , uj′ |θj , uj) where uj′ and uj are auxiliary random variables that keep the dimensions between the augmented parameter spaces constant: D(θj′) + D(uj′) = D(θj) + D(uj) where D(·) is the dimension of the random variable. The acceptance probability in 4.2 is replaced by αMH = min [ 1, pi(θ?j ) pi(θ˜j) q(θ?j , uj |θ˜j , uj) q(θ˜j , uj |θ?j , uj) ∣∣∣∣∣∂(θ?j , uj)∂(θ˜j , uj) ∣∣∣∣∣ ] (4.20) 4.2.2.2 Importance Sampling and Sequential Monte Carlo The Metropolis-Hastings algorithm derives from acceptance-rejection sampling. Another sampling method, importance-sampling, leads to a class of algorithms known as sequential Monte Carlo, so called because they are often suitable for sequential inference problems as often appear in dynamical systems and hidden Markov models. Importance sampling requires another proposal pdf q(θ) with q(θ|θ′) > 0 wherever pi(θ) > 0, known as the importance pdf. Now, if we have a set of Monte Carlo samples θ˜(i), i = 1, . . . , N drawn from q (θ), we 51 can write the general integration problem as ˆ θ h(θ)pi(θ)dθ = ˆ θ h(θ) pi(θ) q(θ) q(θ)dθ ≈ 1 N N∑ i=1 ω˜(i)h(θ˜(i)) (4.21) Each Monte Carlo sample θ˜(i) is weighted by an importance weight ω˜(i) which corrects the difference between q(θ) and pi(θ). The importance weights are given by the importance ratio ω˜(i) = pi(θ˜(i)) q(θ˜(i)) (4.22) The variance of the estimate depends strongly on how close q(θ) is to pi(θ), which is usually evaluated as the Kullback-Liebler divergence KL(q(θ)||pi(θ)). The optimal importance pdf that minimizes the variance of the estimate is q(θ) = |h(θ)|pi(θ), which however usually cannot be sampled from. Sequential Monte Carlo methods [Doucet et al., 2001], also known as particle filtering, are algorithms for approximate inference on typically dynamical systems using importance sampling. At each time k we are interested in generating Monte Carlo samples θ (i) 1:k of the state trajectories, also know as particles. The target posterior pdf p(θ1:k|y1:k) can be written sequentially as p(θ1:k|y1:k) = p(θ1:k−1|y1:k−1)p(θk|θk−1)p(yk|θk) p(yk|y1:k−1) (4.23) The normalization constant p(yk|y1:k−1) does not need to be computed. We also select a sequential importance pdf qk(θ1:k) = qk−1(θ1:k−1)qk(θk|θk−1) = q0(θ0) k∏ k′=1 qk′(θk′ |θk′−1) (4.24) so we can sample the new state at time k from the proposal qk(θk|θk−1) using the state trajectories found up until time k − 1. The choice of the proposal pdf affects the resulting variance of the particle estimate. The optimal proposal uses the new observation: qk(θk|θk−1) = p(θk|θk−1, yn) but can be impossible to sample from. Local Gaussian approximations to the optimal proposal lead to the unscented and extended Kalman filters. The simplest proposal is the transition pdf qk(θk|θk−1) = p(θk|θk−1) which results in the algorithm called the bootstrap filter (4.3). The importance weight of each particle is given as the ratio p(θ1:k|y1:k) qk(θ1:k) (4.26) However, after a few iterations, most of these weights become close to zero, and the solution is degenerate, being represented by only a few particles. The solution is to resample the particles: particles with low weights are moved to more accurate positions. A resampling step takes place when the efficiency number I eff = [ ∑I i=1(w˜ (i) k ) 2]−1 is lower than some threshold. A simple scheme known as stratified sampling samples such that the expected number of particles following the resampling step at θk is equal to Iw˜ (i) k . See Doucet et al. [2001] for further details and alternative resampling schemes. 52 Algorithm 4.3 Bootstrap Particle Filter • Initialize θ˜(i)0 ∼ q0(θ0), w˜(i)0 = p0(θ˜0)/q0(θ˜0) for each particle • For k = 1, . . . ,K  Update particle trajectories θ˜ (i) k ∼ qk(θk|θ˜(i)k−1)  Compute particle weights w˜ (i) k ∝ w˜(i)k−1 p(θ˜ (i) k |θ˜(i)k−1)p(yk|θ˜(i)k ) qk(θ˜ (i) k |θ˜(i)k−1) (4.25) and normalize: w˜ (i) k ← w˜(i)k / ∑I i′=1 w˜ (i′) k  Resample if necessary 4.2.3 Variational Methods Variational methods are an alternative, deterministic method for making approximate posterior estimates. Define p(y, θ) as the joint distribution when y is observed, with normalizing constant Zy. We wish to compute the posterior p(θ|y) = 1 Zy p(y, θ) (4.27) Zy = ˆ p(y, θ)dθ (4.28) The Monte Carlo techniques described in Section 4.2.1 can be utilized to draw samples from p(y, θ)/Zy. However, in practice, the integral can often be more quickly approximated by the structured mean field method, also known as variational Bayes. The integrand P = p(y, θ)/Zy is approximated with a simpler distribution Q such that the integral in (4.28) is tractable. A common factorization involves partitioning the parameters into d disjoint components, θ1, . . . , θD such that Q = D∏ d=1 Qd(θd) (4.29) The mean field method minimizes the KL divergence KL(P||Q). Due to the non-negativity of the KL measure, we then obtain a lower bound on the normalizing constant logZy ≥ 〈log p (y, θ)〉Q − 〈logQ〉Q (4.30) The second term of (4.30) is the entropy: −H[Q] ≡ 〈logQ〉Q. See Chapter A for expressions for the entropy of probability distributions in the exponential family. The factors Qd obey the fixed point equation Qd ∝ exp(〈log p (y, θ)〉Q¬d) (4.31) which can be computed easily if all the factor distributions are chosen to be in a conjugate-exponential 53 family. For example, if a random variable σ2 has sufficient statistics 〈 1/σ2 〉 IG and 〈 log σ2 〉 IG under an Inverse-Gamma distribution, and y ∼ N (0, σ2) then Qy ∝ exp (− 12 〈1/σ2〉 y2 − 12 log 2pi − 12 〈log σ2〉) (4.32) and the VB update is 〈y〉N = 0, 〈 y2 〉 N = 〈 1/σ2 〉−1 as expected. See Chapter A for expressions of the sufficient statistics of some probability distributions in the exponential family. The fixed point equation (4.31) has the property that for every iteration, the lower bound (4.30) is guaranteed to increase. Variational Bayes and Gibbs' sampler have been compared for inference in audio signal models in Godsill et al. [2007], Cemgil et al. [2007]. Qualitatively, VB methods tend to converge quicker than Gibbs' sampler, but may result in a poorer solution because only a lower bound of the likelihood is being computed. 54 Chapter 5 A Signal Model for Pitched Musical Instruments We consider the modelling of pitched musical instruments as a summation of several sinusoids with correlated frequency, amplitude and phase. Background noise and modelling error are treated as Gaussian noise, which makes a probabilistic treatment desirable. We introduce appropriate priors for the model parameters, which are chosen to reflect prior knowledge of the structure of pitched musical note signals, and to allow effective numerical Bayesian inference. The model we introduce is shown to be capable of modelling both frequency and amplitude modulations, which are characteristic of the sound of many musical instruments. The use of a Bayesian methodology allows model selection to be carried out implicitly, so that the relevant number of sinusoids necessary to model the signal appropriately may be determined automatically. 5.1 Contributions The motivation for this chapter is to extend and further develop promising Bayesian generative models based on a sinusoidal representation for each partial frequency, and present the developments in such a way that they can be incorporated easily into and compared with existing approaches. In 5.2.2 we formulate the mathematical representation of a sinusoid for which the amplitude and phase are permitted to vary slowly in comparison to the central frequency. This motivates the use of the analytic representation in 5.2.3, which eliminates ambiguity in the sinusoidal representation, and we show that this analytic representation is appropriate to be applied to existing Bayesian models using sinusoidal representa- tions. We then describe a state-space representation with constant damping ratio in 5.2.4, showing that the analytic representation can result in a closed form posterior distribution for the damping ratio and frequency, which does not typically arise in the literature, and may be used to linearize the posterior distribution under arbitrary choices of frequency priors. In 5.2.5 we apply the analytic representation to Gabor models, and additionally motivate the use of sinc basis functions to specify and control the bandwidth of frequency and amplitude modulations in the signal. The use of sinc basis functions may be incorporated into the existing methods independently of whether the analytic representation is used or not. In Section 5.3 we describe the literature for Bayesian inference in sinusoidal models and a noise model 55 that may be used. We then derive MCMC algorithms for a fixed number of partials with an arbitrary prior on partial frequencies for both the state-space and Gabor models. For the state-space model in particular, we derive the posterior distribution of the frequencies and damping ratios under a normally distributed prior, and show how this can be adapted as an efficient, model-based, proposal distribution when the prior is not a normal distribution. 5.2 Model for an Isolated Partial In this section we consider how to model the signal of an isolated partial frequency. The focus on this section is to introduce various representations of a sinusoidal signal and how the amplitude envelope and modulations around the central frequency can be handled. In the following section, we will consider the superposition of multiple sinusoids and embed the entire model in a probabilistic framework for Bayesian inference. 5.2.1 Motivation An isolated partial is a minimal description of a musical note. When we listen to an isolated partial, we have the clear perception of the pitch and volume of a musical note. The pitch is related to the frequency of the partial, although the perception of pitch itself is non-linear (2.2.2). Some degree of frequency modulation is tolerated, being perceived as vibrato. The timbre of an isolated partial is perceived as the purest tone, as there is no harmonic structure and therefore no possibility of inharmonicity. Musical instruments which may be modelled by isolated partials include whistles, tuning forks and rubbing crystal glasses. The model we introduce is parametric in that the sinusoid is completely described by its frequency, phase and amplitude envelope; and linear so that we may superimpose multiple sinusoids in order to generate more complex musical tones. We require a comprehensive understanding of the parameters of even such a simple model, because we will pursue the intuition that the perceptual grouping of partials into musical notes (rather than being perceived as separate frequencies) is due to the group of partials having a shared set of parameters. 5.2.2 Amplitude and Phase Modulation In this chapter we will consider a segment of audio data with N samples and time indices t = 0, . . . , N − 1, where it is assumed that a set of multiple pitches are sounding throughout the length of the segment. We will begin by modelling an isolated partial as a sinusoid x [t] having a constant angular frequency ω, amplitude envelope c [t] and time-varying phase φ [t]: x [t] = c [t] cos [ωt+ φ [t]] (5.1) For this model to be realistic, we require constraints on the amplitude envelope and phase modulation. The amplitude envelope is used to model changes in the perceived volume of the partial. Hence the bandwidth of the envelope should be restricted to the lower limit of hearing (20Hz), otherwise the frequency content of the envelope will be perceived as an additional pitch. The ear is relatively insensitive to the phase of a pitched 56 note, but phase modulations may be perceived as frequency modulations, as the modulation in frequency around ω is given by the time derivative of φ [t]. Vibrato (see 2.3.4) is common in many musical genres and instruments, and results in both frequency and amplitude modulations of the note. Experimental studies by Brown and Vaughn [1996] have shown that the perceived pitch centre of a vibrato note is still equivalent to the centre frequency ω. The permissible amount of frequency modulation is governed by stylistic rules, however a useful guideline is that the depth of vibrato should not cross the frequency boundary of the pitch, which in Western music is a semitone. The speed and depth of vibrato are also limited by the mechanical process which creates the vibrato effect, for example in the case of a violin, the rocking of the violinist's finger on the string. 5.2.3 Analytic Representation of Sinusoidal and Noise Signals In (5.1) there is ambiguity in the definitions of c [t] and φ [t]. Different choices will result in the same identical signal x [t]. To overcome this Gabor [1946] defines the instantaneous amplitude and instantaneous phase using the analytic representation of a signal. This has become the conventional definition, and reduces the ambiguity (see Cohen et al. [1999] for cases where the ambiguity still exists). An analytic signal is a complex valued signal with no negative frequency components in its Fourier spectrum. The analytic representation of a real valued signal is produced by discarding the negative frequency components of the Fourier transform. There is no loss of information as the Fourier spectrum of a real signal has Hermitian symmetry around zero frequency. The analytic representation xa [t] of a real-valued signal x [t]is given by xa [t] ≡ x [t] + iH [x [t]] (5.2) where H denotes the Hilbert transform. The Hilbert transform shifts the phase of negative frequency components of the Fourier spectrum by +pi/2 and the positive frequency components by −pi/2. Thus the operation in (5.2) discards the negative frequency components. The original signal may be simply recovered from the real part of the analytic signal: x [t] = R (xa [t]) The instantaneous amplitude of the analytic representation is defined as |xa [t]| and the instantaneous phase is defined as arg xa [t]. For the model of the isolated sinusoid (5.1) we have xa [t] = c [t] cos [ωt+ φ [t]] + ic [t] sin [ωt+ φ [t]] = c [t] exp i [ωt+ φ [t]] from which we can see that the instantaneous amplitude is c [t] and the instantaneous phase is given by ωt+ φ [t]. We will find it convenient to use the analytic representation for our models, and will particularly use the following form xa [t] = c [t] exp [iφ [t]] exp [iωt] (5.3) as the three parameters of the sinusoid are thus separated. The analytic signal is composed of three separate 57 signals multiplied (modulated) together. Using communications terminology, c [t] is the amplitude waveform, exp [iφ [t]] is the phase waveform, and exp [iωt] is the carrier signal with frequency ω. To be able to transmit and recover the original waveforms from the modulated signal, it is necessary for the bandwidths of the amplitude and phase waveforms to be much smaller than ω. We will use the same concept for modelling musical signals, as extracting and storing these low bandwidth waveforms is attractive for compression, reconstruction and synthesis. The model which we develop in this chapter is based on existing work on methods for sinusoidal models [Serra, 1997, Walmsley et al., 1999, Davy and Godsill, 2003, etc.], and it is necessary for us to confirm that the properties of these models are consistent with the analytic representation of the signal. The Hilbert transform is a linear operator, so the frequencies and amplitudes of the sinusoids are preserved. Moreover Picinbono and Bondon [1997] show that the analytic representation of a wide-sense stationary real signal is proper or circular symmetric [Neeser and Massey, 1993]. Hence the analytic representation of a white noise process is a complex Gaussian random variable, and the properties of an autoregressive (AR) process used to model coloured noise are retained in the analytic representation. In 5.2.4 and 5.2.5, we consider two common formulations of the sinusoidal signal (5.3) which are used in state-of-the-art Bayesian harmonic models. Our contribution here is to apply these formulations to the analytic representation of the signal, and demonstrating how bandwidth constraints on frequency and amplitude modulations may be naturally and practically applied. 5.2.4 State-Space Formulation The first formulation treats the sinusoid as a rotating phasor, and is motivated the work of Cemgil et al. [2006] who use a state-space approach to model the rotation of a real-valued sinusoid from one sample to the next. This approach was used in a polyphonic transcription system capable of resolving note onsets and offsets to sample resolution. Moreover as the notes are processed sample by sample and not on a frame by frame basis there are no artifacts arising from reconstruction and synthesis due to phase discontinuities and discrepancies at frame boundaries. In our model, the relationship between one sample of the sinusoid and the next is given by xa [t+ 1] = c [t+ 1] exp [iφ [t+ 1]] exp [iω (t+ 1)] = c [t+ 1] exp [iφ [t+ 1]] exp [iωt] exp [iω] = c [t+ 1] c [t] exp [iφ [t+ 1]] exp [iφ [t]] exp [iω] c [t] exp [iφ [t]] exp [iωt] = c [t+ 1] c [t] exp [iφ [t+ 1]− iφ [t]] exp [iω]xa [t] (5.4) The c[t+1] c[t] term in (5.4) gives the rise or decay in the amplitude envelope between t and t+ 1. Following Cemgil et al. [2006] we refer to this as the damping ratio, and define ρ [t] ≡ c[t+1]c[t] . We will apply a constraint on the amplitude envelope by choosing to make this damping ratio constant throughout the segment of audio: ρ [t] = ρ for all t = 0, . . . , N − 1. This is appropriate for musical instruments such as the piano and guitar, where after the onset of the note (when the string is struck by a hammer or plucked) the decay of the energy in each partial can be approximately described as exponential, 0 < ρ < 1. It is also appropriate 58 for notes held at a constant volume, where ρ = 1. For other situations where other shapes of amplitude envelope would be expected, the Gabor model described in the next section is more appropriate. The exp [iφ [t+ 1]− iφ [t]] term in (5.4) is the difference in the phase modulations, which can be seen as an approximation to the frequency modulation at t. Realistic frequency modulations should be small, hence we approximate using the Taylor expansion: exp [iφ [t+ 1]− iφ [t]] ≈ 1 + (iφ [t+ 1]− iφ [t]) (5.5) The second term in (5.5) which we will denote as f [t] ≡ iφ [t+ 1] − iφ [t] is small, and is purely imaginary, as the phase φ [t] is real for all t. However for convenience, we will model this frequency modulation term as a zero mean complex Gaussian random variable with small variance σ2f , i.e., p (f [t]) = NC ( 0, σ2f ) The consequence of f [t] having a real part is that small amplitude modulations in addition to the damping ratio ρ are permitted. As stated in 5.2.2, frequency modulations in a musical note are often accompanied by amplitude modulations, hence we do not consider this inconsistency in our model a disadvantage. When we incorporate the above constraints into (5.4) we have xa [t+ 1] = ρ exp [iω]xa [t] + f [t] (5.6) or alternative expressed as a conditional probability distribution: p ( xa [t+ 1] |xa [t] , ρ, exp [iω] , σ2f ) = NC ( ρ exp [iω]xa [t] , σ 2 f ) (5.7) (5.6) and (5.7) show us that xa [t] can be regarded as the internal state of a linear dynamic system. This fact was used in Cemgil et al. [2006] where the parameters ρ, ω and σ2f were known, and the Kalman filter used to infer the sinusoid in the presence of observation noise. In 5.3.3 we will show that these parameters may be treated as unknown and Bayesian inference can be used to estimate them. 5.2.5 Gabor Model A model for slowly varying partial amplitudes was introduced by Godsill and Davy [2002] as an extension of the existing harmonic model of Walmsley et al. [1999] which assumed that the amplitude of each partial is constant throughout the note segment. Each partial is projected onto a set of Gabor functions ψi,ω [t− i∆], each of which has a fixed real-valued envelope ψ [t], symmetric around t = 0 and having a finite region of support, shifted in time by i∆ and modulated by frequency ω equal to the frequency of the partial: ψi,ω [t] = ψ [t− i∆] exp [iωt] The constant ∆ is the difference, in samples, between the centres of adjacent basis functions and controls the spacing along the time axis between neighbouring Gabor atoms. ∆ is chosen such that the support of each function overlaps with the next, thus ensuring that the amplitude envelope varies smoothly throughout 59 the length of the note segment. The Gabor model applied to the analytic representation of a signal is the projection of xa [t] onto I + 1 Gabor functions, where ∆(I + 1) is equal to the length N of xa [t]: xa [t] = I∑ i=0 biψ [t− i∆] exp [iωt] (5.8) The complex-valued basis coefficients bi may be viewed as the amplitude of each Gabor function ψ [t− i∆] exp [iωt]. The amplitude envelope as modelled, comparing with (5.3), is given by c [t] exp [iφ [t]] = ∞∑ i=−∞ biψ [t− i∆] (5.9) and is also used to account for frequency modulations in our model. From a signal processing perspective, the envelope would be obtained by low-pass filtering the partial to remove the frequency component at ω of the spectrum. If we were to select the envelope of the Gabor function as the sinc function ψ [t] = sin [2pit/∆] 2pit/∆ (5.10) then the bandwidth of the amplitude envelope (5.9) is constrained to 1/∆. This result is based on the use of the sinc filter for perfect reconstruction of bandlimited signals [Shannon, 1998]. Godsill and Davy [2002] use a Hamming window as the envelope for the Gabor basis functions. When we use a sinc function (5.10) to model sinusoids with periodic amplitude and frequency modulations embedded in white noise, we have found that the residual of the modelling is smaller than when using Hanning windows, and the reconstruction sounds better. Figure 5.1 on page 61 compares the reconstructions obtained of a sinusoid by sinc and Hamming basis functions. The sinusoid has a central frequency of 440Hz, with a frequency modulation of depth 5Hz and speed 5Hz, and amplitude modulation of magnitude 0.2 and speed 5Hz. Ten basis functions were used to cover the entire signal length of 1 second, hence modulations up to 10Hz can be captured. Both basis functions model the spectrum well around the central frequency, but the sinc basis model has a smaller residual and fewer reconstruction artefacts away from the central frequency. In practice we limit the support of the sinc function to 4∆ i.e., ψ [t] =  sin[2pit/∆] 2pit/∆ |t| ≤ 4∆ 0 |t| > 4∆ as the amplitude of the envelope is small outside the central region. 5.3 Probabilistic Model for Multiple Partials In this section we will combine multiple instances of the models for isolated partials described in 5.2.2 and embed them in observation noise. The combined model may then be used for estimating the spectrum of musical signals. We will adopt a Bayesian approach throughout, and our goal is to jointly infer the number 60 -4 -2 0 2 4 6 8 10 400 420 440 460 480 500 lo g sp ec tr u m Frequency / Hz Original signal Sinc basis reconstruction Hamming basis reconstruction Figure 5.1: Comparison of the reconstructions of a violin note with fundamental frequency 440Hz and played with vibrato, using sinc and Hamming basis functions. The sinc basis reconstruction has a smooth spectral shape matching the original signal, whereas the reconstruction using the Hamming basis has a periodic artefact resulting from difficulties modelling the frequency and amplitude modulations in the signal. of partials and their frequency and amplitudes through Bayesian model selection. 5.3.1 Background Full Bayesian inference of a sinusoidal model with noise was first carried out by Andrieu and Doucet [1999] using a reversible jump MCMC scheme (4.2.2.1). A short frame of samples is modelled by a set of constant amplitude sinusoids in white noise, and the inference scheme is shown to correctly and robustly estimate the number of sinusoids present even at low signal-to-noise ratios. The conditional distribution of the frequencies of the sinusoids is not however of a form which can be sampled easily. By this we mean that p (ω|y, θ) where ω is the set of sinusoid frequencies, y is the observed data, and θ are the remainder of the model parameters, is not of a standard form for which a sampling algorithm is known. Two Metropolis-Hastings proposal schemes are suggested for updating the frequencies of the sinusoids from ω(i) in iteration i of the algorithm, to ω(i+1). The first is a local proposal which generates candidates ω′ from a Gaussian distribution with mean ω(i) and small variance. This allows the frequencies to be estimated to a high precision. The second is a global proposal which generates candidates ω′ independently from ω(i), with probability proportional to the Fourier spectrum. This allows the Markov chain to explore new regions of the spectrum. Andrieu and Doucet [1999] provide the probabilities at which to accept each proposal, and also describe birth and death moves for sinusoids in the reversible jump framework, so that the number of sinusoids can be estimated. Walmsley et al. [1999] extend the above model to harmonic signals, where the frequencies of each partial are set to integer multiples of the fundamental frequency. Notes in the model are turned on and off using 61 binary indicator variables. The global proposal is facilitated by a harmonic transform which functions in a similar way to a comb filter (2.2.2), incorporating the energy of the higher-order harmonics into the fundamental and low-order harmonics. An additional proposal is designed to allow the inference algorithm to explore octave errors. Godsill and Davy [2002] further extend the model so that each sinusoid is modelled by a set of Gabor basis functions, as outlined in 5.2.5. Inharmonicity is introduced into the model, originally as an additive term, then as a multiplicative term in Godsill and Davy [2005] (3.3). Reversible jump MCMC is again used, and a range of moves are proposed to explore the high dimensional model space fully: note births and deaths, adding and subtracting variable numbers of harmonics from each note, and multiplying or dividing the fundamental frequency by a factor of two to explore octave errors. Cemgil et al. [2006] use the state-space representation outlined in 5.2.4 in a polyphonic transcription system. The partial frequencies are fixed toMidi specification frequencies (2.3) and the amplitude envelopes of the notes are fixed. The note onsets and offsets are inferred using a pruning algorithm. In this section we will use much of the prior structure that has been developed by the above authors, and apply it to the analytical representation of the signal with the constraints on the amplitude and frequency modulation as described in Section 5.2. The contributions made in this section are improvements to and developments of the state-of-the-art inference algorithms for these model. For the state-space model, the posterior distribution of the frequency parameters under a normal distribution prior is available in closed- form, and this fact is used to derive a Gibbs sampler for the normal distribution prior. For an arbitrary prior distribution on the frequency parameters a Metropolis-Hastings MCMC algorithm is derived using a linearization of the posterior distribution as an efficient proposal distribution. This allows the state-space representation to be used to accurately infer frequencies using a rich prior model for inharmonicity, as will be demonstrated in Section 5.5. Prior to this work, inference on the state-space model was restricted to a fixed grid of frequencies [Cemgil et al., 2006]. For the Gabor model, the contribution is the derivation of the posterior mode of a signal-to-noise ratio hyperparameter, allowing this parameter to be inferred from a marginalized distribution, thus improving estimation and eliminating the computation required simulating latent parameters. 5.3.2 Noise Model In this chapter we have chosen to use a white noise model. As the sinusoidal models we consider here are linear, it is straightforward to model coloured noise sources using an autoregressive (AR) process. For the state-space model, the extension required is straightforward as the AR model is itself commonly expressed as a state-space model. For the extensions required to the Gabor model, see Godsill and Davy [2002]. For the remainder of this chapter, we drop the subscript a denoting that xa [t] is an analytic representation, and work with M partials which we denote xm [t]. The white noise process is denoted n [t] and has variance σ2n. The signal we observe is denoted y [t] and is given by y [t] = M∑ m=1 xm [t] + n [t] (5.11) 62 We choose the prior distribution of σ2n to be inverse-Gamma p ( σ2n ) = IG (σ2n;αn, βn) such that the conditional distribution of σ2n, given n ≡ [n [0] , . . . , n [N − 1]]>, is p ( σ2n|n ) = IG ( σ2n;αn + N 2 , βn + 1 2 n>n ) (5.12) A common setting is αn = βn = 0 such that p ( σ2n ) ∝ 1 σ2n This prior is invariant to arbitrary scaling of the observed signal, and has a maximum entropy interpretation [Jeffreys, 1946]. The structure of the signal model depends on the parametrization that we have chosen. 5.3.3 State-Space Formulation For the state-space formulation, each partial xm [t] has an unknown damping ratio ρm, and frequency ωm. From (5.7) and (5.11) the model is p ( xm [t+ 1] |xm [t] , ρm, ωm, σ2f ) = NC ( xm [t+ 1] ; ρm exp [iωm]xm [t] , σ 2 f ) (5.13) p ( y [t] |x1 [t] , . . . , xM [t] ,M, σ2n ) = NC ( y [t] ; M∑ m=1 xm [t] , σ 2 n ) (5.13) is a linear dynamical system, with an unobserved state vector xt = [x1 [t] , . . . , xM [t]] > at time t, diagonal M ×M state transition matrix A with elements ρm exp [iωm] along the diagonal, process noise covariance matrix σ2fIM , observation model H which is a 1×M vector with all elements equal to one, and observation noise variance σ2n: p ( xt+1|xt,A, σ2f ) = N (xt+1;Axt, σ2fIM) (5.14) p ( y [t] |xt, σ2n ) = N (y [t] ;Hxt, σ2n) This is in a standard form for inferring the marginal distribution of the state vector xt at each time t given the entire signal y [0] , . . . , y [N − 1]: p ( xt|y [0] , . . . , y [N − 1] , {ρm, ωm}m=1,...,M , σ2f , σ2n ) using the Kalman filtering and smoothing recursions. The only remaining requirement is that a multivariate normal prior p (x0) be specified as the initial condition of the state vector. The unknown parameters for the state-space model appear together as a complex number am ≡ ρm exp [iωm]. We show that the posterior distribution of the unknown parameters am is a normal distribution if a normal 63 prior p ( am|µm, σ2m ) = N (am;µm, σ2m) (5.15) is used: p ( am|xm [0] , . . . , xm [N − 1] , µm, σ2m, σ2f ) = 1 Zx p ( xm [0] , . . . , xm [N − 1] , am, µm, σ2m, σ2f ) (5.16) = 1 Zx p ( am, µm, σ 2 m )N−1∏ t=1 p ( xm [t] |xm [t− 1] , am, σ2f ) (5.17) = 1 Zx exp ( − a 2 m 2σ2m + amµm σ2m + am σ2f N−1∑ t=1 xm [t]xm [t− 1]− a 2 m 2σ2f N−1∑ t=1 x2m [t− 1] ) = 1 Zx exp ( −1 2 ( 1 σ2m + 1 σ2f N−1∑ t=1 x2m [t− 1] ) a2m + ( µm σ2m + 1 σ2f N−1∑ t=1 xm [t]xm [t− 1] ) am ) (5.18) where Zx is the normalizing constant of the posterior distribution of am: Zx = p ( xm [0] , . . . , xm [N − 1] , µm, σ2m, σ2f ) From this we see that the variance of the posterior distribution of am is( 1 σ2m + 1 σ2f N−1∑ t=1 x2m [t− 1] )−1 and the mean is ( 1 σ2m + 1 σ2f N−1∑ t=1 x2m [t− 1] )−1( µm σ2m + 1 σ2f N−1∑ t=1 xm [t]xm [t− 1] ) For a known number of partials M we have derived an MCMC scheme to infer the posterior distribution, which is presented as 5.1. p ( a1, . . . , aM ,x0, . . . ,xN−1, σ2n|y [0] , . . . , y [N − 1] , σ2f , { µm, σ 2 m } m=1,...,M ) Although this is a simple and straightforward MCMC scheme for spectrum estimation, without specifying additional parameters for tuning the algorithm, the requirement that the prior p (am) on the damping ratio and partial frequencies must be a normal distribution is very restrictive. We do not foresee that Bayesian hierarchical models for musical structure such as key and chords will impose normally distributed priors on the partial frequencies. Rather, at this stage, we may allow an arbitrary prior p (a1, . . . , aM ), but the final sampling step of 5.1 must then be replaced with a Metropolis-Hastings step. Our contribution here is to suggest a proposal distribution strongly based on the underlying model of the signal, which has been found in practice to have a high acceptance rate whilst reaching the mode of the posterior distribution of the damping ratio and partial frequencies rapidly. This contrasts with global proposals based on the periodogram estimate and local random-walk proposals normally required to effectively explore the non-linear posterior (see 5.3.4). 64 Algorithm 5.1 Gibbs sampler for the state-space model • Initialization  For m = 1, . . . ,M sample the diagonal elements of A(0) : a (0) m ∼ p ( am|µm, σ2m ) (5.15))  Sample x (0) 0 ∼ p (x0)  For t = 1, . . . , N − 1 sample x(0)t ∼ p ( xt|x(0)t−1,A(0), σ2f ) (5.14) • Iterations, i = 1, 2, . . .  Compute n(i−1) [t] = y[t]−∑mm=1 x(i−1)m [t] and sample σ2(i)n ∼ p (σ2n|n(i−1))(5.12)  For t = 1, . . . , N − 1 sample x(i)t ∼ p ( xt|y [0] , . . . , y [N − 1] ,A(i−1), σ2f , σ2(i)n ) computed using Kalman filter and smoother recursions  For m = 1, . . . ,M sample a (i) m ∼ p ( am|x(i)m [0] , . . . , x(i)m [N − 1] , µm, σ2m, σ2f ) (5.18) In this scheme, for each m = 1, . . . ,M we use a proposal distribution Q ( am;x (i) m [0] , . . . , x (i) m [N − 1] , σ2f , a(i)1:m−1, a(i−1)m+1:M ) which is of the same form as (5.18) but substituting µm = 〈am〉p(am|a(i−1)1:m−1,a(i)m+1:M) and σ2m = 〈 a2m 〉 p ( am|a(i−1)1:m−1,a(i)m+1:M ) − µ2m so that the proposal distribution would be equal to the posterior if the prior were a normal distribution. The acceptance probability of the proposed candidate a′m ∼ Q ( am;x (i) m [0] , . . . , x (i) m [N − 1] , σ2f , a(i−1)1:m−1, a(i)m+1:M ) is the minimum of 1 and∏N−1 t=1 N ( x (i−1) m [t] ; a′mx (i−1) m [t− 1] , σ2f ) p ( a′m|a(i−1)1:m−1, a(i)m+1:M ) ∏N−1 t=1 N ( x (i−1) m [t] ; a (i−1) m x (i−1) m [t− 1] , σ2f ) p ( a (i−1) m |a(i−1)1:m−1, a(i)m+1:M ) × Q ( a (i−1) m ;x (i) m [0] , . . . , x (i) m [N − 1] , σ2f , a(i−1)1:m−1, a(i)m+1:M ) Q ( a′m;x (i) m [0] , . . . , x (i) m [N − 1] , σ2f , a(i−1)1:m−1, a(i)m+1:M ) 5.3.4 Gabor Model The Gabor formulation of the model is given by (5.8) y [t] = M∑ m=1 I∑ i=0 bi,mψ [t− i∆] exp [iωmt] + n [t] (5.19) 65 For convenience, we rewrite (5.19) in matrix form, by stacking the amplitudes bi,m into a column vector b of length (I + 1)M , with elements b(m−1)(I+1)+i = bi,m and the Gabor basis functions into a N × (I + 1)M matrix D with elements Dt,(m−1)(I+1)+i = ψ [t− i∆] exp [iωmt] (5.20) Writing y = [y [0] , . . . , y [N − 1]]> and n ≡ [n [0] , . . . , n [N − 1]]> as before, (5.19) becomes y = Db+ n The approaches described in 5.3.1 related to this model all adopt the g-prior [Zellner, 1986] which is chosen for its properties in Bayesian model selection. The g-prior is a zero mean multivariate normal prior distri- bution for p ( b|D, σ2n ) with covariance matrix σ2nξ ( D>D )−1 and an additional parameter ξ. As ξ scales the amplitudes with respect to the noise level, it can be interpreted as a prior signal-to-noise ratio. In the case where we treat ξ as unknown and wish to additionally infer it in a Bayesian setting, we again follow the literature in 5.3.1 1 and assign an inverse-gamma prior: p (ξ) = IG (ξ;αξ, βξ). The probabilistic model described so far, including the additional parameter ξ which was introduced by adopting the g-prior, is p ( y,b,D, σ2n, ξ ) = p ( y|Db, σ2n ) p ( b|D, σ2n, ξ ) p ( σ2n ) p (ξ) (5.21) We have yet to discuss a prior p (D) for D. From (5.20) this is a prior p (ω1, . . . , ωM ) on the partial frequencies, which are the remaining unknowns in this model. In the remainder of this section, we show a result for this model that has not been referred to in the literature we have reviewed. The model parameters b and σ2n may be integrated out, giving the following marginal distribution: p (y|D, ξ) p (ξ) ∝ (y>Py + βn)−(N+αn)/2 ξ−(αξ+1) exp(−βξ ξ ) (5.22) where the following definitions are used, as in the literature: S−1 = D>D+ 1 ξ D>D = ξ + 1 ξ D>D P = IN −DSD> (5.23) = IN − ξ ξ + 1 D ( D>D )−1 D> = IN − ξ ξ + 1 DD† 1 For clarity of notation we denote the scaling hyperparameter as ξ. This is equivalent to δ2 used in Andrieu and Doucet [1999] and ξ2 used in Davy et al. [2006] 66 where D† = ( D>D )−1 D> has been used to make the notation a little more concise. It would be useful for Bayesian inference to be able to sample from the conditional distribution p (ξ|y,D) but this is not a standard distribution. Rather instead, previous approaches have used the following distribution p ( ξ|D,b, σ2n ) = IG ( αξ + (I + 1)M, 1 2σ2n b>D>Db+ βξ ) which is a standard distribution, but requires that b and σ2n be available. This may not be satisfactory given we chose to integrate them out analytically in (5.22) thus improving the Monte-Carlo estimates of the remaining parameters because significant additional computation is required to simulate them. Alternatively, it is possible to integrate p (ξ|y,D) numerically, as Richardson and Green [1997] have chosen to do. The result we present here is that although p (ξ|y,D) is not a standard distribution, its mode, or MAP estimate, is available as the solution of the quadratic equation (see Section B.1 for the full derivation) ξ2 ( N + αn 2 + (αξ + 1) )( y>DD†y )− ξ ((αξ + 1) y>y + βξy>DD†y)+ βξy>y = 0 (5.24) The positive root of (5.24) is the mode, the other root is negative and is disallowed by the prior p (ξ). Using (5.24) to estimate the mode ξ∗ reduces the computation required for each iteration of the MCMC algorithm, and slightly reducing the number of iterations required for convergence. This is illustrated in Figure 5.2 on page 68 for a single violin note with fundamental frequency 440Hz, which is part of the data set used later in this chapter in 5.5.1 to demonstrate the ability of these algorithms to infer partial frequencies in the signal. This result is useful in an iterative MAP estimation scheme. It can also be used as the basis of a proposal distribution for a Metropolis Hastings MCMC kernel. As the prior p (ξ) and conditional posterior distribution p ( ξ|D,b, σ2n ) are both inverse gamma, it seems plausible to construct a proposal distribution which is also inverse gamma, with its mode ξ∗ given by (5.24). We suggest setting the shape parameter of the proposal distribution to αq and the scale parameter to (αq + 1) ξ ∗ . αq controls the variance of the proposal and should be tuned for a suitable acceptance rate. A Metropolis Hastings kernel is also necessary for simulating the frequencies ω1, . . . , ωM as the joint distribution p (ω1, . . . , ωM |ξ, y) or any of the individual distributions p (ωm|ω1, . . . , ωm−1, ωm+1, . . . , ωM , ξ, y) are not standard distributions. We are able to impose any prior distribution p (ω1:M ) using the Gabor model. Here we suggest two proposal distributions based on computing the residual of the signal excluding the partial frequency of interest. Denote D¬m as the N × (I + 1) (M − 1) matrix where the columns related to the mth partial have been omitted. The least squares reconstruction of the signal excluding the mth partial is given by D¬mD†¬my, and the residual signal, which is expected to contain the mth partial and also noise, is given by xm = y − D¬mD†¬my. Note that computing the residual avoids simulating b and σ2n. The proposal distributions are thus of the form Q (ωm; y, ω1:M¬m, ξ) where ω1:M¬m denotes the set of partial frequencies excluding ωm. We are taking advantage of being able to design custom Metropolis Hasting kernels in order to reduce computation whilst performing full Bayesian inference. The first proposal involves computing a K-point DFT of xm and defining a probability distribution function proportional to the magnitude in each frequency bin. This allows the algorithm to explore many parts of the spectrum rapidly. For the proposal to be reversible, the proposed frequency is assumed to be 67 -2 -1 0 1 2 3 0 50 100 150 200 MCMC iterations b1 b2 (a) The amplitudes of the Gabor basis functions must be simulated in order to sample from the posterior of the signal-to-noise ratio parameter. This figure presents the convergence of two amplitude parameters of the fundamental fre- quency. The Markov chain has converged to the true posterior distribution in approximately 100 iterations. The convergence of the corresponding noise vari- ance and signal-to-noise ratio parameters for this Markov chain is shown in Figure 5.2b. -3 -2 -1 0 1 2 0 50 100 150 200 MCMC iterations log σ2 log ξ log ξ∗ (b) Convergence of the noise variance σ2 and a comparison of the convergence of ξ when inferred from the posterior distribution, or set to the mode ξ∗. As in Figure 5.2a, the simulated parameters converge in approximately 100 iterations, whereas the convergence to the mode is complete in around 75 iterations. Figure 5.2: The result presented in (5.24) allows the conditional mode of the signal-to-noise parameter ξ to be calculated without requiring inference of the amplitude parameters b or the noise variance σ2, which reduces the computation required for each iteration of the MCMC algorithm, and slightly reducing the number of iterations required for convergence. 68 uniformly distributed within the range of frequencies |p/K − ωm| < 1/2 for the frequency bin p with centre frequency p/K, i.e., Q (ωm; y, ω1:M¬m, ξ) ∝ K−1∏ p=0 |DFT [xm]|p ∣∣ p K − ωm ∣∣ < 12 0 otherwise (5.25) The second proposal is based on fitting a damped sinusoid to the residual signal, discarding the damping ratio and using the frequency obtained. This proposal is used to make small adjustments to the partial frequencies to search for local maxima in the posterior distribution. We use the posterior distribution (5.18) which we derived for a damped sinusoid with process noise. In this MCMC context, we can use σ2f , the process noise variance, to control the acceptance rate, and the damping ratio ρm is discarded. For simplicity, we also remove the prior parameters µm, σ 2 m, giving Q (ωm; y, ω1:M¬m, ξ) = N ωm;(N−1∑ t=1 x2m [t− 1] )−1(N−1∑ t=1 xm [t]xm [t− 1] ) , σ2f ( N−1∑ t=1 x2m [t− 1] )−1 (5.26) Alternatively we may use a standard random-walk proposal to search for a local maxima, especially in cases where the above damped sinusoid model is not appropriate, such as for sustained note instruments with significant amplitude and frequency modulations. The proposal distribution is a Gaussian distribution centered around the existing frequency with a small random-walk variance σ2RW which Godsill and Davy [2005] suggest setting to 10−3. Q (ω;ωm) = N ( ω;ωm, σ 2 RW ) (5.27) 5.4 Bayesian Inference using Reversible Jump MCMC In the previous section we have developed two probabilistic signal models for musical signals. The state- space model is considered useful for musical instruments which may be approximately modelled by damped oscillators. The Gabor basis model is considered useful for bandlimited frequency and amplitude modulations such as those arising from vibrato. MCMC algorithms (5.1 and 5.2) have been derived for a fixed number M of partial frequencies with conditionally independent priors p (ω1:M |M). However there are few situations where the number of partials and their frequencies are known a priori. In this section we consider how to infer the number of partials jointly with their frequencies. The most flexible method of Bayesian inference for this type of problem is reversible jump MCMC, which was first applied to Bayesian sinusoidal models in Andrieu and Doucet [1999]. Firstly we must specify a prior p (M) on the overall number of partials. In the literature the prior on the number of partials is typically split into a hierarchical model, with a prior on the number of notes and a prior on the number of partials in each note. The contribution here is our general presentation of how to propose and accept changes to the numbers of partials and their frequencies, without referring to any specific move, and also applying reversible jump MCMC to the state-space model. Being able to propose changes to multiple partials is necessary for being able to rapidly explore the numerous combinations of harmonics and notes that arise in musical signals. 69 Algorithm 5.2 Metropolis-Hastings for the Gabor model • Initialization  For m = 1, . . . ,M sample ω (0) m ∼ p (ωm)  Sample ξ(0) ∼ IG (αξ, βξ) • Iterations i = 1, 2, . . .  Choose m from 1, . . . ,M with equal probability 1/M  Compute xm = y −D(i−1)¬m D†(i−1)¬m y  Select a proposal Q ( ωm; y, ω (i−1) 1:M , ξ (i−1) ) to sample ω′m from ((5.25)(5.27))  draw u from U (0, 1) and set ω(i)m ← ω′m if u < p ( y,D′, ξ(i−1) ) p ( y,D(i−1), ξ(i−1) ) Q ( ω (i−1) m ; y, ω (i−1) 1:M , ξ (i−1) ) Q ( ω′m; y, ω (i−1) 1:M , ξ (i−1) )  otherwise set ω (i) m ← ω(i−1)m  Compute ξ∗ using (5.24)  Sample ξ′ from Q ( ξ;D(i), y ) ≡ IG (ξ;αq, (αq + 1) ξ∗)  draw v from U(0, 1) and set ξ(i) ← ξ(i−1) if v < p ( y|D(i), ξ′) p (ξ′) p ( y|D(i), ξ(i−1)) p (ξ(i−1)) Q ( ξ(i−1);D(i), y ) Q ( ξ′;D(i), y ) 70 5.4.1 Proposals and Acceptance Ratios In the following expressions in this section, ω1:M should be substituted with a1:M when using the state-space model. When using the Gabor model, all of the joint and proposal distributions are dependent on the current value of ξ(i). A proposal distribution for changing the number of partials and their frequencies would be expressed as follows Q ( ω1:M ,M |y, ω(i−1)1:M ,M (i−1) ) however this proposal distribution is not reversible as the dimension of the model parameters has changed. Instead we define a birth proposal, where the number of partials has increased, but the frequencies of the original partials have not changed, i.e., M∗ > M (i−1){ ω∗(M(i−1)+1):M∗ ,M ∗ } ∼ QB ( ω(M(i−1)+1):M ,M |y, ω (i−1) 1:M(i−1) ,M (i−1) ) such that if the proposal is accepted, M∗ →M (i) and { ω (i−1) 1:M(i−1) , ω ∗ (M(i−1)+1):M∗ } → ω(i) 1:M(i) . This proposal must be accompanied by a death proposal, where a number of the original partials have been removed, and the remaining partial frequencies unchanged, i.e., M∗ < M (i−1) M∗ ∼ QD ( M |y, ω(i−1) 1:M(i−1) ,M (i−1) ) such that if the proposal is accepted, M∗ →M (i) and ω(i−1) 1:M(i−1) → ω (i) 1:M(i) . The Jacobian of the transform for the birth and death proposals equals one. The acceptance ratio for a birth proposal is given by the minimum of 1 and p ( y, ω (i−1) 1:M , ω ∗ (M(i−1)+1):M∗ ,M∗ ) p ( y, ω (i−1) 1:M(i−1) ,M ) QD ( M (i−1)|y, ω(i−1) 1:M(i−1) , ω ∗ (M(i−1)+1):M∗ ,M∗ ) QB ( ω∗ (M(i−1)+1):M∗ ,M∗|y, ω(i−1) 1:M(i−1) ,M (i−1) ) Similarly, the acceptance ratio for a death proposal is given by the minimum of 1 and the reciprocal of the above ratio: p ( y, ω (i−1) 1:M∗ ,M ∗ ) p ( y, ω (i−1) 1:M(i−1) ,M (i−1) ) QB ( ω(M∗+1):M(i−1) ,M (i−1)|y, ω(i−1)1:M∗ ,M∗ ) QD ( M∗|y, ω(i−1) 1:M(i−1) ,M (i−1) ) Before or after any MCMC move, a complete reordering of the partial frequencies ω1:M is permitted, so that different combinations of partials may be created or destroyed. 5.4.2 Prior Model For the remainder of this chapter we adopt the hierarchical prior used by Godsill and Davy [2005] for the partial frequency structure of the music. A short segment of music has K notes, where the unknown number 71 of notes is distributed according to a truncated Poisson distribution: p (K = k|λK) = ( λkK/k! )∑KMAX k′=KMIN ( λk ′ K/k ′! ) (5.28) The hyperparameter λK is distributed according to a Gamma distribution p (λK) = G (λK ;αK , βK) where αK = 1 and βK = 2 are set such that p (λK) has infinite variance. It is often reasonable within the context of polyphonic music transcription to assume that the minimum and maximum number of notes sounding at the same time is known in advance (for example a four-part fugue or choral). Each note k = 1, . . . .K has Mk partials which is distributed according to the same truncated Poisson distribution (5.28). The minimum number of partials is set to 2 in the prior work, and the maximum number of partials is set to either 30 or the limit and ωs/2ω0 which is the number of partials permitted by the sampling frequency. Here we have set the minimum number of partials to 1 and allowed as many partials as the sampling frequency permits. The sampling of the hyperparameters for the posterior distributions of the number of notes and partials is identical to that of Andrieu and Doucet [1999]. The prior distribution for frequencies p (ω1:Mk |Mk) is given by the following hierarchical model p (ω1:Mk |Mk) = p (ω0) Mk∏ k=1 p (δk) ωk = kω0 (1 + δk) p (δk) = N ( 0, σ2δ ) with σ2δ = 3 × 10−8. p (ω0) is set to be uniform, with limits set to be a semitone below and above the minimum and maximum Midi frequencies. 5.4.3 Examples of Reversible Moves The following moves, designed for inferring the harmonic structure of polyphonic music, are taken from Godsill and Davy [2005] for the prior model described in the previous section. 5.4.3.1 n-increase/decrease This pair of moves is designed to estimate the number of harmonics present in the signal for a single note. For the n-increase move, a new set of n partial frequencies are proposed in harmonic positions above the highest existing harmonic for that note. For the n-decrease move, the highest n partial frequencies for that note are deleted. The number of possible harmonics is limited between 1 and ωs/2ω0 where ωs is the sampling frequency and ω0 is the fundamental frequency of the note. 5.4.3.2 double/halve frequency This pair of moves is designed to explore octave ambiguities and errors in the pattern of harmonics in the signal. The double move doubles the fundamental frequency of a note by removing the odd-numbered partials of that note. The halve move halves the fundamental frequency of a note, keeping the existing partial 72 frequencies, but assigning them to twice the original harmonic position. The missing partials are added in the odd-number harmonic positions. 5.4.3.3 note birth/death This pair of moves is designed to estimate the number of notes in the signal. The birth move adds a new note, with a fundamental frequency and a new set of harmonics. The death move deletes a note with all of its partial frequencies. 5.5 Results In this chapter we have extended and developed two existing models for musical signal analysis, and described a reversible jump MCMC framework to infer the number of partials in a small segment of music, and their respective partial frequencies. To put this work into perspective, in this section we use the Bayesian prior model of Davy et al. [2006] for polyphony and harmonic notes to infer the number of notes and their fundamental frequencies in short segments of musical chords. The objective of this section is to quantitatively evaluate the effect of the model enhancements suggested in this chapter on polyphonic music transcription, using the prior work as a baseline. In Davy et al. [2006], multiple F0 estimation results are presented for a set of recorded note mixtures of a variety of musical instruments from the McGill database 2 . Each signal is downsampled from 44,100 Hz to 11,025 Hz, and the first 544ms is used to estimate the fundamental frequencies present, assuming the number of notes is known a priori. We use the same experimental setup, running the reversible jump MCMC algorithm once for 800 iterations, and using only the final 100 samples to estimate the fundamental frequencies and the frequencies of the partials. 5.5.1 Performance on Monophonic Extracts When evaluating model-based polyphonic music transcription systems, it is implicitly expected that the system should accurately detect the fundamental frequency of an isolated note in a monophonic extract, and be resilient to octave errors, as this capability already exists in simpler detection systems based on autocorrelation function or harmonic transform analysis. We have found it extremely useful to study how the partial frequencies are detected by a model, and have found that significant differences exists in the performance of the different models suggested in this chapter. We choose to study monophonic extracts, as the actual number of partials existing in the signal can be estimated easily by hand by studying the periodogram of the entire monophonic extract. If a model is able to successfully detect a higher number of partials, this indicates that the model is able to capture more of the harmonic structure of a note, and should therefore be able distinguish better between ambiguous combinations of notes that degrade the performance of polyphonic music transcription systems. In this section, we present partial estimation results for different model choices. In each case, we have modified one aspect of the model, and ensured that all other model and algorithm parameters are constant. A ground truth for the number of partials was prepared by a human observer from the periodogram with 2 http://www.music.mcgill.ca/resources/mums/html/MUMS_dvd.htm 73 Average percentage of partials detected Instrument Number of Extracts Real Signal Analytical Representation Violin 31 37.3 73.3 French Horn 36 50.5 82.5 Oboe 40 48.2 80.8 Flute 29 47.5 77.5 Trumpet 26 51.0 81.0 Clarinet 34 47.7 78.5 Viola 32 40.0 66.7 Piano 64 47.2 74.1 Guitar 48 44.7 76.6 TOTAL 340 46.1 76.6 Table 5.1: Partial estimation results, comparing the average percentage of partials detected from the real signal and the analytical representation using the Gabor model with Hamming basis functions prior knowledge of the instrument and the fundamental frequency of the note. For each extract, we record the number of partials detected by the model compared to the ground truth number of partials, and express the ratio as a percentage. The partial estimation results are grouped by musical instrument, and the percentage of partials detected is averaged over the extracts for that instrument 3 . The results are presented in Table 5.1 on page 74 and Table 5.2 on page 75 for a variety of instruments over a range of pitches from the McGill database. The first set of results compares choosing to apply the model to the original real signal or to the Hilbert transform of the signal. In Table 5.1 on page 74 we use the Gabor model with Hamming window basis functions, and compare the number of partials modelled using the original signal and using its Hilbert transform. It is clear from these results that applying the Hilbert transform, which reduces some of the modelling ambiguity in the sinusoidal representation of the signal, increases the number of partials correctly estimated from the note. The second comparison we make is between the choice of basis function, or the use of the state space model, when applied to the analytical representation of the signal. Partial estimation results are presented in Table 5.2 on page 75 comparing the number of partials modelled using the Hamming, Gaussian and sinc basis functions and the state-space model. The difference in the number of partials correctly estimated is overall much less dramatic. The Gaussian basis and the state-space model perform slightly worse than the Hamming and Sinc bases, which themselves mostly return a consistent number of partials. However, as motivated in Section 5.2, the performance for certain types of instruments can be improved by the choice of model. For violin and viola notes, which are played with substantial vibrato, the sinc basis model detects more partials than the other models. The higher order partials hence have a high spread in spectral energy in frequency, and it appears that only the sinc basis model is correctly detecting and modelling these frequency modulations. This can be explained by the explicit bandwidth constraint on frequency and amplitude modulations when using a sinc basis. For piano and guitar notes, which are played percussively and allowed to decay over time, the state-space model, with a constant damping ratio across the length of note, detects 3 As an illustration, if we have two oboe extracts, the first having 10 partials and 6 were detected by the model (60%), and the second having 12 partials and 9 detected by the model (75%) then we record the average percentage of partials detected for this model and instrument as 67.5% 74 Average percentage of partials detected Instrument Number of Extracts Hamming Sinc Gaussian (α = 2.5) State-space Violin 31 73.3 84.4 59.3 59.8 French Horn 36 82.5 83.9 69.2 70.2 Oboe 40 80.8 74.7 69.6 68.1 Flute 29 77.5 79.9 67.9 70.9 Trumpet 26 81.0 76.5 70.3 60.8 Clarinet 34 78.5 80.0 68.4 71.9 Viola 32 66.7 78.3 61.7 61.3 Piano 64 74.1 73.2 70.3 90.7 Guitar 48 76.6 75.4 69.2 90.4 TOTAL 340 76.6 77.8 67.7 74.4 Table 5.2: Partial estimation results for different instruments, comparing Hamming, sinc and Gaussian basis functions and the state-space model with constant damping ratio, on the analytical representation of the signal nearly all of the partials of these notes. To conclude, we have found that using the Hilbert transform to calculate the analytical representation of a signal, and using this representation for modelling, increases the number of partials that are detected and modelled correctly in single harmonic notes. The additional computation required to compute the transform is small, and we expect this to increase polyphonic music transcription by capturing more higher order structure in harmonic notes. We have also shown that in most cases that there is little difference in the partial estimation performance of different basis functions, or using the state-space model. Therefore, we suggest that, unless there is appreciable vibrato or other modulations, Hamming basis functions should be used as they have finite support and therefore it is more efficient to compute and invert the basis. In cases where the modulation bandwidth is known a priori and is sufficiently large, sinc basis functions are attractive as we have observed that higher order partials are modelled correctly, whereas other basis functions will typically either split a higher order partial into two adjacent frequencies, or fail to detect the frequency. For instruments with an approximately constant damping ratio over the length of the note, we have shown that the state-space model defined in this chapter is appropriate for modelling the higher order partials of the signal. 5.5.2 Multiple F0 Estimation In this section, we present results for polyphonic transcription where the number of notes are known a priori. This is a limited application, as this situation is rare in practice. Similar implementations of Bayesian inference for these harmonic models overestimate the number of notes, and have to resort to a somewhat unsatisfactory heuristic to determine whether the energy of a modelled note is too low to be significant. In our research with these models, we have experienced the same problem, however we have also determined that the difficulty does not lie with the model, but the estimation accuracy of the partial frequencies. We have made the following observations: 1. The local maxima of the likelihood of a single partial frequency also minimizes the residual signal in the vicinity of that frequency. Correctly minimizing the residual of the signal around a partial frequency 75 Notes in Mixture Evaluation Metric Model 1 2 3 4 % octave error Gabor 0 2.8 11.1 10.2 Davy et al. [2006] 0 10.3 17.8 9.3 Klapuri [2008] 0 13.6 19.0 22.2 % pitch error Gabor 0 8.3 15.6 18.6 Davy et al. [2006] 0 5.1 7.2 19.7 Klapuri [2008] 0 1.4 6.0 10.3 % total error Gabor 0 11.1 26.7 28.8 Davy et al. [2006] 0 15.4 25.0 29.0 Klapuri [2008] 0 15.0 25.0 32.5 Table 5.3: Polyphonic pitch estimation using the Bayesian harmonic model. The total error for the system presented in this chapter is split into errors when the pitch estimated is an octave above or below the ground truth, and errors when the estimated pitch is not an octave error. reducing the chance of spurious partial detections to either side of the original frequency, which leads to duplicate notes being estimated. 2. Due to frequency and amplitude modulations in the partial, the maximum likelihood frequency may differ by up to 3Hz from the ideal harmonic frequency for theoretically harmonic instruments, and by a comparable amount from the local maxima of the periodogram estimator. This means that global frequency proposals based on these may produce inaccurate initial frequency estimates. 3. The difference between an initial frequency estimate and the maximum likelihood frequency is too large for the constant variance random walk MH proposal distribution used in our algorithm to both arrive near the optimum and accurately estimate its value. We illustrate these observations in Figure 5.3 on page 77 for a single partial frequency estimated firstly from the local maxima of the periodogram, and then by minimizing the residual of the signal. We obtained some improvements by adapting the random walk MH proposal to progressively reduce its variance, so that larger jumps in frequency were followed by smaller steps to explore the local maxima. It was also necessary to reduce the ratio of global to local MH proposals so that enough time was given to search for the maximum before losing the current frequency found in a global jump. However this also increased the computation time, and also restricted the algorithms ability to explore the entire spectrum using global proposals. In the next chapter, we show the benefits of estimating the partial frequencies to high accuracy, and how it can be used to effectively estimate the number of notes. As future lines of research using these Bayesian harmonic models with reversible-jump MCMC, we suggest either using a Hamiltonian Monte Carlo scheme to use derivative information in the likelihood to arrive at the local maxima quicker, or use the numerical methods described in Chapter 6 as the basis of a proposal distribution, and balance this correctly with global proposals. To complete this chapter we present the estimation results where the number of notes are known, so that they can be compared with other transcription systems using this metric. To compare objectively against prior work, we use the data set of Davy et al. [2006] which consists of 20 short monophonic signals and 20 short polyphonic signals for each of 2, 3 and 4 note mixtures, taken from a limited set of instruments 76 -20 -15 -10 -5 0 5 400 450 500 550 600 Frequency / Hz Signal Partial Estimate Residual (a) Estimate of the partial frequency from the maxima of the periodogram. The residual of the resulting signal has two prominent peaks on either side of the detected partial frequency. Subsequent iterations of the algorithm will identify these peaks as additional partials, whereas these peaks are in reality due to amplitude and frequency modulations in the original partial. -20 -15 -10 -5 0 5 400 450 500 550 600 Frequency / Hz Signal Partial Estimate Residual (b) Maximum likelihood estimation of a partial frequency using the signal model. This esti- mate minimizes the residual of the signal. The deviation between the peak of the periodogram and the maximum likelihood estimate is 2.2Hz. Figure 5.3: Comparison of the residual signal when using a periodogram estimate and maximum likeli- hood estimate of the frequency of a partial of a musical note. The estimated frequency is marked on the periodogram estimate of the signal. 77 from the McGill database. As this set of notes are only for sustained note instruments and do not include piano or guitar notes, we only use the Gabor model with hamming basis functions for our comparison. Additionally we have prepared an implementation of the auditory model based system of Klapuri [2008] as a state-of-the-art polyphonic transcription to compare against. We present our results in Table 5.3 on page 76. The results show an improvement in transcription accuracy for two note mixtures when using the analytical representation, and we suggest that this is due to the improved estimation of higher partial frequency structure demonstrated in the previous section. However, for three and four note mixtures, the performance is not appreciably different to prior work of Davy et al. [2006], although there is an improvement in accuracy over the auditory model system for 4 note mixtures. We suggest that as the inference algorithm is mostly identical, the inference algorithm is limiting the performance in these cases. During this work, we took the opportunity to study the reversible jump MCMC algorithm in progress, observing the current state of the fitted model after each iteration. We observed that in many of the situations where transcription errors occurred, the algorithm did reach the correct configuration of notes at some point, but was not able to sustain this state due to inaccuracies in the frequency estimates. 5.6 Conclusion In this chapter we have described and developed signal models for pitched, harmonic instruments from first principles. Our primary motivation has been to investigate the modelling of different forms of frequency and amplitude variations throughout the length of a musical note with a nominally constant frequency. A damped amplitude envelope, where the oscillations decay exponentially in time, was found to fit naturally in a state-space formulation, appropriate for percussive instruments such as the piano and guitar. A limit on the bandwidth of frequency and amplitude modulations of the note was modelled intuitively as a Gabor basis, where the window function was sinc in shape, appropriate for held-note instruments such as the bowed string and woodwind families. The formulation of these models was deliberately chosen so that well-known and understood Bayesian priors and inference algorithms could be applied, complementing and building on existing work. The im- provements in this chapter, as they are in the context of these models, may therefore be conversely applied to the original models, algorithms and applications which inspired them. For example, choosing to model the analytical representation of the observed signal revealed some improvements to the inference algorithms for these models. The damped envelope model, which can be treated as a linear dynamical system when adding Gaussian noise to the state and the observation processes, allows the posterior distribution of the partial frequencies and damping ratio to be computed in closed form if the prior is Gaussian, or otherwise a good proposal distribution to be constructed for a MCMC inference scheme. The Gabor basis model is a general linear model, with a well studied prior structure for the basis coefficients given the frequencies. We were able to derive the mode of the signal-to-noise ratio parameter in this model, which means that the amplitudes and noise variance parameters do not need to be simulated in a MCMC scheme. This reduces the computational cost and complexity of the algorithm required, and also reduces the dimensionality of the target posterior distribution. The models and the inference algorithms developed for them were then applied to monophonic and poly- phonic transcription problems. In the case of single notes playing, we focused on how many of the harmonic 78 partial frequencies present were detected and modelled. We saw that using the analytic representation of the signal resulted in significantly more partials being detected, and conclude that the reduction in the ambiguity of instantaneous phase and amplitude afforded by this representation is beneficial for signal model based inference methods. We also present polyphonic transcription results for the case where the number of notes is known, and compare with prior work. We found that the new models improve transcription performance for two-note mixtures when compared to prior work, but the performance for more complicated mixtures is limited by the inference algorithm's ability to accurately estimate to the extent that spurious partial detec- tions are avoided. The benefits of increasing the accuracy of partial frequency estimation are shown in the next chapter, where we simplify the polyphonic inference algorithm to a two-stage process, estimating the partial frequencies first, and inferring the harmonic structure secondly, to improve transcription accuracy for higher number of notes, and also correctly estimate the number of notes playing in the mixture. 79 Chapter 6 Multiple Pitch Estimation using Non-homogeneous Poisson Processes Point estimates of the parameters of partial frequencies of a musical note are modelled as realizations from a non-homogeneous Poisson process defined on the frequency axis. When several notes are combined, the processes for the individual notes combine to give a new Poisson process whose likelihood is easy to compute. This model avoids the data association step of linking the harmonics of each note with the corresponding par- tials and is ideal for efficient Bayesian inference of unknown multiple fundamental frequencies in a polyphonic mixture of notes. 6.1 Introduction By observing the periodogram of a polyphonic mixture of notes, a trained observer can estimate the partial frequencies present in the signal from the localized peaks in the spectrum, and then suggest fundamental frequencies by observing that some of the partial frequencies are regularly spaced along the frequency axis. For example, peaks in the spectrum at 440, 880, 1320 Hz and so on suggest a fundamental frequency of 440 Hz. In the author's experience, using the periodogram to transcribe mixtures of notes is more reliable and quicker than listening to the mixture. This method also outperforms automated transcription systems such as the signal models described in the previous chapter and state-of-the-art auditory systems, especially avoiding octave errors which plague other systems. One of the goals of this chapter is to investigate and propose models for this method in order to improve the accuracy of polyphonic transcription. Two assumptions about the transcription process are made. The first is that the observer does not change his or her estimates of the partial frequencies when attempting to find a set of notes which fits the observations. In plain terms, the observer is trying to fit a model to the observations, incorporating errors in the partial frequency estimates into the prior, rather than fitting the observations to the model. This motivates a two-stage process where the partial frequencies are estimated first, and then a harmonic model is fitted to the frequencies. A prior model on the partial frequencies is still required however, as the observer may know the range of fundamental frequencies that can be produced by the instrument for example, but 80 this prior must also be defined when the number of notes in the mixture is not known. The second assumption is that the spectral shape in the vicinity of a peak is important to the estimation of partial frequencies, whereas only the frequencies and sometimes the amplitudes of the partials are required for transcription. The spectral shape sometimes allows us to distinguish between merged harmonics of two or more notes. There are various cases where simply picking peaks of the spectrum above an adaptive noise floor is inadequate, and these cases are often the cause of transcription errors. The notes of chords in music often have overlapping harmonics, which may not be manifested as separate peaks but to the observer are obvious because of differences in spectral shape. The spectral shape also helps distinguish between noise or artifacts in the signal and genuine partial frequencies, reducing spurious detections of partials which can lead to over or under-reporting of the number of notes playing. We will use an explicit signal model with a prior on the expected spectral shape of harmonic notes to accurately estimate partial frequencies. We do not assume that the partial estimation procedure is perfect however, and therefore need a tran- scription system which is capable of dealing both with missed and duplicated partial detections. The solution we present in this chapter is to use an iterative algorithm based on the signal model presented in the previous chapter to provide high quality estimates of the partial frequencies, and to model the prior on the frequency estimates as a non-homogeneous Poisson process. Choosing to use a signal model rather than a heuristic estimation scheme for the partial frequency estimation is advantageous as present and future improvements to that model will also benefit the estimation procedure here. However, it is also permissible to use other methods to estimate the partial frequencies, as was carried out previously using periodogram peak picking [Peeling et al., 2007b] and subspace methods [Peeling et al., 2007a]. In these cases, the prior on the frequen- cies needs to reflect the estimation procedure, for example including a uniform clutter process across the frequency axis if many spurious partials are detected. The structure of this chapter is as follows. In Section 6.2 we introduce the properties of non-homogeneous Poisson processes and how to calculate the likelihood given a set of observed frequencies. In Section 6.3 priors for harmonic models are discussed, and suggestions for how these priors should be modified for different partial estimation methods are given. In Section 6.4 a general method for making partial estimates from a signal model is presented. Transcription results for polyphonic mixtures of notes are presented in Section 6.5 and are compared with the previous chapter and prior work. Conclusions and suggestions for future research are given in Section 6.6. 6.2 Non-homogeneous Poisson Processes In this section we define and describe a non-homogeneous Poisson process model [Cox and Isham, 1980]. A homogeneous Poisson process is a stochastic process, defined usually over time, where the number of events N (b)−N (a) occurring between time a and time b has a Poisson distribution with rate parameter λ: P (N (b)−N (a) = k|λ) = exp (−λ (a− b)) (λ (a− b)) k k! λ is the expected number of events per unit of time, and is constant for a homogeneous Poisson process. A non-homogeneous Poisson process generalizes this by allow the rate parameter to vary with time. The principal ideas behind the model are explained by considering a model based solely on the frequency 81 00.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 2 4 6 8 10 P (k ) k Figure 6.1: Probability mass function for the Poisson distribution estimates of multiple partials in a musical signal. 6.2.1 Frequency-Domain Process We define the number of partial estimates as a non-homogeneous Poisson process on the frequency axis. Let N (f) be the number of partial estimates observed in the frequency range (0, f ]. We assume that the number of partial estimates in a particular interval (a, b] of the frequency axis has a Poisson distribution with parameter λa,b. The number of partials is given by N (b)−N (a) and has probability distribution P (N (b)−N (a) = k|λa,b) = exp (−λa,b) (λa,b) k k! (6.1) We interpret λa,b as the expected number of partials occurring in (a, b]. Figure 6.1 on page 82 shows the probability mass function for the Poisson distribution in (6.1) with λa,b = 1. We expect to observe one partial in the region (a, b]. This region could for instance be a DFT bin at a harmonic frequency of a musical note. The probability mass for observing zero and one partial in the region are equal when λa,b = 1. Under the assumptions of a Poisson process, we write λa,b in terms of a continuous rate function λ (f) λa,b ≡ ˆ b a λ (f) df The rate function λ (f) of the Poisson process describes the expected concentration of partial frequencies along the frequency axis. For a harmonic musical note, we would expect the rate function to be large around the fundamental frequency and harmonics of the note, and small but non-zero elsewhere to allow for spurious 82 partial detections and transient effects. For (6.1) to be valid for any values of a and b, there are two requirements. First, no two estimates may have exactly the same frequency. The signal model or partial estimation scheme used to observe partial positions should not provide estimates with exactly the same frequency, but that there must be a non-zero interval between successive frequencies. It is a property of the signal models we use in Section 6.4 that the two partials will never be estimated with exactly the same frequency, as this would lead to the basis functions being linearly dependent. Two basis functions with the same frequency may always be combined into a single basis function. The second requirement is that the process is memoryless: the probability of a number of partials occurring in any region of the frequency axis must be independent of the occurrence of partials in any other region disjoint with that region 1 . This requires that λ (f) contains all of the prior information about the occurrence of partials. Modelling the occurrence of partials as a Poisson process makes the model robust to missing or duplicate partial detections. Harmonic models such as described in 5.4.2 require the existence of a single partial frequency in every harmonic position modelled, and therefore an entire note may not be detected due to a single missing partial frequency. 6.2.2 Superposition One of the key attractions of using a Poisson process model to model partial estimates is that the observation of multiple Poisson processes superimposed on the same axis is also a Poisson process. Moreover, the rate function of the combined process is formed from the summation of the individual rate functions. Formally we have M Poisson processes N1 (f) , . . . , NM (f) with rate functions λ1 (f) , . . . , λM (f); and we observe N1:M (f) = ∑M m=1Nm (f). Then P ( N1:M (b)−N1:M (a) = k|λ(1:M)a,b ) = exp ( −λ(1:M)a,b )( λ (1:M) a,b )k k! λ (1:M) a,b = ˆ b a M∑ m=1 λm (f) df (6.2) Note that in observing N1:M (f) we lose labeling information, i.e., which Poisson process m each partial was generated by. This makes the likelihood (6.2) easy to compute. Inferring the actual labels of the partials, for example in a source separation setting, cannot be carried out using the superimposed process alone, however the labels may also be inferred in a probabilistic manner using a likelihood function based on the individual rate functions for each note. 6.2.3 Evaluation of Likelihood In this section we consider how to evaluate the likelihood of the occurrence of the entire set of observed partial positions. Although we would naturally try to calculate the likelihood exactly, the method we choose depends on how we observe the Poisson process. In this section, three methods are given for evaluating the likelihood. The exact method in 6.2.3.1 should be applied when a signal model is used to estimate the 1 This does not contradict the previous requirement that partial occurrences may not have the same frequency, as identical partial frequencies cannot be mapped to disjoint regions of the frequency axis 83 partial frequencies. The binning method in 6.2.3.2 is suitable when a periodogram peak picking method is employed. If the peak picking method by design only detects zero or one peaks in each frequency bin, the calculation should be modified to allow for the possibility that more than one partial frequency was present in the bin. In this case, the method in 6.2.3.3 is appropriate. 6.2.3.1 Exact Calculation When the partial estimates are known with sufficient accuracy, and their frequencies are distinct, the likeli- hood of the occurrence of frequencies f1, f2, . . . , fN under a non-homogeneous Poisson process λ (f) on the frequency axis between 0 and fs/2 where fs is the sampling frequency, is given by Crowder et al. [1991], Meeker and Escobar [1998] as p (f1, f2, . . . , fN , N |λ (f)) = exp ( − ˆ fs/2 0 λ (f) df ) N∏ n=1 λ (fn) (6.3) The derivation of the above likelihood is informally obtained firstly by noting that in the interval between observed frequency fn and fn+1 there are no observations. Hence, using (6.2) and substituting k = 0, each such interval has probability exp (−λfn,fn+1) = exp(− ´ fn+1fn λ (f) df). At each observed frequency fn, the probability, using (6.2), of observing k = 1 is given by λ (fn). We also take into account that no frequencies were observed in the interval [0, f1) and (fN , fs/2]. As a Poisson process requires that the observations in disjoint intervals of the frequency axis must be independent, we simply combine the probabilities of these observations together by multiplying them, thus: p (f1, f2, . . . , fN , N |λ (f)) = exp ( − ˆ f1 0 λ (f) df ) exp ( − ˆ fs/2 fN λ (f) df ) × N−1∏ n=1 exp ( − ˆ fn+1 fn λ (f) df ) × N∏ n=1 λ (fn) = exp ( − ˆ fs/2 0 λ (f) df ) N∏ n=1 λ (fn) 6.2.3.2 Binning The likelihood when observations are grouped into non-overlapping regions (bins) of the frequency axis may be calculated as follows. Assume we have F such bins, spanning frequency intervals A1, . . . , AF , and denote the number of observations in each bin by Nf . We then have, by the independence of intervals in a Poisson process, P (N1, . . . , NF |λ1, . . . , λf ) = F∏ f=1 P (Nf |λF ) = exp (−λf ) (λf ) Nf Nf ! (6.4) λf = ˆ Af λ (f) df 84 The advantage of this method over the exact calculation method is that the rate function λf may be computed in advance for each bin f before the partial frequencies are estimated, which reduces the computation required when evaluating the likelihood for multiple frames of music. Often the bins will coincide with the frequencies of the DFT used to estimate the partial frequencies. 6.2.3.3 Censored Frequencies The partial estimation method used may only indicate that there is a partial in a frequency bin or not. An example is a single step peak picking scheme, which selects all the spectrum bins with amplitudes larger than neighbouring bins and above a noise threshold. It is possible that multiple frequencies are present within the region of the frequency axis covered by a single observation bin, for example in the case of overlapping harmonics. Although we have only `observed' at most one frequency per observation bin, we wish to allow for the possibility that more than one frequency could be present in each bin. This is useful in practice the rate function of the Poisson process is a superposition of the rate functions of harmonically related notes. For every harmonic that overlaps within the region of a single bin, we would expect two or more partial frequencies to occur within that bin. Thus we are asserting that an observed peak in the spectrum implies the existence of multiple partial frequencies in that bin, and no observed peak implies that no partial frequencies were present in the bin. For the observations to be valid as a Poisson process, when a peak is detected in a bin, we calculate the probability that one or more frequencies were observed in that bin, i.e., p (Nf ≥ 1) = 1− p (Nf = 0) = 1 − exp (−λf ). When a peak is not detected in a bin, the probability is given by p (Nf = 0) = exp (−λf ). The likelihood over all the frequency bins is thus given by F∏ f=1 1− exp (−λf ) peak observed in bin fexp (−λf ) no peak observed in bin f (6.5) The likelihood calculation in this case is the same as a set of Bernoulli trials with probability 1− exp (−λf ). 6.3 Bayesian Priors Bayesian inference for the non-homogeneous Poisson process involves treating the rate function λ (f) or λ (x, f) for a vector valued process as unknown and placing a prior distribution on the rate function. A suitable choice of prior depends greatly on the observation method. If we are binning the observations into fixed, a priori intervals, then we can see from (6.3) that we need to infer each of the unknown parameters λf of the model rather than the full rate function λ (f). However if we consider each observation and evaluate intervals between partials, then our target for inference is the rate function λ (f). A full non-parametric Bayesian inference of the rate function of a Poisson process is carried out in Adams et al. [2009]. A Gaussian process is used as a prior, which is transformed into a rate function using a sigmoid function. The inference is tractable, however it is not immediately clear how higher-level information, such as partials occurring at harmonic positions, could be structured into such a prior. An interesting alternative would be to use a periodic Poisson process [Dimitrov et al., 2004]. In this model, the rate function is a periodic function along the axis. For our model, the period of the rate function 85 would be the fundamental frequency of the musical note. Here however we pursue two designs of Bayesian prior which may be inferred tractably and are amenable to additional, higher level, prior structure. 6.3.1 Fixed Bins When observations are grouped into fixed bins, then the model parameters are a finite set of positive values λf . Each λf is the intensity parameter of a Poisson distribution, for which the conjugate prior choice is the Gamma distribution: p (λf ) = G (αf , βf ) (6.6) The posterior distribution when observing Nf is p (λf |Nf ) = G (αf +Nf , βf + 1) We can integrate the unknown λf to obtain a negative binomial (Pascal) distribution (Figure 6.2 on page 87) p (Nf ) = Γ (αf +Nf ) Nf !Γ (αf ) p αf f (1− pf )Nf (6.7) pf = 1 βf + 1 Figure 6.2 on page 87 shows the prior distribution on expected number of partials λf (6.6) with αf = 2, βf = 1 and corresponding marginal distribution (6.7) on observed number of partials Nf . The hyperparameters may be optimized for the purposes of training. For example, to train the hyperpa- rameters of the rate function for a particular musical instrument and pitch, we would use I example frames of data and estimate the partial frequencies in each frame, obtaining a set of observations N (1) f , . . . , N (I) f for each frequency bin. The posterior of the rate function given these observations is p ( λf |N (1)f , . . . , N (I)f ) = G ( αf + I∑ i=1 N (i) f , βf + I ) (6.8) and the hyperparameters can thus be set to new values: αf → αf + ∑I i=1N (i) f and βf → βf + I. The new values of the hyperparameters can now be used as the prior (6.6) for when new frames of data are observed, thus transparently incorporating training data into the Bayesian model. 6.3.2 Gaussian Mixture Model In this section we model the entire rate function as a Gaussian mixture model (GMM). Modelling the rate function as a GMM is a convenient method to use prior information concerning the partial frequencies of harmonic instruments. The rate function is shaped by the probability density function of a Gaussian mixture 86 00.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 2 4 6 8 10 p(λf |αf , βf ) p(Nf |αf , βf ) Figure 6.2: Prior on expected number of partials and marginal distribution of number of partials model: λ (f) = H∑ h=1 chN ( µh, hσ 2 ) (6.9) ch ≥ 0 ∀h (6.10) H denotes the number of mixture components. A meaningful interpretation of the above model is that H is the number of harmonic positions for a note with fundamental frequency f0, and that a single component of the mixture corresponds to a single harmonic h. We assign H = b fs 2f0 c where fs is the sampling frequency. The means of the components are set to the expected harmonic positions µh ≈ hf0 and be allowed to deviate from their ideal positions to account for inharmonicity. σ2 allows for further spread around the harmonics, which may occur with split peaks or modulations in the signal. Finally ch weights each harmonic, and we expect that low frequency partials have a higher probability of being detected and hence have higher values of weighting ch. Inference of the unknown parameters in a Gaussian mixture model involves introducing labels for each observation and using Expectation Maximization (EM). When we train our model by fitting the parameters to the estimated partial frequencies of a set of frames of audio from a harmonic musical instrument with a particular pitch, the values for σ2 are typically small so that there is negligible overlap between the mixture components for different harmonic positions. Figure 6.3 on page 88 is provided as an example of this, where 87 00.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0 500 1000 1500 2000 2500 3000 3500 4000 λ (f ) f Figure 6.3: Intensity function for a musical note with fundamental frequency 440Hz parameterized using a Gaussian mixture model with H = 9 harmonics we have chosen σ2 = 10−4, µh = hf0 and ch = 1/h as illustrative parameters. This assumption is also used in [Davy et al., 2006] for the model of the detuning parameters for each partial frequency. In practice, the Expectation step for training the GMM can be replaced with a K-means clustering step, where the component means are set to the expected harmonic positions, which reduces the amount of computation required for inferring the unknown parameters compared to the full Expectation step. 6.3.3 Model for mixture weights The Gaussian mixture prior in the previous section allows for inharmonicity through the variance of each mixture component. Rather than inferring the unknown parameters of the mixture model, we can also set the parameters to particular values which match existing generative models of partial frequencies. The prior model of Godsill and Davy [2005], which is also used in Chapter 5 can be adapted as a Poisson process easily by interpreting the prior probability distribution over the number of partials and their frequencies as a counting process of the number of partials along the frequency axis. The number of partials H per note is modelled as Poisson distributed p (H) = Po (H|Λ) = Λ He−Λ H! where Λ is the expected number of partials. The position of each partial frequency is normally distributed around hf0, where h is the harmonic number and f0 is the fundamental frequency of the note. To convert this to a Gaussian mixture model of the form (6.9), we note that each mixture weight ch gives the expected number of partials around hf0. We interpret this as the probability under the generative model that the number of partials is greater or equal to h, hence following this model, the mixture weights 88 in (6.9) are given by ch = ( 1− h∑ m=1 Po (m− 1|Λ) ) pnote (6.11) pnote is the prior probability that the note is playing in the mixture, and is applied as a scaling to all of the mixture weights for that model. ∑h m=1 Po (m− 1|Λ) is the cumulative Poisson distribution of observing up to h−1 partials. ch, when calculated by (6.11), gives the probability of observing a partial at that frequency under the prior model. When we set µh = hf0 and σ 2 to 3 × 10−8 we obtain the multiplicative inharmonicity model suggested in Godsill and Davy [2005]. 6.4 Signal Model Based Partial Estimation In previous work using non-homogeneous Poisson processes for polyphonic transcription, we have used several methods for extracting the partial frequencies as a preprocessing step. In Peeling et al. [2007b] a heuristic scheme for selecting peaks in the Fourier spectrum above an adaptive noise threshold was considered. Such a scheme is quick and can detect the partial frequencies of high amplitude notes without difficulty. However without an explicit signal model this scheme cannot differentiate between genuine partials and transient features of the noise floor, and its performance is therefore limited. In Peeling et al. [2007a] a matrix pencil scheme for estimating damped sinusoids was used to provide frequency and amplitude data for partials. Matrix pencil schemes use eigenvalue analysis to decompose the signal into a number of sinusoids, from which partial-frequency estimates can be obtained. This is an improvement as an explicit signal model is being used, but the number of sinusoids needs to be supplied to the estimation scheme. As this is not known a priori, the number of sinusoids has to be estimated separately. Underestimating or overestimating the number of sinusoids can result in frequency estimates that differ greatly from the true frequencies, leading to transcription errors, as the algorithm attempts to fit an incorrect number of sinusoids to explain the data and adapts the frequencies accordingly. The most satisfying partial estimation scheme we have investigated and present here is to apply an existing probabilistic signal model, and to iteratively estimate the number of partials and their frequencies accurately using a Bayesian model selection criterion for that signal model to determine when a suitable number of partials have been detected. The scheme is similar in structure to Matching Pursuit [Mallat and Zhang, 1993] which at each iteration selects a basis function from an over-complete dictionary of basis functions such as Gabor functions (5.2.5) to match the residual of the signal. The procedure here differs from Matching Pursuit in the following aspects: a set of basis functions with the same frequency are selected per iteration; a local search is employed to identify a suitable frequency at each iteration, using the periodgram as an initial estimator (6.3.3) and additionally numerical optimization to minimize the residual (6.4.3), rather than a global search over the dictionary; and the g-prior is incorporated in the signal model which improves the correct selection of the number of partials when using Bayesian model selection. The frequency estimates obtained are much less sensitive to the modelled number of partials than for the matrix pencil scheme. Moreover, future improvements in these signal models will also improve the quality of the frequency estimates. Using a signal model selection criterion requires much more computation than other schemes, but substantial improvements in transcription accuracy are obtained as a result. 89 Although we describe the algorithm making reference to the probabilistic signal model developed in Chapter 5, in practice any suitable signal model with an accompanying model selection criterion can be used for the partial estimation scheme component of this system. 6.4.1 Overview To estimate partial frequencies using a signal model, we use a simple iterative approach. At the beginning of an iteration, we have a set of partial frequencies, and using the signal model, we can assign some of the signal to the model, and the remainder to a residual. The residual contains the remaining partials that we are yet to detect. A model selection criterion is used to calculate how well the current set of partial frequencies explains the observed signal. We then select a new frequency from the residual, and add this to the set of partial frequencies. When the model selection criterion fails to improve by adding partial frequencies, the algorithm is stopped. In general, the best frequency to select from the residual is that which accounts for the most energy in the residual. This approach naturally will tend to reduce the number of partials that are ultimately selected. Selecting the maximum of the Fourier spectrum is clearly a good estimate of this frequency. We also investigated using the maximum as a starting point, and using numerical methods to maximize the energy removed from the residual. For a probabilistic signal model, this is equivalent to finding the MAP frequency estimate [Bretthorst, 1989]. 6.4.2 Bayesian Model Selection Criterion To exactly estimate the number of partials using a probabilistic signal model requires a reversible jump MCMC scheme, as covered in the previous chapter. However we have found that using a Bayesian model selection criterion, which is an approximate means to compare models of different dimensions, produces acceptable estimates of the number of partials. In Djuric [1996, 1993] a model selection criterion for the estimation of complex valued sinusoids in white noise was derived: N log ( y>Py ) + 5k 2 logN + log p (θ) (6.12) where k is the number of sinusoids and P has the same definition as in (5.23). p (θ) is the prior probability of any additional model parameters. The appropriate number of sinusoids is that which minimizes the above expression. In the remainder of this chapter, we will use the Gabor model, using sinc basis functions, as described in 5.3.4. The only additional model parameter that it is necessary to infer is the expected signal-to-noise ratio ξ which is assigned an inverse-Gamma prior p (ξ) ∼ IG (ξ;αξ, βξ) where αξ = 2 and βξ = 1 are chosen to give this prior infinite variance and thus be uninformative. We divide the signal into frames with 50% overlapping samples. The partial frequencies are estimated in each frame separately. At each iteration we select and add a frequency to the set of partial frequencies already estimated for that frame. If this addition increases the model selection criterion (6.12), then we terminate at this point. Otherwise we update all of the estimated frequencies, and continue. The scheme is described in more detail in Algorithm 6.1. 90 Algorithm 6.1 Partial estimation scheme for a frame of audio y with N samples • Initialize: k ← 0, M0 = N2 log ( y>y ) ,r = y • Iterate while Mk ≥Mk−1,  k ← k + 1  Estimate new partial frequency ωk from the residual r by the method in 6.4.3 or 6.4.4  Form the basis matrix: Dt,k = exp iωkt, t = 1, . . . , N  Estimate signal to noise ratio, ξ∗ = arg maxξ p (ξ|y) and P  Mk = N 2 log ( y>Py ) + 52k logN + log p (ξ ∗)  Calculate partial amplitudes: b = ( 1 + 1ξ∗ )−1 ( D>D )−1 Dy  Update residual of signal r ← y −Db An important step in Algorithm 6.1 is the method by which the frequency value of a new partial is estimated from the residual of the signal. In the remainder of this section, we investigate two methods for producing an accurate frequency estimate from the residual of the signal at each iteration of the algorithm. In Section 6.1 we stated that the prior on the partial frequency estimates practically depends on the estimation scheme. For each method therefore, we present a rate function for single harmonic notes. For polyphonic mixtures of harmonic notes, the rate functions can be superimposed (see 6.2.2). In Section 6.5 we describe inference of polyphony using these rate functions, and compare the accuracy of both methods within the partial estimation scheme for polyphonic music transcription. 6.4.3 Zero-Padding The first method we investigated to estimate the value of a partial frequency in the residual was to zero pad the residual to a length of 4N and find the maxima of the DFT spectrum. Zero padding interpolates the DFT spectrum, increasing the number of discrete frequencies at which the partial frequency can be found. An example of output of the partial estimation scheme in Algorithm 6.1 for a polyphonic note mixture is provided in Figure 6.5 on page 94. We see that many of the partials visible in the spectrum of the signal are detected over multiple frames with only a small number of additional spurious detections. Based on our observation of the partial estimation results, we propose a novel parametric rate function which is designed to robustly infer multiple fundamental frequencies when estimating partial frequencies iteratively from the DFT spectrum. As the DFT spectrum gives discrete estimates of the frequency in bins, the likelihood function of the observed partial frequency estimates should be calculated using the method described in 6.2.3.2. The rate function we propose has the following form: λf (f0) = λNote |f/f0 − [f/f0]| < λClutter otherwise (6.13) where  be the maximum allowed inharmonicity and f is the central frequency of the DFT bin. The notation [f/f0] denotes rounding to the nearest integer, and hence gives us the position of the closest harmonic of f0 to 91 -20 -15 -10 -5 0 5 0 500 1000 1500 2000 2500 3000 3500 4000 lo g P ow er S p ec tr al D en si ty Frequency / Hz Periodogram Power Spectral Density Estimate 0 0.5 1 1.5 2 2.5 3 3.5 4 0 500 1000 1500 2000 2500 3000 3500 4000 N u m b er of P ar ti al -F re q u en cy E st im at es Partial-Frequency Estimation output Figure 6.4: Partial estimation results for zero padding method the central frequency f .  allows for multiple partials per harmonic, which can occur because of modulations in the signal, or because of modelling and estimation error. This inharmonicity model is multiplicative, allowing for a larger deviation from the ideal position for the higher frequency harmonics. The rate function bears some resemblance to the scoring function (3.2) for inharmonicity used by Yeh et al. [2005]. This intensity function approximately resembles histogram data of partial estimates for the low harmonics, see for example Figure 6.4 on page 92. The partial estimation scheme described in Algorithm 6.1 returns clusters of partials around the harmonic positions, and a few spurious estimates. We capture this behaviour with a parametrized intensity function (6.13). Our expectation is that one partial in the correct position would be detected in each frame of data, so we set λNotes equal to the number of frames. λClutter models additional detections and must be greater than zero in each frequency bin. λClutter should be independent of the actual notes being played, and may be estimated or inferred. However we found that the model is quite robust to a range of clutter values, and we set λClutter = 0.1 in the following experiments. The inharmonicity parameter  is also unknown, and depends on the collection of notes being played. Instruments with high modulations (such as vibrato) will have large values of  as well as inharmonic instruments. We assign a uniform prior p () = U ( 1 100 , 1 10 ) and infer  with each observation. Note that λf has a functional dependence on  hence it is unknown, and we use the Bayesian prior described in 6.3.1 to correctly calculated the appropriate marginal likelihood value. 92 6.4.4 Likelihood-Search The second method we investigate is to directly maximize the likelihood of the Gabor signal model when adding a new frequency estimate to the set of partial frequencies estimated in the residual. This is motivated by our observations in 5.5.2 where we found that there could be significant deviation between the maxima of the DFT spectrum and the maximum likelihood frequency estimate, and that the residual of the signal after subtracting the detected frequency could contain additional peaks which can result in spurious partial frequency detections, as shown in Figure 5.3 on page 77. Maximization of the likelihood which is equivalent to minimizing the energy of the residual signal was carried out numerically, using the Golden search method. The steps for calculating the residual are shown in Algorithm 6.1. This method requires an upper and lower bound for the frequency, between which the maxima is estimated. First we chose the maxima of the Fourier spectrum as an initial estimate, as in 6.4.3 but without any zero padding. The bounds for the Golden search method were chosen to be ±10Hz of the initial estimate, based on our observations of the discrepancy between the Fourier spectrum and the signal model maxima in 5.5.2. An example of the partial estimation results are shown in Figure 6.5 on page 94. These results show that the partial estimates are very good, in that many of the partials are detected, and there are very few duplicate or spurious estimates. As these results match the actual partials in a harmonic note, we simply use the rate function described in 6.3.3 which was derived from a Bayesian harmonic model. This method requires more computation per iteration than the method in 6.4.3 but as less spurious and duplicate partials are detected, the overall number of iterations is smaller, and the cost of computing the model selection criteria is also reduced. We found that the likelihood search method was quicker overall than the zero padding method. As it also produces better partial estimation results which can then be analyzed using an acceptable Bayesian model and compared easily with other inference schemes, we recommend the likelihood-search over the zero-padding method from a modelling and practical point of view. As we shall see in Section 6.5, there are also mild improvements in polyphonic transcription performance. 6.5 Polyphonic Pitch Estimation The Poisson process model is useful for polyphonic pitch estimation because the likelihood is quick to compute, and it is therefore feasible to perform searches over pitch candidates, exhaustively testing every single note and also pairs of note pitches, at each state selecting the single note or pair of notes that results in the highest likelihood. The performance of the model is very much dependent on the quality of the partial frequency estimates, although the Poisson process model allows for more clutters, errors and inaccuracy than direct inference using a poor signal model would. 6.5.1 Greedy Search In this section we present results for a maximum marginal likelihood approach comparing the two partial estimation schemes described in Section 6.4. Following other inexpensive multiple pitch estimation schemes, we search in a greedy manner, adding one note at a time to the mixture, selecting the maximum likelihood 93 -20 -15 -10 -5 0 5 0 500 1000 1500 2000 2500 3000 3500 4000 lo g P ow er S p ec tr al D en si ty Frequency / Hz Signal Partial-Frequency Estimate Figure 6.5: Partial estimation results and periodogram estimate for a polyphonic mixture of four notes. solution at each point. For the zero-padding method, the observations are observed in bins, hence the likelihood function is given by (6.4), whereas for the likelihood-search method, the observed frequencies are known precisely, hence the likelihood function is given by (6.3). We evaluate the model using the dataset in Davy et al. [2006], described also in 5.5.2 and compare with the results obtained there. The polyphonic note mixtures are buffered into frames of length 1024 samples with 50% overlap. Table 6.1 on page 95 present our results for when the number of notes is known in advance, and are compared to the results in Table 5.3 on page 76 on the same data set, showing that we are able to achieve a comparable and even superior level of performance to a full Bayesian model and also a state-of-the-art auditory model. The likelihood search method in general makes fewer transcription errors than the zero-padding method. In these results, we observe that the Poisson process model is capable of correctly inferring multiple notes with the same pitch. Due to the superposition property of Poisson processes, it is straightforward to test whether a greedy search performs worse than a more exhaustive search which also considers adding pairs of notes to the solution at each iteration. We can also check whether the search leads to a higher likelihood solution than the likelihood of the true pitches under this model. For the results in Table 5.3 on page 76 the greedy search returns the same sets of notes as the method which adds pairs of notes, and the performance of the search methods are identical. Thus the greedy algorithm is shown to be sufficient in finding the maximum likelihood solution, and we suggest that this is a property that results from the superposition property. 94 Notes in Mixture Evaluation Metric Model 1 2 3 4 % octave error Zero-padding 0 4.9 11.9 8.9 Likelihood-search 0 7.0 6.0 7.5 Gabor 0 2.8 11.1 10.2 Davy et al. [2006] 0 10.3 17.8 9.3 Klapuri [2008] 0 13.6 19.0 22.2 % pitch error Zero-padding 0 6.7 9.0 17.0 Likelihood-search 0 5.0 10.7 15.5 Gabor 0 8.3 15.6 18.6 Davy et al. [2006] 0 5.1 7.2 19.7 Klapuri [2008] 0 1.4 6.0 10.3 % total error Zero-padding 0 11.6 20.9 25.9 Likelihood-search 0 12.0 16.7 23.0 Gabor 0 11.1 26.7 28.8 Davy et al. [2006] 0 15.4 25.0 29.0 Klapuri [2008] 0 15.0 25.0 32.5 Table 6.1: Polyphonic pitch estimation using the Poisson process model, comparing the zero-padding and likelihood-search methods for estimating the frequency of the partial at each iteration of the partial estimation scheme. The results are compared to estimation results for the Gabor model developed in Chapter 5, the Bayesian harmonic model of Davy et al. [2006], and a state-of-the-art auditory model for polyphonic transcription [Klapuri, 2008]. 6.5.2 Estimation of number of notes The greedy search method may also be used to estimate the number of notes. Once the partial frequencies have been estimated, as a result of the superposition property there are no remaining parameters in the model. There is no danger of overfitting too many notes using the Poisson process model once the partial frequencies have been estimated, as the expected number of partials in the signal is implicitly defined by the integral of the rate function of the Poisson process. Attempting to fit too many notes will result in a much higher expected number of partials than actually observed, which is penalized by the likelihood of the Poisson process. This is advantageous, as there is no need for an explicit penalization term to avoid overfitting of the model. The likelihood itself is therefore sufficient for determining the number of notes. Notes are added to the candidate set until this fails to increase the likelihood. We evaluate the estimation of the number of notes by calculating the average precision and recall for different numbers of notes and overall, and present our results in Table 6.2 on page 96. The precision P is defined as the number of correct notes detected in the mixture divided by the estimated number of notes by the system. The recall R is defined as the number of correct notes detected in the mixture divided by the actual number of notes in the mixture. Both precision and recall are commonly expressed as percentages. The F-Score F , given by F = 2 PR P +R is in excess of 80% for both methods, showing that the number of notes is being estimated accurately. Overall the two partial estimation schemes, with their associated rate functions, have the same level of accuracy for polyphonic transcription. The zero-padding method overall has a higher recall than the 95 Notes in Mixture Evaluation Metric Model 1 2 3 4 Overall Precision % Zero-padding 93.8 85.4 72.9 68.4 75.7 Likelihood-search 98.0 85.0 76.7 71.0 78.2 Klapuri [2008] 87.0 73.9 65.2 58.7 66.2 Recall % Zero-padding 100.0 91.8 86.7 87.7 88.5 Likelihood-search 100.0 92.4 85.8 87.1 85.9 Klapuri [2008] 100.0 85.0 75.0 67.5 76.5 F-Score % Zero-padding 96.8 88.5 79.2 76.9 81.6 Likelihood-search 99.0 88.5 80.1 78.2 81.9 Klapuri [2008] 93.0 79.0 69.8 62.8 71.0 Table 6.2: Precision and recall for estimating the number of notes and their pitches using the Poisson process model and the two frequency estimation schemes described in this chapter. The results are compared to a state-of-the-art auditory model. likelihood-search method, which is expected as the zero-padding method returns more partial frequencies, hence potentially detecting more notes. In our opinion and experience, a higher recall is preferable for an oine polyphonic transcription system, where the results can be verified by a trained musician, as it is perceptually simpler to notice and delete a spurious, incorrectly transcribed note than it is to determine and add a missing note. However, we would still prefer to use the likelihood-search method as it requires less computation in the partial estimation stage. To increase the recall of the likelihood-search method, the 5k/2 penalization term in (6.12) can be reduced, so that more partials are detected in the estimation scheme, and thus more notes are detected. Even when reducing the penalization term, the number of partials returned by the likelihood search method is substantially less than returned by the zero-padding method, and the computational savings are retained. 6.5.3 Comparison with State-of-the-Art In this section we compare the performance of the system developed in this chapter with other state-of-the- art multiple pitch estimation systems that compute estimates on a frame-by-frame basis. The results we compare with are taken from Vincent et al. [2010] using the MIREX 2007 woodwind training dataset 2 . Test excerpts are generated by successively summing together the first 30 seconds of the flute, clarinet, bassoon, horn and oboe tracks in order. To use the same evaluation criterion, we use overlapping frames of 46ms length spaced 10ms apart. The results are presented in Table 6.3 on page 97 for the best performing algorithms evaluated in Vincent et al. [2010], and also the likelihood-search and zero-padding methods described here. The results demonstrate that the two multiple pitch estimation schemes of Vincent et al. [2010] outperform the likelihood-search method on this woodwind dataset, although the performance of the likelihood-search method is comparable with state-of-the-art systems. There are also significant differences between the results on this real dataset and the results on the isolated examples presented in Table 6.2 on page 96, where the likelihood-search and zero-padding methods have a similar level of performance, show significant improvement over the system of Klapuri [2008] and have a higher F-measure overall than in Table 6.3 on page 97. These differences are clearly due to the number of frames per note estimate used. The isolated 2 www.music-ir.org/mirex2007/ 96 Polyphony Algorithm 2 3 4 5 Unconstrained NMF Vincent et al. [2010] 79.9 56.3 62.1 61.9 Constrained NMF Vincent et al. [2010] 76.5 64.7 67.5 62.5 Klapuri [2008] 73.4 59.1 63.5 59.9 Likelihood-search 75.0 66.1 55.0 59.6 Zero-padding 66.2 59.4 51.2 56.3 Table 6.3: Comparison of the F-measure of multiple pitch estimation for the MIREX 2007 woodwind training dataset. The `Constrained NMF' algorithm is known as `NMF under harmonicity and spectral smoothness constraints' in Vincent et al. [2010]. examples, which have several frames per note, are realistic only when concurrent sounding note segments are extracted from the audio beforehand, whereas each frame in the woodwind dataset contains much less spectral information. The zero-padding method suffers especially for short frames as the frequency estimates are only obtained from the spectral information alone, whereas the likelihood-search method is able to obtain more accurate estimates by fitting a signal model to the audio. Additional improvements could be made by jointly inferring the parameters of the prior harmonic model described in 6.3.3, as the parameters were chosen in that model to be suitable for longer notes. Alternatively, larger improvements could be made by extending the signal model with dependencies between neighbouring frames, rather than independently inferring isolated frequencies and notes in each frame. This would thus effectively increase the length of the frame available for estimating the frequencies, thus improving the accuracy of the estimates in line with the results presented in Table 6.2 on page 96. 6.6 Conclusion In this chapter we have motivated and implemented a two-stage process for polyphonic pitch detection. The first stage is to accurately detect partial frequencies in a short segment of the signal. The second stage is to infer the notes using a harmonic model of the expected frequencies. The approach we have used is to adapt existing Bayesian signal and prior harmonic models for musical notes and design simple, approximate inference schemes for these models which are computationally inexpensive and allow for the exploration of many more combinations of notes than may be feasible using full Bayesian models. As a result we have shown that a higher level of transcription accuracy can be achieved even with a simple algorithm design. Moreover we are able to benefit from present and future advances of the Bayesian models to improve partial frequency and multiple estimation. The partial frequency estimation stage is carried out by progressively fitting a sinusoidal basis to the observed signal, halting the process when a Bayesian model selection criterion fails to improve by adding more partial frequencies. There are a number of requirements on the signal model for this method to be feasible, firstly that the optimum frequency to be added per iteration can be found quickly, and secondly that adding additional frequencies to the model does not require that the existing frequencies be modified. These requirements are met by the Gabor basis models discussed in Chapter 5 but are not met by matrix pencil methods which were used in previous work [Peeling et al., 2007a].The primary benefit of using explicit signal and noise models over heuristic peak-detection schemes is that the spectral shape of partials and the 97 level of the noise floor can be used to distinguish between actual partials and spurious artefacts, and also detect partials with small amplitudes which are otherwise masked by nearby partials with larger amplitudes. The estimated partial frequencies are then tested against a series of multiple pitch hypotheses by evalu- ating the likelihood of the frequency positions under the assumption of a non-homogeneous Poisson process. We choose to use a Poisson process for the following reasons: it is a generative model for which Bayesian priors for different harmonic models and partial frequency estimation schemes can be created easily; the like- lihood can be calculated exactly without the use of an iterative algorithm, and the superposition property means that multiple notes can be inferred using a greedy search scheme where previously found notes are consistent with the current hypothesis, and the number of notes can be estimated correctly. The resulting transcription scheme is accurate for isolated notes, although the accuracy when used to transcribe music frame-by-frame for short frame lengths is much less than for isolated notes. Future work should concentrate on extending the signal model to include dependencies between neighbouring frames, to allow the frequencies to be estimated in each frame with much higher accuracy than possible for short frame lengths. The relationship between the pitches in neighbouring frames, although not explored in this chapter, can be expressed through prior, generative models of pitch trajectories and note durations. For example, the transcription system here could be used to evaluate the likelihood of observed partial frequencies in each frame for different note combinations, and the Viterbi algorithm used to infer the most likely transcription across multiple frames using a hidden Markov model prior. Multiple passes would be used to add additional notes to each frame, similar to the greedy search procedure presented here. These ideas are developed further in Chapter 8. Although the method we have presented here is accurate and flexible, it is not suitable for large scale processing of musical data as the algorithms for both frequency and pitch estimation are iterative in nature, and their computation cannot be easily parallelised to make use of optimized software libraries or parallel processing in hardware. In the next chapter we develop and apply alternative Bayesian generative models where the signal is projected onto a fixed set of harmonic bases, one per pitch, rather than attempting to infer the basis of the signal model. The relative amounts of energy used in each projection are used to create a transcription of long passages of polyphonic music. This approach allows multiple frames and the dependencies between them to be processed in parallel. 98 Chapter 7 Gaussian Variance Generative Matrix Factorization Models In this chapter we develop a generative Bayesian model for matrix valued observations, where each element of the matrix is assumed distributed zero mean Gaussian, and the variances of the elements are factorized into two positive-valued matrices with smaller common dimension than those of the observed matrix. The models represent the observed data as the superposition of statistically independent sources. These models can be used for dimensionality reduction, modelling and compression of real-valued data, such as the short- time Fourier transform (STFT) of an audio signal, where it can be applied to source separation and music transcription by jointly modelling spectral characteristics such as harmonicity and temporal activations or excitations. Suitable priors are chosen so that the appropriate modelling dimensionality is selected, and the variance parameters can be inferred using efficient matrix update equations, allowing large amounts of data to be processed efficiently. The algorithm is adapted for the task of polyphonic transcription of music using labeled training data. The performance of the system is compared to that of existing discriminative and model-based approaches on a dataset of classical piano music. 7.1 Introduction Tools for multivariate data analysis, processing and compression include principal components analysis (PCA) and non-negative matrix factorization (NMF), which perform dimensionality reduction. Recently these tools have been made more flexible and powerful by description in a probabilistic, statistical framework, using a generative model. Existing algorithms can then typically be described as performing maximum like- lihood estimation of the parameters. Tipping and Bishop [1999] describe PCA in a probabilistic framework as a Gaussian latent variable model, and use the Expectation-Maximization (EM) algorithm to iteratively reach a solution. Virtanen et al. [2008] describe NMF using a Poisson source model, and obtain the iterative update equations for the information-divergence measure given by Lee and Seung [2000]. The advantages cited by adopting a probabilistic signal model are the ability to incorporate prior information via Bayesian methods, and a consistent approach to dealing with multiple observations and missing data. See also Cemgil [2008] for a full report on a Bayesian NMF model with applications to missing data. 99 Non-negative matrix factorization has been used with some success in the modelling of time-frequency en- ergy distributions in audio and musical signal applications, such as drum transcription [Paulus and Virtanen, 2006], source separation [Wang and Plumbley, 2005, Virtanen, 2007], and polyphonic music transcription [Smaragdis and Brown, 2003, Bertin et al., 2009a, Abdallah and Plumbley, 2004]. To illustrate the principal concept, consider a matrix X = ∣∣{x2ν,τ}∣∣ formed of the coefficients of the short-time Fourier transform of an audio signal (see Section 7.4) with frequency indices ν = 1, . . . , F and time indices τ = 1, . . . , T . This non-negative matrix can be approximately decomposed into two non-negative matrices X ≈ TV T is a F × I matrix which is typically interpreted as a set of harmonic template vectors [t1, . . . , tI ] along the frequency axis, and V is a I ×T matrix, interpreted as a set of activation or excitation vectors [v1, . . . ,vI ]> in time. In a single channel source separation situation, the observed matrix X can be written as the superposition of I sources: X ≈ I∑ i=1 tiv > i ≈ I∑ i=1 Si Si = tiv > i The representation of X as the sum of a set of single rank matrices is shown schematically in Figure 7.1 on page 101. However, this model is physically unrealistic: because energy is a quadratic quantity, the energy for two sources is not additive, i.e., |s1 + s2|2 6= s21 + s22 This is typical of spectral modelling where superposition is not addressed. Here we describe a probabilistic model based on the transform coefficients themselves rather than a positive energy representation. The source model is conditionally zero-mean Gaussian, motivated by the underlying physics, which obeys the superposition property desired. The model is valid for real-valued observations which arise from the discrete Cosine transform (DCT) for instance, and also complex-valued coefficients from the discrete Fourier transform (DFT). The statistical perspective readily admits a Bayesian framework in the form of prior distributions placed on the variance parameter of the Gaussian. A review of existing prior structures and related inference methods is in Godsill et al. [2007]. This chapter is an expansion of Peeling et al. [2010] which develops an Expectation-Maximization (EM) algorithm for the Gaussian variance model. Here we additionally discuss variational Bayes and Monte-Carlo inference techniques in Section 7.3, and demonstrate additional applications to musical audio processing in Section 7.4 other than polyphonic music transcription. In this chapter, we use the Gaussian variance model in a matrix factorization framework. We first express the model and obtain the maximum likelihood estimation of the factorization as an iterative EM algorithm, and describe the implementation as a pair of matrix-update equations to demonstrate that the Gaussian model shares the same computational attractiveness as related NMF approaches (Section 7.2). We then 100 xν,τ tν,i sν,i,τ vi,τ vi,K . . . . . .. . . xν,1 xν,k xν,K. . .. . . sν,i,Ksν,i,ksν,i,1 tν,i vi,1 vi,k . . . Figure 7.1: Representations of the single-channel source separation model as a matrix factorization problem place conjugate priors on the elements of the factor matrices, and describe a number of Bayesian inference techniques: variational Bayes and a set of Monte-Carlo techniques (Section 7.3). We present demonstrations of applications for this model in the field of musical audio processing in Section 7.4, and by placing a prior model for Midi transcription (Section 7.5) we develop a system for polyphonic transcription of piano music. We present comparative results on a large dataset of music in Section 7.6. 7.2 Gaussian Variance Matrix Factorization Model In this section we describe a matrix factorization model, where the observed matrix coefficients have a zero mean Gaussian distribution, where the variance of each coefficient is obtained from the matrix product of the template and excitation matrices. This model was used in single channel audio source separation by Benaroya et al. [2003] and polyphonic music transcription by Abdallah and Plumbley [2004], and was linked with the Itakura-Saito (IS) divergence between the observed matrix coefficients and the underlying variances by Févotte et al. [2009]. We initially express the model as a probability distribution over individual sources. sν,i,τ ∼ N (sν,i,τ ; 0, tν,ivi,τ ) (7.1) xν,τ = ∑ i sν,i,τ The s variables represent the individual latent sources, and the x variables are the observations, formed from the superposition of the sources. When the latent source variables and the observed coefficients are complex valued, the distribution in (7.1) is the circular symmetric complex normal distribution (Section A.1) i.e., the real and imaginary parts are uncorrelated and have equal variance. 101 The matrix representation of the superposition is X = S1 + · · ·+ SI = I∑ i=1 Si where X,Si, i = 1, . . . , I ∈ RF×T have elements xν,τ , sν,i,τ , for ν = 1, . . . , F , τ = 1, . . . , T respectively. Marginalizing out the latent sources S = {S1, . . . ,SI} gives p(X|T,V) = ˆ dS p(X|S)p(S|T,V) = ∏ ν,τ N ( xν,τ ; 0, ∑ i tν,ivi,τ ) due to the superposition property of normal random variables, that is: when si ∼ N ( si; 0, σ 2 i ) x = s1 + · · ·+ sI then the marginal probability is given by p(x) = N (x; 0, ∑ i σ2i ) For real x, the marginal log-likelihood of a single observation is given by: log p(X|T,V) = ∑ ν ∑ τ ( − 1 2σ2ν,τ x2ν,τ − 1 2 log 2piσ2ν,τ ) (7.2) and for complex x log p(X|T,V) = ∑ ν ∑ τ ( − 1 σ2ν,τ |xν,τ |2 − log piσ2ν,τ ) (7.3) where σ2ν,τ = ∑ i tν,ivi,τ = [TV]ν,τ The derivation for the real and complex valued models is so similar that when it is required to specify which observation model is being used, we let D = 1 for the real valued model, and D = 2 for the complex valued model. This is motivated by viewing the complex normal distribution as a two-dimensional normal distribution with equal variance on the real and imaginary axes. The marginal log-likelihood in its general form is thus log p(X|T,V) = ∑ ν ∑ τ ( − D 2σ2ν,τ |xν,τ |2 − 1 D logDpiσ2ν,τ ) (7.4) As observed in Févotte et al. [2009], maximizing the log-likelihood is equivalent to minimizing the IS divergence dIS ( z|σ2) = z σ2 − log z + log σ2 − 1 between z = ∣∣x2∣∣ and σ2. This can be seen by comparing (7.3) and (7.4) and ignoring the elements of both equations that do not depend on the variances σ2. 102 7.2.1 Maximum-likelihood and the EM algorithm The EM algorithm for maximum-likelihood estimation of parameters in the Gaussian variance model was independently derived by Févotte et al. [2009]. Maximizing the likelihood of the Gaussian variance model is equivalent to minimizing the Itakura-Saito distance [Itakura and Saito, 1968] between the observed matrix X and its reconstruction TV. The log likelihood of observed datum X can be written as LX ≡ log ˆ dS p(X,S|T,V) = log ˆ dS q(S) q(S) p(X,S|T,V) ≥ ˆ dS q(S) log p(X,S|T,V) q(S) ≡ B[q(S)] by Jensens' inequality [Bishop, 2006], defining a lower bound on the log likelihood. Here, q(S) is an instrumen- tal distribution over the set of latent sources, with the property that q(S) = 0 if and only if p(X,S|T,V) = 0. The lower bound is tight when the instrumental distribution is the posterior of the latent sources: arg max q(S) B[q(S)] = p(S|X,T,V) Hence we can use an iterative coordinate ascent scheme to maximize the log likelihood. The first step, called the expectation (E) step, is to compute the posterior distribution, which we will see has the form of a multivariate normal. Because this is an exponential family, we only need to compute the sufficient statistics, which is why we call this the expectation step. The second step is called the maximization (M) step because we find the maximum likelihood T and V holding q(S) fixed. The two steps of the expectation maximization algorithm (EM) are summarized as: E-step q(S)(n) = p(S|X,T(n−1)V(n−1)) M-step { T(n),V(n) } = arg max T,V 〈log p(S,X|T,V)〉q(S) 7.2.2 Expectation Step In this section we derive the posterior of the latent sources p(S|X,T,V) = p(S,X|T,V) p(X|T,V) The terms in the expression for the log probability density of the posterior are given by log p(S|X,T,V) = ∑ ν,τ (∑ i ( −D 2 |sν,i,τ |2 tν,ivi,τ ) + D 2 |∑i sν,i,τ |2∑ i tν,ivi,τ ) + · · · (7.5) This defines a multivariate normal distribution over the latent sources, for which we will adopt the following notation: sν,τ = [sν,1,τ , . . . , sν,I,τ ], 1 is a I element row vector of ones, so that we can write ∑ i sν,i,τ = 1sν,τ . 103 Let Aν,τ be a I × I diagonal (covariance) matrix with ith element tν,ivi,τ . The above expression is rewritten as log p(S|X,T,V) = ∑ ν,τ ( −D 2 TrA−1ν,τsν,τs H ν,τ + D 2 Tr 1>1sν,τsHν,τ 1Aν,τ1> ) + · · · which, after some manipulations (see Section B.2), becomes = ∑ ν,τ −D 2 Tr ( sν,τ − Aν,τ1 >1sν,τ 1Aν,τ1> )H ( Aν,τ − Aν,τ1 >1Aν,τ 1Aν,τ1> )( sν,τ − Aν,τ1 >1sν,τ 1Aν,τ1> ) + · · · = ∑ ν,τ −D 2 Tr ( sν,τ − Aν,τ1 >xν,τ 1Aν,τ1> )H ( Aν,τ − Aν,τ1 >1Aν,τ 1Aν,τ1> )( sν,τ − Aν,τ1 >xν,τ 1Aν,τ1> ) + · · · This is a multivariate normal distribution, as we can write p(S|X,T,V) = ∏ ν,τ N ( sν,τ ; Aν,τ1 >xν,τ 1Aν,τ1> , Aν,τ − Aν,τ1 >1Aν,τ 1Aν,τ1> ) = ∏ ν,τ N (sν,τ ;µν,τ ,Σν,τ ) and the standard results for the sufficient statistics of the posterior are: 〈sν,τ 〉 = µν,τ〈 sν,τs H ν,τ 〉 = µν,τµ H ν,τ + Σν,τ By defining a positive quantity called the responsibility by Cemgil and Dikmen [2007] κν,i,τ = tν,ivi,τ∑ i′ tν,i′vi′,τ we can write the correlations as 〈|sν,i,τ |2〉 = Dtν,ivi,τ (1− κν,i,τ ) + κ2ν,i,τ |xν,τ |2 7.2.3 Maximization Step In this section, we will present the M step as a coordinate ascent in T and V. Other schemes such as gradient descent and Hessian based approaches are possible (see for example Dhillon and Sra [2006]), however they involve more computation and storage requirements. The update rule for the templates is given by ∂ ∂tν,i 〈log p (X,S|T,V)〉 = D 2 ∑ τ ( |sν,i,τ |2 t2ν,ivi,τ − 1 tν,i ) = 0 tν,i = 1 T ∑ τ |sν,i,τ |2 vi,τ 104 and the update rule for the excitations is given by: ∂ ∂vi,τ 〈log p (X,S|T,V)〉 = D 2 ∑ ν ( |sν,i,τ |2 tν,iv2i,τ − 1 vi,τ ) = 0 vi,τ = 1 F ∑ ν |sν,i,τ |2 tν,i The summations in the update rules can be carried out by means of efficient matrix multiplications. Note that it is unnecessary (and also expensive) to calculate the complete sufficient statistics of the posterior over the latent sources. All that is required is the summations over frequency and time of the individual correlations (the diagonal elements of the covariance matrix). 7.3 Bayesian Hierarchical Model To exploit the power of Bayesian inference, we place conjugate priors on the templates and excitations in the following hierarchical model. tν,i ∼ IG(tν,i; atν,i, btν,iatν,i) vi,τ ∼ IG(vi,τ ; avi,τ , bvi,τavi,τ ) sν,i,τ ∼ N (sν,i,τ ; 0, tν,ivi,τ ) xν,τ = ∑ i sν,i,τ (7.6) The inverse-gamma distribution (Figure 7.2 on page 106) is a conjugate prior to the variance of the nor- mal distribution. This particular parametrization has the following interpretation: 〈1/tν,i〉 = 1/btν,i and 〈1/vi,τ 〉 = 1/bvi,τ under the prior, so the scale parameters approximately gives the expected values of the templates and excitations. The standard deviation is given by a (a−1)√a−2 which decreases with a, hence the scale parameter can represent the sparsity of the representation. A high value of a means a low standard deviation from the scale parameter, hence most of the coefficients have similar magnitudes, implying a full representation. A low value of a means a high standard deviation from the scale parameter, hence most of the coefficients of the representation will be close to zero as shown in Figure 7.2 on page 106, favouring a spare representation. The joint probability distribution of this model is given by p(X,S,T,V) = p(X,S|T,V)p(T)p(V) from which we can consider a number of inference tasks. These typically involve calculating the posterior p(S,T,V|X) and the marginal likelihood (also known as the evidence) given the hyperparameters p(X). 7.3.1 Inference by Variational Bayes The development of the Variational Bayes inference algorithm [Bishop, 2006, Ghahramani and Beal, 2001] is similar to the EM algorithm. Again we approximate the log marginal likelihood by means of an instrumental 105 00.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 3 3.5 4 P (r ) r a=1 a=2 a=3 Figure 7.2: The inverse-gamma distribution, p(r) = IG(r; a, a) for different a, and scale parameter b = 1 distribution: LX ≡ log p(X) = log ˆ dSdTdV p(X,S,T,V) = log ˆ dSdTdV q(S,T,V) q(S,T,V) p(X,S,T,V) ≥ ˆ dSdTdV q(S,T,V) log p(X,S,T,V) q(S,T,V) ≡ B[q(S,T,V)] The bound is tight when the instrumental distribution is equal to the posterior: q(S,T,V) = p(S,T,V|X) However the posterior distribution is intractable, so instead we assume a factorized form q(S,T,V) = q(S)q(T)q(V) = (∏ ν,τ q(sν,τ ) )∏ ν,i q(tν,i) ∏ i,τ q(vi,τ )  This particular approximation is known as the mean field approximation. It can be shown that updating the sufficient statistics of q(S), q(T) or q(V), holding both of the other distributions constant, leads to a monotonically increasing bound with each iteration B[q(S,T,V)(n+1)] ≥ B[q(S,T,V)(n)]. To illustrate the similarity between VB and EM, we will choose the following ordering of update rules. 106 The approximate E-step is: q(S)(n) ∝ exp ( 〈log p(X,S,T,V)〉q(T)(n−1)q(V)(n−1) ) then the approximate M-step involves iterating q(T)(n+k) ∝ exp ( 〈log p(X,S,T,V)〉q(S)(n)q(V)(n+k−1) ) q(V)(n+k) ∝ exp ( 〈log p(X,S,T,V)〉q(S)(n)q(T)(n+k) ) for k = 1, . . . ,K until convergence as defined in 7.3.1.2. An alternative means of Bayesian model selection for the Gaussian variance model with half-normal priors on the factor matrices is investigated by Févotte [2010] using a model selection criteria. In Févotte and Cemgil [2009] Bayesian model selection using both Variational Bayes and MCMC is sketched for a number of NMF models, including the Gaussian variance model. 7.3.1.1 Variational update equations and sufficient statistics The update rule for the expectation step follows from (7.5): q(sν,τ ) ∝ exp (∑ i ( −D 2 〈 t−1ν,i 〉 〈 v−1i,τ 〉 |sν,i,τ |2)+ D 2 |∑i sν,i,τ |2∑ i (〈 t−1ν,i 〉 〈 v−1i,τ 〉)−1 ) The calculation is the same as for the E-step of the EM algorithm, but the covariance matrix Aν,τ has elements (〈 t−1ν,i 〉 〈 v−1i,τ 〉)−1 along the diagonal. The responsibilities are κν,i,τ = (〈 t−1ν,i 〉 〈 v−1i,τ 〉)∑ i (〈 t−1ν,i 〉 〈 v−1i,τ 〉) and the correlations that we need for the M-step are 〈|sν,i,τ |2〉 = (〈t−1ν,i 〉 〈v−1i,τ 〉)−1 (1− κν,i,τ ) + κ2ν,i,τ |xν,τ |2 (7.7) 107 The update equations and sufficient statistics for the templates and excitations follow from the properties of the inverse-gamma distribution: q(tν,i) ∝ exp ( −(atν,i + DT2 + 1) log tν,i − (atν,ibtν,i + D2 ∑ τ 〈|sν,i,τ |2〉 〈v−1i,τ 〉) 〈t−1ν,i 〉 ) ∝ IG(tν,i;αtν,i, βtν,i) αtν,i = a t ν,i + DT 2 β t ν,i = a t ν,ib t ν,i + D 2 ∑ τ 〈|sν,i,τ |2〉 〈v−1i,τ 〉 〈 t−1ν,i 〉 = αtν,i βtν,i 〈log tν,i〉 = −Ψ(αtν,i) + log βtν,i (7.8) q(vi,τ ) ∝ exp ( −(avi,τ + DF2 + 1) log vi,τ − (avi,τ bvi,τ + D2 ∑ ν 〈|sν,i,τ |2〉 〈t−1ν,i 〉) 〈v−1i,τ 〉 ) ∝ IG(vi,τ ;αvi,τ , βvi,τ ) αvi,τ = a v i,τ + DF 2 β v i,τ = a v i,τ b v i,τ + D 2 ∑ ν 〈|sν,i,τ |2〉 〈t−1ν,i 〉 〈 v−1i,τ 〉 = αvi,τ βvi,τ 〈log vi,τ 〉 = −Ψ(αvi,τ ) + log βvi,τ (7.9) We retain the attractiveness of being able to perform these update equations as matrix operations. Note that expensive evaluations of the digamma function Ψ (α) can be precomputed, as the posterior shape parameters are constant. 7.3.1.2 The Variational Bound The variational bound is a lower bound on the marginal log likelihood and also can be used to define convergence for the E and M steps. B[q(S,T,V)] = 〈log p(X,S,T,V)〉q +H[q(S,T,V)] H[q(S,T,V)] denotes the entropy of the variational distribution q, which is defined as−〈log q(S,T,V)〉q. The following is the complete expression for the variational bound immediately after the E-step. The first two rows are the combined energy and entropy of the latent sources. B[q(S,T,V)] = + ∑ ν,τ −D 2 (∑ i 〈 t−1ν,i 〉−1 〈 v−1i,τ 〉−1)−1 |xν,τ |2 − D 2 log 2pi D + D 2 log (∑ i 〈 t−1ν,i 〉−1 〈 v−1i,τ 〉−1) − D 2 ∑ ν,i,τ (−〈log tν,i〉 − log 〈t−1ν,i 〉− 〈log vi,τ 〉 − log 〈v−1i,τ 〉) F [S] +H[q(S)] + ∑ ν,i (−(atν,i + 1) 〈log tν,i〉 − atν,ibtν,i 〈t−1ν,i 〉+ atν,i log(atν,ibtν,i)− log Γ(atν,i)) F [T] − ∑ ν,i (−(αtν,i + 1) 〈log tν,i〉 − βtν,i 〈t−1ν,i 〉+ αtν,i log βtν,i − log Γ(αtν,i)) H[q(T)] + ∑ τ,i (−(avi,τ + 1) 〈log vi,τ 〉 − avi,τ bvi,τ 〈v−1i,τ 〉+ avi,τ log(avi,τ bvi,τ )− log Γ(avi,τ )) F [V] − ∑ τ,i (−(αvi,τ + 1) 〈log vi,τ 〉 − βvi,τ 〈v−1i,τ 〉+ αvi,τ log βvi,τ − log Γ(αvi,τ )) H[q(V)] (7.10) 108 The `energy' notation used to label the summations here is: F [S] = 〈log p(X,S|T,V)〉q F [T] = 〈log p (T)〉q F [V] = 〈log p (V)〉q The calculation of the variational bound can be implemented efficiently using matrix operations, just as with the update equations. Expensive evaluations of log Γ (α) can be precomputed. Once the templates or excitations are updated in the M-step, the variational bound cannot be derived in closed form. However, during the M-step, H[q(S)] remains constant, so we do not need to take it account when determining whether the M-step iterations have converged. This is convenient because H[q(S)] is not straightforward to derive in isolation from F [S]. We therefore need to confirm after each update in the M-step that the quantity F [S] + F [T] +H[q(T)] + F [V] +H[q(V)] (7.11) increases, otherwise we perform another E-step at this stage (see Algorithm 7.1). The values for F [T], H[q(T)], F [V] and H[q(V)] are as in (7.10), however F [S] during the M-step is given by F [S] = ∑ ν ∑ τ ∑ i ( −D 2 〈 t−1ν,i 〉 〈 v−1i,τ 〉 〈 s2ν,i,τ 〉− D 2 log 2pi D − D 2 〈log tν,i〉 − D 2 〈log vi,τ 〉 ) where 〈 s2ν,i,τ 〉 is given by (7.7). F [S] may be calculated efficiently using matrix update equations from the sufficient statistics of T and V immediately prior to being updated during the M-step. 7.3.1.3 Hyperparameter Optimization Hyperparameter optimization involves maximizing the variational bound with respect to the hyperparam- eters. The components of the variational bound that correspond to maximizing the bound are F [T] for the template hyperparameters, and F [V] for the excitation hyperparameters1. The resulting expressions for optimization are very similar to expressions for finding maximum likelihood estimates of the parameters of an inverse-gamma distribution. Hyperparameter optimization is used for training the priors of the template and excitation matrices using labelled data. The hyperparameters will be tied over some subset of the elements of the template or excitation factorization matrices. For example, we typically do not a priori know the length of time over which we will observe data for the model. For the model to be identifiable, we are forced to tie the excitation hyperparameters across the rows of the excitation matrix. Because of the numerous possibilities for how we set up the optimization, we will only outline the process here for a single set of parameters M with variances {rm} over which either the shape parameter or the scale parameter is tied. This allows the shape parameters and the scale parameters to be tied over different subsets of the variances, for example we might want to have a global shape parameter for all of the template variances, but have a scale parameter for each column of the template matrix. 1 In (7.10) the entropy expression H[q(T)] is not directly dependent on the values of the hyperparameters, but is dependent only through the variational distribution q(T), which is updated during the M-Step for the templates. The same is true for the excitation hyperparameters. Therefore the entropy expressions are not used to optimize the hyperparameters in this section. 109 To optimize a single scale parameter bM which is tied over the variances {rm} with corresponding shape parameters {am}, we maximize the following expression B (bM) = ∑ m∈M (−(am + 1) 〈log rm〉 − ambM 〈r−1m 〉+ am log(ambM)− log Γ(am)) by setting the derivative to zero, i.e. ∂B ∂bM = ∑ m∈M ( −am 〈 r−1m 〉 + am bM ) = 0 (7.12) giving an update rule: bM ← ∑ M am∑ M am 〈 r−1m 〉 (7.13) To optimize a single shape parameter aM which is tied over the variances {rm} with corresponding scale parameters {bm}, we maximize the following expression B (aM) = ∑ m∈M (−(aM + 1) 〈log rm〉 − aMbm 〈r−1m 〉+ aM log(aMbm)− log Γ(aM)) by setting the derivative to zero, i.e. ∂B ∂a = ∑ m∈M (−〈log rm〉 − bm 〈r−1m 〉+ log a+ 1 + log bm −Ψ(a)) giving the following expression: log aM −Ψ(aM) = 1|M| ∑ m∈M (〈log rm〉+ bm 〈r−1m 〉− 1− log bm) Equations of this form appear in the maximum likelihood estimate of Gamma distributions. a can be found by Newton iterations. Denote the right hand side as c, then the following is the Newton-Raphson update: aM ← aM − log aM −Ψ(aM)− c 1/aM −Ψ′(aM) (7.14) Hyperparameter optimization is performed after the variational bound no longer increases through the E-Step and M-Step updates. The entire structure of the algorithm is described in Algorithm 7.1. 110 Algorithm 7.1 Variational Bayes for the Gaussian variance model, with hyperparameter optimization • Sample point estimates tν,i and vi,τ from the priors (7.6) • Initialize sufficient statistics 〈t−1ν,i 〉 ← (tν,i)−1 , 〈log tν,i〉 ← log tν,i, 〈v−1i,τ 〉 ← (vi,τ )−1 , 〈log vi,τ 〉 ← log vi,τ from the point estimates • Calculate the variational bound B[q(S,T,V)](0) from (7.10) • for n = 1, . . .  for k = 1 . . . ∗ Update the template sufficient statistics 〈t−1ν,i 〉 , 〈log tν,i〉 (7.8) ∗ Update the excitation sufficient statistics 〈v−1i,τ 〉 , 〈log vi,τ 〉 (7.9) ∗ If the quantity (7.11) increases, then continue, otherwise break  Calculate the variational bound B[q(S,T,V)](n)  If B[q(S,T,V)](n) = B[q(S,T,V)](n−1) then ∗ update each tied shape parameter using (7.14) ∗ update each tied scale parameter using (7.13) ∗ If the updates result in no changes to the hyperparameters, exit the algorithm 7.3.2 Markov Chain Monte-Carlo 7.3.2.1 Gibbs Sampler A possible Gibbs sampler for the model involves sampling the blocks: S(n+1) ∼ p(S|X,T(n),V(n)) T(n+1) ∼ p(T|S(n+1),V(n)) V(n+1) ∼ p(V|S(n+1),T(n+1)) The marginal likelihood is estimated by Chib's method [Chib, 1995] around the mode {S∗,T∗,V∗} found by running the Gibbs sampler. The marginal likelihood is decomposed as: log p(X) = log p(X,S∗,T∗,V∗)− log p(S∗|X)− log p(T∗|S∗)− log p(V∗|S∗,T∗) The second term is found by the Monte-Carlo estimate: p(S∗|X) ≈ 1 N N∑ n=1 p(S∗|X,T(n),V(n)) with {T(n),V(n))} returned by the Gibbs' sampler. The third term is found by the Monte-Carlo estimate: p(T∗|S∗) ≈ 1 M M∑ m=1 p(T∗|S∗,V(m)) 111 requiring a further M samples from the reduced Gibbs sampler, clamping S = S∗ T(m+1) ∼ p(T|S∗,V(m)) V(m+1) ∼ p(V|S∗,T(m+1)) Sampling Latent Sources The distribution p(sν,1:I,τ |xν,τ , tν,1:I , v1:I,τ ) is a degenerate multivariate nor- mal distribution, because p(xν,τ |sν,1:I,τ ) is itself degenerate. As the covariance matrix is not positive definite, we cannot form the Cholesky factor and therefore sample from this distribution directly. Instead, we sample from the reduced distribution p(sν,2:I,τ |xν,τ , tν,2:I , v2:I,τ ) and then set sν,1,τ = xν,τ − ∑I i=2 sν,i,τ . In the rest of this section, we drop the subscripts ν, τ for brevity. First observe that p(x|s2:I) = N (x; 1s2:I , t1v1) where 1 is an (I − 1) row vector of ones. It follows that the posterior is log p(s2:I |x, t2:Iv2:I) = −D 2 |x− 1s2:I |2 t1v1 − D 2 sH2:IA −1s2:I + . . . = D t1v1 sH2:I1 >x− D 2 Tr ( 1 t1v1 1>1 +A−1 ) s2:Is H 2:I where A is a diagonal matrix with elements tivi, i = 2, . . . , I. The posterior is therefore a multivariate normal with covariance matrix and mean Σ = ( 1 t1v1 1>1 +A−1 )−1 = A− A1 >1A t1v1 + 1A1> µ = 1 t1v1 Σ1>x Note that the covariance matrix is formed by downdating the diagonal matrix A with the vector 1A scaled by (t1v1 + 1A1 >)−1. The Cholesky factorization of the covariance matrix can be computed more efficiently by downdating the Cholesky factor of A than by calculating the full factorization. The Cholesky factor of A is a diagonal matrix with elements √ tivi, i = 2, . . . , I. Single Source The following is a description of the Gibbs sampler for the single source case, where the source is directly observed: S = X. We mention this as a special case as the previous expressions for estimating the marginal likelihood using Chib's method do not apply here. The algorithm iterates T(n+1) ∼ p(T|X,V(n)) V(n+1) ∼ p(V|X,T(n+1)) and the marginal likelihood at the mode {T∗,V∗} found from the above run is: log p(X) = log p(X,T∗,V∗)− log p(T∗|X)− log p(V∗|X,T∗) 112 and the second term is found by the Monte-Carlo estimate p(T∗|X) ≈ 1 N N∑ n=1 p(T∗|X,V(n)) 7.3.2.2 Metropolis-Hastings The particular scheme we propose here marginalizes the latent sources, and attempts to draw samples from the posterior p(T,V|X) directly by constructing a Markov chain which draws samples from p(T|X,V) and p(V|X,T) in sequence. However, these distributions cannot be sampled directly, so we resort to a Metropolis-Hastings (MH) algorithm. Note that the posterior can be written as: p(T,V|X) = 1 p(X) p(X|T,V) ∝ p(X|T,V)p(T)p(V) p(X|T,V) is already given in (7.2). The MH algorithm requires a proposal density with the same coverage as the posterior distribution being sampled. We suggest using the inverse-gamma distributions q(T) and q(V) in (7.8) and (7.9) as proposal distributions for the template and excitation parameters respectively, substituting the sufficient statistics with the current point estimates. These are suitable for inferring the posterior p(T,V|X) as the method used to derive them also involved marginalizing the latent sources (the E-Step of the Variational Bayes algorithm). For the MH algorithm we denote these proposal densities as q(T,T′|X,V), the probability of moving from T to T′. The MH algorithm simply involves iterating between the moves T′ ∼ q(T(n),T′|X,V(n)) T(n+1) = T′ if α(T(n),T′|X,V(n)) ≤ U(0, 1)T(n) otherwise V′ ∼ q(V(n),V′|X,T(n+1)) V(n+1) = V′ if α(V(n),V′|X,T(n+1)) ≤ U(0, 1)V(n) otherwise where the acceptance ratios are given by α(T,T′|X,V) = min { 1, p(X,T′,V) p(X,T,V) q(T′,T|X,V) q(T,T′|X,V) } α(V,V′|X,T) = min { 1, p(X,T,V′) p(X,T,V) q(V′,V|X,T) q(V,V′|X,T) } We have found in practice that the acceptance ratio using these well-formed proposal distributions is very high (approximately 90%) when sampling the entire template and excitation matrices. An extension of Chib's method for evaluating the marginal likelihood from the MH output [Chib and Jeliazkov, 2001] is outlined below. At any point {T∗,V∗} the following holds: log p(X) = log p(X,T∗,V∗)− log p(T∗|X)− log p(V∗|X,T∗) 113 The posterior ordinates are given by p(T∗|X) = 〈α(T,T ∗|X,V)q(T,T∗|X,V)〉p(T,V|X) 〈α(T∗,T|X,V)〉p(V|X)q(T∗,T|X,V) p(V∗|X,T∗) = 〈α(V,V ∗|X,T∗)q(V,V∗|X,T∗)〉p(V|X,T∗) 〈α(V∗,V|X,T∗)〉q(V∗,V|X,T∗) for which Monte-Carlo estimates are given by p(T∗|X) ≈ M −1∑M m=1 α(T (m),T∗|X,V(m))q(T(m),T∗|X,V(m)) J−1 ∑J j=1 α(T ∗,T(j)|X,V(j)) p(V∗|X,T∗) ≈ J −1∑J j=1 α(V (j),V∗|X,T∗)q(V(j),V∗|X,T∗) G−1 ∑G g=1 α(V ∗,V(g)|X,T∗) To calculate these ordinates requires three sampling runs: 1. Sample {T(m),V(m)} ∼ p(T,V|X),m = 1, . . . ,M by the MH algorithm. Select {T∗,V∗} as the mode found by this sampling run for the best estimate of the marginal likelihood. 2. Sample {V(j)} ∼ p(V|X,T∗), j = 1, . . . , J , also generating {T(j)} ∼ q(T∗,T(j)|X,V(j)) after each step. This is carried out by running the MH algorithm, but rejecting all moves T∗ → T(j). 3. Sample V(g) ∼ q(V∗,V(g)|X,T∗), g = 1, . . . , G. The above Metropolis-Hastings algorithm is also straightforward to apply to the Bayesian NMF model in Cemgil [2008], circumventing the multinomial sampling required for the Gibbs sampler. 7.3.2.3 Hyperparameter Optimization Hyperparameter optimization using MCMC schemes can be carried out by a number of ways. In analogy to maximizing the variational bound as considered in Section 7.3.1.3, the Markov chain can be run for a number of iterations, and then the hyperparameters estimated from the sample statistics of the chain. However, after this step, the normalization constant p(X) has increased, and the chain is not valid for the new hyperparameters. Either the chain has to be discarded, which is wasteful, or the samples have to be re-weighted. Re-weighting using reverse logistic regression is discussed in Geyer [1991]. An simpler method involves extending the MCMC scheme to sampling the hyperparameters based on their likelihood function, i.e., we sample from the posterior of the hyperparameters assuming flat priors, which was the case with the variational procedure. The use of flat (improper) priors does not create problems, because the calculation of the marginal likelihood is with respect to the hyperparameters, we are not integrating them out. We use the same notation as in Section 7.3.1.3 to denote any tying structure on the hyperparameters. 114 The posterior distribution of the scale parameter (see (7.12)) is Gamma (Section A.2): p(bM|{am, rm : m ∈M}) ∝ exp ( −bM ∑ m∈M am rm + log bM ∑ m∈M am ) = G ( bM; 1 + ∑ m∈M am, ∑ m∈M am rm ) which means a Gibbs sampler step can be used to optimize the shape parameters. The posterior distribution of the scale parameter is: p(aM|{bm, rm : m ∈M}) ∝ exp ( −aM ∑ m∈M ( log rm + bm rm − log bm ) + |M|(aM log aM − log Γ(aM)) ) which is not a standard distribution. Sampling from this requires a Metropolis-Hastings step. 7.3.3 Importance Sampling Importance sampling is not suitable for practical applications of the Gaussian variance model because of the high dimensionality of the posterior. It involves computing a Monte-Carlo estimate using samples from the prior p(T,V), but without any iterations which perform some degree of source separation to update the columns of T and the rows of V, almost all of the samples drawn from the prior are far from the mode and the marginal likelihood is severely underestimated. However importance sampling may be used for single-source examples to confirm values of the marginal likelihood calculated by the other methods. The marginal likelihood can be written as the expected value of the likelihood under the prior, p(X) = 〈p(X|T,V)〉p(T,V) = ˆ p(X|T,V)p(T,V) dTdV which can be approximated with the Monte-Carlo estimate p(X) ≈ 1 N N∑ n=1 p(X|T(n),V(n)) {T(n),V(n)} ∼ p(T,V), n = 1, . . . , N 7.3.4 Consistency of Marginal Likelihood Estimates We use a toy example to confirm the consistency of the marginal likelihood calculations. With F = I = T = 1, at = bt = av = bv = 100, and T,V,X set to the mode of the prior, all four methods discussed return a marginal log likelihood of -5.5266, and with T = 2 the marginal log likelihood is -11.0462. Both of these values are confirmed with MATLAB quadrature methods over the double and triple integrals respectively. For larger models, such as those arising in musical audio analysis as described in Section 7.4, only the vari- ational Bayes approach and the Metropolis-Hastings algorithms are practical. The VB algorithm converges to a lower bound on the marginal likelihood, because the approximating distribution to the posterior ignores the coupling between the latent sources and the templates/excitations; whilst the MH algorithm converges in the limit to the true marginal likelihood. In the single source case, the VB algorithm converges to the true 115 marginal likelihood (as there is no coupling between the latent sources and the templates/excitations to be ignored), but for more than one source, the VB algorithm underestimates the marginal likelihood compared with the estimate returned by the MH algorithm. However, the discrepancy between the two estimates is negligible compared with the ratios of marginal likelihood when selecting between different numbers of sources for some observed data, as described in Section 7.4.3. For Bayesian model selection, the VB and MH algorithms would lead us to the same conclusions. 7.4 Musical Audio Analysis The Gaussian variance matrix factorization model is suitable for the time-frequency surfaces that result from applying a transformation matrix to an audio signal. The Bayesian extension is particularly useful for specifying prior knowledge concerning the spectral profile of musical instruments (templates) and volume / damping (excitations). A signal y = (y1, . . . , yn, . . . , yN ) is represented by a linear combination yn =∑ ν,τ φ (ν,τ) n xν,τ where φ (ν,τ) n are localized sinusoidal basis functions in time τ and frequency ν. The choice of basis functions determines the transform. The short-time Fourier transform (STFT) uses time windowed complex exponentials at linearly spaced frequencies, and the resulting expansion coefficients xν,τ are accordingly complex valued. The [short-time] discrete Cosine transform (DCT) uses even sinusoidal functions, and the expansion coefficients are real valued. Other transforms for musical audio processing include the Gabor regression model of Wolfe et al. [2004], wavelets [Mallat, 1999], the modified discrete Cosine transform (MDCT) [Daudet and Sandler, 2004] and the constant-Q transform of Brown [1991]. In the following examples, the observation matrix is formed of a matrix of DCT coefficients. Audio signals are downsampled to 8000Hz and buffered into frames of F = 1024 samples with no windowing or overlapping. Inference is carried out using the variational Bayes algorithm. 7.4.1 Model Training Here, we optimize hyperparameters for a set of piano notes. The RWC musical instrument sounds database [Goto et al., 2003, Goto, 2004] contains audio for three pianos played with a variety of dynamics and playing styles. In Figure 7.3 on page 117 we display the resulting template hyperparameters for the audio 011PFNOF, 011PFNOM, and 011PFNOP in the database, which denotes a piano played with `normal' style at the dynamics forte, mezzo and piano. Each note on the keyboard is played once, covering a range of 88 notes from MIDI 21 to MIDI 108. Here, individual shape and scale parameters are trained for each frequency bin for each pitch class. The plot of the scale parameters in Figure 7.3b on page 117 clearly shows the harmonic series of each note, and that the spacing between the harmonics increases with pitch. A careful inspection of the plot of the shape parameters in Figure 7.3a on page 117 shows that frequency bins corresponding to the harmonic series have a larger variance than those corresponding to the noise floor. For this example, we have chosen to train the hyperparameters for single source models, so that it can be clearly illustrated that the priors capture the harmonic series of the piano notes. As the samples are of differing length, we choose to tie the excitation parameters across time, thus estimating a single value of av and bv per note. This means that the priors estimated here are valid for signals of arbitrary length. 116 F re q u en cy / H z MIDI note pitch 0 500 1000 1500 2000 2500 3000 3500 4000 30 40 50 60 70 80 90 100 (a) Shape parameters atν,i F re q u en cy / H z MIDI note pitch 0 500 1000 1500 2000 2500 3000 3500 4000 30 40 50 60 70 80 90 100 (b) Scale parameters btν,i Figure 7.3: Template hyperparameters for single source models of piano notes 117 7.4.2 Source Separation and Visualization Here we illustrate a source separation application using the Gaussian variance matrix factorization model. We have taken an extract of a piano piece and synthesized the MIDI file using the same audio samples in 7.4.1. Over the extract we have assumed that the model is stationary, and all 88 sources, corresponding to every note on the piano keyboard, are active. We then infer V using the variational Bayes algorithm. Our intuition is that notes which are being played will have a large excitation, while notes which are not being played will have a small excitation and thus be indistinguishable from silence. This is confirmed in Figure 7.4 on page 119, which is a useful alternative to frequency representations such as the harmonic transform Zhang et al. [2004], and can be used to visualize the frequency content of audio signals. All 88 possible notes are modelled using rank one source matrices, and the hyperparameters optimized separately. Regions of high excitation (in red) correspond to a note being played. The positions of the notes are offset slightly so that the high excitation regions are not obscured. 7.4.3 Model Selection In the previous two sections, we have modelled each piano note using a single source model. It remains to be discussed however, if we would obtain a better model by using a multiple source model, i.e., a rank I > 1 source matrix. The goal of this section is to determine whether the variational Bayes algorithm gives consistent answers as to the optimal number of source for the piano notes. The results of our investigations are given in Figure 7.5 on page 120 for the set of piano notes 01[1,3] PFNO [F,M,P] in the RWC database. The optimal number of sources is correlated with the pitch of the piano note, which may be the result of the particular time-frequency representation chosen for the audio. The trend shown is that notes with a small or high pitch are best modelled by a few sources, whilst notes with a medium pitch are best modelled by a larger number of sources. Factors that contribute to this are perhaps: 1) poor resolution of the DCT for low pitches 2) downsampling to 8000Hz cuts off many of the harmonics of higher pitches, thus leading to simpler models. The dependency of the number of sources on the length of the frame was not however investigated here, and is suggested for investigation in future work when multiple source models are used for polyphonic music transcription. 7.5 Prior Model for Polyphonic Piano Music In this section, we extend the prior model for the excitation matrix to include MIDI pitch and velocity of the notes that are playing in a piece of solo polyphonic piano music. We also apply this prior model to the Poisson intensity model of Cemgil [2008] so that the transcription performance of both models can be compared. 7.5.1 Model Description In this section, we have chosen to rely on deterministic approaches to solve the transcription inference problem, as opposed to more expensive MCMC approaches. We describe a quite general approach which lends itself to any form of music for which the MIDI format is an admissible representation of the transcription. 118 M id i p it ch Time / s 30 40 50 60 70 80 90 100 0 5 10 15 20 (a) Visualization by source separation. Red areas denote regions of higher excitation, which occur particularly at note onsets for notes within the melodic line. The onset excitation corresponds with the MIDI note velocity below. M id i p it ch Time / s 30 40 50 60 70 80 90 100 0 5 10 15 20 (b) Original MIDI. High note onset velocities are shaded red. Figure 7.4: Transcription of the first 20 seconds of Albeniz's Suite Española No.5 Asturias (Leyenda) using the Gaussian variance matrix factorization model. 119 02 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 N u m b er of ex am p le s Optimal number of sources (a) Histogram of the optimal number of sources 0 2 4 6 8 10 12 14 16 18 30 40 50 60 70 80 90 100 O p ti m al n u m b er of so u rc es MIDI pitch (b) Optimal number of sources by pitch Figure 7.5: Optimal number of sources for a set of piano notes 120 We select the maximum number of sources N to be the total number of pitches represented in the MIDI format. Each source i corresponds to a particular pitch. Then we have a single set of template parameters T ∈ RF×I+ for all I sources, which are intended to represent the spectral, harmonic information of the pitches. For polyphonic transcription, we are typically interested in inferring the piano roll matrix C which owing to the above assumption of one source per pitch has the same dimensions as the excitation matrix V. For note i at time τ we set Cn,τ to be the value of the velocity of the note, and Cn,τ = 0 if the note is not playing. We use the note on velocity, which is stored in the MIDI format as a integer between 1 and 128. Thus, we model note velocity using our generative model. This contrasts with previous approaches which infer a binary-valued piano roll matrix of note activity, essentially discarding potentially useful volume information. The prior distribution p(C) is a discrete distribution, which can incorporate note transition probabilities and commonly occurring groups of note pitches, i.e., chords and harmony information. A note with a larger velocity will have a larger corresponding excitation. The magnitude of the excitation will depend on the pitch of the note as well as its velocity, so instead of applying a volume curve such as (2.1), we infer this relationship from training data. We will represent this information as an a priori unknown positive-valued matrix F of size I × 128 where Fi,z represent a mapping from the MIDI pitch i and velocity z to the excitation matrix given by Vi,τ = 0 Ci,τ = 0Fi,Ci,τ otherwise (7.15) For music transcription, we extend the prior model on V to include the matrices Fand C, i.e., p(V,F,C) = p(V|F,C)p(F,C) As F is a mapping to the excitation matrix, we place an inverse-gamma prior (for the Gaussian variance model) or a gamma prior (for the Poisson intensity model) over each element of F. The resulting conditional posterior over F is of the same family as the prior, and is obtained by combining the expectations of the sources corresponding to the correct pitch and velocity. The full generative model for polyphonic transcription is given by p(X,S,T,V,F,C) = p(X|S)p(S|T,V)p(V|F,C)p(T)p(F,C) One advantage of this model is that that minimal storage is required for the parameters which can be estimated oine from training data, as we demonstrate in 7.6.2. The two sets of parameters are intuitive for musical signals. This model also allows closer modeling of the excitation of the notes that the MIDI format allows. 7.5.2 Algorithm We are able to integrate out the latent sources S (see Section 7.2), and also eliminate V given F and C, using (7.15). The algorithm we present here is a generalized EM algorithm, which iterates to find a solution of the posterior: arg max T,F,C p(T,F,C|X) 121 The posterior distribution of F conditional on C,T,X is inverse-gamma as it is formed by collecting the estimates of V corresponding to each note pitch/velocity pairing. To maximize for the piano roll C we first note that each frame of observation data is independent given the other parameters V,F. For each τ we wish to calculate arg max Cτ p(Xτ |T,Vτ )p(Vτ |F,Cτ )p(F,Cτ ) (7.16) where Xτ ,Vτ and Cτ are the τth column vectors of X,V and C respectively. However, as each Cτ has 128I possible values, an exhaustive search to maximize this is not feasible. Instead, we have found that the following greedy search algorithm works sufficiently well: for each frame τ calculate arg max C˜τ p(Xτ |T, V˜τ )p(V˜τ |F, C˜τ )p(F, C˜τ ) (7.17) where C˜τ differs from Cτ by at most one element, and V˜ is the corresponding excitation matrix. There are I × 128 possible settings of C˜τ for which we evaluate the likelihood at each stage of the greedy search. This can be carried out efficiently by noticing that during the search the corresponding matrix products TV˜ differ from the existing value by only a rank-one update of TV. The resulting algorithm has one update for the expectation step and three possible updates for the maximization step. For the generalized EM algorithm to be valid, we must ensure that any maximization step based on parameter values not used to calculate the source expectations is not guaranteed to increase the log likelihood, and therefore must be verified. 7.6 Results A useful comparative study of three differing approaches has been carried out in Poliner and Ellis [2007]. A dataset with ground-truth of polyphonic piano music has been provided to assess the performance of a support-vector machine (SVM) classifier, [Poliner and Ellis, 2007], which is provided as an example of a discriminative based approach, having favorable performance in classification accuracy; a neural-network classifier Marolt [2004], known as SONIC 2 ; and an auditory-model based approach Ryynänen and Klapuri [2005]. 7.6.1 Comparison To comprehensively evaluate these models, we use Poliner and Ellis training and test data and compare the performance against the results provided in the same paper, which are repeated here for convenience. The ground truth for the data consists of 124 MIDI files of classical piano music, of which 24 have been designated for testing purposes and 13 are designated for validation. In a Bayesian framework there need not be any distinction between training and validation data: both may be considered labeled observations. Here we have chosen to discard the validation data rather than include it in the training examples for a fairer comparison with the approaches used by other authors. We also do not attempt to optimize the model 2 http://lgm.fri.uni-lj.si/SONIC 122 Algorithm 7.2 Gaussian Variance: algorithm for polyphonic transcription • Source Expectation〈 sν,τs > ν,τ 〉 = [tνv > τ ] · IN − κν,τκ>ν,τ [TV]ν,τ + 〈sν,τ 〉 〈sν,τ 〉> • Template Maximization Shape and scale parameters of inverse-gamma posterior distribution Aν,i = α (T) ν,i + T Bν,i = β (T) ν,i + ∑ τ V−1n,τ 〈 sν,ks > ν,τ 〉 Mode of posterior distribution Tν,i ← Bν,i Aν,i + 1 • Excitation Maximization Shape and scale parameters of inverse-gamma posterior distribution Ai,z = ∑ {τ :Ci,τ=z} α(V)n,τ + Fi,Ci,τ Bi,z = ∑ {τ :Ci,τ=z} ( β(V)n,τ + ∑ ν T−1ν,i 〈 sν,τs > ν,τ 〉) Mode of posterior distribution Fi,z ← Bi,z Ai,z + 1 • Transcription Search for τ = 1, . . . , T Cτ ← arg max C˜τ ∑ ν ( −1 2 |Xν,τ |2 [TV˜]ν,τ − log[TV˜]ν,τ ) p(F, C˜τ ) 123 Algorithm 7.3 Poisson Intensity: algorithm for polyphonic transcription • Source Expectation 〈sν,τ 〉 = κν,τXν,τ • Template Maximization Shape and scale parameters of inverse-gamma posterior distribution Aν,i = α (T) ν,i + ∑ τ 〈si,τ 〉 Bν,i = β (T) ν,i + ∑ τ Vi,τ Mode of posterior distribution Tν,i ← Aν,i − 1 Bν,i • Excitation Maximization Shape and scale parameters of inverse-gamma posterior distribution Ai,z = ∑ {τ :Ci,τ=z} ( α (V) i,τ + ∑ ν 〈sν,τ 〉 ) Bi,z = ∑ {τ :Ci,τ=z} ( β (V) i,τ + ∑ ν Tν,i ) Mode of posterior distribution Fi,z ← Ai,z − 1 Bi,z • Transcription Search for τ = 1, . . . , T Cτ ← arg max C˜τ ∑ ν ( Xν,τ log[TV˜]ν,τ − [TV˜]ν,τ ) p(F, C˜τ ) 124 parameters to minimize transcription errors on the validation set, as this is not consistent with a generative modelling approach. This is discussed further in Section 7.7. The observation data is primarily obtained by using a software synthesizer to generate audio data. In addition, 19 of the training tracks and 10 of the test tracks were synthesized and recorded on a Yamaha Disklavier. Only the first 60 seconds of each extract is used. The audio, sampled at 8000 Hz, is then buffered into frames of length 128 ms with a 10ms hop between frames, and the spectrogram is obtained from the short-time Fourier transform of these frames. Poliner and Ellis subsequently carry out a spectral normalization step in order to remove some of the timbral and dynamical variation in the data prior to classification. However, we omit this processing stage as we rather wish to capture this information in our generative model. 7.6.2 Implementation Because of the copious amount of training data available, there is enough information concerning the fre- quencies of the occurrence of the note pitches and velocities that it is not necessary to place informative priors on these parameters. It is not necessary to explicitly carry out a training run to estimate values of the model parameters before evaluating against the test data. However the EM algorithm does converge faster during testing if we first estimate the parameters from the labelled observations. Figure 7.6 on page 126 and Figure 7.7 on page 127 show the logT templates and logF excitation parameters estimated from the Poliner and Ellis training data for the Gaussian variance and Poisson intensity models, with flat prior distributions after running the EM algorithm to convergence on the training data only, and using a single source to model each note pitch as in 7.4.1. The templates clearly exhibit the harmonic series of the musical notes, and the excitations contain the desired property that notes with higher velocity correspond to higher excitation, hence our assumption of uniform priors on these parameters has not been detrimental. For the excitation parameters, white areas denote pitch/velocity pairs that are not present in the training data and are thus unobserved. For each of the matrix factorization models we consider two choices of the prior C. The first assumes that each frame of data is independent of the others, which is useful in evaluating the performance of the source models in isolation. The second assumes that each note pitch is independent of the others, and between consecutive frames there is a state transition probability, where the states are each note being active or inactive, i.e., p(Ci,τ > 0|Ci,τ−1 = 0) = p(Ci,τ = 0|Ci,τ−1 > 0) = pevent (7.18) This prior is known as the Markov prior in the remainder of this chapter. The state transition probabilities are estimated from the training data. It is possible and more correct to include these transition probabilities as parameters in the model, but we have not carried out the inference of note transition probabilities in this work. Similar Markov time dependencies between frames of data modelled by NMF techniques are used in Ozerov et al. [2009]. The modification to the inference is straightforward: in (7.16) the prior on Ci,τ is calculated using (7.18) using the current values of Ci,τ−1 and Ci,τ+1 that have been estimated. 125 F re q u en cy / H z MIDI note pitch 0 500 1000 1500 2000 2500 3000 3500 4000 30 40 50 60 70 80 90 100 (a) Template parameters logT M ID I ve lo ci ty MIDI note pitch 20 40 60 80 100 120 30 40 50 60 70 80 90 100 (b) Velocity-Excitation Mapping logF Figure 7.6: Parameter estimates for the Gaussian variance model from training data 126 F re q u en cy / H z MIDI note pitch 0 500 1000 1500 2000 2500 3000 3500 4000 30 40 50 60 70 80 90 100 (a) Template parameters logT M ID I ve lo ci ty MIDI note pitch 20 40 60 80 100 120 30 40 50 60 70 80 90 100 (b) Velocity-Excitation Mapping logF Figure 7.7: Parameter estimates for the Poisson model from training data 127 7.6.3 Evaluation Following training, the matrix of spectrogram coefficients is then extended to include the test extracts. As the same two instruments are used in the training and test data, we simply use the same parameters which were estimated in the training phase. We transcribe each test extract independently of the others, yet note that in the full Bayesian setting this should be carried out jointly, however this is not practical or typical of a reasonable application of a transcription system. An example of the transcription output for the first ten seconds of the synthesized version of Burgmueller's The Fountain is provided for the Gaussian variance model, both with independent and Markov priors (Figure 7.8 on page 129 and Figure 7.9 on page 130) on C, compared to theMIDI ground truth (Figure 7.10 on page 131). The transcription is graphically represented in terms of detections and misses in Figure 7.11 on page 132. True positives are in light gray, false positives in dark gray, and false negatives in black. Most of the difficulties encountered in transcription in this particular extract were due to the positioning of note offsets, rather than the detection of the pitches themselves. The transcription with independent prior on C shows that the generative model has not only detected the activity of many of the notes playing, but also has attempted to jointly infer the velocity of the notes. Each frame has independently inferred velocity, hence there is much variation across a note, however there is correlation between the maximum inferred velocity during a note event and the ground truth velocities. The Markov prior on C eliminates many of the spurious notes detected, which are typically of a short duration of a few frames. We have used only the information contained in note pitches, but the effect of resonance and pedaling can be clearly seen by comparing the ground truth with the transcriptions. This motivates the use of a note onset evaluation criteria. We follow the same evaluation criteria as provided by Poliner and Ellis. As well as recording the accuracy Acc (true positive rate), the transcription is error is decomposed into three parts: Subs the substitution error rate, when a note from the ground truth is transcribed with the wrong pitch; Miss the note miss rate, when a note in the ground truth is not transcribed, and FA the false alarm rate beyond substitutions, when a note not present in the ground truth is transcribed. These sum to form the total transcription error Tot which cannot be biased simply by adjusting a threshold for how many notes are transcribed. Table 7.1 on page 129 shows the frame-level transcription accuracy for the approaches studied in Poliner and Ellis [2007]. We are using the same data sets and features dimensions selected by the authors of this paper to compare our generative models against these techniques. This table expands the accuracy column in Table 7.2 on page 129 by splitting the test data into the recorded piano extracts and theMIDI synthesized extracts. Table 7.2 on page 129 shows the frame-level transcription results for the full synthesized and recorded data set. Accuracy is the true positive rate expressed as a percentage, which can be biased by not reporting notes. The total error is a more meaningful measure which is divided between substitution, note misses and false alarm errors. This table shows that the matrix factorization models with a Markov note event prior have a similar error rate to the Marolt system on this dataset, but has a greater error rate than the support vector machine classifier. 128 Time / s P i t c h 0 1 2 3 4 5 6 7 8 9 10 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 Figure 7.8: Transcription using a priori independent frames Table 7.1: Frame-level transcription accuracy Model Piano MIDI Both SVM 56.5 72.1 67.7 Ryynänen & Klapuri 41.2 48.3 46.3 Marolt 38.4 40.0 39.6 Variance (Independent) 36.0 41.2 39.7 Variance (Markov) 38.0 44.0 42.3 Intensity (Independent) 40.1 35.4 36.8 Intensity (Markov) 39.7 36.2 37.3 Table 7.2: Frame-level transcription results Model Acc Tot Subs Miss FA SVM 67.7 34.2 5.3 12.1 16.8 Ryynänen & Klapuri 46.6 52.3 15.0 26.2 11.1 Marolt 36.9 65.7 19.3 30.9 15.4 Variance (Independent) 39.7 68.2 22.9 27.7 17.6 Variance (Markov) 42.3 62.1 18.1 32.0 12.0 Intensity (Independent) 36.8 71.0 27.8 24.6 18.6 Intensity (Markov) 37.3 66.6 23.7 30.0 12.9 129 Time / s P i t c h 0 1 2 3 4 5 6 7 8 9 10 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 Figure 7.9: Transcription using Markov transition probabilities between frames 130 Time / s P i t c h 0 1 2 3 4 5 6 7 8 9 10 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 Figure 7.10: Ground truth for the transcription results in Figure 7.8 on page 129 and Figure 7.9 on page 130 131 Time / s P i t c h 0 1 2 3 4 5 6 7 8 9 10 30 40 50 60 70 80 90 100 Figure 7.11: Detection assessment 132 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 Number of notes in frame N u m b e r o f e r r o r s ×1 04 Subs Miss FA Figure 7.12: Number of errors for the Gaussian variance Markov model by number of notes and error type. 133 7.7 Conclusion In this chapter we have described a generative model for factorizing the variances of matrix elements into smaller template and excitation matrices, and developed Bayesian priors and inference algorithms to estimate the variances and the dimensions of the factor matrices. When applied to the spectrogram of a musical note, the template matrix models the harmonic content of the note, and the excitation matrix controls how the volume of the note varies over time. The hyperparameter optimization techniques described in this chapter can be applied to labeled training data to develop a system capable of distinguishing between and modelling different musical instruments and pitches. We have demonstrated how the excitation matrix of a polyphonic signal can be used to visualize the transcription of the music. We have compared the performance of generative spectrogram factorization models with three existing transcription systems on a common dataset. The models exhibit a similar error rate as the neural-network classification system of Marolt [2004]. However the support vector machine classifier of Poliner and Ellis [2007] achieves a lower error rate for polyphonic piano transcription on this dataset. In this conclusion, we principally discuss the reasons for the difference in error rate of these systems, and how the generative models can be improved in terms of inference and prior structure to achieve an improved performance. The support vector machine is purely a classification system for transcription, for which the parameters have been explicitly chosen to provide the best transcription performance on a validation set; whereas the spectrogram factorization models, being generative in nature, are applicable to a much wider range of problems: source separation, restoration, score-audio alignment and so on. For this reason, we have not attempted to select priors by hand-tuning in order to improve transcription performance, but rather adopt a fully Bayesian approach with an explicit model which infers correlations in the spectrogram coefficients in training and test data, and thus as a product of this inference provides a transcription of the test data. The differences in this style of approach, and the subsequent difference in performance, resemble that of supervised and unsupervised learning in classification. Thus in light of this, we consider the performance of the spectrogram factorization models to be encouraging, as they are comparable to an existing polyphonic piano transcription system without explicitly attempting to improve the transcription performance by tuning prior hyperparameters. Vincent et al. [2008] for instance demonstrate the improvement in performance for polyphonic piano transcription that can be achieved over the standard NMF algorithm by developing improved basis spectra for the pitches, and achieve a performance mildly better than the neural-network classifier: a similar result to what has been presented here. Bertin et al. [2009b] similarly report improvement in transcription performance for the Gaussian variance model compared to existing Bayesian NMF. They also suggest an tempering approach to avoid iterative algorithms being trapped in local maxima of the likelihood function. There are a number of alternative algorithms to perform NMF with the aim of increasing speed of convergence and locating better solutions. Recent work includes a split-gradient method developed by Lantéri et al. [2010]. To improve performance for transcription in a Bayesian spectrogram factorization, we can firstly improve initialization using existing multiple frequency detection systems for spectrogram data, and extend the hierarchical model for polyphonic transcription using concepts such as chords and keys. We can also jointly track tempo and rhythm using a probabilistic model, for examples of this see Whiteley et al. [2006], Raphael [2004], Peeling et al. [2007a] where the model used could easily be incorporated into the Bayesian hierarchical approach here. 134 The models we have used have assumed that the templates and excitations are drawn independently from priors, however the existing framework of gamma Markov fields developed in Cemgil et al. [2007], Dikmen and Cemgil [2010] can be used as replacements of these priors, and allows us to model stronger correlations, for example, between the harmonic frequencies of the same musical pitch, which additionally contain timbral content, and also model the damping of the excitation of notes from one frame to the next. It has qualitatively shown that using gamma Markov field priors results in a much improved transcription, and in future work we will use this existing framework to extend the model described in this paper, expecting to see a much improved transcription performance by virtue of a more appropriate model of the time-frequency surface. An alternative framework for enforcing temporal continuity in Bayesian NMF for polyphonic music transcription is presented by Bertin et al. [2009a], which could also be applied to the Gaussian variance model as opposed to the Poisson intensity model used by the authors. Other priors enforcing correlations between the elements in the factored matrices are possible, for example Gaussian process priors [Schmidt and Laurberg, 2008]. On this dataset, the Gaussian variance model has better performance for transcription than the intensity based model, and we suggest that this is due to the generative model modeling the weighting of the spectro- gram coefficients directly, and thus being a more appropriate model for time-frequency surface estimation. However, most of the literature for polyphonic music transcription systems using matrix factorization models has focused on the KL divergence and modifications and enhancements of the basic concept. Therefore it would be useful to firstly evaluate such variants of NMF against this dataset and other systems used for comparing and evaluating music transcription systems. Secondly, it would also be useful to replace the implicit Poisson intensity source model in these approaches with the Gaussian variance model. In summary, we have presented matrix factorization models for spectrogram coefficients using a Gaussian variance parametrization, and have developed inference algorithms for the parameters of these models. The suitability of these models has been assessed for the polyphonic transcription of solo piano music, resulting in a performance which is comparable to some existing transcription systems. As we have used a Bayesian approach, we can extend the prior structure in a hierarchical manner to improve performance and model higher-level features of music. 135 Chapter 8 A Probabilistic Framework for Inferring Temporal Structure in Music In this chapter we develop a probabilistic framework for the tractable inference of temporal structure in musical audio. The goal of this framework is to unify otherwise separate applications of Bayesian musical signal processing into a common, generative modelling framework for inference. We model the performance of a piece of music as the movement of a score pointer through a symbolic representation of the music. This representation may be the actual written score of the piece of music being played, converted to a suitable format, when the application is score following or tracking; or the representation may be a code book of rhythmic patterns for a tempo and beat tracking application; or a code book of chords for a transcription application. The observed audio itself is modelled by a generative process conditional on the properties of the score at that point. In the previous chapters of this thesis, we have mainly focused on processing individual frames of musical audio to detect multiple pitches with prior information. In Chapter 7 we added a simple Markov model for note transitions, such that the probability of a note sounding or ceasing to sound in a particular frame is dependent on the previous frame. This addition has a smoothing effect on the transcription such that spurious note detections are avoided, and notes are transcribed to their full length. In this chapter we define a Markov model over all of the notes sounding in each frame, mapping structures in the score, or the expected score of the music into Bayesian priors. This allows a richer, more realistic transcription of the musical structure than a simple frame-by-frame transcription would produce, and provides accuracy and robustness when aligning a preexisting transcription to a performed piece of music. 8.1 Audio Matching using Generative Models 8.1.1 Existing Dynamic Time Warping Approach We introduce our model by considering an existing state-of-the-art approach [Hu et al., 2003, Orio and Schwarz, 2001, Turetsky and Ellis, 2003] to the alignment of two pieces of music which are assumed to share a common score. Each piece of music is buffered into overlapping frames, and a feature vector is extracted 136 from each frame. A distance metric is used to compute the similarity between pairs of feature vectors from each piece, and dynamic time warping (DTW) is used to compute a joint path through both pieces, maximizing the similarity between matched frames. Many choices of feature vector and distance metrics have been proposed, including chroma vectors using non-negative matrix factorization divergence measures by Niedermayer [2009]. An interesting variation of these methods is that of Stark and Plumbley [2010] which performs localized self-alignment to detect repetitious structures in music to 'follow' a performance without a score. This approach to audio alignment may be extended to score matching and transcription by using a synthesizer to generate audio from a score or code book of musical chords. The synthesized audio is aligned to the observed audio and the path through the observed audio can be used to infer the score position or transcription. For transcription, the path must allow movements from the end of one chord to the beginning of the next in the code book. Such jumps in the path reduce the efficiency of most DTW algorithm implementations. In this section we state the audio alignment application using a hidden Markov model for the path and a generative signal model for each frame. This allows a number of extensions to the DTW approach, for example being able to jointly align multiple pieces of music, jointly inferring the structure within each piece (for example when sections of the score are repeated), and being able to use approximate sequential inference for longer pieces of music. Hidden Markov models have been used for audio and score alignment by Orio and Déchelle [2001], Cano et al. [1999]; and Raphael [2004] also uses a probabilistic generative model for score alignment. 8.1.2 Model Statement The data we observe consists of N pieces of music, with Tn frames in each piece, n = 1, . . . , N . Each frame of music is denoted by y (n) t , t = 1, . . . , Tn. The score, which is the underlying representation of all of the pieces, is divided into M sections. Over each section the properties of the score are stationary, hence note onsets and offsets for example mark the beginning of new sections. In regular classical music, each bar of the score may be divided equally, for example into 32 if no notes are longer than semi-quavers. We consider the method in which the score is converted into musical audio via the concept of a `score pointer', which denotes at time t the current position in the score where the performance is. If someone were to listen to a piece of music and follow the score with their finger at the same time, the position of the score pointer would be the position of the listener's finger. The path of the score pointer through each piece of music is denoted x (n) t which takes values 1, . . . ,M . The score pointer may move forwards and also backwards in the case of repeated sections, therefore it is not necessary in general for x (n+1) t ≥ x(n)t . To proceed, we require a generative signal model which assigns a probability p ( y (n) t |θm ) p (θm) for all values of t, n,m. θm denotes a set of parameters which describes how the signal changes given the position in the score. Some of the parameters θm will be unknown and must be inferred as part of the audio matching task. We also define a Markov model p ( x (n) t |x(n)t−1 ) as a prior on the dynamics of the score pointer through each piece, with initial priors on the score position p ( x (n) 1 ) and the signal p ( y (n) 1 |θm ) . The joint probability distribution of this model is given by 137 N∏ n=1 p ( y (n) 1 |θx(n)1 ) p ( θ x (n) 1 ) p ( x (n) 1 ) Tn∏ t=2 p ( y (n) t |θx(n)t ) p ( θ x (n) t ) p ( x (n) t |x(n)t−1 ) (8.1) 8.1.3 Interpretation of Dynamic Time Warping Dynamic time warping constructs a set of M matches between the frames of two pieces of audio, in such a way that the set of matches represents the path of an unknown score pointer through both pieces. Each match m is a unique pair of frames { y (1) p(m), y (2) q(m) } from each piece, where p (m) ∈ 1, . . . , T1 is the timing of the match in piece 1, and q (m) ∈ 1, . . . , T2 is the timing of the match in piece 2. The cost of each match d ( y (1) p , y (2) q ) (also known as the distance or similarity between the frames) is a function of the two frames. The cost has a small value when the frames share some characteristics that indicate that they are at the same position in the unknown score. A good cost function is insensitive to non-score related features, such as the overall energy of the frame. The cosine distance which normalizes each vector is a popular choice. d ( y(1)p , y (2) q ) = y (1) p · y(2)q∣∣∣∣∣∣y(1)p ∣∣∣∣∣∣ ∣∣∣∣∣∣y(2)q ∣∣∣∣∣∣ where ||y|| is the l2 norm. There is also an additional cost between consecutive matches, which controls how the score pointer moves frame by frame through both pieces. This function ensures that only realistic score pointer paths are acceptable and disallows large differences in time between consecutive matches in both pieces, i.e., both p (m) − p (m− 1) and q (m) − q (m− 1) are constrained to be small for all m. Formally, we define a cost function c (p (m− 1) , p (m) , q (m− 1) , q (m)) which is the cost of moving the score pointer from p (m− 1) to p (m) in piece 1 and from q (m− 1) to q (m) in piece 2. The value of the cost function must be infinite for any movement of the score pointer disallowed by the application. For example, if the application does not allow the score pointer to move backwards in time, then the cost function must be infinite for p (m) − p (m− 1) < 0 or q (m) − q (m− 1) < 0. For every frame in each piece to be matched with a frame in the other piece, we cannot allow the cost function to skip frames, hence it must also be infinite for p (m) − p (m− 1) > 1 or q (m) − q (m− 1) > 1. In the literature, the cost function is normally chosen to be stationary, in that it does not depend on the actual values of p (m− 1) , p (m) , q (m− 1) , q (m), but only on the differences between them, i.e., c (p (m− 1) , p (m) , q (m− 1) , q (m)) = c (p (m− 1) + t, p (m) + t, q (m− 1) + t, q (m) + t)∀t ∈ Z When the cost function is stationary it can therefore be written as c (p (m− 1) , p (m) , q (m− 1) , q (m)) ≡ c (p (m)− p (m− 1) , q (m)− q (m− 1)) Later in this chapter we will introduce a dependency on the position of onsets in the score, hence we do not simplify the specification of the cost function at this point. 138 A DTW algorithm minimizes the overall cost M∑ m=1 d ( y (1) p(m), y (2) q(m) ) + M∑ m=2 c (p (m− 1) , p (m) , q (m− 1) , q (m)) (8.2) over the set of matches {p (m) , q (m) : m = 1, . . .M} ,M , subject to the constraints p (1) = 1, q (1) = 1, p (M) = T1, q (M) = T2 (8.3) This set of matches is known as a path and denotes the position of the score pointer in each piece throughout the entire path. The constraints require that the path of the score pointer starts at the first frame of both pieces, and terminates at the last frame of both pieces. Note that the length of the path M is also unknown, although a shorter length is preferred by the cost function, which normally indicates a better match between the two pieces. The constraints may be relaxed to allow for silences or missing notes. In this case, we do not apply the constraints (8.3) but instead define edge costs: e (p (1) , q (1)) which defines the cost of starting the path at p (1) , q (1) and f (p (M) , q (M)) which determines the cost of terminating the path at p (M) , q (M). The overall cost is now written as: e (p (1) , q (1)) + f (p (M) , q (M)) + M∑ m=1 d ( y (1) p(m), y (2) q(m) ) + M∑ m=2 c (p (m− 1) , p (m) , q (m− 1) , q (m)) The DTW model may be interpreted using the generative model in the previous section. The cost functions are interpreted as negative log probabilities, such that the summations in (8.2) correspond to the products in (8.1), and minimizing the cost function is equivalent to maximizing the likelihood. This interpretation is powerful as the model may be extended with further prior information if available. The cost of matching frames is equivalent to the negative log marginal likelihood of both frames being generated by the same score d ( y (1) p(m), y (2) q(m) ) ≡ − log ˆ p ( y(1)p |θm ) p (θm) p ( y(2)q |θm ) p (θm) dθm (8.4) and the cost of the score pointer movement between successive matches is related to the transition proba- bilities: c (p (m− 1) , p (m) , q (m− 1) , q (m)) ≡ − log p ( x (1) p(m)|x(1)p(m−1) ) − log p ( x (2) q(m)|x(2)q(m−1) ) (8.5) The edge cost at the beginning of the path are equivalent to the priors: e (p (1) , q (1)) = − log p ( x (1) p(1) ) p ( x (2) q(1) ) and the cost at the end of the path may similarly be written as f (p (M) , q (M)) = − log ˆ p ( x (1) p(M)|x(1)p(M−1) ) p ( x (2) q(M)|x(2)q(M−1) ) dp (M − 1) dq(M− 1) 139 marginalizing over the preceding position of the score pointer. 8.2 Score Alignment A common way to use audio alignment techniques to align a score to a piece of musical audio is to synthesize the score to audio using an electronic instrument and compute a joint path through both pieces using DTW as described in the previous section. However, the inferred path through the observed audio may not be appropriate to infer the position through the score in every frame. A single frame in the observed audio may be matched to multiple frames in the synthesized audio which may span more than one score event. Thus there is ambiguity over which score event the frame in the observed audio should be matched to. In cases where the length of a frame is small in comparison to the minimum length of a score event, it is impossible for multiple score events to occur within the same frame of the observed audio. When a single frame is matched to multiple score events, the path inferred though the audio must therefore be incorrect. We suggest that this is due to the prior model (8.4) which matches frames being too weak, and the transition model (8.5) being too flexible, allowing dramatic changes in tempo. In this section we focus on the development of a stronger transition model, i.e., the movement of the score pointer through the audio. We state the transition model in such a way that it can be incorporated into existing DTW cost functions and also expressed as a hidden Markov model, which allows further development of the prior models. 8.2.1 Treatment of Score Events In score alignment, there is much more value in accurately inferring the timings of note onset events in the score rather than a smooth contour of the tempo through a piece of audio. Note onsets are also usually more important than note releases. In piano music, the sustain pedal can blur the timings of released notes, however note onsets are clearly defined. In percussive instruments, including plucked and hammered strings, the notes may not be explicitly cut off, but allowed to sound and decay. In legato playing, note releases are timed with the onsets of the next note, whilst in staccato playing, the note lengths are short and the exact position of the release may be difficult to determine in the score. Perceptually, minor errors in the timing of a note onset have a greater impact than in the timing of a release. In light of this, we treat note onsets with a rigid prior model to attempt to infer the timings as accurately as possible. Away from the onsets however, we allow the score pointer to move flexibly through the progression of the note and its release. 8.2.2 Dynamic Time Warping Cost Function Let the observed audio in 8.1.3 be y (1) t and the synthetic audio, which is known to be aligned to the underlying score, be y (2) t . We know a priori which frames in the synthetic audio y (2) t are related to note onsets in the score. If y (2) q(m) is related to a note onset, then we set the cost function as c (p (m− 1) , p (m) , q (m− 1) , q (m)) = 0 p (m)− p (m− 1) = 1 and q (m)− q (m− 1) = 1∞ otherwise (8.6) 140 (8.6) forces a movement of 1 frame in both pieces when there is an onset in the synthetic audio. If y (2) q(m) is not related to a note onset, then we assume that the cost function is stationary and takes a value c (p (m)− p (m− 1) , q (m)− q (m− 1)). When the modified cost function described in this section is ap- plied to score alignment using DTW, the constraint ensures that every note onset is uniquely identifiable in the path of the score pointer through the observed audio. 8.2.3 Hidden Markov Model Formulation The hidden Markov model (HMM) formulation of score alignment is more powerful as it uses an explicit model of the score itself. This allows a generative model p (yt|θm) of a frame of audio yt to be used to infer unknown parameters θm using all of the frames in both the synthesized and observed audio corresponding to a certain set of notes in the score, and even using other frames sharing one or more notes. A library of training data may also be used as priors for the signal model. The parameter set θm at score position m includes the pitches and volumes of the set of notes currently sounding, plus any unknown parameters of the generative model for the frame. Any reasonable generative model for a frame of audio given the playing notes may be used, including the Bayesian models developed in the preceding chapters of this thesis. The joint probability distribution of the frame yt and the parameters θm may be written as p (yt, θm) = p (yt|θm) p (yt) p (θm) Now if we have previously obtained several frames of music y (n) t , t = 1, . . . , Tn, which we denote y (n) 1:Tn , through synthesis or other means which we know was generated from a score with parameters θm, then we can update the prior generative model with the additional data. As the below expression shows, this is equivalent to replacing the prior p (θm) with the posterior under the previously observed frames p ( θm|y(n)1:Tn ) : p ( yt, θm|y(n)1:Tn ) = p (yt|θm) p (yt) p ( θm|y(n)1:Tn ) If the prior is a conjugate prior of the likelihood function p (yt|θm) then the posterior p ( θm|y(n)1:Tn ) is of the same family as the prior, the parameters of which are calculated using standard update rules. In the HMM formulation, we are only interested in inferring the path of the score pointer xt through the observed audio yt, as the path through any synthesized audio is already known. The probabilistic model for the movement of the score pointer is simple. Usually from one frame to the next, the score pointer may either stay in the same score position or move to the next position. p (xt+1|xt) =  pm xt+1 = m+ 1, xt = m 1− pm xt+1 = xt = m 0 otherwise (8.7) pm is related to the current tempo and the expected duration of the event represented by the score pointer position at that point. For example, if the event is expected to last 1 second and the time difference between frames is 125ms then pm should be set to 1/8. 141 (8.7) may be extended to allow other transitions, such as skipping a score event in a live and error prone performance. An additional Markov model allowing changes in tempo may be added by extending the hidden state xt to include the unknown tempo parameter. The tempo should only be allowed to change slowly, for example at bar lines in the score or explicitly marked tempo change points. When θm represents a note onset, we set pm = 0: similar to the constraint (8.6) to improve the accuracy of onset timings. This forces the score pointer to move one frame at the onset. The remainder of the note is represented by θm+1 and may be further divided into sustain and release portions of the note if this is available or may be inferred from the score. 8.2.4 Inference In Peeling et al. [2007a] we described two methods of inference using a simpler version of the hidden Markov model described in 8.2.3. In this section, we extend these methods into an iterative scheme which improves the accuracy of the timings and the consistency of the inferred score parameters. This scheme firstly infers x1:T given θ1:M and then infers θ1:M given x1:T . The method of calculating the posterior distribution ∏T t=1 p (xt|y1:T , θ1:M ) for all t is known as the forwards-backwards algorithm [Rabiner, 1989]. Modifications of this algorithm be used on-line to calculate a fixed lag filtering distribution p (xt|y1:t+L, θ1:M ) where L ≥ 0 is the permitted lag allowed in observations before the score pointer position is required. The method of calculating the mode of the posterior distribution is known as the Viterbi algorithm. It is used where a consistent path of the score pointer is required across the entire piece. The remaining step of inference is to update the parameters θm of the generative model, given that we have attempted to fit the score to the observed frames. The standard method for hidden Markov models is the Baum-Welch algorithm, an expectation-maximization algorithm. The algorithm first computes the posterior distribution of the score pointer by the forward-backwards procedure, and then locates the MAP estimate of θ1:M under the posterior distribution of the parameters p (θ1:M |x1:T , y1:T ) An alternative method, known as conditional modes, uses the Viterbi path to compute the score pointer, and then maximize p (θm| {xt, yt} : xt = m) for all m. As this method segments the score parameters and data into separate maximization problems, there is potential for the computation to be carried out in parallel; and in general the computation and memory requirements are less than employing the Baum-Welch algorithm. 8.2.5 Results Figure 8.1 on page 144 shows an example of audio alignment carried out using DTW techniques on a recorded piece of guitar audio aligned to a synthesized version. The synthesized version was generated from the known score of the piece with a constant tempo. The note onsets were identified from the score, and timings were assigned by scaling the number of beats from the beginning of the score to the note onset event. Note onset 142 costs are implemented as described in 8.2.2. The distance matrix d (p, q) is computed using a single source Gaussian variance model (Section 7.2) as follows. Each element of the distance matrix is the joint marginal likelihood of the frame of recorded audio y (1) p and the frame of synthetic audio y (2) q , assuming that both frames share the same template vector t but have a separate excitation parameter: v (1) p being the excitation parameter for y (1) p and v (2) q being the excitation parameter for y (2) q . We therefore calculate d ( y(1)p , y (2) q ) = p ( y(1)p , y (2) q ) = ˆ p ( y(1)p |t, v(1)p ) p ( y(2)q |t, v(2)q ) p (t) p ( v(1)p ) p ( v(2)q ) dtdv(1)p dv (2) q (8.8) for each pair of frames, using the Variational Bayes algorithm 7.1 to approximate the integral in (8.8). The template and excitation hyperparameters are chosen to be uniform: at = 2, bt = 1 and av = 2, bv = 1, and the hyperparameters are not optimized for this algorithm. The alignment path is then computed by DTW. The result is that the timings of note onsets in the recorded piece are synchronized well with the synthetic version, which is indicated by the alignment path passing through the intersections of the edges of the vertical and horizontal bands in the distance matrix. Figure 8.1a on page 144 shows the spectrogram of the recorded audio, and Figure 8.1b on page 144 shows the distance matrix computed from (8.8) overlaid with the alignment path. By comparing with the distance matrix with the spectrogram, it can be seen that the note onsets of the observed audio occur at the edges of the vertical bands. Similarly, note onsets in the synthetic audio correspond to the edges of the horizontal bands. On close observation, the alignment path passes through the intersections of the band edges, which indicates that the note onsets in the observed and synthetic audio have been matched together with little timing error. To quantify the improvement in the alignment when using note onset costs and the iterative inference algorithm of 8.2.4 we use a data set built from Midi and mp3 files from the Classical Piano Midi page 1 . The mp3 audio files are recorded acoustically from a Midi controlled grand piano, and are aligned to the Midi files provided. An accompanying synthetic set of audio was obtained by removing all of the tempo and expressive markings from theMidi files and synthesizing the result. Both pieces of audio were downsampled to 8000Hz and split into frames of 48ms with 50% overlap between the frames. These were then matched together using the algorithms described in this chapter. The tempo through the observed audio is assumed to be constant, and is estimated from the tempo of the synthetic audio by multiplying by the ratio of the lengths of the two pieces. The estimated tempo is then used to set the cost function for moving from one frame to the next using the model in (8.7). For each note onset in the score, we identify the frame in the synthetic audio containing the note onset. Then if that frame is matched to a unique frame in the observed audio by the DTW algorithm (which is guaranteed if the note onset costs described in 8.2.2 are used), then the centre of the frame in the observed audio is recorded as the timing of the note onset in the observed audio. If the frame in the synthetic audio is not matched to a unique frame in the observed audio, then the timing of the onset in the observed audio is recorded as halfway between the start of the first frame and the end of the last frame of the group of frames in the observed audio matched to the frame containing the onset in the synthetic audio. 1 www.piano-midi.de 143 F re q u en cy / H z Time / s 0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 20 (a) Spectrogram of the observed audio. Vertical features in the spectrogram correspond to vertical bands in the distance matrix below. S y n th es iz ed au d io ti m e / s Observed audio time / s S y n th es iz ed au d io ti m e / s 0 5 10 15 20 0 5 10 15 20 (b) The distance matrix of the observed spectrogram above, compared frame-by-frame to syn- thesized audio from the same score. Figure 8.1: On the distance matrix of the observed spectrogram, regions of high spectral similarity are shaded darker and regions of low spectral similarity are shaded lighter. Overlaid in red is the optimal alignment path computed using DTW, which moves steadily from the beginning of both pieces along the diagonal of the distance matrix. 144 Cosine Distance Gaussian Variance Piece Unaligned DTW Note Onset Costs DTW Note Onset Costs Iterative Inference alb_esp1 578.1 350.1 342.7 357.8 345.3 331.8 scn15_5 1278.5 11.9 11.6 11.3 10.9 10.6 bor_ps7 822.0 55.4 51.5 50.6 47.0 45.8 alb_esp2 1203.4 122.7 119.9 111.5 110.0 110.0 mendel_op62_4 607.5 16.0 15.9 15.8 15.4 14.7 ty_maerz 3068.1 639.0 634.1 637.6 633.6 625.7 chpn-p22 753.2 158.4 152.7 154.9 151.1 151.0 scn15_3 74.2 12.4 12.2 17.0 13.4 12.4 scn15_1 374.4 13.2 12.5 12.6 12.2 12.2 scn15_6 408.0 27.2 22.8 22.7 22.2 19.4 Table 8.1: Score alignment: median alignment in milliseconds We then calculate the difference in milliseconds between the onset timings in the original Midi (before removing tempo changes) compared to the note onsets identified in the aligned audio. The quality of alignment measured using the median alignment error of note onsets in milliseconds, as this evaluation criterion is also used in Cont et al. [2007], Devaney et al. [2009] for score alignment. The results are presented in Table 8.1 on page 145. From these results we observe that both the cosine distance and Gaussian variance models produce similar alignments. The advantage of being able to use an iterative inference scheme for the Gaussian variance model provides modest improvements in alignment accuracy. The accuracy of the alignment varies strongly based on the piece that is being aligned. The common characterstic of the pieces which are badly aligned is that they include sections of fast notes with high polyphony, and in these situations a simple distance metric such as the cosine distance or a single source Gaussian variance model is too weak a model to improve alignment dramatically. We therefore suggest that a useful line of future work would be to investigate more powerful models, which in themselves are capable of music transcription, such as the multiple source Gaussian variance model in Chapter 7 with training data in addition to the synthetic audio, and implement them in the inference framework described in this chapter, in order to tackle these more difficult cases of score alignment. In Figure 8.2 on page 146 we show how the spectrogram audio and aligned score can be presented in an visual manner. The score pointer is presented as a vertical bar which moves through both the spectrogram and the score, synchronized with the audio track. 8.3 Event Based Inference The generative model described in 8.1.2 directly works with frames of observed audio, and infers the temporal structure from the posterior distribution. In this section, we consider a different application, where the observations are not frames of audio but incompletely labeled events. The observations here are the output of an audio preprocessing system such as an onset detector (which detects rapid changes in the energy of the audio over time), or an approximate transcription of the melodic line of the piece, which could be provided by a human in the context of a query-by-humming application. In these situations, we again wish to determine 145 Spectrogram Data Time / s F r e q u e n c y / H z 0 2 4 6 8 10 12 14 0 1000 2000 3000 4000 50 100 150 200 250 300 350 400 450 55 60 65 70 75 80 85 MIDI Data Score position M I D I n o t e Figure 8.2: Score alignment using Gaussian variance model. The movement of the score pointer is displayed regularly in time as a vertical bar passing through the spectrogram and also the Midi representation. The spectral features can be matched visually to the score representation easily, and it can be seen that the score pointer positions shown correspond to the appropriate timing in the audio. In a practical visualization application, this figure can be presented as a video, where the score pointer moves through the spectrogram and the Midi representation concurrently, whilst the audio is playing. It is then much easier to notice subtle changes in tempo (in performance rather than an explicit score marking), for example between score positions 150 and 175, which marks the entry of the second voice in Bach's second Fugue in C minor (from the Well-Tempered Klavier performed by Daniel Ben Pienaar) 146 the temporal structure in the music, and be able to match and align the observed events with the underlying score. The goal of this section is the same as in Section 8.2, however instead of using a generative model of the audio in each frame, we now must define and infer using a model which describes the production of these observed events from the score. For example, a tempo tracking algorithm must firstly infer the presence of onsets of note events in a piece of music, and then infer the overall speed of the arrival of events by looking at periodic patterns in the timing of events (Figure 8.3 on page 151). The onset timings may also be provided directly by a human listener, in an application which infers and even controls the tempo of a piece playing, or in a query-by-tapping context, where the application attempts to match the timings provided by the listener to a Midi database. In these applications, the events are unlabeled - we do not know which beat of which bar each event belongs to. Exact onset timings may also be acquired from a Midi enabled instrument attempting to perform a score. Analyzing the note timings in relation to the score is useful for musicological studies of performance. Another application is an incomplete transcription of the score, such as a melodic line transcription in a query-by-singing application, or even the output of a classification model-based transcription system which cannot easily be set in a generative model framework. In these applications, the events are partially labeled: we know the pitch and perhaps the volume of the note, but these notes are not matched with the score itself. Some of the notes observed may be erroneous, and others may arrive in the wrong order, even subtly in the case of piano chords and polyphonic music. In the above descriptions, we have motivated both the use of an alignment application (to infer the tempo throughout the audio) and a matching application where we wish to select the best match to the observed events from a large set of candidate pieces. Although these applications appear different, they can both be addressed by Bayesian inference of the unknown model parameters. For alignment, inference of the note onset positions can be used to construct the progress of the tempo through the piece. For matching against other candidates, Bayesian inference can compute the likelihood of the observed events given a candidate score. The candidate giving the highest likelihood to the observations is therefore the best match to the observations. In 8.3.3 we consider a query-by-tapping application, where we attempt to match a set of onsets intended to mimic the rhythmic structure of a piece of music, to candidate scores from a database. In terms of the generative model p (yt|θm) p (θm), yt refers to the set of event onsets inferred in that frame. In this section we will propose a Bayesian model which can be adapted to all of the above applications in a general way. We define C as the number of categories of events that we are able to observe. If an event of category c ∈ 1, . . . , C occurs in frame yt, then we use the notation that c ∈ yt, and if it is expected to occur during the section of score θm then c ∈ θm. The definition of an event category depends on the nature of the observed events. If the observed events are generated by a simple onset detector, there is only one event category: the detected onsets. Simultaneously sounding note onsets are grouped into score onset sections. In a score onset section c ∈ θm as we expect an onset to be detected when there is an onset in the score. Sections are also defined for the periods in the score where there are no onsets. In these sections, c /∈ θm as we do not expect an onset to be detected. If the observed events are note onsets returned by a melodic line transcription algorithm, then each pitch returned by a transcription algorithm would correspond to one of the C categories. The score is divided into disjoint sections, where a group of pitches are sounding throughout a section, or there is silence. When a 147 pitch corresponding to event category c is sounding in section m then c ∈ θm as we expect the transcription algorithm to detect this pitch during this section of the score. In the case of silence, we expect c 6/∈ θm for all c. 8.3.1 Counting of Temporal Events The basic model we propose maintains a count of the number and category of temporal events observed in the music up to and including the present frame. Our observation is a function nc (t) for every frame yt defined as the number of events of category c observed up to and including frame yt. Our prior model consists of the expected number of events observed at different times in the piece of music. We will model the temporal events as a set of independent non-homogeneous Poisson processes, one for each category. The occurrence of note c onsets has a time-varying intensity ρc (t) which for each value of t gives the expected number of onsets by the time we reach frame t, i.e., nc (t) ∼ Po ( t∑ τ=0 ρc (τ) ) = Po (λc (t)) (8.9) where λc (t) = ∑t τ=0 ρc (τ). The intensity function ρc (t) gives the expected number of event of category c occurring in each frame. When matching onset detections to a score, we would initially expect that this intensity function is equal to the number of events in the section of the score xt corresponding to the frame t, i.e., ρc (t) |θxt = 1 c ∈ θxt0 c /∈ θxt (8.10) This model of the intensity function is suitable for ideal cases where we do not expect any errors in the performance or errors in the detection of the events, and the only unknown variable is the tempo of the performance, represented by xt which maps frame t to score section m. (8.9) is a generative model for the occurrence of observed events, given the expected counts of events in the underlying score, and therefore may be used for all of the applications described earlier. Alignment applications involve inferring the tempo of the performance xt throughout the piece. Applications which match the observations to the best candidate score must compute the likelihood of the observed event counts nc (t) given the candidate score θ1:M ∏ c∈C ∏ t=1,....T p (nc (t) |θ1:M ) (8.11) for each candidate. The candidate with the highest likelihood (8.11) is chosen as the best match. The Poisson assumption allows variability in both the number of events detected and their timings. The maximum likelihood case is when the events match the score exactly, however we show in 8.3.3 that using (8.10) without modification is sufficient to match scores to observations on a global or coarse scale. 148 8.3.2 Clutter and Missed Detections For a more powerful model which is more appropriate for real performances and onset detection methods and is able to match individual events in the score to the observations, we may begin by adding two additional error parameters per event type. The first parameter is a clutter process ρ (clutter) c which gives the expected number of spurious event detections in each frame. The second parameter governs the probability of missing note detections, which we denote ρ (missed) c . The new model for the intensity function is ρc (t) |θxt = 1− ρ (missed) c c ∈ θxt ρ (clutter) c c /∈ θxt In this section, we treat the two error processes as independent per event type and constant throughout the score / observation. It is straightforward to allow the error parameters to vary across different parts of the score, for example in fast moving sections where notes are more likely to be missed, by appropriately subdividing the score. These parameters are straightforward to infer as part of the iterative procedure described in 8.2.4. If the current map estimate of the path of the score pointer is given by x∗t for all t, then the maximum likelihood estimate of the missed event parameter is ρ (missed) c = 1 T T∑ t=0 I [ c ∈ θx∗t ] I [c /∈ yt] is the average of the missed detections in the observation, and the maximum likelihood estimate of the clutter parameter is ρ (clutter) c = 1 T T∑ t=0 I [ c /∈ θx∗t ] I [c ∈ yt] is the average of the spurious detections. If we are repeatedly using a particular technique for detecting events, it is likely that we would have strong prior information about the clutter and missed event processes. In this case, we can put a Beta distribution (Section A.4) ρ (missed) c ∼ B ( α (missed) c , β (missed) c ) on the parameter, where α (missed) c + β (missed) c is the number of times we applied the technique and β (missed) c is the number of missed detections. The maximum a posteriori estimate of the missed event parameter is ρ (missed) c = 1 T + α (missed) c + β (missed) c ( β(missed)c + T∑ t=0 I [ c ∈ θx∗t ] I [c /∈ yt] ) An equivalent formula holds for the clutter parameter 8.3.3 Query-by-Tapping Results The Mirex Query by Tapping Task is an interesting setting for evaluating models of temporal structure in music. A set of onset times are provided to the system, and the task is to match the rhythmic structure observed to one of a set of candidate scores. The Mirex task provides a set of monophonic Midi database 149 records, and a set of audio tapped and symbolic queries available for download 2 . The question here is whether the temporal structure of music can be adequately represented by rhythmic (onset) information alone [Jang et al., 2001]. This has implications for score-alignment algorithms where robustness may be increased by ignoring pitch information rather than including it. Applying the models here is straightforward. C = 1 and nc (t) is the number of observed onsets provided in the query up to frame t. ρc (t) is set to the number of note onsets in the Midi file up to frame t. The examples in the Mirex task are monophonic, so note onsets do not happen at exactly the same time. The system ranks all of the Midi files in the database according to the likelihood (8.11) for each query. Figure 8.3 on page 151 shows an example of the inter-onset timings which are used to match the observed onsets to the note onsets in Midi. To evaluate how well the system performs, for each query q, we compute the rank rq of the correct score when the database scores are ranked according to likelihood. If the system correctly matches query q by assigning the highest likelihood, then rq = 1. If the system assigned two incorrect scores with a higher likelihood than the correct score, then rq = 3. The evaluation method published by Mirex is the mean reciprocal rank (MRR) over all the queries q ∈ Q: MRR = ∑ q∈Q 1 rq such that a perfect system which returns the correct score for every query, i.e., rq = 1∀q has an MRR of 1, and systems which make more mistakes have lower MRR values. . The best performing algorithm by Typke and Walczak-Typke [2008] for the 2008 task, which calculates an Earth mover's distance (EMD) between the observed and score onsets, achieved an average MRR of 0.52. Even without using any Bayesian techniques, we obtain an average MRR of 0.54 on the Mirex database, outperforming more elaborate and computationally expensive algorithms on the symbolic onset data. 8.4 Conclusion In this chapter we have developed techniques for applying a generative model of a musical audio signal coupled with a Markov model of the movement of a score pointer to a variety of inference tasks in musical signal processing. The first application we consider is aligning two extracts of musical audio by matching frames on the basis of the similarity of spectral features. A popular framework for carrying out this task is dynamic time warping (DTW). By expressing the dynamic time warping problem in a generative model setting, we are able to incorporate Bayesian priors on the spectral features and the matching process in order to generate more reliable and realistic results. Moreover we are able to extend the method to multiple extracts and apply iterative Bayesian inference techniques on hidden Markov models (8.2.4) in order to process frames in real time and over indefinitely long periods of time, which is not possible with a standard DTW implementation. The second application we consider is score alignment, where we typically have a symbolic representation of a piece of music which is to be matched to an extract of audio. The audio is expected to be a performance of that score with tempo changes and some errors. We sketch a model of the score pointer which assigns more importance to the position of onsets in the score, and allows gradual tempo changes. We also show 2 www.music-ir.org/mirex/wiki/2008:Query_by_Tapping 150 5 10 15 20 25 30 0 500 1000 T i m e / m s Query Onsets 5 10 15 20 25 30 0 500 1000 1500 MIDI Onsets T i m e / m s Figure 8.3: Inter-onset timings in a query-by-tapping problem. The timings in the tapped query have the same rhythmic structure as the Midi timings, which enables the query to be matched with a high likelihood to the score using the Poisson process model generated in this chapter. 151 how training data, for example a synthesized version of the score, can be used to infer the parameters of the generative model, and how to update the model using the structure inferred within the observed audio itself. A standard approach to score alignment is to match a synthesized track with the observed audio using DTW. By applying the prior model of the score pointer position and the improved inference techniques, we show how the accuracy of the matching can be improved, especially at the position of note onsets where timing errors are perceptually most critical. The final applications are based on an event-based observation of the audio, which may be obtained through an onset detector or a principal pitch transcription algorithm. We develop a novel event counting generative model of the events using a non-homogeneous Poisson process, which replaces a generative model of the signal itself, and may use the same inference techniques as the other models in this chapter. We describe simple priors for the event processes which allow and infer the probability of missed event detections and clutter for each event type. The approach is applied to a query-by-tapping example where it is shown to be more accurate than more elaborate algorithms. The work of Degara et al. [2010] shows that significant improvement in note onset detection can be achieved by applying a prior model of rhythmic structure and fusing onset detection with rhythmic structure constraints. A natural extension of the work in Section 8.3 would be to jointly infer the event positions and the score structure using Bayesian inference. Throughout this chapter we have mostly outlined the prior models and inference algorithms that can be applied using this framework, whereas in previous work [Peeling et al., 2007a] we have expressed a specific approach in more detail. The hidden Markov model framework is general and well studied, and many software packages exist for inference in HMMs which only require the specification of the observation model and the dynamics of the hidden state. Our contribution in this chapter has been to illustrate how several important applications in musical signal processing may be carried out through these inference algorithms. 152 Chapter 9 Conclusion 9.1 Summary In this thesis we have presented a variety of Bayesian models and modelling methods aimed at tackling difficult applications of the processing of musical audio signals. A hierarchy of Bayesian priors is a feasible way to represent the complex structure exhibited in musical audio that is known from musical theory and has been obtained by experiments on the physics of musical instruments. Generative models of music are attractive as they are not tied to a particular application. Instead, the same generative model and prior structure may be used for music transcription, source separation, synthesis and reconstruction, simply by identifying which parameters of the model are known and unknown in a particular context, and applying an appropriate inference algorithm to the model to infer the values of the unknown parameters. For each model we have developed in this thesis, we have described how it differs from existing work or generalizes an existing model, and provide a theoretical basis and justification to the accuracy observed in the experimental results presented. We have focused particularly on the problem of multiple pitch detection in frames of audio, particularly because it is an exacting measure of how appropriate and accurate a model of a musical signal is, especially for mixtures of three or more notes. Our assumption is that a generative model which is capable of performing the difficult classification task of identifying multiple pitches with high accuracy will also produce faithful and realistic reconstructions when used for its original purpose in a synthesis application, as far as the limitations of the particular model allow 1 . Another goal of this thesis was to produce algorithms and inference methods using these generative models that are ultimately feasible for real-time applications. Musical signals are inherently complex, and even a simple model requires many parameters to model a single frame of audio. Full Bayesian inference by reversible jump MCMC when the number of parameters is itself unknown is slow despite substantial improvements to inference in the Bayesian harmonic model in previous work. It is important however that any compromises made should be with the inference algorithm rather than by oversimplifying the model, because musical audio requires a rich prior structure to capture the information we are extracting from the signal. 1 The matrix factorization models in Chapter 7 have no explicit model for the phase of the source coefficients, hence the phase of a synthetic signal would need to be constructed using a phase vocoder [Flanagan et al., 1965, Cemgil and Godsill, 2005]. 153 In Chapter 5 we embellished the existing Bayesian harmonic model with a justification for using the Hilbert transform to improve the models ability to capture frequency and amplitude modulations in a partial frequency. We also saw a small improvement when using sinc windowed Gabor basis functions to model signals from instruments with vibrato. We also provided a result for the mode of the posterior distribution of the signal-to-noise ratio parameter which reduces the number of parameters that need to be simulated in the MCMC algorithm. By making these modifications to the existing model, we were able to show improvement in the accuracy of multiple pitch detection for two-note mixtures, although there was no improvement demonstrated for mixtures of three or more notes, and we claimed this was due to not estimating the frequency values to sufficient accuracy. In Chapter 6 however, we make this model practical for applications demanding faster computation, by splitting inference into two stages: the first stage to detect partial frequency positions in a frame using the generative model with a vague prior over possible frequency positions, using numerical techniques to improve the accuracy of the frequency estimates; and the second stage to fit a harmonic model to the estimated partial frequencies. Both stages were implemented using greedy algorithms, but we showed that the generative signal model and the prior harmonic model are powerful enough to detect the number and frequencies of pitches in a frame to the same level of accuracy as a full Bayesian inference scheme, but with computation reduced dramatically. In Chapter 7 each frame of audio is modelled by projection onto a fixed basis, rather than inferring the number and frequencies of the bases. The harmonic spectrum of a note is represented as a prior on the relative amplitudes of the basis functions in each frame, and the amplitude envelope of a note is represented as a prior on the relative energy of the signal across frames. The model is linear in the amplitude parameters, allowing multiple notes to be superimposed, and importantly multiple frames can be processed in parallel. A simple polyphonic transcription algorithm was implemented for this model, and was shown to have a competitive level of accuracy to other transcription schemes on a large set of classical music. In Chapter 8 we present a unified framework for musical signal processing applications which interact with a score of the music. This framework defines the dynamics of a virtual score pointer, such that the generative model of the signal in each frame depends on the properties of the score at the position indicated by the score pointer. By treating the dynamics and the generative model together as a hidden Markov model, we have a large and flexible class of inference algorithms available to estimate the movement of the score pointer through the signal and to iteratively train and learn the parameters of the generative models on past and currently observed data. 9.2 Discussion The use of Bayesian methods defines a clear separation between the model and the inference technique used to infer unknown quantities about that model. The use of hierarchical priors allows different models to be substituted in place of one another. For example, in Chapter 8 any of the generative signal models developed in this thesis, or elsewhere, can be introduced without modifying the overall structure of the inference algorithm, although the complexity and number of parameters of the signal model may prohibit this. In Chapter 5 we were able to substitute new models developed in the chapter into the reversible jump MCMC algorithm developed in existing work without modifying any of the parameters used in the inference. This allowed us to objectively measure and demonstrate different levels of accuracy achieved with different 154 model configurations. Through the use of analogies between models, we are able to apply the techniques developed by researchers in different fields and contribute likewise. This is illustrated in Chapter 7 and prior work, where the technique of non-negative matrix factorization (NMF) has applications in document classification [Xu et al., 2003], face recognition [Guillamet and Vitrià, 2002] and chemometrics [Paatero, 1997] for example, as well as polyphonic music transcription [Smaragdis and Brown, 2003]. One major advantage of this is that any inference algorithm designed for the general statement of a problem can be used for all the applications that have been transformed into this model. For popular models there typically exist multiple approaches which differ in terms of accuracy, computation, memory, etc. In Chapter 7 we only considered inference techniques which update the two matrices separately, a technique commonly known as multiplicative update rules [Lin, 2007a]. However other authors have applied gradient descent techniques, such asLin [2007b], to reduce the number of iterations required to reach a local optima. Another example of this is constructing a Bayesian network using conjugate priors. Once the model is designed and specified, then either Gibb's sampler, an MCMC technique, or Variational Bayes can be applied, using the same expressions for the conditional distributions of the parameters 2 . Many of the models described make use of linearity in the parameters so that independent processes can be superimposed and inferred from the observation. In music, this is a good approximation, as notes on a musical instrument are mostly independent of any other notes played on that instrument and notes played on other instruments. In Chapter 6 and Chapter 7 we therefore considered the model for a single harmonic source first, and then showed how multiple sources could be superimposed with an additional source modelling background noise. In Chapter 5 we take this further, and begin with the model for a single partial frequency within a harmonic. These models may then be used for source separation, where individual components of a signal are extracted and synthesized. 9.3 Further Research 9.3.1 Improvements to the Gaussian Variance Model Further work is needed to investigate and strengthen the relationship between the Bayesian harmonic model developed in Chapter 5 and the Gaussian variance model of spectrogram coefficients in Chapter 7. The goal of this work would be to increase the accuracy of the polyphonic transcription algorithm in Chapter 7 to that of the frame-based multiple pitch detection algorithm that was developed in Chapter 6. Both algorithms function by greedily adding notes to the transcription, so accuracy could be improved by focusing on the following areas: 1. The choice of basis functions used for the Gaussian variance model. The implementation presented in this thesis used the short-time Fourier transform to obtain the observed signal coefficient, thus assuming that each component of the signal in a frame is a sinusoid with constant amplitude. Davy and Godsill [2003] previously showed that a musical signal could be better modelled using Gabor basis functions which allows a slowing varying amplitude thoughout the frame, compared to the model of 2 Implementing generic Gibb's sampler algorithms is provided in software frameworks such as BUGS (Bayesian Inference Using Gibbs Sampler) available at www.openbugs.info/w/. A similar framework exists for Variational Bayes, known as VIBES (Variational Inference for Bayesian Networks) and is available at vibes.sourceforge.net 155 Walmsley et al. [1999] which has only one amplitude parameter per frame. We also showed in Chapter 5 that using the Hilbert transform and a sinc window Gabor basis improves the accuracy of modelling. Although the frequencies of the basis function are fixed in the Gaussian variance model, the number of basis functions per frame and the shape of the Gabor windows, could be modified in order to model the signal better and obtain improvements in transcription accuracy. The basis frequencies could also be chosen to match what would be expected in a harmonic musical signal rather than being spaced equally on the frequency axis. 2. The number of template functions used to model each pitch. In the transcription algorithm only one template function was used per pitch to keep computation at a minimum, although 7.4.1 indicates that multiple templates is preferred by Bayesian model selection. 3. Deriving the relationship between the priors in the two models. As a starting point, data generated by the harmonic model could be used to train the template functions of the Gaussian variance model, so that the model priors are equivalent. However, deriving even an approximate mathematical relationship may provide more insight into the models, for example the number of templates that should be used. 9.3.2 Frame Boundaries The algorithms in this thesis model continuity between adjacent frames only at a high level, by modelling the transition probabilities of note pitches and volumes across frame boundaries. The generative model for each frame is not directly dependent on the signal in the previous frame. This allows for potential phase discontinuities at frame boundaries, the results of which are unpleasant artifacts when a signal is reconstructed. Also, the frames of the audio obtained often overlap by 50% of the samples, in which case the frames are more strongly dependent on one another than if there was no overlap. To improve the modelling of phase boundaries, we need to account for the fact that a basis function in the model can contribute to two adjacent frames of audio. In the case of 50% overlap, every basis function contributes to two frames, whereas when there is no overlap, only the basis functions whose region of support extends beyond the end of one frame also contributes to the signal in the next frame. One method of treating shared basis functions in an iterative multiple frame processing algorithm is to fix the parameters and amplitudes of the basis functions in one frame to the values found when they were inferred in the adjacent frame, and subsequently alternating in which frames the parameters are fixed and in which frames they are inferred. This is straightforward to implement for the generative models in this thesis, as fixing the basis functions is equivalent to subtracting their contribution from the observed signal. However further work is required to determine that this inference approach will converge properly, or whether the contribution of shared basis functions to multiple frames should be inferred jointly. 9.3.3 Note Envelopes We have mostly paid attention to harmonic spectral content of musical notes, which was used solely as a method of multiple pitch detection in Chapter 6. In 5.2.4 we modelled a note as having a constant damping ratio over its length, following from the model of Cemgil et al. [2006], and in 5.2.5 we allow for regular amplitude modulations. In Chapter 8 the excitation vector component of the spectrogram variances was 156 used to model in a general way the amplitude envelope of a note. However, as discussed in 3.3.1, the note envelope may be divided into attack, sustain, decay and release stages (ASDR), each potentially with a different spectral profile, and other authors such as Orio and Déchelle [2001] have modelled each of these stages as additional states in a hidden Markov model. In Section 8.2 we showed the value in modelling note onsets explicitly as an additional state. Adding additional states to represent decay and release stages is straightforward for both the transcription model in Section 7.5 and the hidden Markov models in Chapter 8. From a modelling perspective there are still aspects of note envelopes that need further investigation. The onset of a note is very characteristic of the sound of a particular instrument, and hence could be used for instrument identification - an application we have not investigated using the models in this thesis. The onset of a note may also have a percussive component in addition to any harmonic content, which requires an additional component to the harmonic signal model of a musical model, using for example an autoregressive model for the noise process. A full generative model of a note envelope for different instruments would take into account the relation- ships between the different stages, including the relative volumes and damping ratios, and the time for each stage. 9.3.4 High Level Score Priors Score priors represent the highest level of hierarchy in this thesis. They are applied as a prior probability of a pitch being present in a frame, and the probability of a pitch transition from one frame to the next. Having a suitably powerful and realistic prior, which models chords, melodic and bass lines etc. promises to eliminate many transcription errors and increase the accuracy to the level of a human transcriber. A generative model of a score is ideal, as it can be used as a computer music composition system. Automated music composition using generative models is a popular area of music research. Although many of these models are too intricate for general use as they are designed to emulate a particular genre or style, the basic elements of these models such as chord progressions and the placements of notes around beat positions and divisions of the beat, should be suitable for many applications. One interesting observation is that from a Bayesian perspective, score models with fewer parameters will lead to more consonant and regular sounding music, as notes with multiple shared harmonics often occur together in chords, and regular timings of notes onsets lead to the strong perception of beat and tempo, drawing a parallel with Occam's razor. As music has developed over centuries, harmonic and temporal structures have increased in complexity, so the simplest model may not be appropriate for later genres. The major challenge with using high level score priors is managing the additional parameters introduced and designing efficient and practical inference algorithms for these models. 157 Bibliography S.A. Abdallah and M.D. Plumbley. Polyphonic music transcription by non-negative sparse coding of power spectra. In International Conference on Music Information Retrieval, 2004. R. P. Adams, I. Murray, and D. J. C. MacKay. Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009. L. Ahlzen and C. Song. The Sound Blaster Live! Book. No Starch Press, 2003. C. Andrieu and A. Doucet. Joint Bayesian model selection and estimation of noisy sinusoids via reversible jump MCMC. IEEE Transactions on Signal Processing, 47:26672676, 1999. I. Arroabarren, M. Zivanovic, X. Rodet, and A. Carlosena. Instantaneous frequency and amplitude of vibrato in singing voice. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003. M. Barthet, P. Guillemain, R. Kronland-Martinet, and S. Ystad. On the relative influence of even and odd harmonics in clarinet timbre. In Proc. Int. Comp. Music Conf (ICMC 2005), Barcelona, Spain, pages 351354, 2005. L. Benaroya, R. Gribonval, and F. Bimbot. Non negative sparse representation for Wiener based source separation with a single sensor. 6:613616, 2003. N. Bertin, R. Badeau, and E. Vincent. Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech and Language Processing, 18:538549, 2009a. N. Bertin, C. Fevotte, and R. Badeau. A tempering approach for Itakura-Saito non-negative matrix factor- ization. With application to music transcription. In International Conference on Acoustics, Speech and Signal Processing, 2009b. C. M. Bishop. Pattern recognition and machine learning. Springer, 2006. S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, 1990. G. L. Bretthorst. Bayesian Spectrum Analysis and Parameter Estimation. Springer-Verlag, 1989. J. C. Brown. Calculation of a constant Q spectral transform. Journal of the Acoustical Society of America, 89:425434, 1991. 158 J. C. Brown and K. V. Vaughn. Pitch center of stringed instrument vibrato tones. Journal of the Acoustical Society of America, 100(3):17281735, 1996. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121167, 1998. C. Cannam, C. Landone, M. Sandler, and J.P. Bello. The sonic visualiser: A visualisation platform for seman- tic descriptors from musical signals. In ISMIR 2006 7th International Conference on Music Information Retrieval Proceedings, 2006. P. Cano, A. Loscos, and J. Bonada. Score-performance matching using HMMs. In International Computer Music Conference, 1999. A. Cemgil and O. Dikmen. Conjugate gamma Markov random fields for modelling nonstationary sources. Independent Component Analysis and Signal Separation, pages 697705, 2007. A. T. Cemgil. Bayesian inference in non-negative matrix factorisation models. Technical report, University of Cambridge, 2008. A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. A. T. Cemgil and B. Kappen. Monte Carlo methods for tempo tracking and rhythm quantization. Journal of Artificial Intelligence Research, 18:4581, 2003. A. T. Cemgil, H. J. Kappen, and D. Barber. A generative model for music transcription. IEEE Transactions on Audio, Speech and Language Processing, 14:679694, 2006. A. T. Cemgil, C. Févotte, and S. J. Godsill. Variational and stochastic inference for Bayesian source sepa- ration. Digital Signal Processing, 17:891913, 2007. S. Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90: 75108, 1995. S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association, 96:270281, 2001. L. Cohen, P. Loughlin, and D. Vakman. On an ambiguity in the definition of the amplitude and phase of a signal. Signal Processing, 79:301307, 1999. A. Cont. A coupled duration-focused architecture for realtime music to score alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009. A. Cont, D. Schwarz, N. Schnell, and C. Raphael. Evaluation of real-time audio-to-score alignment. In International Conference on Music Information Retrieval., 2007. Nicholas Cook. Reactions to the Record: Perspectives on Historical Performance, chapter Objective expres- sion: phrase arching in recordings of Chopin's Mazurkas. 159 D. R. Cox and V. Isham. Point processes. Chapman & Hall, 1980. M. J. Crowder, A. C. Kimber, R. L. Smith, and T. J. Sweeting. Statistical analysis of reliability data. Chapman & Hall, 1991. L. Daudet and M. Sandler. MDCT analysis of sinusoids: exact results and applications to coding artifacts reduction. IEEE Transactions on Speech and Audio Processing, 12:302312, 2004. M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis. In Bayesian Statistics. Oxford University Press, 2003. M. Davy, S. J. Godsill, and J. Idier. Bayesian analysis of polyphonic western tonal music. Journal of the Acoustical Society of America, 119:24982517, 2006. N. Degara, A. Pena, M. E. P. Davies, and M. D. Plumbley. Note onset detection using rhythmic structure. In International Conference on Acoustics, Speech and Signal Processing, 2010. J. Devaney, M. I. Mandel, and D. P. W. Ellis. Improving MIDI-audio alignment with acoustic features. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009. I. Dhillon and S. Sra. Generalized nonnegative matrix approximations with Bregman divergences. Advances in neural information processing systems, 18:283, 2006. ISSN 1049-5258. O. Dikmen and A. T. Cemgil. Gamma Markov random fields for audio source modeling. IEEE Transactions on Audio, Speech and Language Processing, 18:589601, 2010. B. N. Dimitrov, VV. Rykov, and Z. L. Krougly. Periodic Poisson processes and almost-lack-of-memory distributions. Automation and Remote Control, 65:15971610, 2004. S. Dixon and G. Widmer. Match: A music alignment tool chest. In Proceedings of the International Conference of Music Information Retrieval, 2005. P.M. Djuric. Simultaneous detection and frequency estimation of sinusoidal signals. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993. P.M. Djuric. A model selection rule for sinusoids in white Gaussian noise. IEEE Transactions on Signal Processing, 44:17441751, 1996. Charles Dodge and Thomas A. Jerse. Computer Music. Schirmer Books, 1997. A. Doucet, N. De Freitas, and N. Gordon. Sequential Monte Carlo methods in practice. Springer Verlag, 2001. J. S. Downie. The Music Information Retrieval Evaluation Exchange (2005-2007): A window into music information retrieval research. Acoustical Science and Technology, 29:247255, 2008. D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green. Automatic generation of social tags for music recom- mendation. Advances in neural information processing systems, 20:385392, 2007. 160 C. Févotte and A.T. Cemgil. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proc. 17th European Signal Processing Conf.(EUSIPCO), Glasgow, Scotland, 2009. C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito divergence. with application to music analysis. Neural Computation, 21(3):793830, March 2009. J. L. Flanagan, D. I. S. Meinhart, . M. Golden, and M. M. Sondhi. Phase vocoder. The Journal of the Acoustical Society of America, 38(5):939940, 1965. H. Fletcher and W. A. Munson. Loudness, its definition, measurement and calculation. Journal of the Acoustical Society of America, 5:82108, 1933. N. H. Fletcher and T. D. Rossing. The physics of musical instruments. Springer, 1998. C. Févotte. Itakura-Saito nonnegative factorizations of the power spectrogram for music signal decomposition, chapter 11. IGI Global Press, 2010. D. Gabor. Theory of communication. IEE Journal on Communications Engineering, 93:429457, 1946. S. A. Gelfand. Hearing- An Introduction to Psychological and Physiological Acoustics. Informa HealthCare, 2004. A. Gelman. Bayesian data analysis. CRC press, 2004. J. M. Geringer and M. L. Allen. An analysis of vibrato among high school and university violin and cello students. Journal of Research in Music Education, 52:167179, 2004. C. J. Geyer. Reweighting Monte Carlo mixtures. Journal of Americal Statistical Association, 1991. Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian learning. Advances in Neural Information Processing Systems, pages 507513, 2001. W. R. Gilks and D. J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman & Hall/CRC, 1996. S. Godsill. The shifted inverse-gamma model for noise-floor estimation in archived audio recordings. Signal Processing, 2009. S. Godsill and M. Davy. Bayesian harmonic models for musical pitch estimation and analysis. In IEEE International Conference on Acoustics Speech and Signal Processing, volume 2, 2002. S. J. Godsill and M. Davy. Bayesian computational models for inharmonicity in musical instruments. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. S. J. Godsill, A. T. Cemgil, C. Fevotte, and P. J. Wolfe. Bayesian computational methods for sparse audio and music processing. In 15th European Signal Processing Conference, 2007. S.J. Godsill. Bayesian enhancement of speech and audio signals which can be modelled as ARMA processes. International Statistical Review/Revue Internationale de Statistique, 65(1):121, 1997. M. Goto. Development of the RWC music database. In Proceedings of the 18th International Congress on Acoustics, volume 1, pages 553556, 2004. 161 M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC music database: Music genre database and musical instrument sound database. In International Symposium on Music Information Retrieval, pages 229230, 2003. P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711, 1995. D. D. Greenwood. Critical bandwidth and the frequency coordinates of the basilar membrane. The Journal of the Acoustical Society of America, 33:1344, 1961. D. Guillamet and J. Vitrià. Non-negative matrix factorization for face recognition. Topics in Artificial Intelligence, pages 336344, 2002. W. M. Hartmann. Pitch, periodicity, and auditory organization. The Journal of the Acoustical Society of America, 100:3491, 1996. N. Hu, R. Dannenberg, and G. Tzanetakis. Polyphonic audio matching and alignment for music retrieval. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003. F. Itakura and S. Saito. Analysis synthesis telephony based on the maximum likelihood method. In Pro- ceedings of the 6th International Congress on Acoustics, 1968. J. S. R. Jang, H. R. Lee, and C. H. Yeh. Query by tapping: A new paradigm for content-based music retrieval from acoustic input. In IEEE Pacific Rim Conference on Multimedia, 2001. H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, pages 453461, 1946. M. Karjalainen and U. K. Laine. A model for real-time sound synthesis of guitar on a floating-point signal processor. In International Conference on Acoustics, Speech, and Signal Processing, 1991. R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 90(430), 1995. A. Klapuri. Signal Processing Methods for Music Transcription, chapter Auditory-Model Based Methods for Multiple F0 Estimation, pages 229265. Springer, 2006. A. Klapuri. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Transactions on Audio, Speech, and Language Processing, 16:255266, 2008. A. P. Klapuri. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6):804816, 2003. A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech, and Language Processing, 14:342355, 2006. A.P. Klapuri. Automatic music transcription as we know it today. Journal of New Music Research, 33(3): 269282, 2004. H. Lantéri, C. Theys, C. Richard, and C. Févotte. Split gradient method for nonnegative matrix factorization. In European Signal Processing Conference, Aalborg, Denmark, Aug. 2010. 162 D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing, 2000. C.J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. Neural Networks, IEEE Transactions on, 18(6):15891596, 2007a. ISSN 1045-9227. C.J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10): 27562779, 2007b. ISSN 0899-7667. J. S. Liu. Monte Carlo strategies in scientific computing. Springer, 2003. R. B. MacLeod. Influences of dynamic level and pitch register on the vibrato rates and widths of violin and viola players. Journal of Research in Music Education, 56:4354, 2008. R. C. Maher and J. W. Beauchamp. Fundamental frequency estimation of musical signals using a two-way mismatch procedure. Journal of the Acoustical Society of America, 4:22542263, 1995. S. Mallat. A wavelet tour of signal processing. Academic press, 1999. S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41:33973415, 1993. M. Marolt. A connectionist approach to automatic transcription of polyphonic piano music. IEEE Trans. Multimedia, 6(3):439449, Jun. 2004. R. Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing, 9, 2001. R. Meddis and L. O' Mard. A unitary model of pitch perception. The Journal of the Acoustical Society of America, 102:1811, 1997. W. Q. Meeker and L. A. Escobar. Statistical methods for reliability data. Wiley, 1998. J. A. Moorer. On the transcription of musical sound by computer. Journal of Computer Music, pages 3238, 1977. F. D. Neeser and J. L. Massey. Proper complex random processes with applications to information theory. IEEE Transactions on Information Theory, 39(4):12931302, 1993. B. Niedermayer. Towards audio to score alignment in the symbolic domain. In Sound and Music Computing Conference, 2009. N. Orio and F. Déchelle. Score following using spectral analysis and hidden Markov models. In Proceedings of the ICMC, pages 151154, 2001. N. Orio and D. Schwarz. Alignment of monophonic and polyphonic music to a score. In International Computer Music Conference, 2001. 163 A. Ozerov, C. Févotte, and M. Charbit. Factorial scaled hidden Markov model for polyphonic audio repre- sentation and source separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk, NY, USA, Oct. 2009. P. Paatero. Least squares formulation of robust nonnegative factor analysis. Chemometrics and Intelligent Laboratory Systems, 37:2235, 1997. H. Papadopoulos and G. Peeters. Large-scale study of chord estimation algorithms based on chroma repre- sentation and HMM. In Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI), pages 5360, 2007. R. D Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand. Complex sounds and auditory images. Auditory physiology and perception, 83:429446, 1992. J. Paulus and T. Virtanen. Drum transcription with non-negative spectrogram factorisation. In European Signal Processing Conference, 2006. P. H. Peeling, A. T. Cemgil, and S. J. Godsill. A probabilistic framework for matching music representations. In International Conference on Music Information Retrieval, 2007a. P. H. Peeling, C. F. Li, and S. J. Godsill. Poisson point process modeling for polyphonic music transcription. Journal of the Acoustical Society of America, 121:EL168EL175, 2007b. P.H. Peeling, A.T. Cemgil, and S.J. Godsill. Generative spectrogram factorization models for polyphonic piano transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18:519527, 2010. ISSN 1558-7916. G. Peeters. Chroma-based estimation of musical key from audio-signal analysis. In Proc. of the 7th Interna- tional Conference on Music Information Retrieval (ISMIR), pages 115120. Citeseer, 2006. A. Pertusa and J. M. Inesta. Multiple fundamental frequency estimation using Gaussian smoothness. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. B. Picinbono and P. Bondon. Second-order statistics of complex signals. IEEE Transactions on Signal Processing, 45(2):411420, 1997. C. J. Plack, A. J. Oxenham, and R. R. Fay, editors. Pitch: Neural Coding and Perception. Springer, 2005. M. D. Plumbley, S. A. Abdallah, J. P. Bello, M. E. Davies, G. Monti, and M. B. Sandler. Automatic music transcription and audio source separation. Cybernetics and Systems, 33(6):603627, 2002. G. E. Poliner and D. P. W. Ellis. A discriminative model for polyphonic piano transcription. EURASIP Journal on Advances in Signal Processing, 2007. E. Prame. Measurements of the vibrato rate of ten singers. Journal of the Acoustical Society of America, 96 (4):19791984, 1994. 164 J. P. Princen, A. W. Johnson, and A. B. Bradley. Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1987. M. Puckette, T. Apel, and D. Zicarelli. Real-time audio analysis tools for Pd and MSP. In International Computer Music Conference, 1998. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Pro- ceedings of the IEEE, 77:257286, 1989. L. R. Rabiner and B. Juang. Fundamentals of speech recognition. Prentice-Hall, 1993. C. Raphael. Automated rhythm transcription. In International Symposium on Music Information Retrieval, 2001. C. Raphael. A hybrid graphical model for aligning polyphonic audio with musical scores. In International Conference on Musical Information Retrieval, 2004. C. Raphael. Aligning music audio with symbolic scores using a hybrid graphical model. Machine Learning, 65:389409, 2006. C. Raphael. A classifier-based approach to score-guided source separation of musical audio. Computer Music Journal, 32(1):5159, 2008. S. Richardson and P. J. Green. On Bayesian analysis of mixtures with unknown number of components. Journal of the Royal Statistical Society: Series B, 4:731792, 1997. C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, 2004. A. Robertson and M. D. Plumbley. Post-processing fiddle: A real-time multi-pitch tracking technique using harmonic partial subtraction for use within live performance systems. In International Computer Music Conference, 2009. M. P. Ryynänen and A. P. Klapuri. Modelling of note events for singing transcription. In ISCA Tutorial and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing. Citeseer, 2004. M. P. Ryynänen and A. P. Klapuri. Polyphonic music transcription using note event modeling. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. E. D. Scheirer. Readings in Computational Auditory Scene Analysis, chapter Using musical knowledge to extract expressive performance information from audio recordings, pages 361380. L. Erlbaum Associates Inc., 1998. E. D. Scheirer. Music-listening systems. PhD thesis, Massachusetts Institute of Technology, 2000. M. N. Schmidt and H. Laurberg. Nonnegative matrix factorization with Gaussian process priors. Computa- tional intelligence and neuroscience, 2008. D. Schwarz, A. Cont, and N. Schnell. From Boulez to ballads: Training IRCAM's score follower. In Proceedings of International Computer Music Conference, 2005. 165 C. Seashore. Objective analysis of musical performance. McGraw-Hill, 1936. X. Serra. Musical sound modeling with sinusoids plus noise. Musical signal processing, pages 497510, 1997. C. E. Shannon. Communication in the presence of noise. Proceedings of the IEEE, 86(2):447457, 1998. P. Smaragdis and J.C. Brown. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177180, 2003. A. Stark and M. D. Plumbley. Tracking a performance without a score. In International Conference on Acoustics, Speech and Signal Processing, 2010. S. S. Stevens, J. Volkmann, and E. B. Newman. A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America, 8:185, 1937. M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society. Series B, Statistical Methodology, pages 611622, 1999. P. M. Todd and D. G. Loy. Music and connectionism. The MIT Press, 1991. T. Tolonen and M. Karjalainen. A computationally efficient multipitch analysis model. IEEE Transactions on Speech and Audio Processing, 8(6):708716, 2000. R. J. Turetsky and D. P. W. Ellis. Ground-truth transcriptions of real music from force-aligned MIDI syntheses. In International Conference on Music Information Retrieval, 2003. R. Typke and A. Walczak-Typke. A tunneling-vantage indexing method for non-metrics. In International Conference on Music Information Retrieval, 2008. B. L. Vercoe, W. G. Gardner, and E. D. Scheirer. Structured audio: Creation, transmission, and rendering of parametric sound representations. Proceedings of the IEEE, 86(5):922940, 1998. E. Vincent, N. Bertin, and R. Badeau. Harmonic and inharmonic nonnegative matrix factorization for poly- phonic pitch transcription. In IEEE Internation Conference on Acoustics, Speech and Signal Processing, 2008. E. Vincent, N. Bertin, and R. Badeau. Adaptive harmonic spectral decomposition for multiple pitch es- timation. Audio, Speech, and Language Processing, IEEE Transactions on, 18(3):528537, 2010. ISSN 1558-7916. T. Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3):10661074, 2007. T. Virtanen, A. T. Cemgil, and S. J. Godsill. Bayesian extensions to non-negative matrix factorisation for audio signal modelling. In International Conference on Acoustics, Speech and Signal Processing, 2008. G. H. Wakefield. Mathematical representation of joint time-chroma distributions. In Proceedings of SPIE, volume 3807, page 637, 1999. 166 P. J. Walmsley, S. J. Godsill, and P. J. W. Rayner. Polyphonic pitch tracking using joint Bayesian estimation of multiple frame parameters. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 119122. Citeseer, 1999. B. Wang and M.D. Plumbley. Musical audio stream separation by non-negative matrix factorization. In DMRN Summer Conference, 2005. N. P. Whiteley, A. T. Cemgil, and S. J. Godsill. Bayesian modelling of temporal structure in musical audio. In International Conference on Music Information Retrieval, 2006. P. J. Wolfe, M. Dorfler, and S. J. Godsill. Bayesian modelling of time-frequency coefficients for audio signal enhancement. Advances in Neural Information Processing Systems, 2003. P. J. Wolfe, S. J. Godsill, and W. J. Ng. Bayesian variable selection and regularization for time-frequency surface estimation. Journal of the Royal Statistical Society Series B, 66:575589, 2004. W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 267273. ACM, 2003. ISBN 1581136463. C. Yeh, A. Röbel, and X. Rodet. Multiple fundamental frequency estimation of polyphonic music signals. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2005. A. Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, 6:233243, 1986. F. Zhang, G. Bi, and Y. Q. Chen. Harmonic transform. IEE Proceedings-Vision, Image and Signal Processing, 151(4):257263, 2004. E. Zwicker. Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Acoustical Society of America Journal, 33:248, 1961. 167 Appendix A Probability Distributions A.1 Normal Distribution The normal distribution N (x;µ, σ2) , for real valued x with mean µ and variance σ2 has probability distri- bution p (x) = 1√ 2piσ2 exp ( −1 2 (x− µ)2 σ2 ) µ is a location parameter for the normal distribution, and σ2 is a scale parameter. Sufficient statistics for a set of normally distributed observations xi are given by ∑ xi and ∑ x2i . Maximum likelihood estimates of the parameters are given by µˆ = ∑ xi and σˆ2 = ∑ x2i − µˆ2. The entropy of the normal distribution is 1 2 ln ( 2pieσ2 ) . The normal distribution is the conjugate prior distribution for the mean parameter of a normal dis- tribution. If µ ∼ N (µ0, σ20) and the variance σ2 is known, then the posterior distribution of µ given n observations xi is a normal distribution with mean( 1 σ20 + n σ2 )−1( µ0 σ20 + ∑ xi σ2 ) and variance ( 1 σ20 + n σ2 )−1 If x is a complex number where the real and imaginary components are independently normally distributed with zero mean and variance σ2, then p (x) = 1 2piσ2 exp ( −1 2 |x|2 σ2 ) 168 A.2 Gamma Distribution The Gamma distribution G (r;α, β) for r > 0 where α > 0 is the shape parameter and β > 0 is a scale parameter has probability distribution p (r) = rα−1 exp (−r/β) βαΓ (α) Sufficient statistics for a set of Gamma distributed observations ri are ∑ ri and ∑ i log ri. The parameters of the Gamma distribution are related to the sufficient statistics by 1 N ∑ i log ri = ψ (a)− log β 1 N ∑ i ri = ab The entropy of the gamma distribution is α+ lnβ + ln Γ (α) + (1− α)ψ (α) The gamma distribution is the conjugate prior for the rate parameter of the Poisson distribution. If λ ∼ G (α, β) and we have n observations xi ∼ Po (λ) then the posterior distribution of λ is p (λ|x1, . . . , xn, α, β) = G ( α+ ∑ i xi, β + n ) A.3 Inverse-Gamma Distribution The inverse Gamma distribution IG (r;α, β) for r > 0 where α > 0 is the shape parameter and β > 0 is a scale parameter has probability distribution p (r) = (1/r) α−1 βα exp (−β/r) Γ (α) The inverse Gamma distribution is the distribution of the random variable 1/r when r ∼ G (r;α, 1/β). Sufficient statistics for a set of inverse Gamma distributed observations ri are ∑ r−1i and ∑ i log ri. The parameters of the inverse Gamma distribution are related to the sufficient statistics by 1 N ∑ i log ri = −ψ (a) + log β 1 N ∑ i ri = a/b The entropy of the inverse Gamma distribution is α+ lnβ + ln Γ (α)− (1− α)ψ (α) 169 The inverse Gamma distribution is the conjugate prior distribution for the variance parameter of a normal distribution. If σ2 ∼ IG (α, β) and the mean µ is known, then the posterior distribution of σ2 given n observations xi is p ( σ2|x1, . . . , xn, µ, α, β ) = IG ( α+ n 2 , β + ∑ i (xi − µ)2 2 ) A.4 Beta Distribution The Beta distribution Beta (x;α, β) where 0 ≤ x ≤ 1 and α > 0, β > 0 has probability distribution p (x) = xα−1 (1− x)β−1 B (α, β) The Beta distribution is the conjugate prior to the probability of success ρ of a set of n Bernoulli trials ri where ri ∈ {0, 1}. The posterior distribution of ρ is p (ρ|r1, . . . , rn, α, β) = Beta ( α+ ∑ i ri, β + n− ∑ i ri ) 170 Appendix B Derivation of Results B.1 Mode of Posterior Distribution of Signal-to-noise Parameter Take the natural log of (5.22) log p (y|D, ξ) + log p (ξ) = −N + αn 2 log ( y>Py + βn )− (αξ + 1) log (ξ)− βδ ξ differentiate the expression with respect to δ2 and set equal to zero: −N + αn 2 − 1ξ+1y>DD†y y>y − ξξ+1y>DD†y − αξ + 1 ξ + βξ ξ2 = 0 N + αn 2 y>DD†y y>y − ξy>DD†y − αξ + 1 ξ + βξ ξ2 = 0 N + αn 2 ξ2y>DD†y y>y − ξy>DD†y − ξ (αξ + 1) + βξ = 0 After some rearranging N + αn 2 ( δ2 )2 y>DD†y + ( βξ − δ2 (αξ + 1) ) ( y>y − δ2y>DD†y) = 0 ξ2 ( N + αn 2 + (αξ + 1) )( y>DD†y )− ξ ((αξ + 1) y>y + βξy>DD†y)+ βξy>y = 0 (B.1) 171 B.2 Posterior over Latent Sources in Gaussian Variance Matrix Factorization Model J is the identity matrix with dimensions I × I − D 2 TrA−1ssH + D 2 Tr 1>1ssH 1A1> + · · · = −D 2 TrA−1 ( J −A 1 >1 1A1> ) ssH + · · · = −D 2 Tr A−1 1A1> (1A1>J −A1>1)ssH + · · · = −D 2 Tr 1 1A1> (1A1>J −A1>1)>(1A1>J −A1>1)−1A−1(1A1>J −A1>1)ssH + . . . = −D 2 Tr ( s− A1 >1s 1A1> )H (1A1>J −A1>1)−1 A −1 1A1> ( s− A1 >1s 1A1> ) + . . . = −D 2 Tr ( s− A1 >1s 1A1> )H ( A− A1 >1A 1A1> )−1( s− A1 >1s 1A1> ) + . . . 172