Bayesian Methods in Music Modelling
Paul Halliday Peeling
Clare College
December 19, 2010
Declaration
This dissertation is submitted for the degree of Doctor of Philosophy. This dissertation is the result of my
own work and includes nothing which is the outcome of work done in collaboration except where specifically
indicated in the text. This dissertation does not exceed 50,000 words and 50 figures.
Paul H. Peeling
Abstract
This thesis presents several hierarchical generative Bayesian models of musical signals designed to improve the
accuracy of existing multiple pitch detection systems and other musical signal processing applications whilst
remaining feasible for real-time computation. At the lowest level the signal is modelled as a set of overlapping
sinusoidal basis functions. The parameters of these basis functions are built into a prior framework based
on principles known from musical theory and the physics of musical instruments. The model of a musical
note optionally includes phenomena such as frequency and amplitude modulations, damping, volume, timbre
and inharmonicity. The occurrence of note onsets in a performance of a piece of music is controlled by an
underlying tempo process and the alignment of the timings to the underlying score of the music.
A variety of applications are presented for these models under differing inference constraints. Where
full Bayesian inference is possible, reversible-jump Markov Chain Monte Carlo is employed to estimate
the number of notes and partial frequency components in each frame of music. We also use approximate
techniques such as model selection criteria and variational Bayes methods for inference in situations where
computation time is limited or the amount of data to be processed is large. For the higher level score
parameters, greedy search and conditional modes algorithms are found to be sufficiently accurate.
We emphasize the links between the models and inference algorithms developed in this thesis with that
in existing and parallel work, and demonstrate the effects of making modifications to these models both
theoretically and by means of experimental results.
Acknowledgments
First of all I would like to thank my supervisor Prof. Simon Godsill, whose insight and direction has guided
much of the work in this thesis, and attention to detail has been invaluable.
Very special thanks to Dr. Taylan Cemgil, for being a constant source of inspiration and encouragement
on this project and for providing so many examples and resources which made this possible. I am very
grateful to all those who I also have collaborated with at various points over the last four years: Drs.
Sumeetpal Singh, Nick Whiteley, Daniel Clark, Onur Dikmen and Chung-fai Li.
Thanks to all the staff and students in the Signal Processing Laboratory, especially Henry, Jonathan and
Simon for on- and off-topic coffee room discussions, and to Janet and Rachel for putting in so much effort
to support us all. Thanks also to the folks at Featurespace, especially Bill, Dave and Kirsty, for encouraging
me along the way.
My final thanks go to my family and friends, who haven't needed to understand what this thesis is about
to guide and support me: Matthew, Glen, Denise, Glen, Elliot, Theodore, Bob, Jon, Ding, Yuwei, Ted, Olly,
1
Thomas. Thanks to Mum, Dad and Katie for their love and support throughout the university years. Most
of all, thanks to Angel for caring and loving every time. This thesis is dedicated to our new arrival.
Notation
y ∼ p (y) y is sampled from the probability distribution p (y)
p (y|θ) The conditional probability density of y given θ
N (y;µ, σ2) y is normally distributed with mean µ and variance σ2
NC
(
y;µ, σ2
)
Complex normal distribution
Am,n The element of matrix A in the mth row and nth column
Ep(y) [f (y)] The expectation of the function f (y) under the probability distribution p (y)
〈f (y)〉p(y) The expectation of the function f (y) under the probability distribution p (y)
f0 The fundamental frequency of a musical note
Tr A The trace of matrix A
A† Pseudo-inverse of matrix A
A440 The note A, which has a pitch of 440 Hz
θˆ A point estimate of the parameters θ
y1:K The set of observations {y1, . . . , yK}
Acronyms
DFT Discrete Fourier transform
STFT Short-time Fourier transform (spectrogram)
(M)DCT (Modified) Discrete Cosine transform
MIDI Musical instrument Digital Interface
MIREX Music Information Retrieval Exchange
NMF Non-negative Matrix Factorization
HMM Hidden Markov Model
ML Maximum-likelihood parameter estimate
MAP Maximum a posteriori parameter estimate
EM Expectation-maximization algorithm
MCMC Markov Chain Monte Carlo
MH Metropolis-Hastings
GMM Gaussian mixture model
ACF Autocorrelation function
SNR Signal-noise ratio
2
Contents
1 Introduction 12
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Psychoacoustics and auditory modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Machine learning techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Structure of Musical Audio 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Perception of Musical Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Dynamics (Loudness) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Oscillators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Stringed Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Wind Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Vibrato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Harmony, Chords and Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Literature Review 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Multipitch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Auditory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Spectrum Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Harmonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Spectral Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.5 Transform Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3
3.3 Polyphony Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Attack-Sustain-Decay-Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Pitch Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Tempo Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Bayesian Methods for Signal Processing 43
4.1 Bayesian Modelling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Bayes Rule and Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1.1 Parameter Estimation using Bayes Rule . . . . . . . . . . . . . . . . . . . . . 43
4.1.1.2 Marginal Likelihood for Model Comparison . . . . . . . . . . . . . . . . . . . 44
4.1.1.3 Generative and Discriminative Models . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1.4 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1.5 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.4 Exponential Family of Probability Distributions . . . . . . . . . . . . . . . . . . . . . 47
4.2 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2.2 Importance Sampling and Sequential Monte Carlo . . . . . . . . . . . . . . . 51
4.2.3 Variational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 A Signal Model for Pitched Musical Instruments 55
5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Model for an Isolated Partial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Amplitude and Phase Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.3 Analytic Representation of Sinusoidal and Noise Signals . . . . . . . . . . . . . . . . . 57
5.2.4 State-Space Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.5 Gabor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Probabilistic Model for Multiple Partials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.3 State-Space Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.4 Gabor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Bayesian Inference using Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.1 Proposals and Acceptance Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Prior Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3 Examples of Reversible Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3.1 n-increase/decrease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4
5.4.3.2 double/halve frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3.3 note birth/death . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Performance on Monophonic Extracts . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Multiple F0 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Multiple Pitch Estimation using Non-homogeneous Poisson Processes 80
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Non-homogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.1 Frequency-Domain Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.2 Superposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.3 Evaluation of Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.3.1 Exact Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.3.2 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.3.3 Censored Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Bayesian Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Fixed Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.2 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.3 Model for mixture weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Signal Model Based Partial Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.2 Bayesian Model Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.3 Zero-Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.4 Likelihood-Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Polyphonic Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5.2 Estimation of number of notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5.3 Comparison with State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Gaussian Variance Generative Matrix Factorization Models 99
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Gaussian Variance Matrix Factorization Model . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.1 Maximum-likelihood and the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.2 Expectation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.3 Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3 Bayesian Hierarchical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3.1 Inference by Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3.1.1 Variational update equations and sufficient statistics . . . . . . . . . . . . . . 107
7.3.1.2 The Variational Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3.1.3 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5
7.3.2 Markov Chain Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.2.1 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.2.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.2.3 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.4 Consistency of Marginal Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Musical Audio Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.1 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.2 Source Separation and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5 Prior Model for Polyphonic Piano Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.6.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 A Probabilistic Framework for Inferring Temporal Structure in Music 136
8.1 Audio Matching using Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.1.1 Existing Dynamic Time Warping Approach . . . . . . . . . . . . . . . . . . . . . . . . 136
8.1.2 Model Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1.3 Interpretation of Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2 Score Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.1 Treatment of Score Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.2 Dynamic Time Warping Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.3 Hidden Markov Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.3 Event Based Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.3.1 Counting of Temporal Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.3.2 Clutter and Missed Detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3.3 Query-by-Tapping Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9 Conclusion 153
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.3.1 Improvements to the Gaussian Variance Model . . . . . . . . . . . . . . . . . . . . . . 155
9.3.2 Frame Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6
9.3.3 Note Envelopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3.4 High Level Score Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography 158
A Probability Distributions 168
A.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.3 Inverse-Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.4 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B Derivation of Results 171
B.1 Mode of Posterior Distribution of Signal-to-noise Parameter . . . . . . . . . . . . . . . . . . . 171
B.2 Posterior over Latent Sources in Gaussian Variance Matrix Factorization Model . . . . . . . . 172
7
List of Figures
2.1 Flow of information in musical production and the auditory system . . . . . . . . . . . . . . . 20
2.2 Impulse and frequency response for a second-order gammatone filter . . . . . . . . . . . . . . 21
2.3 Volume curves used by some Midi implementations. . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Comparison of the Mel frequency scale and the Midi definition of pitch . . . . . . . . . . . . . 23
2.5 Frequency response and autocorrelation of a comb filter . . . . . . . . . . . . . . . . . . . . . 25
3.1 Comparison of Fourier and summary spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Comparison of salience functions obtained from spectra . . . . . . . . . . . . . . . . . . . . . 34
3.3 Bayesian network representations of tempo models . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Hidden Markov model. Observed random variables have doubled lines . . . . . . . . . . . . . 47
5.1 Comparison of sinc and Hamming basis functions for modelling sinusoids in noise . . . . . . . 61
5.2 Convergence of the model parameters using MCMC . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Comparison of the residual using periodogram and maximum likelihood estimates . . . . . . 77
6.1 Probability mass function for the Poisson distribution . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Prior on expected number of partials and marginal distribution of number of partials . . . . . 87
6.3 Poisson intensity function using a Gaussian mixture model . . . . . . . . . . . . . . . . . . . . 88
6.4 Partial estimation results for zero padding method . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 Partial estimation results and periodogram estimate for a polyphonic mixture of four notes. . 94
7.1 Representations of the single-channel source separation model as a matrix factorization problem101
7.2 The inverse-gamma distribution, p(r) = IG(r; a, a) for different a, and scale parameter b = 1 . 106
7.3 Template hyperparameters for single source models of piano notes . . . . . . . . . . . . . . . 117
7.4 Transcription using the Gaussian variance matrix factorization model. . . . . . . . . . . . . . 119
7.5 Optimal number of sources for a set of piano notes . . . . . . . . . . . . . . . . . . . . . . . . 120
7.6 Parameter estimates for the Gaussian variance model from training data . . . . . . . . . . . . 126
7.7 Parameter estimates for the Poisson model from training data . . . . . . . . . . . . . . . . . . 127
7.8 Transcription using a priori independent frames . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.9 Transcription using Markov transition probabilities between frames . . . . . . . . . . . . . . . 130
7.10 Ground truth for the transcription results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.11 Detection assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8
7.12 Number of errors for the Gaussian variance Markov model by number of notes and error type. 133
8.1 Audio alignment using DTW with note onset costs . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Score alignment using Gaussian variance model . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3 Inter-onset timings in a query-by-tapping problem . . . . . . . . . . . . . . . . . . . . . . . . 151
9
List of Tables
2.1 Intervals and harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Partial estimation results for real and analytical representation . . . . . . . . . . . . . . . . . 74
5.2 Partial estimation results for different basis function and model choices . . . . . . . . . . . . . 75
5.3 Polyphonic pitch estimation using the Bayesian harmonic model . . . . . . . . . . . . . . . . 76
6.1 Polyphonic pitch estimation using the Poisson process model . . . . . . . . . . . . . . . . . . 95
6.2 Precision and recall using the Poisson process model . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 F-measure of multiple pitch estimation on woodwind data . . . . . . . . . . . . . . . . . . . . 97
7.1 Frame-level transcription accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Frame-level transcription results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.1 Score alignment: median alignment in milliseconds . . . . . . . . . . . . . . . . . . . . . . . . 145
10
List of Algorithms
4.1 Generic MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Metropolis-Hastings Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Bootstrap Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Gibbs sampler for the state-space model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Metropolis-Hastings for the Gabor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Partial estimation scheme for a frame of audio y with N samples . . . . . . . . . . . . . . . . 91
7.1 Variational Bayes for the Gaussian variance model, with hyperparameter optimization . . . . 111
7.2 Gaussian Variance: algorithm for polyphonic transcription . . . . . . . . . . . . . . . . . . . . 123
7.3 Poisson Intensity: algorithm for polyphonic transcription . . . . . . . . . . . . . . . . . . . . 124
11
Chapter 1
Introduction
1.1 Background
This thesis is concerned with Bayesian methods for the modelling of musical signals. Musical signals are rich
in structure, and are capable of evoking emotional and aesthetic response in the listener. When processing
music, much of this structure is known in advance. Thus we may use Bayesian methods to infer some of the
unknown aspects of the musical signal in a signal processing setting.
Bayesian methods center around the application of Bayes' rule to statistical models of observed data y.
The unknown information about the musical signal y is encapsulated in a set of parameters θ. We define
a statistical model p (y|θ), the likelihood, which describes how the signal is related to these parameters. To
complete the picture, we define a prior p (θ) which describes what we know about the parameters before we
observe any data. A Bayesian model thus may be represented by its joint probability distribution
p (y, θ) = p (y|θ) p (θ)
Using the prior and the likelihood, the posterior distribution p (θ|y) of the parameters after we observe
the signal is given by Bayes' rule
p (θ|y) ∝ p (y|θ) p (θ) = p (y|θ) p (θ)
p (y)
(1.1)
The quantity p (y) in (1.1) is both the normalization constant of the posterior p (θ|y) and also the marginal
distribution of the observed data for our particular choice of model p (y, θ).
p (y) =
ˆ
p (y, θ) dθ
p (y) is therefore known as the marginal likelihood or the evidence of the data, and may be used to evaluate
which model from a range of potential models best explains the data observed.
From a musical signal processing perspective, the unknown parameters θ represent a hierarchy of musical
information. At the highest level we have the cognitive concepts of genre, mood, style and so forth, which
12
may be used to group different pieces of music together. Within a piece of music we have music theoretic
constructs such as key, tempo and meter, which are defined globally, and time localized structure such as
pitch, tone, harmony and rhythm. At this local level, we can begin to represent the structure in terms of
physical entities, such as frequencies, transients and noise. A complete representation of musical information
can be naturally expressed as a hierarchical Bayes model [Gelman, 2004]. For example we could express three
levels of musical information thus described as θ1, θ2, θ3 progressively incorporating higher level information,
and write the prior as
p (θ) ≡ p (θ1, θ2, θ3) = p (θ1|θ2, θ3) p (θ2|θ3) p (θ3)
When we are able to draw random samples from the likelihood and prior distributions we may simulate
data y under various conditions and assumptions. Such a scheme is known as generative modelling. This
allows us to evaluate the modelling assumptions not only using mathematical considerations, but also per-
ceptually by listening to the generated data and assessing it qualitatively. For instance, a generative model
for the frequencies contained in a model of musical pitch (which is a perceived characteristic of a musical
note) would include definitions of fundamental frequency and harmonicity. As inharmonicity can be sub-
stantial in some musical instruments, we would be tempted to put a vague model on how the frequencies of
individual harmonics are related to the fundamental. If the model is too vague, we would be able to perceive
by listening to simulated data that the perceived pitch is no longer equal to the desired pitch.
1.2 Scope of Work
The signal processing and modelling of musical signals is a diverse field, with a multitude of approaches,
philosophies and goals. Here we present a brief overview of popular approaches, and how they relate to the
methods we develop in this thesis.
1.2.1 Psychoacoustics and auditory modelling
These approaches characterize and model the human auditory system. Principles of psychoacoustics underlie
present audio coding standards [Ahlzen and Song, 2003]. Music is not merely the production of sound by
physical processes, but also includes how these sounds are perceived. Hence any system which processes
musical signals, even when based solely on physical models, should consider the perception of musical sounds.
We might ask whether an automatic music transcription system should transcribe notes which are not audible
for instance.
Principles known from psychoacoustics include:
• The frequency and dynamic (intensity of sound) response of the ear, including the range of sounds that
can be detected and the resolution with which they are perceived. Physical measurements of different
parts of the ear, (from canal to neurones) have been modelled using filter banks and other non-linear
signal processing techniques, to compute a summary spectrum, which maps higher order harmonics to
lower order harmonics, partially accounting for the ear's ability to perceive pitch. Currently in some
Bayesian systems, the periodogram is used as a spectrum estimator to guide Bayesian inference to
more likely frequencies and pitches in the signal: a proposal distribution in an MCMC setting, see
13
Section 4.2.2. It is reasonable to suggest that a spectrum estimator based on an auditory model and
pitch perception could improve existing methods.
• Masking: a psychoacoustical effect when tones (for example, musical pitches) are presented in such a
way that one tone renders the other(s) inaudible [Gelfand, 2004]. For example, a loud tone can mask a
quiet tone with a similar frequency or pitch. A related phenomenon is the fusing of two similar sounds
into one, such as two tones with equal loudness and similar frequency.
• Perception of the parameters of musical notes such as pitch, loudness and harmonicity. Pitch and
loudness for example are not perceived linearly, or independently one of another. The Bark scale
[Zwicker, 1961] is a subjective scale of loudness, and the Mel scale [Stevens et al., 1937] is a perceptual
scale of pitch. A well known phenomenon in the perception of harmonicity is the ability of the human
ear to reconstruct missing fundamental frequencies [Todd and Loy, 1991] for example when listening
to a bassoon.
1.2.2 Machine learning techniques
We may describe an example of this type of approach in general terms as a two-stage procedure:
1. Extracting salient features from frames of audio. The method used to extract the feature depends
on the application. Applications which aim to extract frequency related information, such as pitch or
harmony, from the music, may start by computing the Fourier transform. A popular feature computed
via the DFT is a chroma vector [Wakefield, 1999], which describes the energy distribution between
the 12 notes of the Western pitch scale. This information can then be used in a chord recognition
[Papadopoulos and Peeters, 2007] or key recognition [Peeters, 2006] applications, by reasoning that
notes with higher energy distributed to them will often form a root, third or fifth of the chord. Feature
extraction methods which prove successful and useful for a musical signal processing application can
often inspire or be adapted in a generative model approach. Considering the example of chroma vectors,
the expected distribution of energy across different note groups can be treated as a Bayesian prior for
chords and keys in music.
2. Learning and classification algorithms. These algorithms map the features as inputs to the informa-
tion to be extracted from the music as an output. The majority of these algorithms were not designed
specifically for musical signals, but rather map real-valued vectors to labeled classes. The distinc-
tion between the feature extraction method and the machine learning algorithm may be considered
analogous to the separation of a probabilistic model and the inference algorithm, although machine
learning algorithms do assume models of both the features and the labels. Some popular algorithms and
their uses include support vector machines for classification [Burges, 1998] and dynamic programming
[Rabiner and Juang, 1993] for aligning sequences of music.
1.2.3 Generative models
The scope of a generative model for music, i.e., the actual data being modelled, varies. A model which
describes the production of each sample and channel of a digital audio signal may be desirable for high
14
fidelity signal processing, however sample rates of 44.1 kHz (CD quality) or 48 kHz (digital audio tape) may
be unmanageable in terms of computation and inference. Instead, a generative model is usually defined on
a simpler domain, e.g., overlapping frames of audio, and additionally some preprocessing and downsampling
may be applied using the Fourier transform. These effects however are often non-invertible, for example
taking the magnitude of the Fourier spectrum, and losing the phase continuity between frames of audio. For
the signal to be reproduced without audible artefacts then requires additional postprocessing such as using
a phase vocoder [Flanagan et al., 1965]. The modified discrete cosine transform (MDCT) which has a 50%
overlap between frames retains the phase continuity between frames [Princen et al., 1987] hence its inclusion
in the MP3 standard.
In this thesis we have chosen to build upon approaches that utilize Bayesian and generative modelling
ideas for music. Crucially at the lowest level we restrict ourselves to models proved to be valid for audio
signal processing, for in a hierarchical Bayes model musical information is represented by additional levels
of hierarchy above that existing for audio signals, because musical signals are a subset of audio signals.
For this reason we have adopted a bottom-up approach to the design for our models, and the structure of
this thesis is similar. Most Bayesian analysis is carried out in a similar manner: beginning with the definition
of the likelihood, and adding levels of hierarchy until the model is deemed sufficient for the purpose. The
models that we propose and develop here are not a complete representation of music information. We
have focused here particularly on the modelling of pitch and temporal structure, such that we can infer the
positions of musical notes both in time and frequency, and their respective dynamics. The information we
obtain may be stored in the intermediate Midi format for musical events. Midi (Musical Instrument Digital
Interface) is a widely adopted standard available at http://www.midi.org for controlling electronic musical
instruments. The data transmitted through a Midi interface does not contain the recorded waveforms used
to synthesize the music into audio, and therefore compactly represents musical content.
1.3 Motivation
The motivation for this work is not directly related to the classification of music and music recommendation
systems. The technology behind these systems is advanced and has led to the development of commercial
applications, for example www.Last.FM which is an Internet radio, social network and music recommendation
service. That technology does not necessarily involve signal processing or machine listening, as there is
abundant information supplied by human listeners, for example in the form of tagging. Listeners are often
very willing to provide such information and even gain utility from it, which contributes to the success of
these systems. However there is interest in the automatic generation of tags, see for example Eck et al.
[2007].
The classification and identification of structure within a piece of music is perceived to be much more
laborious, time consuming, subjective and frustrating. Such is the task of a trained musician in music
transcription: to write the musical score after listening to extracts of the audio multiple times. Unsurprisingly
there are few sources of musical audio labeled with a corresponding accurate Midi transcription readily
available.
Transcription is not necessarily the final goal: being able to align a Midi file with a performance in the
audio is also appropriate for the applications we envisage, as there is a large quantity of Midi files related
15
to the score of a piece of music, but however divorced from a real performance of that piece.
Content-based music information retrieval (MIR) is possible using the models presented here. Some
examples of how a system may be used include:
• Notation of improvised music, for example in Western Jazz
• Preservation and study of music from traditions without a notation system, which instead rely on oral
transmission.
• Detailed performance analysis for musicology [Scheirer, 1998, Seashore, 1936]. One example of this is
the Mazurka project
1
, which has made detailed annotations of the tempo and dynamics of recordings of
Chopin's Mazurkas by several performers. These annotations have been used to compare and contrast
different styles and approaches to playing the same piece, capturing performance effects such as phrase
arching [Cook] and accentuation patterns. This approach has been used to identify historical record-
ings which have influenced today's performers, and even cases of copyright violation where existing
performances have been replayed.
• Visualization of musical structure in a music media player framework. The Sonic Visualizer [Cannam
et al., 2006] allows musicians to listen to and study recordings, offering synchronized playback and
display of annotations (including Midi) and frequency spectra. This tool is already integrated with an
audio alignment system called Match [Dixon and Widmer, 2005].
• Score alignment - matching events in a score with corresponding audio cues in a recording. The two
applications mentioned above would be improved with automated score alignment, reducing much of
the manual labour involved.
• Score following - tracking the position of a live performance in a score. One motivation for this
application arises in present day music compositions which rely on the synchronization of human
performers with prerecorded or synthesized electronic music. The current score follower at Ircam is
trained during rehearsal time to improve its performance, see for example Schwarz et al. [2005]. Current
development is centered on anticipatory score following, see Cont [2009], permitting a higher level of
interaction between the performance and the score follower. A similar development is the Music Plus
One system [Raphael, 2006] which uses score following to accompany and anticipate a performer. The
system is used for example to control the playback of a Music Minus One
2
recording to assist musicians
when practicing a piece of music without live accompaniment.
As we are using generative models, we have also the following applications:
• Source separation. Score guided source separation may be used to produce the Music Minus One
recording automatically from a favoured historical recording [Raphael, 2008], a process known as de-
soloing. Another use of source separation for musical signals is the separation of polyphonic instruments
(for example piano and guitar) into separate channels based on pitch or register, which in a recording
studio would be typically recorded on a single channel. This functionality has been demonstrated in
Melodyne's Direct Note Access technology
3
.
1
http://www.mazurka.org.uk
2
A recording of a piece of music for soloist and accompaniment where only the accompaniment is recorded
3
www.celomony.com
16
• Object coding based music compression and synthesis. As discussed in Plumbley et al. [2002] the
recent Mpeg-4 standard for structured audio [Vercoe et al., 1998] provides for an advanced form of
parametric coding of musical signals. The labeling required for the coding could be produced with
automated music transcription, and the decomposition and synthesis tasks could also be automated
using source separation.
• Morphing, reconstruction and digital effects. Bayesian models have been used for audio reconstruction
[Godsill, 1997, Cemgil and Godsill, 2005], noise estimation [Godsill, 2009] and enhancement [Wolfe
et al., 2003]. These approaches can be readily modified by extending the prior models for general
audio to account for the higher level of structure present in music (for example harmonicity). These
extensions form a major part of this thesis.
1.4 Outline of Thesis
This introduction has described the motivation and approach to the research covered in this thesis. We have
demonstrated our reasons for adopting a generative modelling approach using Bayesian inference, but have
reflected on how complementary approaches from psychoacoustic modelling and machine learning can be
adopted. The three following chapters provide the background and foundation for our research.
Chapter 2 provides an overview of our current knowledge of the structure of musical audio signals, drawing
from music theory, the physics of musical instruments and the propagation of sounds, and psychoacoustics.
The material covered in this chapter is the basis and reasoning for having selected the particular Bayesian
priors that we use in this research.
Chapter 3 reviews the literature for machine listening of musical signals. Since 2004 a community driven
evaluation of machine listening systems for the important applications in this field, known as Mirex)
4
[Downie, 2008], has become prominent. Previously there had been no overall consensus on standardized
data sets or evaluation metrics to use for comparing the performance of different systems. In Chapter 3 we
use this resource to learn how this area of research has developed and has been influenced over the last few
years, and make an objective comparison of the current state-of-the-art technologies.
Chapter 4 introduces the field and methods of Bayesian signal processing, reviewing models and inference
methods. Models and inference algorithms used in this thesis will be covered in greater detail in this chapter
so that these are collected into one place and are referenced throughout this thesis where they are used.
The remaining chapters describe the new research carried out. Each chapter describes firstly a novel
Bayesian model for some level of musical structure in audio. Secondly, a variety of inference methods for
these models are developed, and finally the applications of each model, and results presented using data-sets
from the literature review in Chapter 3. The chapters are not fully self contained, as the models used are
extensions of models already described in the thesis or in the literature, but the focus of each chapter is on
one particular level in a hierarchical Bayes model of musical audio, and the chapters are ordered accordingly:
partial frequencies and amplitudes in Chapter 5; the grouping of partials into musical notes in Chapter 6;
the harmonic and temporal distributions of time-frequency basis coefficients in Chapter 7; prior distributions
for the volume of notes and their continuity in Chapter 7; and finally dynamic models for tempo and how
the performance of a piece of music moves through the score of that piece in Chapter 8.
4
http://www.music-ir.org/MIREX
17
Chapter 5 describes a generative model for musical audio using the analytic representation of a signal.
This model is based on existing Bayesian models for musical audio using sinusoidal and Gabor bases, but
modelling the analytic representation has a number of implications for both inference and the prior structure.
We also consider common musical effects such as frequency modulations (vibrato) and amplitude modulations
(tremolo) and model these in musical notes as multiple partials occupying the same harmonic position. We
then use the model and inference methods developed to perform spectrum estimation and polyphonic pitch
transcription in musical signals, demonstrating its ability to model vibrato and detect higher order partials.
Chapter 6 describes a Poisson point process model for inferring musical notes and chords from partial
frequency estimates. The model allows for multiple and missed partial detections, and can be applied to the
model described in Chapter 5 and also other spectral estimation schemes, both Bayesian and heuristic. The
primary advantage of this model is that calculating the likelihood function is computationally inexpensive
and inference is straightforward. A simple and intuitive prior is used with a partial estimation scheme using
Bayesian model selection on the model in Chapter 5 to produce an effective system for inferring polyphony
in short frames of music.
Chapter 7 describes modelling the coefficients of a time-frequency transform of a musical signal by
variance parameters. The variances are grouped into a matrix, which is composed of harmonic factors
across frequency and excitation factors across time, analogous to non-negative matrix factorization (NMF).
A number of Bayesian inference procedures are proposed for rapid and efficient inference, which is necessary
for processing large amounts of audio data. A number of applications for general musical signal processing
are illustrated using these models. We extend the model with prior structures for the volume of a note, and
its onset and offset. This allows the inference of a Midi transcription from musical audio. The transcription
performance of the model is compared to existing systems using a large selection of synthesized classical
piano music.
Chapter 8 develops two models for inferring the motion of a hypothetical 'score pointer' through the
performance of a piece of music. The first model is a hidden Markov model with a tempo variable describing
the probability of moving from one position in the score to the next. The second is a Poisson model counting
the expected number of note onset and offsets occurring at each point in the music. Results are presented
for score following and query-by-tapping music retrieval applications.
18
Chapter 2
Structure of Musical Audio
2.1 Introduction
In this chapter we develop some understanding of musical audio which will aid us considerably in defining
models. This chapter covers known and experimentally derived results from physics, psychoacoustics and
musical theory, which are a necessary foundation for the rest of this thesis. In Chapter 3 we cover the
progress made in using these models for the applications grouped under the encompassing term machine
listening.
In analyzing music audio, we need to address both the physical production of the sound and how it
is perceived. Most systems for the analysis of musical audio model either the musical instrument or the
auditory system in isolation. In reality neither exists in isolation. Sound is produced by the instrument and
received by the sensory system (which includes but is not restricted to the auditory system, as the existence
of deaf musicians proves). Feedback may exist in the form of performance, as indicated in Figure 2.1 on page
20.
Physical models of musical instruments are well studied, and although this area of research is by no
means dormant or redundant; much of what has already been discovered is of value to us. Fletcher and
Rossing [1998] reviews the physics of musical instruments, and forms the basis of our description of pitched
musical instruments in Section 2.3.
Models of the perception of sound focus firstly on the auditory periphery and secondly on psychoacoustical
studies. We will not describe in detail the research in these areas, pointing the reader to Klapuri [2006] for
an excellent introduction; but focus on the models that have been developed as a result of this research.
2.2 Perception of Musical Audio
The process of perceiving audio can be divided into two stages. The first, physical, stage converts the pressure
waves that transmit sound through the air into electrical signals in the brain. Central to this is the action
of the basiliar membrane, which is a stiff structure in the inner ear separating two fluids: the endolymph
and the perilymph. One function of the basiliar membrane is frequency dispersion: the membrane varies in
thickness and stiffness, thus it responds to different frequencies across its length. The variation of location
19
Musical
Instrument
Production
 of Sound
Acoustics
Sensory
 System
Performance
Signal
Auditory Filterbank
Neural Transducers
Pyschoacoustics
Perception
Figure 2.1: The flow of information in musical production and the auditory system. Sound is produced by
a musical instrument and received by the sensory system. Performance is a feedback route from the sensory
system back to the instrument. The flow of information through the auditory system is shown here as a
block diagram. In reality many more channels exist than are shown here.
with frequency is described by the Greenwood function [Greenwood, 1961].
The auditory periphery may be modelled by a filter bank, splitting the received signal into a number of
channels (Figure 2.1 on page 20). Each filter has a bandwidth selected to model the frequency selectivity of
various parts of the basiliar membrane. Experimental results (Patterson et al. [1992]) suggest that gamma-
tone filters are a good approximation to the frequency response. Figure 2.2 on page 21 shows the impulse
and frequency response of a second-order gammatone filter. Each channel is then subject to dynamic level
compression, half-wave rectification and low pass filtering. These processes are designed to model neural
transduction. The dynamic level compression models the loudness response of the ear which is relevant in
our discussion of dynamics in Section 2.2.1. The frequency content is then typically analyzed by computing
the summary spectrum which is the summation of the spectrum magnitudes across the channel outputs after
the low pass filtering operation.
A side effect of the half-wave rectification combined with low pass filtering is that the harmonic complexity
of a musical signal is reduced: higher order partials are mapped onto lower order partials (see Section 2.3
for an explanation of these terms) which may explain why humans are good at perceiving multiple pitches.
One difficulty with applying standard spectral analysis even for monophonic signals is that the fundamental
frequency does not necessarily have the largest amplitude. However when the mapping of partials takes place
and we observe the summary spectrum (see Figure 3.1b on page 33 for example), the fundamental does then
have the largest amplitude and can be easily identified. A complete auditory model based on the process
outlined above appears in Meddis and O' Mard [1997].
The second stage of perception is psychological, occurring within the brain. The first research on the fre-
quency response of the ear was carried out in Fletcher and Munson [1933] where contours of equal subjective
loudness (measured in phons) are obtained for different frequencies and sound pressure levels.
20
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0 5 10 15 20 25 30 35 40
A
m
p
li
tu
d
e
Time / ms
Impulse Response
Envelope
(a) Impulse response and amplitude envelope
0 500 1000 1500 2000 2500 3000 3500 4000
-4
-3
-2
-1
0
1
2
3
4
P
h
as
e
/
ra
d
s
Frequency / Hz
-150
-100
-50
0
50
M
ag
n
it
u
d
e
/
d
B
(b) Magnitude and phase of the frequency response
Figure 2.2: Impulse and frequency response for a second-order gammatone filter with centre frequency 440Hz
and sampling rate 8000Hz
21
00.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
A
m
p
li
tu
d
e
sc
al
in
g
MIDI velocity
β = 1
β = 1.6661
β = 2
Figure 2.3: Volume curves (2.1) used by someMidi implementations. β = 1 gives a linear response, Roland's
GS standard uses β = 2, a square-law relationship, and β ≈ 1.661 is derived from the rule of thumb that
loudness doubles when the sound intensity is increased by a factor of ten
2.2.1 Dynamics (Loudness)
Dynamics refers to the volume of a musical note, which may have stylistic interpretation, but primarily refers
to the note on velocity Midi event. Typically a particular dynamic notated in a musical score e.g., forte
(loud), is mapped to a range of velocities.
The mapping of velocity to a scaling of the note in amplitude is relevant to underlying signal models.
Based on the human ear having a perceptual logarithmic response to amplitude variations, GM (General
Midi) synthesizers use the following volume curve
a =
( v
127
)β
(2.1)
which expresses the amplitude scaling a in terms of the note velocity v as a fraction of the maximum velocity
127 allowed by the Midi standard, and a logarithmic response scaling term β. Figure 2.3 on page 22 plots
volume curves for the different values of β for implementations of the Midi standard.
The RWC database [Goto et al., 2003, Goto, 2004] also contains samples at three levels of dynamics:
forte, mezzo, piano. Studying the musical instrument samples here can give us a granular yet instructive set
of volume curves for different instruments.
The dynamic of a note is not necessarily constant across the duration of the note. crescendo (gradually
getting louder) and diminuendo (gradually getting softer) are typically modelled by synthesizers as linear
trajectories in note velocity. tremolo is a periodic variation of volume, which occurs often with vibrato (2.3.4),
the pitch analog of volume oscillations.
22
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
20
40
60
80
100
120
M
ID
I
p
it
ch
Frequency / Hz
2600
2700
2800
2900
3000
3100
3200
3300
M
el
fr
eq
u
en
cy
Figure 2.4: Comparison of the Mel frequency scale (2.2) and the Midi definition of pitch (2.3)
2.2.2 Pitch
Pitch is the perceived fundamental frequency of a musical note. This perception can deviate substantially
from physical reality, especially when the sounds are not quasiperiodic. Experimental studies have shown
that missing fundamentals and harmonics are tolerated, as is a substantial amount of inharmonicity [Plack
et al., 2005]. Hence we typically distinguish between pitch estimation and fundamental frequency estimation.
Pitch estimation may not therefore directly rely on physical signal models, but rather on a more approximate
representation of a fundamental frequency.
One strong argument in favour of this is that a listener is able to group musical notes by pitch indepen-
dently of the timbre of that note. A listener is able to identify a piano, a bell, a tympanum and filtered white
noise as having a common pitch. Moorer [1977] defines the pitch of a sound as some property that allows
it to be matched to a sine wave with a particular frequency. A sine wave itself has no harmonic structure
whatsoever, however it has an identifiable pitch. Hence although the harmonic structure of a musical note
may be invaluable to identifying the pitch of the note because the instrument is approximately a harmonic
oscillator (Section 2.3), harmonicity is not an adequate description of pitch itself.
Pitch perception is not independent of the volume either. The Mel frequency scale [Stevens et al., 1937],
which was arrived at by comparing perceived intervals of equal pitch with frequency intervals, illustrates
that the perception of pitch is non-linear. A general rule is that that pitch decreases with increased loudness.
One formulation of the Mel frequency scale is
m = 1127 loge (f/700 + 1) (2.2)
As with many parametric models in signal processing, pitch estimation may be approached in the time
23
domain or in the frequency domain. Autocorrelation function (ACF) based methods detect periodicity in a
non-linear transformation of the Fourier spectrum of a signal. As such, regularly spaced partial frequencies
in the spectrum are detected by an increase in the energy distribution around those frequencies. Comb
filters add a delayed copy of a signal to itself, causing periodicity in the signal to be constructive when the
delay is correctly specified. The lag at which the maximum of the ACF or the comb filter response occurs
corresponds to a dominant fundamental frequency in the signal, and is used as an estimate of the pitch, as
illustrated in Figure 2.5 on page 25. Klapuri [2004] shows that comb filter solutions exhibit both better SNR
and pitch detection range than ACF methods, but at the cost of additional computation.
Despite our above discussion concerning the imprecise relationship between pitch and fundamental fre-
quency, we still need to mention that the MIDI standard has adopted the following formula to map between
Western tonal musical pitches and fundamental frequency:
p = 69 + 12× log2
(
f0
440
)
(2.3)
where p is a pitch such that there are 12 integer pitches per octave, and 69 = A440Hz is the standardized
pitch. A semitone is the difference between two pitches, and a cent is 1/100 of this interval.
2.2.3 Timbre
Timbre is another grouping category of music perception which can stand independent of dynamics and
pitch. In terms of frequency analysis, because pitch refers to a specific frequency, therefore timbre may only
refer to the overall spectral profile. This is backed up by studies which have shown that timbre is related
to the relative amplitudes of the partials. Inharmonicity in the partials also affects timbre. Away from
frequency domain considerations, the timbre also can be characterized by the onset of the note, which is
typically unpitched due to the non-linear activation mechanism to generate the sound.
2.3 Oscillators
The perception of a pitched sound requires some periodicity in the signal. The most common means of
producing periodicity is by a mechanical system known as an oscillator. Oscillators are used in a great
variety of musical instruments.
2.3.1 Terminology
The fundamental frequency often corresponds to the lowest partial frequency in a signal, which may be the
oscillation across the entire string length for a string instrument, or the air column for a wind instrument.
Other partial frequencies occur at the resonant frequencies of the instrument. Because of the relative spacing
of the resonant frequencies, partial frequencies are often integer multiples of the fundamental frequency.
Inharmonicity is often measured as the deviation in cents of a partial from the closest ideal harmonic of the
fundamental frequency.
24
-20
-10
0
10
20
30
40
0 0.2 0.4 0.6 0.8 1
M
ag
n
it
u
d
e
/
d
B
Normalised frequency (×pi rads / sample)
(a) Magnitude response of the comb filter
-0.5
0
0.5
1
1.5
2
0 10 20 30 40 50
A
u
to
co
rr
el
at
io
n
Lag / samples
(b) Autocorrelation of a white noise signal filtered by the comb filter
Figure 2.5: Frequency response and autocorrelation of a comb filter with 10 samples delay. The maximum of
the autocorrelation occurs at zero lag, but the peak at 10 samples corresponds to the fundamental frequency
of the signal.
25
2.3.2 Stringed Instruments
For a string to be a harmonic oscillator the requirement is that it must be homogeneous, infinitely thin
and flexible. In this case, a string will exhibit a fundamental frequency related to its length and partial
frequencies at exactly integer multiples of the fundamental.
This is rarely the method used to construct a stringed instruments simply because the space required
for the string lengths and their transverse vibrations is impractical for low pitches. In these cases, the
stiffness of the string is increased, so that the string vibrates more slowly. The result of this however is that
inharmonicity is introduced. Stiffness is increased by using thicker strings, or winding strings, or increasing
the tension applied to the string. Hence for stringed instruments such as the piano, guitar, harp and members
of the violin family played pizzicato (plucking the string), inharmonicity is present. For a piano, the following
formula relates the harmonic frequencies fh to the fundamental [Fletcher and Rossing, 1998]
fh = hf0
√
1 + h2B (2.4)
where B is the inharmonicity coefficient.
In the case of a bowed violin family instrument, the non-linear action of the bow on the string negates the
effect of the inharmonicity. The partial frequencies are driven to be multiples of the fundamental frequency.
The fundamental however is not precisely equal to the length of the string, a fact which is taken into account
when a violinist plays the instrument.
2.3.3 Wind Instruments
The resonator in a wind instrument is an air column, the length of which is controlled by various means. A
partially open pipe has end effects where the acoustic length of the pipe differs from the geometric length.
However the effect varies with frequency, giving rising to inharmonicity.
There are exceptions: flutes and clarinets have been experimentally determined to have harmonic partial
frequencies. This is due to the sound-generating mechanism not being linear whilst the resonator itself is
(approximately) so. The system exhibits mode locking behaviour which forces partial frequencies to assume
their ideal positions.
Other interesting points about wind instruments are that the fundamental frequency of a bassoon is
nearly always absent; and even-numbered harmonics of a clarinet are suppressed because of the cylindrical
shape of the resonator [Barthet et al., 2005]. The saxophone, which is related to the clarinet, having a conical
resonator, does not suppress the even harmonics.
2.3.4 Vibrato
Vibrato refers to the periodic variation in pitch around a note, and as such is characterized by its depth: the
amplitude of the pitch variation, and the speed of the variation. Brown and Vaughn [1996] have determined
experimentally that the perceived pitch centre of a note with vibrato is the mean value of the perceived
frequency of the sound. Vibrato is not purely a frequency modulation, as it is difficult to perform this effect
without some degree of amplitude modulation, as shown by Arroabarren et al. [2003]. Deliberate amplitude
modulation in the absence of vibrato in a musical performance is known as tremolo.
26
Ratio Harmonic Name Example
2:1 2 Octave C
3:2 3 Perfect Fifth G
4:3 Perfect Fourth F
5:4 5 Major Third E
6:5 6 Minor Third E[
9:8 9 Major Second D
Table 2.1: Intervals and harmonics
Vibrato is subject to the mechanical limitations of the instrument, physical limitations of the player,
and stylistic rules. For some instruments vibrato is not possible to produce. For the violin, Geringer and
Allen [2004] have investigated vibrato performance across students, finding that the speed of vibrato was
approximately 5.5Hz and independent of the experience of the performer. A more detailed study by MacLeod
[2008] suggests that vibrato speed is a function of the pitch and dynamic, whilst the depth is a function of
the dynamic only. Similar empirical studies have been carried out for wind instruments and the human voice
[Prame, 1994].
2.4 Harmony, Chords and Key
Chords are a grouping of simultaneous pitches, and harmony is the relationship between these pitches to
create chords. The dual use of the word harmony in describing the relationship between partial frequencies
and also notes within a chord is deliberate. Table 2.1 on page 27 is based on the Pythagorean scale of just
intonation. In practice, Western music has generally adopted a tempered system, where the pitch ratios are
approximate, being logarithmically spaced with 12 semitones per octave. The pitch ratios in Table 2.1 on
page 27 can be achieved on instruments which do not have fixed tuning.
The seventh harmonic is omitted here as the interval suggested is dissonant (the same is true for the
eleventh harmonic). This particular overtone is often avoided in the production of sounds, as it is not
sufficiently close to the pitch B[ related to C.
Consonance is the pleasantness with which different pitches sound together, and it is usually attributed to
notes sharing harmonics. One rule for defining consonance is to consider notes in the chord pairwise. If the
lowest overlapping harmonic is the eighth or less, the chord is consonant. This accounts for the prevalence of
major and minor triads. A chord possessing a semitone interval is always dissonant. The number of possible
chords in Western music is reduced massively when semitone intervals are rejected. Overlapping harmonics
is thus common in polyphonic music, and is also one of the difficulties for signal processing algorithms to
overcome when attempting to resolve separate notes from within a chord.
Favoured chord progressions are those which involve the least number of modifications to pitches present
within a chord. For example, chord progressions involving I, IV and V chords are extremely common, each
only involving one note out of three to be modified. The concept of a key derives from the chord progressions.
27
2.5 Tempo
The rhythm of Western music is often felt as a regular occurring pulse, known also as the beat or tactus. The
rate at which beats occur in music is known as the tempo, measured in beats per minute (BPM). Consecutive
groups of typically two, three or four beats are known as measures or bars. The properties of this grouping
is known as the meter of the music. Subdivisions of the tactus, corresponding to the shortest musical note
durations (semiquavers for example), are referred to as the tatum.
Generative models of tempo are presented in Section 3.4.
2.6 Conclusion
In this chapter we have briefly covered several topics in psychoacoustics and music theory. The goal of this
review has been to describe a priori structure in musical audio. The technology we review in Chapter 3 and
the methods we develop in this thesis all make use of this prior knowledge to infer musical information from
audio. In Chapter 3 we will study the applications and implementations of the models presented here for
inferring notes and tempo in musical audio.
Some of the results in this chapter may be applied to musical signal processing directly. The auditory
model described in Section 2.2 is based on physical evidence and experimental results. When this model is
applied in the literature, there are some variations in its implementation, such as the choice of the number
of gammatone filters to use in the filter bank, or how the summary spectrum is computed. These variations
are motivated by a need for computational efficiency when processing multiple output channels from the
filter bank compared to the faithfulness of the model to the ear. On the other hand, only general principles
and theories are available from psychological evidence when we consider how the brain perceives pitch. The
pitch estimation stage of systems in the literature therefore vary widely in both the model they use and their
implementations: for example, compare the use of comb filters and the auto correlation function and the
different models they imply. Ultimately we must compare these systems by measuring and evaluating their
performance against that of a human listener.
A similar situation, where we have both reliable experimental knowledge and only general principles, is
when modelling the sound of a musical instrument. For an inharmonic stringed instrument, we have a model
of the harmonic positions of the partial frequencies given by (2.4), with one unknown parameter which can
be estimated experimentally [Godsill and Davy, 2005]. However the sound of a note from this instrument
may come with a performance effect like vibrato. As described in 2.3.4, experimental studies of vibrato are
limited and have only provided us with some limitations on the depth and speed of vibrato, rather than
precise measures.
In conclusion we have a set of models describing both the structure of musical audio and how it is
perceived and interpreted. These models at present have different levels of certainty because of the process
by which they were arrived at. Although we expect models to improve in accuracy with further investigation
and experimentation, there will always be uncertainty due to the human element in the performance and
reception of music. A flexible musical signal processing system needs to take into account the knowledge
covered in this chapter and the certainty we have about it. Moreover, the system should be able to incorporate
existing information where it is available. For example, if the key of a piece of music is known beforehand,
28
for example determined by a human listener, then the system may take advantage of this prior knowledge
to estimate chords and chord progressions and consequently improve transcription performance.
29
Chapter 3
Literature Review
3.1 Introduction
Machine listening of music audio [Scheirer, 2000] refers to the processing of digital audio signals to extract
information within a music information retrieval (MIR) context. It differs from, and complements, semantic
information which relies on information supplied by human listeners.
As the previous chapter has shown, the modelling of musical audio is not a straightforward task. Physical
models of musical instruments are approximate at best, and evaluation using perceptual models is at present
still focused on reducing computational expense whilst remaining realistic. Physical models for example may
have large numbers of parameters for each instant in time: partial frequencies, amplitudes, noise and so on,
making inference using these models expensive until recently. As a result, many authors have resorted to
approximations of these models, which have resulted in a wide array of algorithms in the literature.
Moreover, evaluation of machine listening systems is not necessarily straightforward. Evaluation is crucial
to performing comparative research and understanding which models and techniques are most appropriate
in given situations. Often the presentation of results in publications can be biased by the researcher's
particular choice of evaluation criteria, and the conclusions therefore misleading. Taking the example of
music transcription, a suitable method of evaluation would involve taking a set of trained musicians and
asking them to subjectively assess and rank the transcription outputs of different systems. This type of
evaluation involves considerable expense and time, and is not feasible for studies on large libraries of music.
An automated approach to evaluation requires choosing libraries of music for which there is a ground truth,
such as a reliable Midi file with its events aligned in time to the audio.
Assuming such a ground truth exists, we need to consider how the performance of a transcription system
can be measured. Poliner and Ellis [2007] state that even simple measures such as frame-level transcription
accuracy can be biased by reporting too many notes. An event based metric, such as edit distance, or the
measure used by the authors mentioned, is more appropriate given that music may be regarded as a stream
of note events. The authors also note that, at least in the case of polyphonic piano music which they study,
the position of the note onsets is more important than the duration and release of the notes.
Some form of consensus is being arrived at in the form of the annual Music Information Retrieval Eval-
30
uation eXchange (Mirex)
1
[Downie, 2008] contest, which consists of a community-driven open evaluation
of research systems on a selection of evaluation tasks deemed to be of interest. Mirex is gaining more
credibility in the research community: a good or improved performance at one of the evaluation tasks is now
considered acceptable as comparative research.
This chapter will focus on those areas of interest highlighted at Mirex which are relevant to this thesis,
and will describe the models and algorithms that have been shown to be suitable to the processing tasks
they are designed for. We will demonstrate that the methods which have their basis and justification in the
concepts described in Chapter 2 are also the methods which perform best in these comparative evaluations.
3.2 Multipitch Analysis
The goal of automatic music transcription systems is to correctly infer the notes being played in a polyphonic
piece of music, producing an intermediate representation, such as Midi, from an audio track. The inference
of polyphonic music may be broken into two tasks:
1. Estimating multiple pitches / fundamental frequencies in individual frames of music. We focus on this
task in this section.
2. Tracking the pitches as note contours over consecutive frames, also filtering and smoothing the estimates
in the frames. This aspect of multipitch analysis is the focus of Section 3.3.
InMirex the multiple fundamental frequency estimation and tracking task evaluates frequencies as correct
when they are within a semitone of the ground truth. The accuracy of the estimation of the fundamental
frequencies is not evaluated: a system which reports ideal fundamental frequencies corresponding to the
MIDI standard will be ranked equally as one which finely evaluates the fundamental frequencies. For physical
models to perform well on such a task requires that the model of the audio fits the observed data very well,
and such tractable models have only become feasible in recent years. Auditory models on the other hand
will in general perform quite well even with simple implementations, because the definition of pitch is looser
than that of fundamental frequency.
3.2.1 Auditory Models
The best performing auditory model, both at Mirex and in other comparisons such as Poliner and Ellis
[2007] is that of Klapuri [2008]. The unitary model of pitch perception described in Section 2.2 was first used
by Tolonen and Karjalainen [2000] for multipitch analysis using an autocorrelation method. Klapuri [2008]
uses a comb filter which analyzes the periodicity of the summary spectrum (see Section 2.2). The output of
the comb filter is used to select fundamental frequency candidates. The comb filter weights partial m with
fundamental frequency f0 using the formula:
f0 + 1
mf0 + 2
(3.1)
1
http://www.music-ir.org/MIREX
31
where 1 ≈ 20Hz and 2 ≈ 320Hz, and the number of partials is limited to 20. This weighting ascribes more
importance to partials with lower harmonic index than higher, as many of these partials have already been
mapped to lower frequencies due to the auditory model processing.
In Figure 3.1b on page 33 we compare the summary spectrum which is obtained from the output of the
auditory model with the power spectrum obtained using Fourier analysis in Figure 3.1a on page 33. In Figure
3.2b on page 34 and Figure 3.2a on page 34 we compare the salience functions which are obtained from the
periodicity analysis by the comb filter. There are pronounced peaks in the salience functions of both spectra
at the correct pitches . However the salience function derived from the auditory model correctly ranks the
true pitches in first and second position, whereas the power spectrum ranks the third harmonic of one of the
notes with a higher salience. This improvement in multiple pitch estimation is due to the aforementioned
higher-order partial suppression which is a property of the auditory model.
The estimation of multiple pitches is iterative. At each iteration the pitch with the highest salience is
chosen and then the partials corresponding to this frequency are subtracted from the summary spectrum using
(3.3). A heuristic formula is used for evaluating the point at which the algorithm should stop, outputting
the number of notes that have been detected.
The above model is an example of an iterative estimation algorithm, where a preferred predominant
fundamental frequency candidate is determined; and then its contribution to the spectrum is canceled. The
algorithm iterates until a stopping criterion is met. This scheme requires less effort than jointly estimating
fundamental frequency candidates and also has justification from a psychoacoustical perspective [Bregman,
1990, Hartmann, 1996].Klapuri [2003] refers to this as predominant F0 estimation.
3.2.2 Spectrum Estimation
In this section we look at algorithms based on sinusoidal models. In the recent Mirex evaluations of 2008
and 2009 sinusoidal models have shown the best performance, whereas in previous years auditory models
outperformed other schemes. Sinusoidal models are used to identify multiple frequencies in musical signals,
and the algorithm identifies which of those frequencies are the fundamentals. In early work such as Maher
and Beauchamp [1995], estimating multiple frequencies, i.e., spectrum estimation, was performed in isolation
to the task of detecting harmonic structure in the frequency estimates and thus identifying the fundamental
frequencies. Spectrum estimation has been a popular area of research in the signal processing community for
years, and there are many algorithms to choose from. However their ability to detect all of the frequencies
of interest in a complex polyphonic signal is limited. The successful approaches that we cover in this section
incorporate known priors of musical signals such as harmonicity (3.2.3) and spectral smoothness into their
spectrum estimation algorithms.
After obtaining the sinusoidal representation of the signal, the algorithm must then determine which par-
tial frequencies belong to which pitch, and how many pitches are in the frame. The problem of labelling each
frequency is combinatorial in nature, especially when the number of pitches is not known, as is normally the
case. Hence the algorithm may only perform a limited search through possible combinations of fundamental
frequencies.
Spectrum estimation may be carried out in time domain or the frequency domain. The motivation for
frequency domain approaches arises from the result by Bretthorst [1989] that the maxima of the periodogram
spectrum estimator gives the frequency of a single sinusoid embedded in white noise. Although the result
32
-40
-20
0
20
40
60
0 1000 2000 3000 4000 5000
M
ag
n
it
u
d
e
/
d
B
Frequency / Hz
(a) Fourier spectrum for a mixture of two notes with pitch 1 and 5 relative to A440.
0
20
40
60
80
100
0 1000 2000 3000 4000 5000
M
ag
n
it
u
d
e
/
d
B
Frequency / Hz
(b) Summary spectrum obtained using an auditory model from the same signal as Figure 3.1a.
The higher order partial frequencies have been suppressed due to the half-wave rectification
and low pass filtering applied by the model.
Figure 3.1: Comparison of Fourier and summary spectra
33
100
200
300
400
500
600
-30 -20 -10 0 10 20
S
al
ie
n
ce
Pitch relative to A440
(a) Salience function obtained from Fourier spectrum in Figure 3.1a on page 33. Pitch 5 is
correctly identified with the highest salience, however pitch 20 (the third harmonic of pitch
5) is ranked with a higher salience than the correct pitch 1.
2000
4000
6000
8000
10000
12000
14000
16000
-30 -20 -10 0 10 20
S
al
ie
n
ce
Pitch relative to A440
(b) Salience function obtained from summary spectrum. Pitches 1 and 5 are correctly ranked
with the highest salience, even before the contribution of pitch 1 is removed from the spectrum
in a iterative-F0 estimation procedure.
Figure 3.2: Comparison of salience functions obtained from spectra
34
does not hold for multiple sinusoids, a popular approach picks local maxima of the periodogram above a
noise floor threshold. One approach for estimating the noise floor in this context is to compute the statistics
of the local minima of the periodogram [Martin, 2001], as the minima are expected to be noise components
of the signal rather than the sinusoidal components. Bayesian spectrum estimation on the other hand makes
use of an explicit signal model, i.e., sum of sinusoids, and a noise model. The principles used to design the
systems are similar:
1. Perform spectrum estimation: i.e., identify a number of frequencies present in the noise-corrupted
signal. This may be carried out using an explicit signal model, or by extraction from the coefficients
in a transform domain of the signal.
2. Identify and label fundamental frequencies and corresponding partial frequencies. A set of possible
candidates is generated in this step.
3. Select the candidate set with the simplest explanation.
Stages 2 and 3 may be combined into a single step as the principles used to identify the frequencies are
typically rules based on harmonicity and correlation between the amplitudes of the detected partials. For
example, Yeh et al. [2005] assumes a sinusoidal model with multiplicative detuning parameters for each
partial frequency. The preference expressed by the authors is to prefer candidates with smaller detuning
parameters. In a Bayesian setting, this would correspond to a prior which assigned greater probability mass
to smaller detuning parameters. Pertusa and Inesta [2008] apply the principle of Gaussian smoothness to
the partial amplitudes, which is similar to treating the evolution of the amplitudes as a linear dynamical
system.
The preference for the simplest explanation for an observation is encapsulated by Occam's razor, and
is naturally applied by means of Bayesian model section (Section 4.1). This principle can be used to guide
the design of the models for musical signals, and may be used to select between competing models in a
mathematically rigorous manner. It is encouraging to see that signal models are indeed able to outperform
auditory models for multiple pitch recognition. One goal of this thesis will be to set the signal models
described above in a fully Bayesian framework, using the principles described in Chapter 2. In so doing, we
rely heavily on the work carried out for Bayesian spectrum estimation in Andrieu and Doucet [1999] and its
extension to polyphonic music in Davy et al. [2006].
3.2.3 Harmonicity
An explicit model for the inharmonicity of stringed instruments is given by (2.4). The inharmonicity pa-
rameter is estimated as part of a Bayesian modelling scheme by Godsill and Davy [2005], and an estimation
scheme is given as part of the multi-pitch extraction system of Klapuri [2003].
In addition, there exist models for inharmonicity in generic musical instruments, which we review in this
section. These models implicitly prefer a lower amount of inharmonicity, that is, partial frequencies are
expected to lie close to integer multiples of the fundamental frequency. Another property of the models is
that higher frequency partials are allowed to deviate more from their ideal harmonic positions than lower
frequency partials, as in (2.4).
35
Yeh et al. [2005]describe the degree of deviation d of an observed partial f from its expected harmonic
position hf0 as
d =

|f−hf0|
αhf0
if |f − hf0| < αhf0
1 otherwise
(3.2)
where α is a tolerance on the amount of inharmonicity allowed.
Godsill and Davy [2005] introduce a generative model for inharmonicity via detuning parameters. The
detuning of a single partial δ is defined as the multiplicative error term that maps the expected harmonic
hf0 to the observed partial f , i.e.,
f = (1 + δ)hf0 (3.3)
Each δ in the model has a zero-mean Gaussian prior with a constant variance σ2δ = 3× 10−8. Thus smaller
absolute values of the detuning parameters have a higher probability.
3.2.4 Spectral Envelope
In this section we cover how the spectral envelope of musical signals has been modelled in the literature.
The relative partial energies of the harmonic series of a musical instrument has an important contribution to
its timbre. The partial energies of a musical instrument decay relatively smoothly with increasing frequency,
so that the majority of the signal energy is concentrated within the lower harmonics.
A simple Bayesian model for the decay of the amplitudes with increasing frequency is given in Godsill
and Davy [2005]. The prior for the amplitude bm of the mth partial is a zero mean Gaussian:
p (bm) = N
(
0, σ2nξkm
)
km =
1
1 + (Tm)
ν
σ2n is the variance of the ambient noise, and ξ represents the overall signal-to-noise ratio (see Section 5.3 for
details of this Bayesian model). The scaling km controls the decay of the amplitudes according to a low-pass
filter with cut-off frequency T and decay constant ν. The use of low pass filters for modelling instrument
timbre is common, for example the system of Karjalainen and Laine [1991]. A fractional delay filter provides
the periodicity defining the fundamental frequency of the note, and a low pass filter in the feedback loop
causes higher harmonics to decay more rapidly. When excited with a short burst of white noise, the output
of the system has a realistic plucked string sound.
3.2.5 Transform Decomposition Methods
The last class of methods we will consider for multi-pitch analysis are those which decompose a linear basis
transform, such as the Fourier transform, of a musical signal into separate harmonic contributions from each
pitch. The method was pioneered for polyphonic music transcription by Smaragdis and Brown [2003]. In
their method, the spectrogram is formed into a matrix X with the Fourier transform of each overlapping
frame of music forming the columns of this matrix, i.e.,
Xω,t = |STFT (t, ω)|2
36
The matrix X is then decomposed by non-negative matrix factorization [Lee and Seung, 2000] into two
factors X ≈WH where the number of columns of W and the number of rows of H are both equal to the
number of pitches that are modelled in this segment of music. Each column of W models the harmonic
profile of a note with a certain pitch. W may be trained or constrained to some parametric form. Each
element of H gives the weight, or the energy, of that note in a frame. H, when the rows are arranged in
ascending pitch order, has the appearance of a piano roll (see Figure 7.4 on page 119 as an example), and
may be used as a starting point for a full transcription.
3.3 Polyphony Tracking
The task of tracking multiple pitches present in a piece of musical audio over time is normally evaluated
separately from the task of estimating multiple pitches in a single frame (or otherwise stationary section of
audio in terms of pitch). This is useful because it is possible to combine different pitch analysis and tracking
systems together, and evaluate the performance of each separately.
Two related applications to tracking polyphonic signals are
1. Melody extraction. This refers to extracting a melodic line from a polyphonic piece. What constitutes
a melodic line must be answered using music theory, but a practical working definition is that the
melodic line corresponds to the predominant pitch (3.2.1) in each frame. We will therefore view melody
extraction as a sub-problem within the general polyphonic tracking problem, and do not directly address
it here. However, we note that many of the melody extraction systems that perform well in theMirex
task use similar models and algorithms to systems which carry out full multiple-pitch estimation and
tracking, rather than being specifically designed for the purpose of melodic extraction.
2. Score alignment. This refers to tracking polyphony through audio when the ordering of the pitches
present is known to some extent, but the tempo is not. Hence polyphonic tracking systems should
have one model to describe the evolution of the pitches, and another model to describe the evolution of
the tempo. We cover tempo models in Section 3.4 as a separate application. In this section we discuss
models that describe how the pitches within a polyphonic piece change over time.
Polyphony tracking may be carried out in a real-time online manner, or oine. Oine approaches tend to
outperform online approaches, as expected and reported in Robertson and Plumbley [2009], who develop a
real-time system based on Puckette et al. [1998].
3.3.1 Attack-Sustain-Decay-Release
An Attack-Sustain-Decay-Release (ASDR) envelope is a function of time present in modern synthesizers used
to modulate the loudness of the note. Dodge and Jerse [1997] state that the shape of the ASDR envelope
is the dominant factor in the perception of instrument timbre. Thus this relatively simple model is used
frequently in the literature for an isolated musical note spanning multiple frames of audio data.
Orio and Déchelle [2001] model each note event in a score using a three state hidden Markov model
attack-sustain-rest. The duration of each note is governed by the transition probability from the note being
in the sustain state in one frame to the note remaining in the sustain phase in the succeeding frame. The
37
duration of each note thus follows a negative binomial law. A similar technique is used by Devaney et al.
[2009] where transient and sustained sections of sung notes are modelled using a three state hidden Markov
model.
Cemgil et al. [2006] use a state space model for the evolution of sinusoids through time. The beginning of
each note is modelled by the state being drawn from a zero mean Gaussian with covariance matrix modelling
the distribution of energy across partials. The sustain/decay phase is treated by each harmonic sinusoid h
having a damping ratio ρh = ρ
h
d where ρd is the damping ratio of the fundamental. In the rest phase, the
damping ratio is increased to ρr, so that the note rapidly becomes inaudible .
3.3.2 Pitch Evolution
A popular method for polyphonic tracking in the literature is a hierarchical hidden Markov model (HMM).
Individual notes are modelled as in 3.3.1. The transition probabilities between notes of different pitches are
estimated from large databases of music, such as the work carried out by Ryynänen and Klapuri [2004]. The
key of the music is provided as prior information to improve the relevance of the model, and most databases
can be shifted cyclically so that data for one key can be applied to another key to improve the diversity of
training data. Key detection systems are global, and may rely on coarse feature vectors such as chroma.
3.4 Tempo Tracking
Listeners without musical training are capable of tapping a rhythm corresponding to the pulse of the music,
and are able to adapt to changing tempos. With some training, listeners are able to infer the meter (Section
2.5). One goal of machine listening therefore is to be able to infer the same basic rhythmic structure. This
task is collectively known as beat tracking, although the emphasis may be more on tracking tempo rather
than precise beat locations.
Score following is a closely related application to beat tracking. In score following, knowledge of the
score automatically gives us the deterministic relationship between the tatum, beat and metrical structures.
Inferring the tatum is adequate for this task. Other than this, the onset detection model of a beat tracker
may need to be modified, at least to allow for misses and false alarms. It is also typical in the literature
to extend the onset detection model so that it incorporates knowledge of the underlying score, particularly
note pitches and volumes. It is an open question whether such extensions generally improve the quality
of score following or whether there might be a degradation in robustness to variations in timbre etc. The
seemingly satisfactory performance of query by tapping systems (QBT) [Jang et al., 2001] for querying
musical databases adds some weight to this question. Moreover, a perceptual evaluation of the alignment of
the audio to the score may be based on the alignment of onsets in the music.
In the literature, beat tracking is often coupled with audio onset detection to form a complete listening
system. However it is instructive to study these separately, as the onsets may already be available to us
in some electronic format, for example Midi events, and also we may wish to evaluate the performance of
different models in a modular system. We will also indicate how these models may be modified as part of a
score following system.
Cemgil and Kappen [2003] and Raphael [2001] model the observed onset times yk as actual onsets τk
with some added noise k per onset. The difference between successive actual onsets τk−1 and τk is given
38
by the expected inter-onset interval (IOI) γk (as a multiple of the tatum) scaled by the current value of the
tempo ∆k−1, with some additional process noise στ . The tempo evolves as a random process ∆k−1,∆k, . . .
with process noise σ∆. The model can be written in state space form:[
τk
∆k
]
=
[
1 γk
0 1
][
τk−1
∆k−1
]
+
[
στ
σ∆
]
yk = τk + k (3.4)
These models solely relate the observed IOIs to the expected IOIs. Beat and metrical information is supplied
a priori as p (γk). Figure 3.3a on page 40 presents the model as a Bayesian network .
This model is appropriate for rhythm-based quantization of MIDI events, however we must generalize
the model for some detection probability p (yk|τk) when using audio onset detection. For score following we
could extend this model to have a probability distribution over the observed audio p
(
syk:yk+1 |γk
)
between
the recorded onset times. Raphael [2004] uses a simple generative model for the spectrum of each frame
given the score γk.
Klapuri et al. [2006] assume a frame-based onset detector, and include metrical structure in their gener-
ative Markov model
p
(
sk, τ
tatum
k−1:k , τ
beat
k−1:k, τ
measure
k−1:k
)
= p
(
sk|τ tatumk , τbeatk , τmeasurek
)
× p (τ tatumk |τbeatk , τ tatumk−1 )
× p (τmeasurek |τbeatk , τmeasurek−1 )
× p (τbeatk |τbeatk−1 ) (3.5)
As can be seen from the structure of this model, the fundamental state is the beat or pulse. The metrical
structure is derived from the evolution of the beat state. The onset detector itself assigns a likelihood
p
(
sk|τ tatumk , τbeatk , τmeasurek
)
for a feature vector sk based on the frame of audio. Figure 3.3b on page 40
presents the model as a Bayesian network.
Whiteley et al. [2006] adopt a different definition of tempo as the velocity nk of the tatum mk over feature
vectors sk of frames of music.
p (sk,mk+1, nk+1, θk+1|mk, nk, θk) = p (sk|mk, θk)
× p (θk+1|θk,mk+1,mk)
× p (mk+1|mk, nk, θk)
× p (nk+1|nk)
The θk = {Mk, rk} denotes changes in meter Mk (the number of tatum positions in the bar) and also
rhythmical pattern indicators rk within a bar, hence its inclusion in the onset detector p (sk|mk, θk) =
p (sk|mk, rk). The tatum moves deterministically according to
mk+1 = (mk + nk − 1) modMk + 1
39
σ
yk
τk
τk−1 γk στ ∆k−1
∆k
σ∆
(a) State-space model of tempo ∆k with actual onsets τk, observed onsets yk and expected
inter-onset intervals γk (3.4) for a single time slice k
τtatumk
sk
τbeatk
τmeasurek
τtatumk−1τ
measure
k−1
τbeatk−1
(b) Markov model of metrical structure (3.5) for a single time slice k
Figure 3.3: Bayesian network representations of tempo models
40
Both of the above models can be extended to score following by adding a probability distribution of the
form
p
(
smk:mk+1
)
on the frames of audio between tatum positions. Peeling et al. [2007a] extend the model of Klapuri et al.
[2006] to score following using a generative model for the spectrogram coefficients.
3.5 Conclusion
Music is inherently hierarchical in structure, and in this chapter we have seen that the state-of-the-art
systems for extracting pitches, notes and tempo from musical signals reflect this hierarchy. To extract
multiple pitches from a signal, individual partial frequencies are estimated and grouped into the smallest set
of pitches that explain the harmonicity and timbre of the signal. Pitch estimates in consecutive frames of
music are then linked together as notes. Missing pitch estimates between a note's onset and offset are filled
in, and spurious pitches are discarded as noise. Notes in polyphonic music tend to arrive in groups with
regular spaced intervals, giving rise to the perception of beat and tempo in music. Concurrently sounding
notes also tend to have extended harmonic relationships, giving rise to chords, chord sequences and the key
of the music.
It is clear from the above description that a single pass through a musical signal, first extracting pitches,
then smoothing pitch tracks to obtain notes, and estimating metrical structure and key, would be incomplete.
For example, when spurious pitch detections are discarded as noise, this gives us more information about the
structure of the noise in that frame, and can therefore be used to improve the original system that extracted
multiple pitches from the frame. An iterative approach thus seems appropriate, however computational
performance can be an issue here, particularly if the pitch extraction algorithm only works on a single frame
at a time, and frames are then processed in order. Therefore single-pass systems are often experimentally
trained oine to determine the optimum set of parameters for pitch extraction and smoothing notes. However
today's music has a large variety of genres, instruments, styles and so forth. Selecting an appropriate training
set which will generalize well is difficult.
The motivation for an iterative approach has given rise to algorithms based on matrix factorization of
time-frequency coefficients, which in single steps can process multiple frames of music, and are computation-
ally viable alternatives (3.2.5). These methods have yet to be shown to be similar in performance to multiple
pitch extraction schemes based on auditory or sinusoidal models. Reasons for this include that the models
are not physically realizable, as they rely on the spectrogram being additive, the difficulty of determining the
number of pitches in each frame, and linking the frames together to track polyphony. These three reasons
are the motivation for our research in Chapter 7 and Chapter 8, where we attempt to address each issue
without sacrificing computation.
In this chapter we have also seen models for various music phenomena, such as harmonicity, timbre and
the envelope of the note energy. Some of these models were derived from studying the physics of the musical
instruments that produce the sound, whilst others were developed to capture more general characteristics,
such as low inharmonicity and the spectral smoothness of partial amplitudes. The application of these
models has been shown in the Mirex evaluations to be effective at extracting multiple pitches from frames
of music, and may be cast in a Bayesian setting. In Chapter 5 and Chapter 6 we will consider additional
41
parameters for frequency and amplitude modulations in a note, and investigate efficient schemes to infer
with models containing large numbers of parameters.
Combining a generative multiple pitch model in a frame with a hierarchical hidden Markov model for
tracking polyphony and tempo is attractive for musical signal analysis, as the entire model is suitable for
Bayesian inference, prior information can be incorporated in a transparent way, and the model is useful for
a number of applications. In Chapter 8 we consider models for tracking the movement of a score pointer in
a performance, and apply the model to the applications of score following and query by tapping.
42
Chapter 4
Bayesian Methods for Signal Processing
In this chapter we describe basic Bayesian methods for modelling and inference on which the models developed
in this thesis rely on. In Section 4.1 we discuss modelling concepts using Bayes rule, and introduce some
popular models and representations of models. In Section 4.2 we describe the inference algorithms applied
to these models that are used elsewhere in this thesis.
4.1 Bayesian Modelling Methods
In this section we introduce some probabilistic modelling techniques. 4.1.1 provides definitions and concepts
relating to Bayes' rule and comparing probabilistic models as explanations of observed data.. In Section
4.1.2 we show how graph structures can represent complex probabilistic models. We then define the hidden
Markov model in Section 4.1.3 which can be used to model causal and dynamical systems. Finally we
discuss the exponential family of probability distributions in Section 4.1.4 which are often used in Bayesian
approaches because of their useful properties.
4.1.1 Bayes Rule and Model Comparison
4.1.1.1 Parameter Estimation using Bayes Rule
In frequentist statistics, we have observations y produced by a set of unknown model parameters θ according
to a likelihood function p(y|θ). One estimate of the model parameters given the data is the maximum
likelihood (ML) estimate
θˆML = arg max
θ
p(y|θ)
However we often have some prior knowledge expressed as a Bayesian belief p(θ) concerning the parameters,
and p(y|θ) is interpreted as the information provided by observations y conditioned on the parameters.
Bayes' rule states that
43
p(θ|y) = p(y|θ)p(θ)
p(y)
(4.1)
posterior =
likelihood× prior
evidence
which can be viewed as weighting our prior belief with the likelihood of the data to give a posterior estimate
of the parameters. The prior captures our belief about the model parameters before we observed any data.
The ML estimate is now replaced with the maximum a posteriori (MAP) estimator
θˆMAP = arg max
θ
p(θ|y) (4.2)
4.1.1.2 Marginal Likelihood for Model Comparison
The evidence or marginal likelihood p (y)is the normalizing term in (4.1). The term marginal likelihood arises
as p (y) is the likelihood of the observations y after marginalizing the model parameters θ:
p (y) =
ˆ
θ∈Θ
p (y|θ) p (θ) dθ (4.3)
The marginal likelihood is important for Bayesian model comparison as the only remaining unknown is the
identity of the probabilistic model for y itself. A higher marginal likelihood for a model indicates that the
model is a better explanation of the observed data. Calculating the marginal likelihood is an application
of Occam's razor, which states that a simpler explanation for an observation is to be preferred. Here,
simpler implies a model with less parameters. Both ML and MAP estimators will prefer a model with more
parameters, a problem known as over-fitting.
We can select the best model for y by computing the marginal likelihood p (y) for every model in the set
of models we are comparing. We will illustrate this here with two probabilistic models M1 and M2 which
are considered possible explanations for y. The two models may have different numbers of parameters. We
compute the marginal likelihood for each model, which we denote as p (y|M1) and p (y|M2). One method of
comparing the models is the Bayes factor [Kass and Raftery, 1995]
p (y|M1)
p (y|M2)
which assesses the evidence M1 against M2. A Bayes factor greater than 1 indicates that M1 should be
preferred.
In full Bayesian inference we do not compute point estimates of the model parameters θ such as the MAP
estimate. Rather we are interested in inferring the posterior probability distribution p (θ|y), for which the
MAP estimate is the mode of this distribution. Inferring the posterior distribution allows us to compute
expectations such as the variance, which gives us some idea of the uncertainty in our parameter estimates, and
is important for making decisions based on such inference. Note that as the likelihood p (y|θ) and the prior
p (θ) are known, then computing the marginal likelihood p (y) is equivalent to computing the normalization
constant of the posterior distribution and thus computing the distribution itself (4.1).
44
4.1.1.3 Generative and Discriminative Models
A generative model for y is a probabilistic model p (y, θ) = p (y|θ) p (θ) for which we can randomly sample
a set of parameters θ′ ∼ p (θ) and then sample or generate an observation y′ ∼ p (y|θ′). When choosing a
generative model to study a signal, we desire that
• Statistical properties of the observed signal should match the properties of data generated by the model
• The parameters of the model θ include information which is to be extracted from the signal
• The prior p (θ) and the likelihood p (y|θ) of the model are based on the known physical processes which
drive the signal
The alternative to a generative model is a discriminative model which directly defines a probability distri-
bution p (θ|y) which can then be used to obtain the MAP estimate (4.2). This often means that the original
likelihood and prior in (4.1) may not be available in closed form and cannot be used as a generative model. A
discriminative model may be designed to give accurate results for the task it is designed for, such as multiple
pitch detection, but needs to go through cross validation to ensure that the model will generalize well to
unseen signals. For generative models we are able to use model comparison to select the most appropriate
model, and model selection can be carried out not only oine on training data but also at the time when
the signal is observed.
4.1.1.4 Hierarchical Models
A hierarchical model enforces conditional independencies between the model parameters. There is often good
justification in a signal processing application for making this assumption. For example, in a beat tracking
application, where we want to jointly track the tempo and the position of beats in a drum track, the beats
themselves appear as sudden bursts of energy in the signal, but the tempo is not directly relevant to the
observed signal. Rather, the tempo controls the rate at which the beats occur. If we define the parameters
related to the beats as θb and the tempo parameters as θt, then the model
p (y, θb, θd) = p (y|θb) p (θb|θt) p (θt)
is not only justifiable from a theoretical point of view, but also can be a generative model: first sample the
tempo according to the prior p (θt) and then sample the beats, and finally the signal.
Hierarchical generative models are often represented graphically as Bayesian networks (Section 4.1.2).
The graph can then be used to determine how to sample the parameters in turn to generate data from the
model, and also guides inference algorithms to estimate unknown parameters given observed data.
4.1.1.5 Conjugate Priors
In this section we define a useful property of certain families of probability distributions which simplifies
inference and computation. A family of probability distributions is a set of distributions which share the same
functional form but have different parameters. For example, the normal distribution may be parametrized
45
by mean µ and standard deviation σ
N (y;µ, σ) ≡ 1√
2piσ
exp
(
− (y − µ)
2
σ2
)
The mean µ is a location parameter for the normal distribution. A family of distributions with a location
parameter has the following functional form
fµ (y) = f (y − µ)
The standard deviation is a scale parameter for the normal distribution. A family of distributions with a
scale parameter has the following functional form
fσ (y) = f (y/σ) /σ
Now assume the mean µ of some observed data y is unknown, but the standard deviation σ is known. The
posterior of µ is
p (µ|y, σ) ∝ p (y|µ, σ) p (µ)
If the prior p (µ) is also chosen to be a normal distribution, then it can be shown that the posterior p (µ|y, σ)
is also a normal distribution, where the mean and standard deviation of the posterior are given by standard
rules (Section A.1). The normal distribution prior p (µ) is a conjugate prior to the mean parameter of a the
normal distribution, because the posterior is in the same family of probability distributions as the prior. A
conjugate prior is a choice of prior for a particular parameter of a likelihood function, where the posterior
of this parameter is in the same family as the prior.
4.1.2 Bayesian networks
Graphical models are a method of visualizing the structure of probabilistic models by diagrammatically
representing probability distributions. Inference algorithms can be viewed and defined as messages being
passed between nodes of the graphical model [Bishop, 2006]. Directed acyclic graphs, also known as Bayesian
networks, are useful for constructing models via conditional probability distributions. Figure 3.3 on page
40 provides two examples of Bayesian networks. Circular nodes represent the values of random variables,
such as the unknown parameters and observed data, and edges represent statistical dependencies between
the random variables.
Bayesian networks intuitively represent causality. The definition is that an edge is directed from A to B if
B is conditionally dependent on A. We call A a parent of B, and B a child of A. The acyclic property makes
these graphs suitable for working with generative probabilistic models, where we can generate synthetic
data by sampling the distributions of nodes with no parents (hence they are conditionally independent of
all other random variables in the network), and then moving through the network along the directed edges,
sampling each child node dependent on its parents. This is known as ancestral sampling [Bishop, 2006].
The probability distribution, and the method of generating samples from it, are represented by the following
factorization:
46
θ0 θ1 . . .
y1
θn
. . .
. . .
yn
θN
. . . yN
Figure 4.1: Hidden Markov model. Observed random variables have doubled lines
p(x1, . . . , xN ) =
N∏
i=1
p(xi|Par(xi)) (4.4)
where Par(xi) denotes the set of parent nodes of xi. One weakness of Bayesian networks is that although
conditionally dependencies are expressed directly, the task of determining whether a variable is conditionally
dependent of another is not so clearly evident. We denote the Markov blanket of a node as the set of nodes
for which, given the values of these nodes, the node is conditionally independent of all other nodes in the
network. For a Bayesian network, the Markov blanket of xi is the union set {Par(xi)∪Chl(xi)∪Par(Chl(xi))}
where Chl(xi) denotes the set of child nodes of xi.
4.1.3 Hidden Markov Models
A hidden Markov model (HMM) is a probabilistic model with an underlying Markov process, that is, the
conditional probability distribution of future states of the process given the present and all past states of the
process is conditionally independent of the past states, i.e., it depends only on the present state. This state
of the process is assumed to be hidden, and the task is to infer the sequence of states over time given some
observations dependent only on the current value of the state. Usually the hidden Markov model is viewed
as a special case of a general state space probabilistic model, where the state space Θ is discrete.
Hidden Markov models are often used for modelling dynamical systems. We will use the following
notation, which is represented as a Bayesian network in Figure 4.1 on page 47: the hidden state sequence
is θ0:K and y1:K is the sequence of observations produced according to the state likelihoods (also known as
emission probabilities) p(yk|θk), over discrete times k = 1, . . . ,K. The state sequence evolves as a Markov
chain, that is, it has an initial probability distribution p(θ0) and a set of transition probability distributions
p(θk|θ1:k−1) = p(θk|θk−1), k = 1, . . . ,K.
4.1.4 Exponential Family of Probability Distributions
The exponential family of probability distributions is a class of probability distribution having the form
47
p(y|θ) = 1
Zθ
e−〈θ,T (y)〉 (4.5)
The normalizing factor Zθ is given by
´
dy e−〈θ,T (y)〉. This family has a number of useful properties which
make them important for Bayesian inference:
• The finite vector T (y) is the collection of sufficient statistics, which capture all the possible information
about θ that is represented by observations y, that is: p(y|T (y), θ) = p(y|T (y)) ∀θ. For inference only
the sufficient statistics T (y) are required, not the entire data.
• The maximum likelihood parameters θˆML that maximize (4.5) are those for which the observed values
of the sufficient statistics equal their expected values: T (y) = 〈T 〉θ.
• For a likelihood function p(y|θ) in the exponential family, there exists a conjugate prior (4.1.1.5) p(θ),
often itself in the exponential family, for which the posterior p(θ|y) is the same class of distribution
as the prior. This is useful in variational methods (Section 4.2.2) as the update equations require
only updating the sufficient statistics of a factor rather than performing a computationally intensive
calculation over the whole parameter space of a probability distribution.
4.2 Inference Algorithms
In this section we cover inference algorithms falling into three main categories: exact inference (Section
4.2.1) for situations where it is possible to compute the posterior for all possible parameter settings, sampling
methods (Section 4.2.2) which aim to generate samples from the posterior and compute expectations via
Monte Carlo integration, and variational methods (Section 4.2.2) which approximate the full posterior with
distributions for which the integrals can be computed.
4.2.1 Exact Inference
Exact inference refers to being able to compute posterior quantities precisely. Often this requires marginal-
izing over parameter spaces, which therefore requires the integrals to be analytic (such as in the case of
models with Gaussian conditional probabilities) or that the parameters only assume discrete values. The
Kalman filter is an exact inference algorithm for linear dynamical systems with Gaussian transition and
observation noise. For a hidden Markov model (Section 4.1.3) with a discrete and finite state space, such
that θk can assume one of E possible values, and the conditional probability of the observation given the
state is computable, there exist a group of message-passing inference algorithms with complexity O(E2K)
to compute various inference tasks.
Typically we may wish to determine the probability of states at time k, given past observations p(θk|y1:k),
which is known as filtering and can be carried out recursively on-line, or including all future observations
p(θk|y1:K) up to time K, which is known as smoothing and must be carried out oine, or including N recent
observations p(θk|y1:k+N ), which is known as fixed-lag smoothing and is practical if a certain amount of
latency in the inference is acceptable. We may also wish to predict future states p(θk+N |y1:k) or infer the
most likely sequence of states p(θ0:K |y1:K) which is known as the Viterbi path. The computations required
48
for all these related but distinct queries can be viewed in terms of message passing algorithms. Both the
Kalman filter and the following HMM algorithms are actually special cases of inference algorithms, known as
the sum-product algorithm (for filtering and smoothing), that act over general Bayesian networks or factor
graphs, see Bishop [2006] for details. Rabiner [1989] provides a tutorial on HMMs, also describing the Baum-
Welch algorithm, which is an expectation-maximization (EM) algorithm for learning hidden parameters of
the transition and observation process.
Let θ0:K be the unknown state sequence in a hidden Markov model, and let y1:K be the sequence of
observations generated. By Bayes' theorem, the posterior distribution over all possible state sequences is
given by
p(θ0:K |y1:K) = p(y1:K |θ0:K)p(θ0:K)
p(y1:K)
(4.6)
The marginal filtering density p(θk|y1:k) can be computed up to the normalizing constant p(y1:k) by
passing αk|k (θk) ≡ p(θk|y1:k)p(y1:k) `alpha' messages between neighbouring frames:
α0|0 (θ0) = p(θ0) (4.7)
αk|k−1 (θk) =
∑
θk−1
p(θk|θk−1)αk−1|k−1 (θk−1) (4.8)
αk|k (θk) = p(yk|θk)αk|k−1 (θk) (4.9)
The marginal smoothing density p(θk|y1:K) is computed oine by passing βk|k (θk) ≡ p(yk+1:K |θk) `beta'
messages as follows:
βK|K+1 (θK) = 1 (4.10)
βk|k (θk) = p(yk|θk)βk|k+1 (θk) (4.11)
βk|k+1 (θk) =
∑
θk+1
p(θk+1|θk)βk+1|k+1 (θk+1) (4.12)
p(θk|y1:K) ∝ αk|k (θk)βk|k+1 (θk) (4.13)
The Viterbi path is computed in an analogous manner, where messages from neighbouring frames and
observations are combined by taking the maximum rather than summing, i.e.,
αk|k−1 (θk) = max
θk−1
p(θk|θk−1)αk−1|k−1 (θk−1) (4.14)
βk|k+1 (θk) = max
θk+1
p(θk+1|θk)βk+1|k+1 (θk+1) (4.15)
arg max
θ0:K
p(θ0:K |y1:K) = arg max
k=0:K
αk|k (θk)βk|k+1 (θk) (4.16)
49
Algorithm 4.1 Generic MCMC
• Sample θ˜(0) ∼ pi0(θ)
• For i = 1, . . . ,M
 Sample θ˜(i) ∼ K(θ|θ˜(i−1))
4.2.2 Monte Carlo Methods
Bayesian inference problems frequently involve computing high-dimensional integrals. For example, if we
are interested in the mean of the posterior p(θ|y), we are required to compute the following integral over the
space of possible parameter settings Θ.
Ep(θ|y)[θ] =
ˆ
Θ
θp(θ|y)dθ (4.17)
In many cases, the dimension of Θ is very large and the problem is intractable. We must resort to using a
Monte Carlo estimate, which is a stochastic numerical integration method. Monte Carlo methods [Gilks and
Spiegelhalter, 1996, Liu, 2003] are a general class of methods to compute expectations of random variables.
We denote the target probability density function (pdf) of interest as pi (θ), and assume we have a set of
random samples θ˜(i), i = 1, . . . , N drawn from pi (θ), which are calledMonte Carlo samples. Now, for a general
function h(θ) over the parameter space, we have, by the law of large numbers, the following approximation
to the general integration problem
ˆ
θ
h(θ)pi(θ)dθ ≈ 1
N
N∑
i=1
h(θ˜(i)) (4.18)
pi (θ) can however be of large dimension, and with an unknown normalizing constant, and therefore may be
difficult to sample from. See Robert and Casella [2004] for a full overview of Monte Carlo techniques. Here
we will mention two methods in particular which are useful to us for generating random samples from such
a pdf.
4.2.2.1 Markov Chain Monte Carlo
Markov Chain Monte-Carlo (MCMC) methods construct a Markov chain with stationary distribution equal
to the target pdf pi(θ) which we wish to sample from. A Markov chain is specified by an initial sample dis-
tribution pi0(θ) and a transition kernel K(θ|θ′) which is a probability density. For the stationary distribution
of the Markov chain to be equal to the target pdf pi(θ) the transition kernel must obey
ˆ
Θ
K(θ|θ′)pi(θ′)dθ′ = pi(θ) ∀θ ∈ Θ (4.19)
Given such a kernel, the Monte Carlo samples θ˜(i) are generated from the Markov chain as in 4.1.
A general method of constructing the kernel K(θ|θ′) is provided by the Metropolis-Hastings (MH) algo-
rithm. Suppose we have a proposal pdf q(θ|θ′) which we can directly sample from, with q(θ|θ′) > 0 wherever
pi(θ) > 0. Step 3 in 4.1, which involves drawing samples from the kernel, is expanded in 4.2, where candidate
50
Algorithm 4.2 Metropolis-Hastings Kernel
• Sample θ? ∼ q(θ|θ˜(i−1))
• Compute the acceptance ratio α
MH
= min
[
1, pi(θ
?)
pi(θ˜(i−1))
q(θ˜(i−1)|θ?)
q(θ?|θ˜(i−1))
]
• Accept : with probability α
MH
set θ˜(i) ← θ?
• Reject : otherwise set θ˜(i) ← θ˜(i−1)
samples θ? are drawn from the proposal pdf q(θ|θ′), and then accepted or rejected according to an acceptance
ratio or probability α
MH
.
It is not necessary for the entire parameter θ to be sampled in each step. Let θ be partitioned into D
disjoint components {θ1, θ2, . . . , θD} which may be groups (blocks) of parameters. It is permissible at each
iteration for the target pdf to be one of p(θd|θ¬d), d = 1, . . . , D provided every component θd has some
probability of being chosen at each iteration. For the special case when we are able to sample directly from
p(θd|θ¬d) then αMH = 1, that is, every candidate is accepted. This is known as Gibb's sampler. This can
typical occur in Bayesian networks with conditional dependencies between the nodes defined as standard
probability distributions. The algorithm can then be viewed in terms of passing messages between nodes,
and has been implemented in software packages such as WinBugs and OpenBugs.
Commonly the Markov chain is allowed to run for a burn-in period before computing statistics of the
Monte Carlo samples, as it is a commonly observed phenomenon that the Markov chain `converges' to likely
parameter settings after such a period.
[Green, 1995] extends MCMC to model selection problems, where the model parameters have different
dimensions. The scheme is known as Reversible Jump MCMC. An MH kernel is required to move between
models Mj and Mj′ of differing dimension. The jump must be reversible in that for every accepted move
θj → θj′ , the reverse move θj′ → θj must have positive probability of acceptance. The MH proposals are
normally designed so that θj and θj′ share most of the parameter settings between them, and therefore a
sensible proposal normally requires that the difference between the dimensions of j and j′ is small.
The MH proposal pdf is augmented to become q(θj′ , uj′ |θj , uj) where uj′ and uj are auxiliary random
variables that keep the dimensions between the augmented parameter spaces constant: D(θj′) + D(uj′) =
D(θj) + D(uj) where D(·) is the dimension of the random variable. The acceptance probability in 4.2 is
replaced by
αMH = min
[
1,
pi(θ?j )
pi(θ˜j)
q(θ?j , uj |θ˜j , uj)
q(θ˜j , uj |θ?j , uj)
∣∣∣∣∣∂(θ?j , uj)∂(θ˜j , uj)
∣∣∣∣∣
]
(4.20)
4.2.2.2 Importance Sampling and Sequential Monte Carlo
The Metropolis-Hastings algorithm derives from acceptance-rejection sampling. Another sampling method,
importance-sampling, leads to a class of algorithms known as sequential Monte Carlo, so called because they
are often suitable for sequential inference problems as often appear in dynamical systems and hidden Markov
models. Importance sampling requires another proposal pdf q(θ) with q(θ|θ′) > 0 wherever pi(θ) > 0, known
as the importance pdf. Now, if we have a set of Monte Carlo samples θ˜(i), i = 1, . . . , N drawn from q (θ), we
51
can write the general integration problem as
ˆ
θ
h(θ)pi(θ)dθ =
ˆ
θ
h(θ)
pi(θ)
q(θ)
q(θ)dθ ≈ 1
N
N∑
i=1
ω˜(i)h(θ˜(i)) (4.21)
Each Monte Carlo sample θ˜(i) is weighted by an importance weight ω˜(i) which corrects the difference between
q(θ) and pi(θ). The importance weights are given by the importance ratio
ω˜(i) =
pi(θ˜(i))
q(θ˜(i))
(4.22)
The variance of the estimate depends strongly on how close q(θ) is to pi(θ), which is usually evaluated as
the Kullback-Liebler divergence KL(q(θ)||pi(θ)). The optimal importance pdf that minimizes the variance
of the estimate is q(θ) = |h(θ)|pi(θ), which however usually cannot be sampled from.
Sequential Monte Carlo methods [Doucet et al., 2001], also known as particle filtering, are algorithms
for approximate inference on typically dynamical systems using importance sampling. At each time k we
are interested in generating Monte Carlo samples θ
(i)
1:k of the state trajectories, also know as particles. The
target posterior pdf p(θ1:k|y1:k) can be written sequentially as
p(θ1:k|y1:k) = p(θ1:k−1|y1:k−1)p(θk|θk−1)p(yk|θk)
p(yk|y1:k−1) (4.23)
The normalization constant p(yk|y1:k−1) does not need to be computed. We also select a sequential
importance pdf
qk(θ1:k) = qk−1(θ1:k−1)qk(θk|θk−1) = q0(θ0)
k∏
k′=1
qk′(θk′ |θk′−1) (4.24)
so we can sample the new state at time k from the proposal qk(θk|θk−1) using the state trajectories found up
until time k − 1. The choice of the proposal pdf affects the resulting variance of the particle estimate. The
optimal proposal uses the new observation: qk(θk|θk−1) = p(θk|θk−1, yn) but can be impossible to sample
from. Local Gaussian approximations to the optimal proposal lead to the unscented and extended Kalman
filters. The simplest proposal is the transition pdf qk(θk|θk−1) = p(θk|θk−1) which results in the algorithm
called the bootstrap filter (4.3).
The importance weight of each particle is given as the ratio
p(θ1:k|y1:k)
qk(θ1:k)
(4.26)
However, after a few iterations, most of these weights become close to zero, and the solution is degenerate,
being represented by only a few particles. The solution is to resample the particles: particles with low
weights are moved to more accurate positions. A resampling step takes place when the efficiency number
I
eff
= [
∑I
i=1(w˜
(i)
k )
2]−1 is lower than some threshold. A simple scheme known as stratified sampling samples
such that the expected number of particles following the resampling step at θk is equal to Iw˜
(i)
k . See Doucet
et al. [2001] for further details and alternative resampling schemes.
52
Algorithm 4.3 Bootstrap Particle Filter
• Initialize θ˜(i)0 ∼ q0(θ0), w˜(i)0 = p0(θ˜0)/q0(θ˜0) for each particle
• For k = 1, . . . ,K
 Update particle trajectories θ˜
(i)
k ∼ qk(θk|θ˜(i)k−1)
 Compute particle weights
w˜
(i)
k ∝ w˜(i)k−1
p(θ˜
(i)
k |θ˜(i)k−1)p(yk|θ˜(i)k )
qk(θ˜
(i)
k |θ˜(i)k−1)
(4.25)
and normalize: w˜
(i)
k ← w˜(i)k /
∑I
i′=1 w˜
(i′)
k
 Resample if necessary
4.2.3 Variational Methods
Variational methods are an alternative, deterministic method for making approximate posterior estimates.
Define p(y, θ) as the joint distribution when y is observed, with normalizing constant Zy. We wish to compute
the posterior
p(θ|y) = 1
Zy
p(y, θ) (4.27)
Zy =
ˆ
p(y, θ)dθ (4.28)
The Monte Carlo techniques described in Section 4.2.1 can be utilized to draw samples from p(y, θ)/Zy.
However, in practice, the integral can often be more quickly approximated by the structured mean field
method, also known as variational Bayes. The integrand P = p(y, θ)/Zy is approximated with a simpler
distribution Q such that the integral in (4.28) is tractable. A common factorization involves partitioning the
parameters into d disjoint components, θ1, . . . , θD such that
Q =
D∏
d=1
Qd(θd) (4.29)
The mean field method minimizes the KL divergence KL(P||Q). Due to the non-negativity of the KL
measure, we then obtain a lower bound on the normalizing constant
logZy ≥ 〈log p (y, θ)〉Q − 〈logQ〉Q (4.30)
The second term of (4.30) is the entropy: −H[Q] ≡ 〈logQ〉Q. See Chapter A for expressions for the entropy
of probability distributions in the exponential family. The factors Qd obey the fixed point equation
Qd ∝ exp(〈log p (y, θ)〉Q¬d) (4.31)
which can be computed easily if all the factor distributions are chosen to be in a conjugate-exponential
53
family. For example, if a random variable σ2 has sufficient statistics
〈
1/σ2
〉
IG and
〈
log σ2
〉
IG under an
Inverse-Gamma distribution, and y ∼ N (0, σ2) then
Qy ∝ exp
(− 12 〈1/σ2〉 y2 − 12 log 2pi − 12 〈log σ2〉) (4.32)
and the VB update is 〈y〉N = 0,
〈
y2
〉
N =
〈
1/σ2
〉−1
as expected. See Chapter A for expressions of the
sufficient statistics of some probability distributions in the exponential family. The fixed point equation
(4.31) has the property that for every iteration, the lower bound (4.30) is guaranteed to increase.
Variational Bayes and Gibbs' sampler have been compared for inference in audio signal models in Godsill
et al. [2007], Cemgil et al. [2007]. Qualitatively, VB methods tend to converge quicker than Gibbs' sampler,
but may result in a poorer solution because only a lower bound of the likelihood is being computed.
54
Chapter 5
A Signal Model for Pitched Musical
Instruments
We consider the modelling of pitched musical instruments as a summation of several sinusoids with correlated
frequency, amplitude and phase. Background noise and modelling error are treated as Gaussian noise, which
makes a probabilistic treatment desirable. We introduce appropriate priors for the model parameters, which
are chosen to reflect prior knowledge of the structure of pitched musical note signals, and to allow effective
numerical Bayesian inference. The model we introduce is shown to be capable of modelling both frequency
and amplitude modulations, which are characteristic of the sound of many musical instruments. The use of
a Bayesian methodology allows model selection to be carried out implicitly, so that the relevant number of
sinusoids necessary to model the signal appropriately may be determined automatically.
5.1 Contributions
The motivation for this chapter is to extend and further develop promising Bayesian generative models based
on a sinusoidal representation for each partial frequency, and present the developments in such a way that
they can be incorporated easily into and compared with existing approaches.
In 5.2.2 we formulate the mathematical representation of a sinusoid for which the amplitude and phase
are permitted to vary slowly in comparison to the central frequency. This motivates the use of the analytic
representation in 5.2.3, which eliminates ambiguity in the sinusoidal representation, and we show that this
analytic representation is appropriate to be applied to existing Bayesian models using sinusoidal representa-
tions. We then describe a state-space representation with constant damping ratio in 5.2.4, showing that the
analytic representation can result in a closed form posterior distribution for the damping ratio and frequency,
which does not typically arise in the literature, and may be used to linearize the posterior distribution under
arbitrary choices of frequency priors. In 5.2.5 we apply the analytic representation to Gabor models, and
additionally motivate the use of sinc basis functions to specify and control the bandwidth of frequency and
amplitude modulations in the signal. The use of sinc basis functions may be incorporated into the existing
methods independently of whether the analytic representation is used or not.
In Section 5.3 we describe the literature for Bayesian inference in sinusoidal models and a noise model
55
that may be used. We then derive MCMC algorithms for a fixed number of partials with an arbitrary prior
on partial frequencies for both the state-space and Gabor models. For the state-space model in particular,
we derive the posterior distribution of the frequencies and damping ratios under a normally distributed prior,
and show how this can be adapted as an efficient, model-based, proposal distribution when the prior is not
a normal distribution.
5.2 Model for an Isolated Partial
In this section we consider how to model the signal of an isolated partial frequency. The focus on this
section is to introduce various representations of a sinusoidal signal and how the amplitude envelope and
modulations around the central frequency can be handled. In the following section, we will consider the
superposition of multiple sinusoids and embed the entire model in a probabilistic framework for Bayesian
inference.
5.2.1 Motivation
An isolated partial is a minimal description of a musical note. When we listen to an isolated partial, we have
the clear perception of the pitch and volume of a musical note. The pitch is related to the frequency of the
partial, although the perception of pitch itself is non-linear (2.2.2). Some degree of frequency modulation
is tolerated, being perceived as vibrato. The timbre of an isolated partial is perceived as the purest tone,
as there is no harmonic structure and therefore no possibility of inharmonicity. Musical instruments which
may be modelled by isolated partials include whistles, tuning forks and rubbing crystal glasses.
The model we introduce is parametric in that the sinusoid is completely described by its frequency, phase
and amplitude envelope; and linear so that we may superimpose multiple sinusoids in order to generate
more complex musical tones. We require a comprehensive understanding of the parameters of even such
a simple model, because we will pursue the intuition that the perceptual grouping of partials into musical
notes (rather than being perceived as separate frequencies) is due to the group of partials having a shared
set of parameters.
5.2.2 Amplitude and Phase Modulation
In this chapter we will consider a segment of audio data with N samples and time indices t = 0, . . . , N − 1,
where it is assumed that a set of multiple pitches are sounding throughout the length of the segment. We will
begin by modelling an isolated partial as a sinusoid x [t] having a constant angular frequency ω, amplitude
envelope c [t] and time-varying phase φ [t]:
x [t] = c [t] cos [ωt+ φ [t]] (5.1)
For this model to be realistic, we require constraints on the amplitude envelope and phase modulation. The
amplitude envelope is used to model changes in the perceived volume of the partial. Hence the bandwidth of
the envelope should be restricted to the lower limit of hearing (20Hz), otherwise the frequency content of the
envelope will be perceived as an additional pitch. The ear is relatively insensitive to the phase of a pitched
56
note, but phase modulations may be perceived as frequency modulations, as the modulation in frequency
around ω is given by the time derivative of φ [t]. Vibrato (see 2.3.4) is common in many musical genres and
instruments, and results in both frequency and amplitude modulations of the note. Experimental studies
by Brown and Vaughn [1996] have shown that the perceived pitch centre of a vibrato note is still equivalent
to the centre frequency ω. The permissible amount of frequency modulation is governed by stylistic rules,
however a useful guideline is that the depth of vibrato should not cross the frequency boundary of the pitch,
which in Western music is a semitone. The speed and depth of vibrato are also limited by the mechanical
process which creates the vibrato effect, for example in the case of a violin, the rocking of the violinist's
finger on the string.
5.2.3 Analytic Representation of Sinusoidal and Noise Signals
In (5.1) there is ambiguity in the definitions of c [t] and φ [t]. Different choices will result in the same identical
signal x [t]. To overcome this Gabor [1946] defines the instantaneous amplitude and instantaneous phase
using the analytic representation of a signal. This has become the conventional definition, and reduces the
ambiguity (see Cohen et al. [1999] for cases where the ambiguity still exists). An analytic signal is a complex
valued signal with no negative frequency components in its Fourier spectrum. The analytic representation of
a real valued signal is produced by discarding the negative frequency components of the Fourier transform.
There is no loss of information as the Fourier spectrum of a real signal has Hermitian symmetry around zero
frequency.
The analytic representation xa [t] of a real-valued signal x [t]is given by
xa [t] ≡ x [t] + iH [x [t]] (5.2)
where H denotes the Hilbert transform. The Hilbert transform shifts the phase of negative frequency
components of the Fourier spectrum by +pi/2 and the positive frequency components by −pi/2. Thus the
operation in (5.2) discards the negative frequency components. The original signal may be simply recovered
from the real part of the analytic signal:
x [t] = R (xa [t])
The instantaneous amplitude of the analytic representation is defined as |xa [t]| and the instantaneous
phase is defined as arg xa [t]. For the model of the isolated sinusoid (5.1) we have
xa [t] = c [t] cos [ωt+ φ [t]] + ic [t] sin [ωt+ φ [t]]
= c [t] exp i [ωt+ φ [t]]
from which we can see that the instantaneous amplitude is c [t] and the instantaneous phase is given by
ωt+ φ [t].
We will find it convenient to use the analytic representation for our models, and will particularly use the
following form
xa [t] = c [t] exp [iφ [t]] exp [iωt] (5.3)
as the three parameters of the sinusoid are thus separated. The analytic signal is composed of three separate
57
signals multiplied (modulated) together. Using communications terminology, c [t] is the amplitude waveform,
exp [iφ [t]] is the phase waveform, and exp [iωt] is the carrier signal with frequency ω. To be able to transmit
and recover the original waveforms from the modulated signal, it is necessary for the bandwidths of the
amplitude and phase waveforms to be much smaller than ω. We will use the same concept for modelling
musical signals, as extracting and storing these low bandwidth waveforms is attractive for compression,
reconstruction and synthesis.
The model which we develop in this chapter is based on existing work on methods for sinusoidal models
[Serra, 1997, Walmsley et al., 1999, Davy and Godsill, 2003, etc.], and it is necessary for us to confirm that
the properties of these models are consistent with the analytic representation of the signal. The Hilbert
transform is a linear operator, so the frequencies and amplitudes of the sinusoids are preserved. Moreover
Picinbono and Bondon [1997] show that the analytic representation of a wide-sense stationary real signal is
proper or circular symmetric [Neeser and Massey, 1993]. Hence the analytic representation of a white noise
process is a complex Gaussian random variable, and the properties of an autoregressive (AR) process used
to model coloured noise are retained in the analytic representation.
In 5.2.4 and 5.2.5, we consider two common formulations of the sinusoidal signal (5.3) which are used
in state-of-the-art Bayesian harmonic models. Our contribution here is to apply these formulations to
the analytic representation of the signal, and demonstrating how bandwidth constraints on frequency and
amplitude modulations may be naturally and practically applied.
5.2.4 State-Space Formulation
The first formulation treats the sinusoid as a rotating phasor, and is motivated the work of Cemgil et al.
[2006] who use a state-space approach to model the rotation of a real-valued sinusoid from one sample to
the next. This approach was used in a polyphonic transcription system capable of resolving note onsets and
offsets to sample resolution. Moreover as the notes are processed sample by sample and not on a frame by
frame basis there are no artifacts arising from reconstruction and synthesis due to phase discontinuities and
discrepancies at frame boundaries.
In our model, the relationship between one sample of the sinusoid and the next is given by
xa [t+ 1] = c [t+ 1] exp [iφ [t+ 1]] exp [iω (t+ 1)]
= c [t+ 1] exp [iφ [t+ 1]] exp [iωt] exp [iω]
=
c [t+ 1]
c [t]
exp [iφ [t+ 1]]
exp [iφ [t]]
exp [iω] c [t] exp [iφ [t]] exp [iωt]
=
c [t+ 1]
c [t]
exp [iφ [t+ 1]− iφ [t]] exp [iω]xa [t] (5.4)
The
c[t+1]
c[t] term in (5.4) gives the rise or decay in the amplitude envelope between t and t+ 1. Following
Cemgil et al. [2006] we refer to this as the damping ratio, and define ρ [t] ≡ c[t+1]c[t] . We will apply a constraint
on the amplitude envelope by choosing to make this damping ratio constant throughout the segment of
audio: ρ [t] = ρ for all t = 0, . . . , N − 1. This is appropriate for musical instruments such as the piano and
guitar, where after the onset of the note (when the string is struck by a hammer or plucked) the decay of
the energy in each partial can be approximately described as exponential, 0 < ρ < 1. It is also appropriate
58
for notes held at a constant volume, where ρ = 1. For other situations where other shapes of amplitude
envelope would be expected, the Gabor model described in the next section is more appropriate.
The exp [iφ [t+ 1]− iφ [t]] term in (5.4) is the difference in the phase modulations, which can be seen as
an approximation to the frequency modulation at t. Realistic frequency modulations should be small, hence
we approximate using the Taylor expansion:
exp [iφ [t+ 1]− iφ [t]] ≈ 1 + (iφ [t+ 1]− iφ [t]) (5.5)
The second term in (5.5) which we will denote as f [t] ≡ iφ [t+ 1] − iφ [t] is small, and is purely imaginary,
as the phase φ [t] is real for all t. However for convenience, we will model this frequency modulation term as
a zero mean complex Gaussian random variable with small variance σ2f , i.e.,
p (f [t]) = NC
(
0, σ2f
)
The consequence of f [t] having a real part is that small amplitude modulations in addition to the damping
ratio ρ are permitted. As stated in 5.2.2, frequency modulations in a musical note are often accompanied by
amplitude modulations, hence we do not consider this inconsistency in our model a disadvantage.
When we incorporate the above constraints into (5.4) we have
xa [t+ 1] = ρ exp [iω]xa [t] + f [t] (5.6)
or alternative expressed as a conditional probability distribution:
p
(
xa [t+ 1] |xa [t] , ρ, exp [iω] , σ2f
)
= NC
(
ρ exp [iω]xa [t] , σ
2
f
)
(5.7)
(5.6) and (5.7) show us that xa [t] can be regarded as the internal state of a linear dynamic system. This
fact was used in Cemgil et al. [2006] where the parameters ρ, ω and σ2f were known, and the Kalman filter
used to infer the sinusoid in the presence of observation noise. In 5.3.3 we will show that these parameters
may be treated as unknown and Bayesian inference can be used to estimate them.
5.2.5 Gabor Model
A model for slowly varying partial amplitudes was introduced by Godsill and Davy [2002] as an extension of
the existing harmonic model of Walmsley et al. [1999] which assumed that the amplitude of each partial is
constant throughout the note segment. Each partial is projected onto a set of Gabor functions ψi,ω [t− i∆],
each of which has a fixed real-valued envelope ψ [t], symmetric around t = 0 and having a finite region of
support, shifted in time by i∆ and modulated by frequency ω equal to the frequency of the partial:
ψi,ω [t] = ψ [t− i∆] exp [iωt]
The constant ∆ is the difference, in samples, between the centres of adjacent basis functions and controls
the spacing along the time axis between neighbouring Gabor atoms. ∆ is chosen such that the support of
each function overlaps with the next, thus ensuring that the amplitude envelope varies smoothly throughout
59
the length of the note segment.
The Gabor model applied to the analytic representation of a signal is the projection of xa [t] onto I + 1
Gabor functions, where ∆(I + 1) is equal to the length N of xa [t]:
xa [t] =
I∑
i=0
biψ [t− i∆] exp [iωt] (5.8)
The complex-valued basis coefficients bi may be viewed as the amplitude of each Gabor function ψ [t− i∆] exp [iωt].
The amplitude envelope as modelled, comparing with (5.3), is given by
c [t] exp [iφ [t]] =
∞∑
i=−∞
biψ [t− i∆] (5.9)
and is also used to account for frequency modulations in our model. From a signal processing perspective,
the envelope would be obtained by low-pass filtering the partial to remove the frequency component at ω of
the spectrum. If we were to select the envelope of the Gabor function as the sinc function
ψ [t] =
sin [2pit/∆]
2pit/∆
(5.10)
then the bandwidth of the amplitude envelope (5.9) is constrained to 1/∆. This result is based on the use
of the sinc filter for perfect reconstruction of bandlimited signals [Shannon, 1998].
Godsill and Davy [2002] use a Hamming window as the envelope for the Gabor basis functions. When we
use a sinc function (5.10) to model sinusoids with periodic amplitude and frequency modulations embedded in
white noise, we have found that the residual of the modelling is smaller than when using Hanning windows,
and the reconstruction sounds better. Figure 5.1 on page 61 compares the reconstructions obtained of a
sinusoid by sinc and Hamming basis functions. The sinusoid has a central frequency of 440Hz, with a
frequency modulation of depth 5Hz and speed 5Hz, and amplitude modulation of magnitude 0.2 and speed
5Hz. Ten basis functions were used to cover the entire signal length of 1 second, hence modulations up to
10Hz can be captured. Both basis functions model the spectrum well around the central frequency, but the
sinc basis model has a smaller residual and fewer reconstruction artefacts away from the central frequency.
In practice we limit the support of the sinc function to 4∆ i.e.,
ψ [t] =

sin[2pit/∆]
2pit/∆ |t| ≤ 4∆
0 |t| > 4∆
as the amplitude of the envelope is small outside the central region.
5.3 Probabilistic Model for Multiple Partials
In this section we will combine multiple instances of the models for isolated partials described in 5.2.2 and
embed them in observation noise. The combined model may then be used for estimating the spectrum of
musical signals. We will adopt a Bayesian approach throughout, and our goal is to jointly infer the number
60
-4
-2
0
2
4
6
8
10
400 420 440 460 480 500
lo
g
sp
ec
tr
u
m
Frequency / Hz
Original signal
Sinc basis reconstruction
Hamming basis reconstruction
Figure 5.1: Comparison of the reconstructions of a violin note with fundamental frequency 440Hz and played
with vibrato, using sinc and Hamming basis functions. The sinc basis reconstruction has a smooth spectral
shape matching the original signal, whereas the reconstruction using the Hamming basis has a periodic
artefact resulting from difficulties modelling the frequency and amplitude modulations in the signal.
of partials and their frequency and amplitudes through Bayesian model selection.
5.3.1 Background
Full Bayesian inference of a sinusoidal model with noise was first carried out by Andrieu and Doucet [1999]
using a reversible jump MCMC scheme (4.2.2.1). A short frame of samples is modelled by a set of constant
amplitude sinusoids in white noise, and the inference scheme is shown to correctly and robustly estimate the
number of sinusoids present even at low signal-to-noise ratios. The conditional distribution of the frequencies
of the sinusoids is not however of a form which can be sampled easily. By this we mean that p (ω|y, θ) where
ω is the set of sinusoid frequencies, y is the observed data, and θ are the remainder of the model parameters,
is not of a standard form for which a sampling algorithm is known. Two Metropolis-Hastings proposal
schemes are suggested for updating the frequencies of the sinusoids from ω(i) in iteration i of the algorithm,
to ω(i+1). The first is a local proposal which generates candidates ω′ from a Gaussian distribution with
mean ω(i) and small variance. This allows the frequencies to be estimated to a high precision. The second
is a global proposal which generates candidates ω′ independently from ω(i), with probability proportional to
the Fourier spectrum. This allows the Markov chain to explore new regions of the spectrum. Andrieu and
Doucet [1999] provide the probabilities at which to accept each proposal, and also describe birth and death
moves for sinusoids in the reversible jump framework, so that the number of sinusoids can be estimated.
Walmsley et al. [1999] extend the above model to harmonic signals, where the frequencies of each partial
are set to integer multiples of the fundamental frequency. Notes in the model are turned on and off using
61
binary indicator variables. The global proposal is facilitated by a harmonic transform which functions in
a similar way to a comb filter (2.2.2), incorporating the energy of the higher-order harmonics into the
fundamental and low-order harmonics. An additional proposal is designed to allow the inference algorithm
to explore octave errors.
Godsill and Davy [2002] further extend the model so that each sinusoid is modelled by a set of Gabor
basis functions, as outlined in 5.2.5. Inharmonicity is introduced into the model, originally as an additive
term, then as a multiplicative term in Godsill and Davy [2005] (3.3). Reversible jump MCMC is again
used, and a range of moves are proposed to explore the high dimensional model space fully: note births and
deaths, adding and subtracting variable numbers of harmonics from each note, and multiplying or dividing
the fundamental frequency by a factor of two to explore octave errors.
Cemgil et al. [2006] use the state-space representation outlined in 5.2.4 in a polyphonic transcription
system. The partial frequencies are fixed toMidi specification frequencies (2.3) and the amplitude envelopes
of the notes are fixed. The note onsets and offsets are inferred using a pruning algorithm.
In this section we will use much of the prior structure that has been developed by the above authors, and
apply it to the analytical representation of the signal with the constraints on the amplitude and frequency
modulation as described in Section 5.2. The contributions made in this section are improvements to and
developments of the state-of-the-art inference algorithms for these model. For the state-space model, the
posterior distribution of the frequency parameters under a normal distribution prior is available in closed-
form, and this fact is used to derive a Gibbs sampler for the normal distribution prior. For an arbitrary
prior distribution on the frequency parameters a Metropolis-Hastings MCMC algorithm is derived using a
linearization of the posterior distribution as an efficient proposal distribution. This allows the state-space
representation to be used to accurately infer frequencies using a rich prior model for inharmonicity, as will
be demonstrated in Section 5.5. Prior to this work, inference on the state-space model was restricted to a
fixed grid of frequencies [Cemgil et al., 2006]. For the Gabor model, the contribution is the derivation of
the posterior mode of a signal-to-noise ratio hyperparameter, allowing this parameter to be inferred from a
marginalized distribution, thus improving estimation and eliminating the computation required simulating
latent parameters.
5.3.2 Noise Model
In this chapter we have chosen to use a white noise model. As the sinusoidal models we consider here are
linear, it is straightforward to model coloured noise sources using an autoregressive (AR) process. For the
state-space model, the extension required is straightforward as the AR model is itself commonly expressed
as a state-space model. For the extensions required to the Gabor model, see Godsill and Davy [2002].
For the remainder of this chapter, we drop the subscript a denoting that xa [t] is an analytic representation,
and work with M partials which we denote xm [t]. The white noise process is denoted n [t] and has variance
σ2n. The signal we observe is denoted y [t] and is given by
y [t] =
M∑
m=1
xm [t] + n [t] (5.11)
62
We choose the prior distribution of σ2n to be inverse-Gamma
p
(
σ2n
)
= IG (σ2n;αn, βn)
such that the conditional distribution of σ2n, given n ≡ [n [0] , . . . , n [N − 1]]>, is
p
(
σ2n|n
)
= IG
(
σ2n;αn +
N
2
, βn +
1
2
n>n
)
(5.12)
A common setting is αn = βn = 0 such that
p
(
σ2n
) ∝ 1
σ2n
This prior is invariant to arbitrary scaling of the observed signal, and has a maximum entropy interpretation
[Jeffreys, 1946].
The structure of the signal model depends on the parametrization that we have chosen.
5.3.3 State-Space Formulation
For the state-space formulation, each partial xm [t] has an unknown damping ratio ρm, and frequency ωm.
From (5.7) and (5.11) the model is
p
(
xm [t+ 1] |xm [t] , ρm, ωm, σ2f
)
= NC
(
xm [t+ 1] ; ρm exp [iωm]xm [t] , σ
2
f
)
(5.13)
p
(
y [t] |x1 [t] , . . . , xM [t] ,M, σ2n
)
= NC
(
y [t] ;
M∑
m=1
xm [t] , σ
2
n
)
(5.13) is a linear dynamical system, with an unobserved state vector xt = [x1 [t] , . . . , xM [t]]
>
at time t,
diagonal M ×M state transition matrix A with elements ρm exp [iωm] along the diagonal, process noise
covariance matrix σ2fIM , observation model H which is a 1×M vector with all elements equal to one, and
observation noise variance σ2n:
p
(
xt+1|xt,A, σ2f
)
= N (xt+1;Axt, σ2fIM) (5.14)
p
(
y [t] |xt, σ2n
)
= N (y [t] ;Hxt, σ2n)
This is in a standard form for inferring the marginal distribution of the state vector xt at each time t given
the entire signal y [0] , . . . , y [N − 1]:
p
(
xt|y [0] , . . . , y [N − 1] , {ρm, ωm}m=1,...,M , σ2f , σ2n
)
using the Kalman filtering and smoothing recursions. The only remaining requirement is that a multivariate
normal prior p (x0) be specified as the initial condition of the state vector.
The unknown parameters for the state-space model appear together as a complex number am ≡ ρm exp [iωm].
We show that the posterior distribution of the unknown parameters am is a normal distribution if a normal
63
prior
p
(
am|µm, σ2m
)
= N (am;µm, σ2m) (5.15)
is used:
p
(
am|xm [0] , . . . , xm [N − 1] , µm, σ2m, σ2f
)
=
1
Zx
p
(
xm [0] , . . . , xm [N − 1] , am, µm, σ2m, σ2f
)
(5.16)
=
1
Zx
p
(
am, µm, σ
2
m
)N−1∏
t=1
p
(
xm [t] |xm [t− 1] , am, σ2f
)
(5.17)
=
1
Zx
exp
(
− a
2
m
2σ2m
+
amµm
σ2m
+
am
σ2f
N−1∑
t=1
xm [t]xm [t− 1]− a
2
m
2σ2f
N−1∑
t=1
x2m [t− 1]
)
=
1
Zx
exp
(
−1
2
(
1
σ2m
+
1
σ2f
N−1∑
t=1
x2m [t− 1]
)
a2m +
(
µm
σ2m
+
1
σ2f
N−1∑
t=1
xm [t]xm [t− 1]
)
am
)
(5.18)
where Zx is the normalizing constant of the posterior distribution of am:
Zx = p
(
xm [0] , . . . , xm [N − 1] , µm, σ2m, σ2f
)
From this we see that the variance of the posterior distribution of am is(
1
σ2m
+
1
σ2f
N−1∑
t=1
x2m [t− 1]
)−1
and the mean is (
1
σ2m
+
1
σ2f
N−1∑
t=1
x2m [t− 1]
)−1(
µm
σ2m
+
1
σ2f
N−1∑
t=1
xm [t]xm [t− 1]
)
For a known number of partials M we have derived an MCMC scheme to infer the posterior distribution,
which is presented as 5.1.
p
(
a1, . . . , aM ,x0, . . . ,xN−1, σ2n|y [0] , . . . , y [N − 1] , σ2f ,
{
µm, σ
2
m
}
m=1,...,M
)
Although this is a simple and straightforward MCMC scheme for spectrum estimation, without specifying
additional parameters for tuning the algorithm, the requirement that the prior p (am) on the damping ratio
and partial frequencies must be a normal distribution is very restrictive. We do not foresee that Bayesian
hierarchical models for musical structure such as key and chords will impose normally distributed priors on
the partial frequencies. Rather, at this stage, we may allow an arbitrary prior p (a1, . . . , aM ), but the final
sampling step of 5.1 must then be replaced with a Metropolis-Hastings step. Our contribution here is to
suggest a proposal distribution strongly based on the underlying model of the signal, which has been found in
practice to have a high acceptance rate whilst reaching the mode of the posterior distribution of the damping
ratio and partial frequencies rapidly. This contrasts with global proposals based on the periodogram estimate
and local random-walk proposals normally required to effectively explore the non-linear posterior (see 5.3.4).
64
Algorithm 5.1 Gibbs sampler for the state-space model
• Initialization
 For m = 1, . . . ,M sample the diagonal elements of A(0) : a
(0)
m ∼ p
(
am|µm, σ2m
)
(5.15))
 Sample x
(0)
0 ∼ p (x0)
 For t = 1, . . . , N − 1 sample x(0)t ∼ p
(
xt|x(0)t−1,A(0), σ2f
)
(5.14)
• Iterations, i = 1, 2, . . .
 Compute n(i−1) [t] = y[t]−∑mm=1 x(i−1)m [t] and sample σ2(i)n ∼ p (σ2n|n(i−1))(5.12)
 For t = 1, . . . , N − 1 sample x(i)t ∼ p
(
xt|y [0] , . . . , y [N − 1] ,A(i−1), σ2f , σ2(i)n
)
computed using
Kalman filter and smoother recursions
 For m = 1, . . . ,M sample a
(i)
m ∼ p
(
am|x(i)m [0] , . . . , x(i)m [N − 1] , µm, σ2m, σ2f
)
(5.18)
In this scheme, for each m = 1, . . . ,M we use a proposal distribution
Q
(
am;x
(i)
m [0] , . . . , x
(i)
m [N − 1] , σ2f , a(i)1:m−1, a(i−1)m+1:M
)
which is of the same form as (5.18) but substituting
µm = 〈am〉p(am|a(i−1)1:m−1,a(i)m+1:M)
and
σ2m =
〈
a2m
〉
p
(
am|a(i−1)1:m−1,a(i)m+1:M
) − µ2m
so that the proposal distribution would be equal to the posterior if the prior were a normal distribution. The
acceptance probability of the proposed candidate a′m ∼ Q
(
am;x
(i)
m [0] , . . . , x
(i)
m [N − 1] , σ2f , a(i−1)1:m−1, a(i)m+1:M
)
is the minimum of 1 and∏N−1
t=1 N
(
x
(i−1)
m [t] ; a′mx
(i−1)
m [t− 1] , σ2f
)
p
(
a′m|a(i−1)1:m−1, a(i)m+1:M
)
∏N−1
t=1 N
(
x
(i−1)
m [t] ; a
(i−1)
m x
(i−1)
m [t− 1] , σ2f
)
p
(
a
(i−1)
m |a(i−1)1:m−1, a(i)m+1:M
) ×
Q
(
a
(i−1)
m ;x
(i)
m [0] , . . . , x
(i)
m [N − 1] , σ2f , a(i−1)1:m−1, a(i)m+1:M
)
Q
(
a′m;x
(i)
m [0] , . . . , x
(i)
m [N − 1] , σ2f , a(i−1)1:m−1, a(i)m+1:M
)
5.3.4 Gabor Model
The Gabor formulation of the model is given by (5.8)
y [t] =
M∑
m=1
I∑
i=0
bi,mψ [t− i∆] exp [iωmt] + n [t] (5.19)
65
For convenience, we rewrite (5.19) in matrix form, by stacking the amplitudes bi,m into a column vector b
of length (I + 1)M , with elements
b(m−1)(I+1)+i = bi,m
and the Gabor basis functions into a N × (I + 1)M matrix D with elements
Dt,(m−1)(I+1)+i = ψ [t− i∆] exp [iωmt] (5.20)
Writing y = [y [0] , . . . , y [N − 1]]> and n ≡ [n [0] , . . . , n [N − 1]]> as before, (5.19) becomes
y = Db+ n
The approaches described in 5.3.1 related to this model all adopt the g-prior [Zellner, 1986] which is chosen
for its properties in Bayesian model selection. The g-prior is a zero mean multivariate normal prior distri-
bution for p
(
b|D, σ2n
)
with covariance matrix σ2nξ
(
D>D
)−1
and an additional parameter ξ. As ξ scales the
amplitudes with respect to the noise level, it can be interpreted as a prior signal-to-noise ratio. In the case
where we treat ξ as unknown and wish to additionally infer it in a Bayesian setting, we again follow the
literature in 5.3.1
1
and assign an inverse-gamma prior: p (ξ) = IG (ξ;αξ, βξ).
The probabilistic model described so far, including the additional parameter ξ which was introduced by
adopting the g-prior, is
p
(
y,b,D, σ2n, ξ
)
= p
(
y|Db, σ2n
)
p
(
b|D, σ2n, ξ
)
p
(
σ2n
)
p (ξ) (5.21)
We have yet to discuss a prior p (D) for D. From (5.20) this is a prior p (ω1, . . . , ωM ) on the partial
frequencies, which are the remaining unknowns in this model.
In the remainder of this section, we show a result for this model that has not been referred to in the
literature we have reviewed. The model parameters b and σ2n may be integrated out, giving the following
marginal distribution:
p (y|D, ξ) p (ξ) ∝ (y>Py + βn)−(N+αn)/2 ξ−(αξ+1) exp(−βξ
ξ
)
(5.22)
where the following definitions are used, as in the literature:
S−1 = D>D+
1
ξ
D>D
=
ξ + 1
ξ
D>D
P = IN −DSD> (5.23)
= IN − ξ
ξ + 1
D
(
D>D
)−1
D>
= IN − ξ
ξ + 1
DD†
1
For clarity of notation we denote the scaling hyperparameter as ξ. This is equivalent to δ2 used in Andrieu and Doucet
[1999] and ξ2 used in Davy et al. [2006]
66
where D† =
(
D>D
)−1
D> has been used to make the notation a little more concise. It would be useful for
Bayesian inference to be able to sample from the conditional distribution p (ξ|y,D) but this is not a standard
distribution. Rather instead, previous approaches have used the following distribution
p
(
ξ|D,b, σ2n
)
= IG
(
αξ + (I + 1)M,
1
2σ2n
b>D>Db+ βξ
)
which is a standard distribution, but requires that b and σ2n be available. This may not be satisfactory
given we chose to integrate them out analytically in (5.22) thus improving the Monte-Carlo estimates of the
remaining parameters because significant additional computation is required to simulate them. Alternatively,
it is possible to integrate p (ξ|y,D) numerically, as Richardson and Green [1997] have chosen to do.
The result we present here is that although p (ξ|y,D) is not a standard distribution, its mode, or MAP
estimate, is available as the solution of the quadratic equation (see Section B.1 for the full derivation)
ξ2
(
N + αn
2
+ (αξ + 1)
)(
y>DD†y
)− ξ ((αξ + 1) y>y + βξy>DD†y)+ βξy>y = 0 (5.24)
The positive root of (5.24) is the mode, the other root is negative and is disallowed by the prior p (ξ).
Using (5.24) to estimate the mode ξ∗ reduces the computation required for each iteration of the MCMC
algorithm, and slightly reducing the number of iterations required for convergence. This is illustrated in
Figure 5.2 on page 68 for a single violin note with fundamental frequency 440Hz, which is part of the data
set used later in this chapter in 5.5.1 to demonstrate the ability of these algorithms to infer partial frequencies
in the signal.
This result is useful in an iterative MAP estimation scheme. It can also be used as the basis of a proposal
distribution for a Metropolis Hastings MCMC kernel. As the prior p (ξ) and conditional posterior distribution
p
(
ξ|D,b, σ2n
)
are both inverse gamma, it seems plausible to construct a proposal distribution which is also
inverse gamma, with its mode ξ∗ given by (5.24). We suggest setting the shape parameter of the proposal
distribution to αq and the scale parameter to (αq + 1) ξ
∗
. αq controls the variance of the proposal and should
be tuned for a suitable acceptance rate.
A Metropolis Hastings kernel is also necessary for simulating the frequencies ω1, . . . , ωM as the joint
distribution p (ω1, . . . , ωM |ξ, y) or any of the individual distributions p (ωm|ω1, . . . , ωm−1, ωm+1, . . . , ωM , ξ, y)
are not standard distributions. We are able to impose any prior distribution p (ω1:M ) using the Gabor model.
Here we suggest two proposal distributions based on computing the residual of the signal excluding the partial
frequency of interest. Denote D¬m as the N × (I + 1) (M − 1) matrix where the columns related to the mth
partial have been omitted. The least squares reconstruction of the signal excluding the mth partial is given
by D¬mD†¬my, and the residual signal, which is expected to contain the mth partial and also noise, is given
by xm = y − D¬mD†¬my. Note that computing the residual avoids simulating b and σ2n. The proposal
distributions are thus of the form Q (ωm; y, ω1:M¬m, ξ) where ω1:M¬m denotes the set of partial frequencies
excluding ωm. We are taking advantage of being able to design custom Metropolis Hasting kernels in order
to reduce computation whilst performing full Bayesian inference.
The first proposal involves computing a K-point DFT of xm and defining a probability distribution
function proportional to the magnitude in each frequency bin. This allows the algorithm to explore many
parts of the spectrum rapidly. For the proposal to be reversible, the proposed frequency is assumed to be
67
-2
-1
0
1
2
3
0 50 100 150 200
MCMC iterations
b1
b2
(a) The amplitudes of the Gabor basis functions must be simulated in order to
sample from the posterior of the signal-to-noise ratio parameter. This figure
presents the convergence of two amplitude parameters of the fundamental fre-
quency. The Markov chain has converged to the true posterior distribution in
approximately 100 iterations. The convergence of the corresponding noise vari-
ance and signal-to-noise ratio parameters for this Markov chain is shown in Figure
5.2b.
-3
-2
-1
0
1
2
0 50 100 150 200
MCMC iterations
log σ2
log ξ
log ξ∗
(b) Convergence of the noise variance σ2 and a comparison of the convergence
of ξ when inferred from the posterior distribution, or set to the mode ξ∗. As in
Figure 5.2a, the simulated parameters converge in approximately 100 iterations,
whereas the convergence to the mode is complete in around 75 iterations.
Figure 5.2: The result presented in (5.24) allows the conditional mode of the signal-to-noise parameter ξ to be
calculated without requiring inference of the amplitude parameters b or the noise variance σ2, which reduces
the computation required for each iteration of the MCMC algorithm, and slightly reducing the number of
iterations required for convergence.
68
uniformly distributed within the range of frequencies |p/K − ωm| < 1/2 for the frequency bin p with centre
frequency p/K, i.e.,
Q (ωm; y, ω1:M¬m, ξ) ∝
K−1∏
p=0
|DFT [xm]|p
∣∣ p
K − ωm
∣∣ < 12
0 otherwise
(5.25)
The second proposal is based on fitting a damped sinusoid to the residual signal, discarding the damping
ratio and using the frequency obtained. This proposal is used to make small adjustments to the partial
frequencies to search for local maxima in the posterior distribution. We use the posterior distribution (5.18)
which we derived for a damped sinusoid with process noise. In this MCMC context, we can use σ2f , the
process noise variance, to control the acceptance rate, and the damping ratio ρm is discarded. For simplicity,
we also remove the prior parameters µm, σ
2
m, giving
Q (ωm; y, ω1:M¬m, ξ) = N
ωm;(N−1∑
t=1
x2m [t− 1]
)−1(N−1∑
t=1
xm [t]xm [t− 1]
)
, σ2f
(
N−1∑
t=1
x2m [t− 1]
)−1
(5.26)
Alternatively we may use a standard random-walk proposal to search for a local maxima, especially in
cases where the above damped sinusoid model is not appropriate, such as for sustained note instruments
with significant amplitude and frequency modulations. The proposal distribution is a Gaussian distribution
centered around the existing frequency with a small random-walk variance σ2RW which Godsill and Davy
[2005] suggest setting to 10−3.
Q (ω;ωm) = N
(
ω;ωm, σ
2
RW
)
(5.27)
5.4 Bayesian Inference using Reversible Jump MCMC
In the previous section we have developed two probabilistic signal models for musical signals. The state-
space model is considered useful for musical instruments which may be approximately modelled by damped
oscillators. The Gabor basis model is considered useful for bandlimited frequency and amplitude modulations
such as those arising from vibrato.
MCMC algorithms (5.1 and 5.2) have been derived for a fixed number M of partial frequencies with
conditionally independent priors p (ω1:M |M). However there are few situations where the number of partials
and their frequencies are known a priori. In this section we consider how to infer the number of partials
jointly with their frequencies. The most flexible method of Bayesian inference for this type of problem is
reversible jump MCMC, which was first applied to Bayesian sinusoidal models in Andrieu and Doucet [1999].
Firstly we must specify a prior p (M) on the overall number of partials. In the literature the prior on
the number of partials is typically split into a hierarchical model, with a prior on the number of notes and
a prior on the number of partials in each note.
The contribution here is our general presentation of how to propose and accept changes to the numbers
of partials and their frequencies, without referring to any specific move, and also applying reversible jump
MCMC to the state-space model. Being able to propose changes to multiple partials is necessary for being
able to rapidly explore the numerous combinations of harmonics and notes that arise in musical signals.
69
Algorithm 5.2 Metropolis-Hastings for the Gabor model
• Initialization
 For m = 1, . . . ,M sample ω
(0)
m ∼ p (ωm)
 Sample ξ(0) ∼ IG (αξ, βξ)
• Iterations i = 1, 2, . . .
 Choose m from 1, . . . ,M with equal probability 1/M
 Compute xm = y −D(i−1)¬m D†(i−1)¬m y
 Select a proposal Q
(
ωm; y, ω
(i−1)
1:M , ξ
(i−1)
)
to sample ω′m from ((5.25)(5.27))
 draw u from U (0, 1) and set ω(i)m ← ω′m if
u <
p
(
y,D′, ξ(i−1)
)
p
(
y,D(i−1), ξ(i−1)
) Q
(
ω
(i−1)
m ; y, ω
(i−1)
1:M , ξ
(i−1)
)
Q
(
ω′m; y, ω
(i−1)
1:M , ξ
(i−1)
)
 otherwise set ω
(i)
m ← ω(i−1)m
 Compute ξ∗ using (5.24)
 Sample ξ′ from Q
(
ξ;D(i), y
) ≡ IG (ξ;αq, (αq + 1) ξ∗)
 draw v from U(0, 1) and set ξ(i) ← ξ(i−1) if
v <
p
(
y|D(i), ξ′) p (ξ′)
p
(
y|D(i), ξ(i−1)) p (ξ(i−1)) Q
(
ξ(i−1);D(i), y
)
Q
(
ξ′;D(i), y
)
70
5.4.1 Proposals and Acceptance Ratios
In the following expressions in this section, ω1:M should be substituted with a1:M when using the state-space
model. When using the Gabor model, all of the joint and proposal distributions are dependent on the current
value of ξ(i).
A proposal distribution for changing the number of partials and their frequencies would be expressed as
follows
Q
(
ω1:M ,M |y, ω(i−1)1:M ,M (i−1)
)
however this proposal distribution is not reversible as the dimension of the model parameters has changed.
Instead we define a birth proposal, where the number of partials has increased, but the frequencies of the
original partials have not changed, i.e., M∗ > M (i−1){
ω∗(M(i−1)+1):M∗ ,M
∗
}
∼ QB
(
ω(M(i−1)+1):M ,M |y, ω
(i−1)
1:M(i−1) ,M
(i−1)
)
such that if the proposal is accepted, M∗ →M (i) and
{
ω
(i−1)
1:M(i−1) , ω
∗
(M(i−1)+1):M∗
}
→ ω(i)
1:M(i)
.
This proposal must be accompanied by a death proposal, where a number of the original partials have
been removed, and the remaining partial frequencies unchanged, i.e., M∗ < M (i−1)
M∗ ∼ QD
(
M |y, ω(i−1)
1:M(i−1) ,M
(i−1)
)
such that if the proposal is accepted, M∗ →M (i) and ω(i−1)
1:M(i−1) → ω
(i)
1:M(i)
.
The Jacobian of the transform for the birth and death proposals equals one. The acceptance ratio for a
birth proposal is given by the minimum of 1 and
p
(
y, ω
(i−1)
1:M , ω
∗
(M(i−1)+1):M∗
,M∗
)
p
(
y, ω
(i−1)
1:M(i−1) ,M
) QD
(
M (i−1)|y, ω(i−1)
1:M(i−1) , ω
∗
(M(i−1)+1):M∗
,M∗
)
QB
(
ω∗
(M(i−1)+1):M∗
,M∗|y, ω(i−1)
1:M(i−1) ,M
(i−1)
)
Similarly, the acceptance ratio for a death proposal is given by the minimum of 1 and the reciprocal of the
above ratio:
p
(
y, ω
(i−1)
1:M∗ ,M
∗
)
p
(
y, ω
(i−1)
1:M(i−1) ,M
(i−1)
) QB
(
ω(M∗+1):M(i−1) ,M
(i−1)|y, ω(i−1)1:M∗ ,M∗
)
QD
(
M∗|y, ω(i−1)
1:M(i−1) ,M
(i−1)
)
Before or after any MCMC move, a complete reordering of the partial frequencies ω1:M is permitted, so
that different combinations of partials may be created or destroyed.
5.4.2 Prior Model
For the remainder of this chapter we adopt the hierarchical prior used by Godsill and Davy [2005] for the
partial frequency structure of the music. A short segment of music has K notes, where the unknown number
71
of notes is distributed according to a truncated Poisson distribution:
p (K = k|λK) =
(
λkK/k!
)∑KMAX
k′=KMIN
(
λk
′
K/k
′!
) (5.28)
The hyperparameter λK is distributed according to a Gamma distribution p (λK) = G (λK ;αK , βK) where
αK = 1 and βK = 2 are set such that p (λK) has infinite variance. It is often reasonable within the context
of polyphonic music transcription to assume that the minimum and maximum number of notes sounding at
the same time is known in advance (for example a four-part fugue or choral).
Each note k = 1, . . . .K has Mk partials which is distributed according to the same truncated Poisson
distribution (5.28). The minimum number of partials is set to 2 in the prior work, and the maximum
number of partials is set to either 30 or the limit and ωs/2ω0 which is the number of partials permitted by
the sampling frequency. Here we have set the minimum number of partials to 1 and allowed as many partials
as the sampling frequency permits. The sampling of the hyperparameters for the posterior distributions of
the number of notes and partials is identical to that of Andrieu and Doucet [1999].
The prior distribution for frequencies p (ω1:Mk |Mk) is given by the following hierarchical model
p (ω1:Mk |Mk) = p (ω0)
Mk∏
k=1
p (δk)
ωk = kω0 (1 + δk)
p (δk) = N
(
0, σ2δ
)
with σ2δ = 3 × 10−8. p (ω0) is set to be uniform, with limits set to be a semitone below and above the
minimum and maximum Midi frequencies.
5.4.3 Examples of Reversible Moves
The following moves, designed for inferring the harmonic structure of polyphonic music, are taken from
Godsill and Davy [2005] for the prior model described in the previous section.
5.4.3.1 n-increase/decrease
This pair of moves is designed to estimate the number of harmonics present in the signal for a single note. For
the n-increase move, a new set of n partial frequencies are proposed in harmonic positions above the highest
existing harmonic for that note. For the n-decrease move, the highest n partial frequencies for that note
are deleted. The number of possible harmonics is limited between 1 and ωs/2ω0 where ωs is the sampling
frequency and ω0 is the fundamental frequency of the note.
5.4.3.2 double/halve frequency
This pair of moves is designed to explore octave ambiguities and errors in the pattern of harmonics in
the signal. The double move doubles the fundamental frequency of a note by removing the odd-numbered
partials of that note. The halve move halves the fundamental frequency of a note, keeping the existing partial
72
frequencies, but assigning them to twice the original harmonic position. The missing partials are added in
the odd-number harmonic positions.
5.4.3.3 note birth/death
This pair of moves is designed to estimate the number of notes in the signal. The birth move adds a new
note, with a fundamental frequency and a new set of harmonics. The death move deletes a note with all of
its partial frequencies.
5.5 Results
In this chapter we have extended and developed two existing models for musical signal analysis, and described
a reversible jump MCMC framework to infer the number of partials in a small segment of music, and their
respective partial frequencies. To put this work into perspective, in this section we use the Bayesian prior
model of Davy et al. [2006] for polyphony and harmonic notes to infer the number of notes and their
fundamental frequencies in short segments of musical chords.
The objective of this section is to quantitatively evaluate the effect of the model enhancements suggested
in this chapter on polyphonic music transcription, using the prior work as a baseline. In Davy et al. [2006],
multiple F0 estimation results are presented for a set of recorded note mixtures of a variety of musical
instruments from the McGill database
2
. Each signal is downsampled from 44,100 Hz to 11,025 Hz, and the
first 544ms is used to estimate the fundamental frequencies present, assuming the number of notes is known
a priori. We use the same experimental setup, running the reversible jump MCMC algorithm once for 800
iterations, and using only the final 100 samples to estimate the fundamental frequencies and the frequencies
of the partials.
5.5.1 Performance on Monophonic Extracts
When evaluating model-based polyphonic music transcription systems, it is implicitly expected that the
system should accurately detect the fundamental frequency of an isolated note in a monophonic extract,
and be resilient to octave errors, as this capability already exists in simpler detection systems based on
autocorrelation function or harmonic transform analysis. We have found it extremely useful to study how
the partial frequencies are detected by a model, and have found that significant differences exists in the
performance of the different models suggested in this chapter. We choose to study monophonic extracts,
as the actual number of partials existing in the signal can be estimated easily by hand by studying the
periodogram of the entire monophonic extract. If a model is able to successfully detect a higher number of
partials, this indicates that the model is able to capture more of the harmonic structure of a note, and should
therefore be able distinguish better between ambiguous combinations of notes that degrade the performance
of polyphonic music transcription systems.
In this section, we present partial estimation results for different model choices. In each case, we have
modified one aspect of the model, and ensured that all other model and algorithm parameters are constant.
A ground truth for the number of partials was prepared by a human observer from the periodogram with
2
http://www.music.mcgill.ca/resources/mums/html/MUMS_dvd.htm
73
Average percentage of partials detected
Instrument Number of Extracts Real Signal Analytical Representation
Violin 31 37.3 73.3
French Horn 36 50.5 82.5
Oboe 40 48.2 80.8
Flute 29 47.5 77.5
Trumpet 26 51.0 81.0
Clarinet 34 47.7 78.5
Viola 32 40.0 66.7
Piano 64 47.2 74.1
Guitar 48 44.7 76.6
TOTAL 340 46.1 76.6
Table 5.1: Partial estimation results, comparing the average percentage of partials detected from the real
signal and the analytical representation using the Gabor model with Hamming basis functions
prior knowledge of the instrument and the fundamental frequency of the note. For each extract, we record the
number of partials detected by the model compared to the ground truth number of partials, and express the
ratio as a percentage. The partial estimation results are grouped by musical instrument, and the percentage
of partials detected is averaged over the extracts for that instrument
3
. The results are presented in Table
5.1 on page 74 and Table 5.2 on page 75 for a variety of instruments over a range of pitches from the McGill
database.
The first set of results compares choosing to apply the model to the original real signal or to the Hilbert
transform of the signal. In Table 5.1 on page 74 we use the Gabor model with Hamming window basis
functions, and compare the number of partials modelled using the original signal and using its Hilbert
transform. It is clear from these results that applying the Hilbert transform, which reduces some of the
modelling ambiguity in the sinusoidal representation of the signal, increases the number of partials correctly
estimated from the note.
The second comparison we make is between the choice of basis function, or the use of the state space
model, when applied to the analytical representation of the signal. Partial estimation results are presented
in Table 5.2 on page 75 comparing the number of partials modelled using the Hamming, Gaussian and sinc
basis functions and the state-space model. The difference in the number of partials correctly estimated is
overall much less dramatic. The Gaussian basis and the state-space model perform slightly worse than the
Hamming and Sinc bases, which themselves mostly return a consistent number of partials. However, as
motivated in Section 5.2, the performance for certain types of instruments can be improved by the choice of
model. For violin and viola notes, which are played with substantial vibrato, the sinc basis model detects
more partials than the other models. The higher order partials hence have a high spread in spectral energy in
frequency, and it appears that only the sinc basis model is correctly detecting and modelling these frequency
modulations. This can be explained by the explicit bandwidth constraint on frequency and amplitude
modulations when using a sinc basis. For piano and guitar notes, which are played percussively and allowed
to decay over time, the state-space model, with a constant damping ratio across the length of note, detects
3
As an illustration, if we have two oboe extracts, the first having 10 partials and 6 were detected by the model (60%), and
the second having 12 partials and 9 detected by the model (75%) then we record the average percentage of partials detected
for this model and instrument as 67.5%
74
Average percentage of partials detected
Instrument Number of Extracts Hamming Sinc Gaussian (α = 2.5) State-space
Violin 31 73.3 84.4 59.3 59.8
French Horn 36 82.5 83.9 69.2 70.2
Oboe 40 80.8 74.7 69.6 68.1
Flute 29 77.5 79.9 67.9 70.9
Trumpet 26 81.0 76.5 70.3 60.8
Clarinet 34 78.5 80.0 68.4 71.9
Viola 32 66.7 78.3 61.7 61.3
Piano 64 74.1 73.2 70.3 90.7
Guitar 48 76.6 75.4 69.2 90.4
TOTAL 340 76.6 77.8 67.7 74.4
Table 5.2: Partial estimation results for different instruments, comparing Hamming, sinc and Gaussian basis
functions and the state-space model with constant damping ratio, on the analytical representation of the
signal
nearly all of the partials of these notes.
To conclude, we have found that using the Hilbert transform to calculate the analytical representation of
a signal, and using this representation for modelling, increases the number of partials that are detected and
modelled correctly in single harmonic notes. The additional computation required to compute the transform
is small, and we expect this to increase polyphonic music transcription by capturing more higher order
structure in harmonic notes.
We have also shown that in most cases that there is little difference in the partial estimation performance
of different basis functions, or using the state-space model. Therefore, we suggest that, unless there is
appreciable vibrato or other modulations, Hamming basis functions should be used as they have finite
support and therefore it is more efficient to compute and invert the basis. In cases where the modulation
bandwidth is known a priori and is sufficiently large, sinc basis functions are attractive as we have observed
that higher order partials are modelled correctly, whereas other basis functions will typically either split a
higher order partial into two adjacent frequencies, or fail to detect the frequency. For instruments with an
approximately constant damping ratio over the length of the note, we have shown that the state-space model
defined in this chapter is appropriate for modelling the higher order partials of the signal.
5.5.2 Multiple F0 Estimation
In this section, we present results for polyphonic transcription where the number of notes are known a
priori. This is a limited application, as this situation is rare in practice. Similar implementations of Bayesian
inference for these harmonic models overestimate the number of notes, and have to resort to a somewhat
unsatisfactory heuristic to determine whether the energy of a modelled note is too low to be significant. In
our research with these models, we have experienced the same problem, however we have also determined
that the difficulty does not lie with the model, but the estimation accuracy of the partial frequencies. We
have made the following observations:
1. The local maxima of the likelihood of a single partial frequency also minimizes the residual signal in the
vicinity of that frequency. Correctly minimizing the residual of the signal around a partial frequency
75
Notes in Mixture
Evaluation Metric Model 1 2 3 4
% octave error Gabor 0 2.8 11.1 10.2
Davy et al. [2006] 0 10.3 17.8 9.3
Klapuri [2008] 0 13.6 19.0 22.2
% pitch error Gabor 0 8.3 15.6 18.6
Davy et al. [2006] 0 5.1 7.2 19.7
Klapuri [2008] 0 1.4 6.0 10.3
% total error Gabor 0 11.1 26.7 28.8
Davy et al. [2006] 0 15.4 25.0 29.0
Klapuri [2008] 0 15.0 25.0 32.5
Table 5.3: Polyphonic pitch estimation using the Bayesian harmonic model. The total error for the system
presented in this chapter is split into errors when the pitch estimated is an octave above or below the ground
truth, and errors when the estimated pitch is not an octave error.
reducing the chance of spurious partial detections to either side of the original frequency, which leads
to duplicate notes being estimated.
2. Due to frequency and amplitude modulations in the partial, the maximum likelihood frequency may
differ by up to 3Hz from the ideal harmonic frequency for theoretically harmonic instruments, and by
a comparable amount from the local maxima of the periodogram estimator. This means that global
frequency proposals based on these may produce inaccurate initial frequency estimates.
3. The difference between an initial frequency estimate and the maximum likelihood frequency is too large
for the constant variance random walk MH proposal distribution used in our algorithm to both arrive
near the optimum and accurately estimate its value.
We illustrate these observations in Figure 5.3 on page 77 for a single partial frequency estimated firstly from
the local maxima of the periodogram, and then by minimizing the residual of the signal.
We obtained some improvements by adapting the random walk MH proposal to progressively reduce
its variance, so that larger jumps in frequency were followed by smaller steps to explore the local maxima.
It was also necessary to reduce the ratio of global to local MH proposals so that enough time was given
to search for the maximum before losing the current frequency found in a global jump. However this also
increased the computation time, and also restricted the algorithms ability to explore the entire spectrum
using global proposals. In the next chapter, we show the benefits of estimating the partial frequencies to
high accuracy, and how it can be used to effectively estimate the number of notes. As future lines of research
using these Bayesian harmonic models with reversible-jump MCMC, we suggest either using a Hamiltonian
Monte Carlo scheme to use derivative information in the likelihood to arrive at the local maxima quicker, or
use the numerical methods described in Chapter 6 as the basis of a proposal distribution, and balance this
correctly with global proposals.
To complete this chapter we present the estimation results where the number of notes are known, so that
they can be compared with other transcription systems using this metric. To compare objectively against
prior work, we use the data set of Davy et al. [2006] which consists of 20 short monophonic signals and
20 short polyphonic signals for each of 2, 3 and 4 note mixtures, taken from a limited set of instruments
76
-20
-15
-10
-5
0
5
400 450 500 550 600
Frequency / Hz
Signal
Partial Estimate
Residual
(a) Estimate of the partial frequency from the maxima of the periodogram. The residual
of the resulting signal has two prominent peaks on either side of the detected partial
frequency. Subsequent iterations of the algorithm will identify these peaks as additional
partials, whereas these peaks are in reality due to amplitude and frequency modulations
in the original partial.
-20
-15
-10
-5
0
5
400 450 500 550 600
Frequency / Hz
Signal
Partial Estimate
Residual
(b) Maximum likelihood estimation of a partial frequency using the signal model. This esti-
mate minimizes the residual of the signal. The deviation between the peak of the periodogram
and the maximum likelihood estimate is 2.2Hz.
Figure 5.3: Comparison of the residual signal when using a periodogram estimate and maximum likeli-
hood estimate of the frequency of a partial of a musical note. The estimated frequency is marked on the
periodogram estimate of the signal.
77
from the McGill database. As this set of notes are only for sustained note instruments and do not include
piano or guitar notes, we only use the Gabor model with hamming basis functions for our comparison.
Additionally we have prepared an implementation of the auditory model based system of Klapuri [2008]
as a state-of-the-art polyphonic transcription to compare against. We present our results in Table 5.3 on
page 76. The results show an improvement in transcription accuracy for two note mixtures when using
the analytical representation, and we suggest that this is due to the improved estimation of higher partial
frequency structure demonstrated in the previous section. However, for three and four note mixtures, the
performance is not appreciably different to prior work of Davy et al. [2006], although there is an improvement
in accuracy over the auditory model system for 4 note mixtures. We suggest that as the inference algorithm
is mostly identical, the inference algorithm is limiting the performance in these cases. During this work, we
took the opportunity to study the reversible jump MCMC algorithm in progress, observing the current state
of the fitted model after each iteration. We observed that in many of the situations where transcription
errors occurred, the algorithm did reach the correct configuration of notes at some point, but was not able
to sustain this state due to inaccuracies in the frequency estimates.
5.6 Conclusion
In this chapter we have described and developed signal models for pitched, harmonic instruments from first
principles. Our primary motivation has been to investigate the modelling of different forms of frequency
and amplitude variations throughout the length of a musical note with a nominally constant frequency. A
damped amplitude envelope, where the oscillations decay exponentially in time, was found to fit naturally
in a state-space formulation, appropriate for percussive instruments such as the piano and guitar. A limit
on the bandwidth of frequency and amplitude modulations of the note was modelled intuitively as a Gabor
basis, where the window function was sinc in shape, appropriate for held-note instruments such as the bowed
string and woodwind families.
The formulation of these models was deliberately chosen so that well-known and understood Bayesian
priors and inference algorithms could be applied, complementing and building on existing work. The im-
provements in this chapter, as they are in the context of these models, may therefore be conversely applied
to the original models, algorithms and applications which inspired them. For example, choosing to model
the analytical representation of the observed signal revealed some improvements to the inference algorithms
for these models. The damped envelope model, which can be treated as a linear dynamical system when
adding Gaussian noise to the state and the observation processes, allows the posterior distribution of the
partial frequencies and damping ratio to be computed in closed form if the prior is Gaussian, or otherwise
a good proposal distribution to be constructed for a MCMC inference scheme. The Gabor basis model is
a general linear model, with a well studied prior structure for the basis coefficients given the frequencies.
We were able to derive the mode of the signal-to-noise ratio parameter in this model, which means that the
amplitudes and noise variance parameters do not need to be simulated in a MCMC scheme. This reduces
the computational cost and complexity of the algorithm required, and also reduces the dimensionality of the
target posterior distribution.
The models and the inference algorithms developed for them were then applied to monophonic and poly-
phonic transcription problems. In the case of single notes playing, we focused on how many of the harmonic
78
partial frequencies present were detected and modelled. We saw that using the analytic representation of the
signal resulted in significantly more partials being detected, and conclude that the reduction in the ambiguity
of instantaneous phase and amplitude afforded by this representation is beneficial for signal model based
inference methods. We also present polyphonic transcription results for the case where the number of notes
is known, and compare with prior work. We found that the new models improve transcription performance
for two-note mixtures when compared to prior work, but the performance for more complicated mixtures is
limited by the inference algorithm's ability to accurately estimate to the extent that spurious partial detec-
tions are avoided. The benefits of increasing the accuracy of partial frequency estimation are shown in the
next chapter, where we simplify the polyphonic inference algorithm to a two-stage process, estimating the
partial frequencies first, and inferring the harmonic structure secondly, to improve transcription accuracy
for higher number of notes, and also correctly estimate the number of notes playing in the mixture.
79
Chapter 6
Multiple Pitch Estimation using
Non-homogeneous Poisson Processes
Point estimates of the parameters of partial frequencies of a musical note are modelled as realizations from
a non-homogeneous Poisson process defined on the frequency axis. When several notes are combined, the
processes for the individual notes combine to give a new Poisson process whose likelihood is easy to compute.
This model avoids the data association step of linking the harmonics of each note with the corresponding par-
tials and is ideal for efficient Bayesian inference of unknown multiple fundamental frequencies in a polyphonic
mixture of notes.
6.1 Introduction
By observing the periodogram of a polyphonic mixture of notes, a trained observer can estimate the partial
frequencies present in the signal from the localized peaks in the spectrum, and then suggest fundamental
frequencies by observing that some of the partial frequencies are regularly spaced along the frequency axis.
For example, peaks in the spectrum at 440, 880, 1320 Hz and so on suggest a fundamental frequency of 440
Hz.
In the author's experience, using the periodogram to transcribe mixtures of notes is more reliable and
quicker than listening to the mixture. This method also outperforms automated transcription systems such
as the signal models described in the previous chapter and state-of-the-art auditory systems, especially
avoiding octave errors which plague other systems. One of the goals of this chapter is to investigate and
propose models for this method in order to improve the accuracy of polyphonic transcription.
Two assumptions about the transcription process are made. The first is that the observer does not
change his or her estimates of the partial frequencies when attempting to find a set of notes which fits the
observations. In plain terms, the observer is trying to fit a model to the observations, incorporating errors
in the partial frequency estimates into the prior, rather than fitting the observations to the model. This
motivates a two-stage process where the partial frequencies are estimated first, and then a harmonic model
is fitted to the frequencies. A prior model on the partial frequencies is still required however, as the observer
may know the range of fundamental frequencies that can be produced by the instrument for example, but
80
this prior must also be defined when the number of notes in the mixture is not known.
The second assumption is that the spectral shape in the vicinity of a peak is important to the estimation
of partial frequencies, whereas only the frequencies and sometimes the amplitudes of the partials are required
for transcription. The spectral shape sometimes allows us to distinguish between merged harmonics of two
or more notes. There are various cases where simply picking peaks of the spectrum above an adaptive noise
floor is inadequate, and these cases are often the cause of transcription errors. The notes of chords in music
often have overlapping harmonics, which may not be manifested as separate peaks but to the observer are
obvious because of differences in spectral shape. The spectral shape also helps distinguish between noise
or artifacts in the signal and genuine partial frequencies, reducing spurious detections of partials which can
lead to over or under-reporting of the number of notes playing. We will use an explicit signal model with a
prior on the expected spectral shape of harmonic notes to accurately estimate partial frequencies.
We do not assume that the partial estimation procedure is perfect however, and therefore need a tran-
scription system which is capable of dealing both with missed and duplicated partial detections. The solution
we present in this chapter is to use an iterative algorithm based on the signal model presented in the previous
chapter to provide high quality estimates of the partial frequencies, and to model the prior on the frequency
estimates as a non-homogeneous Poisson process. Choosing to use a signal model rather than a heuristic
estimation scheme for the partial frequency estimation is advantageous as present and future improvements
to that model will also benefit the estimation procedure here. However, it is also permissible to use other
methods to estimate the partial frequencies, as was carried out previously using periodogram peak picking
[Peeling et al., 2007b] and subspace methods [Peeling et al., 2007a]. In these cases, the prior on the frequen-
cies needs to reflect the estimation procedure, for example including a uniform clutter process across the
frequency axis if many spurious partials are detected.
The structure of this chapter is as follows. In Section 6.2 we introduce the properties of non-homogeneous
Poisson processes and how to calculate the likelihood given a set of observed frequencies. In Section 6.3 priors
for harmonic models are discussed, and suggestions for how these priors should be modified for different
partial estimation methods are given. In Section 6.4 a general method for making partial estimates from a
signal model is presented. Transcription results for polyphonic mixtures of notes are presented in Section 6.5
and are compared with the previous chapter and prior work. Conclusions and suggestions for future research
are given in Section 6.6.
6.2 Non-homogeneous Poisson Processes
In this section we define and describe a non-homogeneous Poisson process model [Cox and Isham, 1980]. A
homogeneous Poisson process is a stochastic process, defined usually over time, where the number of events
N (b)−N (a) occurring between time a and time b has a Poisson distribution with rate parameter λ:
P (N (b)−N (a) = k|λ) = exp (−λ (a− b)) (λ (a− b))
k
k!
λ is the expected number of events per unit of time, and is constant for a homogeneous Poisson process. A
non-homogeneous Poisson process generalizes this by allow the rate parameter to vary with time.
The principal ideas behind the model are explained by considering a model based solely on the frequency
81
00.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 2 4 6 8 10
P
(k
)
k
Figure 6.1: Probability mass function for the Poisson distribution
estimates of multiple partials in a musical signal.
6.2.1 Frequency-Domain Process
We define the number of partial estimates as a non-homogeneous Poisson process on the frequency axis.
Let N (f) be the number of partial estimates observed in the frequency range (0, f ]. We assume that the
number of partial estimates in a particular interval (a, b] of the frequency axis has a Poisson distribution
with parameter λa,b. The number of partials is given by N (b)−N (a) and has probability distribution
P (N (b)−N (a) = k|λa,b) = exp (−λa,b) (λa,b)
k
k!
(6.1)
We interpret λa,b as the expected number of partials occurring in (a, b]. Figure 6.1 on page 82 shows the
probability mass function for the Poisson distribution in (6.1) with λa,b = 1. We expect to observe one
partial in the region (a, b]. This region could for instance be a DFT bin at a harmonic frequency of a musical
note. The probability mass for observing zero and one partial in the region are equal when λa,b = 1.
Under the assumptions of a Poisson process, we write λa,b in terms of a continuous rate function λ (f)
λa,b ≡
ˆ b
a
λ (f) df
The rate function λ (f) of the Poisson process describes the expected concentration of partial frequencies
along the frequency axis. For a harmonic musical note, we would expect the rate function to be large around
the fundamental frequency and harmonics of the note, and small but non-zero elsewhere to allow for spurious
82
partial detections and transient effects.
For (6.1) to be valid for any values of a and b, there are two requirements. First, no two estimates may
have exactly the same frequency. The signal model or partial estimation scheme used to observe partial
positions should not provide estimates with exactly the same frequency, but that there must be a non-zero
interval between successive frequencies. It is a property of the signal models we use in Section 6.4 that
the two partials will never be estimated with exactly the same frequency, as this would lead to the basis
functions being linearly dependent. Two basis functions with the same frequency may always be combined
into a single basis function.
The second requirement is that the process is memoryless: the probability of a number of partials
occurring in any region of the frequency axis must be independent of the occurrence of partials in any other
region disjoint with that region
1
. This requires that λ (f) contains all of the prior information about the
occurrence of partials. Modelling the occurrence of partials as a Poisson process makes the model robust
to missing or duplicate partial detections. Harmonic models such as described in 5.4.2 require the existence
of a single partial frequency in every harmonic position modelled, and therefore an entire note may not be
detected due to a single missing partial frequency.
6.2.2 Superposition
One of the key attractions of using a Poisson process model to model partial estimates is that the observation
of multiple Poisson processes superimposed on the same axis is also a Poisson process. Moreover, the rate
function of the combined process is formed from the summation of the individual rate functions. Formally
we have M Poisson processes N1 (f) , . . . , NM (f) with rate functions λ1 (f) , . . . , λM (f); and we observe
N1:M (f) =
∑M
m=1Nm (f). Then
P
(
N1:M (b)−N1:M (a) = k|λ(1:M)a,b
)
=
exp
(
−λ(1:M)a,b
)(
λ
(1:M)
a,b
)k
k!
λ
(1:M)
a,b =
ˆ b
a
M∑
m=1
λm (f) df (6.2)
Note that in observing N1:M (f) we lose labeling information, i.e., which Poisson process m each partial was
generated by. This makes the likelihood (6.2) easy to compute. Inferring the actual labels of the partials, for
example in a source separation setting, cannot be carried out using the superimposed process alone, however
the labels may also be inferred in a probabilistic manner using a likelihood function based on the individual
rate functions for each note.
6.2.3 Evaluation of Likelihood
In this section we consider how to evaluate the likelihood of the occurrence of the entire set of observed
partial positions. Although we would naturally try to calculate the likelihood exactly, the method we choose
depends on how we observe the Poisson process. In this section, three methods are given for evaluating
the likelihood. The exact method in 6.2.3.1 should be applied when a signal model is used to estimate the
1
This does not contradict the previous requirement that partial occurrences may not have the same frequency, as identical
partial frequencies cannot be mapped to disjoint regions of the frequency axis
83
partial frequencies. The binning method in 6.2.3.2 is suitable when a periodogram peak picking method is
employed. If the peak picking method by design only detects zero or one peaks in each frequency bin, the
calculation should be modified to allow for the possibility that more than one partial frequency was present
in the bin. In this case, the method in 6.2.3.3 is appropriate.
6.2.3.1 Exact Calculation
When the partial estimates are known with sufficient accuracy, and their frequencies are distinct, the likeli-
hood of the occurrence of frequencies f1, f2, . . . , fN under a non-homogeneous Poisson process λ (f) on the
frequency axis between 0 and fs/2 where fs is the sampling frequency, is given by Crowder et al. [1991],
Meeker and Escobar [1998] as
p (f1, f2, . . . , fN , N |λ (f)) = exp
(
−
ˆ fs/2
0
λ (f) df
)
N∏
n=1
λ (fn) (6.3)
The derivation of the above likelihood is informally obtained firstly by noting that in the interval between
observed frequency fn and fn+1 there are no observations. Hence, using (6.2) and substituting k = 0, each
such interval has probability exp
(−λfn,fn+1) = exp(− ´ fn+1fn λ (f) df). At each observed frequency fn, the
probability, using (6.2), of observing k = 1 is given by λ (fn). We also take into account that no frequencies
were observed in the interval [0, f1) and (fN , fs/2]. As a Poisson process requires that the observations in
disjoint intervals of the frequency axis must be independent, we simply combine the probabilities of these
observations together by multiplying them, thus:
p (f1, f2, . . . , fN , N |λ (f)) = exp
(
−
ˆ f1
0
λ (f) df
)
exp
(
−
ˆ fs/2
fN
λ (f) df
)
×
N−1∏
n=1
exp
(
−
ˆ fn+1
fn
λ (f) df
)
×
N∏
n=1
λ (fn)
= exp
(
−
ˆ fs/2
0
λ (f) df
)
N∏
n=1
λ (fn)
6.2.3.2 Binning
The likelihood when observations are grouped into non-overlapping regions (bins) of the frequency axis may
be calculated as follows. Assume we have F such bins, spanning frequency intervals A1, . . . , AF , and denote
the number of observations in each bin by Nf . We then have, by the independence of intervals in a Poisson
process,
P (N1, . . . , NF |λ1, . . . , λf ) =
F∏
f=1
P (Nf |λF ) = exp (−λf ) (λf )
Nf
Nf !
(6.4)
λf =
ˆ
Af
λ (f) df
84
The advantage of this method over the exact calculation method is that the rate function λf may be computed
in advance for each bin f before the partial frequencies are estimated, which reduces the computation required
when evaluating the likelihood for multiple frames of music. Often the bins will coincide with the frequencies
of the DFT used to estimate the partial frequencies.
6.2.3.3 Censored Frequencies
The partial estimation method used may only indicate that there is a partial in a frequency bin or not. An
example is a single step peak picking scheme, which selects all the spectrum bins with amplitudes larger than
neighbouring bins and above a noise threshold. It is possible that multiple frequencies are present within
the region of the frequency axis covered by a single observation bin, for example in the case of overlapping
harmonics. Although we have only `observed' at most one frequency per observation bin, we wish to allow
for the possibility that more than one frequency could be present in each bin. This is useful in practice the
rate function of the Poisson process is a superposition of the rate functions of harmonically related notes.
For every harmonic that overlaps within the region of a single bin, we would expect two or more partial
frequencies to occur within that bin. Thus we are asserting that an observed peak in the spectrum implies the
existence of multiple partial frequencies in that bin, and no observed peak implies that no partial frequencies
were present in the bin.
For the observations to be valid as a Poisson process, when a peak is detected in a bin, we calculate
the probability that one or more frequencies were observed in that bin, i.e., p (Nf ≥ 1) = 1− p (Nf = 0) =
1 − exp (−λf ). When a peak is not detected in a bin, the probability is given by p (Nf = 0) = exp (−λf ).
The likelihood over all the frequency bins is thus given by
F∏
f=1
1− exp (−λf ) peak observed in bin fexp (−λf ) no peak observed in bin f (6.5)
The likelihood calculation in this case is the same as a set of Bernoulli trials with probability 1− exp (−λf ).
6.3 Bayesian Priors
Bayesian inference for the non-homogeneous Poisson process involves treating the rate function λ (f) or
λ (x, f) for a vector valued process as unknown and placing a prior distribution on the rate function. A
suitable choice of prior depends greatly on the observation method. If we are binning the observations into
fixed, a priori intervals, then we can see from (6.3) that we need to infer each of the unknown parameters λf
of the model rather than the full rate function λ (f). However if we consider each observation and evaluate
intervals between partials, then our target for inference is the rate function λ (f).
A full non-parametric Bayesian inference of the rate function of a Poisson process is carried out in Adams
et al. [2009]. A Gaussian process is used as a prior, which is transformed into a rate function using a sigmoid
function. The inference is tractable, however it is not immediately clear how higher-level information, such
as partials occurring at harmonic positions, could be structured into such a prior.
An interesting alternative would be to use a periodic Poisson process [Dimitrov et al., 2004]. In this
model, the rate function is a periodic function along the axis. For our model, the period of the rate function
85
would be the fundamental frequency of the musical note.
Here however we pursue two designs of Bayesian prior which may be inferred tractably and are amenable
to additional, higher level, prior structure.
6.3.1 Fixed Bins
When observations are grouped into fixed bins, then the model parameters are a finite set of positive values
λf . Each λf is the intensity parameter of a Poisson distribution, for which the conjugate prior choice is the
Gamma distribution:
p (λf ) = G (αf , βf ) (6.6)
The posterior distribution when observing Nf is
p (λf |Nf ) = G (αf +Nf , βf + 1)
We can integrate the unknown λf to obtain a negative binomial (Pascal) distribution (Figure 6.2 on page
87)
p (Nf ) =
Γ (αf +Nf )
Nf !Γ (αf )
p
αf
f (1− pf )Nf (6.7)
pf =
1
βf + 1
Figure 6.2 on page 87 shows the prior distribution on expected number of partials λf (6.6) with αf = 2, βf = 1
and corresponding marginal distribution (6.7) on observed number of partials Nf .
The hyperparameters may be optimized for the purposes of training. For example, to train the hyperpa-
rameters of the rate function for a particular musical instrument and pitch, we would use I example frames
of data and estimate the partial frequencies in each frame, obtaining a set of observations N
(1)
f , . . . , N
(I)
f for
each frequency bin. The posterior of the rate function given these observations is
p
(
λf |N (1)f , . . . , N (I)f
)
= G
(
αf +
I∑
i=1
N
(i)
f , βf + I
)
(6.8)
and the hyperparameters can thus be set to new values: αf → αf +
∑I
i=1N
(i)
f and βf → βf + I. The new
values of the hyperparameters can now be used as the prior (6.6) for when new frames of data are observed,
thus transparently incorporating training data into the Bayesian model.
6.3.2 Gaussian Mixture Model
In this section we model the entire rate function as a Gaussian mixture model (GMM). Modelling the rate
function as a GMM is a convenient method to use prior information concerning the partial frequencies of
harmonic instruments. The rate function is shaped by the probability density function of a Gaussian mixture
86
00.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 2 4 6 8 10
p(λf |αf , βf )
p(Nf |αf , βf )
Figure 6.2: Prior on expected number of partials and marginal distribution of number of partials
model:
λ (f) =
H∑
h=1
chN
(
µh, hσ
2
)
(6.9)
ch ≥ 0 ∀h (6.10)
H denotes the number of mixture components. A meaningful interpretation of the above model is that H is
the number of harmonic positions for a note with fundamental frequency f0, and that a single component
of the mixture corresponds to a single harmonic h. We assign
H = b fs
2f0
c
where fs is the sampling frequency. The means of the components are set to the expected harmonic positions
µh ≈ hf0 and be allowed to deviate from their ideal positions to account for inharmonicity. σ2 allows for
further spread around the harmonics, which may occur with split peaks or modulations in the signal. Finally
ch weights each harmonic, and we expect that low frequency partials have a higher probability of being
detected and hence have higher values of weighting ch.
Inference of the unknown parameters in a Gaussian mixture model involves introducing labels for each
observation and using Expectation Maximization (EM). When we train our model by fitting the parameters
to the estimated partial frequencies of a set of frames of audio from a harmonic musical instrument with a
particular pitch, the values for σ2 are typically small so that there is negligible overlap between the mixture
components for different harmonic positions. Figure 6.3 on page 88 is provided as an example of this, where
87
00.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0 500 1000 1500 2000 2500 3000 3500 4000
λ
(f
)
f
Figure 6.3: Intensity function for a musical note with fundamental frequency 440Hz parameterized using a
Gaussian mixture model with H = 9 harmonics
we have chosen σ2 = 10−4, µh = hf0 and ch = 1/h as illustrative parameters. This assumption is also
used in [Davy et al., 2006] for the model of the detuning parameters for each partial frequency. In practice,
the Expectation step for training the GMM can be replaced with a K-means clustering step, where the
component means are set to the expected harmonic positions, which reduces the amount of computation
required for inferring the unknown parameters compared to the full Expectation step.
6.3.3 Model for mixture weights
The Gaussian mixture prior in the previous section allows for inharmonicity through the variance of each
mixture component. Rather than inferring the unknown parameters of the mixture model, we can also set
the parameters to particular values which match existing generative models of partial frequencies. The prior
model of Godsill and Davy [2005], which is also used in Chapter 5 can be adapted as a Poisson process
easily by interpreting the prior probability distribution over the number of partials and their frequencies as
a counting process of the number of partials along the frequency axis. The number of partials H per note is
modelled as Poisson distributed
p (H) = Po (H|Λ) = Λ
He−Λ
H!
where Λ is the expected number of partials. The position of each partial frequency is normally distributed
around hf0, where h is the harmonic number and f0 is the fundamental frequency of the note.
To convert this to a Gaussian mixture model of the form (6.9), we note that each mixture weight ch
gives the expected number of partials around hf0. We interpret this as the probability under the generative
model that the number of partials is greater or equal to h, hence following this model, the mixture weights
88
in (6.9) are given by
ch =
(
1−
h∑
m=1
Po (m− 1|Λ)
)
pnote (6.11)
pnote is the prior probability that the note is playing in the mixture, and is applied as a scaling to all of the
mixture weights for that model.
∑h
m=1 Po (m− 1|Λ) is the cumulative Poisson distribution of observing up
to h−1 partials. ch, when calculated by (6.11), gives the probability of observing a partial at that frequency
under the prior model.
When we set µh = hf0 and σ
2
to 3 × 10−8 we obtain the multiplicative inharmonicity model suggested
in Godsill and Davy [2005].
6.4 Signal Model Based Partial Estimation
In previous work using non-homogeneous Poisson processes for polyphonic transcription, we have used several
methods for extracting the partial frequencies as a preprocessing step. In Peeling et al. [2007b] a heuristic
scheme for selecting peaks in the Fourier spectrum above an adaptive noise threshold was considered. Such
a scheme is quick and can detect the partial frequencies of high amplitude notes without difficulty. However
without an explicit signal model this scheme cannot differentiate between genuine partials and transient
features of the noise floor, and its performance is therefore limited.
In Peeling et al. [2007a] a matrix pencil scheme for estimating damped sinusoids was used to provide
frequency and amplitude data for partials. Matrix pencil schemes use eigenvalue analysis to decompose
the signal into a number of sinusoids, from which partial-frequency estimates can be obtained. This is an
improvement as an explicit signal model is being used, but the number of sinusoids needs to be supplied
to the estimation scheme. As this is not known a priori, the number of sinusoids has to be estimated
separately. Underestimating or overestimating the number of sinusoids can result in frequency estimates
that differ greatly from the true frequencies, leading to transcription errors, as the algorithm attempts to fit
an incorrect number of sinusoids to explain the data and adapts the frequencies accordingly.
The most satisfying partial estimation scheme we have investigated and present here is to apply an
existing probabilistic signal model, and to iteratively estimate the number of partials and their frequencies
accurately using a Bayesian model selection criterion for that signal model to determine when a suitable
number of partials have been detected. The scheme is similar in structure to Matching Pursuit [Mallat
and Zhang, 1993] which at each iteration selects a basis function from an over-complete dictionary of basis
functions such as Gabor functions (5.2.5) to match the residual of the signal. The procedure here differs from
Matching Pursuit in the following aspects: a set of basis functions with the same frequency are selected per
iteration; a local search is employed to identify a suitable frequency at each iteration, using the periodgram
as an initial estimator (6.3.3) and additionally numerical optimization to minimize the residual (6.4.3), rather
than a global search over the dictionary; and the g-prior is incorporated in the signal model which improves
the correct selection of the number of partials when using Bayesian model selection.
The frequency estimates obtained are much less sensitive to the modelled number of partials than for the
matrix pencil scheme. Moreover, future improvements in these signal models will also improve the quality
of the frequency estimates. Using a signal model selection criterion requires much more computation than
other schemes, but substantial improvements in transcription accuracy are obtained as a result.
89
Although we describe the algorithm making reference to the probabilistic signal model developed in
Chapter 5, in practice any suitable signal model with an accompanying model selection criterion can be used
for the partial estimation scheme component of this system.
6.4.1 Overview
To estimate partial frequencies using a signal model, we use a simple iterative approach. At the beginning of
an iteration, we have a set of partial frequencies, and using the signal model, we can assign some of the signal
to the model, and the remainder to a residual. The residual contains the remaining partials that we are
yet to detect. A model selection criterion is used to calculate how well the current set of partial frequencies
explains the observed signal. We then select a new frequency from the residual, and add this to the set of
partial frequencies. When the model selection criterion fails to improve by adding partial frequencies, the
algorithm is stopped.
In general, the best frequency to select from the residual is that which accounts for the most energy
in the residual. This approach naturally will tend to reduce the number of partials that are ultimately
selected. Selecting the maximum of the Fourier spectrum is clearly a good estimate of this frequency. We
also investigated using the maximum as a starting point, and using numerical methods to maximize the
energy removed from the residual. For a probabilistic signal model, this is equivalent to finding the MAP
frequency estimate [Bretthorst, 1989].
6.4.2 Bayesian Model Selection Criterion
To exactly estimate the number of partials using a probabilistic signal model requires a reversible jump
MCMC scheme, as covered in the previous chapter. However we have found that using a Bayesian model
selection criterion, which is an approximate means to compare models of different dimensions, produces
acceptable estimates of the number of partials. In Djuric [1996, 1993] a model selection criterion for the
estimation of complex valued sinusoids in white noise was derived:
N log
(
y>Py
)
+
5k
2
logN + log p (θ) (6.12)
where k is the number of sinusoids and P has the same definition as in (5.23). p (θ) is the prior probability
of any additional model parameters. The appropriate number of sinusoids is that which minimizes the above
expression.
In the remainder of this chapter, we will use the Gabor model, using sinc basis functions, as described in
5.3.4. The only additional model parameter that it is necessary to infer is the expected signal-to-noise ratio
ξ which is assigned an inverse-Gamma prior p (ξ) ∼ IG (ξ;αξ, βξ) where αξ = 2 and βξ = 1 are chosen to
give this prior infinite variance and thus be uninformative.
We divide the signal into frames with 50% overlapping samples. The partial frequencies are estimated
in each frame separately. At each iteration we select and add a frequency to the set of partial frequencies
already estimated for that frame. If this addition increases the model selection criterion (6.12), then we
terminate at this point. Otherwise we update all of the estimated frequencies, and continue. The scheme is
described in more detail in Algorithm 6.1.
90
Algorithm 6.1 Partial estimation scheme for a frame of audio y with N samples
• Initialize: k ← 0, M0 = N2 log
(
y>y
)
,r = y
• Iterate while Mk ≥Mk−1,
 k ← k + 1
 Estimate new partial frequency ωk from the residual r by the method in 6.4.3 or 6.4.4
 Form the basis matrix: Dt,k = exp iωkt, t = 1, . . . , N
 Estimate signal to noise ratio, ξ∗ = arg maxξ p (ξ|y) and P
 Mk =
N
2 log
(
y>Py
)
+ 52k logN + log p (ξ
∗)
 Calculate partial amplitudes: b =
(
1 + 1ξ∗
)−1 (
D>D
)−1
Dy
 Update residual of signal r ← y −Db
An important step in Algorithm 6.1 is the method by which the frequency value of a new partial is
estimated from the residual of the signal. In the remainder of this section, we investigate two methods for
producing an accurate frequency estimate from the residual of the signal at each iteration of the algorithm. In
Section 6.1 we stated that the prior on the partial frequency estimates practically depends on the estimation
scheme. For each method therefore, we present a rate function for single harmonic notes. For polyphonic
mixtures of harmonic notes, the rate functions can be superimposed (see 6.2.2). In Section 6.5 we describe
inference of polyphony using these rate functions, and compare the accuracy of both methods within the
partial estimation scheme for polyphonic music transcription.
6.4.3 Zero-Padding
The first method we investigated to estimate the value of a partial frequency in the residual was to zero pad
the residual to a length of 4N and find the maxima of the DFT spectrum. Zero padding interpolates the
DFT spectrum, increasing the number of discrete frequencies at which the partial frequency can be found.
An example of output of the partial estimation scheme in Algorithm 6.1 for a polyphonic note mixture is
provided in Figure 6.5 on page 94. We see that many of the partials visible in the spectrum of the signal
are detected over multiple frames with only a small number of additional spurious detections. Based on our
observation of the partial estimation results, we propose a novel parametric rate function which is designed to
robustly infer multiple fundamental frequencies when estimating partial frequencies iteratively from the DFT
spectrum. As the DFT spectrum gives discrete estimates of the frequency in bins, the likelihood function of
the observed partial frequency estimates should be calculated using the method described in 6.2.3.2.
The rate function we propose has the following form:
λf (f0) =
λNote |f/f0 − [f/f0]| < λClutter otherwise (6.13)
where  be the maximum allowed inharmonicity and f is the central frequency of the DFT bin. The notation
[f/f0] denotes rounding to the nearest integer, and hence gives us the position of the closest harmonic of f0 to
91
-20
-15
-10
-5
0
5
0 500 1000 1500 2000 2500 3000 3500 4000
lo
g
P
ow
er
S
p
ec
tr
al
D
en
si
ty
Frequency / Hz
Periodogram Power Spectral Density Estimate
0
0.5
1
1.5
2
2.5
3
3.5
4
0 500 1000 1500 2000 2500 3000 3500 4000
N
u
m
b
er
of
P
ar
ti
al
-F
re
q
u
en
cy
E
st
im
at
es Partial-Frequency Estimation output
Figure 6.4: Partial estimation results for zero padding method
the central frequency f .  allows for multiple partials per harmonic, which can occur because of modulations
in the signal, or because of modelling and estimation error. This inharmonicity model is multiplicative,
allowing for a larger deviation from the ideal position for the higher frequency harmonics. The rate function
bears some resemblance to the scoring function (3.2) for inharmonicity used by Yeh et al. [2005]. This
intensity function approximately resembles histogram data of partial estimates for the low harmonics, see for
example Figure 6.4 on page 92. The partial estimation scheme described in Algorithm 6.1 returns clusters
of partials around the harmonic positions, and a few spurious estimates. We capture this behaviour with a
parametrized intensity function (6.13).
Our expectation is that one partial in the correct position would be detected in each frame of data, so
we set λNotes equal to the number of frames. λClutter models additional detections and must be greater than
zero in each frequency bin. λClutter should be independent of the actual notes being played, and may be
estimated or inferred. However we found that the model is quite robust to a range of clutter values, and we
set λClutter = 0.1 in the following experiments.
The inharmonicity parameter  is also unknown, and depends on the collection of notes being played.
Instruments with high modulations (such as vibrato) will have large values of  as well as inharmonic
instruments. We assign a uniform prior p () = U
(
1
100 ,
1
10
)
and infer  with each observation. Note that λf
has a functional dependence on  hence it is unknown, and we use the Bayesian prior described in 6.3.1 to
correctly calculated the appropriate marginal likelihood value.
92
6.4.4 Likelihood-Search
The second method we investigate is to directly maximize the likelihood of the Gabor signal model when
adding a new frequency estimate to the set of partial frequencies estimated in the residual. This is motivated
by our observations in 5.5.2 where we found that there could be significant deviation between the maxima
of the DFT spectrum and the maximum likelihood frequency estimate, and that the residual of the signal
after subtracting the detected frequency could contain additional peaks which can result in spurious partial
frequency detections, as shown in Figure 5.3 on page 77.
Maximization of the likelihood which is equivalent to minimizing the energy of the residual signal was
carried out numerically, using the Golden search method. The steps for calculating the residual are shown
in Algorithm 6.1. This method requires an upper and lower bound for the frequency, between which the
maxima is estimated. First we chose the maxima of the Fourier spectrum as an initial estimate, as in 6.4.3
but without any zero padding. The bounds for the Golden search method were chosen to be ±10Hz of the
initial estimate, based on our observations of the discrepancy between the Fourier spectrum and the signal
model maxima in 5.5.2.
An example of the partial estimation results are shown in Figure 6.5 on page 94. These results show
that the partial estimates are very good, in that many of the partials are detected, and there are very few
duplicate or spurious estimates. As these results match the actual partials in a harmonic note, we simply
use the rate function described in 6.3.3 which was derived from a Bayesian harmonic model.
This method requires more computation per iteration than the method in 6.4.3 but as less spurious and
duplicate partials are detected, the overall number of iterations is smaller, and the cost of computing the
model selection criteria is also reduced. We found that the likelihood search method was quicker overall than
the zero padding method. As it also produces better partial estimation results which can then be analyzed
using an acceptable Bayesian model and compared easily with other inference schemes, we recommend the
likelihood-search over the zero-padding method from a modelling and practical point of view. As we shall
see in Section 6.5, there are also mild improvements in polyphonic transcription performance.
6.5 Polyphonic Pitch Estimation
The Poisson process model is useful for polyphonic pitch estimation because the likelihood is quick to
compute, and it is therefore feasible to perform searches over pitch candidates, exhaustively testing every
single note and also pairs of note pitches, at each state selecting the single note or pair of notes that results
in the highest likelihood.
The performance of the model is very much dependent on the quality of the partial frequency estimates,
although the Poisson process model allows for more clutters, errors and inaccuracy than direct inference
using a poor signal model would.
6.5.1 Greedy Search
In this section we present results for a maximum marginal likelihood approach comparing the two partial
estimation schemes described in Section 6.4. Following other inexpensive multiple pitch estimation schemes,
we search in a greedy manner, adding one note at a time to the mixture, selecting the maximum likelihood
93
-20
-15
-10
-5
0
5
0 500 1000 1500 2000 2500 3000 3500 4000
lo
g
P
ow
er
S
p
ec
tr
al
D
en
si
ty
Frequency / Hz
Signal
Partial-Frequency Estimate
Figure 6.5: Partial estimation results and periodogram estimate for a polyphonic mixture of four notes.
solution at each point. For the zero-padding method, the observations are observed in bins, hence the
likelihood function is given by (6.4), whereas for the likelihood-search method, the observed frequencies are
known precisely, hence the likelihood function is given by (6.3).
We evaluate the model using the dataset in Davy et al. [2006], described also in 5.5.2 and compare with
the results obtained there. The polyphonic note mixtures are buffered into frames of length 1024 samples
with 50% overlap. Table 6.1 on page 95 present our results for when the number of notes is known in
advance, and are compared to the results in Table 5.3 on page 76 on the same data set, showing that we
are able to achieve a comparable and even superior level of performance to a full Bayesian model and also
a state-of-the-art auditory model. The likelihood search method in general makes fewer transcription errors
than the zero-padding method.
In these results, we observe that the Poisson process model is capable of correctly inferring multiple notes
with the same pitch. Due to the superposition property of Poisson processes, it is straightforward to test
whether a greedy search performs worse than a more exhaustive search which also considers adding pairs of
notes to the solution at each iteration. We can also check whether the search leads to a higher likelihood
solution than the likelihood of the true pitches under this model. For the results in Table 5.3 on page 76 the
greedy search returns the same sets of notes as the method which adds pairs of notes, and the performance of
the search methods are identical. Thus the greedy algorithm is shown to be sufficient in finding the maximum
likelihood solution, and we suggest that this is a property that results from the superposition property.
94
Notes in Mixture
Evaluation Metric Model 1 2 3 4
% octave error Zero-padding 0 4.9 11.9 8.9
Likelihood-search 0 7.0 6.0 7.5
Gabor 0 2.8 11.1 10.2
Davy et al. [2006] 0 10.3 17.8 9.3
Klapuri [2008] 0 13.6 19.0 22.2
% pitch error Zero-padding 0 6.7 9.0 17.0
Likelihood-search 0 5.0 10.7 15.5
Gabor 0 8.3 15.6 18.6
Davy et al. [2006] 0 5.1 7.2 19.7
Klapuri [2008] 0 1.4 6.0 10.3
% total error Zero-padding 0 11.6 20.9 25.9
Likelihood-search 0 12.0 16.7 23.0
Gabor 0 11.1 26.7 28.8
Davy et al. [2006] 0 15.4 25.0 29.0
Klapuri [2008] 0 15.0 25.0 32.5
Table 6.1: Polyphonic pitch estimation using the Poisson process model, comparing the zero-padding and
likelihood-search methods for estimating the frequency of the partial at each iteration of the partial estimation
scheme. The results are compared to estimation results for the Gabor model developed in Chapter 5,
the Bayesian harmonic model of Davy et al. [2006], and a state-of-the-art auditory model for polyphonic
transcription [Klapuri, 2008].
6.5.2 Estimation of number of notes
The greedy search method may also be used to estimate the number of notes. Once the partial frequencies
have been estimated, as a result of the superposition property there are no remaining parameters in the
model. There is no danger of overfitting too many notes using the Poisson process model once the partial
frequencies have been estimated, as the expected number of partials in the signal is implicitly defined by
the integral of the rate function of the Poisson process. Attempting to fit too many notes will result in a
much higher expected number of partials than actually observed, which is penalized by the likelihood of
the Poisson process. This is advantageous, as there is no need for an explicit penalization term to avoid
overfitting of the model. The likelihood itself is therefore sufficient for determining the number of notes.
Notes are added to the candidate set until this fails to increase the likelihood.
We evaluate the estimation of the number of notes by calculating the average precision and recall for
different numbers of notes and overall, and present our results in Table 6.2 on page 96. The precision P is
defined as the number of correct notes detected in the mixture divided by the estimated number of notes by
the system. The recall R is defined as the number of correct notes detected in the mixture divided by the
actual number of notes in the mixture. Both precision and recall are commonly expressed as percentages.
The F-Score F , given by
F = 2
PR
P +R
is in excess of 80% for both methods, showing that the number of notes is being estimated accurately.
Overall the two partial estimation schemes, with their associated rate functions, have the same level
of accuracy for polyphonic transcription. The zero-padding method overall has a higher recall than the
95
Notes in Mixture
Evaluation Metric Model 1 2 3 4 Overall
Precision % Zero-padding 93.8 85.4 72.9 68.4 75.7
Likelihood-search 98.0 85.0 76.7 71.0 78.2
Klapuri [2008] 87.0 73.9 65.2 58.7 66.2
Recall % Zero-padding 100.0 91.8 86.7 87.7 88.5
Likelihood-search 100.0 92.4 85.8 87.1 85.9
Klapuri [2008] 100.0 85.0 75.0 67.5 76.5
F-Score % Zero-padding 96.8 88.5 79.2 76.9 81.6
Likelihood-search 99.0 88.5 80.1 78.2 81.9
Klapuri [2008] 93.0 79.0 69.8 62.8 71.0
Table 6.2: Precision and recall for estimating the number of notes and their pitches using the Poisson process
model and the two frequency estimation schemes described in this chapter. The results are compared to a
state-of-the-art auditory model.
likelihood-search method, which is expected as the zero-padding method returns more partial frequencies,
hence potentially detecting more notes.
In our opinion and experience, a higher recall is preferable for an oine polyphonic transcription system,
where the results can be verified by a trained musician, as it is perceptually simpler to notice and delete a
spurious, incorrectly transcribed note than it is to determine and add a missing note. However, we would
still prefer to use the likelihood-search method as it requires less computation in the partial estimation
stage. To increase the recall of the likelihood-search method, the 5k/2 penalization term in (6.12) can be
reduced, so that more partials are detected in the estimation scheme, and thus more notes are detected.
Even when reducing the penalization term, the number of partials returned by the likelihood search method
is substantially less than returned by the zero-padding method, and the computational savings are retained.
6.5.3 Comparison with State-of-the-Art
In this section we compare the performance of the system developed in this chapter with other state-of-the-
art multiple pitch estimation systems that compute estimates on a frame-by-frame basis. The results we
compare with are taken from Vincent et al. [2010] using the MIREX 2007 woodwind training dataset
2
. Test
excerpts are generated by successively summing together the first 30 seconds of the flute, clarinet, bassoon,
horn and oboe tracks in order. To use the same evaluation criterion, we use overlapping frames of 46ms length
spaced 10ms apart. The results are presented in Table 6.3 on page 97 for the best performing algorithms
evaluated in Vincent et al. [2010], and also the likelihood-search and zero-padding methods described here.
The results demonstrate that the two multiple pitch estimation schemes of Vincent et al. [2010] outperform
the likelihood-search method on this woodwind dataset, although the performance of the likelihood-search
method is comparable with state-of-the-art systems. There are also significant differences between the
results on this real dataset and the results on the isolated examples presented in Table 6.2 on page 96,
where the likelihood-search and zero-padding methods have a similar level of performance, show significant
improvement over the system of Klapuri [2008] and have a higher F-measure overall than in Table 6.3 on
page 97. These differences are clearly due to the number of frames per note estimate used. The isolated
2
www.music-ir.org/mirex2007/
96
Polyphony
Algorithm 2 3 4 5
Unconstrained NMF Vincent et al. [2010] 79.9 56.3 62.1 61.9
Constrained NMF Vincent et al. [2010] 76.5 64.7 67.5 62.5
Klapuri [2008] 73.4 59.1 63.5 59.9
Likelihood-search 75.0 66.1 55.0 59.6
Zero-padding 66.2 59.4 51.2 56.3
Table 6.3: Comparison of the F-measure of multiple pitch estimation for the MIREX 2007 woodwind training
dataset. The `Constrained NMF' algorithm is known as `NMF under harmonicity and spectral smoothness
constraints' in Vincent et al. [2010].
examples, which have several frames per note, are realistic only when concurrent sounding note segments
are extracted from the audio beforehand, whereas each frame in the woodwind dataset contains much less
spectral information. The zero-padding method suffers especially for short frames as the frequency estimates
are only obtained from the spectral information alone, whereas the likelihood-search method is able to obtain
more accurate estimates by fitting a signal model to the audio. Additional improvements could be made
by jointly inferring the parameters of the prior harmonic model described in 6.3.3, as the parameters were
chosen in that model to be suitable for longer notes. Alternatively, larger improvements could be made
by extending the signal model with dependencies between neighbouring frames, rather than independently
inferring isolated frequencies and notes in each frame. This would thus effectively increase the length of the
frame available for estimating the frequencies, thus improving the accuracy of the estimates in line with the
results presented in Table 6.2 on page 96.
6.6 Conclusion
In this chapter we have motivated and implemented a two-stage process for polyphonic pitch detection. The
first stage is to accurately detect partial frequencies in a short segment of the signal. The second stage is
to infer the notes using a harmonic model of the expected frequencies. The approach we have used is to
adapt existing Bayesian signal and prior harmonic models for musical notes and design simple, approximate
inference schemes for these models which are computationally inexpensive and allow for the exploration of
many more combinations of notes than may be feasible using full Bayesian models. As a result we have
shown that a higher level of transcription accuracy can be achieved even with a simple algorithm design.
Moreover we are able to benefit from present and future advances of the Bayesian models to improve partial
frequency and multiple estimation.
The partial frequency estimation stage is carried out by progressively fitting a sinusoidal basis to the
observed signal, halting the process when a Bayesian model selection criterion fails to improve by adding
more partial frequencies. There are a number of requirements on the signal model for this method to be
feasible, firstly that the optimum frequency to be added per iteration can be found quickly, and secondly
that adding additional frequencies to the model does not require that the existing frequencies be modified.
These requirements are met by the Gabor basis models discussed in Chapter 5 but are not met by matrix
pencil methods which were used in previous work [Peeling et al., 2007a].The primary benefit of using explicit
signal and noise models over heuristic peak-detection schemes is that the spectral shape of partials and the
97
level of the noise floor can be used to distinguish between actual partials and spurious artefacts, and also
detect partials with small amplitudes which are otherwise masked by nearby partials with larger amplitudes.
The estimated partial frequencies are then tested against a series of multiple pitch hypotheses by evalu-
ating the likelihood of the frequency positions under the assumption of a non-homogeneous Poisson process.
We choose to use a Poisson process for the following reasons: it is a generative model for which Bayesian
priors for different harmonic models and partial frequency estimation schemes can be created easily; the like-
lihood can be calculated exactly without the use of an iterative algorithm, and the superposition property
means that multiple notes can be inferred using a greedy search scheme where previously found notes are
consistent with the current hypothesis, and the number of notes can be estimated correctly.
The resulting transcription scheme is accurate for isolated notes, although the accuracy when used to
transcribe music frame-by-frame for short frame lengths is much less than for isolated notes. Future work
should concentrate on extending the signal model to include dependencies between neighbouring frames, to
allow the frequencies to be estimated in each frame with much higher accuracy than possible for short frame
lengths.
The relationship between the pitches in neighbouring frames, although not explored in this chapter, can
be expressed through prior, generative models of pitch trajectories and note durations. For example, the
transcription system here could be used to evaluate the likelihood of observed partial frequencies in each
frame for different note combinations, and the Viterbi algorithm used to infer the most likely transcription
across multiple frames using a hidden Markov model prior. Multiple passes would be used to add additional
notes to each frame, similar to the greedy search procedure presented here. These ideas are developed further
in Chapter 8.
Although the method we have presented here is accurate and flexible, it is not suitable for large scale
processing of musical data as the algorithms for both frequency and pitch estimation are iterative in nature,
and their computation cannot be easily parallelised to make use of optimized software libraries or parallel
processing in hardware. In the next chapter we develop and apply alternative Bayesian generative models
where the signal is projected onto a fixed set of harmonic bases, one per pitch, rather than attempting to
infer the basis of the signal model. The relative amounts of energy used in each projection are used to
create a transcription of long passages of polyphonic music. This approach allows multiple frames and the
dependencies between them to be processed in parallel.
98
Chapter 7
Gaussian Variance Generative Matrix
Factorization Models
In this chapter we develop a generative Bayesian model for matrix valued observations, where each element
of the matrix is assumed distributed zero mean Gaussian, and the variances of the elements are factorized
into two positive-valued matrices with smaller common dimension than those of the observed matrix. The
models represent the observed data as the superposition of statistically independent sources. These models
can be used for dimensionality reduction, modelling and compression of real-valued data, such as the short-
time Fourier transform (STFT) of an audio signal, where it can be applied to source separation and music
transcription by jointly modelling spectral characteristics such as harmonicity and temporal activations or
excitations. Suitable priors are chosen so that the appropriate modelling dimensionality is selected, and the
variance parameters can be inferred using efficient matrix update equations, allowing large amounts of data
to be processed efficiently. The algorithm is adapted for the task of polyphonic transcription of music using
labeled training data. The performance of the system is compared to that of existing discriminative and
model-based approaches on a dataset of classical piano music.
7.1 Introduction
Tools for multivariate data analysis, processing and compression include principal components analysis
(PCA) and non-negative matrix factorization (NMF), which perform dimensionality reduction. Recently
these tools have been made more flexible and powerful by description in a probabilistic, statistical framework,
using a generative model. Existing algorithms can then typically be described as performing maximum like-
lihood estimation of the parameters. Tipping and Bishop [1999] describe PCA in a probabilistic framework
as a Gaussian latent variable model, and use the Expectation-Maximization (EM) algorithm to iteratively
reach a solution. Virtanen et al. [2008] describe NMF using a Poisson source model, and obtain the iterative
update equations for the information-divergence measure given by Lee and Seung [2000]. The advantages
cited by adopting a probabilistic signal model are the ability to incorporate prior information via Bayesian
methods, and a consistent approach to dealing with multiple observations and missing data. See also Cemgil
[2008] for a full report on a Bayesian NMF model with applications to missing data.
99
Non-negative matrix factorization has been used with some success in the modelling of time-frequency en-
ergy distributions in audio and musical signal applications, such as drum transcription [Paulus and Virtanen,
2006], source separation [Wang and Plumbley, 2005, Virtanen, 2007], and polyphonic music transcription
[Smaragdis and Brown, 2003, Bertin et al., 2009a, Abdallah and Plumbley, 2004].
To illustrate the principal concept, consider a matrix
X =
∣∣{x2ν,τ}∣∣
formed of the coefficients of the short-time Fourier transform of an audio signal (see Section 7.4) with
frequency indices ν = 1, . . . , F and time indices τ = 1, . . . , T . This non-negative matrix can be approximately
decomposed into two non-negative matrices
X ≈ TV
T is a F × I matrix which is typically interpreted as a set of harmonic template vectors [t1, . . . , tI ] along the
frequency axis, and V is a I ×T matrix, interpreted as a set of activation or excitation vectors [v1, . . . ,vI ]>
in time. In a single channel source separation situation, the observed matrix X can be written as the
superposition of I sources:
X ≈
I∑
i=1
tiv
>
i ≈
I∑
i=1
Si
Si = tiv
>
i
The representation of X as the sum of a set of single rank matrices is shown schematically in Figure 7.1 on
page 101. However, this model is physically unrealistic: because energy is a quadratic quantity, the energy
for two sources is not additive, i.e.,
|s1 + s2|2 6= s21 + s22
This is typical of spectral modelling where superposition is not addressed. Here we describe a probabilistic
model based on the transform coefficients themselves rather than a positive energy representation. The
source model is conditionally zero-mean Gaussian, motivated by the underlying physics, which obeys the
superposition property desired. The model is valid for real-valued observations which arise from the discrete
Cosine transform (DCT) for instance, and also complex-valued coefficients from the discrete Fourier transform
(DFT). The statistical perspective readily admits a Bayesian framework in the form of prior distributions
placed on the variance parameter of the Gaussian. A review of existing prior structures and related inference
methods is in Godsill et al. [2007].
This chapter is an expansion of Peeling et al. [2010] which develops an Expectation-Maximization (EM)
algorithm for the Gaussian variance model. Here we additionally discuss variational Bayes and Monte-Carlo
inference techniques in Section 7.3, and demonstrate additional applications to musical audio processing in
Section 7.4 other than polyphonic music transcription.
In this chapter, we use the Gaussian variance model in a matrix factorization framework. We first express
the model and obtain the maximum likelihood estimation of the factorization as an iterative EM algorithm,
and describe the implementation as a pair of matrix-update equations to demonstrate that the Gaussian
model shares the same computational attractiveness as related NMF approaches (Section 7.2). We then
100
xν,τ
tν,i sν,i,τ
vi,τ
vi,K
. . .
. . .. . .
xν,1 xν,k xν,K. . .. . .
sν,i,Ksν,i,ksν,i,1
tν,i
vi,1 vi,k
. . .
Figure 7.1: Representations of the single-channel source separation model as a matrix factorization problem
place conjugate priors on the elements of the factor matrices, and describe a number of Bayesian inference
techniques: variational Bayes and a set of Monte-Carlo techniques (Section 7.3). We present demonstrations
of applications for this model in the field of musical audio processing in Section 7.4, and by placing a prior
model for Midi transcription (Section 7.5) we develop a system for polyphonic transcription of piano music.
We present comparative results on a large dataset of music in Section 7.6.
7.2 Gaussian Variance Matrix Factorization Model
In this section we describe a matrix factorization model, where the observed matrix coefficients have a zero
mean Gaussian distribution, where the variance of each coefficient is obtained from the matrix product of
the template and excitation matrices. This model was used in single channel audio source separation by
Benaroya et al. [2003] and polyphonic music transcription by Abdallah and Plumbley [2004], and was linked
with the Itakura-Saito (IS) divergence between the observed matrix coefficients and the underlying variances
by Févotte et al. [2009].
We initially express the model as a probability distribution over individual sources.
sν,i,τ ∼ N (sν,i,τ ; 0, tν,ivi,τ ) (7.1)
xν,τ =
∑
i
sν,i,τ
The s variables represent the individual latent sources, and the x variables are the observations, formed from
the superposition of the sources. When the latent source variables and the observed coefficients are complex
valued, the distribution in (7.1) is the circular symmetric complex normal distribution (Section A.1) i.e., the
real and imaginary parts are uncorrelated and have equal variance.
101
The matrix representation of the superposition is
X = S1 + · · ·+ SI =
I∑
i=1
Si
where X,Si, i = 1, . . . , I ∈ RF×T have elements xν,τ , sν,i,τ , for ν = 1, . . . , F , τ = 1, . . . , T respectively.
Marginalizing out the latent sources S = {S1, . . . ,SI} gives
p(X|T,V) =
ˆ
dS p(X|S)p(S|T,V) =
∏
ν,τ
N
(
xν,τ ; 0,
∑
i
tν,ivi,τ
)
due to the superposition property of normal random variables, that is: when
si ∼ N
(
si; 0, σ
2
i
)
x = s1 + · · ·+ sI
then the marginal probability is given by
p(x) = N (x; 0,
∑
i
σ2i )
For real x, the marginal log-likelihood of a single observation is given by:
log p(X|T,V) =
∑
ν
∑
τ
(
− 1
2σ2ν,τ
x2ν,τ −
1
2
log 2piσ2ν,τ
)
(7.2)
and for complex x
log p(X|T,V) =
∑
ν
∑
τ
(
− 1
σ2ν,τ
|xν,τ |2 − log piσ2ν,τ
)
(7.3)
where
σ2ν,τ =
∑
i
tν,ivi,τ = [TV]ν,τ
The derivation for the real and complex valued models is so similar that when it is required to specify
which observation model is being used, we let D = 1 for the real valued model, and D = 2 for the complex
valued model. This is motivated by viewing the complex normal distribution as a two-dimensional normal
distribution with equal variance on the real and imaginary axes. The marginal log-likelihood in its general
form is thus
log p(X|T,V) =
∑
ν
∑
τ
(
− D
2σ2ν,τ
|xν,τ |2 − 1
D
logDpiσ2ν,τ
)
(7.4)
As observed in Févotte et al. [2009], maximizing the log-likelihood is equivalent to minimizing the IS
divergence
dIS
(
z|σ2) = z
σ2
− log z + log σ2 − 1
between z =
∣∣x2∣∣ and σ2. This can be seen by comparing (7.3) and (7.4) and ignoring the elements of both
equations that do not depend on the variances σ2.
102
7.2.1 Maximum-likelihood and the EM algorithm
The EM algorithm for maximum-likelihood estimation of parameters in the Gaussian variance model was
independently derived by Févotte et al. [2009]. Maximizing the likelihood of the Gaussian variance model is
equivalent to minimizing the Itakura-Saito distance [Itakura and Saito, 1968] between the observed matrix
X and its reconstruction TV.
The log likelihood of observed datum X can be written as
LX ≡ log
ˆ
dS p(X,S|T,V)
= log
ˆ
dS
q(S)
q(S)
p(X,S|T,V)
≥
ˆ
dS q(S) log
p(X,S|T,V)
q(S)
≡ B[q(S)]
by Jensens' inequality [Bishop, 2006], defining a lower bound on the log likelihood. Here, q(S) is an instrumen-
tal distribution over the set of latent sources, with the property that q(S) = 0 if and only if p(X,S|T,V) = 0.
The lower bound is tight when the instrumental distribution is the posterior of the latent sources:
arg max
q(S)
B[q(S)] = p(S|X,T,V)
Hence we can use an iterative coordinate ascent scheme to maximize the log likelihood. The first step,
called the expectation (E) step, is to compute the posterior distribution, which we will see has the form of a
multivariate normal. Because this is an exponential family, we only need to compute the sufficient statistics,
which is why we call this the expectation step. The second step is called the maximization (M) step because
we find the maximum likelihood T and V holding q(S) fixed. The two steps of the expectation maximization
algorithm (EM) are summarized as:
E-step q(S)(n) = p(S|X,T(n−1)V(n−1))
M-step
{
T(n),V(n)
}
= arg max
T,V
〈log p(S,X|T,V)〉q(S)
7.2.2 Expectation Step
In this section we derive the posterior of the latent sources
p(S|X,T,V) = p(S,X|T,V)
p(X|T,V)
The terms in the expression for the log probability density of the posterior are given by
log p(S|X,T,V) =
∑
ν,τ
(∑
i
(
−D
2
|sν,i,τ |2
tν,ivi,τ
)
+
D
2
|∑i sν,i,τ |2∑
i tν,ivi,τ
)
+ · · · (7.5)
This defines a multivariate normal distribution over the latent sources, for which we will adopt the following
notation: sν,τ = [sν,1,τ , . . . , sν,I,τ ], 1 is a I element row vector of ones, so that we can write
∑
i sν,i,τ = 1sν,τ .
103
Let Aν,τ be a I × I diagonal (covariance) matrix with ith element tν,ivi,τ . The above expression is rewritten
as
log p(S|X,T,V) =
∑
ν,τ
(
−D
2
TrA−1ν,τsν,τs
H
ν,τ +
D
2
Tr
1>1sν,τsHν,τ
1Aν,τ1>
)
+ · · ·
which, after some manipulations (see Section B.2), becomes
=
∑
ν,τ
−D
2
Tr
(
sν,τ − Aν,τ1
>1sν,τ
1Aν,τ1>
)H (
Aν,τ − Aν,τ1
>1Aν,τ
1Aν,τ1>
)(
sν,τ − Aν,τ1
>1sν,τ
1Aν,τ1>
)
+ · · ·
=
∑
ν,τ
−D
2
Tr
(
sν,τ − Aν,τ1
>xν,τ
1Aν,τ1>
)H (
Aν,τ − Aν,τ1
>1Aν,τ
1Aν,τ1>
)(
sν,τ − Aν,τ1
>xν,τ
1Aν,τ1>
)
+ · · ·
This is a multivariate normal distribution, as we can write
p(S|X,T,V) =
∏
ν,τ
N
(
sν,τ ;
Aν,τ1
>xν,τ
1Aν,τ1>
, Aν,τ − Aν,τ1
>1Aν,τ
1Aν,τ1>
)
=
∏
ν,τ
N (sν,τ ;µν,τ ,Σν,τ )
and the standard results for the sufficient statistics of the posterior are:
〈sν,τ 〉 = µν,τ〈
sν,τs
H
ν,τ
〉
= µν,τµ
H
ν,τ + Σν,τ
By defining a positive quantity called the responsibility by Cemgil and Dikmen [2007]
κν,i,τ =
tν,ivi,τ∑
i′ tν,i′vi′,τ
we can write the correlations as
〈|sν,i,τ |2〉 = Dtν,ivi,τ (1− κν,i,τ ) + κ2ν,i,τ |xν,τ |2
7.2.3 Maximization Step
In this section, we will present the M step as a coordinate ascent in T and V. Other schemes such as gradient
descent and Hessian based approaches are possible (see for example Dhillon and Sra [2006]), however they
involve more computation and storage requirements. The update rule for the templates is given by
∂
∂tν,i
〈log p (X,S|T,V)〉 = D
2
∑
τ
(
|sν,i,τ |2
t2ν,ivi,τ
− 1
tν,i
)
= 0
tν,i =
1
T
∑
τ
|sν,i,τ |2
vi,τ
104
and the update rule for the excitations is given by:
∂
∂vi,τ
〈log p (X,S|T,V)〉 = D
2
∑
ν
(
|sν,i,τ |2
tν,iv2i,τ
− 1
vi,τ
)
= 0
vi,τ =
1
F
∑
ν
|sν,i,τ |2
tν,i
The summations in the update rules can be carried out by means of efficient matrix multiplications. Note
that it is unnecessary (and also expensive) to calculate the complete sufficient statistics of the posterior
over the latent sources. All that is required is the summations over frequency and time of the individual
correlations (the diagonal elements of the covariance matrix).
7.3 Bayesian Hierarchical Model
To exploit the power of Bayesian inference, we place conjugate priors on the templates and excitations in
the following hierarchical model.
tν,i ∼ IG(tν,i; atν,i, btν,iatν,i)
vi,τ ∼ IG(vi,τ ; avi,τ , bvi,τavi,τ )
sν,i,τ ∼ N (sν,i,τ ; 0, tν,ivi,τ )
xν,τ =
∑
i
sν,i,τ (7.6)
The inverse-gamma distribution (Figure 7.2 on page 106) is a conjugate prior to the variance of the nor-
mal distribution. This particular parametrization has the following interpretation: 〈1/tν,i〉 = 1/btν,i and
〈1/vi,τ 〉 = 1/bvi,τ under the prior, so the scale parameters approximately gives the expected values of the
templates and excitations. The standard deviation is given by
a
(a−1)√a−2 which decreases with a, hence the
scale parameter can represent the sparsity of the representation. A high value of a means a low standard
deviation from the scale parameter, hence most of the coefficients have similar magnitudes, implying a full
representation. A low value of a means a high standard deviation from the scale parameter, hence most of
the coefficients of the representation will be close to zero as shown in Figure 7.2 on page 106, favouring a
spare representation.
The joint probability distribution of this model is given by
p(X,S,T,V) = p(X,S|T,V)p(T)p(V)
from which we can consider a number of inference tasks. These typically involve calculating the posterior
p(S,T,V|X) and the marginal likelihood (also known as the evidence) given the hyperparameters p(X).
7.3.1 Inference by Variational Bayes
The development of the Variational Bayes inference algorithm [Bishop, 2006, Ghahramani and Beal, 2001] is
similar to the EM algorithm. Again we approximate the log marginal likelihood by means of an instrumental
105
00.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5 3 3.5 4
P
(r
)
r
a=1
a=2
a=3
Figure 7.2: The inverse-gamma distribution, p(r) = IG(r; a, a) for different a, and scale parameter b = 1
distribution:
LX ≡ log p(X) = log
ˆ
dSdTdV p(X,S,T,V)
= log
ˆ
dSdTdV
q(S,T,V)
q(S,T,V)
p(X,S,T,V)
≥
ˆ
dSdTdV q(S,T,V) log
p(X,S,T,V)
q(S,T,V)
≡ B[q(S,T,V)]
The bound is tight when the instrumental distribution is equal to the posterior:
q(S,T,V) = p(S,T,V|X)
However the posterior distribution is intractable, so instead we assume a factorized form
q(S,T,V) = q(S)q(T)q(V) =
(∏
ν,τ
q(sν,τ )
)∏
ν,i
q(tν,i)
∏
i,τ
q(vi,τ )

This particular approximation is known as the mean field approximation. It can be shown that updating
the sufficient statistics of q(S), q(T) or q(V), holding both of the other distributions constant, leads to a
monotonically increasing bound with each iteration B[q(S,T,V)(n+1)] ≥ B[q(S,T,V)(n)]. To illustrate the
similarity between VB and EM, we will choose the following ordering of update rules.
106
The approximate E-step is:
q(S)(n) ∝ exp
(
〈log p(X,S,T,V)〉q(T)(n−1)q(V)(n−1)
)
then the approximate M-step involves iterating
q(T)(n+k) ∝ exp
(
〈log p(X,S,T,V)〉q(S)(n)q(V)(n+k−1)
)
q(V)(n+k) ∝ exp
(
〈log p(X,S,T,V)〉q(S)(n)q(T)(n+k)
)
for k = 1, . . . ,K until convergence as defined in 7.3.1.2.
An alternative means of Bayesian model selection for the Gaussian variance model with half-normal
priors on the factor matrices is investigated by Févotte [2010] using a model selection criteria. In Févotte
and Cemgil [2009] Bayesian model selection using both Variational Bayes and MCMC is sketched for a
number of NMF models, including the Gaussian variance model.
7.3.1.1 Variational update equations and sufficient statistics
The update rule for the expectation step follows from (7.5):
q(sν,τ ) ∝ exp
(∑
i
(
−D
2
〈
t−1ν,i
〉 〈
v−1i,τ
〉 |sν,i,τ |2)+ D
2
|∑i sν,i,τ |2∑
i
(〈
t−1ν,i
〉 〈
v−1i,τ
〉)−1
)
The calculation is the same as for the E-step of the EM algorithm, but the covariance matrix Aν,τ has
elements
(〈
t−1ν,i
〉 〈
v−1i,τ
〉)−1
along the diagonal. The responsibilities are
κν,i,τ =
(〈
t−1ν,i
〉 〈
v−1i,τ
〉)∑
i
(〈
t−1ν,i
〉 〈
v−1i,τ
〉)
and the correlations that we need for the M-step are
〈|sν,i,τ |2〉 = (〈t−1ν,i 〉 〈v−1i,τ 〉)−1 (1− κν,i,τ ) + κ2ν,i,τ |xν,τ |2 (7.7)
107
The update equations and sufficient statistics for the templates and excitations follow from the properties
of the inverse-gamma distribution:
q(tν,i) ∝ exp
(
−(atν,i + DT2 + 1) log tν,i − (atν,ibtν,i + D2
∑
τ
〈|sν,i,τ |2〉 〈v−1i,τ 〉) 〈t−1ν,i 〉
)
∝ IG(tν,i;αtν,i, βtν,i)
αtν,i = a
t
ν,i +
DT
2 β
t
ν,i = a
t
ν,ib
t
ν,i +
D
2
∑
τ
〈|sν,i,τ |2〉 〈v−1i,τ 〉
〈
t−1ν,i
〉
=
αtν,i
βtν,i
〈log tν,i〉 = −Ψ(αtν,i) + log βtν,i (7.8)
q(vi,τ ) ∝ exp
(
−(avi,τ + DF2 + 1) log vi,τ − (avi,τ bvi,τ + D2
∑
ν
〈|sν,i,τ |2〉 〈t−1ν,i 〉) 〈v−1i,τ 〉
)
∝ IG(vi,τ ;αvi,τ , βvi,τ )
αvi,τ = a
v
i,τ +
DF
2 β
v
i,τ = a
v
i,τ b
v
i,τ +
D
2
∑
ν
〈|sν,i,τ |2〉 〈t−1ν,i 〉
〈
v−1i,τ
〉
=
αvi,τ
βvi,τ
〈log vi,τ 〉 = −Ψ(αvi,τ ) + log βvi,τ (7.9)
We retain the attractiveness of being able to perform these update equations as matrix operations. Note that
expensive evaluations of the digamma function Ψ (α) can be precomputed, as the posterior shape parameters
are constant.
7.3.1.2 The Variational Bound
The variational bound is a lower bound on the marginal log likelihood and also can be used to define
convergence for the E and M steps.
B[q(S,T,V)] = 〈log p(X,S,T,V)〉q +H[q(S,T,V)]
H[q(S,T,V)] denotes the entropy of the variational distribution q, which is defined as−〈log q(S,T,V)〉q.
The following is the complete expression for the variational bound immediately after the E-step. The
first two rows are the combined energy and entropy of the latent sources.
B[q(S,T,V)] = +
∑
ν,τ
−D
2
(∑
i
〈
t−1ν,i
〉−1 〈
v−1i,τ
〉−1)−1 |xν,τ |2 − D
2
log
2pi
D
+
D
2
log
(∑
i
〈
t−1ν,i
〉−1 〈
v−1i,τ
〉−1)
− D
2
∑
ν,i,τ
(−〈log tν,i〉 − log 〈t−1ν,i 〉− 〈log vi,τ 〉 − log 〈v−1i,τ 〉) F [S] +H[q(S)]
+
∑
ν,i
(−(atν,i + 1) 〈log tν,i〉 − atν,ibtν,i 〈t−1ν,i 〉+ atν,i log(atν,ibtν,i)− log Γ(atν,i)) F [T]
−
∑
ν,i
(−(αtν,i + 1) 〈log tν,i〉 − βtν,i 〈t−1ν,i 〉+ αtν,i log βtν,i − log Γ(αtν,i)) H[q(T)]
+
∑
τ,i
(−(avi,τ + 1) 〈log vi,τ 〉 − avi,τ bvi,τ 〈v−1i,τ 〉+ avi,τ log(avi,τ bvi,τ )− log Γ(avi,τ )) F [V]
−
∑
τ,i
(−(αvi,τ + 1) 〈log vi,τ 〉 − βvi,τ 〈v−1i,τ 〉+ αvi,τ log βvi,τ − log Γ(αvi,τ )) H[q(V)] (7.10)
108
The `energy' notation used to label the summations here is:
F [S] = 〈log p(X,S|T,V)〉q
F [T] = 〈log p (T)〉q
F [V] = 〈log p (V)〉q
The calculation of the variational bound can be implemented efficiently using matrix operations, just as with
the update equations. Expensive evaluations of log Γ (α) can be precomputed.
Once the templates or excitations are updated in the M-step, the variational bound cannot be derived in
closed form. However, during the M-step, H[q(S)] remains constant, so we do not need to take it account
when determining whether the M-step iterations have converged. This is convenient because H[q(S)] is not
straightforward to derive in isolation from F [S]. We therefore need to confirm after each update in the
M-step that the quantity
F [S] + F [T] +H[q(T)] + F [V] +H[q(V)] (7.11)
increases, otherwise we perform another E-step at this stage (see Algorithm 7.1). The values for F [T],
H[q(T)], F [V] and H[q(V)] are as in (7.10), however F [S] during the M-step is given by
F [S] =
∑
ν
∑
τ
∑
i
(
−D
2
〈
t−1ν,i
〉 〈
v−1i,τ
〉 〈
s2ν,i,τ
〉− D
2
log
2pi
D
− D
2
〈log tν,i〉 − D
2
〈log vi,τ 〉
)
where
〈
s2ν,i,τ
〉
is given by (7.7). F [S] may be calculated efficiently using matrix update equations from the
sufficient statistics of T and V immediately prior to being updated during the M-step.
7.3.1.3 Hyperparameter Optimization
Hyperparameter optimization involves maximizing the variational bound with respect to the hyperparam-
eters. The components of the variational bound that correspond to maximizing the bound are F [T] for
the template hyperparameters, and F [V] for the excitation hyperparameters1. The resulting expressions for
optimization are very similar to expressions for finding maximum likelihood estimates of the parameters of
an inverse-gamma distribution.
Hyperparameter optimization is used for training the priors of the template and excitation matrices
using labelled data. The hyperparameters will be tied over some subset of the elements of the template or
excitation factorization matrices. For example, we typically do not a priori know the length of time over
which we will observe data for the model. For the model to be identifiable, we are forced to tie the excitation
hyperparameters across the rows of the excitation matrix.
Because of the numerous possibilities for how we set up the optimization, we will only outline the process
here for a single set of parameters M with variances {rm} over which either the shape parameter or the
scale parameter is tied. This allows the shape parameters and the scale parameters to be tied over different
subsets of the variances, for example we might want to have a global shape parameter for all of the template
variances, but have a scale parameter for each column of the template matrix.
1
In (7.10) the entropy expression H[q(T)] is not directly dependent on the values of the hyperparameters, but is dependent
only through the variational distribution q(T), which is updated during the M-Step for the templates. The same is true for the
excitation hyperparameters. Therefore the entropy expressions are not used to optimize the hyperparameters in this section.
109
To optimize a single scale parameter bM which is tied over the variances {rm} with corresponding shape
parameters {am}, we maximize the following expression
B (bM) =
∑
m∈M
(−(am + 1) 〈log rm〉 − ambM 〈r−1m 〉+ am log(ambM)− log Γ(am))
by setting the derivative to zero, i.e.
∂B
∂bM
=
∑
m∈M
(
−am
〈
r−1m
〉
+
am
bM
)
= 0 (7.12)
giving an update rule:
bM ←
∑
M am∑
M am
〈
r−1m
〉
(7.13)
To optimize a single shape parameter aM which is tied over the variances {rm} with corresponding scale
parameters {bm}, we maximize the following expression
B (aM) =
∑
m∈M
(−(aM + 1) 〈log rm〉 − aMbm 〈r−1m 〉+ aM log(aMbm)− log Γ(aM))
by setting the derivative to zero, i.e.
∂B
∂a
=
∑
m∈M
(−〈log rm〉 − bm 〈r−1m 〉+ log a+ 1 + log bm −Ψ(a))
giving the following expression:
log aM −Ψ(aM) = 1|M|
∑
m∈M
(〈log rm〉+ bm 〈r−1m 〉− 1− log bm)
Equations of this form appear in the maximum likelihood estimate of Gamma distributions. a can be found
by Newton iterations. Denote the right hand side as c, then the following is the Newton-Raphson update:
aM ← aM − log aM −Ψ(aM)− c
1/aM −Ψ′(aM) (7.14)
Hyperparameter optimization is performed after the variational bound no longer increases through the
E-Step and M-Step updates. The entire structure of the algorithm is described in Algorithm 7.1.
110
Algorithm 7.1 Variational Bayes for the Gaussian variance model, with hyperparameter optimization
• Sample point estimates tν,i and vi,τ from the priors (7.6)
• Initialize sufficient statistics 〈t−1ν,i 〉 ← (tν,i)−1 , 〈log tν,i〉 ← log tν,i, 〈v−1i,τ 〉 ← (vi,τ )−1 , 〈log vi,τ 〉 ←
log vi,τ from the point estimates
• Calculate the variational bound B[q(S,T,V)](0) from (7.10)
• for n = 1, . . .
 for k = 1 . . .
∗ Update the template sufficient statistics 〈t−1ν,i 〉 , 〈log tν,i〉 (7.8)
∗ Update the excitation sufficient statistics 〈v−1i,τ 〉 , 〈log vi,τ 〉 (7.9)
∗ If the quantity (7.11) increases, then continue, otherwise break
 Calculate the variational bound B[q(S,T,V)](n)
 If B[q(S,T,V)](n) = B[q(S,T,V)](n−1) then
∗ update each tied shape parameter using (7.14)
∗ update each tied scale parameter using (7.13)
∗ If the updates result in no changes to the hyperparameters, exit the algorithm
7.3.2 Markov Chain Monte-Carlo
7.3.2.1 Gibbs Sampler
A possible Gibbs sampler for the model involves sampling the blocks:
S(n+1) ∼ p(S|X,T(n),V(n))
T(n+1) ∼ p(T|S(n+1),V(n))
V(n+1) ∼ p(V|S(n+1),T(n+1))
The marginal likelihood is estimated by Chib's method [Chib, 1995] around the mode {S∗,T∗,V∗} found
by running the Gibbs sampler. The marginal likelihood is decomposed as:
log p(X) = log p(X,S∗,T∗,V∗)− log p(S∗|X)− log p(T∗|S∗)− log p(V∗|S∗,T∗)
The second term is found by the Monte-Carlo estimate:
p(S∗|X) ≈ 1
N
N∑
n=1
p(S∗|X,T(n),V(n))
with {T(n),V(n))} returned by the Gibbs' sampler. The third term is found by the Monte-Carlo estimate:
p(T∗|S∗) ≈ 1
M
M∑
m=1
p(T∗|S∗,V(m))
111
requiring a further M samples from the reduced Gibbs sampler, clamping S = S∗
T(m+1) ∼ p(T|S∗,V(m))
V(m+1) ∼ p(V|S∗,T(m+1))
Sampling Latent Sources The distribution p(sν,1:I,τ |xν,τ , tν,1:I , v1:I,τ ) is a degenerate multivariate nor-
mal distribution, because p(xν,τ |sν,1:I,τ ) is itself degenerate. As the covariance matrix is not positive definite,
we cannot form the Cholesky factor and therefore sample from this distribution directly. Instead, we sample
from the reduced distribution p(sν,2:I,τ |xν,τ , tν,2:I , v2:I,τ ) and then set sν,1,τ = xν,τ −
∑I
i=2 sν,i,τ . In the rest
of this section, we drop the subscripts ν, τ for brevity. First observe that
p(x|s2:I) = N (x; 1s2:I , t1v1)
where 1 is an (I − 1) row vector of ones. It follows that the posterior is
log p(s2:I |x, t2:Iv2:I) = −D
2
|x− 1s2:I |2
t1v1
− D
2
sH2:IA
−1s2:I + . . .
=
D
t1v1
sH2:I1
>x− D
2
Tr
(
1
t1v1
1>1 +A−1
)
s2:Is
H
2:I
where A is a diagonal matrix with elements tivi, i = 2, . . . , I. The posterior is therefore a multivariate normal
with covariance matrix and mean
Σ =
(
1
t1v1
1>1 +A−1
)−1
= A− A1
>1A
t1v1 + 1A1>
µ =
1
t1v1
Σ1>x
Note that the covariance matrix is formed by downdating the diagonal matrix A with the vector 1A scaled
by (t1v1 + 1A1
>)−1. The Cholesky factorization of the covariance matrix can be computed more efficiently
by downdating the Cholesky factor of A than by calculating the full factorization. The Cholesky factor of A
is a diagonal matrix with elements
√
tivi, i = 2, . . . , I.
Single Source The following is a description of the Gibbs sampler for the single source case, where the
source is directly observed: S = X. We mention this as a special case as the previous expressions for
estimating the marginal likelihood using Chib's method do not apply here. The algorithm iterates
T(n+1) ∼ p(T|X,V(n))
V(n+1) ∼ p(V|X,T(n+1))
and the marginal likelihood at the mode {T∗,V∗} found from the above run is:
log p(X) = log p(X,T∗,V∗)− log p(T∗|X)− log p(V∗|X,T∗)
112
and the second term is found by the Monte-Carlo estimate
p(T∗|X) ≈ 1
N
N∑
n=1
p(T∗|X,V(n))
7.3.2.2 Metropolis-Hastings
The particular scheme we propose here marginalizes the latent sources, and attempts to draw samples from
the posterior p(T,V|X) directly by constructing a Markov chain which draws samples from p(T|X,V)
and p(V|X,T) in sequence. However, these distributions cannot be sampled directly, so we resort to a
Metropolis-Hastings (MH) algorithm. Note that the posterior can be written as:
p(T,V|X) = 1
p(X)
p(X|T,V) ∝ p(X|T,V)p(T)p(V)
p(X|T,V) is already given in (7.2). The MH algorithm requires a proposal density with the same coverage
as the posterior distribution being sampled. We suggest using the inverse-gamma distributions q(T) and
q(V) in (7.8) and (7.9) as proposal distributions for the template and excitation parameters respectively,
substituting the sufficient statistics with the current point estimates. These are suitable for inferring the
posterior p(T,V|X) as the method used to derive them also involved marginalizing the latent sources (the
E-Step of the Variational Bayes algorithm). For the MH algorithm we denote these proposal densities as
q(T,T′|X,V), the probability of moving from T to T′. The MH algorithm simply involves iterating between
the moves
T′ ∼ q(T(n),T′|X,V(n))
T(n+1) =
T′ if α(T(n),T′|X,V(n)) ≤ U(0, 1)T(n) otherwise
V′ ∼ q(V(n),V′|X,T(n+1))
V(n+1) =
V′ if α(V(n),V′|X,T(n+1)) ≤ U(0, 1)V(n) otherwise
where the acceptance ratios are given by
α(T,T′|X,V) = min
{
1,
p(X,T′,V)
p(X,T,V)
q(T′,T|X,V)
q(T,T′|X,V)
}
α(V,V′|X,T) = min
{
1,
p(X,T,V′)
p(X,T,V)
q(V′,V|X,T)
q(V,V′|X,T)
}
We have found in practice that the acceptance ratio using these well-formed proposal distributions is very
high (approximately 90%) when sampling the entire template and excitation matrices.
An extension of Chib's method for evaluating the marginal likelihood from the MH output [Chib and
Jeliazkov, 2001] is outlined below. At any point {T∗,V∗} the following holds:
log p(X) = log p(X,T∗,V∗)− log p(T∗|X)− log p(V∗|X,T∗)
113
The posterior ordinates are given by
p(T∗|X) = 〈α(T,T
∗|X,V)q(T,T∗|X,V)〉p(T,V|X)
〈α(T∗,T|X,V)〉p(V|X)q(T∗,T|X,V)
p(V∗|X,T∗) = 〈α(V,V
∗|X,T∗)q(V,V∗|X,T∗)〉p(V|X,T∗)
〈α(V∗,V|X,T∗)〉q(V∗,V|X,T∗)
for which Monte-Carlo estimates are given by
p(T∗|X) ≈ M
−1∑M
m=1 α(T
(m),T∗|X,V(m))q(T(m),T∗|X,V(m))
J−1
∑J
j=1 α(T
∗,T(j)|X,V(j))
p(V∗|X,T∗) ≈ J
−1∑J
j=1 α(V
(j),V∗|X,T∗)q(V(j),V∗|X,T∗)
G−1
∑G
g=1 α(V
∗,V(g)|X,T∗)
To calculate these ordinates requires three sampling runs:
1. Sample {T(m),V(m)} ∼ p(T,V|X),m = 1, . . . ,M by the MH algorithm. Select {T∗,V∗} as the mode
found by this sampling run for the best estimate of the marginal likelihood.
2. Sample {V(j)} ∼ p(V|X,T∗), j = 1, . . . , J , also generating {T(j)} ∼ q(T∗,T(j)|X,V(j)) after each
step. This is carried out by running the MH algorithm, but rejecting all moves T∗ → T(j).
3. Sample V(g) ∼ q(V∗,V(g)|X,T∗), g = 1, . . . , G.
The above Metropolis-Hastings algorithm is also straightforward to apply to the Bayesian NMF model in
Cemgil [2008], circumventing the multinomial sampling required for the Gibbs sampler.
7.3.2.3 Hyperparameter Optimization
Hyperparameter optimization using MCMC schemes can be carried out by a number of ways. In analogy
to maximizing the variational bound as considered in Section 7.3.1.3, the Markov chain can be run for
a number of iterations, and then the hyperparameters estimated from the sample statistics of the chain.
However, after this step, the normalization constant p(X) has increased, and the chain is not valid for the
new hyperparameters. Either the chain has to be discarded, which is wasteful, or the samples have to be
re-weighted. Re-weighting using reverse logistic regression is discussed in Geyer [1991].
An simpler method involves extending the MCMC scheme to sampling the hyperparameters based on their
likelihood function, i.e., we sample from the posterior of the hyperparameters assuming flat priors, which was
the case with the variational procedure. The use of flat (improper) priors does not create problems, because
the calculation of the marginal likelihood is with respect to the hyperparameters, we are not integrating
them out.
We use the same notation as in Section 7.3.1.3 to denote any tying structure on the hyperparameters.
114
The posterior distribution of the scale parameter (see (7.12)) is Gamma (Section A.2):
p(bM|{am, rm : m ∈M}) ∝ exp
(
−bM
∑
m∈M
am
rm
+ log bM
∑
m∈M
am
)
= G
(
bM; 1 +
∑
m∈M
am,
∑
m∈M
am
rm
)
which means a Gibbs sampler step can be used to optimize the shape parameters.
The posterior distribution of the scale parameter is:
p(aM|{bm, rm : m ∈M}) ∝ exp
(
−aM
∑
m∈M
(
log rm +
bm
rm
− log bm
)
+ |M|(aM log aM − log Γ(aM))
)
which is not a standard distribution. Sampling from this requires a Metropolis-Hastings step.
7.3.3 Importance Sampling
Importance sampling is not suitable for practical applications of the Gaussian variance model because of
the high dimensionality of the posterior. It involves computing a Monte-Carlo estimate using samples from
the prior p(T,V), but without any iterations which perform some degree of source separation to update the
columns of T and the rows of V, almost all of the samples drawn from the prior are far from the mode and
the marginal likelihood is severely underestimated.
However importance sampling may be used for single-source examples to confirm values of the marginal
likelihood calculated by the other methods. The marginal likelihood can be written as the expected value of
the likelihood under the prior,
p(X) = 〈p(X|T,V)〉p(T,V) =
ˆ
p(X|T,V)p(T,V) dTdV
which can be approximated with the Monte-Carlo estimate
p(X) ≈ 1
N
N∑
n=1
p(X|T(n),V(n)) {T(n),V(n)} ∼ p(T,V), n = 1, . . . , N
7.3.4 Consistency of Marginal Likelihood Estimates
We use a toy example to confirm the consistency of the marginal likelihood calculations. With F = I =
T = 1, at = bt = av = bv = 100, and T,V,X set to the mode of the prior, all four methods discussed return
a marginal log likelihood of -5.5266, and with T = 2 the marginal log likelihood is -11.0462. Both of these
values are confirmed with MATLAB quadrature methods over the double and triple integrals respectively.
For larger models, such as those arising in musical audio analysis as described in Section 7.4, only the vari-
ational Bayes approach and the Metropolis-Hastings algorithms are practical. The VB algorithm converges
to a lower bound on the marginal likelihood, because the approximating distribution to the posterior ignores
the coupling between the latent sources and the templates/excitations; whilst the MH algorithm converges
in the limit to the true marginal likelihood. In the single source case, the VB algorithm converges to the true
115
marginal likelihood (as there is no coupling between the latent sources and the templates/excitations to be
ignored), but for more than one source, the VB algorithm underestimates the marginal likelihood compared
with the estimate returned by the MH algorithm. However, the discrepancy between the two estimates
is negligible compared with the ratios of marginal likelihood when selecting between different numbers of
sources for some observed data, as described in Section 7.4.3. For Bayesian model selection, the VB and MH
algorithms would lead us to the same conclusions.
7.4 Musical Audio Analysis
The Gaussian variance matrix factorization model is suitable for the time-frequency surfaces that result
from applying a transformation matrix to an audio signal. The Bayesian extension is particularly useful for
specifying prior knowledge concerning the spectral profile of musical instruments (templates) and volume
/ damping (excitations). A signal y = (y1, . . . , yn, . . . , yN ) is represented by a linear combination yn =∑
ν,τ φ
(ν,τ)
n xν,τ where φ
(ν,τ)
n are localized sinusoidal basis functions in time τ and frequency ν.
The choice of basis functions determines the transform. The short-time Fourier transform (STFT) uses
time windowed complex exponentials at linearly spaced frequencies, and the resulting expansion coefficients
xν,τ are accordingly complex valued. The [short-time] discrete Cosine transform (DCT) uses even sinusoidal
functions, and the expansion coefficients are real valued. Other transforms for musical audio processing
include the Gabor regression model of Wolfe et al. [2004], wavelets [Mallat, 1999], the modified discrete
Cosine transform (MDCT) [Daudet and Sandler, 2004] and the constant-Q transform of Brown [1991].
In the following examples, the observation matrix is formed of a matrix of DCT coefficients. Audio
signals are downsampled to 8000Hz and buffered into frames of F = 1024 samples with no windowing or
overlapping. Inference is carried out using the variational Bayes algorithm.
7.4.1 Model Training
Here, we optimize hyperparameters for a set of piano notes. The RWC musical instrument sounds database
[Goto et al., 2003, Goto, 2004] contains audio for three pianos played with a variety of dynamics and playing
styles. In Figure 7.3 on page 117 we display the resulting template hyperparameters for the audio 011PFNOF,
011PFNOM, and 011PFNOP in the database, which denotes a piano played with `normal' style at the dynamics
forte, mezzo and piano. Each note on the keyboard is played once, covering a range of 88 notes from MIDI
21 to MIDI 108. Here, individual shape and scale parameters are trained for each frequency bin for each
pitch class. The plot of the scale parameters in Figure 7.3b on page 117 clearly shows the harmonic series
of each note, and that the spacing between the harmonics increases with pitch. A careful inspection of the
plot of the shape parameters in Figure 7.3a on page 117 shows that frequency bins corresponding to the
harmonic series have a larger variance than those corresponding to the noise floor.
For this example, we have chosen to train the hyperparameters for single source models, so that it can
be clearly illustrated that the priors capture the harmonic series of the piano notes. As the samples are of
differing length, we choose to tie the excitation parameters across time, thus estimating a single value of av
and bv per note. This means that the priors estimated here are valid for signals of arbitrary length.
116
F
re
q
u
en
cy
/
H
z
MIDI note pitch
0
500
1000
1500
2000
2500
3000
3500
4000
30 40 50 60 70 80 90 100
(a) Shape parameters atν,i
F
re
q
u
en
cy
/
H
z
MIDI note pitch
0
500
1000
1500
2000
2500
3000
3500
4000
30 40 50 60 70 80 90 100
(b) Scale parameters btν,i
Figure 7.3: Template hyperparameters for single source models of piano notes
117
7.4.2 Source Separation and Visualization
Here we illustrate a source separation application using the Gaussian variance matrix factorization model.
We have taken an extract of a piano piece and synthesized the MIDI file using the same audio samples in
7.4.1. Over the extract we have assumed that the model is stationary, and all 88 sources, corresponding to
every note on the piano keyboard, are active. We then infer V using the variational Bayes algorithm. Our
intuition is that notes which are being played will have a large excitation, while notes which are not being
played will have a small excitation and thus be indistinguishable from silence. This is confirmed in Figure
7.4 on page 119, which is a useful alternative to frequency representations such as the harmonic transform
Zhang et al. [2004], and can be used to visualize the frequency content of audio signals. All 88 possible
notes are modelled using rank one source matrices, and the hyperparameters optimized separately. Regions
of high excitation (in red) correspond to a note being played. The positions of the notes are offset slightly
so that the high excitation regions are not obscured.
7.4.3 Model Selection
In the previous two sections, we have modelled each piano note using a single source model. It remains
to be discussed however, if we would obtain a better model by using a multiple source model, i.e., a rank
I > 1 source matrix. The goal of this section is to determine whether the variational Bayes algorithm gives
consistent answers as to the optimal number of source for the piano notes.
The results of our investigations are given in Figure 7.5 on page 120 for the set of piano notes 01[1,3]
PFNO [F,M,P] in the RWC database. The optimal number of sources is correlated with the pitch of the
piano note, which may be the result of the particular time-frequency representation chosen for the audio.
The trend shown is that notes with a small or high pitch are best modelled by a few sources, whilst notes
with a medium pitch are best modelled by a larger number of sources. Factors that contribute to this are
perhaps: 1) poor resolution of the DCT for low pitches 2) downsampling to 8000Hz cuts off many of the
harmonics of higher pitches, thus leading to simpler models. The dependency of the number of sources on
the length of the frame was not however investigated here, and is suggested for investigation in future work
when multiple source models are used for polyphonic music transcription.
7.5 Prior Model for Polyphonic Piano Music
In this section, we extend the prior model for the excitation matrix to include MIDI pitch and velocity of
the notes that are playing in a piece of solo polyphonic piano music. We also apply this prior model to
the Poisson intensity model of Cemgil [2008] so that the transcription performance of both models can be
compared.
7.5.1 Model Description
In this section, we have chosen to rely on deterministic approaches to solve the transcription inference
problem, as opposed to more expensive MCMC approaches. We describe a quite general approach which lends
itself to any form of music for which the MIDI format is an admissible representation of the transcription.
118
M
id
i
p
it
ch
Time / s
30
40
50
60
70
80
90
100
0 5 10 15 20
(a) Visualization by source separation. Red areas denote regions of higher excitation, which
occur particularly at note onsets for notes within the melodic line. The onset excitation
corresponds with the MIDI note velocity below.
M
id
i
p
it
ch
Time / s
30
40
50
60
70
80
90
100
0 5 10 15 20
(b) Original MIDI. High note onset velocities are shaded red.
Figure 7.4: Transcription of the first 20 seconds of Albeniz's Suite Española No.5 Asturias (Leyenda) using
the Gaussian variance matrix factorization model.
119
02
4
6
8
10
12
14
16
18
2 4 6 8 10 12 14 16
N
u
m
b
er
of
ex
am
p
le
s
Optimal number of sources
(a) Histogram of the optimal number of sources
0
2
4
6
8
10
12
14
16
18
30 40 50 60 70 80 90 100
O
p
ti
m
al
n
u
m
b
er
of
so
u
rc
es
MIDI pitch
(b) Optimal number of sources by pitch
Figure 7.5: Optimal number of sources for a set of piano notes
120
We select the maximum number of sources N to be the total number of pitches represented in the MIDI
format. Each source i corresponds to a particular pitch. Then we have a single set of template parameters
T ∈ RF×I+ for all I sources, which are intended to represent the spectral, harmonic information of the pitches.
For polyphonic transcription, we are typically interested in inferring the piano roll matrix C which owing
to the above assumption of one source per pitch has the same dimensions as the excitation matrix V. For
note i at time τ we set Cn,τ to be the value of the velocity of the note, and Cn,τ = 0 if the note is not playing.
We use the note on velocity, which is stored in the MIDI format as a integer between 1 and 128. Thus,
we model note velocity using our generative model. This contrasts with previous approaches which infer a
binary-valued piano roll matrix of note activity, essentially discarding potentially useful volume information.
The prior distribution p(C) is a discrete distribution, which can incorporate note transition probabilities and
commonly occurring groups of note pitches, i.e., chords and harmony information.
A note with a larger velocity will have a larger corresponding excitation. The magnitude of the excitation
will depend on the pitch of the note as well as its velocity, so instead of applying a volume curve such as (2.1),
we infer this relationship from training data. We will represent this information as an a priori unknown
positive-valued matrix F of size I × 128 where Fi,z represent a mapping from the MIDI pitch i and velocity
z to the excitation matrix given by
Vi,τ =
0 Ci,τ = 0Fi,Ci,τ otherwise (7.15)
For music transcription, we extend the prior model on V to include the matrices Fand C, i.e.,
p(V,F,C) = p(V|F,C)p(F,C)
As F is a mapping to the excitation matrix, we place an inverse-gamma prior (for the Gaussian variance
model) or a gamma prior (for the Poisson intensity model) over each element of F. The resulting conditional
posterior over F is of the same family as the prior, and is obtained by combining the expectations of the
sources corresponding to the correct pitch and velocity.
The full generative model for polyphonic transcription is given by
p(X,S,T,V,F,C) = p(X|S)p(S|T,V)p(V|F,C)p(T)p(F,C)
One advantage of this model is that that minimal storage is required for the parameters which can be
estimated oine from training data, as we demonstrate in 7.6.2. The two sets of parameters are intuitive for
musical signals. This model also allows closer modeling of the excitation of the notes that the MIDI format
allows.
7.5.2 Algorithm
We are able to integrate out the latent sources S (see Section 7.2), and also eliminate V given F and C,
using (7.15). The algorithm we present here is a generalized EM algorithm, which iterates to find a solution
of the posterior:
arg max
T,F,C
p(T,F,C|X)
121
The posterior distribution of F conditional on C,T,X is inverse-gamma as it is formed by collecting the
estimates of V corresponding to each note pitch/velocity pairing.
To maximize for the piano roll C we first note that each frame of observation data is independent given
the other parameters V,F. For each τ we wish to calculate
arg max
Cτ
p(Xτ |T,Vτ )p(Vτ |F,Cτ )p(F,Cτ ) (7.16)
where Xτ ,Vτ and Cτ are the τth column vectors of X,V and C respectively. However, as each Cτ has
128I possible values, an exhaustive search to maximize this is not feasible. Instead, we have found that the
following greedy search algorithm works sufficiently well: for each frame τ calculate
arg max
C˜τ
p(Xτ |T, V˜τ )p(V˜τ |F, C˜τ )p(F, C˜τ ) (7.17)
where C˜τ differs from Cτ by at most one element, and V˜ is the corresponding excitation matrix. There
are I × 128 possible settings of C˜τ for which we evaluate the likelihood at each stage of the greedy search.
This can be carried out efficiently by noticing that during the search the corresponding matrix products TV˜
differ from the existing value by only a rank-one update of TV.
The resulting algorithm has one update for the expectation step and three possible updates for the
maximization step. For the generalized EM algorithm to be valid, we must ensure that any maximization
step based on parameter values not used to calculate the source expectations is not guaranteed to increase
the log likelihood, and therefore must be verified.
7.6 Results
A useful comparative study of three differing approaches has been carried out in Poliner and Ellis [2007].
A dataset with ground-truth of polyphonic piano music has been provided to assess the performance of a
support-vector machine (SVM) classifier, [Poliner and Ellis, 2007], which is provided as an example of a
discriminative based approach, having favorable performance in classification accuracy; a neural-network
classifier Marolt [2004], known as SONIC
2
; and an auditory-model based approach Ryynänen and Klapuri
[2005].
7.6.1 Comparison
To comprehensively evaluate these models, we use Poliner and Ellis training and test data and compare
the performance against the results provided in the same paper, which are repeated here for convenience.
The ground truth for the data consists of 124 MIDI files of classical piano music, of which 24 have been
designated for testing purposes and 13 are designated for validation. In a Bayesian framework there need
not be any distinction between training and validation data: both may be considered labeled observations.
Here we have chosen to discard the validation data rather than include it in the training examples for a
fairer comparison with the approaches used by other authors. We also do not attempt to optimize the model
2
http://lgm.fri.uni-lj.si/SONIC
122
Algorithm 7.2 Gaussian Variance: algorithm for polyphonic transcription
• Source Expectation〈
sν,τs
>
ν,τ
〉
= [tνv
>
τ ] · IN − κν,τκ>ν,τ [TV]ν,τ + 〈sν,τ 〉 〈sν,τ 〉>
• Template Maximization
Shape and scale parameters of inverse-gamma posterior distribution
Aν,i = α
(T)
ν,i + T
Bν,i = β
(T)
ν,i +
∑
τ
V−1n,τ
〈
sν,ks
>
ν,τ
〉
Mode of posterior distribution
Tν,i ← Bν,i
Aν,i + 1
• Excitation Maximization
Shape and scale parameters of inverse-gamma posterior distribution
Ai,z =
∑
{τ :Ci,τ=z}
α(V)n,τ + Fi,Ci,τ
Bi,z =
∑
{τ :Ci,τ=z}
(
β(V)n,τ +
∑
ν
T−1ν,i
〈
sν,τs
>
ν,τ
〉)
Mode of posterior distribution
Fi,z ← Bi,z
Ai,z + 1
• Transcription Search
for τ = 1, . . . , T
Cτ ← arg max
C˜τ
∑
ν
(
−1
2
|Xν,τ |2
[TV˜]ν,τ
− log[TV˜]ν,τ
)
p(F, C˜τ )
123
Algorithm 7.3 Poisson Intensity: algorithm for polyphonic transcription
• Source Expectation
〈sν,τ 〉 = κν,τXν,τ
• Template Maximization
Shape and scale parameters of inverse-gamma posterior distribution
Aν,i = α
(T)
ν,i +
∑
τ
〈si,τ 〉
Bν,i = β
(T)
ν,i +
∑
τ
Vi,τ
Mode of posterior distribution
Tν,i ← Aν,i − 1
Bν,i
• Excitation Maximization
Shape and scale parameters of inverse-gamma posterior distribution
Ai,z =
∑
{τ :Ci,τ=z}
(
α
(V)
i,τ +
∑
ν
〈sν,τ 〉
)
Bi,z =
∑
{τ :Ci,τ=z}
(
β
(V)
i,τ +
∑
ν
Tν,i
)
Mode of posterior distribution
Fi,z ← Ai,z − 1
Bi,z
• Transcription Search
for τ = 1, . . . , T
Cτ ← arg max
C˜τ
∑
ν
(
Xν,τ log[TV˜]ν,τ − [TV˜]ν,τ
)
p(F, C˜τ )
124
parameters to minimize transcription errors on the validation set, as this is not consistent with a generative
modelling approach. This is discussed further in Section 7.7.
The observation data is primarily obtained by using a software synthesizer to generate audio data. In
addition, 19 of the training tracks and 10 of the test tracks were synthesized and recorded on a Yamaha
Disklavier. Only the first 60 seconds of each extract is used. The audio, sampled at 8000 Hz, is then
buffered into frames of length 128 ms with a 10ms hop between frames, and the spectrogram is obtained
from the short-time Fourier transform of these frames. Poliner and Ellis subsequently carry out a spectral
normalization step in order to remove some of the timbral and dynamical variation in the data prior to
classification. However, we omit this processing stage as we rather wish to capture this information in our
generative model.
7.6.2 Implementation
Because of the copious amount of training data available, there is enough information concerning the fre-
quencies of the occurrence of the note pitches and velocities that it is not necessary to place informative
priors on these parameters.
It is not necessary to explicitly carry out a training run to estimate values of the model parameters before
evaluating against the test data. However the EM algorithm does converge faster during testing if we first
estimate the parameters from the labelled observations. Figure 7.6 on page 126 and Figure 7.7 on page 127
show the logT templates and logF excitation parameters estimated from the Poliner and Ellis training data
for the Gaussian variance and Poisson intensity models, with flat prior distributions after running the EM
algorithm to convergence on the training data only, and using a single source to model each note pitch as
in 7.4.1. The templates clearly exhibit the harmonic series of the musical notes, and the excitations contain
the desired property that notes with higher velocity correspond to higher excitation, hence our assumption
of uniform priors on these parameters has not been detrimental. For the excitation parameters, white areas
denote pitch/velocity pairs that are not present in the training data and are thus unobserved.
For each of the matrix factorization models we consider two choices of the prior C. The first assumes that
each frame of data is independent of the others, which is useful in evaluating the performance of the source
models in isolation. The second assumes that each note pitch is independent of the others, and between
consecutive frames there is a state transition probability, where the states are each note being active or
inactive, i.e.,
p(Ci,τ > 0|Ci,τ−1 = 0) = p(Ci,τ = 0|Ci,τ−1 > 0) = pevent (7.18)
This prior is known as the Markov prior in the remainder of this chapter. The state transition probabilities
are estimated from the training data. It is possible and more correct to include these transition probabilities
as parameters in the model, but we have not carried out the inference of note transition probabilities in
this work. Similar Markov time dependencies between frames of data modelled by NMF techniques are used
in Ozerov et al. [2009]. The modification to the inference is straightforward: in (7.16) the prior on Ci,τ is
calculated using (7.18) using the current values of Ci,τ−1 and Ci,τ+1 that have been estimated.
125
F
re
q
u
en
cy
/
H
z
MIDI note pitch
0
500
1000
1500
2000
2500
3000
3500
4000
30 40 50 60 70 80 90 100
(a) Template parameters logT
M
ID
I
ve
lo
ci
ty
MIDI note pitch
20
40
60
80
100
120
30 40 50 60 70 80 90 100
(b) Velocity-Excitation Mapping logF
Figure 7.6: Parameter estimates for the Gaussian variance model from training data
126
F
re
q
u
en
cy
/
H
z
MIDI note pitch
0
500
1000
1500
2000
2500
3000
3500
4000
30 40 50 60 70 80 90 100
(a) Template parameters logT
M
ID
I
ve
lo
ci
ty
MIDI note pitch
20
40
60
80
100
120
30 40 50 60 70 80 90 100
(b) Velocity-Excitation Mapping logF
Figure 7.7: Parameter estimates for the Poisson model from training data
127
7.6.3 Evaluation
Following training, the matrix of spectrogram coefficients is then extended to include the test extracts. As
the same two instruments are used in the training and test data, we simply use the same parameters which
were estimated in the training phase. We transcribe each test extract independently of the others, yet note
that in the full Bayesian setting this should be carried out jointly, however this is not practical or typical
of a reasonable application of a transcription system. An example of the transcription output for the first
ten seconds of the synthesized version of Burgmueller's The Fountain is provided for the Gaussian variance
model, both with independent and Markov priors (Figure 7.8 on page 129 and Figure 7.9 on page 130) on C,
compared to theMIDI ground truth (Figure 7.10 on page 131). The transcription is graphically represented
in terms of detections and misses in Figure 7.11 on page 132. True positives are in light gray, false positives in
dark gray, and false negatives in black. Most of the difficulties encountered in transcription in this particular
extract were due to the positioning of note offsets, rather than the detection of the pitches themselves.
The transcription with independent prior on C shows that the generative model has not only detected
the activity of many of the notes playing, but also has attempted to jointly infer the velocity of the notes.
Each frame has independently inferred velocity, hence there is much variation across a note, however there is
correlation between the maximum inferred velocity during a note event and the ground truth velocities. The
Markov prior on C eliminates many of the spurious notes detected, which are typically of a short duration
of a few frames.
We have used only the information contained in note pitches, but the effect of resonance and pedaling
can be clearly seen by comparing the ground truth with the transcriptions. This motivates the use of a note
onset evaluation criteria.
We follow the same evaluation criteria as provided by Poliner and Ellis. As well as recording the accuracy
Acc (true positive rate), the transcription is error is decomposed into three parts: Subs the substitution
error rate, when a note from the ground truth is transcribed with the wrong pitch; Miss the note miss rate,
when a note in the ground truth is not transcribed, and FA the false alarm rate beyond substitutions, when
a note not present in the ground truth is transcribed. These sum to form the total transcription error Tot
which cannot be biased simply by adjusting a threshold for how many notes are transcribed.
Table 7.1 on page 129 shows the frame-level transcription accuracy for the approaches studied in Poliner
and Ellis [2007]. We are using the same data sets and features dimensions selected by the authors of this
paper to compare our generative models against these techniques. This table expands the accuracy column
in Table 7.2 on page 129 by splitting the test data into the recorded piano extracts and theMIDI synthesized
extracts.
Table 7.2 on page 129 shows the frame-level transcription results for the full synthesized and recorded
data set. Accuracy is the true positive rate expressed as a percentage, which can be biased by not reporting
notes. The total error is a more meaningful measure which is divided between substitution, note misses and
false alarm errors. This table shows that the matrix factorization models with a Markov note event prior
have a similar error rate to the Marolt system on this dataset, but has a greater error rate than the support
vector machine classifier.
128
Time / s
P
i
t
c
h
0 1 2 3 4 5 6 7 8 9 10
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
Figure 7.8: Transcription using a priori independent frames
Table 7.1: Frame-level transcription accuracy
Model Piano MIDI Both
SVM 56.5 72.1 67.7
Ryynänen & Klapuri 41.2 48.3 46.3
Marolt 38.4 40.0 39.6
Variance (Independent) 36.0 41.2 39.7
Variance (Markov) 38.0 44.0 42.3
Intensity (Independent) 40.1 35.4 36.8
Intensity (Markov) 39.7 36.2 37.3
Table 7.2: Frame-level transcription results
Model Acc Tot Subs Miss FA
SVM 67.7 34.2 5.3 12.1 16.8
Ryynänen & Klapuri 46.6 52.3 15.0 26.2 11.1
Marolt 36.9 65.7 19.3 30.9 15.4
Variance (Independent) 39.7 68.2 22.9 27.7 17.6
Variance (Markov) 42.3 62.1 18.1 32.0 12.0
Intensity (Independent) 36.8 71.0 27.8 24.6 18.6
Intensity (Markov) 37.3 66.6 23.7 30.0 12.9
129
Time / s
P
i
t
c
h
0 1 2 3 4 5 6 7 8 9 10
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
Figure 7.9: Transcription using Markov transition probabilities between frames
130
Time / s
P
i
t
c
h
0 1 2 3 4 5 6 7 8 9 10
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
Figure 7.10: Ground truth for the transcription results in Figure 7.8 on page 129 and Figure 7.9 on page 130
131
Time / s
P
i
t
c
h
0 1 2 3 4 5 6 7 8 9 10
30
40
50
60
70
80
90
100
Figure 7.11: Detection assessment
132
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
Number of notes in frame
N
u
m
b
e
r
o
f
e
r
r
o
r
s
×1
04
Subs
Miss
FA
Figure 7.12: Number of errors for the Gaussian variance Markov model by number of notes and error type.
133
7.7 Conclusion
In this chapter we have described a generative model for factorizing the variances of matrix elements into
smaller template and excitation matrices, and developed Bayesian priors and inference algorithms to estimate
the variances and the dimensions of the factor matrices. When applied to the spectrogram of a musical note,
the template matrix models the harmonic content of the note, and the excitation matrix controls how the
volume of the note varies over time. The hyperparameter optimization techniques described in this chapter
can be applied to labeled training data to develop a system capable of distinguishing between and modelling
different musical instruments and pitches. We have demonstrated how the excitation matrix of a polyphonic
signal can be used to visualize the transcription of the music.
We have compared the performance of generative spectrogram factorization models with three existing
transcription systems on a common dataset. The models exhibit a similar error rate as the neural-network
classification system of Marolt [2004]. However the support vector machine classifier of Poliner and Ellis
[2007] achieves a lower error rate for polyphonic piano transcription on this dataset. In this conclusion,
we principally discuss the reasons for the difference in error rate of these systems, and how the generative
models can be improved in terms of inference and prior structure to achieve an improved performance.
The support vector machine is purely a classification system for transcription, for which the parameters
have been explicitly chosen to provide the best transcription performance on a validation set; whereas
the spectrogram factorization models, being generative in nature, are applicable to a much wider range of
problems: source separation, restoration, score-audio alignment and so on. For this reason, we have not
attempted to select priors by hand-tuning in order to improve transcription performance, but rather adopt
a fully Bayesian approach with an explicit model which infers correlations in the spectrogram coefficients
in training and test data, and thus as a product of this inference provides a transcription of the test data.
The differences in this style of approach, and the subsequent difference in performance, resemble that of
supervised and unsupervised learning in classification. Thus in light of this, we consider the performance of
the spectrogram factorization models to be encouraging, as they are comparable to an existing polyphonic
piano transcription system without explicitly attempting to improve the transcription performance by tuning
prior hyperparameters. Vincent et al. [2008] for instance demonstrate the improvement in performance
for polyphonic piano transcription that can be achieved over the standard NMF algorithm by developing
improved basis spectra for the pitches, and achieve a performance mildly better than the neural-network
classifier: a similar result to what has been presented here.
Bertin et al. [2009b] similarly report improvement in transcription performance for the Gaussian variance
model compared to existing Bayesian NMF. They also suggest an tempering approach to avoid iterative
algorithms being trapped in local maxima of the likelihood function. There are a number of alternative
algorithms to perform NMF with the aim of increasing speed of convergence and locating better solutions.
Recent work includes a split-gradient method developed by Lantéri et al. [2010].
To improve performance for transcription in a Bayesian spectrogram factorization, we can firstly improve
initialization using existing multiple frequency detection systems for spectrogram data, and extend the
hierarchical model for polyphonic transcription using concepts such as chords and keys. We can also jointly
track tempo and rhythm using a probabilistic model, for examples of this see Whiteley et al. [2006], Raphael
[2004], Peeling et al. [2007a] where the model used could easily be incorporated into the Bayesian hierarchical
approach here.
134
The models we have used have assumed that the templates and excitations are drawn independently
from priors, however the existing framework of gamma Markov fields developed in Cemgil et al. [2007],
Dikmen and Cemgil [2010] can be used as replacements of these priors, and allows us to model stronger
correlations, for example, between the harmonic frequencies of the same musical pitch, which additionally
contain timbral content, and also model the damping of the excitation of notes from one frame to the next. It
has qualitatively shown that using gamma Markov field priors results in a much improved transcription, and
in future work we will use this existing framework to extend the model described in this paper, expecting to
see a much improved transcription performance by virtue of a more appropriate model of the time-frequency
surface. An alternative framework for enforcing temporal continuity in Bayesian NMF for polyphonic music
transcription is presented by Bertin et al. [2009a], which could also be applied to the Gaussian variance
model as opposed to the Poisson intensity model used by the authors. Other priors enforcing correlations
between the elements in the factored matrices are possible, for example Gaussian process priors [Schmidt
and Laurberg, 2008].
On this dataset, the Gaussian variance model has better performance for transcription than the intensity
based model, and we suggest that this is due to the generative model modeling the weighting of the spectro-
gram coefficients directly, and thus being a more appropriate model for time-frequency surface estimation.
However, most of the literature for polyphonic music transcription systems using matrix factorization models
has focused on the KL divergence and modifications and enhancements of the basic concept. Therefore it
would be useful to firstly evaluate such variants of NMF against this dataset and other systems used for
comparing and evaluating music transcription systems. Secondly, it would also be useful to replace the
implicit Poisson intensity source model in these approaches with the Gaussian variance model.
In summary, we have presented matrix factorization models for spectrogram coefficients using a Gaussian
variance parametrization, and have developed inference algorithms for the parameters of these models. The
suitability of these models has been assessed for the polyphonic transcription of solo piano music, resulting
in a performance which is comparable to some existing transcription systems. As we have used a Bayesian
approach, we can extend the prior structure in a hierarchical manner to improve performance and model
higher-level features of music.
135
Chapter 8
A Probabilistic Framework for Inferring
Temporal Structure in Music
In this chapter we develop a probabilistic framework for the tractable inference of temporal structure in
musical audio. The goal of this framework is to unify otherwise separate applications of Bayesian musical
signal processing into a common, generative modelling framework for inference. We model the performance
of a piece of music as the movement of a score pointer through a symbolic representation of the music. This
representation may be the actual written score of the piece of music being played, converted to a suitable
format, when the application is score following or tracking; or the representation may be a code book of
rhythmic patterns for a tempo and beat tracking application; or a code book of chords for a transcription
application. The observed audio itself is modelled by a generative process conditional on the properties of
the score at that point.
In the previous chapters of this thesis, we have mainly focused on processing individual frames of musical
audio to detect multiple pitches with prior information. In Chapter 7 we added a simple Markov model
for note transitions, such that the probability of a note sounding or ceasing to sound in a particular frame
is dependent on the previous frame. This addition has a smoothing effect on the transcription such that
spurious note detections are avoided, and notes are transcribed to their full length. In this chapter we
define a Markov model over all of the notes sounding in each frame, mapping structures in the score, or
the expected score of the music into Bayesian priors. This allows a richer, more realistic transcription of
the musical structure than a simple frame-by-frame transcription would produce, and provides accuracy and
robustness when aligning a preexisting transcription to a performed piece of music.
8.1 Audio Matching using Generative Models
8.1.1 Existing Dynamic Time Warping Approach
We introduce our model by considering an existing state-of-the-art approach [Hu et al., 2003, Orio and
Schwarz, 2001, Turetsky and Ellis, 2003] to the alignment of two pieces of music which are assumed to share
a common score. Each piece of music is buffered into overlapping frames, and a feature vector is extracted
136
from each frame. A distance metric is used to compute the similarity between pairs of feature vectors
from each piece, and dynamic time warping (DTW) is used to compute a joint path through both pieces,
maximizing the similarity between matched frames. Many choices of feature vector and distance metrics
have been proposed, including chroma vectors using non-negative matrix factorization divergence measures
by Niedermayer [2009]. An interesting variation of these methods is that of Stark and Plumbley [2010] which
performs localized self-alignment to detect repetitious structures in music to 'follow' a performance without
a score.
This approach to audio alignment may be extended to score matching and transcription by using a
synthesizer to generate audio from a score or code book of musical chords. The synthesized audio is aligned
to the observed audio and the path through the observed audio can be used to infer the score position
or transcription. For transcription, the path must allow movements from the end of one chord to the
beginning of the next in the code book. Such jumps in the path reduce the efficiency of most DTW algorithm
implementations.
In this section we state the audio alignment application using a hidden Markov model for the path and
a generative signal model for each frame. This allows a number of extensions to the DTW approach, for
example being able to jointly align multiple pieces of music, jointly inferring the structure within each piece
(for example when sections of the score are repeated), and being able to use approximate sequential inference
for longer pieces of music. Hidden Markov models have been used for audio and score alignment by Orio and
Déchelle [2001], Cano et al. [1999]; and Raphael [2004] also uses a probabilistic generative model for score
alignment.
8.1.2 Model Statement
The data we observe consists of N pieces of music, with Tn frames in each piece, n = 1, . . . , N . Each frame
of music is denoted by y
(n)
t , t = 1, . . . , Tn. The score, which is the underlying representation of all of the
pieces, is divided into M sections. Over each section the properties of the score are stationary, hence note
onsets and offsets for example mark the beginning of new sections. In regular classical music, each bar of
the score may be divided equally, for example into 32 if no notes are longer than semi-quavers. We consider
the method in which the score is converted into musical audio via the concept of a `score pointer', which
denotes at time t the current position in the score where the performance is. If someone were to listen to
a piece of music and follow the score with their finger at the same time, the position of the score pointer
would be the position of the listener's finger. The path of the score pointer through each piece of music is
denoted x
(n)
t which takes values 1, . . . ,M . The score pointer may move forwards and also backwards in the
case of repeated sections, therefore it is not necessary in general for x
(n+1)
t ≥ x(n)t .
To proceed, we require a generative signal model which assigns a probability p
(
y
(n)
t |θm
)
p (θm) for all
values of t, n,m. θm denotes a set of parameters which describes how the signal changes given the position in
the score. Some of the parameters θm will be unknown and must be inferred as part of the audio matching
task. We also define a Markov model p
(
x
(n)
t |x(n)t−1
)
as a prior on the dynamics of the score pointer through
each piece, with initial priors on the score position p
(
x
(n)
1
)
and the signal p
(
y
(n)
1 |θm
)
. The joint probability
distribution of this model is given by
137
N∏
n=1
p
(
y
(n)
1 |θx(n)1
)
p
(
θ
x
(n)
1
)
p
(
x
(n)
1
) Tn∏
t=2
p
(
y
(n)
t |θx(n)t
)
p
(
θ
x
(n)
t
)
p
(
x
(n)
t |x(n)t−1
)
(8.1)
8.1.3 Interpretation of Dynamic Time Warping
Dynamic time warping constructs a set of M matches between the frames of two pieces of audio, in such
a way that the set of matches represents the path of an unknown score pointer through both pieces. Each
match m is a unique pair of frames
{
y
(1)
p(m), y
(2)
q(m)
}
from each piece, where p (m) ∈ 1, . . . , T1 is the timing of
the match in piece 1, and q (m) ∈ 1, . . . , T2 is the timing of the match in piece 2.
The cost of each match d
(
y
(1)
p , y
(2)
q
)
(also known as the distance or similarity between the frames) is
a function of the two frames. The cost has a small value when the frames share some characteristics that
indicate that they are at the same position in the unknown score. A good cost function is insensitive to
non-score related features, such as the overall energy of the frame. The cosine distance which normalizes
each vector is a popular choice.
d
(
y(1)p , y
(2)
q
)
=
y
(1)
p · y(2)q∣∣∣∣∣∣y(1)p ∣∣∣∣∣∣ ∣∣∣∣∣∣y(2)q ∣∣∣∣∣∣
where ||y|| is the l2 norm.
There is also an additional cost between consecutive matches, which controls how the score pointer
moves frame by frame through both pieces. This function ensures that only realistic score pointer paths
are acceptable and disallows large differences in time between consecutive matches in both pieces, i.e., both
p (m) − p (m− 1) and q (m) − q (m− 1) are constrained to be small for all m. Formally, we define a cost
function
c (p (m− 1) , p (m) , q (m− 1) , q (m))
which is the cost of moving the score pointer from p (m− 1) to p (m) in piece 1 and from q (m− 1) to q (m)
in piece 2. The value of the cost function must be infinite for any movement of the score pointer disallowed
by the application. For example, if the application does not allow the score pointer to move backwards in
time, then the cost function must be infinite for p (m) − p (m− 1) < 0 or q (m) − q (m− 1) < 0. For every
frame in each piece to be matched with a frame in the other piece, we cannot allow the cost function to skip
frames, hence it must also be infinite for p (m) − p (m− 1) > 1 or q (m) − q (m− 1) > 1. In the literature,
the cost function is normally chosen to be stationary, in that it does not depend on the actual values of
p (m− 1) , p (m) , q (m− 1) , q (m), but only on the differences between them, i.e.,
c (p (m− 1) , p (m) , q (m− 1) , q (m)) = c (p (m− 1) + t, p (m) + t, q (m− 1) + t, q (m) + t)∀t ∈ Z
When the cost function is stationary it can therefore be written as
c (p (m− 1) , p (m) , q (m− 1) , q (m)) ≡ c (p (m)− p (m− 1) , q (m)− q (m− 1))
Later in this chapter we will introduce a dependency on the position of onsets in the score, hence we do not
simplify the specification of the cost function at this point.
138
A DTW algorithm minimizes the overall cost
M∑
m=1
d
(
y
(1)
p(m), y
(2)
q(m)
)
+
M∑
m=2
c (p (m− 1) , p (m) , q (m− 1) , q (m)) (8.2)
over the set of matches {p (m) , q (m) : m = 1, . . .M} ,M , subject to the constraints
p (1) = 1, q (1) = 1, p (M) = T1, q (M) = T2 (8.3)
This set of matches is known as a path and denotes the position of the score pointer in each piece throughout
the entire path. The constraints require that the path of the score pointer starts at the first frame of both
pieces, and terminates at the last frame of both pieces. Note that the length of the path M is also unknown,
although a shorter length is preferred by the cost function, which normally indicates a better match between
the two pieces.
The constraints may be relaxed to allow for silences or missing notes. In this case, we do not apply the
constraints (8.3) but instead define edge costs: e (p (1) , q (1)) which defines the cost of starting the path at
p (1) , q (1) and f (p (M) , q (M)) which determines the cost of terminating the path at p (M) , q (M). The
overall cost is now written as:
e (p (1) , q (1)) + f (p (M) , q (M)) +
M∑
m=1
d
(
y
(1)
p(m), y
(2)
q(m)
)
+
M∑
m=2
c (p (m− 1) , p (m) , q (m− 1) , q (m))
The DTW model may be interpreted using the generative model in the previous section. The cost
functions are interpreted as negative log probabilities, such that the summations in (8.2) correspond to
the products in (8.1), and minimizing the cost function is equivalent to maximizing the likelihood. This
interpretation is powerful as the model may be extended with further prior information if available.
The cost of matching frames is equivalent to the negative log marginal likelihood of both frames being
generated by the same score
d
(
y
(1)
p(m), y
(2)
q(m)
)
≡ − log
ˆ
p
(
y(1)p |θm
)
p (θm) p
(
y(2)q |θm
)
p (θm) dθm (8.4)
and the cost of the score pointer movement between successive matches is related to the transition proba-
bilities:
c (p (m− 1) , p (m) , q (m− 1) , q (m)) ≡ − log p
(
x
(1)
p(m)|x(1)p(m−1)
)
− log p
(
x
(2)
q(m)|x(2)q(m−1)
)
(8.5)
The edge cost at the beginning of the path are equivalent to the priors:
e (p (1) , q (1)) = − log p
(
x
(1)
p(1)
)
p
(
x
(2)
q(1)
)
and the cost at the end of the path may similarly be written as
f (p (M) , q (M)) = − log
ˆ
p
(
x
(1)
p(M)|x(1)p(M−1)
)
p
(
x
(2)
q(M)|x(2)q(M−1)
)
dp (M − 1) dq(M− 1)
139
marginalizing over the preceding position of the score pointer.
8.2 Score Alignment
A common way to use audio alignment techniques to align a score to a piece of musical audio is to synthesize
the score to audio using an electronic instrument and compute a joint path through both pieces using DTW
as described in the previous section. However, the inferred path through the observed audio may not be
appropriate to infer the position through the score in every frame. A single frame in the observed audio may
be matched to multiple frames in the synthesized audio which may span more than one score event. Thus
there is ambiguity over which score event the frame in the observed audio should be matched to.
In cases where the length of a frame is small in comparison to the minimum length of a score event, it is
impossible for multiple score events to occur within the same frame of the observed audio. When a single
frame is matched to multiple score events, the path inferred though the audio must therefore be incorrect.
We suggest that this is due to the prior model (8.4) which matches frames being too weak, and the transition
model (8.5) being too flexible, allowing dramatic changes in tempo.
In this section we focus on the development of a stronger transition model, i.e., the movement of the score
pointer through the audio. We state the transition model in such a way that it can be incorporated into
existing DTW cost functions and also expressed as a hidden Markov model, which allows further development
of the prior models.
8.2.1 Treatment of Score Events
In score alignment, there is much more value in accurately inferring the timings of note onset events in the
score rather than a smooth contour of the tempo through a piece of audio. Note onsets are also usually
more important than note releases. In piano music, the sustain pedal can blur the timings of released notes,
however note onsets are clearly defined. In percussive instruments, including plucked and hammered strings,
the notes may not be explicitly cut off, but allowed to sound and decay. In legato playing, note releases are
timed with the onsets of the next note, whilst in staccato playing, the note lengths are short and the exact
position of the release may be difficult to determine in the score. Perceptually, minor errors in the timing of
a note onset have a greater impact than in the timing of a release.
In light of this, we treat note onsets with a rigid prior model to attempt to infer the timings as accurately
as possible. Away from the onsets however, we allow the score pointer to move flexibly through the progression
of the note and its release.
8.2.2 Dynamic Time Warping Cost Function
Let the observed audio in 8.1.3 be y
(1)
t and the synthetic audio, which is known to be aligned to the underlying
score, be y
(2)
t . We know a priori which frames in the synthetic audio y
(2)
t are related to note onsets in the
score. If y
(2)
q(m) is related to a note onset, then we set the cost function as
c (p (m− 1) , p (m) , q (m− 1) , q (m)) =
0 p (m)− p (m− 1) = 1 and q (m)− q (m− 1) = 1∞ otherwise (8.6)
140
(8.6) forces a movement of 1 frame in both pieces when there is an onset in the synthetic audio. If y
(2)
q(m)
is not related to a note onset, then we assume that the cost function is stationary and takes a value
c (p (m)− p (m− 1) , q (m)− q (m− 1)). When the modified cost function described in this section is ap-
plied to score alignment using DTW, the constraint ensures that every note onset is uniquely identifiable in
the path of the score pointer through the observed audio.
8.2.3 Hidden Markov Model Formulation
The hidden Markov model (HMM) formulation of score alignment is more powerful as it uses an explicit
model of the score itself. This allows a generative model p (yt|θm) of a frame of audio yt to be used to infer
unknown parameters θm using all of the frames in both the synthesized and observed audio corresponding
to a certain set of notes in the score, and even using other frames sharing one or more notes. A library of
training data may also be used as priors for the signal model. The parameter set θm at score position m
includes the pitches and volumes of the set of notes currently sounding, plus any unknown parameters of
the generative model for the frame. Any reasonable generative model for a frame of audio given the playing
notes may be used, including the Bayesian models developed in the preceding chapters of this thesis.
The joint probability distribution of the frame yt and the parameters θm may be written as
p (yt, θm) =
p (yt|θm)
p (yt)
p (θm)
Now if we have previously obtained several frames of music y
(n)
t , t = 1, . . . , Tn, which we denote y
(n)
1:Tn
, through
synthesis or other means which we know was generated from a score with parameters θm, then we can update
the prior generative model with the additional data. As the below expression shows, this is equivalent to
replacing the prior p (θm) with the posterior under the previously observed frames p
(
θm|y(n)1:Tn
)
:
p
(
yt, θm|y(n)1:Tn
)
=
p (yt|θm)
p (yt)
p
(
θm|y(n)1:Tn
)
If the prior is a conjugate prior of the likelihood function p (yt|θm) then the posterior p
(
θm|y(n)1:Tn
)
is of the
same family as the prior, the parameters of which are calculated using standard update rules.
In the HMM formulation, we are only interested in inferring the path of the score pointer xt through the
observed audio yt, as the path through any synthesized audio is already known. The probabilistic model
for the movement of the score pointer is simple. Usually from one frame to the next, the score pointer may
either stay in the same score position or move to the next position.
p (xt+1|xt) =

pm xt+1 = m+ 1, xt = m
1− pm xt+1 = xt = m
0 otherwise
(8.7)
pm is related to the current tempo and the expected duration of the event represented by the score pointer
position at that point. For example, if the event is expected to last 1 second and the time difference between
frames is 125ms then pm should be set to 1/8.
141
(8.7) may be extended to allow other transitions, such as skipping a score event in a live and error prone
performance. An additional Markov model allowing changes in tempo may be added by extending the hidden
state xt to include the unknown tempo parameter. The tempo should only be allowed to change slowly, for
example at bar lines in the score or explicitly marked tempo change points.
When θm represents a note onset, we set pm = 0: similar to the constraint (8.6) to improve the accuracy
of onset timings. This forces the score pointer to move one frame at the onset. The remainder of the note
is represented by θm+1 and may be further divided into sustain and release portions of the note if this is
available or may be inferred from the score.
8.2.4 Inference
In Peeling et al. [2007a] we described two methods of inference using a simpler version of the hidden Markov
model described in 8.2.3. In this section, we extend these methods into an iterative scheme which improves
the accuracy of the timings and the consistency of the inferred score parameters. This scheme firstly infers
x1:T given θ1:M and then infers θ1:M given x1:T .
The method of calculating the posterior distribution
∏T
t=1 p (xt|y1:T , θ1:M ) for all t is known as the
forwards-backwards algorithm [Rabiner, 1989]. Modifications of this algorithm be used on-line to calculate
a fixed lag filtering distribution p (xt|y1:t+L, θ1:M ) where L ≥ 0 is the permitted lag allowed in observations
before the score pointer position is required.
The method of calculating the mode of the posterior distribution is known as the Viterbi algorithm. It
is used where a consistent path of the score pointer is required across the entire piece.
The remaining step of inference is to update the parameters θm of the generative model, given that we
have attempted to fit the score to the observed frames. The standard method for hidden Markov models
is the Baum-Welch algorithm, an expectation-maximization algorithm. The algorithm first computes the
posterior distribution of the score pointer by the forward-backwards procedure, and then locates the MAP
estimate of θ1:M under the posterior distribution of the parameters
p (θ1:M |x1:T , y1:T )
An alternative method, known as conditional modes, uses the Viterbi path to compute the score pointer,
and then maximize
p (θm| {xt, yt} : xt = m)
for all m. As this method segments the score parameters and data into separate maximization problems,
there is potential for the computation to be carried out in parallel; and in general the computation and
memory requirements are less than employing the Baum-Welch algorithm.
8.2.5 Results
Figure 8.1 on page 144 shows an example of audio alignment carried out using DTW techniques on a recorded
piece of guitar audio aligned to a synthesized version. The synthesized version was generated from the known
score of the piece with a constant tempo. The note onsets were identified from the score, and timings were
assigned by scaling the number of beats from the beginning of the score to the note onset event. Note onset
142
costs are implemented as described in 8.2.2. The distance matrix d (p, q) is computed using a single source
Gaussian variance model (Section 7.2) as follows. Each element of the distance matrix is the joint marginal
likelihood of the frame of recorded audio y
(1)
p and the frame of synthetic audio y
(2)
q , assuming that both
frames share the same template vector t but have a separate excitation parameter: v
(1)
p being the excitation
parameter for y
(1)
p and v
(2)
q being the excitation parameter for y
(2)
q . We therefore calculate
d
(
y(1)p , y
(2)
q
)
= p
(
y(1)p , y
(2)
q
)
=
ˆ
p
(
y(1)p |t, v(1)p
)
p
(
y(2)q |t, v(2)q
)
p (t) p
(
v(1)p
)
p
(
v(2)q
)
dtdv(1)p dv
(2)
q (8.8)
for each pair of frames, using the Variational Bayes algorithm 7.1 to approximate the integral in (8.8). The
template and excitation hyperparameters are chosen to be uniform: at = 2, bt = 1 and av = 2, bv = 1, and
the hyperparameters are not optimized for this algorithm.
The alignment path is then computed by DTW. The result is that the timings of note onsets in the
recorded piece are synchronized well with the synthetic version, which is indicated by the alignment path
passing through the intersections of the edges of the vertical and horizontal bands in the distance matrix.
Figure 8.1a on page 144 shows the spectrogram of the recorded audio, and Figure 8.1b on page 144 shows
the distance matrix computed from (8.8) overlaid with the alignment path. By comparing with the distance
matrix with the spectrogram, it can be seen that the note onsets of the observed audio occur at the edges
of the vertical bands. Similarly, note onsets in the synthetic audio correspond to the edges of the horizontal
bands. On close observation, the alignment path passes through the intersections of the band edges, which
indicates that the note onsets in the observed and synthetic audio have been matched together with little
timing error.
To quantify the improvement in the alignment when using note onset costs and the iterative inference
algorithm of 8.2.4 we use a data set built from Midi and mp3 files from the Classical Piano Midi page
1
.
The mp3 audio files are recorded acoustically from a Midi controlled grand piano, and are aligned to the
Midi files provided. An accompanying synthetic set of audio was obtained by removing all of the tempo and
expressive markings from theMidi files and synthesizing the result. Both pieces of audio were downsampled
to 8000Hz and split into frames of 48ms with 50% overlap between the frames. These were then matched
together using the algorithms described in this chapter. The tempo through the observed audio is assumed
to be constant, and is estimated from the tempo of the synthetic audio by multiplying by the ratio of the
lengths of the two pieces. The estimated tempo is then used to set the cost function for moving from one
frame to the next using the model in (8.7).
For each note onset in the score, we identify the frame in the synthetic audio containing the note onset.
Then if that frame is matched to a unique frame in the observed audio by the DTW algorithm (which is
guaranteed if the note onset costs described in 8.2.2 are used), then the centre of the frame in the observed
audio is recorded as the timing of the note onset in the observed audio. If the frame in the synthetic audio is
not matched to a unique frame in the observed audio, then the timing of the onset in the observed audio is
recorded as halfway between the start of the first frame and the end of the last frame of the group of frames
in the observed audio matched to the frame containing the onset in the synthetic audio.
1
www.piano-midi.de
143
F
re
q
u
en
cy
/
H
z
Time / s
0
500
1000
1500
2000
2500
3000
3500
4000
0 5 10 15 20
(a) Spectrogram of the observed audio. Vertical features in the spectrogram correspond to
vertical bands in the distance matrix below.
S
y
n
th
es
iz
ed
au
d
io
ti
m
e
/
s
Observed audio time / s
S
y
n
th
es
iz
ed
au
d
io
ti
m
e
/
s
0
5
10
15
20
0 5 10 15 20
(b) The distance matrix of the observed spectrogram above, compared frame-by-frame to syn-
thesized audio from the same score.
Figure 8.1: On the distance matrix of the observed spectrogram, regions of high spectral similarity are shaded
darker and regions of low spectral similarity are shaded lighter. Overlaid in red is the optimal alignment
path computed using DTW, which moves steadily from the beginning of both pieces along the diagonal of
the distance matrix.
144
Cosine Distance Gaussian Variance
Piece Unaligned DTW Note Onset Costs DTW Note Onset Costs Iterative Inference
alb_esp1 578.1 350.1 342.7 357.8 345.3 331.8
scn15_5 1278.5 11.9 11.6 11.3 10.9 10.6
bor_ps7 822.0 55.4 51.5 50.6 47.0 45.8
alb_esp2 1203.4 122.7 119.9 111.5 110.0 110.0
mendel_op62_4 607.5 16.0 15.9 15.8 15.4 14.7
ty_maerz 3068.1 639.0 634.1 637.6 633.6 625.7
chpn-p22 753.2 158.4 152.7 154.9 151.1 151.0
scn15_3 74.2 12.4 12.2 17.0 13.4 12.4
scn15_1 374.4 13.2 12.5 12.6 12.2 12.2
scn15_6 408.0 27.2 22.8 22.7 22.2 19.4
Table 8.1: Score alignment: median alignment in milliseconds
We then calculate the difference in milliseconds between the onset timings in the original Midi (before
removing tempo changes) compared to the note onsets identified in the aligned audio. The quality of
alignment measured using the median alignment error of note onsets in milliseconds, as this evaluation
criterion is also used in Cont et al. [2007], Devaney et al. [2009] for score alignment. The results are
presented in Table 8.1 on page 145.
From these results we observe that both the cosine distance and Gaussian variance models produce similar
alignments. The advantage of being able to use an iterative inference scheme for the Gaussian variance model
provides modest improvements in alignment accuracy. The accuracy of the alignment varies strongly based
on the piece that is being aligned. The common characterstic of the pieces which are badly aligned is that
they include sections of fast notes with high polyphony, and in these situations a simple distance metric
such as the cosine distance or a single source Gaussian variance model is too weak a model to improve
alignment dramatically. We therefore suggest that a useful line of future work would be to investigate
more powerful models, which in themselves are capable of music transcription, such as the multiple source
Gaussian variance model in Chapter 7 with training data in addition to the synthetic audio, and implement
them in the inference framework described in this chapter, in order to tackle these more difficult cases of
score alignment.
In Figure 8.2 on page 146 we show how the spectrogram audio and aligned score can be presented in an
visual manner. The score pointer is presented as a vertical bar which moves through both the spectrogram
and the score, synchronized with the audio track.
8.3 Event Based Inference
The generative model described in 8.1.2 directly works with frames of observed audio, and infers the temporal
structure from the posterior distribution. In this section, we consider a different application, where the
observations are not frames of audio but incompletely labeled events. The observations here are the output
of an audio preprocessing system such as an onset detector (which detects rapid changes in the energy of the
audio over time), or an approximate transcription of the melodic line of the piece, which could be provided by
a human in the context of a query-by-humming application. In these situations, we again wish to determine
145
Spectrogram Data Time / s
F
r
e
q
u
e
n
c
y
/
H
z
0 2 4 6 8 10 12 14
0
1000
2000
3000
4000
50 100 150 200 250 300 350 400 450
55
60
65
70
75
80
85
MIDI Data Score position
M
I
D
I
n
o
t
e
Figure 8.2: Score alignment using Gaussian variance model. The movement of the score pointer is displayed
regularly in time as a vertical bar passing through the spectrogram and also the Midi representation. The
spectral features can be matched visually to the score representation easily, and it can be seen that the
score pointer positions shown correspond to the appropriate timing in the audio. In a practical visualization
application, this figure can be presented as a video, where the score pointer moves through the spectrogram
and the Midi representation concurrently, whilst the audio is playing. It is then much easier to notice
subtle changes in tempo (in performance rather than an explicit score marking), for example between score
positions 150 and 175, which marks the entry of the second voice in Bach's second Fugue in C minor (from
the Well-Tempered Klavier performed by Daniel Ben Pienaar)
146
the temporal structure in the music, and be able to match and align the observed events with the underlying
score. The goal of this section is the same as in Section 8.2, however instead of using a generative model
of the audio in each frame, we now must define and infer using a model which describes the production of
these observed events from the score.
For example, a tempo tracking algorithm must firstly infer the presence of onsets of note events in a piece
of music, and then infer the overall speed of the arrival of events by looking at periodic patterns in the timing
of events (Figure 8.3 on page 151). The onset timings may also be provided directly by a human listener, in
an application which infers and even controls the tempo of a piece playing, or in a query-by-tapping context,
where the application attempts to match the timings provided by the listener to a Midi database. In these
applications, the events are unlabeled - we do not know which beat of which bar each event belongs to.
Exact onset timings may also be acquired from a Midi enabled instrument attempting to perform a
score. Analyzing the note timings in relation to the score is useful for musicological studies of performance.
Another application is an incomplete transcription of the score, such as a melodic line transcription in a
query-by-singing application, or even the output of a classification model-based transcription system which
cannot easily be set in a generative model framework. In these applications, the events are partially labeled:
we know the pitch and perhaps the volume of the note, but these notes are not matched with the score itself.
Some of the notes observed may be erroneous, and others may arrive in the wrong order, even subtly in the
case of piano chords and polyphonic music.
In the above descriptions, we have motivated both the use of an alignment application (to infer the tempo
throughout the audio) and a matching application where we wish to select the best match to the observed
events from a large set of candidate pieces. Although these applications appear different, they can both be
addressed by Bayesian inference of the unknown model parameters. For alignment, inference of the note
onset positions can be used to construct the progress of the tempo through the piece. For matching against
other candidates, Bayesian inference can compute the likelihood of the observed events given a candidate
score. The candidate giving the highest likelihood to the observations is therefore the best match to the
observations. In 8.3.3 we consider a query-by-tapping application, where we attempt to match a set of onsets
intended to mimic the rhythmic structure of a piece of music, to candidate scores from a database.
In terms of the generative model p (yt|θm) p (θm), yt refers to the set of event onsets inferred in that frame.
In this section we will propose a Bayesian model which can be adapted to all of the above applications in a
general way. We define C as the number of categories of events that we are able to observe. If an event of
category c ∈ 1, . . . , C occurs in frame yt, then we use the notation that c ∈ yt, and if it is expected to occur
during the section of score θm then c ∈ θm.
The definition of an event category depends on the nature of the observed events. If the observed
events are generated by a simple onset detector, there is only one event category: the detected onsets.
Simultaneously sounding note onsets are grouped into score onset sections. In a score onset section c ∈ θm
as we expect an onset to be detected when there is an onset in the score. Sections are also defined for the
periods in the score where there are no onsets. In these sections, c /∈ θm as we do not expect an onset to be
detected.
If the observed events are note onsets returned by a melodic line transcription algorithm, then each pitch
returned by a transcription algorithm would correspond to one of the C categories. The score is divided into
disjoint sections, where a group of pitches are sounding throughout a section, or there is silence. When a
147
pitch corresponding to event category c is sounding in section m then c ∈ θm as we expect the transcription
algorithm to detect this pitch during this section of the score. In the case of silence, we expect c 6/∈ θm for
all c.
8.3.1 Counting of Temporal Events
The basic model we propose maintains a count of the number and category of temporal events observed in
the music up to and including the present frame. Our observation is a function nc (t) for every frame yt
defined as the number of events of category c observed up to and including frame yt.
Our prior model consists of the expected number of events observed at different times in the piece of
music. We will model the temporal events as a set of independent non-homogeneous Poisson processes, one
for each category. The occurrence of note c onsets has a time-varying intensity ρc (t) which for each value of
t gives the expected number of onsets by the time we reach frame t, i.e.,
nc (t) ∼ Po
(
t∑
τ=0
ρc (τ)
)
= Po (λc (t)) (8.9)
where λc (t) =
∑t
τ=0 ρc (τ).
The intensity function ρc (t) gives the expected number of event of category c occurring in each frame.
When matching onset detections to a score, we would initially expect that this intensity function is equal to
the number of events in the section of the score xt corresponding to the frame t, i.e.,
ρc (t) |θxt =
1 c ∈ θxt0 c /∈ θxt (8.10)
This model of the intensity function is suitable for ideal cases where we do not expect any errors in the
performance or errors in the detection of the events, and the only unknown variable is the tempo of the
performance, represented by xt which maps frame t to score section m.
(8.9) is a generative model for the occurrence of observed events, given the expected counts of events
in the underlying score, and therefore may be used for all of the applications described earlier. Alignment
applications involve inferring the tempo of the performance xt throughout the piece. Applications which
match the observations to the best candidate score must compute the likelihood of the observed event counts
nc (t) given the candidate score θ1:M ∏
c∈C
∏
t=1,....T
p (nc (t) |θ1:M ) (8.11)
for each candidate. The candidate with the highest likelihood (8.11) is chosen as the best match.
The Poisson assumption allows variability in both the number of events detected and their timings. The
maximum likelihood case is when the events match the score exactly, however we show in 8.3.3 that using
(8.10) without modification is sufficient to match scores to observations on a global or coarse scale.
148
8.3.2 Clutter and Missed Detections
For a more powerful model which is more appropriate for real performances and onset detection methods and
is able to match individual events in the score to the observations, we may begin by adding two additional
error parameters per event type. The first parameter is a clutter process ρ
(clutter)
c which gives the expected
number of spurious event detections in each frame. The second parameter governs the probability of missing
note detections, which we denote ρ
(missed)
c . The new model for the intensity function is
ρc (t) |θxt =
1− ρ
(missed)
c c ∈ θxt
ρ
(clutter)
c c /∈ θxt
In this section, we treat the two error processes as independent per event type and constant throughout
the score / observation. It is straightforward to allow the error parameters to vary across different parts of
the score, for example in fast moving sections where notes are more likely to be missed, by appropriately
subdividing the score.
These parameters are straightforward to infer as part of the iterative procedure described in 8.2.4. If the
current map estimate of the path of the score pointer is given by x∗t for all t, then the maximum likelihood
estimate of the missed event parameter is
ρ
(missed)
c =
1
T
T∑
t=0
I
[
c ∈ θx∗t
]
I [c /∈ yt]
is the average of the missed detections in the observation, and the maximum likelihood estimate of the clutter
parameter is
ρ
(clutter)
c =
1
T
T∑
t=0
I
[
c /∈ θx∗t
]
I [c ∈ yt]
is the average of the spurious detections.
If we are repeatedly using a particular technique for detecting events, it is likely that we would have strong
prior information about the clutter and missed event processes. In this case, we can put a Beta distribution
(Section A.4) ρ
(missed)
c ∼ B
(
α
(missed)
c , β
(missed)
c
)
on the parameter, where α
(missed)
c + β
(missed)
c is the
number of times we applied the technique and β
(missed)
c is the number of missed detections. The maximum
a posteriori estimate of the missed event parameter is
ρ
(missed)
c =
1
T + α
(missed)
c + β
(missed)
c
(
β(missed)c +
T∑
t=0
I
[
c ∈ θx∗t
]
I [c /∈ yt]
)
An equivalent formula holds for the clutter parameter
8.3.3 Query-by-Tapping Results
The Mirex Query by Tapping Task is an interesting setting for evaluating models of temporal structure in
music. A set of onset times are provided to the system, and the task is to match the rhythmic structure
observed to one of a set of candidate scores. The Mirex task provides a set of monophonic Midi database
149
records, and a set of audio tapped and symbolic queries available for download
2
. The question here is whether
the temporal structure of music can be adequately represented by rhythmic (onset) information alone [Jang
et al., 2001]. This has implications for score-alignment algorithms where robustness may be increased by
ignoring pitch information rather than including it.
Applying the models here is straightforward. C = 1 and nc (t) is the number of observed onsets provided
in the query up to frame t. ρc (t) is set to the number of note onsets in the Midi file up to frame t. The
examples in the Mirex task are monophonic, so note onsets do not happen at exactly the same time.
The system ranks all of the Midi files in the database according to the likelihood (8.11) for each query.
Figure 8.3 on page 151 shows an example of the inter-onset timings which are used to match the observed
onsets to the note onsets in Midi. To evaluate how well the system performs, for each query q, we compute
the rank rq of the correct score when the database scores are ranked according to likelihood. If the system
correctly matches query q by assigning the highest likelihood, then rq = 1. If the system assigned two
incorrect scores with a higher likelihood than the correct score, then rq = 3. The evaluation method
published by Mirex is the mean reciprocal rank (MRR) over all the queries q ∈ Q:
MRR =
∑
q∈Q
1
rq
such that a perfect system which returns the correct score for every query, i.e., rq = 1∀q has an MRR of 1,
and systems which make more mistakes have lower MRR values.
. The best performing algorithm by Typke and Walczak-Typke [2008] for the 2008 task, which calculates
an Earth mover's distance (EMD) between the observed and score onsets, achieved an average MRR of 0.52.
Even without using any Bayesian techniques, we obtain an average MRR of 0.54 on the Mirex database,
outperforming more elaborate and computationally expensive algorithms on the symbolic onset data.
8.4 Conclusion
In this chapter we have developed techniques for applying a generative model of a musical audio signal
coupled with a Markov model of the movement of a score pointer to a variety of inference tasks in musical
signal processing. The first application we consider is aligning two extracts of musical audio by matching
frames on the basis of the similarity of spectral features. A popular framework for carrying out this task
is dynamic time warping (DTW). By expressing the dynamic time warping problem in a generative model
setting, we are able to incorporate Bayesian priors on the spectral features and the matching process in
order to generate more reliable and realistic results. Moreover we are able to extend the method to multiple
extracts and apply iterative Bayesian inference techniques on hidden Markov models (8.2.4) in order to
process frames in real time and over indefinitely long periods of time, which is not possible with a standard
DTW implementation.
The second application we consider is score alignment, where we typically have a symbolic representation
of a piece of music which is to be matched to an extract of audio. The audio is expected to be a performance
of that score with tempo changes and some errors. We sketch a model of the score pointer which assigns
more importance to the position of onsets in the score, and allows gradual tempo changes. We also show
2
www.music-ir.org/mirex/wiki/2008:Query_by_Tapping
150
5 10 15 20 25 30
0
500
1000
T
i
m
e
/
m
s
Query Onsets
5 10 15 20 25 30
0
500
1000
1500
MIDI Onsets
T
i
m
e
/
m
s
Figure 8.3: Inter-onset timings in a query-by-tapping problem. The timings in the tapped query have the
same rhythmic structure as the Midi timings, which enables the query to be matched with a high likelihood
to the score using the Poisson process model generated in this chapter.
151
how training data, for example a synthesized version of the score, can be used to infer the parameters of
the generative model, and how to update the model using the structure inferred within the observed audio
itself. A standard approach to score alignment is to match a synthesized track with the observed audio using
DTW. By applying the prior model of the score pointer position and the improved inference techniques, we
show how the accuracy of the matching can be improved, especially at the position of note onsets where
timing errors are perceptually most critical.
The final applications are based on an event-based observation of the audio, which may be obtained
through an onset detector or a principal pitch transcription algorithm. We develop a novel event counting
generative model of the events using a non-homogeneous Poisson process, which replaces a generative model
of the signal itself, and may use the same inference techniques as the other models in this chapter. We describe
simple priors for the event processes which allow and infer the probability of missed event detections and
clutter for each event type. The approach is applied to a query-by-tapping example where it is shown to be
more accurate than more elaborate algorithms.
The work of Degara et al. [2010] shows that significant improvement in note onset detection can be
achieved by applying a prior model of rhythmic structure and fusing onset detection with rhythmic structure
constraints. A natural extension of the work in Section 8.3 would be to jointly infer the event positions and
the score structure using Bayesian inference.
Throughout this chapter we have mostly outlined the prior models and inference algorithms that can be
applied using this framework, whereas in previous work [Peeling et al., 2007a] we have expressed a specific
approach in more detail. The hidden Markov model framework is general and well studied, and many
software packages exist for inference in HMMs which only require the specification of the observation model
and the dynamics of the hidden state. Our contribution in this chapter has been to illustrate how several
important applications in musical signal processing may be carried out through these inference algorithms.
152
Chapter 9
Conclusion
9.1 Summary
In this thesis we have presented a variety of Bayesian models and modelling methods aimed at tackling
difficult applications of the processing of musical audio signals. A hierarchy of Bayesian priors is a feasible
way to represent the complex structure exhibited in musical audio that is known from musical theory and
has been obtained by experiments on the physics of musical instruments. Generative models of music are
attractive as they are not tied to a particular application. Instead, the same generative model and prior
structure may be used for music transcription, source separation, synthesis and reconstruction, simply by
identifying which parameters of the model are known and unknown in a particular context, and applying an
appropriate inference algorithm to the model to infer the values of the unknown parameters.
For each model we have developed in this thesis, we have described how it differs from existing work or
generalizes an existing model, and provide a theoretical basis and justification to the accuracy observed in
the experimental results presented. We have focused particularly on the problem of multiple pitch detection
in frames of audio, particularly because it is an exacting measure of how appropriate and accurate a model
of a musical signal is, especially for mixtures of three or more notes. Our assumption is that a generative
model which is capable of performing the difficult classification task of identifying multiple pitches with
high accuracy will also produce faithful and realistic reconstructions when used for its original purpose in a
synthesis application, as far as the limitations of the particular model allow
1
.
Another goal of this thesis was to produce algorithms and inference methods using these generative
models that are ultimately feasible for real-time applications. Musical signals are inherently complex, and
even a simple model requires many parameters to model a single frame of audio. Full Bayesian inference
by reversible jump MCMC when the number of parameters is itself unknown is slow despite substantial
improvements to inference in the Bayesian harmonic model in previous work. It is important however that
any compromises made should be with the inference algorithm rather than by oversimplifying the model,
because musical audio requires a rich prior structure to capture the information we are extracting from the
signal.
1
The matrix factorization models in Chapter 7 have no explicit model for the phase of the source coefficients, hence the
phase of a synthetic signal would need to be constructed using a phase vocoder [Flanagan et al., 1965, Cemgil and Godsill,
2005].
153
In Chapter 5 we embellished the existing Bayesian harmonic model with a justification for using the
Hilbert transform to improve the models ability to capture frequency and amplitude modulations in a
partial frequency. We also saw a small improvement when using sinc windowed Gabor basis functions to
model signals from instruments with vibrato. We also provided a result for the mode of the posterior
distribution of the signal-to-noise ratio parameter which reduces the number of parameters that need to be
simulated in the MCMC algorithm. By making these modifications to the existing model, we were able to
show improvement in the accuracy of multiple pitch detection for two-note mixtures, although there was
no improvement demonstrated for mixtures of three or more notes, and we claimed this was due to not
estimating the frequency values to sufficient accuracy. In Chapter 6 however, we make this model practical
for applications demanding faster computation, by splitting inference into two stages: the first stage to
detect partial frequency positions in a frame using the generative model with a vague prior over possible
frequency positions, using numerical techniques to improve the accuracy of the frequency estimates; and the
second stage to fit a harmonic model to the estimated partial frequencies. Both stages were implemented
using greedy algorithms, but we showed that the generative signal model and the prior harmonic model are
powerful enough to detect the number and frequencies of pitches in a frame to the same level of accuracy as
a full Bayesian inference scheme, but with computation reduced dramatically.
In Chapter 7 each frame of audio is modelled by projection onto a fixed basis, rather than inferring the
number and frequencies of the bases. The harmonic spectrum of a note is represented as a prior on the
relative amplitudes of the basis functions in each frame, and the amplitude envelope of a note is represented
as a prior on the relative energy of the signal across frames. The model is linear in the amplitude parameters,
allowing multiple notes to be superimposed, and importantly multiple frames can be processed in parallel.
A simple polyphonic transcription algorithm was implemented for this model, and was shown to have a
competitive level of accuracy to other transcription schemes on a large set of classical music.
In Chapter 8 we present a unified framework for musical signal processing applications which interact
with a score of the music. This framework defines the dynamics of a virtual score pointer, such that the
generative model of the signal in each frame depends on the properties of the score at the position indicated
by the score pointer. By treating the dynamics and the generative model together as a hidden Markov
model, we have a large and flexible class of inference algorithms available to estimate the movement of the
score pointer through the signal and to iteratively train and learn the parameters of the generative models
on past and currently observed data.
9.2 Discussion
The use of Bayesian methods defines a clear separation between the model and the inference technique
used to infer unknown quantities about that model. The use of hierarchical priors allows different models
to be substituted in place of one another. For example, in Chapter 8 any of the generative signal models
developed in this thesis, or elsewhere, can be introduced without modifying the overall structure of the
inference algorithm, although the complexity and number of parameters of the signal model may prohibit
this. In Chapter 5 we were able to substitute new models developed in the chapter into the reversible jump
MCMC algorithm developed in existing work without modifying any of the parameters used in the inference.
This allowed us to objectively measure and demonstrate different levels of accuracy achieved with different
154
model configurations.
Through the use of analogies between models, we are able to apply the techniques developed by researchers
in different fields and contribute likewise. This is illustrated in Chapter 7 and prior work, where the technique
of non-negative matrix factorization (NMF) has applications in document classification [Xu et al., 2003],
face recognition [Guillamet and Vitrià, 2002] and chemometrics [Paatero, 1997] for example, as well as
polyphonic music transcription [Smaragdis and Brown, 2003]. One major advantage of this is that any
inference algorithm designed for the general statement of a problem can be used for all the applications that
have been transformed into this model. For popular models there typically exist multiple approaches which
differ in terms of accuracy, computation, memory, etc. In Chapter 7 we only considered inference techniques
which update the two matrices separately, a technique commonly known as multiplicative update rules [Lin,
2007a]. However other authors have applied gradient descent techniques, such asLin [2007b], to reduce the
number of iterations required to reach a local optima. Another example of this is constructing a Bayesian
network using conjugate priors. Once the model is designed and specified, then either Gibb's sampler,
an MCMC technique, or Variational Bayes can be applied, using the same expressions for the conditional
distributions of the parameters
2
.
Many of the models described make use of linearity in the parameters so that independent processes
can be superimposed and inferred from the observation. In music, this is a good approximation, as notes
on a musical instrument are mostly independent of any other notes played on that instrument and notes
played on other instruments. In Chapter 6 and Chapter 7 we therefore considered the model for a single
harmonic source first, and then showed how multiple sources could be superimposed with an additional
source modelling background noise. In Chapter 5 we take this further, and begin with the model for a single
partial frequency within a harmonic. These models may then be used for source separation, where individual
components of a signal are extracted and synthesized.
9.3 Further Research
9.3.1 Improvements to the Gaussian Variance Model
Further work is needed to investigate and strengthen the relationship between the Bayesian harmonic model
developed in Chapter 5 and the Gaussian variance model of spectrogram coefficients in Chapter 7. The goal
of this work would be to increase the accuracy of the polyphonic transcription algorithm in Chapter 7 to that
of the frame-based multiple pitch detection algorithm that was developed in Chapter 6. Both algorithms
function by greedily adding notes to the transcription, so accuracy could be improved by focusing on the
following areas:
1. The choice of basis functions used for the Gaussian variance model. The implementation presented
in this thesis used the short-time Fourier transform to obtain the observed signal coefficient, thus
assuming that each component of the signal in a frame is a sinusoid with constant amplitude. Davy
and Godsill [2003] previously showed that a musical signal could be better modelled using Gabor basis
functions which allows a slowing varying amplitude thoughout the frame, compared to the model of
2
Implementing generic Gibb's sampler algorithms is provided in software frameworks such as BUGS (Bayesian Inference
Using Gibbs Sampler) available at www.openbugs.info/w/. A similar framework exists for Variational Bayes, known as VIBES
(Variational Inference for Bayesian Networks) and is available at vibes.sourceforge.net
155
Walmsley et al. [1999] which has only one amplitude parameter per frame. We also showed in Chapter
5 that using the Hilbert transform and a sinc window Gabor basis improves the accuracy of modelling.
Although the frequencies of the basis function are fixed in the Gaussian variance model, the number
of basis functions per frame and the shape of the Gabor windows, could be modified in order to model
the signal better and obtain improvements in transcription accuracy. The basis frequencies could also
be chosen to match what would be expected in a harmonic musical signal rather than being spaced
equally on the frequency axis.
2. The number of template functions used to model each pitch. In the transcription algorithm only one
template function was used per pitch to keep computation at a minimum, although 7.4.1 indicates that
multiple templates is preferred by Bayesian model selection.
3. Deriving the relationship between the priors in the two models. As a starting point, data generated by
the harmonic model could be used to train the template functions of the Gaussian variance model, so
that the model priors are equivalent. However, deriving even an approximate mathematical relationship
may provide more insight into the models, for example the number of templates that should be used.
9.3.2 Frame Boundaries
The algorithms in this thesis model continuity between adjacent frames only at a high level, by modelling
the transition probabilities of note pitches and volumes across frame boundaries. The generative model
for each frame is not directly dependent on the signal in the previous frame. This allows for potential
phase discontinuities at frame boundaries, the results of which are unpleasant artifacts when a signal is
reconstructed. Also, the frames of the audio obtained often overlap by 50% of the samples, in which case
the frames are more strongly dependent on one another than if there was no overlap.
To improve the modelling of phase boundaries, we need to account for the fact that a basis function
in the model can contribute to two adjacent frames of audio. In the case of 50% overlap, every basis
function contributes to two frames, whereas when there is no overlap, only the basis functions whose region
of support extends beyond the end of one frame also contributes to the signal in the next frame. One
method of treating shared basis functions in an iterative multiple frame processing algorithm is to fix the
parameters and amplitudes of the basis functions in one frame to the values found when they were inferred
in the adjacent frame, and subsequently alternating in which frames the parameters are fixed and in which
frames they are inferred. This is straightforward to implement for the generative models in this thesis, as
fixing the basis functions is equivalent to subtracting their contribution from the observed signal. However
further work is required to determine that this inference approach will converge properly, or whether the
contribution of shared basis functions to multiple frames should be inferred jointly.
9.3.3 Note Envelopes
We have mostly paid attention to harmonic spectral content of musical notes, which was used solely as a
method of multiple pitch detection in Chapter 6. In 5.2.4 we modelled a note as having a constant damping
ratio over its length, following from the model of Cemgil et al. [2006], and in 5.2.5 we allow for regular
amplitude modulations. In Chapter 8 the excitation vector component of the spectrogram variances was
156
used to model in a general way the amplitude envelope of a note. However, as discussed in 3.3.1, the note
envelope may be divided into attack, sustain, decay and release stages (ASDR), each potentially with a
different spectral profile, and other authors such as Orio and Déchelle [2001] have modelled each of these
stages as additional states in a hidden Markov model. In Section 8.2 we showed the value in modelling note
onsets explicitly as an additional state. Adding additional states to represent decay and release stages is
straightforward for both the transcription model in Section 7.5 and the hidden Markov models in Chapter
8.
From a modelling perspective there are still aspects of note envelopes that need further investigation.
The onset of a note is very characteristic of the sound of a particular instrument, and hence could be used
for instrument identification - an application we have not investigated using the models in this thesis. The
onset of a note may also have a percussive component in addition to any harmonic content, which requires an
additional component to the harmonic signal model of a musical model, using for example an autoregressive
model for the noise process.
A full generative model of a note envelope for different instruments would take into account the relation-
ships between the different stages, including the relative volumes and damping ratios, and the time for each
stage.
9.3.4 High Level Score Priors
Score priors represent the highest level of hierarchy in this thesis. They are applied as a prior probability
of a pitch being present in a frame, and the probability of a pitch transition from one frame to the next.
Having a suitably powerful and realistic prior, which models chords, melodic and bass lines etc. promises to
eliminate many transcription errors and increase the accuracy to the level of a human transcriber.
A generative model of a score is ideal, as it can be used as a computer music composition system.
Automated music composition using generative models is a popular area of music research. Although many
of these models are too intricate for general use as they are designed to emulate a particular genre or style, the
basic elements of these models such as chord progressions and the placements of notes around beat positions
and divisions of the beat, should be suitable for many applications. One interesting observation is that
from a Bayesian perspective, score models with fewer parameters will lead to more consonant and regular
sounding music, as notes with multiple shared harmonics often occur together in chords, and regular timings
of notes onsets lead to the strong perception of beat and tempo, drawing a parallel with Occam's razor. As
music has developed over centuries, harmonic and temporal structures have increased in complexity, so the
simplest model may not be appropriate for later genres.
The major challenge with using high level score priors is managing the additional parameters introduced
and designing efficient and practical inference algorithms for these models.
157
Bibliography
S.A. Abdallah and M.D. Plumbley. Polyphonic music transcription by non-negative sparse coding of power
spectra. In International Conference on Music Information Retrieval, 2004.
R. P. Adams, I. Murray, and D. J. C. MacKay. Tractable nonparametric Bayesian inference in Poisson
processes with Gaussian process intensities. In Proceedings of the 26th Annual International Conference
on Machine Learning, 2009.
L. Ahlzen and C. Song. The Sound Blaster Live! Book. No Starch Press, 2003.
C. Andrieu and A. Doucet. Joint Bayesian model selection and estimation of noisy sinusoids via reversible
jump MCMC. IEEE Transactions on Signal Processing, 47:26672676, 1999.
I. Arroabarren, M. Zivanovic, X. Rodet, and A. Carlosena. Instantaneous frequency and amplitude of vibrato
in singing voice. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003.
M. Barthet, P. Guillemain, R. Kronland-Martinet, and S. Ystad. On the relative influence of even and odd
harmonics in clarinet timbre. In Proc. Int. Comp. Music Conf (ICMC 2005), Barcelona, Spain, pages
351354, 2005.
L. Benaroya, R. Gribonval, and F. Bimbot. Non negative sparse representation for Wiener based source
separation with a single sensor. 6:613616, 2003.
N. Bertin, R. Badeau, and E. Vincent. Enforcing harmonicity and smoothness in Bayesian non-negative
matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech and
Language Processing, 18:538549, 2009a.
N. Bertin, C. Fevotte, and R. Badeau. A tempering approach for Itakura-Saito non-negative matrix factor-
ization. With application to music transcription. In International Conference on Acoustics, Speech and
Signal Processing, 2009b.
C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.
S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, 1990.
G. L. Bretthorst. Bayesian Spectrum Analysis and Parameter Estimation. Springer-Verlag, 1989.
J. C. Brown. Calculation of a constant Q spectral transform. Journal of the Acoustical Society of America,
89:425434, 1991.
158
J. C. Brown and K. V. Vaughn. Pitch center of stringed instrument vibrato tones. Journal of the Acoustical
Society of America, 100(3):17281735, 1996.
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge
Discovery, 2:121167, 1998.
C. Cannam, C. Landone, M. Sandler, and J.P. Bello. The sonic visualiser: A visualisation platform for seman-
tic descriptors from musical signals. In ISMIR 2006 7th International Conference on Music Information
Retrieval Proceedings, 2006.
P. Cano, A. Loscos, and J. Bonada. Score-performance matching using HMMs. In International Computer
Music Conference, 1999.
A. Cemgil and O. Dikmen. Conjugate gamma Markov random fields for modelling nonstationary sources.
Independent Component Analysis and Signal Separation, pages 697705, 2007.
A. T. Cemgil. Bayesian inference in non-negative matrix factorisation models. Technical report, University
of Cambridge, 2008.
A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing
values in audio signals. In 13th European Signal Processing Conference, 2005.
A. T. Cemgil and B. Kappen. Monte Carlo methods for tempo tracking and rhythm quantization. Journal
of Artificial Intelligence Research, 18:4581, 2003.
A. T. Cemgil, H. J. Kappen, and D. Barber. A generative model for music transcription. IEEE Transactions
on Audio, Speech and Language Processing, 14:679694, 2006.
A. T. Cemgil, C. Févotte, and S. J. Godsill. Variational and stochastic inference for Bayesian source sepa-
ration. Digital Signal Processing, 17:891913, 2007.
S. Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90:
75108, 1995.
S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis-Hastings output. Journal of the American
Statistical Association, 96:270281, 2001.
L. Cohen, P. Loughlin, and D. Vakman. On an ambiguity in the definition of the amplitude and phase of a
signal. Signal Processing, 79:301307, 1999.
A. Cont. A coupled duration-focused architecture for realtime music to score alignment. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2009.
A. Cont, D. Schwarz, N. Schnell, and C. Raphael. Evaluation of real-time audio-to-score alignment. In
International Conference on Music Information Retrieval., 2007.
Nicholas Cook. Reactions to the Record: Perspectives on Historical Performance, chapter Objective expres-
sion: phrase arching in recordings of Chopin's Mazurkas.
159
D. R. Cox and V. Isham. Point processes. Chapman & Hall, 1980.
M. J. Crowder, A. C. Kimber, R. L. Smith, and T. J. Sweeting. Statistical analysis of reliability data.
Chapman & Hall, 1991.
L. Daudet and M. Sandler. MDCT analysis of sinusoids: exact results and applications to coding artifacts
reduction. IEEE Transactions on Speech and Audio Processing, 12:302312, 2004.
M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis. In Bayesian Statistics.
Oxford University Press, 2003.
M. Davy, S. J. Godsill, and J. Idier. Bayesian analysis of polyphonic western tonal music. Journal of the
Acoustical Society of America, 119:24982517, 2006.
N. Degara, A. Pena, M. E. P. Davies, and M. D. Plumbley. Note onset detection using rhythmic structure.
In International Conference on Acoustics, Speech and Signal Processing, 2010.
J. Devaney, M. I. Mandel, and D. P. W. Ellis. Improving MIDI-audio alignment with acoustic features. In
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009.
I. Dhillon and S. Sra. Generalized nonnegative matrix approximations with Bregman divergences. Advances
in neural information processing systems, 18:283, 2006. ISSN 1049-5258.
O. Dikmen and A. T. Cemgil. Gamma Markov random fields for audio source modeling. IEEE Transactions
on Audio, Speech and Language Processing, 18:589601, 2010.
B. N. Dimitrov, VV. Rykov, and Z. L. Krougly. Periodic Poisson processes and almost-lack-of-memory
distributions. Automation and Remote Control, 65:15971610, 2004.
S. Dixon and G. Widmer. Match: A music alignment tool chest. In Proceedings of the International
Conference of Music Information Retrieval, 2005.
P.M. Djuric. Simultaneous detection and frequency estimation of sinusoidal signals. In IEEE International
Conference on Acoustics, Speech, and Signal Processing, 1993.
P.M. Djuric. A model selection rule for sinusoids in white Gaussian noise. IEEE Transactions on Signal
Processing, 44:17441751, 1996.
Charles Dodge and Thomas A. Jerse. Computer Music. Schirmer Books, 1997.
A. Doucet, N. De Freitas, and N. Gordon. Sequential Monte Carlo methods in practice. Springer Verlag,
2001.
J. S. Downie. The Music Information Retrieval Evaluation Exchange (2005-2007): A window into music
information retrieval research. Acoustical Science and Technology, 29:247255, 2008.
D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green. Automatic generation of social tags for music recom-
mendation. Advances in neural information processing systems, 20:385392, 2007.
160
C. Févotte and A.T. Cemgil. Nonnegative matrix factorizations as probabilistic inference in composite
models. In Proc. 17th European Signal Processing Conf.(EUSIPCO), Glasgow, Scotland, 2009.
C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito divergence.
with application to music analysis. Neural Computation, 21(3):793830, March 2009.
J. L. Flanagan, D. I. S. Meinhart, . M. Golden, and M. M. Sondhi. Phase vocoder. The Journal of the
Acoustical Society of America, 38(5):939940, 1965.
H. Fletcher and W. A. Munson. Loudness, its definition, measurement and calculation. Journal of the
Acoustical Society of America, 5:82108, 1933.
N. H. Fletcher and T. D. Rossing. The physics of musical instruments. Springer, 1998.
C. Févotte. Itakura-Saito nonnegative factorizations of the power spectrogram for music signal decomposition,
chapter 11. IGI Global Press, 2010.
D. Gabor. Theory of communication. IEE Journal on Communications Engineering, 93:429457, 1946.
S. A. Gelfand. Hearing- An Introduction to Psychological and Physiological Acoustics. Informa HealthCare,
2004.
A. Gelman. Bayesian data analysis. CRC press, 2004.
J. M. Geringer and M. L. Allen. An analysis of vibrato among high school and university violin and cello
students. Journal of Research in Music Education, 52:167179, 2004.
C. J. Geyer. Reweighting Monte Carlo mixtures. Journal of Americal Statistical Association, 1991.
Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian learning. Advances in
Neural Information Processing Systems, pages 507513, 2001.
W. R. Gilks and D. J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman & Hall/CRC, 1996.
S. Godsill. The shifted inverse-gamma model for noise-floor estimation in archived audio recordings. Signal
Processing, 2009.
S. Godsill and M. Davy. Bayesian harmonic models for musical pitch estimation and analysis. In IEEE
International Conference on Acoustics Speech and Signal Processing, volume 2, 2002.
S. J. Godsill and M. Davy. Bayesian computational models for inharmonicity in musical instruments. In
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005.
S. J. Godsill, A. T. Cemgil, C. Fevotte, and P. J. Wolfe. Bayesian computational methods for sparse audio
and music processing. In 15th European Signal Processing Conference, 2007.
S.J. Godsill. Bayesian enhancement of speech and audio signals which can be modelled as ARMA processes.
International Statistical Review/Revue Internationale de Statistique, 65(1):121, 1997.
M. Goto. Development of the RWC music database. In Proceedings of the 18th International Congress on
Acoustics, volume 1, pages 553556, 2004.
161
M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC music database: Music genre database and
musical instrument sound database. In International Symposium on Music Information Retrieval, pages
229230, 2003.
P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
Biometrika, 82(4):711, 1995.
D. D. Greenwood. Critical bandwidth and the frequency coordinates of the basilar membrane. The Journal
of the Acoustical Society of America, 33:1344, 1961.
D. Guillamet and J. Vitrià. Non-negative matrix factorization for face recognition. Topics in Artificial
Intelligence, pages 336344, 2002.
W. M. Hartmann. Pitch, periodicity, and auditory organization. The Journal of the Acoustical Society of
America, 100:3491, 1996.
N. Hu, R. Dannenberg, and G. Tzanetakis. Polyphonic audio matching and alignment for music retrieval.
In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003.
F. Itakura and S. Saito. Analysis synthesis telephony based on the maximum likelihood method. In Pro-
ceedings of the 6th International Congress on Acoustics, 1968.
J. S. R. Jang, H. R. Lee, and C. H. Yeh. Query by tapping: A new paradigm for content-based music
retrieval from acoustic input. In IEEE Pacific Rim Conference on Multimedia, 2001.
H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal
Society of London. Series A, Mathematical and Physical Sciences, pages 453461, 1946.
M. Karjalainen and U. K. Laine. A model for real-time sound synthesis of guitar on a floating-point signal
processor. In International Conference on Acoustics, Speech, and Signal Processing, 1991.
R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 90(430), 1995.
A. Klapuri. Signal Processing Methods for Music Transcription, chapter Auditory-Model Based Methods for
Multiple F0 Estimation, pages 229265. Springer, 2006.
A. Klapuri. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE
Transactions on Audio, Speech, and Language Processing, 16:255266, 2008.
A. P. Klapuri. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness.
IEEE Transactions on Speech and Audio Processing, 11(6):804816, 2003.
A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of acoustic musical signals. IEEE
Transactions on Audio, Speech, and Language Processing, 14:342355, 2006.
A.P. Klapuri. Automatic music transcription as we know it today. Journal of New Music Research, 33(3):
269282, 2004.
H. Lantéri, C. Theys, C. Richard, and C. Févotte. Split gradient method for nonnegative matrix factorization.
In European Signal Processing Conference, Aalborg, Denmark, Aug. 2010.
162
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural
Information Processing, 2000.
C.J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization.
Neural Networks, IEEE Transactions on, 18(6):15891596, 2007a. ISSN 1045-9227.
C.J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10):
27562779, 2007b. ISSN 0899-7667.
J. S. Liu. Monte Carlo strategies in scientific computing. Springer, 2003.
R. B. MacLeod. Influences of dynamic level and pitch register on the vibrato rates and widths of violin and
viola players. Journal of Research in Music Education, 56:4354, 2008.
R. C. Maher and J. W. Beauchamp. Fundamental frequency estimation of musical signals using a two-way
mismatch procedure. Journal of the Acoustical Society of America, 4:22542263, 1995.
S. Mallat. A wavelet tour of signal processing. Academic press, 1999.
S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on
Signal Processing, 41:33973415, 1993.
M. Marolt. A connectionist approach to automatic transcription of polyphonic piano music. IEEE Trans.
Multimedia, 6(3):439449, Jun. 2004.
R. Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics.
IEEE Transactions on Speech and Audio Processing, 9, 2001.
R. Meddis and L. O' Mard. A unitary model of pitch perception. The Journal of the Acoustical Society of
America, 102:1811, 1997.
W. Q. Meeker and L. A. Escobar. Statistical methods for reliability data. Wiley, 1998.
J. A. Moorer. On the transcription of musical sound by computer. Journal of Computer Music, pages 3238,
1977.
F. D. Neeser and J. L. Massey. Proper complex random processes with applications to information theory.
IEEE Transactions on Information Theory, 39(4):12931302, 1993.
B. Niedermayer. Towards audio to score alignment in the symbolic domain. In Sound and Music Computing
Conference, 2009.
N. Orio and F. Déchelle. Score following using spectral analysis and hidden Markov models. In Proceedings
of the ICMC, pages 151154, 2001.
N. Orio and D. Schwarz. Alignment of monophonic and polyphonic music to a score. In International
Computer Music Conference, 2001.
163
A. Ozerov, C. Févotte, and M. Charbit. Factorial scaled hidden Markov model for polyphonic audio repre-
sentation and source separation. In IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, Mohonk, NY, USA, Oct. 2009.
P. Paatero. Least squares formulation of robust nonnegative factor analysis. Chemometrics and Intelligent
Laboratory Systems, 37:2235, 1997.
H. Papadopoulos and G. Peeters. Large-scale study of chord estimation algorithms based on chroma repre-
sentation and HMM. In Proceedings of the International Workshop on Content-Based Multimedia Indexing
(CBMI), pages 5360, 2007.
R. D Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand. Complex sounds
and auditory images. Auditory physiology and perception, 83:429446, 1992.
J. Paulus and T. Virtanen. Drum transcription with non-negative spectrogram factorisation. In European
Signal Processing Conference, 2006.
P. H. Peeling, A. T. Cemgil, and S. J. Godsill. A probabilistic framework for matching music representations.
In International Conference on Music Information Retrieval, 2007a.
P. H. Peeling, C. F. Li, and S. J. Godsill. Poisson point process modeling for polyphonic music transcription.
Journal of the Acoustical Society of America, 121:EL168EL175, 2007b.
P.H. Peeling, A.T. Cemgil, and S.J. Godsill. Generative spectrogram factorization models for polyphonic
piano transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18:519527, 2010.
ISSN 1558-7916.
G. Peeters. Chroma-based estimation of musical key from audio-signal analysis. In Proc. of the 7th Interna-
tional Conference on Music Information Retrieval (ISMIR), pages 115120. Citeseer, 2006.
A. Pertusa and J. M. Inesta. Multiple fundamental frequency estimation using Gaussian smoothness. In
IEEE International Conference on Acoustics, Speech and Signal Processing, 2008.
B. Picinbono and P. Bondon. Second-order statistics of complex signals. IEEE Transactions on Signal
Processing, 45(2):411420, 1997.
C. J. Plack, A. J. Oxenham, and R. R. Fay, editors. Pitch: Neural Coding and Perception. Springer, 2005.
M. D. Plumbley, S. A. Abdallah, J. P. Bello, M. E. Davies, G. Monti, and M. B. Sandler. Automatic music
transcription and audio source separation. Cybernetics and Systems, 33(6):603627, 2002.
G. E. Poliner and D. P. W. Ellis. A discriminative model for polyphonic piano transcription. EURASIP
Journal on Advances in Signal Processing, 2007.
E. Prame. Measurements of the vibrato rate of ten singers. Journal of the Acoustical Society of America, 96
(4):19791984, 1994.
164
J. P. Princen, A. W. Johnson, and A. B. Bradley. Subband/transform coding using filter bank designs based
on time domain aliasing cancellation. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, 1987.
M. Puckette, T. Apel, and D. Zicarelli. Real-time audio analysis tools for Pd and MSP. In International
Computer Music Conference, 1998.
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Pro-
ceedings of the IEEE, 77:257286, 1989.
L. R. Rabiner and B. Juang. Fundamentals of speech recognition. Prentice-Hall, 1993.
C. Raphael. Automated rhythm transcription. In International Symposium on Music Information Retrieval,
2001.
C. Raphael. A hybrid graphical model for aligning polyphonic audio with musical scores. In International
Conference on Musical Information Retrieval, 2004.
C. Raphael. Aligning music audio with symbolic scores using a hybrid graphical model. Machine Learning,
65:389409, 2006.
C. Raphael. A classifier-based approach to score-guided source separation of musical audio. Computer Music
Journal, 32(1):5159, 2008.
S. Richardson and P. J. Green. On Bayesian analysis of mixtures with unknown number of components.
Journal of the Royal Statistical Society: Series B, 4:731792, 1997.
C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, 2004.
A. Robertson and M. D. Plumbley. Post-processing fiddle: A real-time multi-pitch tracking technique using
harmonic partial subtraction for use within live performance systems. In International Computer Music
Conference, 2009.
M. P. Ryynänen and A. P. Klapuri. Modelling of note events for singing transcription. In ISCA Tutorial
and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing. Citeseer, 2004.
M. P. Ryynänen and A. P. Klapuri. Polyphonic music transcription using note event modeling. In IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics, 2005.
E. D. Scheirer. Readings in Computational Auditory Scene Analysis, chapter Using musical knowledge to
extract expressive performance information from audio recordings, pages 361380. L. Erlbaum Associates
Inc., 1998.
E. D. Scheirer. Music-listening systems. PhD thesis, Massachusetts Institute of Technology, 2000.
M. N. Schmidt and H. Laurberg. Nonnegative matrix factorization with Gaussian process priors. Computa-
tional intelligence and neuroscience, 2008.
D. Schwarz, A. Cont, and N. Schnell. From Boulez to ballads: Training IRCAM's score follower. In
Proceedings of International Computer Music Conference, 2005.
165
C. Seashore. Objective analysis of musical performance. McGraw-Hill, 1936.
X. Serra. Musical sound modeling with sinusoids plus noise. Musical signal processing, pages 497510, 1997.
C. E. Shannon. Communication in the presence of noise. Proceedings of the IEEE, 86(2):447457, 1998.
P. Smaragdis and J.C. Brown. Non-negative matrix factorization for polyphonic music transcription. In
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177180, 2003.
A. Stark and M. D. Plumbley. Tracking a performance without a score. In International Conference on
Acoustics, Speech and Signal Processing, 2010.
S. S. Stevens, J. Volkmann, and E. B. Newman. A scale for the measurement of the psychological magnitude
pitch. The Journal of the Acoustical Society of America, 8:185, 1937.
M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical
Society. Series B, Statistical Methodology, pages 611622, 1999.
P. M. Todd and D. G. Loy. Music and connectionism. The MIT Press, 1991.
T. Tolonen and M. Karjalainen. A computationally efficient multipitch analysis model. IEEE Transactions
on Speech and Audio Processing, 8(6):708716, 2000.
R. J. Turetsky and D. P. W. Ellis. Ground-truth transcriptions of real music from force-aligned MIDI
syntheses. In International Conference on Music Information Retrieval, 2003.
R. Typke and A. Walczak-Typke. A tunneling-vantage indexing method for non-metrics. In International
Conference on Music Information Retrieval, 2008.
B. L. Vercoe, W. G. Gardner, and E. D. Scheirer. Structured audio: Creation, transmission, and rendering
of parametric sound representations. Proceedings of the IEEE, 86(5):922940, 1998.
E. Vincent, N. Bertin, and R. Badeau. Harmonic and inharmonic nonnegative matrix factorization for poly-
phonic pitch transcription. In IEEE Internation Conference on Acoustics, Speech and Signal Processing,
2008.
E. Vincent, N. Bertin, and R. Badeau. Adaptive harmonic spectral decomposition for multiple pitch es-
timation. Audio, Speech, and Language Processing, IEEE Transactions on, 18(3):528537, 2010. ISSN
1558-7916.
T. Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity
and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3):10661074,
2007.
T. Virtanen, A. T. Cemgil, and S. J. Godsill. Bayesian extensions to non-negative matrix factorisation for
audio signal modelling. In International Conference on Acoustics, Speech and Signal Processing, 2008.
G. H. Wakefield. Mathematical representation of joint time-chroma distributions. In Proceedings of SPIE,
volume 3807, page 637, 1999.
166
P. J. Walmsley, S. J. Godsill, and P. J. W. Rayner. Polyphonic pitch tracking using joint Bayesian estimation
of multiple frame parameters. In IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, pages 119122. Citeseer, 1999.
B. Wang and M.D. Plumbley. Musical audio stream separation by non-negative matrix factorization. In
DMRN Summer Conference, 2005.
N. P. Whiteley, A. T. Cemgil, and S. J. Godsill. Bayesian modelling of temporal structure in musical audio.
In International Conference on Music Information Retrieval, 2006.
P. J. Wolfe, M. Dorfler, and S. J. Godsill. Bayesian modelling of time-frequency coefficients for audio signal
enhancement. Advances in Neural Information Processing Systems, 2003.
P. J. Wolfe, S. J. Godsill, and W. J. Ng. Bayesian variable selection and regularization for time-frequency
surface estimation. Journal of the Royal Statistical Society Series B, 66:575589, 2004.
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings
of the 26th annual international ACM SIGIR Conference on Research and Development in Informaion
Retrieval, pages 267273. ACM, 2003. ISBN 1581136463.
C. Yeh, A. Röbel, and X. Rodet. Multiple fundamental frequency estimation of polyphonic music signals.
In IEEE International Conference on Acoustics, Speech and Signal Processing, 2005.
A. Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions.
Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, 6:233243, 1986.
F. Zhang, G. Bi, and Y. Q. Chen. Harmonic transform. IEE Proceedings-Vision, Image and Signal Processing,
151(4):257263, 2004.
E. Zwicker. Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Acoustical
Society of America Journal, 33:248, 1961.
167
Appendix A
Probability Distributions
A.1 Normal Distribution
The normal distribution N (x;µ, σ2) , for real valued x with mean µ and variance σ2 has probability distri-
bution
p (x) =
1√
2piσ2
exp
(
−1
2
(x− µ)2
σ2
)
µ is a location parameter for the normal distribution, and σ2 is a scale parameter. Sufficient statistics for
a set of normally distributed observations xi are given by
∑
xi and
∑
x2i . Maximum likelihood estimates
of the parameters are given by µˆ =
∑
xi and σˆ2 =
∑
x2i − µˆ2. The entropy of the normal distribution is
1
2 ln
(
2pieσ2
)
.
The normal distribution is the conjugate prior distribution for the mean parameter of a normal dis-
tribution. If µ ∼ N (µ0, σ20) and the variance σ2 is known, then the posterior distribution of µ given n
observations xi is a normal distribution with mean(
1
σ20
+
n
σ2
)−1(
µ0
σ20
+
∑
xi
σ2
)
and variance (
1
σ20
+
n
σ2
)−1
If x is a complex number where the real and imaginary components are independently normally distributed
with zero mean and variance σ2, then
p (x) =
1
2piσ2
exp
(
−1
2
|x|2
σ2
)
168
A.2 Gamma Distribution
The Gamma distribution G (r;α, β) for r > 0 where α > 0 is the shape parameter and β > 0 is a scale
parameter has probability distribution
p (r) = rα−1
exp (−r/β)
βαΓ (α)
Sufficient statistics for a set of Gamma distributed observations ri are
∑
ri and
∑
i log ri. The parameters
of the Gamma distribution are related to the sufficient statistics by
1
N
∑
i
log ri = ψ (a)− log β
1
N
∑
i
ri = ab
The entropy of the gamma distribution is
α+ lnβ + ln Γ (α) + (1− α)ψ (α)
The gamma distribution is the conjugate prior for the rate parameter of the Poisson distribution. If
λ ∼ G (α, β) and we have n observations xi ∼ Po (λ) then the posterior distribution of λ is
p (λ|x1, . . . , xn, α, β) = G
(
α+
∑
i
xi, β + n
)
A.3 Inverse-Gamma Distribution
The inverse Gamma distribution IG (r;α, β) for r > 0 where α > 0 is the shape parameter and β > 0 is a
scale parameter has probability distribution
p (r) = (1/r)
α−1 βα exp (−β/r)
Γ (α)
The inverse Gamma distribution is the distribution of the random variable 1/r when r ∼ G (r;α, 1/β).
Sufficient statistics for a set of inverse Gamma distributed observations ri are
∑
r−1i and
∑
i log ri. The
parameters of the inverse Gamma distribution are related to the sufficient statistics by
1
N
∑
i
log ri = −ψ (a) + log β
1
N
∑
i
ri = a/b
The entropy of the inverse Gamma distribution is
α+ lnβ + ln Γ (α)− (1− α)ψ (α)
169
The inverse Gamma distribution is the conjugate prior distribution for the variance parameter of a
normal distribution. If σ2 ∼ IG (α, β) and the mean µ is known, then the posterior distribution of σ2 given
n observations xi is
p
(
σ2|x1, . . . , xn, µ, α, β
)
= IG
(
α+
n
2
, β +
∑
i (xi − µ)2
2
)
A.4 Beta Distribution
The Beta distribution Beta (x;α, β) where 0 ≤ x ≤ 1 and α > 0, β > 0 has probability distribution
p (x) =
xα−1 (1− x)β−1
B (α, β)
The Beta distribution is the conjugate prior to the probability of success ρ of a set of n Bernoulli trials
ri where ri ∈ {0, 1}. The posterior distribution of ρ is
p (ρ|r1, . . . , rn, α, β) = Beta
(
α+
∑
i
ri, β + n−
∑
i
ri
)
170
Appendix B
Derivation of Results
B.1 Mode of Posterior Distribution of Signal-to-noise Parameter
Take the natural log of (5.22)
log p (y|D, ξ) + log p (ξ) = −N + αn
2
log
(
y>Py + βn
)− (αξ + 1) log (ξ)− βδ
ξ
differentiate the expression with respect to δ2 and set equal to zero:
−N + αn
2
− 1ξ+1y>DD†y
y>y − ξξ+1y>DD†y
− αξ + 1
ξ
+
βξ
ξ2
= 0
N + αn
2
y>DD†y
y>y − ξy>DD†y −
αξ + 1
ξ
+
βξ
ξ2
= 0
N + αn
2
ξ2y>DD†y
y>y − ξy>DD†y − ξ (αξ + 1) + βξ = 0
After some rearranging
N + αn
2
(
δ2
)2
y>DD†y +
(
βξ − δ2 (αξ + 1)
) (
y>y − δ2y>DD†y) = 0
ξ2
(
N + αn
2
+ (αξ + 1)
)(
y>DD†y
)− ξ ((αξ + 1) y>y + βξy>DD†y)+ βξy>y = 0 (B.1)
171
B.2 Posterior over Latent Sources in Gaussian Variance Matrix
Factorization Model
J is the identity matrix with dimensions I × I
− D
2
TrA−1ssH +
D
2
Tr
1>1ssH
1A1>
+ · · ·
= −D
2
TrA−1
(
J −A 1
>1
1A1>
)
ssH + · · ·
= −D
2
Tr
A−1
1A1>
(1A1>J −A1>1)ssH + · · ·
= −D
2
Tr
1
1A1>
(1A1>J −A1>1)>(1A1>J −A1>1)−1A−1(1A1>J −A1>1)ssH + . . .
= −D
2
Tr
(
s− A1
>1s
1A1>
)H
(1A1>J −A1>1)−1 A
−1
1A1>
(
s− A1
>1s
1A1>
)
+ . . .
= −D
2
Tr
(
s− A1
>1s
1A1>
)H (
A− A1
>1A
1A1>
)−1(
s− A1
>1s
1A1>
)
+ . . .
172