Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features
The present study investigates relationships between voice similarity ratings made by human listeners and comparison scores produced by an automatic speaker recognition system that includes phonetic, perceptually-relevant features in its modelling. The study analyses human voice similarity ratings of pairs of speech samples from unrelated speakers from an accent-controlled database (DyViS, Standard Southern British English) and the comparison scores from an i-vector-based automatic speaker recognition system using ‘auto-phonetic’ (automatically extracted phonetic) features. The voice similarity ratings were obtained from 106 listeners who each rated the voice similarity of pairings of ten speakers on a Likert scale via an online test. Correlation analysis and Multidimensional Scaling showed a positive relationship between listeners’ judgements and the automatic comparison scores. A separate analysis of the subsets of listener responses from English and German native speaker groups showed that a positive relationship was present for both groups, but that the correlation was higher for the English listener group. This work has key implications for forensic phonetics through highlighting the potential to automate part of the process of selecting foil voices in voice parade construction for which the collection and processing of human judgements is currently needed. Further, establishing that it is possible to use automatic voice comparisons using phonetic features to select similar-sounding voices has important applications in ‘voice casting’ (finding voices that are similar to a given voice) and ‘voice banking’ (saving one’s voice for future synthesis in case of an operation or degenerative disease).