Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features

Gerlach, Linda; McDougall, Kirsty; Kelly, Finnian; Alexander, Anil; Nolan, Francis

doi:10.17863/CAM.56656

Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/309562

Repository DOI

https://doi.org/10.17863/CAM.56656

Files

Accepted version (562.35 KB)

Type

Article

Authors

Gerlach, Linda

McDougall, Kirsty

Kelly, Finnian

Alexander, Anil

Nolan, Francis

https://orcid.org/0000-0002-8302-5726

Abstract

The present study investigates relationships between voice similarity ratings made by human listeners and comparison scores produced by an automatic speaker recognition system that includes phonetic, perceptually-relevant features in its modelling. The study analyses human voice similarity ratings of pairs of speech samples from unrelated speakers from an accent-controlled database (DyViS, Standard Southern British English) and the comparison scores from an i-vector-based automatic speaker recognition system using ‘auto-phonetic’ (automatically extracted phonetic) features. The voice similarity ratings were obtained from 106 listeners who each rated the voice similarity of pairings of ten speakers on a Likert scale via an online test. Correlation analysis and Multidimensional Scaling showed a positive relationship between listeners’ judgements and the automatic comparison scores. A separate analysis of the subsets of listener responses from English and German native speaker groups showed that a positive relationship was present for both groups, but that the correlation was higher for the English listener group. This work has key implications for forensic phonetics through highlighting the potential to automate part of the process of selecting foil voices in voice parade construction for which the collection and processing of human judgements is currently needed. Further, establishing that it is possible to use automatic voice comparisons using phonetic features to select similar-sounding voices has important applications in ‘voice casting’ (finding voices that are similar to a given voice) and ‘voice banking’ (saving one’s voice for future synthesis in case of an operation or degenerative disease).

Keywords

Perceived voice similarity, Speaker similarity, Automatic speaker recognition, Voice parades, Earwitness evidence

Journal Title

Speech Communication

Journal ISSN

0167-6393
1872-7182

Publisher

Elsevier

Publisher DOI

https://doi.org/10.1016/j.specom.2020.08.003

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Collections

University of Cambridge Research Outputs (Articles and Conferences)

Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features

Accepted version

Peer-reviewed

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Keywords

Journal Title

Conference Name

Journal ISSN

Volume Title

Publisher

Publisher DOI

Rights

Collections