Repository logo
 

A deep learning approach to assessing non-native pronunciation of English using phone distances

Accepted version
Peer-reviewed

Type

Conference Object

Change log

Authors

Kyriakopoulos, Konstantinos  ORCID logo  https://orcid.org/0000-0002-7659-4541
Knill, KM 
Gales, MJF 

Abstract

The way a non-native speaker pronounces the phones of a language is an important predictor of their proficiency. In grading spontaneous speech, the pairwise distances between generative statistical models trained on each phone have been shown to be powerful features. This paper presents a deep learning alternative to model-based phone distances in the form of a tunable Siamese network feature extractor to extract distance metrics directly from the audio frame sequence. Features are extracted at the phone instance level and combined to phone-level representations using an attention mechanism. Pair-wise distances between phone features are then projected through a feed-forward layer to predict score. The extraction stage is initialised on either a binary phone instance-pair classification task, or to mimic the model-based features, then the whole system is fine-tuned end-to-end, optimising the learning of the distance metric to the score prediction task. This method is therefore more adaptable and more sensitive to phone instance level phenomena. Its performance is compared against

Description

Keywords

pronunciation assessment, phone distances, CALL, CAPT, Siamese Networks, attention mechanism

Journal Title

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Conference Name

Interspeech 2018

Journal ISSN

2308-457X
1990-9772

Volume Title

2018-September

Publisher

ISCA
Sponsorship
Cambridge Assessment (unknown)