This paper proposes a multilingual audio information management system based on semantic knowledge in complex environments. The complex environment is defined by the limited resources (financial, material, human, and audio resources); the poor quality of the audio signal taken from an internet radio channel; the multilingual context (Spanish, French, and Basque that is in under-resourced situation in some areas); and the regular appearance of cross-lingual elements between the three languages. In addition to this, the system is also constrained by the requirements of the local multilingual industrial sector. We present the first evolutionary system based on a scalable architecture that is able to fulfill these specifications with automatic adaptation based on automatic semantic speech recognition, folksonomies, automatic configuration selection, machine learning, neural computing methodologies, and collaborative networks. As a result, it can be said that the initial goals have been accomplished and the usability of the final application has been tested successfully, even with non-experienced users.
In order to contribute to the development of a community, we should take into account its socio-cultural and economic context. In this paper, we propose a joint solution to the socio-cultural context of the Basque Country and the needs and characteristics of small and medium companies with limited staff and economic resources, which are one of the foundations of the economy of the area. The model is easily exportable to other similar environments that may appear in developing areas [
In the socio-cultural context of the Basque Country, the interest in multilingual systems [
In this sense, audio information retrieval systems are increasingly used in different applications ranging from the extraction of musical information [
In this context, this paper proposes an audio information management system (the
This document is organized as follows: Sect.
In the field of speech processing, extraction and indexing of audio information from internet mass media is a research area of interest to the international scientific community [
When working in complex environments with limited amount of data, multilingual contexts, nonlinearities, or uncontrollable noise, some possibilities are based on: enriching poor resources of a language with resources from another powerful language beside it, approaches oriented to the lack of resources, cross-lingual approaches [
Within this complex environment, our proposal aims to: automatically generate an optimal corpus from an initial corpus; a corpus is said to be optimal when the recognition results measured by means of cross-validation methods are optimal for the task; reduce the dimensionality of data through the convergence of redundant information by means of principal components analysis (PCA); estimate the accuracy of the system by means of the leave-one-out cross-validation technique (LOOCV); reduce unwished effects in sublexical units (SLU) due to the lack of resources and the low number of samples by choosing properly sublexical units and possible groupings of them; and extract information by means to folksonomies.
In this context, the requirements and novelty of the system lie in the complex environment in which it has to work, that is defined by: The trilingual context that comprises the three languages of the Basque Country (Basque, Spanish, and French), which requires the design of the first multilingual ASR system for these three languages. The regular appearance of cross-lingual elements between the three languages. The fact that within the geographical area of this research project, the variety of the Basque language in the French region is in a critical under-resourced situation because it is not an official language in the French State. The limited resources (financial, material, human, and audio resources), specially the audio corpora for training is only about 0.5–1% of the usual corpora for such systems, for example, in [ The extremely poor quality of the audio signal, taken from an internet radio channel, which increases the complexity of the development.
In this section, the resources for the development of the system are presented. Furthermore, the main linguistic features are analyzed because they have a clear impact both on the performance of the acoustic phonetic decoding (APD) system, and on the size of the vocabulary of the system.
The three languages involved in our study are: Basque, Spanish, and French. Basque is a Pre-Indo-European language of unknown origin and has circa 1,000,000 speakers in the Basque Country, which spreads over the international border between France and Spain. The Basque language has a wide range of dialects, there are six main dialects and several variations, and this dialectal variety entails phonetic, phonologic, and morphologic differences. In order to develop the APD system, a sound inventory of each language is necessary. Table Sound inventories in the X-SAMPA Sound type BS FR SP Plosive p b t d k g X X X J\ c p_ht_hk_h X – – Affricates tS J\j\ X – X ts_mts_adZ X – – Fricatives B f s_az_a X X X S Z X X – D x G X – X S_mz_m j\ h X – – v – X – Nasals m n J X X X F n_d N X – X Liquids l X X X R\ X X – L r\ r X – X Vowel glides w j X X X H – X – Vowels i e a o u X X X y @ X X – A E O 2 9 a~ e~ o~ 9~ – X –
A further challenge for developing an ASR system is that Basque is an agglutinative language with a special intra-word morphosyntactic structure [ Examples of the agglutinative structure of the Basque language Example in Basque Lemma+morphemes Translation Etxekoarenak Etxe+ko+aren+ak The people from home Parisekoak Paris+eko+ak People from Paris Miguelek txakur hori ikusiko du Miguel+ek txakur hori ikusi+ko du Miguel will see that dog
A plausible approach to the problem would be to use lemmas and morphemes instead of words when defining the system vocabulary [
Most speakers in the Basque Country are bilingual, and they commonly mix two of the three languages in their speech, particularly in spontaneous speech. The two languages that are mixed depend on the region of the country where the speaker lives: most Basque speakers living in the Spanish side also use Spanish, while Basque speakers in the French side also use French. Moreover, mixing all the three languages also occurs. Indeed, the acoustic interactions between these three languages with the addition of Basque dialects are very strong, because speakers naturally and spontaneously mix sounds and vocabulary, and sometimes they also add other influences, such as English. Some speakers are able to use the three languages consecutively in the same sentence with native pronunciation. All the resources contain numerous instances of cross-lingual material at words, sentences, and pronunciation levels (Tables Examples of cross-lingual appearance in the language resources Primary language Text Basque French Spanish
The basic audio resources used in this project have been mainly provided by a small local news radio channel,
In order to correctly implement an ASR system, it is crucial to design and obtain appropriate linguistic resources. A speech corpus is a collection of audio recordings tagged at different levels, which contains phrases, words, and common expressions of a certain language. This type of corpus is a database that stores implicitly various properties of the language, and this information lays the foundation for building voice recognition systems.
When only limited resources are available, the appropriate choice of the training corpus is a fundamental part of the design of the application. This is because on the one hand, depending on the circumstances a larger corpus does not imply better performance results [
The resources inventory is summarized in Table Summary of the resources inventory, audio, and speech segments (SSG) Languages Audio SSG SSG used for training (hh:mm:ss) (hh:mm:ss) (hh:mm:ss) BS 2:47:27 2:10:37 0:55:09 FR 3:17:23 1:22:54 0:55:38 SP 2:10:15 1:01:13 0:55:37 Total 8:15:05 4:34:44 2:46:24
In the audio for French and Spanish, there is a high background noise due to signature tunes, which can be noted in Fig. NIST SNR and WADA SNR of speech signals with regard to the signal length
The
The architecture of the Architecture of the
This layer, labeled as 1 in Fig.
The domain layer consists of different components which provide all the information to fill the database of the system. This layer includes the Language Identification (LID) tool [
The database contains all the information of the application, which is related to users, audio, concepts, and conceptual information. This information is divided into tables and developed in the management system MySQL. This layer is labeled as 3 in Fig.
For each language, a folksonomy has been defined. Figure Example of the structure of the folksonomy. Showing only some of the concepts (culture, politics, international, …), superclasses (sports, arts, …), classes (local sports, general…), etc
The folksonomy of each language consist of:
In order to determine the class where a certain text belongs to, the following procedure is followed, where several variables have been defined:
Finally, the class where the text belongs to is the one that fulfills the following optimization criterion:
Although usually an expert defines the set of SLUs, this method becomes very complex when the application is multilingual, resources are limited, or in shortage or under-resourced conditions Fig. General diagram of In the first configuration, the HMMs have the same number of states (NS) for all the SLUs (NS-X, where X is the number of states used). In the second configuration, the effect of assigning different number of states to each SLU is analyzed, where the selection of the number of states depends on the nature of the allophones (a similar approach can be found in [ Examples of HMM topologies for the SLUs of the three languages with different topologies and number of states (NS) Description Languages Topology NS SLU BS Bs-T5 5 /j/,/w/,/p/,/t/,/b/,/f/,/m/,/n/,/J/,/l/,/c/,/J\/,/r\/,/s_a/ 6 /a/,/e/,/i/,/o/,/u/,/d/,/F/,/N/,/L/,/r/ 7 /k/,/B/,/D/,/g /,/G/,/x/,/n_d/,/j\/,/z_a/,/s_m/,/S/,/T/,/ts_a/,/ts_m/,/tS/, /INS/ SP Sp-T5 5 /p/,/t/,/k/ 6 /a/,/e/,/i/,/o/,/u/,/j/,/w/,/b/,/d/,/g/,/B/,/D/,/G/,/x/,/f/,,/m/,/F/,/n/,/N/,/n_d/,/J/,/l/,/L/,/j\/,/c/,/J\/,/r\/,/r/,/z_a/,/s_a/ 7 /T/,/S/,/s_m/,/ts_a/,/ts_m/,/tS/,/INS/ FR Fr-T5 6 /p/,/t/,/k/,/m/,/n/ 7 /a/,/e/,/i/,/o/,/u/,/y/,/j/,/w/,/b/,/d/,/g/,/x/,/f/,/F/,/N/,/n_d/,/J/,/l/,/L/,/j\/,/c/,/J\/,/r\/,/r/,/z_a/,/s_a/ 8 /A/,/E/,/O/,/@/,/2/,/9/,/9~/,/a~/,/e~/,/o~/,/B/,/D/,/G/,/T/,/S/,/s_m/,/ts_a/,/ts_m/,/tS/,/INS/ Description of examples of groups of SLUs for the three languages Language Group type Description of different SLU groups BS G-C /i/=/i/+/j/ /b/=/b/+/B/ /m/=/m/+/F/ /s_a/=/s_a/+/z_a/ /u/=/u/+/w/ /d/=/d/+/D/ /n/=/n/+/N/+/n_d/ /j\/=/j\/+/J\/+/c/+/L/ /g/=/g/+/G/ SP G-B /i/=/i/+/j/ /b/=/b/+/B/ /m/=/m/+/F/ /s_a/=/s_a/+/z_a/ /u/=/u/+/w/ /d/=/d/+/D/ /n/=/n/+/N/+/n_d/ /j\/=/j\/+/J\/+/c/+/L/ /g/=/g/+/G/ FR G-C /a/=/a/+/a~/ /b/=/b/+/B/+/v/ /m/=/m/+/F/ /s_a/=/s_a/+/z_a/ /e/=/e/+/E/+/e~/+/9~/ /d/=/d/+/D/ /n/=/n/+/N/+/n_d/ /s_anFR/=/s_anFR/+/z_anFR/ /i/=/i/+/j/ /g/=/g/+/G/ /R\/=/R\/+/r/+/r\/ /o/=/o/+/O/+/o~/ /u/=/u/+/w/+/y/+/H/+/@/+/2/+/9/
Finally, sets of triphonemes are created for all the selected allophonic options, and discrete and semicontinuous HMM are generated [
The
In a keyword-spotting system, the presence of words out of vocabulary (OOV) in the speech signal during the recognition process can produce unwanted performance results. One of the classic methods for solving this problem consists in using filler models to absorb this part of unwanted signal. Furthermore, these fillers can provide valuable information for the construction of hypothetical new vocabulary words or in vocabulary. This approach is really interesting in the case of under-resourced languages because there is a lack of words in vocabulary. On the one hand, the
The characteristics of a system determine its performance and its ability to evolve and improve, and to become more general in the long term. Nevertheless, the definition of these characteristics becomes tedious when multiple possibilities must be tested in order to reach optimal configurations Fig. Inventory of components involved in the system configuration Component Type Description SLU type Allophone Allophones of the language (Table Triphoneme Triphonemes based on allophones G-X Group of allophones or triphonemes based on Table D HMM structure DC Discrete SC Semicontinuous HMM topology XLG-TX XLG = Language, TX, number of states based on Table C proposals NS-X All SLUs the same number of states, X NG Number of Gaussians LUs type Words Word type unit Pseudo-morphemes Unit based on morphemes, Lemmas and morphemes Fillers Allophonic Allophone, triphoneme Syllabic Syllables Subwords Most frequent filler words in the languages, automatically calculated. Length Subword-2 subwords of length 2 SYL-subword-2 Syllabic type and subwords of length 2 Recogn-adjustment WIP Word insertion penalty FIP Filler insertion penalty Folksonomies WTc Total weight for a class c WRc Ratified weight for a class c
In our case, for the three languages there are about 6000 possible system configurations. Therefore, our goal is to automatically find optimal combinations of parameters (optimal configurations), while ensuring the quality of the system. In the literature, phone error rate (PER) and word error rate (WER) and word correct rate (WCR) have usually been used in order to assess the quality of ASR systems.
However, these measures strongly depend on the vocabulary and they only analyze some features of recognition, without measuring any features related to the quality of unit partitions, the mutual interaction between units, combined semantic quality features, or a combination of these factors. Therefore, there is not a comprehensive quality measurement of the system that would be very useful in the future, when there could be significant changes in the original components of the system. In this paper, we propose a new method for the selection and ranking of the features of the system in order to adjust the performance of the system, which is already in use in other fields [
The proposed methodology based on optimization algorithms, fuzzy indices, PCA, ANN and SVM is described below: In a first stage, the quality of the sets of SLUs is ensured by an automatic selection based on several indices which analyze the quality of unit partitions, the mutual interaction between units, false positive and true negative rates, entropy, and similarities. More than 80 external and fuzzy indices have been analyzed, and the best eight are chosen for the selection process described in [ Then, for each of the combinations of configuration parameters, twelve system performance rates are calculated (described in Table System performance rates Rate Component involved Description APD PER using triphonemes as SLUs PCR using triphonemes as SLUs adiUP adiUP-Co Concept correct rate in the adiUP-Cl Class correct rate in the LEXICON w-WCR WCR of the words present in the folksonomy w-WER WER of the word present in the folksonomy WCR-mean Mean of w-WCR WER-mean Mean of w-WER Folksonomy kc-WCR WCR of the key concepts defined in the folksonomy kc-WER WER of the key concepts defined in the folksonomy kc-WCR-mean Mean of kc-WCR kc-WER-mean Mean of kc-WER During the second stage, an expert classifies all configurations in five levels of quality with regard to an Objective Function which consists in a linear combination of the aforementioned rates (Table
In ANN case, multi-layer perceptron (MLP) with neuron number in hidden layer (NNHL) = (attribute number + classes/2) and training step (TS) NNHL*10. In SVM case, PolyKernel option has been used. For the training and validation steps, we used k-fold cross-validation with The third stage consists in the automatic selection of rates in order to get the optimal set of parameters. The most relevant features are selected by means of attribute selection algorithms, which are available in the WEKA Program [ Once the ranking is obtained, the system performance rates outside the ranking are discarded, and the classification is repeated by using the DT machine learning algorithm in order to analyze significant changes in the results, and overall quality of the system. Then the number of features (rates) is reduced by applying the principal component analysis (PCA) algorithm to the ten selected rates, and th In this step, artificial neural networks (ANN) and support vector machines (SVM) are used for the final automatic model selection. Finally, a group of experts (developers, linguists, and staff of the
The main aim of the experiments is to improve the performance of the designed system using the current resources, but keeping in mind the possibility of incorporating new resources in the future. The experimentation is divided into two tasks: the automatic selection of the system configuration by using the methodology described in Sect.
Prior to evaluating the system by users, it is necessary to select the optimal configuration of the system, that is, to choose the best set of parameters in order to optimize the system performance. These parameters include the optimal unit set, the folksonomy, the topology, the LUs, the SLUs, the number of Gaussians, the word insertion penalty, the fillers, and the acoustic models, among others. The selection process is carried out by using the methodology described in Sect.
In the first stage, the optimal sets of SLUs are selected from the proposals described in Sect.
Then, the combinations of system configuration parameters are generated. There are about 6000 different combinations for the three languages of the Rates that have the greatest effect on the classification, according to the attribute-selection algorithms System performance rates Repetitiveness 8 7 kc-WER 7 w-WER 6 w-WCR 6 kc-WCR 6 kc-WER-mean 6 kc-WCR-mean 6 Triph-WER 6 Triph-WCR 6
The other system performance rates are discarded, and the classification is repeated by using the DT machine learning algorithm without significant changes in the results, so we decided to keep using only the best rates.
In the next stage, new combinations of rates are generated by PCA, and the number of criteria is reduced. The result of the reduction by PCA is the creation of five groups that integrate different performance features. These groups are shown in Table Results of the PCA algorithm Principal component Definition PC1 PC2 PC3 PC4 PC5
For each language, an Objective Function is created by a linear combination of PCA components, and the six best solutions are selected. Afterward, an automatic selection by ANNs and SVM is carried out and the best selected solutions are supervised by a group of experts according to their performance on: traditional classification paradigms; experiments with semicontinuous HMM (SC-HMM) ASR systems; experiments with the Three of the best configuration options selected for each language and their system performance % rates Features System rates Language Topology SLU NG WIP Filler type APD A Key concepts Words Triph-PCR Triph-PER Cl Co Kc-WCR Kc-WER WCR WER BS Bs-T5 Triphonemes 26 − 25 SYL-subword-2 21,10 80,50 84,67 35,47 82,80 NS-6 G-C 10 − 10 subword-2 78,90 25,50 77,89 80,65 64,12 37,60 22,42 Bs-T5 G-C 6 − 25 subword-2 80,50 23,95 34,50 85,98 21,74 SP Sp-T5 Triphonemes 24 − 20 subword-2 53,25 52,5 79,88 63,36 54,48 82,97 24,78 NS-7 G-B 8 − 10 subword-2 58,80 46,04 72,89 75,65 59,12 52,60 NS-7 G-B 4 − 25 SYL-subword-2 76,14 81,90 23,12 FR NS-8 Triphonemes 10 − 25 subword-2 46,33 58,28 65,30 66,88 53,26 61,07 40,78 NS-7 G-C 8 − 10 SYL-subword-2 52,86 45,94 62,24 56,65 52,81 60,75 39,27 Fr-T5 G-C 26 − 25 SYL-subword-2 67,85 52,11
It can be seen that the best results were obtained for Basque. Although Spanish has weaker acoustic models, the success rates in the
An expert supervises the proposed solutions, and selects the final combination of configuration parameters for each of the three languages. Table Configurations implemented in the Language Features Topology SLU NG WIP Filler BS Bs-T5 G-C 6 − 25 subword-2 SP NS-7 G-B 4 − 25 SYL-subword-2 FR Fr-T5 G-C 26 − 25 SYL-subword-2
The assessment of the final performance of the system was carried out both automatically by using the tags created during the analysis stage and manually by experts, mainly journalists. Figure Performance of the final system in terms of concept correct rate (
Finally, in order to evaluate the usability of the system, it was also tested by people not familiarized with this kind of systems. Tests have been carried out with four types of users: One person that uses the system as a super-user; three expert workers of the internet radio channel; three collaborators of the project; and three external real users of the system. These people have used the system and have subjectively graded it with regard to: the interface of the application; the answers provided by the search engine; the updating of the system; the flexibility of the system; and the general information provided by the system. Real users have only been inquired about this last general issue. The subjective parameters represent the degree of satisfaction with regard to the outcome of the system and the interface. This satisfaction was measured in a range of 0–10, where 0 is the lowest satisfaction level, and 10 is the highest satisfaction level with regard to the application performance: interface, perplexity, level of confusion, information, search usefulness, updating-adaptation, and information level. The tests and usability heuristics were designed by following the methodology described in [ Summary of the results of the usability tests for each user profile
Neural computing-based approaches provide flexible solutions for complex systems. This paper presents the design and development of the
The
This work is being funded by Grants: TEC2016-77791-C4 from Plan Nacional de I + D + i, Ministry of Economic Affairs and Competitiveness of Spain and from the DomusVi Foundation Kms para recorder, the Basque Government (ELKARTEK KK-2018/00114, GEJ IT1189-19, the Government of Gipuzkoa (DG18/14 DG17/16), UPV/EHU (GIU19/090), COST ACTION (CA18106, CA15225).
The authors declare that they have no conflict of interest.
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.
Informed consent was obtained from all individual participants included in the study.
Audio information management system
Artificial neural networks
Acoustic phonetic decoding
Automatic speech recognition
Corrrect rates for classes
Correct rates for concepts
Fast Fourier transform
Filler insertion penalty
Hidden Markov model
Language identification
Leave-one-out cross-validation
Lexical units
Mel frequency cepstral coefficients
Number of Gaussians
Number of states
Out of vocabulary
Principal components analysis
Phone error rate
Semicontinuous HMM
Sublexical units
Signal-to-noise ratio
Audio and speech segments
Support vector machines
Voice activity detection
Waveform amplitude distribution analysis
Word error rate
Word insertion penalty
eXtended speech assessment methods phonetic alphabet
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.