CS229 - Fina Report Speech Recording based Language Recognition (Natura Language) Leopod Cambier - cambier; Matan Leibovich - matane; Cindy Orozco Bohorquez - orozcocc ABSTRACT We construct a rea time anguage cassifier for communication purposes. Feature vectors are based on Shifted Deta Cepstra Coefficients (SDC). The cassifier is constructed by a maximum ikeihood estimator based on Gaussian Mixture Mode (GMM) density estimations of the feature vectors. As expected, cassification error decreases with mode compexity to a certain imit. We observe that the optima number of gaussians varies significanty across anguages, and there is custering pattern in the confusion matrix. I. INTRODUCTION How to recognize a spoken anguage without any semantica anaysis?. This is a common probem in fieds where anguage detection has to be immediate, and therefore compex agorithms to identify phones and then words are not feasibe. For exampe, modern day communication requires service providers to be accessibe to many parts of the gobe, where Engish is not a commony spoken anguage. In case there is need to provide information, or converse with a cient, there is great urgency in detecting a anguage in which the cient can communicate. Another appication is nationa security. Inteigence agencies coect copious amounts of ceuar data. In many cases, it is imperative to identify the anguage spoken as soon as possibe, for the data to be vauabe. Finay, technoogy appications shoud fit in a gobaized word, and that incudes satisfy the demands of poygot peope, who maintain interactions in different anguages during a daiy routine. The appications shoud be abe to identify and switch from one anguage to another, depending on the demand of the user. Just think in Stanford students whose first anguage is not Engish and want to communicate with both famiy and cassmates. Notice that as an additiona outcome, a anguage cassification agorithm can aid in mapping reations between anguages, and distinctive features that woud not necessariy be thought of as ones, speciay for those anguages that do not have a written representation. With this set of motivations, we construct a rea time anguage cassifier which identifies any speech audio sampe to a anguage within a given database. For this purpose, we receive as input an audio sampe of spoken anguage (which can be as sma as 210 ms sampe), we extract a set of features caed Shifted Cepstra coefficients and then we use a maximum ikeihood estimator based on Gaussian Mixture Mode (GMM) to identify the most suitabe anguage among the trained ones. A. Description II. PROBLEM For the space of audio signas construct a cassifier S : [0, T ] R, C : S L i.e., given an audio sampe s, cassify it to, a member of a known set of anguages L C(s) = L. C is constructed using a training dataset D, B. Stages D S L, D = {(s i, i ) s i S, i L}. The work itsef is constructed of 3 stages 1) Constructing the feature vectors for every sampe in the training set using Shifted Cepstra Coefficients. 2) Training a mode for each anguage, and cross vaidate to seect the best fit. 3) Compute test error. III. RELATED WORK A comprehensive review of historica methods and approaches to spoken anguage recognition appears in [1]. Human anguage recognition has been the topic of research for many decades. Human istening experiments suggest that there are two broad casses of anguage cues peope rey on in identifying a anguage: the preexica information, such as intonation and pitch and the exica semantic knowedge based on identifying words, syntax and context. Research shows that pre-exica recognition predates exica recognitions, as infants are abe to identify a anguage ong before they have a soid grasp of anguage vocabuary and syntax. Accordingy, there is a choice of the features to use when trying to cassify a spoken anguage: they are 1
Fig. 1: Cassification schemes for acoustic and phonotactic methods [1] (a) Cepstra coefficients (b) SDC coefficients [1] Fig. 2: Extraction of Feature Vectors diagrams roughy divided into acoustic - physica sound patterns associated with a anguage, which reate to pre-exica data and phonotactic - constrained syabe structures, more associated with grammar and syntax structure. Acoustic features are easier to describe as they reate to primordia traits of the anguage, whie phonotactic features rey on particuar grammatica structures. As we were interested in finding a cassifier that woud work we on a heterogeneous cass of exotic anguages, we wished to make fewest assumptions about the grammatica structure of the anguages (tona/ admissibe consonants etc.). Therefore we chose to focus on Acoustic features. There is a pethora of cassification methods based on extracted features. However, as demonstrated in Fig 1, the most effective methods seem to be based on N-gram modeing for phonotactic features (recognizing phones in a speech segments and assessing the probabiity of a sequence of N such phones given each anguage in dictionary), and mixture of Gaussian density estimation for the conditiona density of the features. As we chose acoustic features, we based our cassification on a mixture of Gaussians density estimation. IV. FEATURE VECTORS CEPSTRAL COEFFICIENTS Cepstra coefficients have been used in anguage processing for a whie, since they seem to match the human auditory perception. i.e. signas with highy different Cepstra coefficient are most ikey to be attributed to different sources for exampe in speaker cassification. Cepstra coefficients are usuay computed over a phone, which is a of a duration of 10ms on average for human speakers. Cepstra coefficients for every 10 ms sampe are constructed by (aso see figure 2a) Hamming Window : x Ham (t) = (x Ham)(t), Fourier Transform : ˆx(f) = F(x Ham (t)), Transition to Me Scae : ˆx Me (f) = ˆx(M(f)), DCT of ogarithm : x Cep [i] = DCT(og ˆx Me (f))[i]. We can aso construct Shifted Deta Cepstra coefficients (SDC) by ooking at the variation of the Cepstra coefficients over adjacent sampes (see figure 2b), x k SDC[i] = x k+1 Cep [i] xk 1[i]. (1) Cep 2
0.8 0.75 0.7 Diagona Σ 0.65 Vaidation Error 0.6 0.55 0.5 0.45 0.4 0.35 Optimum = 0.34 Fu Σ 0.3 10 0 10 1 10 2 10 3 N =NumberofGaussiansinGMM Fig. 3: Vaidation error as a function of N V. DATASET We train our mode C with a data base from a Hacking competition from the webpage topcoder [3]. This data set contains information from 176 different anguages, each one with 376 sampes of 10 seconds. Each of these sampes is a recording of daiy conversation. One particuar characteristic of this data set is the unusua presence of exotic anguages. We divide each 10s sampe into 210ms sampes. For each 210 ms sampe we construct 21 7-vectors Cepstra coefficient for each 10 ms segment. We then use each 3 adjacent Cepstra coefficients to construct a 7-vector SDC coefficient. Thus we have a 49 feature vector for each 210 ms sampe. 46 376 = 17296 data sampes, to be used for training, vaidation and testing (where we divide them at the 10s eve). We first isoate 10% of the dataset for testing, and then divide the remaining 90% into 70% for training and 30% for vaidation. VI. TRAINING MODEL In the training we use a GMM mixture mode: every anguage is modeed as a sum of Gaussians. However, we do not assume every Gaussian is associated with a different abe. Rather, this is an extension of the GDA anaysis to mutipe Gaussians, such that the training agorithm can mode generaized distributions. We assume that the prior distribution of the anguages is uniform, which is the case in our training set, which mainy contains exotic anguages. We train the mode on 0.9 0.7 = 63 % of our dataset to get p(x ) = N i=1 w (i) p(x µ (i), Σ(i) ) (2) where x is the 49- feature vector, constructed in the training set from a 210ms sampe, w (i), µ(i), Σ(i) are our trained weights, means and covariances, and N is the number of Gaussians in the mode. VII. VALIDATION Given the trained parameters in (2), we cacuate the conditiona probabiity of each 10s sampe x (i), in our vaidation set to be drawn from a specific anguage by p(x (i) ) = 46 j=1 p(x (i) j ) (3) ˆ = arg max p(x (i) ) (4) where x (i) j is the the j-th 210 ms segment of x (i). We store the probabiities in ogarithm to avoid machine precision issues. VIII. RESULTS A. Vaidation error and hyperparameters optimization To expore different modes, we trained our mode assuming both a fu and diagona covariance matrix. The vaidation errors for various vaues of N are depicted on Figure 3 for those two modes. Both modes present the same bias-variance behavior. We can see that the fu mode runs into probems after N = 20. Given that for N = 20 we woud have roughy 20 49 + 20 49 2 50, 000 variabes to sove for whie we have 18, 000 data sampes per anguage in tota, the probem quicky becomes over determined for the fu Gaussian mode (for instance, the covariance matrices earned become singuar for N = 100). Using diagona covariances aeviates this issue to some extent, 3
(a) Distribution of vaidation error for N = 10 (b) Distribution of the optima N across anguages Fig. 4: Anaysis of the vaidation error for each individua anguage aowing us to go up to N = 300 before reaching singuarity issues. As we can see on the graph, the smaest vaidation error is reached for N = 10 with the fu covariances matrices. In order to use more Gaussians for training the modes, we need to increase the number of data sampes. Since we are training the mode for every anguage independenty, we are aso interested in how the error spreads between different anguages. First, as we can see in Figure 4a, for N = 10 using fu covariance matrix, we have a unimoda distribution of the error among the anguages and the error [0, 1]. The same pattern hods as we change N or the type of covariance matrix. Therefore, this suggests that some anguages are more susceptibe to inaccurate modeing, either because the GMM assumption is not accurate enough at ow N, or because the set of features does not capture distinguishabe characteristics of them. On the other hand, we aso anayzed the impact of choosing a uniform N for a anguages. In the origina mode, we fit a anguages with the same number of Gaussians N, and if this parameter were independent of the anguage, then we woud expect that a anguages get their minimum vaidation error at the same N. Nevertheess in Figure 4b, we can see the distribution of the optima N across anguages. For exampe for modes with fu covariance matrix, ony 40% of the anguages get the minimum error in the optima seection of N = 10. The same phenomenon happens for diagona covariance matrix, where the optima N = 150, but ony 25% of the anguages attains its minimum error at this N, whereas more than 30% of the anguages have minimum error when N = 100. This anaysis show that in order to improve our predictions, we need to make N variabe among different anguages. B. Test error and confusion matrix After seecting the best mode in section VIII-A (N = 10 with dense confusion matrix), we run the mode on the 10% remaining data. We obtained a test error of 0.34, which is consistent with the vaidation error obtained earier, suggesting good generaization of the mode. To anayze the error by anguage, we computed the confusion matrix of the test set. In this matrix, every (i, j) entry corresponds to the fraction of cases when given anguage i it is predicted as anguage j. To cacuate this vaue, we work with the score vector defined in (3) per each sampe, and then we normaize aong a the sampes. To have a better visuaization, we use coors in og-scae, as shown in Figure 5a. The confusion matrix is a standard technique to identify if the miscassification error is uniform aong the sampe, or if there exists subfamiies or custers, where the error resides. Given the arge amount of casses (neary 200), without any reative order among them, potting the confusion matrix to find custers is useess, uness we have by chance a suitabe permutation. In order to get this permutation, we understood the confusion matrix as the adjacency matrix of a directed graph. Under this assumption, custers of miscassified anguages are communities in the graph. Therefore, once we computed the confusion matrix, we used a community detection agorithm [2] to find five communities of anguages. Once this is done, a suitabe permutation of the confusion matrix gives us Figure 5a and the associated graph is depicted on Figure 5b. This shows that the error, when any, is not uniform and tends 4
(a) Permuted Confusion matrix (b) Communities graph Fig. 5: Custer anaysis in test error for N = 10 with fu covariance matrix, with highighted communities to custer, i.e., the agorithm tends to confuse simiar anguage together within a community. IX. CONCLUSION In this project, we impemented a rea time anguage cassifier for communication purposes. After buiding the features using SDC and using a GMM mode with MLE estimator for cassification, we obtained a test error of 0.34, which has to be compared to the performance a random estimator woud have on the 176 casses of the dataset. Looking at the data, we see that there is some pattern in the cassification, in the sense that when the mode miscassifies a anguage, it does not do so at random but tend to cassify the anguage within some community. X. FUTURE WORK Considering the resuts from section VIII-A, one way to improve the agorithm woud be to aow the number of Gaussian to vary for each anguage. This shoud ikey aow us to improve the fit and decrease the vaidation and test error. Additionay, as figure 3 iustrates, the mode quicky become overdetermined because of the important number of unknowns and the imited size of the dataset. One way to improve the mode woud then be to find a arger dataset to be abe to fit more compex modes which may improve the accuracy of the agorithm. Another interesting feature woud be to compute a measure of confidence when predicting a anguage (confidence interva). This coud be couped with a measure of the quaity of the GMM fit for each anguage. For exampe, for a given sampe, we coud return the probabiity of that sampe being we cassified, and a set of (e.g. 5) ikey anguages, with a score and confidence interva for each of them. The described approach can be based in the custering effect of the cassification described in section VIII-B. A hierarchica custering agorithm, where different communities of anguages are treated separatey, coud ead to some improvements. In this mode, the goa woud be to first cassify an audio sampe among different communities of anguages, which ideay (but not mandatory), correspond to the geographica region or the same inguistic famiy. After this step, we woud find the specific anguage to which it beongs, ooking ony among the distinct community. This may aso hep improve performances. Lasty, other direction for future work incude making the mode avaiabe to targeted users. This incudes deveoping an interactive interface that easiy recognizes anguages, and ater on, additiona components can be added such as rea time recognition of anguage, even it is switched aong time. REFERENCES [1] Li, Haizhou, Bin Ma, and Kong Aik Lee. Spoken anguage recognition: from fundamentas to practice. Proceedings of the IEEE 101.5 (2013): 1136-1159. [2] Bonde, Vincent D., et a. Fast unfoding of communities in arge networks. Journa of statistica mechanics: theory and experiment 2008.10 (2008): P10008. [3] topcoder. Maraton Match: Spoken Languages 2 (2015). Last visit October 2016: https://community.topcoder.com/ongcontest/?modue=viewprobemstatement&rd=16555&pm=13978 5