II. PROBLEM. A. Description. For the space of audio signals

Similar documents
Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

CS229 Lecture notes. Andrew Ng

AST 418/518 Instrumentation and Statistics

Explicit overall risk minimization transductive bound

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

Statistical Learning Theory: A Primer

A Brief Introduction to Markov Chains and Hidden Markov Models

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

From Margins to Probabilities in Multiclass Learning Problems

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

STA 216 Project: Spline Approach to Discrete Survival Analysis

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

Data Mining Technology for Failure Prognostic of Avionics

Statistical Learning Theory: a Primer

A Comparison Study of the Test for Right Censored and Grouped Data

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Separation of Variables and a Spherical Shell with Surface Charge

Melodic contour estimation with B-spline models using a MDL criterion

Paragraph Topic Classification

Generalized multigranulation rough sets and optimal granularity selection

Some Measures for Asymmetry of Distributions

Two view learning: SVM-2K, Theory and Practice

A. Distribution of the test statistic

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

8 Digifl'.11 Cth:uits and devices

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Traffic data collection

Emmanuel Abbe Colin Sandon

arxiv: v1 [cs.cv] 25 Oct 2017

Support Vector Machine and Its Application to Regression and Classification

Automobile Prices in Market Equilibrium. Berry, Pakes and Levinsohn

A proposed nonparametric mixture density estimation using B-spline functions

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

The EM Algorithm applied to determining new limit points of Mahler measures

BP neural network-based sports performance prediction model applied research

SVM: Terminology 1(6) SVM: Terminology 2(6)

XSAT of linear CNF formulas

A Statistical Framework for Real-time Event Detection in Power Systems

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Asynchronous Control for Coupled Markov Decision Systems

Gauss Law. 2. Gauss s Law: connects charge and field 3. Applications of Gauss s Law

THE OUT-OF-PLANE BEHAVIOUR OF SPREAD-TOW FABRICS

Collective organization in an adaptative mixture of experts

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

BDD-Based Analysis of Gapped q-gram Filters

Ant Colony Algorithms for Constructing Bayesian Multi-net Classifiers

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance

Stochastic Variational Inference with Gradient Linearization

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

Turbo Codes. Coding and Communication Laboratory. Dept. of Electrical Engineering, National Chung Hsing University

Combining reaction kinetics to the multi-phase Gibbs energy calculation

Probabilistic Graphical Models

arxiv: v1 [cs.lg] 31 Oct 2017

Soft Clustering on Graphs

Information-Based Adaptive Sensor Management for Sensor Networks

LIKELIHOOD RATIO TEST FOR THE HYPER- BLOCK MATRIX SPHERICITY COVARIANCE STRUCTURE CHARACTERIZATION OF THE EXACT

Fast Blind Recognition of Channel Codes

Consistent linguistic fuzzy preference relation with multi-granular uncertain linguistic information for solving decision making problems

Chemical Kinetics Part 2

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

c 2016 Georgios Rovatsos

An approximate method for solving the inverse scattering problem with fixed-energy data

Adjustment of automatic control systems of production facilities at coal processing plants using multivariant physico- mathematical models

A study of singular spectrum analysis with global optimization techniques

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

Active Learning & Experimental Design

Estimating the Power Spectrum of the Cosmic Microwave Background

Evolutionary Product-Unit Neural Networks for Classification 1

TRAVEL TIME ESTIMATION FOR URBAN ROAD NETWORKS USING LOW FREQUENCY PROBE VEHICLE DATA

BALANCING REGULAR MATRIX PENCILS

Journal of Econometrics

Approach to Identifying Raindrop Vibration Signal Detected by Optical Fiber

A Novel Learning Method for Elman Neural Network Using Local Search

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008

A Separability Index for Distance-based Clustering and Classification Algorithms

Statistics for Applications. Chapter 7: Regression 1/43

/epjconf/

V.B The Cluster Expansion

A SIMPLIFIED DESIGN OF MULTIDIMENSIONAL TRANSFER FUNCTION MODELS

Cryptanalysis of PKP: A New Approach

arxiv: v1 [cs.db] 1 Aug 2012

Sum Capacity and TSC Bounds in Collaborative Multi-Base Wireless Systems

Nonlinear Analysis of Spatial Trusses

Chemical Kinetics Part 2. Chapter 16

Statistical Inference, Econometric Analysis and Matrix Algebra

$, (2.1) n="# #. (2.2)

A Robust Voice Activity Detection based on Noise Eigenspace Projection

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes

FOURIER SERIES ON ANY INTERVAL

INCORPORATING PERFORMANCE PREDICTION UNCERTAINTY INTO DETECTION AND TRACKING. 21 February 2006

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Trainable fusion rules. I. Large sample size case

Exploring the Throughput Boundaries of Randomized Schedulers in Wireless Networks

Converting Z-number to Fuzzy Number using. Fuzzy Expected Value

Maximizing Sum Rate and Minimizing MSE on Multiuser Downlink: Optimality, Fast Algorithms and Equivalence via Max-min SIR

Optimal Control of Assembly Systems with Multiple Stages and Multiple Demand Classes 1

Supervised i-vector Modeling - Theory and Applications

Transcription:

CS229 - Fina Report Speech Recording based Language Recognition (Natura Language) Leopod Cambier - cambier; Matan Leibovich - matane; Cindy Orozco Bohorquez - orozcocc ABSTRACT We construct a rea time anguage cassifier for communication purposes. Feature vectors are based on Shifted Deta Cepstra Coefficients (SDC). The cassifier is constructed by a maximum ikeihood estimator based on Gaussian Mixture Mode (GMM) density estimations of the feature vectors. As expected, cassification error decreases with mode compexity to a certain imit. We observe that the optima number of gaussians varies significanty across anguages, and there is custering pattern in the confusion matrix. I. INTRODUCTION How to recognize a spoken anguage without any semantica anaysis?. This is a common probem in fieds where anguage detection has to be immediate, and therefore compex agorithms to identify phones and then words are not feasibe. For exampe, modern day communication requires service providers to be accessibe to many parts of the gobe, where Engish is not a commony spoken anguage. In case there is need to provide information, or converse with a cient, there is great urgency in detecting a anguage in which the cient can communicate. Another appication is nationa security. Inteigence agencies coect copious amounts of ceuar data. In many cases, it is imperative to identify the anguage spoken as soon as possibe, for the data to be vauabe. Finay, technoogy appications shoud fit in a gobaized word, and that incudes satisfy the demands of poygot peope, who maintain interactions in different anguages during a daiy routine. The appications shoud be abe to identify and switch from one anguage to another, depending on the demand of the user. Just think in Stanford students whose first anguage is not Engish and want to communicate with both famiy and cassmates. Notice that as an additiona outcome, a anguage cassification agorithm can aid in mapping reations between anguages, and distinctive features that woud not necessariy be thought of as ones, speciay for those anguages that do not have a written representation. With this set of motivations, we construct a rea time anguage cassifier which identifies any speech audio sampe to a anguage within a given database. For this purpose, we receive as input an audio sampe of spoken anguage (which can be as sma as 210 ms sampe), we extract a set of features caed Shifted Cepstra coefficients and then we use a maximum ikeihood estimator based on Gaussian Mixture Mode (GMM) to identify the most suitabe anguage among the trained ones. A. Description II. PROBLEM For the space of audio signas construct a cassifier S : [0, T ] R, C : S L i.e., given an audio sampe s, cassify it to, a member of a known set of anguages L C(s) = L. C is constructed using a training dataset D, B. Stages D S L, D = {(s i, i ) s i S, i L}. The work itsef is constructed of 3 stages 1) Constructing the feature vectors for every sampe in the training set using Shifted Cepstra Coefficients. 2) Training a mode for each anguage, and cross vaidate to seect the best fit. 3) Compute test error. III. RELATED WORK A comprehensive review of historica methods and approaches to spoken anguage recognition appears in [1]. Human anguage recognition has been the topic of research for many decades. Human istening experiments suggest that there are two broad casses of anguage cues peope rey on in identifying a anguage: the preexica information, such as intonation and pitch and the exica semantic knowedge based on identifying words, syntax and context. Research shows that pre-exica recognition predates exica recognitions, as infants are abe to identify a anguage ong before they have a soid grasp of anguage vocabuary and syntax. Accordingy, there is a choice of the features to use when trying to cassify a spoken anguage: they are 1

Fig. 1: Cassification schemes for acoustic and phonotactic methods [1] (a) Cepstra coefficients (b) SDC coefficients [1] Fig. 2: Extraction of Feature Vectors diagrams roughy divided into acoustic - physica sound patterns associated with a anguage, which reate to pre-exica data and phonotactic - constrained syabe structures, more associated with grammar and syntax structure. Acoustic features are easier to describe as they reate to primordia traits of the anguage, whie phonotactic features rey on particuar grammatica structures. As we were interested in finding a cassifier that woud work we on a heterogeneous cass of exotic anguages, we wished to make fewest assumptions about the grammatica structure of the anguages (tona/ admissibe consonants etc.). Therefore we chose to focus on Acoustic features. There is a pethora of cassification methods based on extracted features. However, as demonstrated in Fig 1, the most effective methods seem to be based on N-gram modeing for phonotactic features (recognizing phones in a speech segments and assessing the probabiity of a sequence of N such phones given each anguage in dictionary), and mixture of Gaussian density estimation for the conditiona density of the features. As we chose acoustic features, we based our cassification on a mixture of Gaussians density estimation. IV. FEATURE VECTORS CEPSTRAL COEFFICIENTS Cepstra coefficients have been used in anguage processing for a whie, since they seem to match the human auditory perception. i.e. signas with highy different Cepstra coefficient are most ikey to be attributed to different sources for exampe in speaker cassification. Cepstra coefficients are usuay computed over a phone, which is a of a duration of 10ms on average for human speakers. Cepstra coefficients for every 10 ms sampe are constructed by (aso see figure 2a) Hamming Window : x Ham (t) = (x Ham)(t), Fourier Transform : ˆx(f) = F(x Ham (t)), Transition to Me Scae : ˆx Me (f) = ˆx(M(f)), DCT of ogarithm : x Cep [i] = DCT(og ˆx Me (f))[i]. We can aso construct Shifted Deta Cepstra coefficients (SDC) by ooking at the variation of the Cepstra coefficients over adjacent sampes (see figure 2b), x k SDC[i] = x k+1 Cep [i] xk 1[i]. (1) Cep 2

0.8 0.75 0.7 Diagona Σ 0.65 Vaidation Error 0.6 0.55 0.5 0.45 0.4 0.35 Optimum = 0.34 Fu Σ 0.3 10 0 10 1 10 2 10 3 N =NumberofGaussiansinGMM Fig. 3: Vaidation error as a function of N V. DATASET We train our mode C with a data base from a Hacking competition from the webpage topcoder [3]. This data set contains information from 176 different anguages, each one with 376 sampes of 10 seconds. Each of these sampes is a recording of daiy conversation. One particuar characteristic of this data set is the unusua presence of exotic anguages. We divide each 10s sampe into 210ms sampes. For each 210 ms sampe we construct 21 7-vectors Cepstra coefficient for each 10 ms segment. We then use each 3 adjacent Cepstra coefficients to construct a 7-vector SDC coefficient. Thus we have a 49 feature vector for each 210 ms sampe. 46 376 = 17296 data sampes, to be used for training, vaidation and testing (where we divide them at the 10s eve). We first isoate 10% of the dataset for testing, and then divide the remaining 90% into 70% for training and 30% for vaidation. VI. TRAINING MODEL In the training we use a GMM mixture mode: every anguage is modeed as a sum of Gaussians. However, we do not assume every Gaussian is associated with a different abe. Rather, this is an extension of the GDA anaysis to mutipe Gaussians, such that the training agorithm can mode generaized distributions. We assume that the prior distribution of the anguages is uniform, which is the case in our training set, which mainy contains exotic anguages. We train the mode on 0.9 0.7 = 63 % of our dataset to get p(x ) = N i=1 w (i) p(x µ (i), Σ(i) ) (2) where x is the 49- feature vector, constructed in the training set from a 210ms sampe, w (i), µ(i), Σ(i) are our trained weights, means and covariances, and N is the number of Gaussians in the mode. VII. VALIDATION Given the trained parameters in (2), we cacuate the conditiona probabiity of each 10s sampe x (i), in our vaidation set to be drawn from a specific anguage by p(x (i) ) = 46 j=1 p(x (i) j ) (3) ˆ = arg max p(x (i) ) (4) where x (i) j is the the j-th 210 ms segment of x (i). We store the probabiities in ogarithm to avoid machine precision issues. VIII. RESULTS A. Vaidation error and hyperparameters optimization To expore different modes, we trained our mode assuming both a fu and diagona covariance matrix. The vaidation errors for various vaues of N are depicted on Figure 3 for those two modes. Both modes present the same bias-variance behavior. We can see that the fu mode runs into probems after N = 20. Given that for N = 20 we woud have roughy 20 49 + 20 49 2 50, 000 variabes to sove for whie we have 18, 000 data sampes per anguage in tota, the probem quicky becomes over determined for the fu Gaussian mode (for instance, the covariance matrices earned become singuar for N = 100). Using diagona covariances aeviates this issue to some extent, 3

(a) Distribution of vaidation error for N = 10 (b) Distribution of the optima N across anguages Fig. 4: Anaysis of the vaidation error for each individua anguage aowing us to go up to N = 300 before reaching singuarity issues. As we can see on the graph, the smaest vaidation error is reached for N = 10 with the fu covariances matrices. In order to use more Gaussians for training the modes, we need to increase the number of data sampes. Since we are training the mode for every anguage independenty, we are aso interested in how the error spreads between different anguages. First, as we can see in Figure 4a, for N = 10 using fu covariance matrix, we have a unimoda distribution of the error among the anguages and the error [0, 1]. The same pattern hods as we change N or the type of covariance matrix. Therefore, this suggests that some anguages are more susceptibe to inaccurate modeing, either because the GMM assumption is not accurate enough at ow N, or because the set of features does not capture distinguishabe characteristics of them. On the other hand, we aso anayzed the impact of choosing a uniform N for a anguages. In the origina mode, we fit a anguages with the same number of Gaussians N, and if this parameter were independent of the anguage, then we woud expect that a anguages get their minimum vaidation error at the same N. Nevertheess in Figure 4b, we can see the distribution of the optima N across anguages. For exampe for modes with fu covariance matrix, ony 40% of the anguages get the minimum error in the optima seection of N = 10. The same phenomenon happens for diagona covariance matrix, where the optima N = 150, but ony 25% of the anguages attains its minimum error at this N, whereas more than 30% of the anguages have minimum error when N = 100. This anaysis show that in order to improve our predictions, we need to make N variabe among different anguages. B. Test error and confusion matrix After seecting the best mode in section VIII-A (N = 10 with dense confusion matrix), we run the mode on the 10% remaining data. We obtained a test error of 0.34, which is consistent with the vaidation error obtained earier, suggesting good generaization of the mode. To anayze the error by anguage, we computed the confusion matrix of the test set. In this matrix, every (i, j) entry corresponds to the fraction of cases when given anguage i it is predicted as anguage j. To cacuate this vaue, we work with the score vector defined in (3) per each sampe, and then we normaize aong a the sampes. To have a better visuaization, we use coors in og-scae, as shown in Figure 5a. The confusion matrix is a standard technique to identify if the miscassification error is uniform aong the sampe, or if there exists subfamiies or custers, where the error resides. Given the arge amount of casses (neary 200), without any reative order among them, potting the confusion matrix to find custers is useess, uness we have by chance a suitabe permutation. In order to get this permutation, we understood the confusion matrix as the adjacency matrix of a directed graph. Under this assumption, custers of miscassified anguages are communities in the graph. Therefore, once we computed the confusion matrix, we used a community detection agorithm [2] to find five communities of anguages. Once this is done, a suitabe permutation of the confusion matrix gives us Figure 5a and the associated graph is depicted on Figure 5b. This shows that the error, when any, is not uniform and tends 4

(a) Permuted Confusion matrix (b) Communities graph Fig. 5: Custer anaysis in test error for N = 10 with fu covariance matrix, with highighted communities to custer, i.e., the agorithm tends to confuse simiar anguage together within a community. IX. CONCLUSION In this project, we impemented a rea time anguage cassifier for communication purposes. After buiding the features using SDC and using a GMM mode with MLE estimator for cassification, we obtained a test error of 0.34, which has to be compared to the performance a random estimator woud have on the 176 casses of the dataset. Looking at the data, we see that there is some pattern in the cassification, in the sense that when the mode miscassifies a anguage, it does not do so at random but tend to cassify the anguage within some community. X. FUTURE WORK Considering the resuts from section VIII-A, one way to improve the agorithm woud be to aow the number of Gaussian to vary for each anguage. This shoud ikey aow us to improve the fit and decrease the vaidation and test error. Additionay, as figure 3 iustrates, the mode quicky become overdetermined because of the important number of unknowns and the imited size of the dataset. One way to improve the mode woud then be to find a arger dataset to be abe to fit more compex modes which may improve the accuracy of the agorithm. Another interesting feature woud be to compute a measure of confidence when predicting a anguage (confidence interva). This coud be couped with a measure of the quaity of the GMM fit for each anguage. For exampe, for a given sampe, we coud return the probabiity of that sampe being we cassified, and a set of (e.g. 5) ikey anguages, with a score and confidence interva for each of them. The described approach can be based in the custering effect of the cassification described in section VIII-B. A hierarchica custering agorithm, where different communities of anguages are treated separatey, coud ead to some improvements. In this mode, the goa woud be to first cassify an audio sampe among different communities of anguages, which ideay (but not mandatory), correspond to the geographica region or the same inguistic famiy. After this step, we woud find the specific anguage to which it beongs, ooking ony among the distinct community. This may aso hep improve performances. Lasty, other direction for future work incude making the mode avaiabe to targeted users. This incudes deveoping an interactive interface that easiy recognizes anguages, and ater on, additiona components can be added such as rea time recognition of anguage, even it is switched aong time. REFERENCES [1] Li, Haizhou, Bin Ma, and Kong Aik Lee. Spoken anguage recognition: from fundamentas to practice. Proceedings of the IEEE 101.5 (2013): 1136-1159. [2] Bonde, Vincent D., et a. Fast unfoding of communities in arge networks. Journa of statistica mechanics: theory and experiment 2008.10 (2008): P10008. [3] topcoder. Maraton Match: Spoken Languages 2 (2015). Last visit October 2016: https://community.topcoder.com/ongcontest/?modue=viewprobemstatement&rd=16555&pm=13978 5