Eigenvoice Modeling With Sparse Training Data

Size: px
Start display at page:

Download "Eigenvoice Modeling With Sparse Training Data"

Transcription

1 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 3, MAY Eigenvoice Modeling With Sparse Training Data Patrick Kenny, Member, IEEE, Gilles Boulianne, Member, IEEE, and Pierre Dumouchel, Member, IEEE Abstract We derive an exact solution to the problem of maximum likelihood estimation of the supervector covariance matrix used in extended MAP (or EMAP) speaker adaptation and show how it can be regarded as a new method of eigenvoice estimation. Unlike other approaches to the problem of estimating eigenvoices in situations where speaker-dependent training is not feasible, our method enables us to estimate as many eigenvoices from a given training set as there are training speakers. In the limit as the amount of training data for each speaker tends to infinity, it is equivalent to cluster adaptive training. Index Terms Cluster adaptive training, eigenvoices, extended MAP (EMAP), speech recognition, speaker adaptation. I. INTRODUCTION EIGENVOICE modeling has proved to be effective in small vocabulary speech recognition tasks where enough data can be collected to carry out speaker-dependent training for large numbers of speakers [1] [4]. Our objective in this article is to show how this type of modeling can be extended to situations where the training data is sparse in the sense that speaker-dependent training is not feasible. Suppose, to begin with, we ask an oracle to supply speakerdependent models for any speaker. If denotes the total number of mixture components in a speaker model and the dimension of the acoustic feature vectors then, for each speaker, we can concatenate the mean vectors associated with the mixture components to form a supervector of dimension. The idea behind eigenvoice modeling is that principal components analysis can be used to constrain these supervectors to lie in a low dimensional space with little loss of accuracy. This ensures that only a small number of parameters need to be estimated in order to enroll a new speaker. Thus, speaker adaptation saturates quickly in the sense that the recognition accuracy for a test speaker reaches a plateau after a small amount of adaptation data has been collected. To flesh this out a bit, let and denote the mean and covariance matrix of the supervectors for the speaker population. The basic assumption in eigenvoice modeling is that most of the eigenvalues of are zero. This guarantees that the speaker supervectors are all contained in a linear manifold of low dimension, namely, the set of supervectors of the form where lies in the range of. This set of supervectors is known as the eigenspace; The eigenvoices of the population are the eigenvectors of corresponding to nonzero eigenvalues. (It will be helpful to bear in mind that the dimension of the eigenspace is Manuscript received March 26, 2003; revised October 30, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ramesh A. Gopinath. The authors are with the Centre de Recherche Informatique de Montréal, Montréal, QC H3A 1B9, Canada ( pkenny@crim.ca). Digital Object Identifier /TSA equal to the number of eigenvoices and this is equal to the rank of.) In order to construct a speaker-adapted HMM for a given speaker (absent the oracle) we need a mechanism to impose the constraint that the speaker s supervector lies in the eigenspace. In this article, we will use maximum a posteriori (MAP) estimation for this purpose rather than use eigenvoices explicitly. The idea behind MAP estimation is that we have a prior probability distribution for the speaker s supervector, namely, the Gaussian distribution with mean and covariance matrix. Since this prior distribution is concentrated on the eigenspace, the same is true of the posterior distribution derived from it using the speaker s adaptation data. In particular, the mode of the posterior distribution lies in the eigenspace. Thus, MAP estimation of the speaker s supervector ensures that the constraint is satisfied. The term EMAP adaptation E for extended [5] is generally used to refer to MAP adaptation using a supervector covariance matrix but, since the traditional usage assumes that is invertible whereas, our concern in this article is with the case where is of less than full rank, we will use the term eigenvoice MAP instead. The idea of integrating eigenvoice modeling with MAP speaker adaptation has been suggested by other authors [6] [8]. Of course other mechanisms can be used to constrain the supervectors to lie in the eigenspace [2], [9], [10] but a good case can be made that MAP estimation is the most natural way. First, MAP adaptation implicitly takes account of the eigenvalues of, as well as the eigenvectors. In fact, if we use the MAP approach, there is no reason in principle why we should make a hard decision to suppress the eigenvectors of which correspond to minor eigenvalues (as required by other approaches). Secondly, MAP adaptation is asymptotically equivalent to speaker-dependent training as the amount of adaptation data increases. Thirdly, it is natural from a mathematical point of view since, as we will show in this article, it plays a central role in estimating eigenvoices by probabilistic principal components analysis. (In particular, no extra software needs to be developed for speaker adaptation if the eigenvoices are estimated in this way.) Finally, the computational burden of MAP adaptation is quite modest provided that the rank of is reasonably small since it essentially boils down to inverting an matrix where is the rank of [6] [8]. The only difficulty with MAP adaptation arises in the case where is large (more than a few thousand) in which case it is computationally intractable unless is constrained in some way. (For example, if is assumed to be of full rank and sparsity constraints are imposed on, MAP speaker adaptation can be implemented by viewing the posterior distribution as a Gaussian Markov random field [11].) The main issue that needs to be addressed in order to implement an eigenvoice model is to estimate the population co /$ IEEE

2 346 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 3, MAY 2005 variance matrix. If large numbers of training speakers can be enlisted then we can take the sample covariance matrix of the training set supervectors as an estimate of. But note that the rank of the sample covariance matrix is just the number of speakers in the training set. In most situations of interest this is far less than the dimension of the supervector space so the sample covariance matrix may not provide a reasonable estimate of the population covariance matrix. In such cases it may be necessary to impose some constraints on the population covariance matrix in order to produce an eigenspace of sufficiently high dimension that MAP adaptation has some hope of being asymptotically equivalent to speaker-dependent training. For instance, we could impose a block diagonal structure on by ignoring the correlations between mixture components associated with different phonemes as in the seminal paper [5]. (Block diagonal constraints have the effect of boosting the rank of by a factor of where is the number of blocks.) Alternatively, as we will explain, we can boost the rank of by streaming the acoustic features. Or we could guarantee correct asymptotic behavior, at least in theory, simply by forcing the supervector covariance matrix to be of full rank. (Note that a singular covariance matrix can be made nonsingular by an arbitrarily small perturbation. For example, we could set the zero eigenvalues to a small positive value as in [6]. We say in theory because very large amounts of data hundreds of sentences may be needed for speaker adaptation to saturate with a covariance matrix of full rank [11].) Generally speaking, increasing the dimension of the eigenspace can be expected to slow down the rate at which speaker adaptation saturates so, in practice, it may not be possible to ensure that speaker adaptation saturates quickly and that it behaves like speaker-dependent training as the amount of adaptation data increases. Assuming that a satisfactory tradeoff between these two considerations can be found, there remains the question of how to estimate the population covariance matrix if speaker-dependent training is not feasible. (Recall that we began our discussion by appealing to an oracle to supply the training speakers supervectors.) In this article we will show how to estimate the principal eigenvectors of the population covariance matrix directly from the training data without having recourse to speaker-dependent models. Our solution to this problem is a variant of the probabilistic principal components approach introduced in [12]. This methodology was developed in order to extend principal components analysis to handle priors specified by a mixture of Gaussians (rather than by a single Gaussian as in traditional principal components analysis). Priors of this type give rise to an interesting class of MAP estimators which, loosely speaking, are locally linear but globally non linear. These MAP estimators can be applied to a variety of data compression and pattern recognition tasks but in the context of eigenvoice modeling large numbers of training speakers seem to be needed to take advantage of them. So we have to limit ourselves to the case of a single Gaussian as in conventional principal components analysis. (This is unfortunate since clear evidence of at least two modes one male, one female is presented in [9], [13]). The probabilistic approach still has an advantage over other approaches even in the unimodal case because it enables us to estimate as many eigenvoices as there are speakers in the training set. Even if it turns out that most of the variability in the training data can be captured by a smaller number of eigenvoices, MAP speaker adaptation can use the remaining eigenvoices to good advantage since it implicitly takes account of the corresponding eigenvalues. A clear example of this is given in [14, Table I] which reports the results of some closed-set speaker-identification experiments on a population of 319 speakers. Decreasing the number of eigenvoices from 300 to 100 was found to increase the error rate from 14.8% to 16.5% in the case of mean adaptation and from 14.7% to 17.7% when the variances were adapted as well. The need to use as many eigenvoices as possible is particularly evident in the case of large vocabulary speech recognition given the very high dimensionality of the supervector space. The experiments we report in this article were conducted on a large vocabulary task using as many eigenvoices in each stream as there are speakers in the training set. We begin by explaining why cluster adaptive training [9] and the maximum likelihood eigenspace method [13] would break down if we attempted to use them to estimate a full complement of eigenvoices in situations where speaker-dependent training is not feasible and how this problem can be avoided by adopting the probabilistic approach. II. MAXIMUM LIKELIHOOD FORMULATIONS OF THE ESTIMATION PROBLEM Let denote the supervector for a speaker. Our assumption is that for a randomly chosen speaker, is Gaussian distributed with mean and covariance matrix. The speaker-independent supervector can be estimated by Baum-Welch training in the usual way so the problem confronting us is how to estimate if speaker-dependent training is not feasible (so that the supervectors for the training speakers are unobservable). This problem is complicated by the fact that the dimensions of are enormous but it is amenable to maximum likelihood estimation because the maximum likelihood estimate of is of low rank in practice (the rank is bounded by the number of training speakers). This fact will allow us to work out an exact solution to the estimation problem (without having to make any approximations as in [11], [15]). Previous approaches to the problem of estimating eigenvoices have adopted a different perspective: instead of estimating, they estimate a basis for the eigenspace. This point of view is equivalent to ours in the sense that given an estimate of we can write down a basis for the eigenspace (namely, the eigenvectors of corresponding to nonzero eigenvalues) and, conversely, given a basis for the eigenspace we can estimate (by fitting a multivariate Gaussian distribution to the coordinate representations of the supervectors). However the two points of view give rise to different maximum likelihood formulations of the estimation problem and hence to different estimation procedures. To describe the likelihood function used to estimate the eigenspace basis in [9], [13] we need to introduce some notation. For each mixture component, let be the subvector of which corresponds to it. It is assumed that there is a covariance matrix such that, for any speaker, acoustic

3 KENNY et al.: EIGENVOICE MODELING WITH SPARSE TRAINING DATA 347 observation vectors (frames) associated with the mixture component are normally distributed with mean and covariance matrix. Note that, although is independent of, it is not a speaker-independent covariance matrix in the usual sense since it measures deviations from the speaker-dependent mean vectors rather than deviations from a speaker-independent mean vector. Let denote the block diagonal matrix whose diagonal blocks are. For each speaker, let denote the speaker s training data and for each supervector, let denote the likelihood of calculated with the HMM specified by the supervector and the supercovariance matrix. (We assume throughout that the HMM transition probabilities and mixture weights are fixed so there is no need to include these in the notation.) Although the bias term can be eliminated by augmenting the dimension of the eigenspace [9], it will be convenient for us to retain it. Given a basis for the eigenspace, let be the matrix whose columns are the basis supervectors so that every speaker supervector can be written in the from. (The vector is of dimension where is the dimension of the eigenspace.) The likelihood function which serves as the estimation criterion in [9], [13] is where ranges over the speakers in the training set. (Strictly speaking this is not a likelihood function since it does not integrate to 1. In order to obtain a proper likelihood function the max operation would have to be replaced by a suitable integral with respect to.) The optimization proceeds by iterating the following two steps: 1) For each training speaker, use the current estimates of and to find the supervector which maximizes the HMM likelihood of the speaker s training data. Set 2) Update and by maximizing where the product extends over all speakers in the training set. These two steps are referred to as maximum likelihood eigendecomposition and the maximum likelihood eigenspace method in [13] and as estimating the cluster weights and estimating the cluster means in [9]. The methods in [9], [13] differ as to how the maximization in the second step is performed. In particular, it is assumed in [13] that, for each mixture component, is equal to the speaker-independent covariance matrix for the mixture component and no attempt is made to re-estimate it. It will be helpful to briefly review Gales s method [9] for estimating since our approach will require a similar calculation. For each speaker, let where is given by the first step and suppose that each frame in the training data has been aligned with a mixture component (as in a Viterbi alignment although a forward-backward alignment can also be used). (1) For each mixture component, let denote the number of frames aligned with. Then the quantity to be maximized in the second step is where ranges over all speakers in the training set, ranges over all mixture components and for each pair the sum over extends over all frames aligned with. If we ignore the issue of how to estimate, the problem is just to minimize regarded as a function of. It is characteristic of this type of optimization problem that the covariance matrices drop out so the problem reduces to an exercise in least squares, namely, to minimize For our purposes, the important thing to note is that linear regression (with the s as the explanatory variables and the columns of as the regression coefficients) provides the solution. Our objection to using the likelihood function (1) as the estimation criterion is that, in order to obtain a reasonable estimate of the eigenspace basis in situations where speaker-dependent training is not feasible, the dimension of the eigenspace has to be strictly less than the number of training speakers. (If not, the maximum value of (1) is obviously attained by the basis which consists of the estimates of the training speakers supervectors given by speaker-dependent training.) So in order to avoid this type of degeneracy, only a relatively small number of eigenvoices can be estimated or the basis vectors have to be constrained in some other way. (For example, the basis vectors can be constrained to lie in the low dimensional space consisting of all MLLR transforms of [9].) These constraints are unreasonable because they are artifacts of the estimation procedure. In order to outline the alternative estimation procedure that we will develop, note first that, if is the rank of, we can write where is a matrix of rank. This implies that for each speaker, there is a unique vector such that To say that is normally distributed with mean and covariance matrix, is equivalent to saying has a standard normal distribution. So the problem of estimating the supervector covariance matrix can be formulated in terms of estimating using a different generative model for the training data. Fix a speaker and suppose that each frame in the training data has been aligned with a mixture component. If the

4 348 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 3, MAY 2005 vector were observable then, using the notation introduced above, the log likelihood of would given by Since we are treating as a random vector with a standard normal distribution, we define the likelihood of by integrating this conditional likelihood function with respect to the standard Gaussian kernel. We denote the value of this integral by (a closed form expression is given in Proposition 2 below). Thus our approach will be to estimate and by maximizing the log likelihood function where the product extends over all speakers in the training set. We will develop an EM algorithm for this purpose in which the role of the hidden variables is played by the vectors. We will show that successive estimates of and are convergent, at least to the extent that the likelihood of the training data increases on successive iterations. However, as in [12], the EM algorithm generally has more than one fixed point so there is no guarantee that if is an EM fixed point then the eigenvectors of are the principal eigenvectors of the supervector covariance matrix. Our procedure consists in iterating the following three steps. We will spell out the details in Section III. 1) For each training speaker, use the current alignment of the speaker s training data and the current estimates of and to carry out MAP speaker adaptation. Use the speaker-adapted model to realign the speaker s training data. 2) The E-step: For each speaker, calculate the posterior distribution of using the current alignment of the speaker s training data, the current estimates of and and the prior. 3) The M-step: Update and by a linear regression in which the s play the role of the explanatory variables. Calculating the posterior distribution of in the E-step rather than the maximum likelihood estimate is the key to avoiding the degeneracy problem that arises when (1) is used as the criterion for estimating the eigenvoices in situations where speaker-dependent training is not feasible and the number of eigenvoices is large compared with the number of training speakers. This calculation is also essentially all that is required for the first step. The principal mathematical difficulty that we will encounter is in carrying out the regression in the M-step, given that for each speaker, is only observable up to the posterior distribution calculated in the E-step. III. ADAPTATION AND ESTIMATION PROCEDURES The main computation that needs to be done both for estimating and and for MAP speaker adaptation for a speaker is to calculate the posterior distribution of given the speaker s training data. To explain how this is done we need to introduce some notation. Assuming that the data for the speaker has been aligned with either the speaker-independent or a speaker-adapted HMM, so that each frame is labeled by a mixture component, we denote by the entire collection of labeled frames for the speaker. We extract the following statistics from. For each mixture component, let be the number of frames in the training data for speaker which are accounted for by the given mixture component and set where the sums extend over all frames for speaker that are aligned with the mixture component and is the speakerindependent mean vector. Let be the block diagonal matrix whose diagonal blocks are where denotes the identity matrix. Let be the vector obtained by concatenating. Let be the matrix defined by Proposition 1: For each training or test speaker, the posterior distribution of given and a parameter set is Gaussian with mean and covariance matrix. The proof of this proposition can be found in the Appendix. Note that is strictly positive definite (and hence, invertible). Furthermore, if is reasonably small (at most a few thousand) there is no difficulty in calculating the inverse of. Hence calculating the posterior distribution is straightforward even in the large vocabulary case. Corollary 1: If, for each training or test speaker, we denote the posterior mean and covariance of by and, then This corollary is the key to MAP speaker adaptation. Invoking the Bayesian predictive classification principle [16], we can adapt both the mean vectors and the variances in the speaker independent HMM to a given speaker as follows. With each mixture component, we associate the mean vector and a covariance matrix given by where is the th entry of when is considered as a block matrix (each block being of dimension ). This type of covariance estimation seems to have been (2)

5 KENNY et al.: EIGENVOICE MODELING WITH SPARSE TRAINING DATA 349 first suggested in [11]. The rationale behind it is that if is an observation for speaker and mixture component, we have where and denotes the th diagonal block of the matrix where the residual is distributed with mean 0 and covariance matrix and is distributed with mean and covariance matrix. Since and are assumed to be statistically independent Cov Cov Cov which gives (2). We will refer to this as covariance adaptation although this is not strictly correct since it is really a mechanism for incorporating the uncertainty about the MAP estimate of into the HMM. (Note that as the amount of adaptation data increases, the uncertainty tends to zero and the covariance estimate reduces to for all speakers.) Corollary 2: If, for each training speaker, denotes the posterior expectation of given and a parameter set and likewise for then Then where the sums extend over all speakers in the training set. Equation (4) is the system of normal equations for our linear regression problem. To solve it, note that since the matrices are diagonal, equating the th row of the left hand side with the th row of the right hand side for gives where is the th row of and similarly for and. Note also that if we write in the form where and then so we obtain This also follows immediately from Proposition 1 and it is the key to implementing the E-step of the EM algorithm described in Proposition 3 below. Next we explain how to calculate the likelihood function that we used in our formulation of the estimation problem. (This calculation is not strictly necessary; its only role is to provide a diagnostic for verifying the implementation of the EM algorithm.) Proposition 2: For each speaker, let denote the Gaussian log likelihood function given by the expression Then Finally, the EM algorithm: Proposition 3: Suppose we are given initial parameter estimates. For each training speaker, let and be the first and second moments of calculated with these estimates according to Corollary 2 to Proposition 1. Let be new estimates of the model parameters defined as follows: is the solution of and for each (3) (4) (5) This is just an system of equations so there is no difficulty in solving for and in the limit as the amount of training data tends to infinity (so that, for each speaker, the posterior distribution of becomes concentrated at a single point, namely the maximum likelihood estimate of ) it is equivalent to [9, eq. (49)]. Equation (5) stands in the same relation to [9, eq. (51)] so Gales s method and ours are asymptotically equivalent. Note that the EM algorithm does not impose any restrictions on which seems to suggest that it is possible to estimate a larger number of eigenvoices than there are speakers in the training set. This paradox can be resolved by proving that if is a fixed point of the EM algorithm then (that is, the rank of ) is necessarily less than or equal, the number of speakers in the training set. It follows that there is nothing to be gained by taking so we took in our experiments. IV. IMPLEMENTATION ISSUES In order to apply the EM estimation procedure given in Proposition 3 we need the first and second order statistics for each training speaker and mixture component. These statistics can be extracted by aligning the training data with the speaker-independent HMM using either a Viterbi alignment or the Baum Welch algorithm. (We used the Baum Welch procedure in our experiments.) But note that after estimates of and have been obtained we can use MAP adaptation (i.e., Corollary 1 to Proposition 1) to construct speaker-adapted models for each of the training speakers and use these to align the training data instead of the speaker-independent model. Accordingly, in the training phase of our experiments, we alternate between alignment and EM iterations until the estimates of converge. (A similar issue arises when applying MAP adaptation to a test speaker. We deal with it in the same way by performing several alignment iterations on the speaker s adaptation data.)

6 350 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 3, MAY 2005 As we mentioned in the introduction, some constraints have to be imposed on the supervector covariance matrix in order to produce an eigenspace of sufficiently high dimension that MAP adaptation has some hope of being asymptotically equivalent to speaker-dependent training. Our experience has been that the type of block diagonal constraint used in [5] which we referred to in the introduction is not very effective so we experimented with another type of constraint which we implemented by splitting the acoustic features into streams and applying the EM algorithm in each stream separately. This has the effect of boosting the dimension of the eigenspace by a factor of where is the number of streams (so that the greatest effect is achieved by treating each acoustic feature as a separate stream). As we mentioned in Section II, Nguyen and colleagues [13] construct speaker HMMs by applying eigenvoice adaptation to the HMM mean vectors and copying the HMM covariance matrices from a speaker-independent HMM trained in the usual way. Since this is easier to implement than our Bayesian version of Gales s treatment of the HMM covariances, we compared both approaches in all of our experiments. Contrary to our expectations we found that Nguyen s approach (which is really a type of variance inflation) almost always gives better results. TABLE I RECOGNITION ACCURACIES (%) AVERAGED OVER 20 SPEAKERS, 26 STREAMS. IS THE NUMBER OF ADAPTATION SENTENCES, INDICATES MEAN ADAPTATION, INDICATES MEAN AND VARIANCE ADAPTATIONS TABLE II RECOGNITION ACCURACIES (%) AVERAGED OVER 20 SPEAKERS. TWO STREAMS. IS THE NUMBER OF ADAPTATION SENTENCES, INDICATES MEANADAPTATION, INDICATES MEAN AND VARIANCE ADAPTATION V. EXPERIMENTS We carried out some experiments on the French language AUPELF task [17] using the BREF-80 training set which consists of 10.6 h of data collected from 80 speakers. The speaker-independent HMM was a tied-state triphone model with 782 output distributions each having four mixture components with diagonal covariance matrices. For signal processing we used a 10 ms frame rate and a 26-dimensional acoustic feature vector (13 liftered mel-frequency cepstral coefficients together with their first derivatives). For recognition we used a dictionary containing words and a language model consisting of bigrams and trigrams. (Admittedly a much larger language model would be needed to deal with the problem of homophone confusions in French.) We modeled liaisons in both training and recognition [18]. For the test set we chose 20 speakers (not represented in the training set) and five sentences per speaker for a total of words of which 3.0% were out of vocabulary. A. Speaker Adaptation We experimented with 26 streams (each of dimension 1) and two streams (each of dimension 13, one for cepstra, and the other for their first derivatives). We used the training set to estimate 80 eigenvoices in each stream and performed supervised adaptation with various adaptation sets comprising 1, 2, 5, 10, 15, 20, and 100 sentences per speaker; the average length of a sentence was 6 s. 1) 26 Streams: Recognition results averaged over the 20 test speakers are reported in Table I. The second column of Table I gives the results of adapting the HMM mean vectors and copying the HMM covariance matrices from the speaker-independent HMM. Adapting only the mean vectors with small amounts of adaptation data gave small improvements in accuracy and performance saturated slowly, only reaching a plateau after 20 sentences (2 min of speech). The third column shows that adapting the variances as well as the mean vectors was unhelpful across the entire range of adaptation sets. The only way we were able to achieve any improvement with variance adaptation was by imposing block diagonal constraints on (compare [5]). However these constraints diminished the effectiveness of mean adaptation so that no net gain in performance was obtained. Thus, partitioning the mixture components into four blocks gave 73.4% accuracy for mean adaptation versus 74.0% for mean and variance adaptation (using 100 adaptation sentences for each speaker). Similarly, partitioning the mixture components into ten blocks gave 74.7% accuracy for mean adaptation versus 74.8% for mean and variance adaptation. 2) 2 Streams: Table II reports recognition results obtained under the same conditions as in Table I using two streams rather than 26. Aside from the fact that variance adaptation is still ineffective, the main thing to note here is that the performance for mean adaptation saturates much more quickly but reaches a lower plateau than in Table I. This type of behavior is to be expected since the eigenspace is of much lower dimension (160 versus 2080). The substantial improvement obtained with a single sentence of adaptation data and a relatively small number of eigenvoices confirms Botterweck s result [16]. B. Multispeaker Modeling Since variance adaptation with low dimensional eigenspaces was consistently unhelpful, we tried another way of producing speaker-adapted models for the test speakers, namely adding the adaptation data to the training data and estimating the model parameters using the extended training set. With sufficient adaptation data for the test speakers, this increases the

7 KENNY et al.: EIGENVOICE MODELING WITH SPARSE TRAINING DATA 351 number of eigenvoices that can be estimated (from 80 to 100 per stream in the case at hand) and so ensures that the eigenspace is big enough to contain the test speakers as well as the training speakers. We refer to this type of training as multispeaker modeling since it only produces speaker-adapted models for the speakers in the extended training set. We performed a multispeaker recognition experiment using 100 adaptation sentences for each of the test speakers. Performing Baum Welch estimation of the speaker independent HMM with the extended training set and running recognition on the test speakers gave a new benchmark recognition accuracy of 72.1%. Performing mean adaptation on all of the speakers in the extended training set in the course of estimating and running recognition on the test speakers gave a recognition accuracy of 76.4%. Adapting both the means and the variances also led to a recognition accuracy of 76.4% so there is no degradation in performance in this case but no improvement either. VI. DISCUSSION Eigenvoice methods of acoustic phonetic modeling were developed initially to tackle small vocabulary tasks where speaker-dependent training can be carried out for large numbers of speakers. In this article we have shown how to estimate the principal eigenvoices of a speaker population in situations where the training data is too sparse to permit speaker-dependent training by formulating the problem in terms of maximum likelihood estimation of the supervector covariance matrix used in EMAP speaker adaptation. Unlike other methods of eigenvoice estimation, this approach enables us to estimate as many eigenvoices from a given training set as there are training speakers. Our results, like Botterweck s, show that eigenvoice MAP can yield substantial improvements in accuracy on a large vocabulary task with very small amounts of adaptation data (one or two sentences). However, it is doubtful that eigenvoice modeling can provide a complete solution to the problem of acoustic phonetic adaptation in the large vocabulary case because, although it seems reasonable to assume that the true supervector covariance matrix is of relatively low rank and our approach enables us to extract the largest possible number of eigenvoices from a given training set, it seems unlikely that sufficiently many training speakers could ever be enlisted to estimate the supervector covariance matrix reliably. (An exception is the case of multispeaker modeling where the training and test speaker populations coincide. This type of modeling is more likely to be of use in speaker recognition than in speech recognition [14]. It is also interesting to note that the eigenchannel MAP estimator introduced in [14] which uses the methods developed here to tackle the problem of blind channel compensation for speaker identification does not seem to suffer from this rank deficiency problem.) If the supervector covariance matrix is not reliably estimated then speaker adaptation may saturate quickly but there is no guarantee that it will be asymptotically equivalent to speaker-dependent training. Ad hoc constraints designed to increase the dimension of the eigenspace (such as block diagonal constraints or statistical independence conditions on the acoustic features) result in better asymptotic behavior but the improvement is slight and the principal effect of such constraints seems to be to slow down the rate at which speaker adaptation saturates. If the goal of speaker adaptation is to attain speaker-dependent performance with the smallest possible amount of adaptation data (rather than to attain the best possible performance with one or two sentences of adaptation data), then it may be necessary to adopt a different approach to estimating the supervector covariance matrix in the large vocabulary case. Instead of using a principal components analysis to estimate the principal eigenvectors of the supervector covariance matrix, we could carry out a factor analysis. That is, we could assume a decomposition of the form where is a nonsingular diagonal covariance matrix (this guarantees that is of full rank). Ideally this type of model will exhibit both the correct asymptotic behavior of classical MAP (the case ) and the rapid saturation of eigenvoice MAP (the case ). A factor analysis of this type has been implemented in a connected digit recognition task (where speaker-dependent training is feasible) [8]. However, a factor analysis of the training data in the absence of speaker-dependent models will require a good deal of algorithmic development. (Note that even if speaker-dependent models are given, the natural procedure for estimating a factor loading matrix is an EM algorithm [19].) We will take up this question elsewhere. Our results to the effect that variance adaptation is generally less effective than simply copying the speaker-independent HMM variances were contrary to our expectations. They seem to indicate that the best way to model variances is to inflate them and raise the question of whether heavy-tailed distributions ought to be used instead of Gaussians in HMM mixture modeling. Since replacing individual Gaussians by discrete scale Richter mixtures has been found to be a moderately effective strategy [20], continuous scale Richter mixtures might be worth exploring. (These have proved to be effective in dealing with the nongaussian behavior of images containing edges as well as textures [21].) More ambitiously, one could replace each Gaussian in a HMM output distribution by an independent components analyzer, tying the mixing matrices in the spirit of [22]. (See [23] for an outstanding tutorial on ICA. Integrating ICA with HMMs is straightforward in principle [24].) This suggests that non-gaussian modeling of speaker supervectors could also be considered but we do not have any evidence to indicate whether this is worth pursuing. APPENDIX PROOFS OF THE PROPOSITIONS For each speaker, let denote the conditional likelihood of given and the parameter set. Lemma 1: For each speaker

8 352 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 3, MAY 2005 where is defined by (3) and b) Proof of Proposition 2: If denotes the Gaussian kernel with mean and covariance matrix, then Proof: To simplify the notation, let us drop the reference to. Let. For each mixture component, let denote the th block of (where each block is an vector) and let By Lemma 1, we can write this as where the sum extends over all the frames for the given speaker which are aligned with the mixture component. The log likelihood of conditioned on is (6) where the sum extends over all mixture components. This can be simplified by writing so that If denotes the Gaussian kernel with mean and covariance matrix then the integral can be written in the form By the formula for the Fourier-Laplace transform of the Gaussian kernel [25] this simplifies to which, by corollary 1 to proposition 1, can be written as for. This implies that so and the result follows by substituting this expression in (6). a) Proof of Proposition 1: In order to show that the posterior distribution of is of the stated form it is enough to show that as required. It will be helpful to make a few remarks on differentiating functions of a matrix variable before turning to the proof of proposition 3. If is a function of a matrix variable, then for any matrix of the same dimensions as we set Dropping the reference to, we have and we refer to as the derivative of in the direction of. These directional derivatives are easily calculated for linear and quadratic functions of. For the determinant function they are given by as required. (This can be derived from the identity where is the adjoint of.) If for all matrices then is a critical point of. Define an inner product by setting

9 KENNY et al.: EIGENVOICE MODELING WITH SPARSE TRAINING DATA 353 Since, for each, the functional is linear, there is a unique matrix (the gradient of ) such that and the second term on the right hand side can be simplified as follows: for all. So to find the critical points of we first find the matrix-valued function which enables us to express the directional derivatives of in the form and then set. If the domain of is restricted to symmetric matrices (e.g., covariance matrices), the critical points of are defined by the condition that for all symmetric matrices. In this case, the critical points can be found by setting c) Proof of Proposition 3: We begin by constructing an EM auxiliary function with the s as hidden variables. By Jensen s inequality, We derive the re-estimation formula (4) by differentiating this expression with respect to and setting the gradient to. If is a matrix of the same dimensions as then the derivative with respect to in the direction of is given by In order for this to be equal to for all we must have Since the right hand side of this inequality simplifies to the total log likelihood of the training data can be increased by choosing the new estimates of so as to maximize the left hand side. Since for each speaker from which (4) follows. It remains to derive the re-estimation formula (5) for. Let be any block diagonal matrix having the same dimensions as, so that it has the form..., this is equivalent to opti- and similarly for mizing the quantity Note that, for each, the derivative of with respect to in the direction of is. Hence the directional derivative of with respect to in the direction of is which we refer to as the auxiliary function and denote by. Observe that where is the conditional expectation operator introduced in the statement of the proposition. Furthermore, by Lemma 1 which, by (4), simplifies to

10 354 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 3, MAY 2005 This can be written as where, for each mixture component and speaker, is the th diagonal block of the matrix. The latter expression evaluates to 0 for all symmetric matrices iff for. Since and are symmetric for all and, (5) follows immediately. (The estimates for the covariance matrices can be shown to be positive definite with a little bit of extra work.) REFERENCES [1] R. Kuhn et al., Eigenfaces and eigenvoices: Dimensionality reduction for specialized pattern recognition, in Proc. IEEE Workshop Multimedia Signal Processing, Dec [2], Eigenvoices for speaker adaptation, in Proc. ICSLP, Sydney, Australia, Dec [3], Fast speaker adaptation using a priori knowledge, in Proc. ICASSP, Phoenix, AZ, Mar [4] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, Rapid speaker adaptation in eigenvoice space, IEEE Trans. Speech Audio Process., vol. 8, no. 6, pp , Nov [5] G. Zavaliagkos, R. Schwartz, and J. Makhoul, Batch, incremental and instantaneous adaptation techniques for speech recognition, in Proc. ICASSP, Detroit, MI, May 1995, pp [6] H. Botterweck, Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition, in Proc. ICASSP, Salt Lake City, UT, May [7] E. Jon, D. Kim, and N. Kim, EMAP-based speaker adaptation with robust correlation estimation, in Proc. ICASSP, Salt Lake City, UT, May [8] D. Kim and N. Kim, Online adaptation of continuous density hidden Markov models based on speaker space model evolution, in Proc. ICSLP, Denver, CO, Sep [9] M. Gales, Cluster adaptive training of hidden Markov models, IEEE Trans. Speech Audio Process., vol. 8, no. 4, pp , Jul [10] R. Westwood, Speaker Adaptation Using Eigenvoices, M.Phil., Cambridge University, Cambridge, UK, [11] B. Shahshahani, A Markov random field approach to Bayesian speaker adaptation, IEEE Trans. Speech Audio Process., vol. 5, no. 2, pp , Mar [12] M. Tipping and C. Bishop, Mixtures of probabilistic principal component analyzers, Neural Computation, vol. 11, pp , [13] P. Nguyen, C. Wellekens, and J.-C. Junqua, Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments, in Proc. Eurospeech, Budapest, Hungary, Sep [14] P. Kenny, M. Mihoubi, and P. Dumouchel, New MAP estimators for speaker recognition, in Proc. Eurospeech, Geneva, Switzerland, Sep [15] Q. Huo and C.-H. Lee, Online adaptive learning of the correlated continuous-density hidden Markov models for speech recognition, IEEE Trans. Speech Audio Processing, vol. 6, no. 4, pp , Jul [16] C.-H. Lee and Q. Huo, On adaptive decision rules and decision parameter adaptation for automatic speech recognition, Proc. IEEE, vol. 88, pp , Aug [17] J. Dolmazon et al., ARC B1 Organization de la première campagne AUPELF pour l évaluation des systèmes de dictée vocale, in JST97 FRANCIL, Avignon, France, Apr [18] G. Boulianne, J. Brousseau, P. Ouellet, and P. Dumouchel, French large vocabulary recognition with cross-word phonology transducers, in Proc. ICASSP, Istanbul, Turkey, Jun [19] D. Rubin and D. Thayer, EM algorithms for ML factor analysis, Psychometrika, vol. 47, pp , [20] M. Gales and P. Olsen, Tail distribution modeling using the richter and power exponential distributions, in Proc. Eurospeech, Budapest, Hungary, Sep [21] M. Wainwright, E. P. Simoncelli, and A. Willsky, Random cascades on wavelet trees and their use in modeling and analyzing natural imagery, Applied Computational Harmonic Analysis, vol. 11, pp , Jul [22] M. Gales, Semi-tied covariance matrices for hidden Markov models, IEEE Trans. Speech Audio Processing, vol. 7, pp , [23] A. Hyvärinen and E. Oja. (1999) Independent Component Analysis: A Tutorial. [Online]. Available: [24] W. Penny, R. Everson, and S. Roberts, Hidden Markov independent components analysis, in Advances in Independent Components Analysis, M. Girolami, Ed. New York: Springer-Verlag, [25] H. Stark and J. Woods, Probability, Random Processes and Estimation Theory for Engineers. Englewood Cliffs, NJ: Prentice-Hall, Patrick Kenny (M 04) received the B.A. degree in mathematics from Trinity College, Dublin, Ireland, and the M.Sc. and Ph.D. degrees, also in mathematics, from McGill University. He was Professor of electrical engineering at INRS-Télécommunication, Montréal, QC, Canada, from 1990 to 1995, where he started up Spoken Word Technologies, RIP to spin off INRS s speech recognition technology. He joined the Centre de Recherche Informatique de Montréal in 1998, where he now holds the position of principal research scientist. His current research interests are concentrated on Bayesian speaker- and channel-adaptation for speech and speaker recognition. Gilles Boulianne (M 99) received the B.Sc. degree in unified engineering from the Université du Québec, Chicoutimi, QC, Canada, and the M.Sc. degree in telecommunications from INRS-Télécommunications, Montréal, QC. He worked on speech analysis and articulatory speech modeling at the Linguistics Department in the Université du Québec, Montréal, until 1990 and then on large vocabulary speech recognition at INRS and Spoken Word Technologies until 1998, when he joined the Centre de Recherche Informatique de Montréal. His research interests include finite state transducer approaches and practical applications of large vocabulary speech recognition. Pierre Dumouchel (M 97) received the Ph.D. degree in telecommunications from INRS-Télécommunications, Montréal, QC, Canada, in He is currently Vice-President R&D of the Centre de Recherche Informatique de Montréal, a Professor at the Ecole de Technologie Supérieure, and a board member of the Canadian Language Industry Association (AILIA). His research interests include broadcast news speech recognition, speaker recognition and audio-visual content extraction.

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University Heeyoul (Henry) Choi Dept. of Computer Science Texas A&M University hchoi@cs.tamu.edu Introduction Speaker Adaptation Eigenvoice Comparison with others MAP, MLLR, EMAP, RMP, CAT, RSW Experiments Future

More information

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Eigenvoice Speaker Adaptation via Composite Kernel PCA Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk

More information

A Small Footprint i-vector Extractor

A Small Footprint i-vector Extractor A Small Footprint i-vector Extractor Patrick Kenny Odyssey Speaker and Language Recognition Workshop June 25, 2012 1 / 25 Patrick Kenny A Small Footprint i-vector Extractor Outline Introduction Review

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Bayesian Analysis of Speaker Diarization with Eigenvoice Priors

Bayesian Analysis of Speaker Diarization with Eigenvoice Priors Bayesian Analysis of Speaker Diarization with Eigenvoice Priors Patrick Kenny Centre de recherche informatique de Montréal Patrick.Kenny@crim.ca A year in the lab can save you a day in the library. Panu

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Support Vector Machines using GMM Supervectors for Speaker Verification

Support Vector Machines using GMM Supervectors for Speaker Verification 1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail:

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

Monaural speech separation using source-adapted models

Monaural speech separation using source-adapted models Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis {ronw,dpwe}@ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 007 IEEE Workshop on Applications

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition

Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann Odyssey Speaker and Language Recognition Workshop

More information

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 411 A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces Paul M. Baggenstoss, Member, IEEE

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

Recursive Generalized Eigendecomposition for Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis Recursive Generalized Eigendecomposition for Independent Component Analysis Umut Ozertem 1, Deniz Erdogmus 1,, ian Lan 1 CSEE Department, OGI, Oregon Health & Science University, Portland, OR, USA. {ozertemu,deniz}@csee.ogi.edu

More information

Modeling acoustic correlations by factor analysis

Modeling acoustic correlations by factor analysis Modeling acoustic correlations by factor analysis Lawrence Saul and Mazin Rahim {lsaul.mazin}~research.att.com AT&T Labs - Research 180 Park Ave, D-130 Florham Park, NJ 07932 Abstract Hidden Markov models

More information

Front-End Factor Analysis For Speaker Verification

Front-End Factor Analysis For Speaker Verification IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING Front-End Factor Analysis For Speaker Verification Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, Abstract This

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 12, DECEMBER 2008 2009 Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation Yuanqing Li, Member, IEEE, Andrzej Cichocki,

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However, Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny

More information

ABSTRACT INTRODUCTION

ABSTRACT INTRODUCTION ABSTRACT Presented in this paper is an approach to fault diagnosis based on a unifying review of linear Gaussian models. The unifying review draws together different algorithms such as PCA, factor analysis,

More information

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

On the Behavior of Information Theoretic Criteria for Model Order Selection

On the Behavior of Information Theoretic Criteria for Model Order Selection IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 8, AUGUST 2001 1689 On the Behavior of Information Theoretic Criteria for Model Order Selection Athanasios P. Liavas, Member, IEEE, and Phillip A. Regalia,

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES Dinh-Tuan Pham Laboratoire de Modélisation et Calcul URA 397, CNRS/UJF/INPG BP 53X, 38041 Grenoble cédex, France Dinh-Tuan.Pham@imag.fr

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Riccati difference equations to non linear extended Kalman filter constraints

Riccati difference equations to non linear extended Kalman filter constraints International Journal of Scientific & Engineering Research Volume 3, Issue 12, December-2012 1 Riccati difference equations to non linear extended Kalman filter constraints Abstract Elizabeth.S 1 & Jothilakshmi.R

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker Adaptation Instructor: Preethi Jyothi Oct 23, 2017 Speaker variations Major cause of variability in speech is the differences between speakers Speaking

More information

Joint Factor Analysis for Speaker Verification

Joint Factor Analysis for Speaker Verification Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session

More information

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Antti Honkela 1, Stefan Harmeling 2, Leo Lundqvist 1, and Harri Valpola 1 1 Helsinki University of Technology,

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Gang Ji and Jeff Bilmes SSLI-Lab, Department of Electrical Engineering University of Washington Seattle, WA 9895-500 {gang,bilmes}@ee.washington.edu

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Chapter 1 Gaussian Mixture Models

Chapter 1 Gaussian Mixture Models Chapter 1 Gaussian Mixture Models Abstract In this chapter we rst introduce the basic concepts of random variables and the associated distributions. These concepts are then applied to Gaussian random variables

More information

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno ACTORIAL HMMS OR ACOUSTIC MODELING Beth Logan and Pedro Moreno Cambridge Research Laboratories Digital Equipment Corporation One Kendall Square, Building 700, 2nd loor Cambridge, Massachusetts 02139 United

More information

NON-NEGATIVE SPARSE CODING

NON-NEGATIVE SPARSE CODING NON-NEGATIVE SPARSE CODING Patrik O. Hoyer Neural Networks Research Centre Helsinki University of Technology P.O. Box 9800, FIN-02015 HUT, Finland patrik.hoyer@hut.fi To appear in: Neural Networks for

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function 890 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models Jorge Silva and Shrikanth

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Estimation of linear non-gaussian acyclic models for latent factors

Estimation of linear non-gaussian acyclic models for latent factors Estimation of linear non-gaussian acyclic models for latent factors Shohei Shimizu a Patrik O. Hoyer b Aapo Hyvärinen b,c a The Institute of Scientific and Industrial Research, Osaka University Mihogaoka

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Dynamic Time-Alignment Kernel in Support Vector Machine

Dynamic Time-Alignment Kernel in Support Vector Machine Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information

More information

Auxiliary signal design for failure detection in uncertain systems

Auxiliary signal design for failure detection in uncertain systems Auxiliary signal design for failure detection in uncertain systems R. Nikoukhah, S. L. Campbell and F. Delebecque Abstract An auxiliary signal is an input signal that enhances the identifiability of a

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

IN this paper, we consider the capacity of sticky channels, a

IN this paper, we consider the capacity of sticky channels, a 72 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 1, JANUARY 2008 Capacity Bounds for Sticky Channels Michael Mitzenmacher, Member, IEEE Abstract The capacity of sticky channels, a subclass of insertion

More information

Multi-level Gaussian selection for accurate low-resource ASR systems

Multi-level Gaussian selection for accurate low-resource ASR systems Multi-level Gaussian selection for accurate low-resource ASR systems Leïla Zouari, Gérard Chollet GET-ENST/CNRS-LTCI 46 rue Barrault, 75634 Paris cedex 13, France Abstract For Automatic Speech Recognition

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Diagonal Priors for Full Covariance Speech Recognition

Diagonal Priors for Full Covariance Speech Recognition Diagonal Priors for Full Covariance Speech Recognition Peter Bell 1, Simon King 2 Centre for Speech Technology Research, University of Edinburgh Informatics Forum, 10 Crichton St, Edinburgh, EH8 9AB, UK

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Title without the persistently exciting c. works must be obtained from the IEE

Title without the persistently exciting c.   works must be obtained from the IEE Title Exact convergence analysis of adapt without the persistently exciting c Author(s) Sakai, H; Yang, JM; Oka, T Citation IEEE TRANSACTIONS ON SIGNAL 55(5): 2077-2083 PROCESS Issue Date 2007-05 URL http://hdl.handle.net/2433/50544

More information

Temporal Modeling and Basic Speech Recognition

Temporal Modeling and Basic Speech Recognition UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing

More information

BLIND SEPARATION OF TEMPORALLY CORRELATED SOURCES USING A QUASI MAXIMUM LIKELIHOOD APPROACH

BLIND SEPARATION OF TEMPORALLY CORRELATED SOURCES USING A QUASI MAXIMUM LIKELIHOOD APPROACH BLID SEPARATIO OF TEMPORALLY CORRELATED SOURCES USIG A QUASI MAXIMUM LIKELIHOOD APPROACH Shahram HOSSEII, Christian JUTTE Laboratoire des Images et des Signaux (LIS, Avenue Félix Viallet Grenoble, France.

More information

Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation

Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation Zhang et al EURASIP Journal on Audio, Speech, and Music Processing 2014, 2014:11 RESEARCH Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation Wen-Lin Zhang 1*,Wei-QiangZhang

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Optimum Sampling Vectors for Wiener Filter Noise Reduction

Optimum Sampling Vectors for Wiener Filter Noise Reduction 58 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 1, JANUARY 2002 Optimum Sampling Vectors for Wiener Filter Noise Reduction Yukihiko Yamashita, Member, IEEE Absact Sampling is a very important and

More information

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data Jan Vaněk, Lukáš Machlica, Josef V. Psutka, Josef Psutka University of West Bohemia in Pilsen, Univerzitní

More information

Multiclass Discriminative Training of i-vector Language Recognition

Multiclass Discriminative Training of i-vector Language Recognition Odyssey 214: The Speaker and Language Recognition Workshop 16-19 June 214, Joensuu, Finland Multiclass Discriminative Training of i-vector Language Recognition Alan McCree Human Language Technology Center

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY Uplink Downlink Duality Via Minimax Duality. Wei Yu, Member, IEEE (1) (2)

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY Uplink Downlink Duality Via Minimax Duality. Wei Yu, Member, IEEE (1) (2) IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY 2006 361 Uplink Downlink Duality Via Minimax Duality Wei Yu, Member, IEEE Abstract The sum capacity of a Gaussian vector broadcast channel

More information

Embedded Kernel Eigenvoice Speaker Adaptation and its Implication to Reference Speaker Weighting

Embedded Kernel Eigenvoice Speaker Adaptation and its Implication to Reference Speaker Weighting IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, December 21, 2005 1 Embedded Kernel Eigenvoice Speaker Adaptation and its Implication to Reference Speaker Weighting Brian Mak, Roger Hsiao, Simon Ho,

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Speaker Verification Using Accumulative Vectors with Support Vector Machines

Speaker Verification Using Accumulative Vectors with Support Vector Machines Speaker Verification Using Accumulative Vectors with Support Vector Machines Manuel Aguado Martínez, Gabriel Hernández-Sierra, and José Ramón Calvo de Lara Advanced Technologies Application Center, Havana,

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY MengSun,HugoVanhamme Department of Electrical Engineering-ESAT, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Bus

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

On Information Maximization and Blind Signal Deconvolution

On Information Maximization and Blind Signal Deconvolution On Information Maximization and Blind Signal Deconvolution A Röbel Technical University of Berlin, Institute of Communication Sciences email: roebel@kgwtu-berlinde Abstract: In the following paper we investigate

More information

CEPSTRAL analysis has been widely used in signal processing

CEPSTRAL analysis has been widely used in signal processing 162 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999 On Second-Order Statistics and Linear Estimation of Cepstral Coefficients Yariv Ephraim, Fellow, IEEE, and Mazin Rahim, Senior

More information

MAP adaptation with SphinxTrain

MAP adaptation with SphinxTrain MAP adaptation with SphinxTrain David Huggins-Daines dhuggins@cs.cmu.edu Language Technologies Institute Carnegie Mellon University MAP adaptation with SphinxTrain p.1/12 Theory of MAP adaptation Standard

More information