Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or full covariance. Imposing a structure, though may be restrictive and lead to degraded performance and/or increased computations. In this work I sparsify the regression matrix of each Gaussian and experiment with two different structure-finding techniques; the difference of mutual informations and structural EM. I evaluated the approach in the 1996 NIST speaker recognition task. 2 Introduction Most state-of-the-art systems in speech and speaker recognition use mixtures of Gaussians when fitting a probability distribution to data. Reasons for this choice, are the easily implementable estimation formulas and the modeling power of mixtures of Gaussians. For example, it is known that a mixture of diagonal Gaussians can still model dependencies on the global level. An established practice when employing mixtures of Gaussians is to use either full or diagonal covariances. However, imposing a structure can be less than optimum and a more general methodology should allow two steps. First, find the optimum structure of the model given the data and second find the optimum parameter values given the structure and the data 1. Current techniques for mixture of Gaussians focus their attention only on the second step with a very specific structure (either full or diagonal). The first question we have to answer is what type of structure we want to estimate. For mixtures of Gaussians there are 3 choices. Covariances, inverse covariances or regression matrices. For all cases, we can see as selecting a structure by introducing zeros in the respective matrix. The three structures are distinctively different and zeros in one matrix do not in general map to zeros in another matrix. For example, we can have sparse covariance but full inverse covariance or sparse inverse covariance and full regression matrix. There are no clear theoretical reasons why one choice of structures is more suitable than others. However, introducing zeros in the inverse covariance can be seen as deleting arcs in an Undirected Graphical Model (UGM) where each node represents each dimension of a single Gaussian [1]. Similarly, introducing zeros in the regression matrix can be seen as deleting arcs in a Directed Graphical Model (DGM). There is a rich body of work on structure learning for UGM and DGM and therefore the view of a mixture of Gaussians as a mixture of DGM or UGM may be advantageous. In addition, the specific problem of selecting features for linear regression has been encountered in different fields in the past. In this work, I adopt the view of a mixture of Gaussians as a mixture of DGM and introduce zeros in the component regression matrices [1]. Since we evaluate our method in a classification task 1 Here, we describe the ML estimation methodology for both structure and parameters. One alternative is Bayesian estimation. 1

X 1 X 2 X 3 X 4 Figure 1: Multivariate Gaussian as Directed Graphical Model (speaker recognition) discriminative approaches may achieve better performance than generative ones, but are in general hard to estimate. We apply structure-finding algorithms that use both approaches. The first algorithm uses the difference of mutual informations between a target speaker and the impostors and the second algorithm is a specific implementation of the structural EM algorithm for the mixture of Gaussians case. I present experimental results on the 1996 NIST speaker recognition evaluation task. 3 Gaussians as Directed Graphical Models Suppose that we have a mixture of M Gaussians: p(x) = M p(z = m)n(x; µ m, Σ m ) (1) m It is known from linear algebra that any square matrix A can be decomposed as A = LDU, where L is a lower triangular matrix, D is a diagonal matrix and U is an upper triangular matrix. In the special case where A is also symmetric and positive definite the decomposition becomes A = U T DU where U is an upper triangular matrix with ones in the main diagonal. Therefore we can write U = I B with B ij = 0 if i >= j. The exponent of the Gaussian function can now be written as [1]: ( x B x) T D( x B x) (2) Where x = x µ. The i-th element of ( x B x) can be written as x i B i,{i+1:v } x {i+1:v }, with V being the dimensionality of each vector. We can see that B i,{i+1:v } regresses x {i+1:v } on x i thus the name regression matrix. Regression schemes can be represented as Directed Graphical Models. In fact, the multivariate Gaussian can be represented as a DGM as shown in Figure 1. Missing arcs represent zeros in the regression matrix. For example the B matrix in Figure 1 would have B 1,4 = B 2,3 = 0. We can use the EM algorithm to estimate the parameters of a mixture of Gaussian θ = [µ m B m D m ]. 4 Structure Learning In general, structure learning in DGM is an NP-hard problem even when all the variables are observed [3]. Our case is further complicated by the fact that we have a hidden variable (the Gaussian index). In this paper, we experimented with two approaches in structure learning with different strengths/weaknesses. The first approach is to learn a discriminative structure, i.e. a structure that can discriminate between classes even though the parameters are estimated in an ML fashion. Our algorithm starts 2

from the full model and deletes arcs, i.e. sets B i,j m = 0 m = 1 : M (M is the number of Gaussian components in a mixture) according to: min{i(x i ; X j speaker) I(X i ; X j impostors)} (3) Where I(X i ; X j ) is the mutual information between elements X i and X j of input vector X. Although this criterion can roughly capture discriminative structures, it is limited by the fact that all the Gaussians will have the same structure. The second approach we followed was one based on an ML fashion which may not be optimum for classification tasks, but can assign a different structure for each component. We used the structural EM [2], [4] and adopt it for the case of mixtures of Gaussians. Structural EM is an algorithm that generalizes on the EM algorithm by searching in the combined space of structure and parameters. One approach to the problem of structure finding would be to start from the full model, evaluate every possible combination of arc removals in every Gaussian, and pick the ones with the least decrease in likelihood. Unfortunately, this approach can be very expensive since every time we remove an arc on one of the Gaussians we have to re-estimate all the parameters, so the EM algorithm must be used for each combination. Therefore, this approach alternates parameter search with structure search and can be very expensive even if we follow greedy approaches. On the other hand, structural EM interleaves parameter search with structure search. Instead of following the sequence Estep M step structure search, structural EM follows Estep structure search M step. By treating expected data as observed data, the scoring of likelihood decomposes and therefore local changes do not influence the likelihood on other parameters. In essence, structural EM has the same core idea as standard EM. IF M is the structure, Θ are the parameters and n is the combination index, then the naive approach would be to do: On the other hand, structural EM follows the sequence: {M n, Θ n } {M n+1, Θ n+1 } (4) {M n, Θ n } {M n+1, Θ n } {M n+1, Θ n+1 } (5) If we replace M with H, i.e. the hidden variables or sufficient statistics, we will recognize the sequence of steps as the standard EM algorithm. For a more thorough discussion of structural EM, the reader is referred to [2]. The paper in [2] has a general discussion on the structural EM algorithm for an arbitrary graphical model. In this paper, we introduced a greedy pruning algorithm with step size K for mixtures of Gaussians. Algorithm: Finding both structure and parameter values using structural EM Start with the full model for a given number of Gaussians while (number of pruned regression coefficients < T ) E step: Collect sufficient statistics for given structure, i.e, γ m (n) = p(z n = m x n, M old ) StructureSearch: Remove one arc from a Gaussian at a time, i.e. set Bm i,j = 0. The score associated with zeroing a single regression coefficient is. Score m,i,j = 2D i mb i,j m Nn γ m (n) x j n,m( x i n,m B i m x n,m ) + D i m(b i,j m ) 2 Nn γ m (n) x j n,m Order coefficients in ascending order of score. P is the set of the first K coefficients. Set the new structure M new as M new = M old \{P }. M step: Calculate the new parameters given M new. This step can be followed by a number of EM iterations to obtain better parameter values. end 3

The one thing to note about the scoring criterion is that it is local, i.e. coefficent m, i, j will not invlove computations on other parameters. zeroing regression 5 Experiments We evaluated our approach in the 1996 NIST speaker recognition task. The problem can be described as following. Given 21 target speakers, perform 21 binary classifications (one for each target speaker) for each one of the test sentences. Each one of the binary classifications is a YES if the sentence belongs to the target speaker and NO otherwise. Under this setting, one sentence may be decided to have been generated by more than one speaker, in which case there will be at least one false alarm. Also, some of the test sentences were spoken by non-target speakers (impostors) in which case the correct answer would be 21 NO. All speakers are male and the data are extracted from the Switchboard database. The features are 20-dimensional MFCC vectors, cepstrum mean normalized and with all silences and pauses removed. There are approximately 2 minutes of training data for each target speaker and there are also approximately 2 minutes of training data for each one of 43 impostors. In the test data there are impostors who don t appear in the training data. The system consisted of a mixture of Gaussians trained on each one of the target speakers. The log-likelihood of each test sentence for each speaker is subtracted from the log-likelihood of the impostor model. We used 100 diagonal Gaussians estimated on all 43 speakers as the impostor model and the impostor model remained fixed throughout our experiments. A decision for YES is made if the difference of the log-likelihoods is above a threshold. Although in real operation of the system the thresholds are parameters that need to be estimated from the training data, in this evaluation the thresholds are optimized for the current test set. Therefore the results reported should be viewed as a best case scenario, but are nevertheless useful for comparing different approaches. The metric used in all experiments was Equal Error Rate. In Table 1 the best results are reported for different configurations. DMI stands for Difference Mutual Information. Therefore mindmi sets regression coefficients to zero according to min{i(x i ; X j speaker) I(X i ; X j impostors). maxdmi sets regression coefficients to zero according to max{i(x i ; X j speaker) I(X i ; X j impostors)}. This is used to confirm that mindmi is consistently better than maxdmi. Random sets regression coefficients to zero randomly and minmi sets regression coefficients to zero according to min{i(x i ; X j speaker)}. minmi is introduced to evaluate if and by how much the discriminative method 1 offers a better structure than the generative method 4. It should be noted that all results are optimized for the number of Gaussians and percent of parameters pruned. Also the best results for diagonal Gaussians were lower than the best results for full Gaussians and are therefore not reported here. Full min DMI max DMI min MI Random structure from 15 sents 6.9 6.6 7.5 7.2 7.5 training from 15 sents structure from 15 sents 9.4 8.8 10.4 9.4 10.1 training from 5 sents structure from 5 sents 9.4 9.4 9.7 9.7 10.1 training from 5 sents Table 1: EER for different sparse structures selected using mutual information criteria. 4

Table 1 shows small improvements for 15 training sentences by using mindmi than the mixture of full Gaussians. For 5 training sentences all the sparse structures seem to be around the same, and equal to the full case. Interestingly enough, if we estimate the structure with 15 sentences but do the training with 5 sentences then we see a clear advantage of mindmi over the baseline. This shows that the structure-finding criterion is valid but also that estimates of mutual information are strongly dependent on the amounts of training data available. K=100 K=50 15 sents 7.2 7.2 5 sents 9.4 9.4 Table 2: EER for different pruning steps, using structural EM Results for structural EM don t show any improvement from the baseline, even when the pruning step is varied. 6 Conclusions In this work the problem of sparsifying the regression matrices of mixtures of Gaussians was encountered. Two structure-finding algorithms were used, one discriminative and the second based on extensions of EM. Interesting connections can be drawn with MLLR speaker adaptation. Not surprisingly, the reestimation equations for the regression matrix bare resemblance with the MLLR equations. However thus far, researchers have barely looked into the problem of structure-finding for speaker adaptation, focusing mostly on parameter adaptation. An interesting new topic for speaker adaptation could be joint structure and parameter adaptation. References [1] J. Bilmes, Factored sparse inverse covariance matrices Proceedings of ICASSP, 2000. [2] N. Friedman, Learning belief networks in the presence of missing values and hidden variables, Proc. 14th International Conference on Machine Learning, pp. 125-133, 1997 [3] D.M. Chickering, D. Geiger, and D.E. Heckerman, Learning Bayesian Networks is NP-Hard, Microsoft Research Technical Report MSR-TR-94-17, 1994. [4] B. Thiesson, C. Meek, D. Chickering, and D. Heckerman, Learning mixtures of DAG models, Technical Report MSR-TR-97-30, Microsoft Research, Redmond, WA, 1998. 5