Mixtures of Gaussians with Sparse Structure

Similar documents
Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

CS534 Machine Learning - Spring Final Exam

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Algorithmisches Lernen/Machine Learning

Introduction to Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Machine Learning Techniques for Computer Vision

Introduction to Machine Learning Midterm, Tues April 8

Design and Implementation of Speech Recognition Systems

K-Means and Gaussian Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA 4273H: Statistical Machine Learning

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

STA 414/2104: Machine Learning

Mixtures of Gaussians. Sargur Srihari

Learning in Bayesian Networks

Introduction to Signal Detection and Classification. Phani Chavali

MIXTURE MODELS AND EM

Graphical Models for Collaborative Filtering

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Conditional Random Field

Expectation Maximization

An Empirical-Bayes Score for Discrete Bayesian Networks

Machine Learning for Signal Processing Bayes Classification and Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Notes on Machine Learning for and

Intelligent Systems:

GMM-Based Speech Transformation Systems under Data Reduction

Lecture 3: Pattern Classification

Machine Learning Overview

Machine Learning, Fall 2009: Midterm

The Expectation-Maximization Algorithm

Recent Advances in Bayesian Inference Techniques

Pattern Classification

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Introduction to Graphical Models

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Chapter 17: Undirected Graphical Models

Expectation Maximization

MACHINE LEARNING ADVANCED MACHINE LEARNING

Introduction to Machine Learning Midterm Exam

Beyond Uniform Priors in Bayesian Network Structure Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

MACHINE LEARNING ADVANCED MACHINE LEARNING

How to Deal with Multiple-Targets in Speaker Identification Systems?

Statistical Pattern Recognition

Bayesian Networks Structure Learning (cont.)

PATTERN CLASSIFICATION

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning, Fall 2012 Homework 2

Intelligent Systems (AI-2)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Mining Classification Knowledge

Lecture 16 Deep Neural Generative Models

CPSC 540: Machine Learning

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Joint Factor Analysis for Speaker Verification

L11: Pattern recognition principles

COMP538: Introduction to Bayesian Networks

Intelligent Systems (AI-2)

Mathematical Formulation of Our Example

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A brief introduction to Conditional Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines using GMM Supervectors for Speaker Verification

PMR Learning as Inference

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

BAYESIAN DECISION THEORY

Final Exam, Machine Learning, Spring 2009

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Directed Graphical Models

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Structure Learning: the good, the bad, the ugly

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Comparing linear and non-linear transformation of speech

Probabilistic Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian decision making

Introduction to Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

10708 Graphical Models: Homework 2

Learning Bayesian network : Given structure and completely observed data

Pattern Recognition. Parameter Estimation of Probability Density Functions

CS 6375 Machine Learning

Transcription:

Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or full covariance. Imposing a structure, though may be restrictive and lead to degraded performance and/or increased computations. In this work I sparsify the regression matrix of each Gaussian and experiment with two different structure-finding techniques; the difference of mutual informations and structural EM. I evaluated the approach in the 1996 NIST speaker recognition task. 2 Introduction Most state-of-the-art systems in speech and speaker recognition use mixtures of Gaussians when fitting a probability distribution to data. Reasons for this choice, are the easily implementable estimation formulas and the modeling power of mixtures of Gaussians. For example, it is known that a mixture of diagonal Gaussians can still model dependencies on the global level. An established practice when employing mixtures of Gaussians is to use either full or diagonal covariances. However, imposing a structure can be less than optimum and a more general methodology should allow two steps. First, find the optimum structure of the model given the data and second find the optimum parameter values given the structure and the data 1. Current techniques for mixture of Gaussians focus their attention only on the second step with a very specific structure (either full or diagonal). The first question we have to answer is what type of structure we want to estimate. For mixtures of Gaussians there are 3 choices. Covariances, inverse covariances or regression matrices. For all cases, we can see as selecting a structure by introducing zeros in the respective matrix. The three structures are distinctively different and zeros in one matrix do not in general map to zeros in another matrix. For example, we can have sparse covariance but full inverse covariance or sparse inverse covariance and full regression matrix. There are no clear theoretical reasons why one choice of structures is more suitable than others. However, introducing zeros in the inverse covariance can be seen as deleting arcs in an Undirected Graphical Model (UGM) where each node represents each dimension of a single Gaussian [1]. Similarly, introducing zeros in the regression matrix can be seen as deleting arcs in a Directed Graphical Model (DGM). There is a rich body of work on structure learning for UGM and DGM and therefore the view of a mixture of Gaussians as a mixture of DGM or UGM may be advantageous. In addition, the specific problem of selecting features for linear regression has been encountered in different fields in the past. In this work, I adopt the view of a mixture of Gaussians as a mixture of DGM and introduce zeros in the component regression matrices [1]. Since we evaluate our method in a classification task 1 Here, we describe the ML estimation methodology for both structure and parameters. One alternative is Bayesian estimation. 1

X 1 X 2 X 3 X 4 Figure 1: Multivariate Gaussian as Directed Graphical Model (speaker recognition) discriminative approaches may achieve better performance than generative ones, but are in general hard to estimate. We apply structure-finding algorithms that use both approaches. The first algorithm uses the difference of mutual informations between a target speaker and the impostors and the second algorithm is a specific implementation of the structural EM algorithm for the mixture of Gaussians case. I present experimental results on the 1996 NIST speaker recognition evaluation task. 3 Gaussians as Directed Graphical Models Suppose that we have a mixture of M Gaussians: p(x) = M p(z = m)n(x; µ m, Σ m ) (1) m It is known from linear algebra that any square matrix A can be decomposed as A = LDU, where L is a lower triangular matrix, D is a diagonal matrix and U is an upper triangular matrix. In the special case where A is also symmetric and positive definite the decomposition becomes A = U T DU where U is an upper triangular matrix with ones in the main diagonal. Therefore we can write U = I B with B ij = 0 if i >= j. The exponent of the Gaussian function can now be written as [1]: ( x B x) T D( x B x) (2) Where x = x µ. The i-th element of ( x B x) can be written as x i B i,{i+1:v } x {i+1:v }, with V being the dimensionality of each vector. We can see that B i,{i+1:v } regresses x {i+1:v } on x i thus the name regression matrix. Regression schemes can be represented as Directed Graphical Models. In fact, the multivariate Gaussian can be represented as a DGM as shown in Figure 1. Missing arcs represent zeros in the regression matrix. For example the B matrix in Figure 1 would have B 1,4 = B 2,3 = 0. We can use the EM algorithm to estimate the parameters of a mixture of Gaussian θ = [µ m B m D m ]. 4 Structure Learning In general, structure learning in DGM is an NP-hard problem even when all the variables are observed [3]. Our case is further complicated by the fact that we have a hidden variable (the Gaussian index). In this paper, we experimented with two approaches in structure learning with different strengths/weaknesses. The first approach is to learn a discriminative structure, i.e. a structure that can discriminate between classes even though the parameters are estimated in an ML fashion. Our algorithm starts 2

from the full model and deletes arcs, i.e. sets B i,j m = 0 m = 1 : M (M is the number of Gaussian components in a mixture) according to: min{i(x i ; X j speaker) I(X i ; X j impostors)} (3) Where I(X i ; X j ) is the mutual information between elements X i and X j of input vector X. Although this criterion can roughly capture discriminative structures, it is limited by the fact that all the Gaussians will have the same structure. The second approach we followed was one based on an ML fashion which may not be optimum for classification tasks, but can assign a different structure for each component. We used the structural EM [2], [4] and adopt it for the case of mixtures of Gaussians. Structural EM is an algorithm that generalizes on the EM algorithm by searching in the combined space of structure and parameters. One approach to the problem of structure finding would be to start from the full model, evaluate every possible combination of arc removals in every Gaussian, and pick the ones with the least decrease in likelihood. Unfortunately, this approach can be very expensive since every time we remove an arc on one of the Gaussians we have to re-estimate all the parameters, so the EM algorithm must be used for each combination. Therefore, this approach alternates parameter search with structure search and can be very expensive even if we follow greedy approaches. On the other hand, structural EM interleaves parameter search with structure search. Instead of following the sequence Estep M step structure search, structural EM follows Estep structure search M step. By treating expected data as observed data, the scoring of likelihood decomposes and therefore local changes do not influence the likelihood on other parameters. In essence, structural EM has the same core idea as standard EM. IF M is the structure, Θ are the parameters and n is the combination index, then the naive approach would be to do: On the other hand, structural EM follows the sequence: {M n, Θ n } {M n+1, Θ n+1 } (4) {M n, Θ n } {M n+1, Θ n } {M n+1, Θ n+1 } (5) If we replace M with H, i.e. the hidden variables or sufficient statistics, we will recognize the sequence of steps as the standard EM algorithm. For a more thorough discussion of structural EM, the reader is referred to [2]. The paper in [2] has a general discussion on the structural EM algorithm for an arbitrary graphical model. In this paper, we introduced a greedy pruning algorithm with step size K for mixtures of Gaussians. Algorithm: Finding both structure and parameter values using structural EM Start with the full model for a given number of Gaussians while (number of pruned regression coefficients < T ) E step: Collect sufficient statistics for given structure, i.e, γ m (n) = p(z n = m x n, M old ) StructureSearch: Remove one arc from a Gaussian at a time, i.e. set Bm i,j = 0. The score associated with zeroing a single regression coefficient is. Score m,i,j = 2D i mb i,j m Nn γ m (n) x j n,m( x i n,m B i m x n,m ) + D i m(b i,j m ) 2 Nn γ m (n) x j n,m Order coefficients in ascending order of score. P is the set of the first K coefficients. Set the new structure M new as M new = M old \{P }. M step: Calculate the new parameters given M new. This step can be followed by a number of EM iterations to obtain better parameter values. end 3

The one thing to note about the scoring criterion is that it is local, i.e. coefficent m, i, j will not invlove computations on other parameters. zeroing regression 5 Experiments We evaluated our approach in the 1996 NIST speaker recognition task. The problem can be described as following. Given 21 target speakers, perform 21 binary classifications (one for each target speaker) for each one of the test sentences. Each one of the binary classifications is a YES if the sentence belongs to the target speaker and NO otherwise. Under this setting, one sentence may be decided to have been generated by more than one speaker, in which case there will be at least one false alarm. Also, some of the test sentences were spoken by non-target speakers (impostors) in which case the correct answer would be 21 NO. All speakers are male and the data are extracted from the Switchboard database. The features are 20-dimensional MFCC vectors, cepstrum mean normalized and with all silences and pauses removed. There are approximately 2 minutes of training data for each target speaker and there are also approximately 2 minutes of training data for each one of 43 impostors. In the test data there are impostors who don t appear in the training data. The system consisted of a mixture of Gaussians trained on each one of the target speakers. The log-likelihood of each test sentence for each speaker is subtracted from the log-likelihood of the impostor model. We used 100 diagonal Gaussians estimated on all 43 speakers as the impostor model and the impostor model remained fixed throughout our experiments. A decision for YES is made if the difference of the log-likelihoods is above a threshold. Although in real operation of the system the thresholds are parameters that need to be estimated from the training data, in this evaluation the thresholds are optimized for the current test set. Therefore the results reported should be viewed as a best case scenario, but are nevertheless useful for comparing different approaches. The metric used in all experiments was Equal Error Rate. In Table 1 the best results are reported for different configurations. DMI stands for Difference Mutual Information. Therefore mindmi sets regression coefficients to zero according to min{i(x i ; X j speaker) I(X i ; X j impostors). maxdmi sets regression coefficients to zero according to max{i(x i ; X j speaker) I(X i ; X j impostors)}. This is used to confirm that mindmi is consistently better than maxdmi. Random sets regression coefficients to zero randomly and minmi sets regression coefficients to zero according to min{i(x i ; X j speaker)}. minmi is introduced to evaluate if and by how much the discriminative method 1 offers a better structure than the generative method 4. It should be noted that all results are optimized for the number of Gaussians and percent of parameters pruned. Also the best results for diagonal Gaussians were lower than the best results for full Gaussians and are therefore not reported here. Full min DMI max DMI min MI Random structure from 15 sents 6.9 6.6 7.5 7.2 7.5 training from 15 sents structure from 15 sents 9.4 8.8 10.4 9.4 10.1 training from 5 sents structure from 5 sents 9.4 9.4 9.7 9.7 10.1 training from 5 sents Table 1: EER for different sparse structures selected using mutual information criteria. 4

Table 1 shows small improvements for 15 training sentences by using mindmi than the mixture of full Gaussians. For 5 training sentences all the sparse structures seem to be around the same, and equal to the full case. Interestingly enough, if we estimate the structure with 15 sentences but do the training with 5 sentences then we see a clear advantage of mindmi over the baseline. This shows that the structure-finding criterion is valid but also that estimates of mutual information are strongly dependent on the amounts of training data available. K=100 K=50 15 sents 7.2 7.2 5 sents 9.4 9.4 Table 2: EER for different pruning steps, using structural EM Results for structural EM don t show any improvement from the baseline, even when the pruning step is varied. 6 Conclusions In this work the problem of sparsifying the regression matrices of mixtures of Gaussians was encountered. Two structure-finding algorithms were used, one discriminative and the second based on extensions of EM. Interesting connections can be drawn with MLLR speaker adaptation. Not surprisingly, the reestimation equations for the regression matrix bare resemblance with the MLLR equations. However thus far, researchers have barely looked into the problem of structure-finding for speaker adaptation, focusing mostly on parameter adaptation. An interesting new topic for speaker adaptation could be joint structure and parameter adaptation. References [1] J. Bilmes, Factored sparse inverse covariance matrices Proceedings of ICASSP, 2000. [2] N. Friedman, Learning belief networks in the presence of missing values and hidden variables, Proc. 14th International Conference on Machine Learning, pp. 125-133, 1997 [3] D.M. Chickering, D. Geiger, and D.E. Heckerman, Learning Bayesian Networks is NP-Hard, Microsoft Research Technical Report MSR-TR-94-17, 1994. [4] B. Thiesson, C. Meek, D. Chickering, and D. Heckerman, Learning mixtures of DAG models, Technical Report MSR-TR-97-30, Microsoft Research, Redmond, WA, 1998. 5