Parsimonious Gaussian Mixture Models

Size: px

Start display at page:

Download "Parsimonious Gaussian Mixture Models"

Ella Price
6 years ago
Views:

1 Parsimonious Gaussian Mixture Models Brendan Murphy Department of Statistics, Trinity College Dublin, Ireland. East Liguria West Liguria Umbria North Apulia Coast Sardina Inland Sardinia South Apulia Calabria Sicily 1

2 Outline Data From Food Authenticity Studies Italian Olive Oils Italian Wines Background: Model-Based Clustering Data Reduction and Clustering Factor Analysis Methods: Parsimonious Gaussian Mixture Models 2

3 Acknowledgements This work has been done in collaboration with: Paul McNicholas, Department of Statistics, Trinity College Dublin. This work is supported by a Science Foundation of Ireland Basic Research Grant (04/BR/M0057). Much of this work has been carried out when visiting CSSS. Thanks to Nema for producing the results of Slide 33 at extremely short notice. 3

4 Food Authenticity Studies An authentic food is one that is exactly what it claims to be. Important aspects of food description include: Process history Geographic origin Species or variety Purity and adulteration Food producers and consumers need to be assured of the authenticity of their food purchases. Food authenticity studies are concerned with establishing if foods are authentic or not. 4

5 Analytical Techniques Many analytical chemistry techniques are used in food authenticity studies. These include: Gas chromatography Mass spectroscopy Vibrational spectroscopic techniques (Raman, ultraviolet, mid-infrared, near-infrared and visible). These techniques have been shown to be capable for discrimination between sets of similar biological materials. Some of these techniques are slow and difficult, while others are quick and easy. 5

6 Italian Olive Oils Forina et al (1982,1983) report the percentage composition of eight fatty acids found by lipid fraction of 572 Italian olive oils. Fatty Acids palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic These data are available in the GGobi package (Swayne et al, 2003; 6

7 Italian Olive Oils: Origin The data are used to classify the olive oil samples to their geographic origin. East Liguria West Liguria Umbria North Apulia Coast Sardina Inland Sardinia South Apulia Calabria Sicily 7

8 Italian Wines Forina et al (1986) used twenty eight chemical properties of Italian wines from the Asti region to classify the wines to their specific type (Barolo, Grignolino, Barbera). A subset of thirteen variables are available from the gclus library (Hurley, 2004) for R. These data are also in the UCI Machine Learning Database. Chemical Properties Alcohol Malic Acid Ash Alcalinity of Ash Magnesium Total Phenols Flavanoids Nonflavanoid Phenols Proanthocyanins Color Intensity Hue OD280/OD315 Of Diluted Wines Proline 8

9 Model-based Clustering and Discriminant Analysis Model-based clustering (Banfield and Raftery, 1993; Fraley and Raftery, 2000; 2002) uses normal mixtures to develop a flexible suite of cluster analysis methods. Model-based clustering uses constraints on the group covariance matrices; the constraints use the eigenvalue decomposition of the covariance matrices to impose shape restrictions on the groups. Bensmail and Celeux (1996) developed discriminant analysis methods using the same covariance decomposition. The decomposition is of the form, Σ g = λ g D g A g D T g, where λ g is a constant, D g is an orthonormal matrix, and A g is a diagonal matrix with det(a k ) = 1. 9

10 Covariance Parameters Interpretations for the parameters are: (λ g = Volume); (A g = Shape); (D g = Orientation). These parameters can be constrained in various ways. EII VII EEI VEI EVI VVI EEE EEV VEV VVV 10

11 Model-Based Clustering Let y 1, y 2,..., y M be unlabelled data that we wish to cluster. Model-based clustering is based on the likelihood function, f(y M θ 1,..., θ G ) = M G π g f(y m θ g ). m=1 g=1 The likelihood function is maximized using the EM algorithm to get estimates for the parameters. Observations are clustered on the basis of estimated values for the posterior probability of component membership, P{Component g y} ˆπ g f(y ˆθ g ) G g =1 ˆπ g f(y ˆθ g ) 11

12 Data Reduction, Clustering and Classification Chang (1983) showed that the principal components corresponding to the larger eigenvalues do not necessarily contain information about group structure. Data reduction and clustering separately may not be a good idea! 12

13 Model-based Clustering & Variable Selection Raftery and Dean (2004) recently developed a version of model-based clustering that includes variable selection. With their method, variables are selected in a step-wise manner. Their method involves the stages: Find the variable with the greatest evidence of clustering given the already selected variables. Remove a variable from the set of selected variables if it no longer has evidence of clustering. This is one approach that avoids the problems of data reduction followed by clustering. 13

14 Factor Analysis The factor analysis assumes that observed values are conditionally independent given a latent variable. Specifically, X 1 X 2.. = µ 1 µ λ 11 λ λ 1q λ 21 λ λ 2q U 1 U ɛ 1 ɛ 2.. X p µ p λ p1 λ p2... λ pq U q ɛ p where independently ie. X = µ + ΛU + ɛ U MVN(0, I), ɛ MVN(0, Ψ) and Ψ = diag(σ 2 1, σ 2 2,..., σ 2 p). 14

15 Factor Analysis The resulting distribution for X is, X MVN ( µ, ΛΛ T + Ψ ). Λ is called the loading matrix. Ψ is the noise variance. Λ is not defined uniquely. If Λ is replaced by Λ = ΛD where D is orthonormal, then ΛΛ T + Ψ = (Λ )(Λ ) T + Ψ. As a result, the covariance has pq q(q 1)/2 + p free parameters. 15

16 Probabilisitic Principal Components Analysis Tipping and Bishop (1999a) developed the probabilistic principal components analysis (PPCA) model. This model is equivalent to imposing an isotropy constraint on the noise variance Ψ = diag(ψ, ψ,..., ψ) = ψi in the factor analysis model. The covariance in this model has pq q(q 1)/2 + 1 free parameters. 16

17 Mixture of Factor Analyzers Model The Mixture of Factor Analyzers (MFA) model assumes a normal mixture model. The covariance of each component has a factor analysis covariance structure. So, X G π g MVN(µ g, Λ g Λ T g + Ψ g ). g=1 This model was developed by Ghahramani and Hinton (1997) and further developed by McLachlan et al (2002,2003). Tipping and Bishop (1999b) developed a Mixture of Probabilistic Principal Components Analysers (MPPCA) model (Ψ g = σ 2 gi) 17

18 Constraints We can constrain the Λ g and Ψ g parameters in the MFA model across groups to reduce the number of parameters. We also have the option of assuming that Ψ g = ψ g I. This leads to eight Parsimonious Gaussian Mixture Models: ModelID Loading Noise Isotropic (G = 2) (G = 2) (p = 8, q = 1) (p = 8, q = 3) CCC Constrained Constrained Constrained 9 22 CCU Constrained Constrained Unconstrained CUC Constrained Unconstrained Constrained CUU Constrained Unconstrained Unconstrained UCC Unconstrained Constrained Constrained UCU Unconstrained Constrained Unconstrained UUC Unconstrained Unconstrained Constrained UUU Unconstrained Unconstrained Unconstrained

19 Model Fitting The Parsimonious Gaussian mixture models are fitted using the AECM algorithm (Meng and van Dyk, 1997). The ECM algorithm (Meng and Rubin, 1993) replaces the M-step by a series of conditional maximization steps. The AECM algorithm (Meng and van Dyk, 1997) allows a different specification of complete-data for each conditional maximization step. McLachlan and Krishnan (1997) gives an extensive review of the EM algorithm and variants McLachlan and Peel (2000) give extensive details of the fitting algorithm in the UUU case. 19

20 Three Likelihoods The likelihood function for this mixture is, L = f(x π g, µ g, Λ g, Ψ g ) = N G π g φ(x n µ g, Λ g Λ T g + Ψ g ). n=1 g=1 The first complete-data likelihood function is, L 1 = f(x, z π g, µ g, Λ g, Ψ g ) = N G [ πg φ(x n µ g, Λ g Λ T g + Ψ g ) ] z ng. n=1 g=1 The second complete-data likelihood function is, L 2 = f(x, z, u π g, µ g, Λ g, Ψ g ) = N G [π g φ(x n µ g + Λ g u n, Ψ g )φ(u n 0, I)] z ng. n=1 g=1 20

21 AECM: Stage 1 (π g and µ g ) This missing data are the component membership labels z ng. These are replaced by their expected values ẑ ng ˆπ g φ(x n ˆµ g, ˆΛ g ˆΛT g + ˆΨ g ). This leads to the expected complete-data log-likelihood, Q 1 = G g=1 N g log π g Np 2 log 2π G g=1 G N g tr { S g (Λ g Λ T g + Ψ g ) 1}, g=1 where N g = N n=1 ẑng and S g = (1/N g ) N n=1 ẑng(x n ˆµ g )(x n ˆµ g ) T. N g 2 log Λ gλ T g + Ψ g 21

22 AECM: Stage 1 (π g and µ g ) Maximizing Q 1 with respect to µ g and π g gives the estimates, ˆµ g = N n=1 ẑngx n N n=1 ẑng and ˆπ g = N g N. 22

23 AECM: Stage 2 (Λ g and Ψ g ) The missing data are the z ng and the latent variables u n. The expected complete-data log-likelihood can be shown to be, Q 2 = C + G g=1 N g [ log π g { 2 tr Λ T g Ψ 1 g Λ g (I ˆB g ˆΛg + ˆB } ] g S g ˆBT g ) where ˆB g = ˆΛ T g ( ˆΛ g ˆΛT g + ˆΨ g ) 1. log Ψ 1 g 1 2 tr { Ψ 1 } { } g S g + tr Ψ 1 g Λ g ˆBg S g Maximizing this with respect to Λ g and Ψ g gives new estimates for these parameters. How we do this depends on the constraints... 23

24 AECM: An Aside Graybill (1983) and Lütkepohl (1996) give matrix differential results that help with the maximization. In particular, the following useful identities, log X X = X 1 tr(xaxb) X tr(xa) X tr(axb) X = AT = B T A T = B T X T A T + A T X T B T. 24

25 AECM: Stage 2 (Λ g and Ψ g Constrains) Constraints are implemented by replacing the Λ g and Ψ g terms with the appropriate version and then maximizing. For example, the UCU estimates are: ˆB g = ˆΛ T g ( ˆΛ g ˆΛT g + ˆΨ) 1 ˆΛ new g = S g ˆBT g (I ˆB g ˆΛg + ˆB g S g ˆBT g ) 1 ˆΨ new = G g=1 ˆπ g diag { S g ˆΛ new g ˆB g S g } The CUU constrained model is more complicated than the others. 25

26 Model Selection Model selection was done using BIC. BIC = 2(Maximized Log-Likelihood) log(n) (Number of Parameters). Three model features are chosen: Constraints Components (G) Latent factor dimension (q) For small problems, an exhaustive search was possible. For larger problems, a local search strategy can be used. 26

27 Results: Italian Olive Oils The eight Parsimonious Gaussian mixture models (CCC, CCU,..., UUU), G = 1, 2,..., 14 and q = 1, 2,..., 5 were fitted. The best model is a UCU model with (G = 7, q = 5). q G 27

28 Results: Italian Olive Oils Classification table for the best PGMM. Rand Index=0.90 Adjusted Rand Index=0.64 BIC= N. Apulia 24 1 Calabria 48 8 S. Apulia Sicily I. Sardinia 64 1 C. Sardinia 33 E. Liguria 50 W. Liguria 50 Umbria 51 28

29 Results: Italian Olive Oils Classification table for the best model found using mclust. Rand Index=0.93 Adjusted Rand Index=0.78 BIC= N. Apulia 25 Calabria 56 S. Apulia Sicily 36 I. Sardinia 65 C. Sardinia 33 E. Liguria 50 W. Liguria 50 Umbria 51 29

30 Results: Italian Wines Fitted eight Parsimonious Gaussian Mixture Models (CCC, CCU,..., UUU), G = 1, 2,..., 8 and q = 1, 2,..., 5. The best model is a CUU model with (G = 4, q = 2). q G 30

31 Results: Italian Wines Classification table for the best PGMM Barolo 59 Grignolino Barbera 48 Rand Index=0.91 Adjusted Rand Index=0.79 BIC=

32 Results: Italian Wines (mclust) Classification table for the best model found using mclust Barolo Grignolino Barbera Rand Index=0.80 Adjusted Rand Index=0.48 BIC=

33 Results: Italian Wines (Variable Selection) Using the variable selection method of Raftery and Dean (2004) we get: Rand Index=0.88 Adjusted Rand Index= Barolo 51 8 Grignolino Barbera 1 47 Variables Selected: Malic Acid, Proline, Flavanoids, Color Intensity. 33

34 Italian Wines (A Little Known Fact?) The wines in this study were from the years Barolo Grignolino Barbera Could this be affecting the results? 34

35 Results: Italian Wines (A Little Deeper) Returning to the PGMM results... and assuming that the data are in year order... Cluster Barolo Grignolino Barbera The middle two clusters are almost grouped by year. 35

36 Results: Italian Wines (A Little Deeper) Returning to the mclust results... Cluster Barolo Grignolino Barbera

37 Discriminant Analysis Results A quick study of the use of the PGMM for discriminant analysis (semi-supervised) found the following classification rates: The data were randomly split into training and test in a 50:50 ratio. The results for 50 random splits are below. Data ModelID q Misclassification Rate Italian Olive Oils UCU 4 5.6% (1.6) Italian Wines UUU 3 1.8% (1.4) 37

38 Conclusions Data reduction can improve clustering and classification results. Combining variable selection and clustering can give improved results. The constrained mixture of factor analyzers model leads to a family of Parsimonious Gaussian Mixture Models. These models should be especially useful in high-dimensional problems. For fixed q, the number of parameters grows linearly in dimension. Incorporating a LASSO-type constraint on the loading matrix may give a sparse solution and effectively do variable selection. 38

Parsimonious Gaussian Mixture Models

Parsimonious Gaussian Mixture Models Paul David McNicholas Trinity Collee Dublin, Ireland Thomas Brendan Murphy Trinity Collee Dublin, Ireland Abstract Parsimonious Gaussian mixture models are developed