Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions Daniel F. Schmidt Enes Makalic Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne 25th Australasian Joint Conference on Artificial Intelligence 2012 (The University of Melbourne) AI 2012 1 / 19
Content Mixture Modelling 1 Mixture Modelling Problem Description MML Mixture Models 2 MML Inverse Gaussian Distributions Inverse Gaussian Distributions MML Inference of Inverse Gaussians 3 Example (The University of Melbourne) AI 2012 2 / 19
Problem Description Mixture Modelling Problem Description We have n items, each with q associated attributes, formed into a matrix y 1 y 1,1 y 1,2... y 1,q y 2 Y =. = y 2,1 y 2,2... y 2,q...... y n y n,1 y n,2... y n,q Group together, or cluster, similar items A form of unsupervised learning Sometimes called intrinsic classification Class labels are learned from the data (The University of Melbourne) AI 2012 3 / 19
Mixture Modelling Problem Description Mixture Modelling (1) Models data as a mixture of probability distributions p(y i,j ; Φ) = K α k p(y i,j ; θ k,j ) k=1 where K is the number of classes α = (α 1,..., α K ) are the mixing (population) weights θ k,j are the parameters of the distributions Φ = {K, α, θ 1,1,..., θ K,q } denotes the complete mixture model Has an explicit probabilistic form allows for statistical interpretion (The University of Melbourne) AI 2012 4 / 19
Mixture Modelling Problem Description Mixture Modelling (2) How is this related to clustering? Each class is a cluster Class-specific probability distributions over each attribute e.g., normal, inverse Gaussian, Poisson, etc. Mixing weight is prevalance of the classes in the population Measure of similarity of item to class q p k (y i ) = p(y i,j ; θ k,j ) j=1 probability of item s attributes under class distributions (The University of Melbourne) AI 2012 5 / 19
Mixture Modelling Problem Description Mixture Modelling (2) How is this related to clustering? Each class is a cluster Class-specific probability distributions over each attribute e.g., normal, inverse Gaussian, Poisson, etc. Mixing weight is prevalance of the classes in the population Measure of similarity of item to class q p k (y i ) = p(y i,j ; θ k,j ) j=1 probability of item s attributes under class distributions (The University of Melbourne) AI 2012 5 / 19
Mixture Modelling (3) Mixture Modelling Problem Description Membership of items to classes is soft r i,k = α k p k (y i ) Kl=1 α l p l (y i ) Posterior probability of belonging to class k α k is a priori probability item belongs to class k p k (y i ) is probability of data item y i under class k Assign to class with highest posterior probability Total number of samples in a class is then n n k = r i,k i=1 (The University of Melbourne) AI 2012 6 / 19
Mixture Modelling (3) Mixture Modelling Problem Description Membership of items to classes is soft r i,k = α k p k (y i ) Kl=1 α l p l (y i ) Posterior probability of belonging to class k α k is a priori probability item belongs to class k p k (y i ) is probability of data item y i under class k Assign to class with highest posterior probability Total number of samples in a class is then n n k = r i,k i=1 (The University of Melbourne) AI 2012 6 / 19
Mixture Modelling MML Mixture Models MML Mixture Models (1) Minimum Message Length goodness-of-fit criterion Popular criterion for mixture modelling Based on the idea of compression Message length of data is our yardstick; comprised of 1 Length of codeword needed to state model Φ Number of classes: I(K) Relative abundances: I(α) Parameters for each distribution in each class: I(θ k,j ) 2 Length of codeword needed to state data, given model: I(Y Φ) (The University of Melbourne) AI 2012 7 / 19
Mixture Modelling MML Mixture Models MML Mixture Models (1) Minimum Message Length goodness-of-fit criterion Popular criterion for mixture modelling Based on the idea of compression Message length of data is our yardstick; comprised of 1 Length of codeword needed to state model Φ Number of classes: I(K) Relative abundances: I(α) Parameters for each distribution in each class: I(θ k,j ) 2 Length of codeword needed to state data, given model: I(Y Φ) (The University of Melbourne) AI 2012 7 / 19
Mixture Modelling MML Mixture Models (2) MML Mixture Models Total message length: K q I(Y, Φ) = I(K) + I(α) + I(θ k,j ) + I(Y Φ) k=1 j=1 balances model complexity against model fit Estimate Φ by minimising message length ˆα and ˆθ j,k found by expectation-maximisation Find ˆK by splitting/merging classes (The University of Melbourne) AI 2012 8 / 19
Mixture Modelling MML Mixture Models (2) MML Mixture Models Total message length: K q I(Y, Φ) = I(K) + I(α) + I(θ k,j ) + I(Y Φ) k=1 j=1 balances model complexity against model fit Estimate Φ by minimising message length ˆα and ˆθ j,k found by expectation-maximisation Find ˆK by splitting/merging classes (The University of Melbourne) AI 2012 8 / 19
Content MML Inverse Gaussian Distributions 1 Mixture Modelling Problem Description MML Mixture Models 2 MML Inverse Gaussian Distributions Inverse Gaussian Distributions MML Inference of Inverse Gaussians 3 Example (The University of Melbourne) AI 2012 9 / 19
MML Inverse Gaussian Distributions Inverse Gaussian Distributions (1) Inverse Gaussian Distributions Distribution for positive, continuous data We say Y i IG(µ, λ) if p.d.f. for Y i = y i is p(y i ; µ, λ) = ( 1 2πλy 3 i where µ > 0 is the mean parameter λ > 0 is the inverse-shape parameter Suitable for positively skewed data ) 1 ( 2 exp (y i µ) 2 ) 2µ 2, λy i Derive the message length formula for use in mixture modelling (The University of Melbourne) AI 2012 10 / 19
MML Inverse Gaussian Distributions Inverse Gaussian Distributions (2) Inverse Gaussian Distributions Example of inverse Gaussian distributions 2 1.8 1.6 µ=1, λ=1 µ=1, λ=3 µ=3, λ=1 p(y; µ, λ) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 y (The University of Melbourne) AI 2012 11 / 19
MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (1) Use Wallace Freeman approximation Bayesian; we chose uninformative priors π(µ, λ) 1 λµ 3 2 Message length component for use in mixture models I(θ k,j ) = log n k 1 ( ) 2 log ˆλ 2 2aj k,j + log bj where ˆλ k,j is the MML estimate of λ for class k and variable j n k is number of samples in class k a j, b j are hyper-parameters Details may be found in the paper (The University of Melbourne) AI 2012 12 / 19
MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (1) Use Wallace Freeman approximation Bayesian; we chose uninformative priors π(µ, λ) 1 λµ 3 2 Message length component for use in mixture models I(θ k,j ) = log n k 1 ( ) 2 log ˆλ 2 2aj k,j + log bj where ˆλ k,j is the MML estimate of λ for class k and variable j n k is number of samples in class k a j, b j are hyper-parameters Details may be found in the paper (The University of Melbourne) AI 2012 12 / 19
MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (2) Let y = (y 1,..., y n ) be data from an inverse Gaussian Define sufficient statistics n n 1 S 1 = y i, S 2 =, y i i=1 Compare maximum likelihood estimates i=1 ˆµ ML = S 1 n, ˆλML = S 1S 2 n 2 ns 1 to minimum message length estimates ˆµ 87 = S 1 n, ˆλ87 = S 1S 2 n 2 (n 1)S 1 MML estimates: 1 Are Unbiased 2 Strictly dominate ML estimates in terms of KL risk (The University of Melbourne) AI 2012 13 / 19
MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (2) Let y = (y 1,..., y n ) be data from an inverse Gaussian Define sufficient statistics n n 1 S 1 = y i, S 2 =, y i i=1 Compare maximum likelihood estimates i=1 ˆµ ML = S 1 n, ˆλML = S 1S 2 n 2 ns 1 to minimum message length estimates ˆµ 87 = S 1 n, ˆλ87 = S 1S 2 n 2 (n 1)S 1 MML estimates: 1 Are Unbiased 2 Strictly dominate ML estimates in terms of KL risk (The University of Melbourne) AI 2012 13 / 19
Content Example 1 Mixture Modelling Problem Description MML Mixture Models 2 MML Inverse Gaussian Distributions Inverse Gaussian Distributions MML Inference of Inverse Gaussians 3 Example (The University of Melbourne) AI 2012 14 / 19
Example (1) Example Compared inverse Gaussian mixture models against standard Gaussian mixture models Used several well known, real, datasets 1 Enzyme 2 Acidity 3 Galaxy Results shown for enzyme n = 245 samples See paper for acidity and galaxy results (The University of Melbourne) AI 2012 15 / 19
Example (2) Example Histogram of enzyme data 80 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 (The University of Melbourne) AI 2012 16 / 19
Example (3) Example Gaussian mixture model (K = 2, I = 86.19) 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 (The University of Melbourne) AI 2012 17 / 19
Example (4) Example Inverse Gaussian mixture model (K = 3, I = 69.34) 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 (The University of Melbourne) AI 2012 18 / 19
Example References Wallace, C. S., Boulton, D. M. An information measure for classification. Computer Journal, 1968, Vol. 11, pp. 185-194 Wallace, C. S., Dowe, D. L. MML mixture modelling of multi-state, Poisson, von Mises circular and Gaussian distributions. Proceedings of the 6th International Workshop on Artificial Intelligence and Statistics, 1997, pp. 529-536 Wallace, C. S. Intrinsic Classification of Spatially Correlated Data. The Computer Journal, 1998, Vol. 41, pp. 602-611 Wallace, C. S., Dowe, D. L., MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 2000, Vol. 10, pp. 73-83 Wallace, C. S. Statistical and Inductive Inference by Minimum Message Length, Springer, 2005 (The University of Melbourne) AI 2012 19 / 19