Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions

Similar documents
The Minimum Message Length Principle for Inductive Inference

Minimum Message Length Analysis of the Behrens Fisher Problem

Minimum Message Length Analysis of Multiple Short Time Series

Lecture 4: Probabilistic Learning

Minimum Message Length Clustering of Spatially-Correlated Data with Varying Inter-Class Penalties

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Naïve Bayes classification

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

MML Invariant Linear Regression

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistical and Inductive Inference by Minimum Message Length

Logistic Regression with the Nonnegative Garrote

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Minimum Message Length Grouping of Ordered Data

Lecture 9: PGM Learning

Lecture 6: Graphical Models: Learning

Machine Learning Summer School

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Shrinkage and Denoising by Minimum Message Length

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Variable selection for model-based clustering

Model Based Clustering of Count Processes Data

Density Estimation: ML, MAP, Bayesian estimation

Bayesian Learning (II)

arxiv: v3 [stat.me] 11 Feb 2018

High Dimensional Discriminant Analysis

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic Machine Learning. Industrial AI Lab.

Statistics: Learning models from data

On the identification of outliers in a simple model

2.6.3 Generalized likelihood ratio tests

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Statistical Data Mining and Machine Learning Hilary Term 2016

Non-Parametric Bayes

Lecture 6: Model Checking and Selection

Latent Variable Models and EM Algorithm

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Mathematical statistics

Density Estimation. Seungjin Choi

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Unsupervised Learning

Expectation Propagation Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Latent Variable Models and EM algorithm

Chapter 10. Semi-Supervised Learning

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Parametric Techniques Lecture 3

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Minimum Message Length Shrinkage Estimation

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Primer on statistics:

Foundations of Statistical Inference

Parametric Techniques

Bayesian Interpretations of Regularization

Mixture Models and Expectation-Maximization

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Tutorial on Approximate Bayesian Computation

The Expectation-Maximization Algorithm

Handling imprecise and uncertain class labels in classification and clustering

Machine Learning. Probabilistic KNN.

HCOC: hierarchical classifier with overlapping class groups

Estimation of Optimally-Combined-Biomarker Accuracy in the Absence of a Gold-Standard Reference Test

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Expectation Maximization

Gaussian Mixture Models

Probabilistic & Unsupervised Learning

Bayesian Decision and Bayesian Learning

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

CS-E3210 Machine Learning: Basic Principles

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

MML Mixture Models of Heterogeneous Poisson Processes with Uniform Outliers for Bridge Deterioration

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probabilistic Time Series Classification

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

LECTURE NOTE #3 PROF. ALAN YUILLE

F & B Approaches to a simple model

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to Machine Learning

Probabilistic and Bayesian Machine Learning

Bayesian Inference Course, WTCN, UCL, March 2013

Transcription:

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions Daniel F. Schmidt Enes Makalic Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne 25th Australasian Joint Conference on Artificial Intelligence 2012 (The University of Melbourne) AI 2012 1 / 19

Content Mixture Modelling 1 Mixture Modelling Problem Description MML Mixture Models 2 MML Inverse Gaussian Distributions Inverse Gaussian Distributions MML Inference of Inverse Gaussians 3 Example (The University of Melbourne) AI 2012 2 / 19

Problem Description Mixture Modelling Problem Description We have n items, each with q associated attributes, formed into a matrix y 1 y 1,1 y 1,2... y 1,q y 2 Y =. = y 2,1 y 2,2... y 2,q...... y n y n,1 y n,2... y n,q Group together, or cluster, similar items A form of unsupervised learning Sometimes called intrinsic classification Class labels are learned from the data (The University of Melbourne) AI 2012 3 / 19

Mixture Modelling Problem Description Mixture Modelling (1) Models data as a mixture of probability distributions p(y i,j ; Φ) = K α k p(y i,j ; θ k,j ) k=1 where K is the number of classes α = (α 1,..., α K ) are the mixing (population) weights θ k,j are the parameters of the distributions Φ = {K, α, θ 1,1,..., θ K,q } denotes the complete mixture model Has an explicit probabilistic form allows for statistical interpretion (The University of Melbourne) AI 2012 4 / 19

Mixture Modelling Problem Description Mixture Modelling (2) How is this related to clustering? Each class is a cluster Class-specific probability distributions over each attribute e.g., normal, inverse Gaussian, Poisson, etc. Mixing weight is prevalance of the classes in the population Measure of similarity of item to class q p k (y i ) = p(y i,j ; θ k,j ) j=1 probability of item s attributes under class distributions (The University of Melbourne) AI 2012 5 / 19

Mixture Modelling Problem Description Mixture Modelling (2) How is this related to clustering? Each class is a cluster Class-specific probability distributions over each attribute e.g., normal, inverse Gaussian, Poisson, etc. Mixing weight is prevalance of the classes in the population Measure of similarity of item to class q p k (y i ) = p(y i,j ; θ k,j ) j=1 probability of item s attributes under class distributions (The University of Melbourne) AI 2012 5 / 19

Mixture Modelling (3) Mixture Modelling Problem Description Membership of items to classes is soft r i,k = α k p k (y i ) Kl=1 α l p l (y i ) Posterior probability of belonging to class k α k is a priori probability item belongs to class k p k (y i ) is probability of data item y i under class k Assign to class with highest posterior probability Total number of samples in a class is then n n k = r i,k i=1 (The University of Melbourne) AI 2012 6 / 19

Mixture Modelling (3) Mixture Modelling Problem Description Membership of items to classes is soft r i,k = α k p k (y i ) Kl=1 α l p l (y i ) Posterior probability of belonging to class k α k is a priori probability item belongs to class k p k (y i ) is probability of data item y i under class k Assign to class with highest posterior probability Total number of samples in a class is then n n k = r i,k i=1 (The University of Melbourne) AI 2012 6 / 19

Mixture Modelling MML Mixture Models MML Mixture Models (1) Minimum Message Length goodness-of-fit criterion Popular criterion for mixture modelling Based on the idea of compression Message length of data is our yardstick; comprised of 1 Length of codeword needed to state model Φ Number of classes: I(K) Relative abundances: I(α) Parameters for each distribution in each class: I(θ k,j ) 2 Length of codeword needed to state data, given model: I(Y Φ) (The University of Melbourne) AI 2012 7 / 19

Mixture Modelling MML Mixture Models MML Mixture Models (1) Minimum Message Length goodness-of-fit criterion Popular criterion for mixture modelling Based on the idea of compression Message length of data is our yardstick; comprised of 1 Length of codeword needed to state model Φ Number of classes: I(K) Relative abundances: I(α) Parameters for each distribution in each class: I(θ k,j ) 2 Length of codeword needed to state data, given model: I(Y Φ) (The University of Melbourne) AI 2012 7 / 19

Mixture Modelling MML Mixture Models (2) MML Mixture Models Total message length: K q I(Y, Φ) = I(K) + I(α) + I(θ k,j ) + I(Y Φ) k=1 j=1 balances model complexity against model fit Estimate Φ by minimising message length ˆα and ˆθ j,k found by expectation-maximisation Find ˆK by splitting/merging classes (The University of Melbourne) AI 2012 8 / 19

Mixture Modelling MML Mixture Models (2) MML Mixture Models Total message length: K q I(Y, Φ) = I(K) + I(α) + I(θ k,j ) + I(Y Φ) k=1 j=1 balances model complexity against model fit Estimate Φ by minimising message length ˆα and ˆθ j,k found by expectation-maximisation Find ˆK by splitting/merging classes (The University of Melbourne) AI 2012 8 / 19

Content MML Inverse Gaussian Distributions 1 Mixture Modelling Problem Description MML Mixture Models 2 MML Inverse Gaussian Distributions Inverse Gaussian Distributions MML Inference of Inverse Gaussians 3 Example (The University of Melbourne) AI 2012 9 / 19

MML Inverse Gaussian Distributions Inverse Gaussian Distributions (1) Inverse Gaussian Distributions Distribution for positive, continuous data We say Y i IG(µ, λ) if p.d.f. for Y i = y i is p(y i ; µ, λ) = ( 1 2πλy 3 i where µ > 0 is the mean parameter λ > 0 is the inverse-shape parameter Suitable for positively skewed data ) 1 ( 2 exp (y i µ) 2 ) 2µ 2, λy i Derive the message length formula for use in mixture modelling (The University of Melbourne) AI 2012 10 / 19

MML Inverse Gaussian Distributions Inverse Gaussian Distributions (2) Inverse Gaussian Distributions Example of inverse Gaussian distributions 2 1.8 1.6 µ=1, λ=1 µ=1, λ=3 µ=3, λ=1 p(y; µ, λ) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 y (The University of Melbourne) AI 2012 11 / 19

MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (1) Use Wallace Freeman approximation Bayesian; we chose uninformative priors π(µ, λ) 1 λµ 3 2 Message length component for use in mixture models I(θ k,j ) = log n k 1 ( ) 2 log ˆλ 2 2aj k,j + log bj where ˆλ k,j is the MML estimate of λ for class k and variable j n k is number of samples in class k a j, b j are hyper-parameters Details may be found in the paper (The University of Melbourne) AI 2012 12 / 19

MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (1) Use Wallace Freeman approximation Bayesian; we chose uninformative priors π(µ, λ) 1 λµ 3 2 Message length component for use in mixture models I(θ k,j ) = log n k 1 ( ) 2 log ˆλ 2 2aj k,j + log bj where ˆλ k,j is the MML estimate of λ for class k and variable j n k is number of samples in class k a j, b j are hyper-parameters Details may be found in the paper (The University of Melbourne) AI 2012 12 / 19

MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (2) Let y = (y 1,..., y n ) be data from an inverse Gaussian Define sufficient statistics n n 1 S 1 = y i, S 2 =, y i i=1 Compare maximum likelihood estimates i=1 ˆµ ML = S 1 n, ˆλML = S 1S 2 n 2 ns 1 to minimum message length estimates ˆµ 87 = S 1 n, ˆλ87 = S 1S 2 n 2 (n 1)S 1 MML estimates: 1 Are Unbiased 2 Strictly dominate ML estimates in terms of KL risk (The University of Melbourne) AI 2012 13 / 19

MML Inverse Gaussian Distributions MML Inference of Inverse Gaussians MML Inference of Inverse Gaussians (2) Let y = (y 1,..., y n ) be data from an inverse Gaussian Define sufficient statistics n n 1 S 1 = y i, S 2 =, y i i=1 Compare maximum likelihood estimates i=1 ˆµ ML = S 1 n, ˆλML = S 1S 2 n 2 ns 1 to minimum message length estimates ˆµ 87 = S 1 n, ˆλ87 = S 1S 2 n 2 (n 1)S 1 MML estimates: 1 Are Unbiased 2 Strictly dominate ML estimates in terms of KL risk (The University of Melbourne) AI 2012 13 / 19

Content Example 1 Mixture Modelling Problem Description MML Mixture Models 2 MML Inverse Gaussian Distributions Inverse Gaussian Distributions MML Inference of Inverse Gaussians 3 Example (The University of Melbourne) AI 2012 14 / 19

Example (1) Example Compared inverse Gaussian mixture models against standard Gaussian mixture models Used several well known, real, datasets 1 Enzyme 2 Acidity 3 Galaxy Results shown for enzyme n = 245 samples See paper for acidity and galaxy results (The University of Melbourne) AI 2012 15 / 19

Example (2) Example Histogram of enzyme data 80 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 (The University of Melbourne) AI 2012 16 / 19

Example (3) Example Gaussian mixture model (K = 2, I = 86.19) 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 (The University of Melbourne) AI 2012 17 / 19

Example (4) Example Inverse Gaussian mixture model (K = 3, I = 69.34) 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 (The University of Melbourne) AI 2012 18 / 19

Example References Wallace, C. S., Boulton, D. M. An information measure for classification. Computer Journal, 1968, Vol. 11, pp. 185-194 Wallace, C. S., Dowe, D. L. MML mixture modelling of multi-state, Poisson, von Mises circular and Gaussian distributions. Proceedings of the 6th International Workshop on Artificial Intelligence and Statistics, 1997, pp. 529-536 Wallace, C. S. Intrinsic Classification of Spatially Correlated Data. The Computer Journal, 1998, Vol. 41, pp. 602-611 Wallace, C. S., Dowe, D. L., MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 2000, Vol. 10, pp. 73-83 Wallace, C. S. Statistical and Inductive Inference by Minimum Message Length, Springer, 2005 (The University of Melbourne) AI 2012 19 / 19