Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Similar documents
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

L11: Pattern recognition principles

Latent Variable Models and EM algorithm

Unsupervised machine learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Estimating Gaussian Mixture Densities with EM A Tutorial

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Gaussian Mixture Distance for Information Retrieval

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization (EM)

Unsupervised Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

10-701/15-781, Machine Learning: Homework 4

Adaptive Mixture Discriminant Analysis for. Supervised Learning with Unobserved Classes

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 3: Pattern Classification

Mixture Models and EM

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

MIXTURE MODELS AND EM

Pattern Recognition. Parameter Estimation of Probability Density Functions

COM336: Neural Computing

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Chapter 2 The Naïve Bayes Model in the Context of Word Sense Disambiguation

The Expectation-Maximization Algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

SGN (4 cr) Chapter 5

Introduction to Machine Learning

Metric-based classifiers. Nuno Vasconcelos UCSD

EXPECTATION- MAXIMIZATION THEORY

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 3: Pattern Classification. Pattern classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Lecture 4: Probabilistic Learning

Gaussian Models

Lecture 14. Clustering, K-means, and EM

An introduction to Variational calculus in Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

EM Algorithm. Expectation-maximization (EM) algorithm.

CPSC 540: Machine Learning

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 6: Gaussian Mixture Models (GMM)

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Clustering with k-means and Gaussian mixture distributions

Clustering, K-Means, EM Tutorial

Unsupervised Learning with Permuted Data

5. Discriminant analysis

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Clustering with k-means and Gaussian mixture distributions

Machine Learning Lecture 5

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Lecture Notes

Expectation Maximization Mixture Models HMMs

Probabilistic Time Series Classification

Linear Models for Classification

Mixture of Gaussians Models

6.867 Machine Learning

Notes on Machine Learning for and

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Bayes Decision Theory - I

Lecture Notes 1: Vector spaces

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

More on Unsupervised Learning

Mixture Models and Expectation-Maximization

Gaussian Mixture Models

CSCI-567: Machine Learning (Spring 2019)

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Gaussian Statistics and Unsupervised Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Latent Variable Models and Expectation Maximization

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Accelerating the EM Algorithm for Mixture Density Estimation

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

The Regularized EM Algorithm

ECE521 week 3: 23/26 January 2017

Maximum Likelihood Estimation. only training data is available to design a classifier

The Expectation Maximization Algorithm

CS229 Machine Learning Project: Allocating funds to the right people

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Hidden Markov models for time series of counts with excess zeros

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 6: April 19, 2002

Linear Regression. S. Sumitra

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

Classification: The rest of the story

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Machine Learning: A Statistics and Optimization Perspective

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Transcription:

Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1

Clustering A basic tool in data mining/pattern recognition: Divide a set of data into groups. Samples in one cluster are close and clusters are far apart. 6 5 4 3 2 1 0 1 2 3 2 1 0 1 2 3 4 5 6 Motivations: Discover classes of data in an unsupervised way (unsupervised learning). Efficient representation of data: fast retrieval, data complexity reduction. Various engineering purposes: tightly lined with pattern recognition. 2

Approaches to Clustering Represent samples by feature vectors. Define a distance measure to assess the closeness between data. Closeness can be measured in many ways. Define distance based on various norms. For gene expression levels in a set of micro-array data, closeness between genes may be measured by the Euclidean distance between the gene profile vectors, or by correlation. Approaches: Define an objective function to assess the quality of clustering and optimize the objective function (purely computational). Clustering can be performed based merely on pairwise distances. How each sample is represented does not come into the picture. Statistical model based clustering. 3

K-means Assume there are M clusters with centroids Z = {z 1,z 2,..., z M }. Each training sample is assigned to one of the clusters. Denote the assignment function by η( ). Then η(i) =j means the ith training sample is assigned to the jth cluster. Goal: minimize the total mean squared error between the training samples and their representative cluster centroids, that is, the trace of the pooled within cluster covariance matrix. arg min Z,η N i=1 x i z η(i) 2 Denote the objective function by N L(Z,η)= x i z η(i) 2. i=1 Intuition: training samples are tightly clustered around the centroids. Hence, the centroids serve as a compact representation for the training data. 4

Necessary Conditions If Z is fixed, the optimal assignment function η( ) should follow the nearest neighbor rule, that is, η(i) = arg min j {1,2,...,M} x i z j. If η( ) is fixed, the cluster centroid z j should be the average of all the samples assigned to the jth cluster: i:η(i)=j z j = x i. N j N j is the number of samples assigned to cluster j. 5

The Algorithm Based on the necessary conditions, the -means algorithm alternates the two steps: For a fixed set of centroids, optimize η( ) by assigning each sample to its closest centroid using Euclidean distance. Update the centroids by computing the average of all the samples assigned to it. The algorithm converges since after each iteration, the objective function decreases (non-increasing). Usually converges fast. Stopping criterion: the ratio between the decrease and the objective function is below a threshold. 6

Mixture Model-based Clustering Each cluster is mathematically represented by a parametric distribution. Examples: Gaussian (continuous), Poisson (discrete). The entire data set is modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. Suppose there are K components (clusters). Each component is a Gaussian distribution parameterized by µ, Σ. Denote the data by X, X R d. The density of component is f (x) =φ(x µ, Σ ) 1 = (2π)d Σ exp( (x µ ) t Σ 1 (x µ ) ). 2 The prior probability (weight) of component is a. The mixture density is: K K f(x) = a f (x) = a φ(x µ, Σ ). =1 =1 7

Advantages A mixture model with high lielihood tends to have the following traits: Component distributions have high peas (data in one cluster are tight) The mixture model covers the data well (dominant patterns in the data are captured by component distributions). Advantages Well-studied statistical inference techniques available. Flexibility in choosing the component distributions. Obtain a density estimation for each cluster. A soft classification is available. 0.4 0.35 Density function of two clusters 0.3 0.25 0.2 0.15 0.1 0.05 0 8 6 4 2 0 2 4 6 8 8

EM Algorithm The parameters are estimated by the maximum lielihood (ML) criterion using the EM algorithm. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum lielihood from incomplete data via the EM algorithm, Journal Royal Statistics Society, vol. 39, no. 1, pp. 1-21, 1977. The EM algorithm provides an iterative computation of maximum lielihood estimation when the observed data are incomplete. Incompleteness can be conceptual. We need to estimate the distribution of X, in sample space X, but we can only observe X indirectly through Y, in sample space Y. In many cases, there is a mapping x y(x) from X to Y, and x is only nown to lie in a subset of X, denoted by X (y), which is determined by the equation y = y(x). The distribution of X is parameterized by a family of distributions f(x θ), with parameters θ Ω, on x. The distribution of y, g(y θ) is g(y θ) = f(x θ)dx. 9 X (y)

The EM algorithm aims at finding a θ that maximizes g(y θ) given an observed y. Introduce the function Q(θ θ) =E(log f(x θ ) y, θ), that is, the expected value of log f(x θ ) according to the conditional distribution of x given y and parameter θ. The expectation is assumed to exist for all pairs (θ,θ). In particular, it is assumed that f(x θ) > 0 for θ Ω. EM Iteration: E-step: Compute Q(θ θ (p) ). M-step: Choose θ (p+1) to be a value of θ Ω that maximizes Q(θ θ (p) ). 10

EM for the Mixture of Normals Observed data (incomplete): {x 1,x 2,..., x n }, where n is the sample size. Denote all the samples collectively by x. Complete data: {(x 1,y 1 ), (x 2,y 2 ),..., (x n,y n )}, where y i is the cluster (component) identity of sample x i. The collection of parameters, θ, includes: a, µ, Σ, =1, 2,..., K. The lielihood function is: ( n K ) L(x θ) = log a φ(x i µ, Σ ) i=1 =1 L(x θ) is the objective function of the EM algorithm (maximize). Numerical difficulty comes from the sum inside the log.. 11

The Q function is: [ Q(θ θ) =E = E = log ] n a y i φ(x i µ y i, Σ y i ) x,θ [ i=1 n ] ( ) log(a yi )+logφ(x i µ y i, Σ y i x,θ i=1 n E [ log(a y i )+logφ(x i µ y i, Σ y i ) x i,θ ]. i=1 The last equality comes from the fact the samples are independent. Note that when x i is given, only y i is random in the complete data (x i,y i ). Also y i only taes a finite number of values, i.e, cluster identities 1 to K. The distribution of Y given X = x i is the posterior probability of Y given X. Denote the posterior probabilities of Y =, = 1,..., K given x i by p i,. By the Bayes formula, the posterior probabilities are: K p i, a φ(x i µ, Σ ), p i, =1. =1 12

Then each summand in Q(θ θ) is E [ log(a y i )+logφ(x i µ y i, Σ y i ) x i,θ ] K K = p i, log a + p i, log φ(x i µ, Σ ). =1 =1 Note that we cannot see the direct effect of θ in the above equation, but p i, are computed using θ, i.e, the current parameters. θ includes the updated parameters. We then have: Q(θ θ) = n i=1 n i=1 K p i, log a + =1 K p i, log φ(x i µ, Σ ) =1 Note that the prior probabilities a and the parameters of the Gaussian components µ, Σ can be optimized separately. 13

The a s subject to K =1 a =1. Basic optimization theories show that a are optimized by a i=1 = p i,. n The optimization of µ and Σ is simply a maximum lielihood estimation of the parameters using samples x i with weights p i,. Basic optimization techniques also lead to µ = i=1 p i,x i i=1 p i, Σ = i=1 p i,(x i µ )(x i µ )t i=1 p i, After every iteration, the lielihood function L is guaranteed to increase (may not strictly). We have derived the EM algorithm for a mixture of Gaussians. 14

EM Algorithm for the Mixture of Gaussians Parameters estimated at the pth iteration are mared by a superscript (p). 1. Initialize parameters 2. E-step: Compute the posterior probabilities for all i = 1,..., n, =1,..., K. 3. M-step: p i, = a (p) K =1 a(p) φ(x i µ (p) φ(x i µ (p), Σ(p) ), Σ(p) ). a (p+1) = i=1 p i, n µ (p+1) = i=1 p i,x i i=1 p i, Σ (p+1) = i=1 p i,(x i µ (p+1) )(x i µ (p+1) n i=1 p i, ) t 4. Repeat step 2 and 3 until converge. Comment: for mixtures of other distributions, the EM algorithm is very similar. The E-step involves computing the posterior probabilities. Only the particular distribution φ needs to be changed. The M-step always involves parameter optimization. Formulas differ according to distributions. 15

Computation Issues If a different Σ is allowed for each component, the lielihood function is not bounded. Global optimum is meaningless. (Don t overdo it!) How to initialize? Example: Apply -means first. Initialize µ and Σ using all the samples classified to cluster. Initialize a by the proportion of data assigned to cluster by -means. In practice, we may want to reduce model complexity by putting constraints on the parameters. For instance, assume equal priors, identical covariance matrices for all the components. 16

Examples The heart disease data set is taen from the UCI machine learning database repository. There are 297 cases (samples) in the data set, of which 137 have heart diseases. Each sample contains 13 quantitative variables, including cholesterol, max heart rate, etc. We remove the mean of each variable and normalize it to yield unit variance. data are projected onto the plane spanned by the two most dominant principal component directions. A two-component Gaussian mixture is fit. 17

3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 5 Figure 1: The heart disease data set and the estimated cluster densities. Top: The scatter plot of the data. Bottom: The contour plot of the pdf estimated using a single-layer mixture of two normals. The thic lines are the boundaries between the two clusters based on the estimated pdfs of individual clusters. 18

Classification Lielihood The lielihood: L(x θ) = ( n K ) log a φ(x i µ, Σ ) i=1 =1 maximized by the EM algorithm is sometimes called mixture lielihood. Maximization can also be applied to the classification lielihood. Denote the collection of cluster identities of all the samples by y = {y 1,..., y n }. n L(x θ, y) = log (a yi φ(x i µ yi, Σ yi )) i=1 The cluster identities y i, i = 1,..., n are treated as parameters together with θ and are part of the estimation. To maximize L, EM algorithm can be modified to yield an ascending algorithm. This modified version is called Classification EM (CEM). 19

Classification EM A classification step is inserted between the E-step and the M-step. 1. Initialize parameters 2. E-step: Compute the posterior probabilities for all i = 1,..., n, =1,..., K. p i, = 3. Classification: a (p) K =1 a(p) y (p+1) i φ(x i µ (p) φ(x i µ (p) = arg max, Σ(p) ), Σ(p) p i,. ). Or equivalently, let ˆp i, =1if = arg max p i, and 0 otherwise. 4. M-step: a (p+1) = i=1 ˆp i, n = i=1 I(y(p+1) i = ) n µ (p+1) = i=1 ˆp i,x i i=1 ˆp i, = i=1 I(y(p+1) i = )x i i=1 I(y(p+1) i = ) 20

Σ (p+1) = = i=1 ˆp i,(x i µ (p+1) )(x i µ (p+1) ) t n i=1 ˆp i, i=1 I(y(p+1) i = )(x i µ (p+1) )(x i µ (p+1) i=1 I(y(p+1) i = ) 5. Repeat step 2, 3, 4 until converge. Comment: CEM tends to underestimate the variances. It usually converges much faster than EM. For the purpose of clustering, it is generally believed that it performs similarly as EM. If we assume equal priors a and also the covariance matrices Σ are identical and are a scalar matrix, CEM is exactly -means. (Exercise) ) t 21