Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Size: px

Start display at page:

Download "Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes."

Nickolas Garrett
5 years ago
Views:

1 Model-based clustering Based in part on slides from textbook, slides of Susan Holmes December 2, / 1

2 Model-based clustering General approach Choose a type of mixture model (e.g. multivariate Normal) and a maximum number of clusters, K Use a specialized hierarchical clustering technique. Uses some criterion to determine optimal model and number of clusters. 2 / 1

3 Model-based clustering Choosing a mixture model General form of the mixture model f (x) = k π j f (x; θ j ) j=1 For multivariate normal, θ j = (µ j, Σ j ). The EM algorithm we discussed before assumed Σ j are all different in the different classes. Other possibilities: Σ j = Σ, Σ j = λ j I, etc. 3 / 1

4 Model-based clustering Choosing a mixture model Generally, we can write Σ j = c j D j A j D T j with c j diag(a j ) the eigenvalues of Σ j with max(a j ) = 1. The parameter c j is the size, A j is the shape, and D j is the orientation. 4 / 1

5 Model-based agglomerative clustering Ward s criterion A hierarchical clustering algorithm that merges k clusters {C1 k,..., C k k } into k 1 clusters based on k 1 WSS = WSS(C k 1 j ) j=1 where WSS is the within-cluster sum of squared distances. The procedure merges the two clusters Ci k, Cl k that produce the smallest increase in WSS. 5 / 1

6 NCI data (Ward s linkage) 6 / 1

7 Model-based agglomerative clustering Model-based agglomerative clustering If Σ j = σ 2 I, then Ward s criterion is equivalent to merging based on the criterion where 2 log L(θ, l) L(θ, l) = n i=1 f (x x i ; θ li ) is called the classification likelihood. This idea can be used to make a hierarchical clustering algorithm for other types of multivariate normal models, i.e. equal shape, same size, etc. 7 / 1

8 Model selection & BIC Bayesian Information Criterion After a merge, the clusters are taken as initial starting points for the EM algorithm. This results in several mixture models: one for each number of clusters and each type of mixture model considered. How do we choose? This raises the topic of model selection 8 / 1

9 Model selection & BIC Bayesian Information Criterion Suppose we have several possible models M = {M 1,..., M T } for a data set which we assume is given by a data matrix X n p. These models have parameters Θ = {θ 1,..., θ T }. Further, suppose that each one has a likelihood, L j and Θ = { θ 1,..., θ T } are the maximum likelihood estimators. We can compare 2 log L j ( θ j ) but this ignores how much fitting each model does. A common approach is to add a penalty that makes different models comparable. 9 / 1

10 Model selection & BIC Bayesian Information Criterion The BIC of a model is usually BIC(M j ) = 2 log L j ( θ j ) + log n # parameters in M j. The BIC can be thought of as approximating P(M j is correct X n p ) under an appropriate Bayesian model for X. 10 / 1

11 Model selection & BIC Bayesian Information Criterion Typically, statisticians will try to prove choosing model with best BIC yields correct model. Some theoretical justification is needed for this, and this breaks down for mixture models. Nevertheless, it is still used. Another common criterion is AIC (Akaike Information Criterion) AIC(M j ) = 2 log L j ( θ j ) + 2 # parameters in M j. 11 / 1

12 Model-based clustering Summary 1 Choose a type of mixture model (e.g. multivariate Normal) and a maximum number of clusters, K 2 Use a specialized hierarchical clustering technique: model-based hierarchical agglomeration. 3 Use clusters from previous step to initialize EM for the mixture model. 4 Uses BIC to compare different mixture models and models with different numbers of clusters. 12 / 1

13 The Iris data best model: equal shape, 2 components 13 / 1

14 The Iris data 14 / 1

15 The Iris data 15 / 1

16 16 / 1

STAT 100C: Linear models

STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 21 Model selection Choosing the best model among a collection of models {M 1, M 2..., M N }. What is a good model? 1. fits the data well (model