BU Statistics Seminar Series Fall 2006

Size: px

Start display at page:

Download "BU Statistics Seminar Series Fall 2006"

Bathsheba Patrick
6 years ago
Views:

1 Modal EM for Mixtures and its Application in Clustering BU Statistics Seminar Series Fall 2006 Surajit Ray BU : Sep slide #1

2 Why Study Modality Inference about actual data generation process. Potentially symptomatic to the underlying population structure or substructure. Modes are stable and more elemental than components of a mixture model. For example to model a single t-distribution we may require mixture of many normals with same mean but different variance. BU : Sep slide #2

3 Mixture of Distributions K-component mixture: f(x) = K j=1 π j f j (x; λ j, θ) x R D f j (x; λ j, θ) component densities. π j mixing proportions. 0 π j 1 j and K j=1 π j = 1. π lives in a K 1 simplex BU : Sep slide #3

4 Mixture of Distributions Flexible way of modeling heterogeneous population. Data reduction technique: Visualization Directly applicable model based clustering.[ray 2003, Ray and Lindsay 2006] BU : Sep slide #4

5 Contents Modal EM and Modal Cluster of parametric mixtures Modes and Clusters from Kernel Density Estimates/ Nonparametric Modes Multi-scale modal clustering or Hierarchical Modal clustering Further Pruning of Modes: Modal Ridges, Gap test, Hilltop Test, Coverage Visualization of Modes: Dimension Reduction: Direction of maximum homogeneity Application: Parametric mixture Modal clustering Multi-scale Modal Clustering Conclusion BU : Sep slide #5

6 Components vs Modes 5 component Bivariate Normal Mixture: yy µ µ µ µ µ xx BU : Sep slide #6

7 Components vs Modes 5 component Bivariate Normal Mixture: Only 3 modes y µ µ µ µ µ x BU : Sep slide #6

8 Number of components Number of Modes Density Density x Mixture of Normals 4 standard deviations apart Bimodal x Mixture of Normals 2 standard deviations apart Unimodal For Univariate Normals: number of components number of modes. BU : Sep slide #7

9 Number of components Number of Modes Density Density x Mixture of Normals 4 standard deviations apart Bimodal x Mixture of Normals 2 standard deviations apart Unimodal Conditions for bimodality of two univariate normals (Helguero, 1904) X πn (µ 1, σ) + (1 π)n (µ 2, σ) µ 2 µ 1 σ 2 (π = 1 2 ) Conditions for arbitrary univariate mixture in (Eisenberger:1964,Behbodian:1970, Robertson and Fryer,1969) BU : Sep slide #7

10 Parametric Modes: Multivariate data : Equal variance Not many previous work in this area except Ray and Lindsay (2005) A mixture of multivariate normal is bimodal the univariate distribution along some line is bimodal. Get the axis of maximum separation. Use the bimodality condition in the univariate case to determine modality. For mixture of two multivariate normals this is the line joining the two means. BU : Sep slide #8

11 Parametric Modes: Multivariate data : Equal variance Not many previous work in this area except Ray and Lindsay (2005) A mixture of multivariate normal is bimodal the univariate distribution along some line is bimodal. Get the axis of maximum separation. Use the bimodality condition in the univariate case to determine modality. For mixture of two multivariate normals this is the line joining the two means. Theorem 1 Multivariate Modality Condition: Equal variance Case The distribution of X is bimodal iff X πn (µ 1, Σ) + (1 π)n (µ 2, Σ) (µ 1 µ 2 ) Σ 1 (µ 1 µ 2 ) > 4 BU : Sep slide #8

12 The Ridgeline Manifold g(x) = π 1 φ(x; µ 1, Σ 1 ) + π 2 φ(x; µ 2, Σ 2 ) π K φ(x; µ K, Σ K ) Mapping x : R K 1 R D x (α) = [ α 1 Σ α 2 Σ α K Σ 1 ] 1 K [ α1 Σ 1 1 µ 1 + α 2 Σ 1 2 µ α K Σ 1 K µ ] K The image of this map will be denoted by M and called the ridgeline surface or manifold. If K = 2, it will be called the ridgeline as it is a one-dimensional curve. Theorem 2 All the critical values of the D-dimensional multivariate mixture, and hence modes, antimodes and saddle points, lie in the manifold M given by x (α) Enormous dimension reduction for d >> K. Explore only K 1 dimensions. BU : Sep slide #9

13 Modality and mixing proportion: Π-plots Ridgeline elevation/contour plots Full information (location and height) of the modes and saddle-point. No dependence on π. ( Mixing proportions) Π-plots and Analytical Solutions Only location of critical points. But dependence on the mixing parameter π BU : Sep slide #10

14 The Π-equation Two component case: If x (α) is a critical value of h(α) if it satisfies h (α) = πφ 1 (x (α)) + πφ 2 (x (α)) = 0, whereh(α) = g(x α ) Solving gives Π(α) = φ 2(α) φ 2 (α) φ 1 (α). = if α is a critical value then it solves the ıpi-equation : Π(α) = π. 1 solution = 1 mode 3 solution = 2 mode 5 solution = 3 mode BU : Sep slide #11

15 Analytical Solution for the number of modes where p(α) = (µ 2 µ 1 ) Σ 1 κ(α) = [p(α)] 2 [1 αᾱp(α)], 1 Sα 1 Σ 1 2 Sα 1 Σ 1 2 Sα 1 Σ 1 1 (µ 2 µ 1 ). where S α = [ ] α 1 Σ α 2 Σ 1 1 [ 2 α1 Σ 1 1 µ 1 + α 2 Σ 1 2 µ ] 2 BU : Sep slide #12

16 Analytical Solution for the number of modes where p(α) = (µ 2 µ 1 ) Σ 1 κ(α) = [p(α)] 2 [1 αᾱp(α)], 1 Sα 1 Σ 1 2 Sα 1 Σ 1 2 Sα 1 Σ 1 1 (µ 2 µ 1 ). where S α = [ ] α 1 Σ α 2 Σ 1 1 [ 2 α1 Σ 1 1 µ 1 + α 2 Σ 1 2 µ ] 2 Solution to: Quadratic for Equal Variance. Cubic for Proportional variance. Open Question Upper bound for the number of modes for unequal variance case. BU : Sep slide #12

17 Modal EM and Modal Cluster of parametric mixtures x1 x2 x3 x4 x Any point x can move along its path of steepest ascent, w.r.t the density. The associated mode of x is that point after which no more ascent is possible. BU : Sep slide #13

18 MEM: The Modal EM Modal EM (MEM) algorithm solves for a local maximum of a mixture density by an ascending iterative procedure starting from any initial point. 1. Expectation Step p k = π kf k (x (r) ) f(x (r) ), k = 1,..., K 2. Maximization Step x (r+1) = arg max x K p k log f k (x). k=1 BU : Sep slide #14

19 MEM: The Modal EM Modal EM (MEM) algorithm solves for a local maximum of a mixture density by an ascending iterative procedure starting from any initial point. 1. Expectation Step p k = π kf k (x (r) ) f(x (r) ), k = 1,..., K 2. Maximization Step x (r+1) = arg max x K p k log f k (x). k=1 MM [Hunter and Lange] is not an algorithm but a prescription for constructing EM like algorithms. Surrogate function by majorizing or minorizing the objective function. Likelihood or missing data framework: not necessary General Convexity of function and relevant inequalities: important BU : Sep slide #14

20 Modal EM: Normal Mixture Normal mixture, common covariance Σ, = x r+1 = Pr(J = j X = x r )µ j. Posterior probability. Pr(J = j X = x r ) = π jf j (x r ) k π kf k (x r ) BU : Sep slide #15

21 Properties of Modal EM This iteration will, monotonely increase the density i.e. g(x r+1 ) g(x r ). Discrete approximation to the path of steepest ascent until no further ascent is possible. Computationally simple steps even in high-dimensions BU : Sep slide #16

22 Modal EM for non-normal distributions T-distribution M-Step f k (x; µ k, Σ, ν) = Γ[(ν + D)/2] (πν) D/2 Σ 1/2 Γ[ν/2] ( 1 + (x µ ) k) T Σ 1 (ν+d)/2 (x µ k ) ν (1) Poisson Distribution Modal merging even more important for this one parameter distribution. Using integer optimization and using the fact f(x + 1)/f(x) = λ/(x + 1), and using a smooth function that replicates the ease of solution. Use Stirlings Approximation. BU : Sep slide #17

23 Defining modal clusters using modal EM For each µ j in the support, run the modal EM with it as a starting value. Group together all µ j that have the same associated mode and call it a modal cluster. EXAMPLE: modal clusters of parametric mixtures n = 80 Estimated Means µ = ( )( )( )( ) BU : Sep slide #18

24 BU : Sep slide #19 EXAMPLE: modal clusters of parametric mixtures Estimated modes ( ) ( ) x y µ µ µ µ

25 Motivation behind kernel density based clustering No parametric assumptions necessary Kernel density obtained by putting a normal kernel on each data point: K(x, x i ; h) Exactly a finite mixture model (n-component) Natural extension to the parametric modal clustering BU : Sep slide #20

26 History of nonparametric modality tests Hartigan and Hartigan(1985) and Cheng and Hall (1998) developed the DIP Test Univariate, using local minima and maxima. SPAN test and RANT test Hartigan(1998) and Hartigan and Mohanty (1992) Cheng et al (2004) Bivariate, path of steepest ascent Wegman and Luo (2002) Mode Forest. BU : Sep slide #21

27 Example: Model based Nonparametric clustering Selection of h the smoothing parameter is extremely crucial BU : Sep slide #22

28 Multi-scale Modal Clustering Based on Kernel Density approach. Points belonging to the common mode of the density estimator defines a cluster. Varying the smoothing parameter provides multi-scale analysis Research Questions Interesting range of h ( kernel smoothing ) to be explored. Incremental steps for h. Are the modes significant. Are the modes connected: ridge-line elevation plot. Starting value for a particular level. Flat Clustering Hierarchical Modal Clustering (HMC) BU : Sep slide #23

29 Diffusion Process Motivation When using diffusion kernel K(x, µ; h) for the smoothing, then, for a continuous time Markov chain, it represents the prob. distribution of the location x of a particle that started at µ, after h 2 time units. If the process is reversible, then this works going backward in time h units. If we have n points x 1,..., n, and ask for the distribution of a randomly selected x j h units in the past, then the empirical kernel represents its likely distribution h units ago. That is, the modes are the most likely evolution points of the past. BU : Sep slide #24

30 Choice of h: Spectral Degrees of Freedom sdof analogous to DOF in chi-squared goodness of fit [ Details in Ray 2003, Ray and Lindsay 2006 ] Hypothesis testing H 0 : G = F τ : Fit statistic d K (f, g) = ( f h(w) g h(w)) 2dw w K(h/2) (x, w)df (x) K(h/2) (x, w)dg(x) Choose h ( ) D+1 such that 2 < sdof < n/5 BU : Sep slide #25

31 Spectral Decomposition If K(x, y): S K(x, y) 2 dm(x)dm(y) < K(x, y) generates a Hilbert-Schmidt operator on L 2 (M) through the operation (Kg)(x) = K(x, y)g(y) dm(y) Theorem: Yosida (1980) K(x, y) = λ j φ j (x)φ j (y), j=1 λ j s eigenvalues φ j (x) s are corresponding normalized eigen functions of K under baseline measure M. BU : Sep slide #26

32 Satterthwaite Approximation and SDOF Asymptotically d K (f, g) = d λ j χ 2 (1) j 1 Can be further approximated by a single chi-squared with df= Spectral Degrees of Freedom χ 2 sdof Empirical estimate of sdof given by sdof = (trace ˆF (K (h)c )) 2 trace ˆF (K 2 (h) c ) = [ 1 n n i=1 K (h) c (x i, x i ) ] 2 2 n ( n(n 1) i=1 j<i K(h)c (x i, x j ) ) 2, where the centered kernel based on the empirical distribution is given by K (h)c (x i, x j ) = K h (x i, x j ) 1 K h (x i, x j ) 1 n n i K h (x i, x j ) + 1 K n 2 h (x i, x j ) i i j BU : Sep slide #27

33 Practical Issues regarding choice of h Meaningful clusters The values of h should make difference in cluster allocation of sample points Select range and intermediate points Number of clusters and SDOF... do they have a connection BU : Sep slide #28

34 Example: Hierarchical Modal Clustering BU : Sep slide #29

35 Modal Ridges: Gap test Purpose: To find out the separateness of each pairwise mode. Evaluate the height along the ridgeline connecting the two modes. Modal ridge determined by the height of the saddle between the the two modes. Use paired t-test of the kernel density estimates and multiple testing BU : Sep slide #30

36 Modal Ridges: Gap test Significant Ridge-line plots at different levels of h BU : Sep slide #30

37 Dimension reduction using ridge direction Visualization of Modes: In high dimension how do we visualize modes Using the ridge-lines can we find interesting linear or non-linear direction to project the data for interesting visualization Direction of maximum heterogeneity BU : Sep slide #31

38 Rees Data Analysis used by Hartigan Rees (1993) determined the proper motions of 515 stars in the region of the globular cluster M5. BU : Sep slide #32

39 Rees Data Analysis used by Hartigan Rees (1993) determined the proper motions of 515 stars in the region of the globular cluster M5. y µ µ µ µ µ µ µ µ µ x Parametric Modal Clustering starting with 9 clusters BU : Sep slide #32

40 Rees Data Analysis used by Hartigan Rees (1993) determined the proper motions of 515 stars in the region of the globular cluster M5. y x Parametric Modal Clustering starting with 10 clusters BU : Sep slide #32

41 Rees Data Analysis used by Hartigan Rees (1993) determined the proper motions of 515 stars in the region of the globular cluster M5. y x Parametric Modal Clustering starting with 14 clusters BU : Sep slide #32

42 Rees Data: Non-parametric clustering Nonparametric Modal Clustering (a) h= BU : Sep slide #33

43 Rees Data: Non-parametric clustering Nonparametric Modal Clustering (a) h= BU : Sep slide #33

44 Rees Data: Non-parametric clustering Nonparametric Modal Clustering (a) h= BU : Sep slide #33

45 Rees Data: Non-parametric clustering Nonparametric Hierarchical Modal Clustering BU : Sep slide #33

46 Image Segmentation Cluster the pixel color components and label pixels in the same cluster as one region Partitions images into relatively homogeneous region. Comparison with k-means clustering with respect to Computational Cost Prior Specification Interpretation of result BU : Sep slide #34

47 Image Segmentation: Impressionist HMAC modes K-means means BU : Sep slide #35

48 Image Segmentation: Conclusion K-means clusters the pixels and computes the centroid vector for each cluster HMAC finds the modal vectors, The vectors are bump peaks of the density. The representative colors generated by k-means tend to be muddier due to averaging. HMAC retains the true colors better white color of the stars in the first picture purplish pink of the vase in the second. HMAC may ignore certain colors that either are not distinct enough from others or do not occupy enough pixels. For the purpose of finding the main palette of a painting, HMC may be more preferable. BU : Sep slide #36

49 Medical Image Segmentation BU : Sep slide #37

50 Medical Image Segmentation BU : Sep slide #38

51 Conclusion Using the best of two worlds of Model based and distance based/partition clustering. Multivariate in nature. Unlike hierarchical clustering at each step it is a global analysis ( for kernels with appropriate support). We use the knowledge of topography of mixtures to merge component of mixtures whose contribution is unimodal. (Pruning) Robust to specification of underlying probability model. Can be completely nonparametric. Underlying theme computational simplicity even in higher dimensions. BU : Sep slide #39

52 Extension of Modal Clustering Determining marginal modes. Determining conditional modes. Missing data problem. Modal parametric distribution. Modal non-parametric distribution. BU : Sep slide #40

53 Collaborators Mixture Models, Quadratic Distance, Modal Clusters Bruce G. Lindsay Marianthi Markatou Shu-Chuan (Grace) Chen Jia Li HDLSS geometry, bi-clustering, analysis of microarray data Steve Marron Protemics: Classification of MHC Binders, Vaccine Design, HMM Thomas B Kepler Andrew Nobel Scott Schmidler Medical Imaging, Non-linear Shape Statistics, Imaging using tissue Mixtures Keith E. Muller Stephen M. Pizer Joshua Stough (Ph D student) Ja Yeon Jeong (Ph D student) Bayesian Model Selection, Generalized BIC, Structural Equations Model Jim Berger Kenneth Bollen Jane Zavisca M. J. (Susie) Bayarri BU : Sep slide #41

54 References Topography of Mixtures and Modal Clustering Ray, S., Lindsay, B.(2005). The Topography of Multivariate Normal Mixtures. The Annals of Statistics 33, 5, Ray, S. (2003) Distance-based Model-Selection with application to the Analysis of Gene Expression Data. Electronic Thesis. HDLSS and Microarray Analysis Ray, S., Marron, J.S. Feature selection in High dimensional Low sample size data using two-way mixtures. Model Selection in High Dimension Ray, S., Lindsay, B.G. Model selection in High-Dimensions: A Quadratic-risk Based Approach. [Submitted] Lindsay, B.G., Ray, S., Chen, S.C., Yang, K., Markatou M. on probabilities: the foundations.[submitted] Quadratic distances Lindsay, B.G., Ray, S., Chen, S.C., Yang, K., Markatou M. Quadratic distances on probabilities: Multivariate construction using tuneable diffusion kernels. [Tech report] BU : Sep slide #42

HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE? Surajit Ray and Dan Ren. Boston University

HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE? Surajit Ray and Dan Ren Boston University Abstract: The main result of this article states that one can get as many as D + 1 modes from a two component