Estimation for nonparametric mixture models

Size: px

Start display at page:

Download "Estimation for nonparametric mixture models"

Cory McBride
5 years ago
Views:

1 Estimation for nonparametric mixture models David Hunter Penn State University Research supported by NSF Grant SES Joint work with Didier Chauveau (University of Orléans, France), Tatiana Benaglia (Cambridge University) Purdue University, November 12, 2009

2 Finite mixture estimation problem Goal: Estimate λ j and f j (or f jk ) given an i.i.d. sample from: Univariate Case g(x) = m λ j f j (x) j=1

3 Finite mixture estimation problem Goal: Estimate λ j and f j (or f jk ) given an i.i.d. sample from: Univariate Case m g(x) = λ j f j (x) j=1 Multivariate case g(x) = m r f jk (x k ) λ j j=1 k=1 N.B.: Assume conditional independence of x 1,..., x r.

4 Finite mixture estimation problem Goal: Estimate λ j and f j (or f jk ) given an i.i.d. sample from: Univariate Case m g(x) = λ j f j (x) j=1 Multivariate case g(x) = m r f jk (x k ) λ j j=1 k=1 N.B.: Assume conditional independence of x 1,..., x r. Motivation for this talk: Do not assume any more than necessary about the parametric form of f j or f jk.

5 Publicité All computational techniques in this talk are implemented in a package called mixtools for R. mixtools is on CRAN: See recent J. Stat. Soft. article (Benaglia et al., 2009)

6 Outline of talk 1 Motivating examples and notation 2 Review of EM algorithm-ology 3 Identifiability: The univariate case 4 The multivariate non-parametric EM algorithm

7 Outline: Next up... 1 Motivating examples and notation 2 Review of EM algorithm-ology 3 Identifiability: The univariate case 4 The multivariate non-parametric EM algorithm

8 Univariate example: Old Faithful wait times (min.) Time between Old Faithful eruptions Minutes from Obvious bimodality Normal-looking components (?) More on this later!

9 Multivariate example: Water-level angles This example is due to Thomas, Lohaus, and Brainerd (1993). The task: Subjects are shown 8 vessels, pointing at 1:00, 2:00, 4:00, 5:00, 7:00, 8:00, 10:00, and 11:00 They draw the water surface for each Measure: (signed) angle with horizontal formed by drawn surface Vessel tilted to point at 1:00

10 Notational nightmare We have: n = # of individuals in the sample m = # of Mixture components r = # of Repeated measurements (coordinates per observation)

11 Notational nightmare We have: n = # of individuals in the sample m = # of Mixture components r = # of Repeated measurements (coordinates per observation) Thus, the log-likelihood given data x 1,..., x n is L(θ) = n m log λ j i=1 j=1 k=1 r f jk (x ik ) Note the subscripts: Throughout, we use 1 i n 1 j m 1 k r

12 Water-level dataset After loading mixtools, R> data(waterdata) R> Waterdata Trial.1 Trial.2 Trial.3 Trial.4 Trial.5 Trial.6 Trial.7 Trial For these data, Number of subjects: n = 405 Number of coordinates (repeated measures): r = 8. What should m (number of mixture components) be?

13 Outline: Next up... 1 Motivating examples and notation 2 Review of EM algorithm-ology 3 Identifiability: The univariate case 4 The multivariate non-parametric EM algorithm

14 Review of standard EM for mixtures For MLE in finite mixtures, EM algorithms are standard. A complete observation (X, Z) consists of: The observed data X. The unobserved vector Z, defined by { 1 if X comes from component j for 1 j m, Z j = 0 otherwise

15 Review of standard EM for mixtures For MLE in finite mixtures, EM algorithms are standard. A complete observation (X, Z) consists of: The observed data X. The unobserved vector Z, defined by { 1 if X comes from component j for 1 j m, Z j = 0 otherwise What does this mean? In simulations: Generate Z first, then X Z [ j fj ( ) ] Z j In real data, Z is a latent variable whose interpretation depends on context.

16 Parametric EM algorithm for mixtures E-step: Amounts to finding the conditional expectation of each Z i : def Ẑ ij = E ˆθ Z ij = ˆλ j ˆfj (x i ) ˆλ ˆf(x i ) M-step: Amounts to maximizing the expected complete data loglikelihood n m L c (θ) = Ẑ ij log [ λ j f j (x i ) ] i=1 j=1 Iterate: Let ˆθ = arg max θ L c (θ) and repeat. N.B.: Usually, f j (x) f (x; φ j ). We let θ denote (λ, φ).

17 Parametric EM algorithm for mixtures E-step: Amounts to finding the conditional expectation of each Z i : def Ẑ ij = E ˆθ Z ij = ˆλ j ˆfj (x i ) ˆλ ˆf(x i ) M-step: Amounts to maximizing the expected complete data loglikelihood n m L c (θ) = Ẑ ij log [ λ j f j (x i ) ] = ˆλ next = 1 Ẑ i n i=1 j=1 i Iterate: Let ˆθ = arg max θ L c (θ) and repeat. N.B.: Usually, f j (x) f (x; φ j ). We let θ denote (λ, φ).

18 Old Faithful data with parametric normal EM Density Time between Old Faithful eruptions Minutes In R with mixtools, type R> data(faithful) R> attach(faithful) R> ans=normalmixem( R+ waiting, R+ mu=c(55,80), R+ sigma=5, R+ fast=t) number of iterations= 24

19 Old Faithful data with parametric normal EM Density Time between Old Faithful eruptions λ 1 = Minutes In R with mixtools, type R> data(faithful) R> attach(faithful) R> ans=normalmixem( R+ waiting, R+ mu=c(55,80), R+ sigma=5, R+ fast=t) number of iterations= 24 Gaussian EM result: ˆµ = (54.6, 80.1)

20 Outline: Next up... 1 Motivating examples and notation 2 Review of EM algorithm-ology 3 Identifiability: The univariate case 4 The multivariate non-parametric EM algorithm

21 Identifiability Univariate Case g(x) = m λ j f j (x) j=1 Identifiability means: g(x) uniquely determines all λ j and f j (up to permuting the subscripts). Parametric case: When f j (x) = f (x; φ j ), generally no problem Nonparametric case: Big problem! We need some restrictions on f j.

22 How to restrict f j in the univariate (r = 1) case? Bordes, Mottelet, and Vandekerkhove (2006) and Hunter, Wang, and Hettmansperger (2007) both showed that g(x) = 2 λ j f j (x) j=1 is identifiable, at least when λ 1 1/2, if f j (x) f (x µ j ) for some density f ( ) that is symmetric about the origin.

23 Exploiting identifiability: An EM algorithm Assume that g(x) = 2 λ j f (x µ j ), j=1 where f ( ) is a symmetric density. Bordes, Chauveau, and Vandekerkhove (2007) introduce an EM-like algorithm that includes a kernel density estimation step. It is much simpler than the algorithms of Bordes et al. (2006) or Hunter et al. (2007).

24 An EM algorithm for m = 2, r = 1: E-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ jˆf (xi µ j ) ˆλ 1ˆf (xi µ 1 ) + ˆλ 2ˆf (xi µ 2 )

25 An EM algorithm for m = 2, r = 1: E-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ jˆf (xi µ j ) ˆλ 1ˆf (xi µ 1 ) + ˆλ 2ˆf (xi µ 2 ) M-step: Maximize complete data loglikelihood for λ and µ: ˆλ next = 1 n n i=1 Ẑ i ˆµ next j? = 1 nλ next n Ẑ ij x i i=1

26 An EM algorithm for m = 2, r = 1: E-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ jˆf (xi µ j ) ˆλ 1ˆf (xi µ 1 ) + ˆλ 2ˆf (xi µ 2 ) M-step: Maximize complete data loglikelihood for λ and µ: ˆλ next = 1 n n i=1 Ẑ i ˆµ next j? = 1 nλ next n Ẑ ij x i i=1 KDE-step: Update estimate of f (for some bandwidth h) by ˆf next (u) = 1 nh n i=1 j=1 2 ( u xi + ˆµ j Ẑ ij K h ), then symmetrize.

27 Old Faithful data again Time between Old Faithful eruptions Density Minutes

28 Old Faithful data again Time between Old Faithful eruptions Density λ 1 = Gaussian EM: ˆµ = (54.6, 80.1) Minutes

29 Old Faithful data again Time between Old Faithful eruptions Density λ 1 = λ 1 = Gaussian EM: ˆµ = (54.6, 80.1) Semiparametric EM with bandwidth = 4: ˆµ = (54.7, 79.8) Minutes

30 Outline: Next up... 1 Motivating examples and notation 2 Review of EM algorithm-ology 3 Identifiability: The univariate case 4 The multivariate non-parametric EM algorithm

31 The blessing of dimensionality (!) Recall the model in the multivariate case, r > 1: g(x) = m r f jk (x k ) λ j j=1 k=1 N.B.: Assume conditional independence of x 1,..., x r. Hall and Zhou (2003) show that when m = 2 and r = 3, the model is identifiable without restrictions on the f jk ( )!

32 The blessing of dimensionality (!) Recall the model in the multivariate case, r > 1: g(x) = m r f jk (x k ) λ j j=1 k=1 N.B.: Assume conditional independence of x 1,..., x r. Hall and Zhou (2003) show that when m = 2 and r = 3, the model is identifiable without restrictions on the f jk ( )! Hall et al (2005) remark that this is a case in which... from at least one point of view, the curse of dimensionality works in reverse.

33 An amazing result Recall the model in the multivariate case, r > 1: g(x) = m r f jk (x k ) λ j j=1 k=1 N.B.: Assume conditional independence of x 1,..., x r. Allman et al (2009) use a theorem by Kruskal (1976) to show that if: f 1k,..., f mk are linearly independent for each k; r 3... then g(x) uniquely determines all the λ j and f jk (up to label-switching).

34 The nonparametric EM, generalized E-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ j r k=1 ˆf jk (x ik ) j ˆλ j r k=1 ˆf j k(x ik )

35 The nonparametric EM, generalized E-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ j r k=1 ˆf jk (x ik ) j ˆλ j r k=1 ˆf j k(x ik ) M-step: Maximize complete data loglikelihood for λ: ˆλ next = 1 n n i=1 Ẑ i

36 The nonparametric EM, generalized E-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ j r k=1 ˆf jk (x ik ) j ˆλ j r k=1 ˆf j k(x ik ) M-step: Maximize complete data loglikelihood for λ: n ˆλ next = 1 n i=1 Ẑ i KDE-step: Update estimate of f (for some bandwidth h) by ˆf next jk (u) = 1 nhˆλ j n ( u xik Ẑ ij K h i=1 ).

37 Simulated example: bandwidth chosen by default Using mixtools in R, type mu = rbind(0, 1:4) # means (m=2, r=4) sigma = matrix(1, 2, 4) lambda = c(0.4, 0.6) x = rmvnormmix(500, lambda, mu, sigma) # simulated data a = npem(x, 2) # fit the model (2 comps) par(mfrow=c(2,2)) plot(a) Coordinate(s) 1 Coordinate(s) 2 Estimated means: component 1 component 2 coordinate coordinate coordinate coordinate Estimated variances: component 1 component 2 coordinate coordinate coordinate coordinate bandwidth = 0.23 Density Density xx Coordinate(s) Density Density xx Coordinate(s) xx xx

38 Simulated example: bandwidth = 1 Use same simulated x data from previous slide, then type b = npem(x, 2, bw=1) # fit the model (2 comps) plot(b) Coordinate(s) 1 Coordinate(s) 2 Estimated means: component 1 component 2 coordinate coordinate coordinate coordinate Estimated variances: component 1 component 2 coordinate coordinate coordinate coordinate Density Density xx Coordinate(s) Density Density xx Coordinate(s) xx xx

39 Different bandwidths From earlier... KDE-step: Update estimate of f (for some bandwidth h) by ˆf next jk (u) = 1 nhˆλ j No reason h can t be h jk. n ( u xik Ẑ ij K h i=1 ). Algorithms for choosing h jk are discussed in Benaglia et al (2008) and implemented in mixtools.

40 The notation gets even worse... Suppose some of the r coordinates are identically distributed. Let the r coordinates be grouped into B i.d. blocks. Denote the block of the kth coordinate by b k, 1 b k B. The model becomes g(x) = m r f jbk (x k ) λ j j=1 k=1

41 The notation gets even worse... Suppose some of the r coordinates are identically distributed. Let the r coordinates be grouped into B i.d. blocks. Denote the block of the kth coordinate by b k, 1 b k B. The model becomes g(x) = m r f jbk (x k ) λ j j=1 k=1 Special cases: b k = k for each k: Fully general model, seen earlier (Hall et al. 2005; Qin and Leung 2006) b k = 1 for each k: Conditionally i.i.d. assumption (Elmore et al. 2004)

42 The Water-level data, three components Block 1: 1:00 and 7:00 orientations Block 2: 2:00 and 8:00 orientations Appearance of Vessel at Orientation = 1:00 Mixing Proportion (Mean, Std Dev) ( 32.1, 19.4) ( 3.9, 23.3) ( 1.4, 6.0) Appearance of Vessel at Orientation = 2:00 Mixing Proportion (Mean, Std Dev) ( 31.4, 55.4) ( 11.7, 27.0) ( 2.7, 4.6) Block 3: 4:00 and 10:00 orientations Block 4: 5:00 and 11:00 orientations Appearance of Vessel at Orientation = 4:00 Mixing Proportion (Mean, Std Dev) ( 43.6, 39.7) ( 11.4, 27.5) ( 1.0, 5.3) Appearance of Vessel at Orientation = 5:00 Mixing Proportion (Mean, Std Dev) ( 27.5, 19.3) ( 2.0, 22.1) ( 0.1, 6.1)

43 The Water-level data, four components Block 1: 1:00 and 7:00 orientations Block 2: 2:00 and 8:00 orientations Appearance of Vessel at Orientation = 1:00 Mixing Proportion (Mean, Std Dev) ( 31.0, 10.2) ( 22.9, 35.2) ( 0.5, 16.4) ( 1.7, 5.1) Appearance of Vessel at Orientation = 2:00 Mixing Proportion (Mean, Std Dev) ( 48.2, 36.2) ( 0.3, 51.9) ( 14.5, 18.0) ( 2.7, 4.3) Block 3: 4:00 and 10:00 orientations Block 4: 5:00 and 11:00 orientations Appearance of Vessel at Orientation = 4:00 Mixing Proportion (Mean, Std Dev) ( 58.2, 16.3) ( 0.5, 49.0) ( 15.6, 16.9) ( 0.9, 5.2) Appearance of Vessel at Orientation = 5:00 Mixing Proportion (Mean, Std Dev) ( 28.2, 12.0) ( 18.0, 34.6) ( 1.9, 14.8) ( 0.3, 5.3)

44 Old slide: Pros and cons of npem compared with: Mixture model inversion method (Hall et al. 2005) Pro: Easily generalizes beyond m = 2, r = 3. Con : Easily generalizes beyond m = 2, r = 3. Pro: Much lower MISE for similar test problems. Pro: Computationally simple. Exponential tilting method (Qin and Leung, 2006) Pro: and Con : Easily generalizable, as above Pro: Computationally simple. Pro: Less restrictive model; does not assume different components are related in any way. Cutpoint method (Elmore et al. 2004) Pro: No need to assume conditionally i.i.d. Pro: No loss of information from categorizing data. Con: Identifiability of model not settled Update: Now, the "cons" are fixed; identifiability is settled.

45 Open questions Are the estimators consistent, and if so at what rate? Emperical evidence: Rates of convergence similar to those in non-mixture setting. Does this algorithm maximize any sort of log-likelihood? Is it truly an EM algorithm? With small adjustments to the E-step, the answer is yes. This is ongoing work with Michael Levine.

46 References Allman, E. S., Matias, C., and Rhodes, J. A. (2009), Identifiability of parameters in latent structure models with many observed variables, Annals of Statistics, 37: Benaglia, T., Chauveau, D., and Hunter, D. R. (2008), Bandwidth Selection in an EM-like algorithm for nonparametric multivariate mixtures, Technical Report hal , version 1, HAL, URL Benaglia, T., Chauveau, D., and Hunter, D. R. (2009a), An EM-like algorithm for semi- and non-parametric estimation in mutlivariate mixtures, Journal of Computational and Graphical Statistics, 18: Benaglia, T., Chauveau, D., Hunter, D. R., and Young, D. S. (2009b), mixtools: An R Package for Analyzing Finite Mixture Models, Journal of Statistical Software, 32(6). Bordes, L., Mottelet, S., and Vandekerkhove, P. (2006), Semiparametric estimation of a two-component mixture model, Annals of Statistics, 34, Bordes, L., Chauveau, D., and Vandekerkhove, P. (2007), An EM algorithm for a semiparametric mixture model, Computational Statistics and Data Analysis, 51: Elmore, R. T., Hettmansperger, T. P., and Thomas, H. (2004), Estimating component cumulative distribution functions in finite mixture models, Communications in Statistics: Theory and Methods, 33: Hall, P. and Zhou, X. H. (2003) Nonparametric estimation of component distributions in a multivariate mixture, Annals of Statistics, 31: Hall, P., Neeman, A., Pakyari, R., and Elmore, R. (2005), Nonparametric inference in multivariate mixtures, Biometrika, 92: Hunter, D. R., Wang, S., and Hettmansperger, T. P. (2007), Inference for mixtures of symmetric distributions, Annals of Statistics, 35: Kruskal, J. B. (1976), More Factors Than Subjects, Tests and Treatments: An Indeterminacy Theorem for Canonical Decomposition and Individual Differences Scaling, Psychometrika, 41: Thomas, H., Lohaus, A., and Brainerd, C.J. (1993). Modeling Growth and Individual Differences in Spatial Tasks, Monographs of the Society for Research in Child Development, 58, 9: Qin, J. and Leung, D. H.-Y. (2006), Semiparametric analysis in conditionally independent multivariate mixture models, unpublished manuscript.

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures David Hunter Pennsylvania State University, USA Joint work with: Tom Hettmansperger, Hoben Thomas, Didier Chauveau, Pierre Vandekerkhove,