Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Similar documents
Estimation for nonparametric mixture models

New mixture models and algorithms in the mixtools package

An EM-like algorithm for semi- and non-parametric estimation in multivariate mixtures

Semi-parametric estimation for conditional independence multivariate finite mixture models

NONPARAMETRIC ESTIMATION IN MULTIVARIATE FINITE MIXTURE MODELS

Computation of an efficient and robust estimator in a semiparametric mixture model

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

12 - Nonparametric Density Estimation

Curve Fitting Re-visited, Bishop1.2.5

Statistics: Learning models from data

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

The Expectation-Maximization Algorithm

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

UNIVERSITY OF CALGARY. Minimum Hellinger Distance Estimation for a Two-Component Mixture Model. Xiaofan Zhou A THESIS

CSCI-567: Machine Learning (Spring 2019)

Bayesian estimation of the discrepancy with misspecified parametric models

Lecture 3: Introduction to Complexity Regularization

Expectation Maximization

Notes on Machine Learning for and

Introduction to Machine Learning Midterm, Tues April 8

A general mixed model approach for spatio-temporal regression data

PARSIMONIOUS MULTIVARIATE COPULA MODEL FOR DENSITY ESTIMATION. Alireza Bayestehtashk and Izhak Shafran

Mixtures of Gaussians. Sargur Srihari

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Statistical Pattern Recognition

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

CV-NP BAYESIANISM BY MCMC. Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUEZ

A Conditional Approach to Modeling Multivariate Extremes

Copulas. MOU Lili. December, 2014

Density Estimation. Seungjin Choi

Lecture 7 Introduction to Statistical Decision Theory

POLI 8501 Introduction to Maximum Likelihood Estimation

Lecture 4: Probabilistic Learning

INFERENCE ON MIXTURES UNDER TAIL RESTRICTIONS

Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions

Semi-Nonparametric Inferences for Massive Data

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

BAYESIAN DECISION THEORY

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Stat 451 Lecture Notes Numerical Integration

Part 6: Multivariate Normal and Linear Models

Linear Methods for Prediction

Math 494: Mathematical Statistics

Expectation Maximization Algorithm

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

K-Means and Gaussian Mixture Models

Variational Inference (11/04/13)

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 18: Learning Probabilistic Models

Mixture Models and EM

Probabilistic Graphical Models

Gaussian processes for inference in stochastic differential equations

Lecture : Probabilistic Machine Learning

10-701/15-781, Machine Learning: Homework 4

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Statistical Pattern Recognition

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Semiparametric Generalized Linear Models

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Unsupervised Learning

Multivariate Survival Analysis

Quantile Regression for Residual Life and Empirical Likelihood

Statistical Data Mining and Machine Learning Hilary Term 2016

Computer Intensive Methods in Mathematical Statistics

Log-Linear Models, MEMMs, and CRFs

Bayesian inference for factor scores

Additive Isotonic Regression

Maximum Likelihood Estimation

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints

Smooth simultaneous confidence bands for cumulative distribution functions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Collaborative Topic Modeling for Recommending Scientific Articles

Factor Analysis and Kalman Filtering (11/2/04)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning, Fall 2012 Homework 2

A Novel Nonparametric Density Estimator

Linear regression COMS 4771

Factor Analysis (10/2/13)

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Accelerating the EM Algorithm for Mixture Density Estimation

A tailor made nonparametric density estimate

Bayesian Aggregation for Extraordinarily Large Dataset

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Learning gradients: prescriptive models

Estimation of a Two-component Mixture Model

University of Chicago; Sciences Po, Paris; and Sciences Po, Paris and University College, London

Inference For High Dimensional M-estimates. Fixed Design Results

Probabilistic Time Series Classification

CPSC 540: Machine Learning

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Sparse Linear Models (10/7/13)

Transcription:

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures David Hunter Pennsylvania State University, USA Joint work with: Tom Hettmansperger, Hoben Thomas, Didier Chauveau, Pierre Vandekerkhove, Laurent Bordes, Tatiana Benaglia, Michael Levine, Xiaotian Zhu Research partially supported by NSF Grants SES 0518772, DMS 1209007 JdS Toulouse, May 30, 2013

Outline 1 Nonparametric mixtures and parameter identifiability 2 Motivating example; extension to multivariate case 3 Smoothed maximum likelihood

Outline 1 Nonparametric mixtures and parameter identifiability 2 Motivating example; extension to multivariate case 3 Smoothed maximum likelihood

We first introduce nonparametric finite mixtures X g(x) }{{} mixture density = f φ (x) }{{} component density dq(φ) }{{} mixing distribution (1) Sometimes, Q( ) is the nonparametric part; e.g., work by Bruce Lindsay assumes Q( ) is unrestricted. However, in this talk we assume that f φ ( ) is (mostly) unrestricted Q( ) has finite support So (1) becomes g(x) = m λ j f j (x) j=1... and we assume m is known.

Living histogram at Penn State (Brian Joiner, 1975) Can we learn about male / female subpopulations (i.e., parameters)? What can we say about individuals?

Old Faithful Geyser waiting times provide another simple univariate example Time between Old Faithful eruptions 0 10 20 30 40 50 40 50 60 70 80 90 100 Minutes from www.nps.gov/yell Let m = 2, so assume we have a sample from λ 1 f 1 (x) + λ 2 f 2 (x). Why do we need any assumptions on f j?

With no assumptions, parameters are not identifiable λ 2 f 2 λ 1 f 1 Multiple different parameter combinations (λ 1, λ 2, f 1, f 2 ) λ 1 f 1 λ 2 f 2 give the same mixture density. Thus, some constraints on f j are necessary. NB: Sometimes, there is no obvious multi-modality. λ 1 f 1 λ 2 f 2

The univariate case is identifiable under some assumptions It is possible to show 1 that if g(x) = 2 λ j f j (x), j=1 the λ j and f j are uniquely identifiable from g if λ 1 1/2 and f j (x) f (x µ j ) for some density f ( ) that is symmetric about the origin. 1 cf. Bordes, Mottelet, and Vandekerkhove (2006); Hunter, Wang, and Hettmansperger (2007)

A modified EM algorithm may be used for estimation EM preliminaries: A complete observation (X, Z) consists of: The observed data X The unobserved vector Z, defined by { 1 if X comes from component j for 1 j m, Z j = 0 otherwise What does this mean? In simulations: Generate Z first, then X Z [ j fj ( ) ] Z j In real data, Z is a latent variable whose interpretation depends on context.

Standard EM for finite mixtures looks like this: E-step: Amounts to finding the conditional expectation of each Z i : def Ẑ ij = E ˆθ Z ij = ˆλ j ˆfj (x i ) ˆλ ˆf(x i ) M-step: Amounts to maximizing the expected complete data loglikelihood n m L c (θ) = Ẑ ij log [ λ j f j (x i ) ] = ˆλ next = 1 Ẑ i n i=1 j=1 Iterate: Let ˆθ next = arg max θ L c (θ) and repeat. i N.B.: Usually, f j (x) f (x; φ j ). We let θ denote (λ, φ).

A modified EM algorithm may be used for estimation E-step and M-step: Same as usual: Ẑ ij def = E ˆθ Z ij = ˆλ j ˆfj (x i ) ˆλ ˆf(x i ) and ˆλ next = 1 n n i=1 Ẑ i KDE-step: Update ˆf j using a weighted kernel density estimate. Weight each x i by the corresponding Ẑij. Alternatively, select randomly Z i Mult(1, Ẑi) and use a standard KDE for the x i selected into each component. (cf. Bordes, Chauveau, & Vandekerkhove, 2007).

For the symmetric location family assumption, the modified EM looks like this E-step: Same as usual: Ẑ ij E ˆθ Z ij = ˆλ jˆf (xi µ j ) ˆλ 1ˆf (xi µ 1 ) + ˆλ 2ˆf (xi µ 2 ) M-step: Maximize complete data loglikelihood for λ and µ: ˆλ next j = 1 n n i=1 Ẑ ij ˆµ next j = (n λ j ) 1 n Ẑ ij x i i=1 KDE-step: Update estimate of f (for some bandwidth h) by ˆf next (u) = (nh) 1 n i=1 j=1 2 ( u xi + ˆµ j Ẑ ij K h ), then symmetrize.

Compare two solutions for Old Faithful data Time between Old Faithful eruptions Density 0.00 0.01 0.02 0.03 0.04 40 50 60 70 80 90 100 Minutes

Compare two solutions for Old Faithful data Time between Old Faithful eruptions Density 0.00 0.01 0.02 0.03 0.04 λ 1 = 0.361 40 50 60 70 80 90 100 Minutes Gaussian EM: ˆµ = (54.6, 80.1)

Compare two solutions for Old Faithful data Time between Old Faithful eruptions Density 0.00 0.01 0.02 0.03 0.04 λ 1 = 0.361 λ 1 = 0.361 40 50 60 70 80 90 100 Minutes Gaussian EM: ˆµ = (54.6, 80.1) Semiparametric EM with bandwidth = 4: ˆµ = (54.7, 79.8) Both algorithms are implemented in mixtools package for R (Benaglia et al., 2009).

Outline 1 Nonparametric mixtures and parameter identifiability 2 Motivating example; extension to multivariate case 3 Smoothed maximum likelihood

Multivariate example: Water-level angles This example is due to Thomas, Lohaus, & Brainerd (1993). The task: Subjects are shown 8 vessels, pointing at 1:00, 2:00, 4:00, 5:00, 7:00, 8:00, 10:00, and 11:00 They draw the water surface on each vessel Measure: (signed) angle with horizontal formed by line drawn Vessel tilted to point at 1:00

Water-level dataset After loading mixtools, R> data(waterdata) R> Waterdata Trial.1 Trial.2 Trial.3 Trial.4 Trial.5 Trial.6 Trial.7 Trial.8 1-16 30-5 8 30 1 0-26 2-12 13-15 -64 21 17-46 -66 3 7-85 -27-81 39-2 -10-12 4 6 55-5 1 36-3 5-14 5 32 58-61 -30 60 28-30 -60... For these data, Number of subjects: n = 405 Number of coordinates (repeated measures): r = 8. What should m (number of mixture components) be?

We will assume conditional independence Each f j (x) density on R r is assumed to be the product of its marginals: g(x) = m r f jk (x k ) λ j j=1 k=1 We call this assumption conditional independence (cf. Hall and Zhou, 2003; Qin and Leung, 2006) Very similar to repeated measures models: In RM models, we often assume measurements are independent conditional on the individual. Here, we have component-specific effects instead of individual-specific effects.

There exists a nice identifiability result for conditional independence when r 3 Recall the conditional independence finite mixture model: g(x) = m r f jk (x k ) λ j j=1 k=1 Allman, Matias, & Rhodes (2009) use a theorem by Kruskal (1976) to show that if: f 1k,..., f mk are linearly independent for each k; r 3... then g(x) uniquely determines all the λ j and f jk (up to label-switching).

Some of the marginals may be assumed identical Let the r coordinates be grouped into B i.d. blocks. Denote the block of the kth coordinate by b k, 1 b k B. The model becomes g(x) = m r f jbk (x k ) λ j j=1 k=1 Special cases: b k = k for each k: Fully general model, seen earlier (Hall, Neeman, Pakyari, & Elmore 2005; Qin & Leung 2006) b k = 1 for each k: Conditionally i.i.d. assumption (Elmore, Hettmansperger, & Thomas 2004)

The water-level data may be blocked 8 vessels, presented in the order: 11, 4, 2, 7, 10, 5, 1, 8 o clock Assume that opposite clock-face orientations lead to conditionally iid responses (same behavior) B = 4 blocks defined by b = (4, 3, 2, 1, 3, 4, 1, 2) e.g., b 4 = b 7 = 1, i.e., block 1 relates to coordinates 4 and 7, corresponding to clock orientations 1:00 and 7:00 11:00 4:00 2:00 7:00 10:00 5:00 1:00 8:00

The nonparametric EM algorithm is easily extended to the multivariate conditional independence case E-step and M-step: Same as usual: Ẑ ij = ˆλ j ˆfj (x i ) ˆλ ˆf(x i ) = ˆλ r ˆf j k=1 jk (x ik ) ˆλ ˆf(x i ) and ˆλ next = 1 n n i=1 Ẑ i KDE-step: Update estimate of f (for some bandwidth h) by ˆf next jk (u) = 1 nhˆλ next j n ( u xik Ẑ ij K h i=1 ). (Benaglia, Chauveau, Hunter, 2009)

The Water-level data, three components 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 1:00 Block 1: 1:00 and 7:00 orientations Mixing Proportion (Mean, Std Dev) 0.077 ( 32.1, 19.4) 0.431 ( 3.9, 23.3) 0.492 ( 1.4, 6.0) 90 60 30 0 30 60 90 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 2:00 Block 2: 2:00 and 8:00 orientations Mixing Proportion (Mean, Std Dev) 0.077 ( 31.4, 55.4) 0.431 ( 11.7, 27.0) 0.492 ( 2.7, 4.6) 90 60 30 0 30 60 90 Block 3: 4:00 and 10:00 orientations Block 4: 5:00 and 11:00 orientations 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 4:00 Mixing Proportion (Mean, Std Dev) 0.077 ( 43.6, 39.7) 0.431 ( 11.4, 27.5) 0.492 ( 1.0, 5.3) 90 60 30 0 30 60 90 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 5:00 Mixing Proportion (Mean, Std Dev) 0.077 ( 27.5, 19.3) 0.431 ( 2.0, 22.1) 0.492 ( 0.1, 6.1) 90 60 30 0 30 60 90

The Water-level data, four components 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 1:00 Block 1: 1:00 and 7:00 orientations Mixing Proportion (Mean, Std Dev) 0.049 ( 31.0, 10.2) 0.117 ( 22.9, 35.2) 0.355 ( 0.5, 16.4) 0.478 ( 1.7, 5.1) 90 60 30 0 30 60 90 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 2:00 Block 2: 2:00 and 8:00 orientations Mixing Proportion (Mean, Std Dev) 0.049 ( 48.2, 36.2) 0.117 ( 0.3, 51.9) 0.355 ( 14.5, 18.0) 0.478 ( 2.7, 4.3) 90 60 30 0 30 60 90 Block 3: 4:00 and 10:00 orientations Block 4: 5:00 and 11:00 orientations 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 4:00 Mixing Proportion (Mean, Std Dev) 0.049 ( 58.2, 16.3) 0.117 ( 0.5, 49.0) 0.355 ( 15.6, 16.9) 0.478 ( 0.9, 5.2) 90 60 30 0 30 60 90 0.000 0.010 0.020 0.030 Appearance of Vessel at Orientation = 5:00 Mixing Proportion (Mean, Std Dev) 0.049 ( 28.2, 12.0) 0.117 ( 18.0, 34.6) 0.355 ( 1.9, 14.8) 0.478 ( 0.3, 5.3) 90 60 30 0 30 60 90

Outline 1 Nonparametric mixtures and parameter identifiability 2 Motivating example; extension to multivariate case 3 Smoothed maximum likelihood

The previous algorithm is not truly EM Does this algorithm maximize any sort of log-likelihood? Does it have an EM-like ascent property? Are the estimators consistent and, if so, at what rate? Empirical evidence: Rates of convergence similar to those in non-mixture setting. We might hope that the algorithm has an ascent property for l(λ, f) = = n m log λ j f j (x i ) i=1 j=1 n m log λ j i=1 j=1 k=1 r f jk (x ik ). Unfortunately, l(λ next, f next ) l(λ, f)

Smoothing the likelihood No ascent property for l(λ, f) = n m log λ j f j (x i ). i=1 j=1 However, we borrow an idea from Eggermont and LaRiccia (1995) and introduce a nonlinearly smoothed version: n m l n S (λ, f) = log λ j [N f j ](x i ), i=1 j=1 where 1 [N f j ](x) = exp h r K r ( x u h r ) log f j (u) du.

The infinite sample case: Minimum K-L divergence n m Whereas l n S (λ, f) = log λ j N f j (x i ), i=1 j=1 we may define e j = λ j f j and write l (e) = ( ) g(x) g(x) log j N e dx + j(x) j e j (x) dx. We wish to minimize l (e) over vectors e of functions not necessarily integrating to unity. The added term ensures that the solution satisfies the usual constraints, i.e., j λ j = 1 and f j (x) dx = 1. This may be viewed as minimizing a (penalized) K-L divergence between g( ) and j N e j( ).

The finite- and infinite-sample problems lead to EM-like (actually MM) algorithms E-step: Old: Ẑij = ˆλ jˆfj (x i ) j ˆλ j ˆf j (x i ) M-step: n ˆλ next = 1 n i=1 Ẑ i KDE-step: (Part 2 of M-step) ˆf next jk (u) = 1 nhˆλ next j n ( u xik Ẑ ij K h i=1 ).

The finite- and infinite-sample problems lead to EM-like (actually MM) algorithms E-step: (Technically an M-step, for minorization ) Old: Ẑij = ˆλ jˆfj (x i ) j ˆλ j ˆf j (x i ) New: Ẑij = ˆλ j Nˆf j (x i ) j ˆλ j Nˆf j (x i ) M-step: n ˆλ next = 1 n i=1 Ẑ i KDE-step: (Part 2 of M-step) ˆf next jk (u) = 1 nhˆλ next j n ( u xik Ẑ ij K h i=1 ).

First M in MM is for minorization Recall the smoothed loglikelihood: Define n l n S (λ, f) = i=1 log m λ j N f j (x i ) j=1 Q(λ, f ˆλ,ˆf) = n m Ẑ ij log { λ j N f j (x i ) } i=1 j=1 We can prove l n S (λ, f) ln S (ˆλ,ˆf) Q(λ, f ˆλ,ˆf) Q(ˆλ,ˆf ˆλ,ˆf). We say that Q( ˆλ,ˆf) is a minorizer of l n S ( ) at (λ, f).

MM is a generalization of EM The minorizing equation is l n S (λ, f) ln S (ˆλ,ˆf) Q(λ, f ˆλ,ˆf) Q(ˆλ,ˆf ˆλ,ˆf). (2) By (2), increasing Q( ˆλ,ˆf) leads to an increase in l n S ( ). Q(λ, f ˆλ,ˆf) is maximized at (ˆλ next,ˆf next ). Thus, the MM algorithm guarantees that l n S (ˆλ next,ˆf next ) l n S (ˆλ,ˆf). Top curve: l n S (λ, f) Bottom curve: Q(λ, f ˆλ,ˆf) Q( ˆλ,ˆf ˆλ,ˆf) + l n S ( ˆλ,ˆf) Equality occurs at ( ˆλ,ˆf)

The Water-level data revisited Block 4: 5:00 and 11:00 orientations Block 3: 4:00 and 10:00 orientations Density Density 0.000 0.010 0.020 0.030 0.000 0.010 0.020 0.030 NEMS 0.431 0.492 0.077-90 -60-30 0 30 60 90 Block 2: 2:00 and 8:00 orientations NEMS 0.431 0.492 0.077-90 -60-30 0 30 60 90 Density Density 0.000 0.010 0.020 0.030 0.000 0.010 0.020 0.030 NEMS 0.431 0.492 0.077-90 -60-30 0 30 60 90 Block 1: 1:00 and 7:00 orientations NEMS 0.431 0.492 0.077-90 -60-30 0 30 60 90 NEMS = nonlinearly smoothed EM-like algorithm (Levine, Chauveau, Hunter 2011) Colored lines = original (non-smoothed) algorithm Very little difference in any example we ve seen

Next step: Use ascent property to establish theory Empirical results suggest consistency and convergence rate results are possible. Eggermont and LaRiccia (1995) prove asymptotic results for a related problem; however, it appears that their methods do not apply directly here. for f 21 with 90% empirical CI f 21 : slope = -0.365 1500 2500 3500 log( MISE) -3.8-3.4-3.0-2.6 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 n log(n) for λ 2 with 90% empirical CI λ 2 : slope Multivariate = -0.488 Nonparametric Mixtures

References Allman, E. S., Matias, C., and Rhodes, J. A. (2009), Identifiability of parameters in latent structure models with many observed variables, Annals of Statistics, 37: 3099 3132. Benaglia, T., Chauveau, D., and Hunter, D. R. (2009), An EM-like algorithm for semi- and non-parametric estimation in mutlivariate mixtures, Journal of Computational and Graphical Statistics, 18: 505 526. Benaglia, T., Chauveau, D., Hunter, D. R., and Young, D. S. (2009), mixtools: An R Package for Analyzing Finite Mixture Models, Journal of Statistical Software, 32(6). Bordes, L., Mottelet, S., and Vandekerkhove, P. (2006), Semiparametric estimation of a two-component mixture model, Annals of Statistics, 34, 1204 1232. Bordes, L., Chauveau, D., and Vandekerkhove, P. (2007), An EM algorithm for a semiparametric mixture model, Computational Statistics and Data Analysis, 51: 5429 5443. Eggermont, P. P. B. and LaRiccia, V. N. (1995), Maximum Smoothed Density Estimation for Inverse Problems, Annals of Statistics, 23, 199 220. Elmore, R. T., Hettmansperger, T. P., and Thomas, H. (2004), Estimating component cumulative distribution functions in finite mixture models, Communications in Statistics: Theory and Methods, 33: 2075 2086. Hall, P., Neeman, A., Pakyari, R., and Elmore, R. (2005), Nonparametric inference in multivariate mixtures, Biometrika, 92: 667 678. Hunter, D. R., Wang, S., and Hettmansperger, T. P. (2007), Inference for mixtures of symmetric distributions, Annals of Statistics, 35: 224 251. Levine, M., Hunter, D. R., and Chauveau, D. (2011), Maximum Smoothed Likelihood for Multivariate Mixtures, Biometrika, 98 (2): 403 416. Kruskal, J. B. (1976), More Factors Than Subjects, Tests and Treatments: An Indeterminacy Theorem for Canonical Decomposition and Individual Differences Scaling, Psychometrika, 41: 281 293. Thomas, H., Lohaus, A., and Brainerd, C.J. (1993). Modeling Growth and Individual Differences in Spatial Tasks, Monographs of the Society for Research in Child Development, 58, 9: 1 190. Qin, J. and Leung, D. H.-Y. (2006), Semiparametric analysis in conditionally independent multivariate mixture models, unpublished manuscript.

EXTRA SLIDES

Identifiability for mixtures of regressions: Intuition Consider the simplest case: Univariate x i and J {1, 2}: x y 0 x 1 x 0 0 Y = Xβ J + ɛ where ɛ f Fix X = x 0.

Identifiability for mixtures of regressions: Intuition Consider the simplest case: Univariate x i and J {1, 2}: x y 0 x 1 x 0 0 Y X = x 0 Y = Xβ J + ɛ where ɛ f Fix X = x 0. Conditional distribution of Y when X = x 0 not necessarily identifiable as mixture of shifted versions of f, even if f is assumed (say) symmetric.

Identifiability for mixtures of regressions: Intuition Consider the simplest case: Univariate x i and J {1, 2}: y 0 Y X = x 1 0 x 1 x 0 x Y X = x 0 Y = Xβ J + ɛ where ɛ f Fix X = x 0. Conditional distribution of Y when X = x 0 not necessarily identifiable as mixture of shifted versions of f, even if f is assumed (say) symmetric. Identifiability depends on using additional X values that change the relative locations of the mixture components.

Mixtures of simple linear regressions Next allow an intercept: Y = β J1 + Xβ J2 + ɛ, with ɛ f. x y 0 x 1 x 0 0 Even if f is assumed (say) symmetric about zero, identifiability can fail: Additional X values give no new information if the regression lines are parallel.

Mixtures of simple linear regressions Next allow an intercept: Y = β J1 + Xβ J2 + ɛ, with ɛ f. x y 0 x 1 x 0 0 Y X = x 0 Y X = x 1 Even if f is assumed (say) symmetric about zero, identifiability can fail: Additional X values give no new information if the regression lines are parallel.

Theorem (Hunter and Young 2012) Theorem: If the support of the density h(x) contains an open set in R p, then the parameters in ψ(x, y) = h(x) m λ j f (y x t β j ), j=1 are uniquely identifiable from ψ(x, y). When X is one-dimensional, there is a nice way to see this based on an idea of Laurent Bordes and Pierre Vandekerkhove.

Single predictor case Let m g x (y) = λ j f (y µ j + β j x) j=1 denote the conditional density of y given x. Next, define R k (a, b) = x k g x (a + bx) dx. A bit of algebra gives: R 0 (a, b) = R 1 (a, b) = m j=1 m j=1 λ j b β j This identifies the β j, λ j ; λ j (µ j a) (b β j ) 2 This identifies the µ j ;