Maximum Likelihood Estimation in Latent Class Models for Contingency Table Data

Size: px

Start display at page:

Download "Maximum Likelihood Estimation in Latent Class Models for Contingency Table Data"

Jody Cannon
6 years ago
Views:

1 Maximum Likelihood Estimation in Latent Class Models for Contingency Table Data Stephen E. Fienberg Department of Statistics, Machine Learning Department, Cylab Carnegie Mellon University May 20, 2008 (Joint Work with P. Hersh, A. Rinaldo, and Yi Zhou) 1 / 32

2 Outline Latent Class Models - existence and uniqueness of MLE - identifiability - model selection Algebraic Geometry - the geometric description of LC models Examples Swiss Francs problem - sparse 2 16 table from the National Long Term Care Survey 2 / 32

3 100 Swiss Francs Problem n = C A l(p) = 4 X i log p ii + 2 X i j log p ij Problem Version 1: Find the MLE for the 4 4 contingency table, n, under the latent class model with k = 2 classes. Problem Version 2: Find the 4 4 matrix p of rank 2 that is closest, in sense of maximum likelihood, to the empirical distribution 1 40 n. 3 / 32

4 100 Swiss Francs Problem n = C A l(p) = 4 X i log p ii + 2 X i j log p ij Problem Version 1: Find the MLE for the 4 4 contingency table, n, under the latent class model with k = 2 classes. Problem Version 2: Find the 4 4 matrix p of rank 2 that is closest, in sense of maximum likelihood, to the empirical distribution 1 40 n. Is the following table an MLE? Can you prove it is a global maximum? B C A Is MLE unique? If not, can you suggest other MLEs? 3 / 32

5 100 Swiss Francs Problem n = C A l(p) = 4 X i log p ii + 2 X i j log p ij Problem Version 1: Find the MLE for the 4 4 contingency table, n, under the latent class model with k = 2 classes. Problem Version 2: Find the 4 4 matrix p of rank 2 that is closest, in sense of maximum likelihood, to the empirical distribution 1 40 n. Is the following table an MLE? Can you prove it is a global maximum? B C A Is MLE unique? If not, can you suggest other MLEs? 3 / 32

6 100 Swiss Francs Problem n = C A l(p) = 4 X i log p ii + 2 X i j log p ij Problem Version 1: Find the MLE for the 4 4 contingency table, n, under the latent class model with k = 2 classes. Problem Version 2: Find the 4 4 matrix p of rank 2 that is closest, in sense of maximum likelihood, to the empirical distribution 1 40 n. Is the following table an MLE? Can you prove it is a global maximum? B C A Is MLE unique? If not, can you suggest other MLEs? 3 / 32

7 Categorical Data and Contingency Tables Consider k categorical variables, X 1,..., X k, where each X i takes value on the finite set [d i ] {1,..., d i }. The cross-classification of N i.i.d. realizations of (X 1,..., X k ) produces a random integer-valued vector n, where n i1,...,i k = N 1{X (j) 1 = i 1,..., X (j) k = i k }. j=1 The contingency table n has a Multinomial(N, p) distribution, where d = k d i and p is a point in the (d 1)-dimensional i probability simplex d 1 with coordinates p i1,...,i k = P r{(x 1,..., X k ) = (i 1,..., i k )} Then we have the likelihood function N! L(p) = n i1! n ik! p ni1,...,ik i 1,...,i k i 1,,i k 4 / 32

8 Latent Structure Let H be an unobservable latent variable, defined on the set [r] = {1,..., r}. In its most basic version, a.k.a. the naive Bayes model, the LC model postulates that, conditional on H, the variables X 1,..., X k are mutually independent. Naive Bayes Model p i1,...,i k = r p i1,...,i k,h h=1 = r p (h) 1 (i 1)p (h) 2 (i 2) p (h) (i k)λ h h=1 k where λ h = P r(h = h) and p (h) k (i k) = P r(x k = i k H = h). 5 / 32

9 The Likelihood (2-way Tables) 2-class naive Bayes model with two manifest variables: the cell probabilities are p ij = λ h α ih β jh h {1,2} the log-likelihood function is l(θ) = n ij log i,j h {1,2} λ h α ih β jh where h λ h = i α ih = j β jh = 1. 6 / 32

10 Issues for Estimation and Testing 1 Maximum likelihood estimation not in exponential family, no theory for the existence and uniqueness of MLE no summary statistics, the minimum sufficient statistics are the data themselves maxima of likelihood computed by Newton-Raphson or EM algorithm are local maxima the MLE for p = (p ij ) is not unique (multimodality) there can be infinitely many MLEs for (λ h, α ih, β jh ) (unidentifiability) 2 Goodness-of-fit test because the model may be unidentifiable w.r.t. (λ h, α ih, β jh ), computing the effective dimension is an issue 7 / 32

11 Issues for Estimation and Testing 1 Maximum likelihood estimation not in exponential family, no theory for the existence and uniqueness of MLE no summary statistics, the minimum sufficient statistics are the data themselves maxima of likelihood computed by Newton-Raphson or EM algorithm are local maxima the MLE for p = (p ij ) is not unique (multimodality) there can be infinitely many MLEs for (λ h, α ih, β jh ) (unidentifiability) 2 Goodness-of-fit test because the model may be unidentifiable w.r.t. (λ h, α ih, β jh ), computing the effective dimension is an issue 7 / 32

12 Issues for Estimation and Testing 1 Maximum likelihood estimation not in exponential family, no theory for the existence and uniqueness of MLE no summary statistics, the minimum sufficient statistics are the data themselves maxima of likelihood computed by Newton-Raphson or EM algorithm are local maxima the MLE for p = (p ij ) is not unique (multimodality) there can be infinitely many MLEs for (λ h, α ih, β jh ) (unidentifiability) 2 Goodness-of-fit test because the model may be unidentifiable w.r.t. (λ h, α ih, β jh ), computing the effective dimension is an issue 7 / 32

13 Geometric Derivation of LC Models (2-way Tables) 2 2 table p 21 p 22 ( p11 p 12 ) p 1+ p 2+ p +1 p +2 p ++ model of independence: p ij = p i+ p +j = α i β j now let s consider the polynomial map: f : (α 1, α 2, β 1, β 2 ) ( ) α1 β 1, α 1 β 2 = α 2 β 1, α 2 β 2 where d 1 = {x R d : d i=1 x i = 1, x i 0}. with some computation, we get the image of the mapping: ( ) p11 p 12 p 21 p 22 Image(f) = {(p 11, p 12, p 21, p 22 ) : p 11 p 22 = p 12 p 21 } 8 / 32

14 Geometric Derivation of LC Models (2-way Tables) 2 2 table p 21 p 22 ( p11 p 12 ) p 1+ p 2+ p +1 p +2 p ++ model of independence: p ij = p i+ p +j = α i β j now let s consider the polynomial map: f : (α 1, α 2, β 1, β 2 ) ( ) α1 β 1, α 1 β 2 = α 2 β 1, α 2 β 2 where d 1 = {x R d : d i=1 x i = 1, x i 0}. with some computation, we get the image of the mapping: ( ) p11 p 12 p 21 p 22 Image(f) = {(p 11, p 12, p 21, p 22 ) : p 11 p 22 = p 12 p 21 } 8 / 32

15 Geometric Derivation of LC Models (2-way Tables) 2 2 table p 21 p 22 ( p11 p 12 ) p 1+ p 2+ p +1 p +2 p ++ model of independence: p ij = p i+ p +j = α i β j now let s consider the polynomial map: f : (α 1, α 2, β 1, β 2 ) ( ) α1 β 1, α 1 β 2 = α 2 β 1, α 2 β 2 where d 1 = {x R d : d i=1 x i = 1, x i 0}. with some computation, we get the image of the mapping: ( ) p11 p 12 p 21 p 22 Image(f) = {(p 11, p 12, p 21, p 22 ) : p 11 p 22 = p 12 p 21 } 8 / 32

16 Surface of 2-level LC Model for 2 2 Table p ij = λ 1 α i1 β j1 + λ 2 α i2 β j2 9 / 32

17 Surface of 2-level LC Model for 2 2 Table λ 1 ( α11 β 11, α 11 β 21 α 21 β 11, α 21 β 21 p ij = λ 1 α i1 β j1 + λ 2 α i2 β j2 ) + λ 2 ( α12 β 12, α 12 β 22 α 22 β 12, α 22 β 22 ) 9 / 32

18 Surface of 2-level LC Model for 2 2 Table V = {(p 11, p 12, p 21, p 22 ) : p 11 p 22 = p 12 p 21 } S = {λ 1 p + λ 2 q : p, q V, λ 1, λ 2 0, λ 1 + λ 2 = 1} 3 9 / 32

19 2 2 Table with 1 Binary Class is Identifiable What about the identifiability? p ij = λ 1 α i1 β j1 + λ 2 α i2 β j2 Definition Identifiability: the mapping f : (λ h, α ih, β jh ) (p 11, p 12, p 21, p 22 ) is locally one-to-one. 1 identifiability the symbolic rank of the Jacobian of f is full 2 a necessary condition is that the dimension of Image(f) equals to the expected dimension min{1 + 2 (2 1), 3} = / 32

20 3 3 Table with 1 Binary Class is Unidentifiable! p ij = λ h α ih β jh, i, j = 1, 2, 3. h {1,2} We can compute the polynomials vanishing on Image(f) where f : (λ, α, β) p: p 11 p 12 p 13 p 21 p 22 p 23 p 31 p 32 p 33 It is the determinant of the 3 3 table. So Image(f) = {p = (p ij ) i,j=1,2,3 : det(p) = 0} Then we can compute the dimension of Image(f). The effective dimension is 7, less than the standard dimension [(3 1) 2] = 9 and the expected dimension / 32

21 Algebraic Tools In the language of algebraic geometry, Image(f) is a variety. Below is the code we use in the symbolic software SINGULAR to compute the polynomials defining Image(f). ring r=0, (p11,p12,p13,p21,p22,p23,p31,p32,p33,h1,h2,a11,a21,a31, a12,a22,a32,b11,b21,b31,b12,b22,b32), lp; ideal I=p11-h1*a11*b11-h2*a12*b12, p12-h1*a11*b21-h2*a12*b22, p13-h1*a11*b31-h2*a12*b32, p21-h1*a21*b11-h2*a22*b12, p22-h1*a21*b21-h2*a22*b22, p23-h1*a21*b31-h2*a22*b32, p31-h1*a31*b11-h2*a32*b12, p32-h1*a31*b21-h2*a32*b22, p33-h1*a31*b31-h2*a32*b32, h1+h2-1, a11+a21+a31-1, a12+a22+a32-1, b11+b21+b31-1, b12+b22+b32-1, p11+p12+p13+p21+p22+p23+p31+p32+p33-1; ideal J=elim1(I,h1*h2*a11*a21*a31*a12*a22*a32*b11*b21*b31*b12*b22*b32); 12 / 32

22 Identifiability of LC Models for General 2-way Tables Lemma 1 The probability matrix P I J for r-level latent class model has rank at most r. 2 The image variety Image(f) is defined by all the (r + 1) (r + 1) subdeterminants of P. Theorem (Effective Dimension of 2-way Table) The r-level latent class model for an I J table has effective dimension (I + J r)r 1 and therefore the dimension of the unidentifiable space for (λ h, α ih, β jh ) is SD ED = r(r 1). 13 / 32

23 Identifiability of LC Models for General 2-way Tables Lemma 1 The probability matrix P I J for r-level latent class model has rank at most r. 2 The image variety Image(f) is defined by all the (r + 1) (r + 1) subdeterminants of P. Theorem (Effective Dimension of 2-way Table) The r-level latent class model for an I J table has effective dimension (I + J r)r 1 and therefore the dimension of the unidentifiable space for (λ h, α ih, β jh ) is SD ED = r(r 1). 13 / 32

24 100 Swiss Francs Problem Here is the observed table: n = Now we want to fit a 2-level latent class model to the table. n ij log ( h λ hα ih β jh ) s.t. max λ,α,β i,h i,j h λ h = i α ih = j β jh = 1 OR max n ij log(p ij ) s.t. det(p p ij ) = 0 (rank(p) = 2) where p ij is the 3 3 submatrix of p obtained by erasing the ith row and the jth column. 14 / 32

25 100 Swiss Francs Problem Here is the observed table: n = Now we want to fit a 2-level latent class model to the table. n ij log ( h λ hα ih β jh ) s.t. max λ,α,β i,h i,j h λ h = i α ih = j β jh = 1 OR max n ij log(p ij ) s.t. det(p p ij ) = 0 (rank(p) = 2) where p ij is the 3 3 submatrix of p obtained by erasing the ith row and the jth column. 14 / 32

26 How Many Maxima Are There? At the outset, I told you about the solution: But there are actually 7 maxima, one which this is one! We found them by repeated use of EM. How many of them are global? 15 / 32

27 How Many Maxima Are There? At the outset, I told you about the solution: But there are actually 7 maxima, one which this is one! We found them by repeated use of EM. How many of them are global? 15 / 32

28 How Many Maxima Are There? At the outset, I told you about the solution: But there are actually 7 maxima, one which this is one! We found them by repeated use of EM. How many of them are global? 15 / 32

29 How Many Maxima Are There? At the outset, I told you about the solution: But there are actually 7 maxima, one which this is one! We found them by repeated use of EM. How many of them are global? 15 / 32

30 Maxima of The Log-likelihood Function /3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/ /3 8/3 2 8/ /3 2 8/3 8/ /3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/ / 32

31 The Shape of Likelihood Function Here is a profile likelihood for the parameters (α 11, α 21, α 31 ): 17 / 32

32 2-D Unidentifiable Subspaces For 7 Local Maxima We know the degree of deficiency in the parameter space is 2. Therefore for each MLE (or local maxima), there is a 2-dimensional subspace of the parameter space corresponding to it α 11 β 11 λ 1 20α 11 λ 1 20β 11 λ 1 + 6λ 1 1 = 0 18 / 32

33 2-D Unidentifiable Subspaces For 7 Local Maxima We know the degree of deficiency in the parameter space is 2. Therefore for each MLE (or local maxima), there is a 2-dimensional subspace of the parameter space corresponding to it α 11 β 11 λ 1 20α 11 λ 1 20β 11 λ 1 + 6λ 1 1 = 0 18 / 32

34 2-Dim Unidentifiable Subspaces For 7 Local Maxima /3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/ /3 8/3 2 8/3 1 C A 1 C A 0 0 8/3 2 8/3 8/ /3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/ C A 1 C A 80α 11 β 11 λ 1 20α 11 λ 1 20β 11 λ 1 + 8λ 1 3 = 0 240α 11 β 11 λ 1 60α 11 λ 1 60β 11 λ λ 1 1 = 0 19 / 32

35 2-Dim Unidentifiable Subspaces For 7 Local Maxima /3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/ /3 8/3 2 8/3 1 C A 1 C A 0 0 8/3 2 8/3 8/ /3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/ C A 1 C A 80α 11 β 11 λ 1 20α 11 λ 1 20β 11 λ 1 + 8λ 1 3 = 0 240α 11 β 11 λ 1 60α 11 λ 1 60β 11 λ λ 1 1 = 0 19 / 32

2-Dim Unidentifiable Subspaces For 7 Local Maxima 0 B @ 0 B @ 4 2 2 2 2 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 2 2 4 2 8/3 8/3 2 8/3 1 C A 1 C A 0 B @ 0 B @ 8/3 2 8/3 8/3

36 2-Dim Unidentifiable Subspaces For 7 Local Maxima /3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/ /3 8/3 2 8/3 1 C A 1 C A 0 0 8/3 2 8/3 8/ /3 2 8/3 8/3 8/3 2 8/3 8/3 8/3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/ C A 1 C A 80α 11 β 11 λ 1 20α 11 λ 1 20β 11 λ 1 + 8λ 1 3 = 0 240α 11 β 11 λ 1 60α 11 λ 1 60β 11 λ λ 1 1 = 0 19 / 32

37 Summary for 100 Swiss Francs Problem /3 8/3 8/3 2 8/3 8/3 8/3 2 8/3 8/3 8/ observed MLE local maximum the idea generalizes to general 2-way square tables with symmetric data. - symmetric MLE and local maximum - multiple MLEs - averaging 20 / 32

38 Data From National Long Term Care Survey (NLTCS) 2 16 contingency table with data on 6 activities of daily living (ADL) and 10 instrumental activities of daily living (IADL) extracted by Erosheva from community-dwelling elderly from 1982, 1984, 1989, and 1994 survey waves. Of the 65, 536 cells in the table, 62, 384 (95.19%) contain zero counts, 1, 792 (2.64) contain 1, and 499 (0.76%) contain counts of 2. The largest cell count is Data analyzed in Erosheva, Fienberg, and Joutard (2007). ADL eating getting in/out of bed getting around inside dressing bathing getting to the bathroom IADL doing heavy housework doing light housework doing laundry cooking grocery shopping getting about outside travelling managing money taking medicine telephoning 21 / 32

39 Model Selection for NLTCS Extract BIC and log-likelihood values for various values of r. r Dimension Maximal log-likelihood BIC / 32

40 Fitted Values for the Largest Six Cells in NLTCS Extract. r Fitted values Observed / 32

41 Computational Approaches for LCM MLEs Expectation Maximization (EM) - hill climbing method converges steadily - but converges only linearly - time complexity for one single step is O(d r i d i) and space complexity is O(d r) Newton-Raphson Method - converges quadratically - tends to be very time and space intensive - both the time complexity and space complexity are O(d r 2 i d i). - numerically unstable if Hessian matrix is poorly conditioned Modified NR Approach - modify Hessian matices so they remain negative definite - then approximate log-likelihood locally by quadratic function - since log-likelihood is neither concave nor quadratic, these modifications don t necessarily guarantee increase of log-likelihood at each iteration step 24 / 32

42 Computational Approaches for LCM MLEs Expectation Maximization (EM) - hill climbing method converges steadily - but converges only linearly - time complexity for one single step is O(d r i d i) and space complexity is O(d r) Newton-Raphson Method - converges quadratically - tends to be very time and space intensive - both the time complexity and space complexity are O(d r 2 i d i). - numerically unstable if Hessian matrix is poorly conditioned Modified NR Approach - modify Hessian matices so they remain negative definite - then approximate log-likelihood locally by quadratic function - since log-likelihood is neither concave nor quadratic, these modifications don t necessarily guarantee increase of log-likelihood at each iteration step 24 / 32

43 Condition Numbers of Hessian Matrices Condition numbers of Hessian matrices at the maxima for the NLTCS data. r Condition number e e e e e e e e e e e e e e e e e e e / 32

44 Profile Likelihood For r=2 26 / 32

45 Summary Latent Class Models - existence and uniqueness of MLE - identifiability - model selection Computational tools - Singular - Expectation Maximization - Newton-Raphson method Examples Swiss Francs problem - sparse 2 16 table from the National Long Term Care Survey 27 / 32

46 Thank you! 28 / 32

47 References S.E. Fienberg, P. Hersh, A. Rinaldo, and Y. Zhou (2008). Maximum likelihood estimation in latent class models for contingency tables, In P. Gibilisco, Eva Riccomagno, Maria-Piera Rogantin (eds.) Algebraic and Geometric Methods in Probability and Statistics, Cambridge University Press, to appear. 29 / 32

48 Effective Dimension and Deficiency For a latent class model, p i1,...,i k = r h=1 λ h p (1) i 1h p(k) i k h, i k = 1,..., d k standard dimension: the dimension of the fully obervable model of conditional independence, which is r i (d i 1) + r 1. expected dimension: min {d 1, r i (d i 1) + r 1}, d = i d i is dimension of the table. effective dimension: the actual dimension of the model Definition (Deficiency) A latent class model is deficient if the effective dimension smaller than the expected dimension. Back to Page 10 Back to Page / 32

49 Definition (Variety) the zero set of a system of polynomials. it s indeed a hyper-surface. For example, the surface of independence we ve seen before. Definition (Polynomial ring) the set of polynomials in one or more variables with coefficients in a ring. For example, R[x], Q[x, y]. Definition (Ideal) The ideal I is a subset of a ring R satisfying f + g I if f I and g I, and pf I if f I and p R is an arbitrary element. For example, even number or multiple of 3 is an ideal of the integer ring, and {2} or {3} is called the generating set of the ideal. Definition (Ideal of Variety) the set of polynomials vanishing on the variety (hyper-surface). For example, the polynomial p 11 p 22 p 12 p 21 generates the ideal of the surface of independence. Back to Page / 32

50 Different Dimensions of Some Latent Class Models Effective Standard Complete Deficiency d 1 r / 32

Maximum Likelihood Estimation in Latent Class Models For Contingency Table Data

Maximum Likelihood Estimation in Latent Class Models For Contingency Table Data Stephen E. Fienberg Department of Statistics, Machine Learning Department and Cylab Carnegie Mellon University Pittsburgh,