Maximum likelihood: counterexamples, examples, and open problems Jon A. Wellner University of Washington Maximum likelihood: p. 1/7
Talk at University of Idaho, Department of Mathematics, September 15, 2005 Email: jaw@stat.washington.edu http: //www.stat.washington.edu/jaw/jaw.research.html Maximum likelihood: p. 2/7
Outline Introduction: maximum likelihood estimation Maximum likelihood: p. 3/7
Outline Introduction: maximum likelihood estimation Counterexamples Maximum likelihood: p. 3/7
Outline Introduction: maximum likelihood estimation Counterexamples Beyond consistency: rates and distributions Maximum likelihood: p. 3/7
Outline Introduction: maximum likelihood estimation Counterexamples Beyond consistency: rates and distributions Positive examples Maximum likelihood: p. 3/7
Outline Introduction: maximum likelihood estimation Counterexamples Beyond consistency: rates and distributions Positive examples Problems and challenges Maximum likelihood: p. 3/7
1. Introduction: maximum likelihood estimation Setting 1: dominated families Maximum likelihood: p. 4/7
1. Introduction: maximum likelihood estimation Setting 1: dominated families Suppose that X 1,...,X n are i.i.d. with density p θ0 respect to some dominating measure µ where p θ0 P= {p θ : θ Θ} for Θ R d. with Maximum likelihood: p. 4/7
1. Introduction: maximum likelihood estimation Setting 1: dominated families Suppose that X 1,...,X n are i.i.d. with density p θ0 respect to some dominating measure µ where p θ0 P= {p θ : θ Θ} for Θ R d. The likelihood is with L n (θ) = n p θ (X i ). i=1 Maximum likelihood: p. 4/7
1. Introduction: maximum likelihood estimation Setting 1: dominated families Suppose that X 1,...,X n are i.i.d. with density p θ0 respect to some dominating measure µ where p θ0 P= {p θ : θ Θ} for Θ R d. The likelihood is with L n (θ) = n p θ (X i ). i=1 Definition: A Maximum Likelihood Estimator (or MLE) of θ0 is any value ˆθ Θ satisfying L n (ˆθ) =supl n (θ). θ Θ Maximum likelihood: p. 4/7
Equivalently, the MLE ˆθ maximizes the log-likelihood log L n (θ) = n log p θ (X i ). i=1 Maximum likelihood: p. 5/7
Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where p θ (x) =θ exp( θx)1 [0, ) (x). Maximum likelihood: p. 6/7
Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where Then the likelihood is p θ (x) =θ exp( θx)1 [0, ) (x). L n (θ) =θ n exp( θ n X i ), 1 Maximum likelihood: p. 6/7
Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where Then the likelihood is p θ (x) =θ exp( θx)1 [0, ) (x). L n (θ) =θ n exp( θ n X i ), 1 so the log-likelihood is log L n (θ) =n log(θ) θ n 1 X i Maximum likelihood: p. 6/7
Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where Then the likelihood is p θ (x) =θ exp( θx)1 [0, ) (x). L n (θ) =θ n exp( θ n X i ), 1 so the log-likelihood is log L n (θ) =n log(θ) θ n 1 X i and ˆθ n =1/X n. Maximum likelihood: p. 6/7
0.4 1/n power of likelihood, n =50 0.3 0.2 0.1 1 2 3 4 Maximum likelihood: p. 7/7
1/n times log-likelihood, n =50-2 -4-6 10 1 2 3 4 12 14 Maximum likelihood: p. 8/7
1 0.8 0.6 0.4 0.2 MLE pˆθ(x) and true density p θ0 (x) 1 2 3 4 5 Maximum likelihood: p. 9/7
Example 2. Monotone decreasing densities on (0, ). X 1,...,X n are i.i.d. p 0 Pwhere P = all nonincreasing densities on (0, ). Maximum likelihood: p. 10/7
Example 2. Monotone decreasing densities on (0, ). X 1,...,X n are i.i.d. p 0 Pwhere P = all nonincreasing densities on (0, ). Then the likelihood is L n (p) = n i=1 p(x i); Maximum likelihood: p. 10/7
Example 2. Monotone decreasing densities on (0, ). X 1,...,X n are i.i.d. p 0 Pwhere P = all nonincreasing densities on (0, ). Then the likelihood is L n (p) = n i=1 p(x i); L n (p) is maximized by the Grenander estimator: ˆp n (x) = left derivative at x of the Least Concave Majorant C n of F n where F n (x) =n 1 n i=1 1{X i x} Maximum likelihood: p. 10/7
1 0.8 0.6 0.4 0.2 1 2 3 4 Maximum likelihood: p. 11/7
4 Figure 5: Grenander Estimator, n =10 3 2 1 1 2 3 4 Maximum likelihood: p. 12/7
1 0.8 0.6 0.4 0.2 0.5 1 1.5 2 2.5 3 3.5 4 Maximum likelihood: p. 13/7
2 1.5 Figure 5: Grenander Estimator, n = 100 1 0.5 1 2 3 4 Maximum likelihood: p. 14/7
Setting 2: non-dominated families Maximum likelihood: p. 15/7
Setting 2: non-dominated families Suppose that X 1,...,X n are i.i.d. P 0 Pwhere P is some collection of probability measures on a measurable space (X, A). Maximum likelihood: p. 15/7
Setting 2: non-dominated families Suppose that X 1,...,X n are i.i.d. P 0 Pwhere P is some collection of probability measures on a measurable space (X, A). If P {x} denotes the measure under P of the one-point set {x}, the likelihood of X 1,...,X n is defined to be L n (P )= n P {X i }. i=1 Maximum likelihood: p. 15/7
Setting 2: non-dominated families Suppose that X 1,...,X n are i.i.d. P 0 Pwhere P is some collection of probability measures on a measurable space (X, A). If P {x} denotes the measure under P of the one-point set {x}, the likelihood of X 1,...,X n is defined to be L n (P )= n P {X i }. i=1 Then a Maximum Likelihood Estimator (or MLE) of P 0 can be defined as a measure ˆP n Pthat maximizes L n (P ); thus L n ( ˆP )=supl n (P ) P P Maximum likelihood: p. 15/7
Example 3. (Empirical measure) Maximum likelihood: p. 16/7
Example 3. (Empirical measure) If P =all probability measures on (X, A), then n where δ x (A) =1 A (x). ˆP n = 1 n i=1 δ Xi P n Maximum likelihood: p. 16/7
Example 3. (Empirical measure) If P =all probability measures on (X, A), then where δ x (A) =1 A (x). Thus ˆP n = 1 n n i=1 δ Xi P n ˆP n (A) = 1 n = 1 n n δ Xi (A) i=1 n i=1 1 A (X i )= #{1 i n : X i A} n Maximum likelihood: p. 16/7
Consistency of the MLE: Wald (1949) Maximum likelihood: p. 17/7
Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Maximum likelihood: p. 17/7
Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Maximum likelihood: p. 17/7
Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Perlman (1972) Maximum likelihood: p. 17/7
Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Perlman (1972) Wang (1985) Maximum likelihood: p. 17/7
Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Perlman (1972) Wang (1985) van de Geer (1993) Maximum likelihood: p. 17/7
Counterexamples: Neyman and Scott (1948) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) bivariate right censoring Tsai, van der Laan, Pruitt Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) bivariate right censoring Tsai, van der Laan, Pruitt left truncation and interval censoring Chappell and Pan (1999) Maximum likelihood: p. 18/7
Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) bivariate right censoring Tsai, van der Laan, Pruitt left truncation and interval censoring Chappell and Pan (1999) Maathuis and Wellner (2005) Maximum likelihood: p. 18/7
2. Counterexamples: MLE s are not always consistent Counterexample 1. (Ferguson, 1982). Suppose that X 1,...,X n are i.i.d. with density f θ0 f θ (x) =(1 θ) 1 δ(θ) f 0 ( x θ δ(θ) ) + θf 1 (x) where for θ [0, 1] where f 1 (x) = 1 2 1 [ 1,1](x) Uniform[ 1, 1], f 0 (x) =(1 x )1 [ 1,1] (x) Triangular[ 1, 1] and δ(θ) satisfies: δ(0) = 1 0 <δ(θ) 1 θ δ(θ) 0 as θ 1. Maximum likelihood: p. 19/7
Density f θ (x) for c =2, θ =.38 5 4 3 2 1-1 -0.5 0.5 1 Maximum likelihood: p. 20/7
F θ (x) for c =2, θ =.38 1 0.8 0.6 0.4 0.2-1 -0.5 0.5 1 Maximum likelihood: p. 21/7
Ferguson (1982) shows that ˆθ n a.s. 1 no matter what θ 0 is true if δ(θ) 0 fast enough. Maximum likelihood: p. 22/7
Ferguson (1982) shows that ˆθ n a.s. 1 no matter what θ 0 is true if δ(θ) 0 fast enough. In fact, the assertion is true if δ(θ) =(1 θ)exp( (1 θ) c +1) with c > 2. (Ferguson shows that c =4works.) Maximum likelihood: p. 22/7
Ferguson (1982) shows that ˆθ n a.s. 1 no matter what θ 0 is true if δ(θ) 0 fast enough. In fact, the assertion is true if δ(θ) =(1 θ)exp( (1 θ) c +1) with c > 2. (Ferguson shows that c =4works.) If c =2, Ferguson s argument shows that sup n 1 log L n (θ) 0 θ 1 n 1 n D d log(m n/2) + 1 n log 1 M n δ(m n ) Maximum likelihood: p. 22/7
where P (D y) =exp ( ) 1, y log(2). 2(y log 2) That is, with E an Exponential(1) random variable D d =log2+ 1 2E. Maximum likelihood: p. 23/7
0.25-0.25 0.2 0.4 0.6 0.8 1-0.5-0.75-1 -1.25 Maximum likelihood: p. 24/7
Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Maximum likelihood: p. 25/7
Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Let F star be the class of all star-shaped distributions on [0,b) for some b. Maximum likelihood: p. 25/7
Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Let F star be the class of all star-shaped distributions on [0,b) for some b. Suppose that X 1,...,X n are i.i.d. F F star. Maximum likelihood: p. 25/7
Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Let F star be the class of all star-shaped distributions on [0,b) for some b. Suppose that X 1,...,X n are i.i.d. F F star. Barlow, Bartholomew, Bremner, and Brunk (1972) show that the MLE of a star-shaped distribution function F is ˆF n (x) = 0, x < X (1) ix nx (n), X (i) x<x (i+1),i=1,...,n 1, 1, x X (n). Maximum likelihood: p. 25/7
Moreover, BBBB (1972) show that if F (x) =x for 0 x 1, then ˆF n (x) a.s. x 2 x for 0 x 1. Maximum likelihood: p. 26/7
MLE n =5and true d.f. 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Maximum likelihood: p. 27/7
MLE n = 100 and true d.f. 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Maximum likelihood: p. 28/7
MLE n = 100 and limit 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Maximum likelihood: p. 29/7
Note 1. Since X(i) d = Sj /S n+1 where S i = i j=1 E j with E j i.i.d. Exponential(1) rv s, the total mass at order statistics equals 1 nx (n) n n d X (i) = i=1 p 1 i=1 1 S i = n 1 ns n S n n 0 n j=1 (1 t)dt =1/2. ( 1 j 1 ) E j n Maximum likelihood: p. 30/7
Note 1. Since X(i) d = Sj /S n+1 where S i = i j=1 E j with E j i.i.d. Exponential(1) rv s, the total mass at order statistics equals 1 nx (n) n n d X (i) = i=1 p 1 i=1 1 S i = n 1 ns n S n n 0 n j=1 (1 t)dt =1/2. ( 1 j 1 ) E j n Note 2. BBBB (1972) present consistent estimators of F star-shaped via isotonization due to Barlow and Scheurer (1971) and van Zwet. Maximum likelihood: p. 30/7
Counterexample 3. (Boyles, Marshall, Proschan (1985). A distribution F on [0, ) is Increasing Failure Rate Average if 1 x { log(1 F (x))} 1 x Λ(x) is non-decreasing; that is, if Λ is star-shaped. Maximum likelihood: p. 31/7
Counterexample 3. (Boyles, Marshall, Proschan (1985). A distribution F on [0, ) is Increasing Failure Rate Average if 1 x { log(1 F (x))} 1 x Λ(x) is non-decreasing; that is, if Λ is star-shaped. Let F IFRA be the class of all IFRA- distributions on [0, ). Maximum likelihood: p. 31/7
Suppose that X 1,...,X n are i.i.d. F F IFRA. Boyles, Marshall, and Proschan (1985) showed that the MLE ˆF n of a IFRA-distribution function F is given by ˆλ j, X (j) x<x (j+1), log(1 ˆF n (x)) = j =0,...,n 1, x > X (n) where ˆλ j = j i=1 ( n X 1 (i) log k=i X ) (k) n k=i+1 X (k). Maximum likelihood: p. 32/7
Moreover, BMP (1985) show that if F is exponential(1), then 1 ˆF n (x) a.s. (1 + x) x exp( x), so 1 x ˆΛ n (x) a.s. log(1 + x) 1. Maximum likelihood: p. 33/7
0.8 0.6 0.4 0.2 MLE n = 100 and true d.f. 1 exp( x) 1 1 2 3 4 Maximum likelihood: p. 34/7
0.8 0.6 0.4 0.2 MLE n = 100 and limit d.f.(1 + x) x 1 1 2 3 4 Maximum likelihood: p. 35/7
More counterexamples: bivariate right censoring: Tsai, van der Laan, Pruitt Maximum likelihood: p. 36/7
More counterexamples: bivariate right censoring: Tsai, van der Laan, Pruitt left truncation and interval censoring: Chappell and Pan (1999) Maximum likelihood: p. 36/7
More counterexamples: bivariate right censoring: Tsai, van der Laan, Pruitt left truncation and interval censoring: Chappell and Pan (1999) bivariate interval censoring with a continuous mark: Hudgens, Maathuis, and Gilbert (2005) Maathuis and Wellner (2005) Maximum likelihood: p. 36/7
3. Beyond consistency: rates and distributions Le Cam (1973); Birgé (1983): optimal rate of convergence r n = r opt n determined by If nr 2 n =logn [] (1/r n, P) (1) log N [] (ɛ, P) (1) leads to the optimal rate of convergence r opt n = n γ/(2γ+1). K ɛ 1/γ (2) Maximum likelihood: p. 37/7
On the other hand, bounds (from Birgé and Massart (1993)), yield achieved rates of convergence for maximum likelihood estimators (and other minimum contrast estimators) r n = rn ach determined by nr 2 n = r 1 n cr 2 n If (2) holds, this leads to the rate { n γ/(2γ+1) log N [] (ɛ, P)dɛ if γ>1/2 n γ/2 if γ<1/2. Thus there is the possibility that maximum likelihood is not (rate-)optimal when γ<1/2. Maximum likelihood: p. 38/7
Typically 1 γ = d α where d is the dimension of the underlying sample space and α is a measure of the smoothness of the functions in P. Maximum likelihood: p. 39/7
Typically 1 γ = d α where d is the dimension of the underlying sample space and α is a measure of the smoothness of the functions in P. Hence leads to γ<1/2. α< d 2 Maximum likelihood: p. 39/7
Typically 1 γ = d α where d is the dimension of the underlying sample space and α is a measure of the smoothness of the functions in P. Hence leads to γ<1/2. α< d 2 But there are many examples/problem with γ>1/2! Maximum likelihood: p. 39/7
Optimal rate and MLE rate as a function of γ 0.4 0.3 0.2 0.1 1 2 3 4 5 Maximum likelihood: p. 40/7
Difference of rates γ/(2γ +1) γ/2 0.04 0.03 0.02.01 0.1 0.2 0.3 0.4 0.5 Maximum likelihood: p. 41/7
4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) Maximum likelihood: p. 42/7
4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) panel count data (Wellner and Zhang, 2000) Maximum likelihood: p. 42/7
4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) panel count data (Wellner and Zhang, 2000) k monotone densities (Balabdaoui and Wellner, 2004) Maximum likelihood: p. 42/7
4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) panel count data (Wellner and Zhang, 2000) k monotone densities (Balabdaoui and Wellner, 2004) competing risks current status data (Jewell and van der Laan; Maathuis) Maximum likelihood: p. 42/7
Example 1. (interval censoring, case 1) Maximum likelihood: p. 43/7
Example 1. (interval censoring, case 1) X F, Y G independent Observe (1{X Y },Y) (,Y). Goal: estimate F. MLE ˆF n exists Maximum likelihood: p. 43/7
Example 1. (interval censoring, case 1) X F, Y G independent Observe (1{X Y },Y) (,Y). Goal: estimate F. MLE ˆF n exists Global rate: d =1, α =1, γ = α/d =1. γ/(2γ +1)=1/3, sor n = n 1/3 : n 1/3 h(p ˆFn,p 0 )=O p (1) and this yields n 1/3 ˆF n F 0 dg = O p (1). Maximum likelihood: p. 43/7
Interval censoring case 1, continued: Maximum likelihood: p. 44/7
Interval censoring case 1, continued: Local rate: (Groeneboom, 1987) n 1/3 ( ˆF n (t 0 ) F (t 0 )) d { F (t0 )(1 F (t 0 ))f 0 (t 0 ) 2g(t 0 ) where Z = argmin{w (t)+t 2 } } 1/3 2Z Maximum likelihood: p. 44/7
Example 2. (interval censoring, case 2) Maximum likelihood: p. 45/7
Example 2. (interval censoring, case 2) X F, (U, V ) H, U V independent of X Observe i.i.d. copies of (,U,V) where = ( 1, 2, 3 ) = (1{X U}, 1{U <X V }, 1{V < X}) Maximum likelihood: p. 45/7
Example 2. (interval censoring, case 2) X F, (U, V ) H, U V independent of X Observe i.i.d. copies of (,U,V) where = ( 1, 2, 3 ) = (1{X U}, 1{U <X V }, 1{V < X}) Goal: estimate F. MLE ˆF n exists. Maximum likelihood: p. 45/7
(interval censoring, case 2, continued) Maximum likelihood: p. 46/7
(interval censoring, case 2, continued) Global rate (separated case): If P (V U ɛ) =1, d =1, α =1, γ = α/d =1 γ/(2γ +1)=1/3, sor n = n 1/3 n 1/3 h(p ˆF n,p 0 )=O p (1) and this yields n 1/3 ˆF n F 0 dµ = O p (1) where µ(a) =P (U A)+P (V A), A B 1 Maximum likelihood: p. 46/7
(interval censoring, case 2, continued) Maximum likelihood: p. 47/7
(interval censoring, case 2, continued) Global rate (nonseparated case): (van de Geer, 1993). n 1/3 (log n) 1/6 h(p ˆFn,p 0 )=O p (1). Although this looks worse in terms of the rate, it is actually better because the Hellinger metric is much stronger in this case. Maximum likelihood: p. 47/7
(interval censoring, case 2, continued) Global rate (nonseparated case): (van de Geer, 1993). n 1/3 (log n) 1/6 h(p ˆFn,p 0 )=O p (1). Although this looks worse in terms of the rate, it is actually better because the Hellinger metric is much stronger in this case. Maximum likelihood: p. 47/7
(interval censoring, case 2, continued) Local rate (separated case): (Groeneboom, 1996) { } n 1/3 ( ˆF f0 (t 0 ) 1/3 n (t 0 ) F 0 (t 0 )) d 2Z 2a(t 0 ) where Z = argmin{w (t)+t 2 } and a(t 0 )= h 1(t 0 ) F 0 (t 0 ) + k 1(t 0 )+ k 2 (t 0 )+ h 2(t 0 ) 1 F 0 (t 0 ) k 1 (u) = k 2 (v) = M u v 0 h(u, v) F 0 (v) F 0 (u) dv h(u, v) F 0 (v) F 0 (u) du Maximum likelihood: p. 48/7
(interval censoring, case 2, continued) Local rate (non-separated case): (conjectured, G&W, 1992) (n log n) 1/3 ( ˆF n (t 0 ) F 0 (t 0 )) d { 3 4 where Z = argmin{w (t)+t 2 } f 0 (t 0 ) 2 } 1/3 2Z h(t 0,t 0 ) Maximum likelihood: p. 49/7
(interval censoring, case 2, continued) Local rate (non-separated case): (conjectured, G&W, 1992) (n log n) 1/3 ( ˆF n (t 0 ) F 0 (t 0 )) d { 3 4 where Z = argmin{w (t)+t 2 } Monte-Carlo evidence in support: Groeneboom and Ketelaars (2005) f 0 (t 0 ) 2 } 1/3 2Z h(t 0,t 0 ) Maximum likelihood: p. 49/7
MSE histogram / MSE of MLE f 0 (t) =1 4.5 4 n=1000 n=2500 n=5000 n=10000 3.5 3 2.5 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 50/7
MSE histogram / MSE of MLE f 0 (t) =4(1 t) 3 4.5 4 n=1000 n=2500 n=5000 n=10000 3.5 3 2.5 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 51/7
MSE histogram / MSE of MLE f 0 (t) =1 9 8 n=1000 n=2500 n=5000 n=10000 7 6 5 4 3 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 52/7
MSE histogram / MSE of MLE f 0 (t) =4(1 t) 3 9 8 n=1000 n=2500 n=5000 n=10000 7 6 5 4 3 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 53/7
Example 3. (k-monotone densities) Maximum likelihood: p. 54/7
Example 3. (k-monotone densities) A density p on (0, ) is k monontone (p D k )ifitis non-negative and nonincreasing when k =1; and if ( 1) j p (j) (x) 0 for j =0,...,k 2 and ( 1) k 2 p (k 2) is convex for k 2. Maximum likelihood: p. 54/7
Example 3. (k-monotone densities) A density p on (0, ) is k monontone (p D k )ifitis non-negative and nonincreasing when k =1; and if ( 1) j p (j) (x) 0 for j =0,...,k 2 and ( 1) k 2 p (k 2) is convex for k 2. Mixture representation: p D k iff p(x) = 0 k (y x)k 1 yk + df (y) for some distribution function F on (0, ). k =1: monotone decreasing densities on R + k =2: convex decreasing densities on R + k 3:... k = : completely monotone densities = scale mixtures of exponential Maximum likelihood: p. 54/7
(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) Maximum likelihood: p. 55/7
(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) k =1Grenander estimator: r n = n 1/3 Maximum likelihood: p. 55/7
(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) k =1Grenander estimator: r n = n 1/3 Global rates and finite n minimax bounds: Birgé (1986), (1987), (1989) Maximum likelihood: p. 55/7
(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) k =1Grenander estimator: r n = n 1/3 Global rates and finite n minimax bounds: Birgé (1986), (1987), (1989) Local rates: Prakasa Rao (1969), Groeneboom (1985), (1989) Kim and Pollard (1990) { n 1/3 p0 (t 0 ) p 0 (ˆp n (t 0 ) p 0 (t 0 )) (t } 1/3 0) d 2Z 2 Maximum likelihood: p. 55/7
(k-monotone densities, continued) k =2; convex decreasing density d =1, α =2, γ =2, γ/(2γ +1)=2/5, sor n = n 2/5 (forward problem) Global rates: nothing yet Local rates and distributions: Groeneboom, Jongbloed, Wellner (2001) Maximum likelihood: p. 56/7
(k-monotone densities, continued) k =2; convex decreasing density d =1, α =2, γ =2, γ/(2γ +1)=2/5, sor n = n 2/5 (forward problem) Global rates: nothing yet Local rates and distributions: Groeneboom, Jongbloed, Wellner (2001) k 3; k - monotone density d =1, α = k, γ = k, γ/(2γ +1)=k/(2k +1),so r n = n k/(2k+1) (forward problem)? Global rates: nothing yet Local rates: should be r n = n k/(2k+1) progress: Balabdaoui and Wellner (2004) local rate is true if a certain conjecture about Hermite interpolation holds Maximum likelihood: p. 56/7
Direct and Inverse estimators k =3, n = 100 0.0 0.4 0.8 1.2 (1a) - LSE, k=3, n=100 (direct problem) 0 1 2 3 4 5 0.0 0.4 0.8 1.2 (1b) - MLE, k=3, n=100 (direct problem) 0 1 2 3 4 5 (2a) - LSE, k=3, n=100 (inverse problem) (2b) - MLE, k=3, n=100 (inverse problem) 0.0 0.4 0.8 0 5 10 15 0.0 0.4 0.8 0 5 10 15 Maximum likelihood: p. 57/7
Direct and Inverse estimators k =3, n = 1000 (1a) - LSE, k=3, n=1000 (direct problem) (1b) - MLE, k=3, n=1000 (direct problem) 0.0 0.4 0.8 0 1 2 3 4 5 0.0 0.4 0.8 0 1 2 3 4 5 (2a) - LSE, k=3, n=1000 (inverse problem) (2b) - MLE, k=3, n=1000 (inverse problem) 0.0 0.4 0.8 0 5 10 15 0.0 0.4 0.8 0 5 10 15 Maximum likelihood: p. 58/7
Direct and Inverse estimators k =6, n = 100 0.0 0.4 0.8 1.2 (1a) - LSE, k=6, n=100 (direct problem) 0 1 2 3 4 5 0.0 0.4 0.8 1.2 (1b) - LSE, k=6, n=100 (direct problem) 0 1 2 3 4 5 (2a) - LSE, k=6, n=100 (inverse problem) (2b) - MLE, k=6, n=100 (inverse problem) 0.0 0.4 0.8 0 5 10 15 20 0.0 0.4 0.8 0 5 10 15 20 Maximum likelihood: p. 59/7
Direct and Inverse estimators k =6, n = 1000 (1a) - LSE, k=6, n=1000 (direct problem) (1b) - MLE, k=6, n=1000 (direct problem) 0.0 0.4 0.8 0 1 2 3 4 5 0.0 0.4 0.8 0 1 2 3 4 5 (2a) - LSE, k=6, n=1000 (inverse problem) (2b) - MLE, k=6, n=1000 (inverse problem) 0.0 0.4 0.8 0 5 10 15 20 0.0 0.4 0.8 0 5 10 15 20 Maximum likelihood: p. 60/7
Example 4. (Competing risks with current status data) Maximum likelihood: p. 61/7
Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Maximum likelihood: p. 61/7
Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Observe: (,T), =( 1,..., K, K+1 ) where j =1{X T,Y = j}, j =1,...,K K+1 =1{X >T}. Maximum likelihood: p. 61/7
Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Observe: (,T), =( 1,..., K, K+1 ) where j =1{X T,Y = j}, j =1,...,K K+1 =1{X >T}. Goal: estimate F j (t) =P (X t, Y = j) for j =1,...,K Maximum likelihood: p. 61/7
Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Observe: (,T), =( 1,..., K, K+1 ) where j =1{X T,Y = j}, j =1,...,K K+1 =1{X >T}. Goal: estimate F j (t) =P (X t, Y = j) for j =1,...,K MLE ˆF n =(ˆF n,1,..., ˆF n,k ) exists! Characterization of ˆF n involves an interacting system of slopes of convex minorants Maximum likelihood: p. 61/7
(competing risks with current status data, continued) Global rates. Easy with present methods. n 1/3 K k=1 ˆF n,k (t) F 0,k (t) dg(t) =O p (1) Maximum likelihood: p. 62/7
(competing risks with current status data, continued) Global rates. Easy with present methods. n 1/3 K k=1 ˆF n,k (t) F 0,k (t) dg(t) =O p (1) Local rates? Conjecture r n = n 1/3. New methods needed. Tricky. Maathuis (2006)? Maximum likelihood: p. 62/7
(competing risks with current status data, continued) Global rates. Easy with present methods. n 1/3 K k=1 ˆF n,k (t) F 0,k (t) dg(t) =O p (1) Local rates? Conjecture r n = n 1/3. New methods needed. Tricky. Maathuis (2006)? Limit distribution theory: will involve slopes of an interacting system of greatest convex minorants defined in terms of a vector of dependent two-sided Brownian motions Maximum likelihood: p. 62/7
n 2/3 MSE of MLE and naive estimators of F1 n=25 n=250 MLE NE SNE TNE MLE NE SNE TNE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t n 2 3 MSE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t n 2 3 MSE n=2500 n=25000 MLE NE SNE TNE MLE NE SNE TNE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t n 2 3 MSE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t Maximum likelihood: p. 63/7 n 2 3 MSE
n 2/3 MSE of MLE and naive estimators of F 2 n=25 n=250 n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 t t n=2500 n=25000 n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 t t Maximum likelihood: p. 64/7
Example 5. (Monotone densities in R d ) Maximum likelihood: p. 65/7
Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Maximum likelihood: p. 65/7
Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Biau and Devroye (2003) show via Assouad s lemma and direct calculations: r opt n = n 1/(d+2) Maximum likelihood: p. 65/7
Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Biau and Devroye (2003) show via Assouad s lemma and direct calculations: r opt n = n 1/(d+2) Rate achieved by the MLE: Natural conjecture: r ach n = n 1/2d, d > 2 Maximum likelihood: p. 65/7
Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Biau and Devroye (2003) show via Assouad s lemma and direct calculations: r opt n = n 1/(d+2) Rate achieved by the MLE: Natural conjecture: r ach n = n 1/2d, d > 2 Biau and Devroye (2003) construct generalizations of Birgé s (1987) histogram estimators that achieve the optimal rate for all d 2. Maximum likelihood: p. 65/7
5. Problems and Challenges More tools for local rates and distribution theory? Maximum likelihood: p. 66/7
5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? Maximum likelihood: p. 66/7
5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? More counterexamples to clarify when MLE s do not work? Maximum likelihood: p. 66/7
5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? More counterexamples to clarify when MLE s do not work? What is the limit distribution for interval censoring, case 2? (Does the G&W (1992) conjecture hold?) Maximum likelihood: p. 66/7
5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? More counterexamples to clarify when MLE s do not work? What is the limit distribution for interval censoring, case 2? (Does the G&W (1992) conjecture hold?) When the MLE is not rate optimal, is it still preferable from some other perspectives? For example, does the MLE provide efficient estimators of smooth functionals (while alternative rate -optimal estimators fail to have this property)? Compare with Bickel and Ritov (2003). Maximum likelihood: p. 66/7
Problems and challenges, continued More rate and optimality theory for Maximum Likelihood Estimators of mixing distributions in mixture models with smooth kernels: e.g. completely monotone densities (scale mixtures of exponential), normal location mixtures (deconvolution problems) Maximum likelihood: p. 67/7
Problems and challenges, continued More rate and optimality theory for Maximum Likelihood Estimators of mixing distributions in mixture models with smooth kernels: e.g. completely monotone densities (scale mixtures of exponential), normal location mixtures (deconvolution problems) Stable and efficient algorithms for computing MLE s in models where they exist (e.g. mixture models, missing data). Maximum likelihood: p. 67/7
Problems and challenges, continued More rate and optimality theory for Maximum Likelihood Estimators of mixing distributions in mixture models with smooth kernels: e.g. completely monotone densities (scale mixtures of exponential), normal location mixtures (deconvolution problems) Stable and efficient algorithms for computing MLE s in models where they exist (e.g. mixture models, missing data). Results for monotone densities in R d? Maximum likelihood: p. 67/7
Selected References Bahadur, R. R. (1958). Examples of inconsistency of maximum likelihood estimates. Sankhya, Ser. A 20, 207-210. Barlow, R. E., Bartholomew, D. J., Bremner, J. M., and Brunk, H. D. (1972). Statistical Inference under Order Restrictions. Wiley, New York. Barlow, R. E. and Scheurer, E. M (1971). Estimation from accelerated life tests. Technometrics 13, 145-159. Biau, G. and Devroye, L. (2003). On the risk of estimates for block decreasing densities. J. Mult. Anal. 86, 143-165. Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be plugged-in. Ann. Statist. 31, 1033-1053. Birgé, L. (1987). On the risk of histograms for estimating decreasing densities. Ann. Statist. 15, 1013-1022. Maximum likelihood: p. 68/7
Birgé, L. (1989). The Grenander estimator: a nonasymptotic approach. Ann. Statist. 17, 1532-1549. Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields 97, 113-150. Boyles, R. A., Marshall, A. W., Proschan, F. (1985). Inconsistency of the maximum likelihood estimator of a distribution having increasing failure rate average. Ann. Statist. 13, 413-417. Ferguson, T. S. (1982). An inconsistent maximum likelihood estimate. J. Amer. Statist. Assoc. 77, 831 834. Ghosh, M. (1995). Inconsistent maximum likelihood estimators for the Rasch model. Statist. Probab. Lett. 23, 165-170. Maximum likelihood: p. 69/7
Hudgens, M., Maathuis, M., and Gilbert, P. (2005). Nonparametric estimation of the joint distribution of a survival time subject to interval censoring and a continuous mark variable. Submitted to Biometrics. Le Cam, L. (1990). Maximum likelihood: an introduction. Internat. Statist. Rev. 58, 153-171. Maathuis, M. and Wellner, J. A. (2005). Inconsistency of the MLE for the joint distribution of interval censored survival times and continuous marks. Submitted to Biometrika. Pan, W. and Chappell, R. (1999). A note on inconsistency of NPMLE of the distribution function from left truncated and case I interval censored data. Lifetime Data Anal. 5, 281-291. Maximum likelihood: p. 70/7