Maximum likelihood: counterexamples, examples, and open problems. Jon A. Wellner. University of Washington. Maximum likelihood: p.

Similar documents
Maximum likelihood: counterexamples, examples, and open problems

Nonparametric estimation under shape constraints, part 2

with Current Status Data

Nonparametric estimation under Shape Restrictions

Chernoff s distribution is log-concave. But why? (And why does it matter?)

The International Journal of Biostatistics

STATISTICAL INFERENCE UNDER ORDER RESTRICTIONS LIMIT THEORY FOR THE GRENANDER ESTIMATOR UNDER ALTERNATIVE HYPOTHESES

Nonparametric estimation under shape constraints, part 1

arxiv:math/ v2 [math.st] 17 Jun 2008

A Law of the Iterated Logarithm. for Grenander s Estimator

Score Statistics for Current Status Data: Comparisons with Likelihood Ratio and Wald Statistics

ISOTONIC MAXIMUM LIKELIHOOD ESTIMATION FOR THE CHANGE POINT OF A HAZARD RATE

Nonparametric estimation for current status data with competing risks

NONPARAMETRIC CONFIDENCE INTERVALS FOR MONOTONE FUNCTIONS. By Piet Groeneboom and Geurt Jongbloed Delft University of Technology

Shape constrained estimation: a brief introduction

arxiv:math/ v2 [math.st] 17 Jun 2008

Les deux complices. Piet Groeneboom, Delft University. Juin 28, 2018

Least squares under convex constraint

Additive Isotonic Regression

Inconsistency of the MLE for the joint distribution. of interval censored survival times. and continuous marks

arxiv:math/ v1 [math.st] 21 Jul 2005

Empirical likelihood ratio with arbitrarily censored/truncated data by EM algorithm

Estimation of a k-monotone density, part 2: algorithms for computation and numerical results

Maximum Likelihood Estimation under Shape Constraints

On Efficiency of the Plug-in Principle for Estimating Smooth Integrated Functionals of a Nonincreasing Density

Asymptotic normality of the L k -error of the Grenander estimator

INCONSISTENCY OF BOOTSTRAP: THE GRENANDER ESTIMATOR

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

and Comparison with NPMLE

Nonparametric estimation of log-concave densities

Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models

Lecture 8: Information Theory and Statistics

Generalized continuous isotonic regression

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

An Information Criteria for Order-restricted Inference

Editorial Manager(tm) for Lifetime Data Analysis Manuscript Draft. Manuscript Number: Title: On An Exponential Bound for the Kaplan - Meier Estimator

M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010

Chapter 2 Inference on Mean Residual Life-Overview

Estimation of a discrete monotone distribution

Reliability of Coherent Systems with Dependent Component Lifetimes

Preservation Theorems for Glivenko-Cantelli and Uniform Glivenko-Cantelli Classes

Nonparametric estimation under Shape Restrictions

Likelihood Ratio Tests for Monotone Functions

A note on the asymptotic distribution of Berk-Jones type statistics under the null hypothesis

Nonparametric estimation of log-concave densities

CURE MODEL WITH CURRENT STATUS DATA

Learning Multivariate Log-concave Distributions

Nonparametric Bayes Estimator of Survival Function for Right-Censoring and Left-Truncation Data

A central limit theorem for the Hellinger loss of Grenander-type estimators

Efficiency of Profile/Partial Likelihood in the Cox Model

Bayesian estimation of the discrepancy with misspecified parametric models

Estimation of a k monotone density, part 3: limiting Gaussian versions of the problem; invelopes and envelopes

Likelihood Based Inference for Monotone Response Models

CURRENT STATUS LINEAR REGRESSION. By Piet Groeneboom and Kim Hendrickx Delft University of Technology and Hasselt University

Nonparametric estimation: s concave and log-concave densities: alternatives to maximum likelihood

University of California, Berkeley

ICSA Applied Statistics Symposium 1. Balanced adjusted empirical likelihood

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1

Estimating monotone, unimodal and U shaped failure rates using asymptotic pivots

Stat583. Piet Groeneboom. April 27, 1999

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Nonparametric estimation of. s concave and log-concave densities: alternatives to maximum likelihood

A MODIFIED LIKELIHOOD RATIO TEST FOR HOMOGENEITY IN FINITE MIXTURE MODELS

Estimation for two-phase designs: semiparametric models and Z theorems

A talk on Oracle inequalities and regularization. by Sara van de Geer

arxiv: v3 [math.st] 29 Nov 2018

12 - Nonparametric Density Estimation

Department of Statistics, School of Mathematical Sciences, Ferdowsi University of Mashhad, Iran.

Estimating Bivariate Survival Function by Volterra Estimator Using Dynamic Programming Techniques

Preservation Theorems for Glivenko-Cantelli and Uniform Glivenko-Cantelli Classes

Estimation of the Bivariate and Marginal Distributions with Censored Data

To appear in The American Statistician vol. 61 (2007) pp

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates

Likelihood Based Inference for Monotone Response Models

Asymptotic Properties of Nonparametric Estimation Based on Partly Interval-Censored Data

Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach

Estimation of a Two-component Mixture Model

Asymptotically optimal estimation of smooth functionals for interval censoring, case 2

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

On the Convolution Order with Reliability Applications

Minimax lower bounds I

Bayesian Regularization

Likelihood Construction, Inference for Parametric Survival Distributions

Statistics: Learning models from data

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Bootstrap, Jackknife and other resampling methods

arxiv:math/ v1 [math.st] 11 Feb 2006

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

On the generalized maximum likelihood estimator of survival function under Koziol Green model

FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH

1.1 Basis of Statistical Decision Theory

Lecture 3. Truncation, length-bias and prevalence sampling

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation

Unobserved Heterogeneity in Longitudinal Data An Empirical Bayes Perspective

Review and continuation from last week Properties of MLEs

University of California, Berkeley

Frailty Models and Copulas: Similarities and Differences

Math 152. Rumbos Fall Solutions to Assignment #12

Maximum likelihood estimation of a log-concave density based on censored data

Transcription:

Maximum likelihood: counterexamples, examples, and open problems Jon A. Wellner University of Washington Maximum likelihood: p. 1/7

Talk at University of Idaho, Department of Mathematics, September 15, 2005 Email: jaw@stat.washington.edu http: //www.stat.washington.edu/jaw/jaw.research.html Maximum likelihood: p. 2/7

Outline Introduction: maximum likelihood estimation Maximum likelihood: p. 3/7

Outline Introduction: maximum likelihood estimation Counterexamples Maximum likelihood: p. 3/7

Outline Introduction: maximum likelihood estimation Counterexamples Beyond consistency: rates and distributions Maximum likelihood: p. 3/7

Outline Introduction: maximum likelihood estimation Counterexamples Beyond consistency: rates and distributions Positive examples Maximum likelihood: p. 3/7

Outline Introduction: maximum likelihood estimation Counterexamples Beyond consistency: rates and distributions Positive examples Problems and challenges Maximum likelihood: p. 3/7

1. Introduction: maximum likelihood estimation Setting 1: dominated families Maximum likelihood: p. 4/7

1. Introduction: maximum likelihood estimation Setting 1: dominated families Suppose that X 1,...,X n are i.i.d. with density p θ0 respect to some dominating measure µ where p θ0 P= {p θ : θ Θ} for Θ R d. with Maximum likelihood: p. 4/7

1. Introduction: maximum likelihood estimation Setting 1: dominated families Suppose that X 1,...,X n are i.i.d. with density p θ0 respect to some dominating measure µ where p θ0 P= {p θ : θ Θ} for Θ R d. The likelihood is with L n (θ) = n p θ (X i ). i=1 Maximum likelihood: p. 4/7

1. Introduction: maximum likelihood estimation Setting 1: dominated families Suppose that X 1,...,X n are i.i.d. with density p θ0 respect to some dominating measure µ where p θ0 P= {p θ : θ Θ} for Θ R d. The likelihood is with L n (θ) = n p θ (X i ). i=1 Definition: A Maximum Likelihood Estimator (or MLE) of θ0 is any value ˆθ Θ satisfying L n (ˆθ) =supl n (θ). θ Θ Maximum likelihood: p. 4/7

Equivalently, the MLE ˆθ maximizes the log-likelihood log L n (θ) = n log p θ (X i ). i=1 Maximum likelihood: p. 5/7

Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where p θ (x) =θ exp( θx)1 [0, ) (x). Maximum likelihood: p. 6/7

Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where Then the likelihood is p θ (x) =θ exp( θx)1 [0, ) (x). L n (θ) =θ n exp( θ n X i ), 1 Maximum likelihood: p. 6/7

Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where Then the likelihood is p θ (x) =θ exp( θx)1 [0, ) (x). L n (θ) =θ n exp( θ n X i ), 1 so the log-likelihood is log L n (θ) =n log(θ) θ n 1 X i Maximum likelihood: p. 6/7

Example 1. Exponential (θ). X1,...,X n are i.i.d. p θ0 where Then the likelihood is p θ (x) =θ exp( θx)1 [0, ) (x). L n (θ) =θ n exp( θ n X i ), 1 so the log-likelihood is log L n (θ) =n log(θ) θ n 1 X i and ˆθ n =1/X n. Maximum likelihood: p. 6/7

0.4 1/n power of likelihood, n =50 0.3 0.2 0.1 1 2 3 4 Maximum likelihood: p. 7/7

1/n times log-likelihood, n =50-2 -4-6 10 1 2 3 4 12 14 Maximum likelihood: p. 8/7

1 0.8 0.6 0.4 0.2 MLE pˆθ(x) and true density p θ0 (x) 1 2 3 4 5 Maximum likelihood: p. 9/7

Example 2. Monotone decreasing densities on (0, ). X 1,...,X n are i.i.d. p 0 Pwhere P = all nonincreasing densities on (0, ). Maximum likelihood: p. 10/7

Example 2. Monotone decreasing densities on (0, ). X 1,...,X n are i.i.d. p 0 Pwhere P = all nonincreasing densities on (0, ). Then the likelihood is L n (p) = n i=1 p(x i); Maximum likelihood: p. 10/7

Example 2. Monotone decreasing densities on (0, ). X 1,...,X n are i.i.d. p 0 Pwhere P = all nonincreasing densities on (0, ). Then the likelihood is L n (p) = n i=1 p(x i); L n (p) is maximized by the Grenander estimator: ˆp n (x) = left derivative at x of the Least Concave Majorant C n of F n where F n (x) =n 1 n i=1 1{X i x} Maximum likelihood: p. 10/7

1 0.8 0.6 0.4 0.2 1 2 3 4 Maximum likelihood: p. 11/7

4 Figure 5: Grenander Estimator, n =10 3 2 1 1 2 3 4 Maximum likelihood: p. 12/7

1 0.8 0.6 0.4 0.2 0.5 1 1.5 2 2.5 3 3.5 4 Maximum likelihood: p. 13/7

2 1.5 Figure 5: Grenander Estimator, n = 100 1 0.5 1 2 3 4 Maximum likelihood: p. 14/7

Setting 2: non-dominated families Maximum likelihood: p. 15/7

Setting 2: non-dominated families Suppose that X 1,...,X n are i.i.d. P 0 Pwhere P is some collection of probability measures on a measurable space (X, A). Maximum likelihood: p. 15/7

Setting 2: non-dominated families Suppose that X 1,...,X n are i.i.d. P 0 Pwhere P is some collection of probability measures on a measurable space (X, A). If P {x} denotes the measure under P of the one-point set {x}, the likelihood of X 1,...,X n is defined to be L n (P )= n P {X i }. i=1 Maximum likelihood: p. 15/7

Setting 2: non-dominated families Suppose that X 1,...,X n are i.i.d. P 0 Pwhere P is some collection of probability measures on a measurable space (X, A). If P {x} denotes the measure under P of the one-point set {x}, the likelihood of X 1,...,X n is defined to be L n (P )= n P {X i }. i=1 Then a Maximum Likelihood Estimator (or MLE) of P 0 can be defined as a measure ˆP n Pthat maximizes L n (P ); thus L n ( ˆP )=supl n (P ) P P Maximum likelihood: p. 15/7

Example 3. (Empirical measure) Maximum likelihood: p. 16/7

Example 3. (Empirical measure) If P =all probability measures on (X, A), then n where δ x (A) =1 A (x). ˆP n = 1 n i=1 δ Xi P n Maximum likelihood: p. 16/7

Example 3. (Empirical measure) If P =all probability measures on (X, A), then where δ x (A) =1 A (x). Thus ˆP n = 1 n n i=1 δ Xi P n ˆP n (A) = 1 n = 1 n n δ Xi (A) i=1 n i=1 1 A (X i )= #{1 i n : X i A} n Maximum likelihood: p. 16/7

Consistency of the MLE: Wald (1949) Maximum likelihood: p. 17/7

Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Maximum likelihood: p. 17/7

Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Maximum likelihood: p. 17/7

Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Perlman (1972) Maximum likelihood: p. 17/7

Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Perlman (1972) Wang (1985) Maximum likelihood: p. 17/7

Consistency of the MLE: Wald (1949) Kiefer and Wolfowitz (1956) Huber (1967) Perlman (1972) Wang (1985) van de Geer (1993) Maximum likelihood: p. 17/7

Counterexamples: Neyman and Scott (1948) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) bivariate right censoring Tsai, van der Laan, Pruitt Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) bivariate right censoring Tsai, van der Laan, Pruitt left truncation and interval censoring Chappell and Pan (1999) Maximum likelihood: p. 18/7

Counterexamples: Neyman and Scott (1948) Bahadur (1958) Ferguson (1982) LeCam (1975), (1990) Barlow, Bartholomew, Bremner, and Brunk (4B s) (1972) Boyles, Marshall, and Proschan (1985) bivariate right censoring Tsai, van der Laan, Pruitt left truncation and interval censoring Chappell and Pan (1999) Maathuis and Wellner (2005) Maximum likelihood: p. 18/7

2. Counterexamples: MLE s are not always consistent Counterexample 1. (Ferguson, 1982). Suppose that X 1,...,X n are i.i.d. with density f θ0 f θ (x) =(1 θ) 1 δ(θ) f 0 ( x θ δ(θ) ) + θf 1 (x) where for θ [0, 1] where f 1 (x) = 1 2 1 [ 1,1](x) Uniform[ 1, 1], f 0 (x) =(1 x )1 [ 1,1] (x) Triangular[ 1, 1] and δ(θ) satisfies: δ(0) = 1 0 <δ(θ) 1 θ δ(θ) 0 as θ 1. Maximum likelihood: p. 19/7

Density f θ (x) for c =2, θ =.38 5 4 3 2 1-1 -0.5 0.5 1 Maximum likelihood: p. 20/7

F θ (x) for c =2, θ =.38 1 0.8 0.6 0.4 0.2-1 -0.5 0.5 1 Maximum likelihood: p. 21/7

Ferguson (1982) shows that ˆθ n a.s. 1 no matter what θ 0 is true if δ(θ) 0 fast enough. Maximum likelihood: p. 22/7

Ferguson (1982) shows that ˆθ n a.s. 1 no matter what θ 0 is true if δ(θ) 0 fast enough. In fact, the assertion is true if δ(θ) =(1 θ)exp( (1 θ) c +1) with c > 2. (Ferguson shows that c =4works.) Maximum likelihood: p. 22/7

Ferguson (1982) shows that ˆθ n a.s. 1 no matter what θ 0 is true if δ(θ) 0 fast enough. In fact, the assertion is true if δ(θ) =(1 θ)exp( (1 θ) c +1) with c > 2. (Ferguson shows that c =4works.) If c =2, Ferguson s argument shows that sup n 1 log L n (θ) 0 θ 1 n 1 n D d log(m n/2) + 1 n log 1 M n δ(m n ) Maximum likelihood: p. 22/7

where P (D y) =exp ( ) 1, y log(2). 2(y log 2) That is, with E an Exponential(1) random variable D d =log2+ 1 2E. Maximum likelihood: p. 23/7

0.25-0.25 0.2 0.4 0.6 0.8 1-0.5-0.75-1 -1.25 Maximum likelihood: p. 24/7

Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Maximum likelihood: p. 25/7

Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Let F star be the class of all star-shaped distributions on [0,b) for some b. Maximum likelihood: p. 25/7

Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Let F star be the class of all star-shaped distributions on [0,b) for some b. Suppose that X 1,...,X n are i.i.d. F F star. Maximum likelihood: p. 25/7

Counterexample 2. (4 B s, 1972). A distribution F on [0,b) is star-shaped if F (x)/x is non-decreasing on [0,b). Thus if F has a density f which is increasing on [0,b) then F is star-shaped. Let F star be the class of all star-shaped distributions on [0,b) for some b. Suppose that X 1,...,X n are i.i.d. F F star. Barlow, Bartholomew, Bremner, and Brunk (1972) show that the MLE of a star-shaped distribution function F is ˆF n (x) = 0, x < X (1) ix nx (n), X (i) x<x (i+1),i=1,...,n 1, 1, x X (n). Maximum likelihood: p. 25/7

Moreover, BBBB (1972) show that if F (x) =x for 0 x 1, then ˆF n (x) a.s. x 2 x for 0 x 1. Maximum likelihood: p. 26/7

MLE n =5and true d.f. 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Maximum likelihood: p. 27/7

MLE n = 100 and true d.f. 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Maximum likelihood: p. 28/7

MLE n = 100 and limit 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Maximum likelihood: p. 29/7

Note 1. Since X(i) d = Sj /S n+1 where S i = i j=1 E j with E j i.i.d. Exponential(1) rv s, the total mass at order statistics equals 1 nx (n) n n d X (i) = i=1 p 1 i=1 1 S i = n 1 ns n S n n 0 n j=1 (1 t)dt =1/2. ( 1 j 1 ) E j n Maximum likelihood: p. 30/7

Note 1. Since X(i) d = Sj /S n+1 where S i = i j=1 E j with E j i.i.d. Exponential(1) rv s, the total mass at order statistics equals 1 nx (n) n n d X (i) = i=1 p 1 i=1 1 S i = n 1 ns n S n n 0 n j=1 (1 t)dt =1/2. ( 1 j 1 ) E j n Note 2. BBBB (1972) present consistent estimators of F star-shaped via isotonization due to Barlow and Scheurer (1971) and van Zwet. Maximum likelihood: p. 30/7

Counterexample 3. (Boyles, Marshall, Proschan (1985). A distribution F on [0, ) is Increasing Failure Rate Average if 1 x { log(1 F (x))} 1 x Λ(x) is non-decreasing; that is, if Λ is star-shaped. Maximum likelihood: p. 31/7

Counterexample 3. (Boyles, Marshall, Proschan (1985). A distribution F on [0, ) is Increasing Failure Rate Average if 1 x { log(1 F (x))} 1 x Λ(x) is non-decreasing; that is, if Λ is star-shaped. Let F IFRA be the class of all IFRA- distributions on [0, ). Maximum likelihood: p. 31/7

Suppose that X 1,...,X n are i.i.d. F F IFRA. Boyles, Marshall, and Proschan (1985) showed that the MLE ˆF n of a IFRA-distribution function F is given by ˆλ j, X (j) x<x (j+1), log(1 ˆF n (x)) = j =0,...,n 1, x > X (n) where ˆλ j = j i=1 ( n X 1 (i) log k=i X ) (k) n k=i+1 X (k). Maximum likelihood: p. 32/7

Moreover, BMP (1985) show that if F is exponential(1), then 1 ˆF n (x) a.s. (1 + x) x exp( x), so 1 x ˆΛ n (x) a.s. log(1 + x) 1. Maximum likelihood: p. 33/7

0.8 0.6 0.4 0.2 MLE n = 100 and true d.f. 1 exp( x) 1 1 2 3 4 Maximum likelihood: p. 34/7

0.8 0.6 0.4 0.2 MLE n = 100 and limit d.f.(1 + x) x 1 1 2 3 4 Maximum likelihood: p. 35/7

More counterexamples: bivariate right censoring: Tsai, van der Laan, Pruitt Maximum likelihood: p. 36/7

More counterexamples: bivariate right censoring: Tsai, van der Laan, Pruitt left truncation and interval censoring: Chappell and Pan (1999) Maximum likelihood: p. 36/7

More counterexamples: bivariate right censoring: Tsai, van der Laan, Pruitt left truncation and interval censoring: Chappell and Pan (1999) bivariate interval censoring with a continuous mark: Hudgens, Maathuis, and Gilbert (2005) Maathuis and Wellner (2005) Maximum likelihood: p. 36/7

3. Beyond consistency: rates and distributions Le Cam (1973); Birgé (1983): optimal rate of convergence r n = r opt n determined by If nr 2 n =logn [] (1/r n, P) (1) log N [] (ɛ, P) (1) leads to the optimal rate of convergence r opt n = n γ/(2γ+1). K ɛ 1/γ (2) Maximum likelihood: p. 37/7

On the other hand, bounds (from Birgé and Massart (1993)), yield achieved rates of convergence for maximum likelihood estimators (and other minimum contrast estimators) r n = rn ach determined by nr 2 n = r 1 n cr 2 n If (2) holds, this leads to the rate { n γ/(2γ+1) log N [] (ɛ, P)dɛ if γ>1/2 n γ/2 if γ<1/2. Thus there is the possibility that maximum likelihood is not (rate-)optimal when γ<1/2. Maximum likelihood: p. 38/7

Typically 1 γ = d α where d is the dimension of the underlying sample space and α is a measure of the smoothness of the functions in P. Maximum likelihood: p. 39/7

Typically 1 γ = d α where d is the dimension of the underlying sample space and α is a measure of the smoothness of the functions in P. Hence leads to γ<1/2. α< d 2 Maximum likelihood: p. 39/7

Typically 1 γ = d α where d is the dimension of the underlying sample space and α is a measure of the smoothness of the functions in P. Hence leads to γ<1/2. α< d 2 But there are many examples/problem with γ>1/2! Maximum likelihood: p. 39/7

Optimal rate and MLE rate as a function of γ 0.4 0.3 0.2 0.1 1 2 3 4 5 Maximum likelihood: p. 40/7

Difference of rates γ/(2γ +1) γ/2 0.04 0.03 0.02.01 0.1 0.2 0.3 0.4 0.5 Maximum likelihood: p. 41/7

4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) Maximum likelihood: p. 42/7

4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) panel count data (Wellner and Zhang, 2000) Maximum likelihood: p. 42/7

4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) panel count data (Wellner and Zhang, 2000) k monotone densities (Balabdaoui and Wellner, 2004) Maximum likelihood: p. 42/7

4. Positive Examples (some still in progress!) Interval censoring (Groeneboom) case 1, current status data case 2 (Groeneboom) panel count data (Wellner and Zhang, 2000) k monotone densities (Balabdaoui and Wellner, 2004) competing risks current status data (Jewell and van der Laan; Maathuis) Maximum likelihood: p. 42/7

Example 1. (interval censoring, case 1) Maximum likelihood: p. 43/7

Example 1. (interval censoring, case 1) X F, Y G independent Observe (1{X Y },Y) (,Y). Goal: estimate F. MLE ˆF n exists Maximum likelihood: p. 43/7

Example 1. (interval censoring, case 1) X F, Y G independent Observe (1{X Y },Y) (,Y). Goal: estimate F. MLE ˆF n exists Global rate: d =1, α =1, γ = α/d =1. γ/(2γ +1)=1/3, sor n = n 1/3 : n 1/3 h(p ˆFn,p 0 )=O p (1) and this yields n 1/3 ˆF n F 0 dg = O p (1). Maximum likelihood: p. 43/7

Interval censoring case 1, continued: Maximum likelihood: p. 44/7

Interval censoring case 1, continued: Local rate: (Groeneboom, 1987) n 1/3 ( ˆF n (t 0 ) F (t 0 )) d { F (t0 )(1 F (t 0 ))f 0 (t 0 ) 2g(t 0 ) where Z = argmin{w (t)+t 2 } } 1/3 2Z Maximum likelihood: p. 44/7

Example 2. (interval censoring, case 2) Maximum likelihood: p. 45/7

Example 2. (interval censoring, case 2) X F, (U, V ) H, U V independent of X Observe i.i.d. copies of (,U,V) where = ( 1, 2, 3 ) = (1{X U}, 1{U <X V }, 1{V < X}) Maximum likelihood: p. 45/7

Example 2. (interval censoring, case 2) X F, (U, V ) H, U V independent of X Observe i.i.d. copies of (,U,V) where = ( 1, 2, 3 ) = (1{X U}, 1{U <X V }, 1{V < X}) Goal: estimate F. MLE ˆF n exists. Maximum likelihood: p. 45/7

(interval censoring, case 2, continued) Maximum likelihood: p. 46/7

(interval censoring, case 2, continued) Global rate (separated case): If P (V U ɛ) =1, d =1, α =1, γ = α/d =1 γ/(2γ +1)=1/3, sor n = n 1/3 n 1/3 h(p ˆF n,p 0 )=O p (1) and this yields n 1/3 ˆF n F 0 dµ = O p (1) where µ(a) =P (U A)+P (V A), A B 1 Maximum likelihood: p. 46/7

(interval censoring, case 2, continued) Maximum likelihood: p. 47/7

(interval censoring, case 2, continued) Global rate (nonseparated case): (van de Geer, 1993). n 1/3 (log n) 1/6 h(p ˆFn,p 0 )=O p (1). Although this looks worse in terms of the rate, it is actually better because the Hellinger metric is much stronger in this case. Maximum likelihood: p. 47/7

(interval censoring, case 2, continued) Global rate (nonseparated case): (van de Geer, 1993). n 1/3 (log n) 1/6 h(p ˆFn,p 0 )=O p (1). Although this looks worse in terms of the rate, it is actually better because the Hellinger metric is much stronger in this case. Maximum likelihood: p. 47/7

(interval censoring, case 2, continued) Local rate (separated case): (Groeneboom, 1996) { } n 1/3 ( ˆF f0 (t 0 ) 1/3 n (t 0 ) F 0 (t 0 )) d 2Z 2a(t 0 ) where Z = argmin{w (t)+t 2 } and a(t 0 )= h 1(t 0 ) F 0 (t 0 ) + k 1(t 0 )+ k 2 (t 0 )+ h 2(t 0 ) 1 F 0 (t 0 ) k 1 (u) = k 2 (v) = M u v 0 h(u, v) F 0 (v) F 0 (u) dv h(u, v) F 0 (v) F 0 (u) du Maximum likelihood: p. 48/7

(interval censoring, case 2, continued) Local rate (non-separated case): (conjectured, G&W, 1992) (n log n) 1/3 ( ˆF n (t 0 ) F 0 (t 0 )) d { 3 4 where Z = argmin{w (t)+t 2 } f 0 (t 0 ) 2 } 1/3 2Z h(t 0,t 0 ) Maximum likelihood: p. 49/7

(interval censoring, case 2, continued) Local rate (non-separated case): (conjectured, G&W, 1992) (n log n) 1/3 ( ˆF n (t 0 ) F 0 (t 0 )) d { 3 4 where Z = argmin{w (t)+t 2 } Monte-Carlo evidence in support: Groeneboom and Ketelaars (2005) f 0 (t 0 ) 2 } 1/3 2Z h(t 0,t 0 ) Maximum likelihood: p. 49/7

MSE histogram / MSE of MLE f 0 (t) =1 4.5 4 n=1000 n=2500 n=5000 n=10000 3.5 3 2.5 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 50/7

MSE histogram / MSE of MLE f 0 (t) =4(1 t) 3 4.5 4 n=1000 n=2500 n=5000 n=10000 3.5 3 2.5 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 51/7

MSE histogram / MSE of MLE f 0 (t) =1 9 8 n=1000 n=2500 n=5000 n=10000 7 6 5 4 3 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 52/7

MSE histogram / MSE of MLE f 0 (t) =4(1 t) 3 9 8 n=1000 n=2500 n=5000 n=10000 7 6 5 4 3 2 0.3 0.4 0.5 0.6 Maximum likelihood: p. 53/7

Example 3. (k-monotone densities) Maximum likelihood: p. 54/7

Example 3. (k-monotone densities) A density p on (0, ) is k monontone (p D k )ifitis non-negative and nonincreasing when k =1; and if ( 1) j p (j) (x) 0 for j =0,...,k 2 and ( 1) k 2 p (k 2) is convex for k 2. Maximum likelihood: p. 54/7

Example 3. (k-monotone densities) A density p on (0, ) is k monontone (p D k )ifitis non-negative and nonincreasing when k =1; and if ( 1) j p (j) (x) 0 for j =0,...,k 2 and ( 1) k 2 p (k 2) is convex for k 2. Mixture representation: p D k iff p(x) = 0 k (y x)k 1 yk + df (y) for some distribution function F on (0, ). k =1: monotone decreasing densities on R + k =2: convex decreasing densities on R + k 3:... k = : completely monotone densities = scale mixtures of exponential Maximum likelihood: p. 54/7

(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) Maximum likelihood: p. 55/7

(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) k =1Grenander estimator: r n = n 1/3 Maximum likelihood: p. 55/7

(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) k =1Grenander estimator: r n = n 1/3 Global rates and finite n minimax bounds: Birgé (1986), (1987), (1989) Maximum likelihood: p. 55/7

(k-monotone densities, continued) The MLE ˆp n of p 0 D k exists and is characterized by 0 { k (y x) k + 1, for all y 0 y k dp n (x) ˆp n (x) =1, if ( 1) k ˆp (k 1) n (y ) > ˆp (k 1) n (y+) k =1Grenander estimator: r n = n 1/3 Global rates and finite n minimax bounds: Birgé (1986), (1987), (1989) Local rates: Prakasa Rao (1969), Groeneboom (1985), (1989) Kim and Pollard (1990) { n 1/3 p0 (t 0 ) p 0 (ˆp n (t 0 ) p 0 (t 0 )) (t } 1/3 0) d 2Z 2 Maximum likelihood: p. 55/7

(k-monotone densities, continued) k =2; convex decreasing density d =1, α =2, γ =2, γ/(2γ +1)=2/5, sor n = n 2/5 (forward problem) Global rates: nothing yet Local rates and distributions: Groeneboom, Jongbloed, Wellner (2001) Maximum likelihood: p. 56/7

(k-monotone densities, continued) k =2; convex decreasing density d =1, α =2, γ =2, γ/(2γ +1)=2/5, sor n = n 2/5 (forward problem) Global rates: nothing yet Local rates and distributions: Groeneboom, Jongbloed, Wellner (2001) k 3; k - monotone density d =1, α = k, γ = k, γ/(2γ +1)=k/(2k +1),so r n = n k/(2k+1) (forward problem)? Global rates: nothing yet Local rates: should be r n = n k/(2k+1) progress: Balabdaoui and Wellner (2004) local rate is true if a certain conjecture about Hermite interpolation holds Maximum likelihood: p. 56/7

Direct and Inverse estimators k =3, n = 100 0.0 0.4 0.8 1.2 (1a) - LSE, k=3, n=100 (direct problem) 0 1 2 3 4 5 0.0 0.4 0.8 1.2 (1b) - MLE, k=3, n=100 (direct problem) 0 1 2 3 4 5 (2a) - LSE, k=3, n=100 (inverse problem) (2b) - MLE, k=3, n=100 (inverse problem) 0.0 0.4 0.8 0 5 10 15 0.0 0.4 0.8 0 5 10 15 Maximum likelihood: p. 57/7

Direct and Inverse estimators k =3, n = 1000 (1a) - LSE, k=3, n=1000 (direct problem) (1b) - MLE, k=3, n=1000 (direct problem) 0.0 0.4 0.8 0 1 2 3 4 5 0.0 0.4 0.8 0 1 2 3 4 5 (2a) - LSE, k=3, n=1000 (inverse problem) (2b) - MLE, k=3, n=1000 (inverse problem) 0.0 0.4 0.8 0 5 10 15 0.0 0.4 0.8 0 5 10 15 Maximum likelihood: p. 58/7

Direct and Inverse estimators k =6, n = 100 0.0 0.4 0.8 1.2 (1a) - LSE, k=6, n=100 (direct problem) 0 1 2 3 4 5 0.0 0.4 0.8 1.2 (1b) - LSE, k=6, n=100 (direct problem) 0 1 2 3 4 5 (2a) - LSE, k=6, n=100 (inverse problem) (2b) - MLE, k=6, n=100 (inverse problem) 0.0 0.4 0.8 0 5 10 15 20 0.0 0.4 0.8 0 5 10 15 20 Maximum likelihood: p. 59/7

Direct and Inverse estimators k =6, n = 1000 (1a) - LSE, k=6, n=1000 (direct problem) (1b) - MLE, k=6, n=1000 (direct problem) 0.0 0.4 0.8 0 1 2 3 4 5 0.0 0.4 0.8 0 1 2 3 4 5 (2a) - LSE, k=6, n=1000 (inverse problem) (2b) - MLE, k=6, n=1000 (inverse problem) 0.0 0.4 0.8 0 5 10 15 20 0.0 0.4 0.8 0 5 10 15 20 Maximum likelihood: p. 60/7

Example 4. (Competing risks with current status data) Maximum likelihood: p. 61/7

Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Maximum likelihood: p. 61/7

Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Observe: (,T), =( 1,..., K, K+1 ) where j =1{X T,Y = j}, j =1,...,K K+1 =1{X >T}. Maximum likelihood: p. 61/7

Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Observe: (,T), =( 1,..., K, K+1 ) where j =1{X T,Y = j}, j =1,...,K K+1 =1{X >T}. Goal: estimate F j (t) =P (X t, Y = j) for j =1,...,K Maximum likelihood: p. 61/7

Example 4. (Competing risks with current status data) Variables of interest (X, Y ); X = failure time; Y = failure cause X R +, Y {1,...,K} T = an observation time, independent of (X, Y ) Observe: (,T), =( 1,..., K, K+1 ) where j =1{X T,Y = j}, j =1,...,K K+1 =1{X >T}. Goal: estimate F j (t) =P (X t, Y = j) for j =1,...,K MLE ˆF n =(ˆF n,1,..., ˆF n,k ) exists! Characterization of ˆF n involves an interacting system of slopes of convex minorants Maximum likelihood: p. 61/7

(competing risks with current status data, continued) Global rates. Easy with present methods. n 1/3 K k=1 ˆF n,k (t) F 0,k (t) dg(t) =O p (1) Maximum likelihood: p. 62/7

(competing risks with current status data, continued) Global rates. Easy with present methods. n 1/3 K k=1 ˆF n,k (t) F 0,k (t) dg(t) =O p (1) Local rates? Conjecture r n = n 1/3. New methods needed. Tricky. Maathuis (2006)? Maximum likelihood: p. 62/7

(competing risks with current status data, continued) Global rates. Easy with present methods. n 1/3 K k=1 ˆF n,k (t) F 0,k (t) dg(t) =O p (1) Local rates? Conjecture r n = n 1/3. New methods needed. Tricky. Maathuis (2006)? Limit distribution theory: will involve slopes of an interacting system of greatest convex minorants defined in terms of a vector of dependent two-sided Brownian motions Maximum likelihood: p. 62/7

n 2/3 MSE of MLE and naive estimators of F1 n=25 n=250 MLE NE SNE TNE MLE NE SNE TNE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t n 2 3 MSE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t n 2 3 MSE n=2500 n=25000 MLE NE SNE TNE MLE NE SNE TNE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t n 2 3 MSE 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.5 1.0 1.5 t Maximum likelihood: p. 63/7 n 2 3 MSE

n 2/3 MSE of MLE and naive estimators of F 2 n=25 n=250 n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 t t n=2500 n=25000 n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE n 2 3 MSE 0.0 0.1 0.2 0.3 0.4 MLE NE SNE TNE 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 t t Maximum likelihood: p. 64/7

Example 5. (Monotone densities in R d ) Maximum likelihood: p. 65/7

Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Maximum likelihood: p. 65/7

Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Biau and Devroye (2003) show via Assouad s lemma and direct calculations: r opt n = n 1/(d+2) Maximum likelihood: p. 65/7

Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Biau and Devroye (2003) show via Assouad s lemma and direct calculations: r opt n = n 1/(d+2) Rate achieved by the MLE: Natural conjecture: r ach n = n 1/2d, d > 2 Maximum likelihood: p. 65/7

Example 5. (Monotone densities in R d ) Entropy bounds? Natural conjectures: α =1, d, γ =1/d, soγ/(2γ +1)=1/(d +2) Biau and Devroye (2003) show via Assouad s lemma and direct calculations: r opt n = n 1/(d+2) Rate achieved by the MLE: Natural conjecture: r ach n = n 1/2d, d > 2 Biau and Devroye (2003) construct generalizations of Birgé s (1987) histogram estimators that achieve the optimal rate for all d 2. Maximum likelihood: p. 65/7

5. Problems and Challenges More tools for local rates and distribution theory? Maximum likelihood: p. 66/7

5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? Maximum likelihood: p. 66/7

5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? More counterexamples to clarify when MLE s do not work? Maximum likelihood: p. 66/7

5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? More counterexamples to clarify when MLE s do not work? What is the limit distribution for interval censoring, case 2? (Does the G&W (1992) conjecture hold?) Maximum likelihood: p. 66/7

5. Problems and Challenges More tools for local rates and distribution theory? Under what additional hypotheses are MLE s globally rate optimal in the case γ>1/2? More counterexamples to clarify when MLE s do not work? What is the limit distribution for interval censoring, case 2? (Does the G&W (1992) conjecture hold?) When the MLE is not rate optimal, is it still preferable from some other perspectives? For example, does the MLE provide efficient estimators of smooth functionals (while alternative rate -optimal estimators fail to have this property)? Compare with Bickel and Ritov (2003). Maximum likelihood: p. 66/7

Problems and challenges, continued More rate and optimality theory for Maximum Likelihood Estimators of mixing distributions in mixture models with smooth kernels: e.g. completely monotone densities (scale mixtures of exponential), normal location mixtures (deconvolution problems) Maximum likelihood: p. 67/7

Problems and challenges, continued More rate and optimality theory for Maximum Likelihood Estimators of mixing distributions in mixture models with smooth kernels: e.g. completely monotone densities (scale mixtures of exponential), normal location mixtures (deconvolution problems) Stable and efficient algorithms for computing MLE s in models where they exist (e.g. mixture models, missing data). Maximum likelihood: p. 67/7

Problems and challenges, continued More rate and optimality theory for Maximum Likelihood Estimators of mixing distributions in mixture models with smooth kernels: e.g. completely monotone densities (scale mixtures of exponential), normal location mixtures (deconvolution problems) Stable and efficient algorithms for computing MLE s in models where they exist (e.g. mixture models, missing data). Results for monotone densities in R d? Maximum likelihood: p. 67/7

Selected References Bahadur, R. R. (1958). Examples of inconsistency of maximum likelihood estimates. Sankhya, Ser. A 20, 207-210. Barlow, R. E., Bartholomew, D. J., Bremner, J. M., and Brunk, H. D. (1972). Statistical Inference under Order Restrictions. Wiley, New York. Barlow, R. E. and Scheurer, E. M (1971). Estimation from accelerated life tests. Technometrics 13, 145-159. Biau, G. and Devroye, L. (2003). On the risk of estimates for block decreasing densities. J. Mult. Anal. 86, 143-165. Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be plugged-in. Ann. Statist. 31, 1033-1053. Birgé, L. (1987). On the risk of histograms for estimating decreasing densities. Ann. Statist. 15, 1013-1022. Maximum likelihood: p. 68/7

Birgé, L. (1989). The Grenander estimator: a nonasymptotic approach. Ann. Statist. 17, 1532-1549. Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields 97, 113-150. Boyles, R. A., Marshall, A. W., Proschan, F. (1985). Inconsistency of the maximum likelihood estimator of a distribution having increasing failure rate average. Ann. Statist. 13, 413-417. Ferguson, T. S. (1982). An inconsistent maximum likelihood estimate. J. Amer. Statist. Assoc. 77, 831 834. Ghosh, M. (1995). Inconsistent maximum likelihood estimators for the Rasch model. Statist. Probab. Lett. 23, 165-170. Maximum likelihood: p. 69/7

Hudgens, M., Maathuis, M., and Gilbert, P. (2005). Nonparametric estimation of the joint distribution of a survival time subject to interval censoring and a continuous mark variable. Submitted to Biometrics. Le Cam, L. (1990). Maximum likelihood: an introduction. Internat. Statist. Rev. 58, 153-171. Maathuis, M. and Wellner, J. A. (2005). Inconsistency of the MLE for the joint distribution of interval censored survival times and continuous marks. Submitted to Biometrika. Pan, W. and Chappell, R. (1999). A note on inconsistency of NPMLE of the distribution function from left truncated and case I interval censored data. Lifetime Data Anal. 5, 281-291. Maximum likelihood: p. 70/7