Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Similar documents
Efficiency of Profile/Partial Likelihood in the Cox Model

Estimation for two-phase designs: semiparametric models and Z theorems

arxiv:math/ v2 [math.st] 19 Aug 2008

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

STAT331. Cox s Proportional Hazards Model

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Introduction to Empirical Processes and Semiparametric Inference Lecture 22: Preliminaries for Semiparametric Inference

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010

Survival Analysis Math 434 Fall 2011

1 Glivenko-Cantelli type theorems

Power and Sample Size Calculations with the Additive Hazards Model

Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models

Weighted likelihood estimation under two-phase sampling

11 Survival Analysis and Empirical Likelihood

EMPIRICAL ENVELOPE MLE AND LR TESTS. Mai Zhou University of Kentucky

Ensemble estimation and variable selection with semiparametric regression models

Model Selection and Geometry

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

STAT331. Combining Martingales, Stochastic Integrals, and Applications to Logrank Test & Cox s Model

A note on L convergence of Neumann series approximation in missing data problems

Regularization in Cox Frailty Models

simple if it completely specifies the density of x

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Model comparison and selection

University of California, Berkeley

SEMIPARAMETRIC LIKELIHOOD RATIO INFERENCE. By S. A. Murphy 1 and A. W. van der Vaart Pennsylvania State University and Free University Amsterdam

More on nuisance parameters

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

On the Breslow estimator

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

Lecture 2: Martingale theory for univariate survival analysis

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Association studies and regression

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Lecture 5 Models and methods for recurrent event data

Modelling geoadditive survival data

A semiparametric generalized proportional hazards model for right-censored data

Empirical Processes & Survival Analysis. The Functional Delta Method

Exercises. (a) Prove that m(t) =

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Efficient Estimation in Convex Single Index Models 1

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Semiparametric posterior limits

log T = β T Z + ɛ Zi Z(u; β) } dn i (ue βzi ) = 0,

Large sample theory for merged data from multiple sources

The International Journal of Biostatistics

Semiparametric Regression

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

ESTIMATION IN A SEMI-PARAMETRIC TWO-STAGE RENEWAL REGRESSION MODEL

Statistics 3858 : Maximum Likelihood Estimators

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

University of California, Berkeley

Lecture 22 Survival Analysis: An Introduction

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

STAT 331. Martingale Central Limit Theorem and Related Results

Lecture 7 October 13

ASYMPTOTIC PROPERTIES AND EMPIRICAL EVALUATION OF THE NPMLE IN THE PROPORTIONAL HAZARDS MIXED-EFFECTS MODEL

Theoretical Statistics. Lecture 22.

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Cox s proportional hazards model and Cox s partial likelihood

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

arxiv:submit/ [math.st] 6 May 2011

CURRENT STATUS LINEAR REGRESSION. By Piet Groeneboom and Kim Hendrickx Delft University of Technology and Hasselt University

Various types of likelihood

General Regression Model

Appendix. Proof of Theorem 1. Define. [ ˆΛ 0(D) ˆΛ 0(t) ˆΛ (t) ˆΛ. (0) t. X 0 n(t) = D t. and. 0(t) ˆΛ 0(0) g(t(d t)), 0 < t < D, t.

Statistics 581, Problem Set 8 Solutions Wellner; 11/22/2018

STAT 331. Accelerated Failure Time Models. Previously, we have focused on multiplicative intensity models, where

ST495: Survival Analysis: Maximum likelihood

Tied survival times; estimation of survival probabilities

Two Likelihood-Based Semiparametric Estimation Methods for Panel Count Data with Covariates

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Statistical Properties of Numerical Derivatives

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Chapter 2 Inference on Mean Residual Life-Overview

Quantifying Stochastic Model Errors via Robust Optimization

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths

Analysing geoadditive regression data: a mixed model approach

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

1 Local Asymptotic Normality of Ranks and Covariates in Transformation Models

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

(Part 1) High-dimensional statistics May / 41

A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

An Extended BIC for Model Selection

FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH

Time-to-Event Modeling and Statistical Analysis

Transcription:

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research University of North Carolina-Chapel Hill 1

Score Functions and Estimating Equations A parameter ψ(p ) of particular interest is the parametric component θ of a semiparametric model {P θ,η : θ Θ, η H}, where Θ is an open subset of R k and H is an arbitrary set that may be infinite dimensional. Tangent sets can be used to develop an efficient estimator for ψ(p θ,η ) = θ through the formation of an efficient score function. 2

In this setting, we consider submodels of the form {P θ+ta,ηt, t N ɛ } that are differentiable in quadratic mean with score function where a R k, log dp θ+ta,ηt t = a lθ,η + g, t=0 l θ,η : X R k is the ordinary score for θ when η is fixed, and where g : X R is an element of a tangent set Ṗ (η) P θ,η for the submodel (holding θ fixed). P θ = {P θ,η : η H} 3

This tangent set is the tangent set for η and should be rich enough to reflect all parametric submodels of P θ. The tangent set for the full model is Ṗ Pθ,η = { } a lθ,η + g : a R k, g Ṗ(η) P θ,η. 4

While ψ(p θ+ta,ηt ) = θ + ta is clearly differentiable with respect to t, we also require that there there exists a function ψ θ,η : X R k such that ψ(p θ+ta,ηt ) [ ( t = a = P ψθ,η l θ,η a + g)] t=0 for all a R k and all g Ṗ(η) P θ,η., (1) After setting a = 0, we see that such a function must be uncorrelated with all of the elements of Ṗ (η) P θ,η. 5

Define Π θ,η to be the orthogonal projection onto the closed linear span of Ṗ (η) P θ,η in L 0 2 (P θ,η). The efficient score function for θ is l θ,η = l θ,η Π θ,η lθ,η, while the efficient information matrix for θ is Ĩ θ,η = P ] [ lθ,η l θ,η. 6

Provided that Ĩ θ,η is nonsingular, the function ψ θ,η = Ĩ 1 θ,η l θ,η satisfies (1) for all a R k and all g Ṗ (η) P θ,η. Thus the functional (parameter) ψ(p θ,η ) = θ is differentiable at P θ,η relative to the tangent set Ṗ Pθ,η, with efficient influence function ψ θ,η. 7

Recall that the search for an efficient estimator of θ is over if one can find an estimator T n satisfying n(tn θ) = np n ψθ,η + o P (1). Note that Ĩ θ,η = I θ,η P ) ] [Π θ,η lθ,η (Π θ,η lθ,η, where I θ,η = P [ ] lθ,η l θ,η. 8

An intuitive justification for the form of the efficient score is that some information for estimating θ is lost due to a lack of knowledge about η. The amount subtracted off of the efficient score, Π θ,η lθ,η, is the minimum possible amount for regular estimators when η is unknown. 9

Consider again the semiparametric regression model: where Y = β Z + e, E[e Z] = 0 and E[e 2 Z] K < almost surely, and where we observe (Y, Z), with the joint density η of (e, Z) satisfying R almost surely. eη(e, Z)de = 0 10

Assume η has partial derivative with respect to the first argument, η 1, satisfying and hence η 1 η L 2(P β,η ), η 1 η L0 2(P β,η ), where P β,η is the joint distribution of (Y, Z). The Euclidean parameter of interest in this semiparametric model is θ = β. 11

The score for β, assuming η is known, is l β,η = Z( η 1 /η)(y β Z, Z), where we use the shorthand (f/g)(u, v) = f(u, v)/g(u, v) for ratios of functions. One can show that the tangent set Ṗ (η) P β,η for η is the subset of L 0 2 (P β,η) which consists of all functions g(e, Z) L 0 2 (P β,η) which satisfy R eg(e, Z)η(e, Z)de E[eg(e, Z) Z] = R η(e, Z)de = 0, almost surely. 12

One can also show that this set is the orthocomplement in L 0 2 (P β,η) of all functions of the form ef(z), where f satisfies P β,η f 2 (Z) <. This means that l β,η = (I Π β,η ) l β,η is the projection in L 0 2 (P β,η) of Z( η 1 /η)(e, Z) onto {ef(z) : P β,η f 2 (Z) < }, where I is the identity. 13

Thus where l β,η (Y, Z) = Ze R η 1(e, Z)ede P β,η [e 2 Z] = Ze( 1) P β,η [e 2 Z] = Z(Y β Z) P β,η [e 2, Z] the second-to-last step follows from the identity R η 1 (e, Z)ede = R and the last step follows since e = Y β Z. η(te, Z)de t, t=1 14

When the function z P β,η [e 2 Z = z] is non-constant in z, l β,η (Y, Z) is not proportional to Z(Y β Z), and the usual least-squares estimator ˆβ will not be efficient. This is discussed in greater detail in Chapter 4. 15

Two very useful tools for computing efficient scores are score and information operators. Returning to the generic semiparametric model {P θ,η : θ Θ, η H}, sometimes it is easier to represent an element g in the tangent set for η, Ṗ (η) P θ,η, as B θ,η b, where b is an element of another set H η and B θ,η is an operator satisfying Ṗ (η) P θ,η = {B θ,η b : b H η }. 16

Such an operator is a score operator. The information operator is: B θ,η B θ,η : H η lin H η. If B θ,η B θ,η has an inverse, then it can be shown that the efficient score for θ has the form l θ,η = ( I B θ,η [ B θ,η B θ,η ] 1 B θ,η ) lθ,η. 17

To illustrate these methods, consider again the Cox model for right-censored data. Recall in this setting that we observe a sample of n realizations of X = (V, d, Z), where V = T C, d = 1{V = T }, Z R k is a covariate vector, T is a failure time, and C is a censoring time. 18

We also assume that T and C are independent given Z, that T given Z has integrated hazard function e β Z Λ(t), for β in an open subset B R k and Λ is continuous and monotone increasing with Λ(0) = 0, and that the censoring distribution does not depend on β or Λ (i.e., censoring is uninformative). 19

Recall also the counting and at-risk processes N(t) = 1{V t}d and Y (t) = 1{V t}, and let M(t) = N(t) t 0 Y (s)e β Z dλ(s). For some 0 < τ < with P {C τ} > 0, let H be the set of all Λ s satisfying our criteria with Λ(τ) <. Now the set of models P is indexed by β B and Λ H. 20

We let P β,λ be the distribution of (V, d, Z) corresponding to the given parameters. The likelihood for a single observation is thus proportional to [ ] d [ ] p β,λ (X) = e β Z λ(v ) exp e β Z Λ(V ), where λ is the derivative of Λ. 21

Now let L 2 (Λ) be the set of measurable functions b : [0, τ] R with τ 0 b 2 (s)dλ(s) <. Note that if b L 2 (Λ) is bounded, then Λ t (s) = s 0 e tb(u) dλ(u) H for all t. 22

The score function is thus τ log p β+ta,λt (X) t t=0 [ a Z + b(s) ] dm(s), 0 for any a R k. 23

The score function for β is therefore l β,λ (X) = ZM(τ), while the score function for Λ is τ 0 b(s)dm(s). In fact, one can show that there exists one-dimensional submodels Λ t such that log p β+ta,λt is differentiable with score a lβ,λ (X) + for any b L 2 (Λ) and a R k. τ 0 b(s)dm(s), 24

The operator B β,λ : L 2 (Λ) L 0 2(P β,λ ), given by B β,λ (b) = τ 0 b(s)dm(s), is the score operator which generates the tangent set for Λ, Ṗ (Λ) P β,λ {B β,λ b : b L 2 (Λ)}. It can be shown that this tangent space spans all square-integrable score functions for Λ generated by parametric submodels. 25

The adjoint operator can be shown to be B β,λ : L 2(P β,λ ) L 2 (Λ), where B β,λ (g)(t) = P β,λ[g(x)dm(t)] dλ(t). The information operator B β,λ B β,λ : L 2 (Λ) L 2 (Λ) 26

is thus B β,λ B β,λ(b)(t) = P β,λ [ τ 0 b(s)dm(s)dm(u)] dλ(u) = P β,λ [Y (t)e β Z ] b(t), using martingale methods. 27

Since B β,λ ( lβ,λ ) (t) = P β,λ [ ZY (t)e β Z ], we have that the efficient score for β is l β,λ = = ( [ ] ) I B β,λ B 1 β,λ B β,λ B β,λ lβ,λ (2) [ ] τ P β,λ ZY (t)e β Z Z P β,λ [Y (t)e β Z ] dm(t). 0 28

When Ĩ β,λ P β,λ [ lβ,λ l β,λ ] is positive definite, the resulting efficient influence function is ψ β,λ Ĩ 1 β,λ l β,λ. 29

Since the estimator ˆβ n obtained from maximizing the partial likelihood L n (β) = can be shown to satisfy ( ) di n e β Z i n j=1 1{V (3) j V i }e β Z j i=1 n( ˆβn β) = np n ψβ,λ + o P (1), this estimator is efficient. 30

Returning to our discussion of score and information operators, these operators are also useful for generating scores for the entire model, not just for the nuisance component. With semiparametric models having score functions of the form a lθ,η + B θ,η b, for a R k and b H η, we can define a new operator A β,η : {(a, b) : a R k, b lin H η } L 0 2(P θ,η ) where A β,η (a, b) = a lθ,η + B θ,η b. 31

More generally, we can define the score operator A η : lin H η L 2 (P η ) for the model {P η : η H}, where H indexes the entire model and may include both parametric and nonparametric components, and where lin H η indexes directions in H. Let the parameter of interest be ψ(p η ) = χ(η) R k. 32

We assume there exists a linear operator χ : lin H η R k such that, for every b lin H η, there exists a one-dimensional submodel {P ηt : η t H, t N ɛ } satisfying [ (dp ηt ) 1/2 (dp η ) 1/2 t 1 2 A ηb(dp η ) 1/2 ] 2 0, as t 0, and χ(η t ) t = χ(b). t=0 33

We require H η to be a Hilbert space with inner product, η. The efficient influence function is the solution ψ Pη R(A η ) L 0 2(P η ) of A ψ η Pη = χ η, (4) where χ η H η satisfies χ η, b η = χ η (b) for all b H η. 34

When A ηa η is invertible, then the solution to (4) can be written ψ Pη = A η ( A η A η ) 1 χη. In Chapter 4, we utilized this approach to derive efficient estimators for all parameters of the Cox model. 35

Returning to the semiparametric model setting, where P = {P θ,η : θ Θ, η H}, Θ is an open subset of R k, and H is a set, the efficient score can be used to derive estimating equations for computing efficient estimators of θ. 36

Recall that an estimating equation is a data dependent function Ψ n : Θ R k for which an approximate zero yields a Z-estimator for θ. When Ψ n ( θ) has the form P nˆl θ,n, where ˆl θ,n (X X 1,..., X n ) is a function for the generic observation X which depends on the value of θ and the sample data X 1,..., X n, we have the following estimating equation result: 37

THEOREM 1. Suppose that the model {P θ,η : θ Θ}, where Θ R k, is differentiable in quadratic mean with respect to θ at (θ, η) and let the efficient information matrix Ĩθ,η be nonsingular. Let ˆθ n satisfy np nˆlˆθn,n = o P (1) and be consistent for θ. Also assume that ˆlˆθn,n is contained in a P θ,η-donsker class with probability tending to 1 and that the following conditions hold: Pˆθn,η ˆlˆθn,n = o P (n 1/2 + ˆθ n θ ), (5) P θ,η ˆlˆθn,n l 2 P θ,η 0, Pˆθn,η ˆlˆθn,n 2 = O P (1). (6) 38

Then ˆθ n is asymptotically efficient at (θ, η). 39

Returning to the Cox model example, the profile likelihood score is the partial likelihood score Ψ n ( β) = P nˆl β,n, where ˆl β,n (X = (V, d, Z) X 1,..., X n ) [ ] τ P n ZY (t)e β Z = 0 Z ] dm P n [Y (t)e β Z β(t), and M β(t) = N(t) t 0 Y (u)e β Z dλ(u). We showed in Chapter 4 that all the conditions of Theorem 1 are satisfied for the root of Ψ n ( β) = 0, ˆβ n, and thus the partial likelihood yields efficient estimation of β. 40

Maximum Likelihood Estimation The most common approach to efficient estimation is based on modifications of maximum likelihood estimation that lead to efficient estimates. These modifications, which we will call likelihoods, are generally not really likelihoods (products of densities) because of complications resulting from the presence of an infinite dimensional nuisance parameter. Recall the setting of estimation of an unknown real density f(x) from an i.i.d. sample X 1,..., X n. 41

The likelihood is n i=1 f(x i), and the maximizer over all densities has arbitrarily high peaks at the observations, with zero at the other values, and is therefore not a density. This problem can be fixed by using an empirical likelihood n i=1 p i, where p 1,..., p n are the masses assigned to the observations indexed by i = 1,..., n and are constrained to satisfy n i=1 p i = 1. This leads to the empirical distribution function estimator, which is known to be fully efficient. 42

Consider again the Cox model for right-censored data explored in the previous section. The density for a single observation X = (V, d, Z) is proportional to [ ] d [ ] e β Z λ(v ) exp e β Z Λ(V ). Maximizing the likelihood based on this density will result in the same problem raised in the previous paragraph. 43

A likelihood that works is the following, which assigns mass only at observed failure times: L n (β, Λ) = n i=1 where Λ(t) is the jump size of Λ at t. [ ] di [ ] e β Z i Λ(V i ) exp e β Z i Λ(V i ), (7) For each value of β, one can maximize or profile L n (β, Λ) over the nuisance parameter Λ to obtain the profile likelihood pl n (β), which for the Cox model is exp [ n i=1 d i] times the partial likelihood (3). 44

Let ˆβ be the maximizer of pl n (β). Then the maximizer ˆΛ of L n ( ˆβ, Λ) is the Breslow estimator ˆΛ(t) = t 0 P n dn(s) ]. P n [Y (s)e ˆβ Z We showed in Chapter 4 that ˆβ and ˆΛ are both efficient. 45

Another useful class of likelihood variants are penalized likelihoods. Penalized likelihoods add a penalty term (or terms) in order to maintain an appropriate level of smoothness for one or more of the nuisance parameters. This method was used in the partly linear logistic regression model described in Chapter 1. Other methods of generating likelihood variants that work are possible. 46

The basic idea is that using the likelihood principle to guide estimation of semiparametric models often leads to efficient estimators for the model components which are n consistent. Because of the richness of this approach to estimation, one needs to verify for each new situation that a likelihood-inspired estimator is consistent, efficient and well-behaved for moderate sample sizes. Verifying efficiency usually entails demonstrating that the estimator satisfies the efficient score equation described in the previous section. 47

Approximately Least-Favorable Submodels One of the key challenges in this setting is to ensure that the efficient score is a derivative of the chosen log likelihood along some submodel. Something helpful in this setting are approximately least-favorable submodels. 48

The basic idea is to find a function η t (θ, η) such that η 0 (θ, η) = η, for all θ Θ and η H, where η t (θ, η) H for all t small enough, and such that κ θ0,η 0 = l θ0,η 0, where κ θ,η (x) = l θ+t,η t (θ,η)(x) t, t=0 l θ,η (x) is the log-likelihood for the observed value x at the parameters (θ, η), and where (θ 0, η 0 ) are the true parameter values. Note that we require κ θ,η = l θ,η only when (θ, η) = (θ 0, η 0 ). 49

If (ˆθ n, ˆη n ) is the maximum likelihood estimator, i.e., the maximizer of P n l θ,η, then the function t P n lˆθn +t,η t (ˆθ n,ˆη n ) is maximal at t = 0, and thus (ˆθ n, ˆη n ) is a zero of P n κ θ, η. Now if ˆθ n and ˆl θ,n = κ θ,ˆηn satisfy the conditions of Theorem 1 at (θ, η) = (θ 0, η 0 ), then the maximum likelihood estimator ˆθ n is asymptotically efficient at (θ 0, η 0 ). 50

Consider now the maximum likelihood estimator ˆθ n based on maximizing the joint empirical log-likelihood L n (θ, η) np n l(θ, η), where l(, ) is the log-likelihood for a single observation. For now, η will be regarded as a nuisance parameter, and thus we can restrict our attention to the profile log-likelihood θ pl n (θ) sup η L n (θ, η). Note that L n is a sum and not an average, since we multiplied the empirical measure by n. 51

While the solution of an efficient score equation need not be a maximum likelihood estimator, it is also possible that the maximum likelihood estimator in a semiparametric model may not be expressible as the zero of an efficient score equation. This possibility occurs because the efficient score is a projection, and, as such, there is no assurance that this projection is the derivative of the log-likelihood along a submodel. This is the main issue that motivates approximately least-favorable submodels. 52

An approximately least-favorable submodel approximates the true least-favorable submodel to a useful level of accuracy that facilitates analysis of semiparametric estimators. We will now describe this process in generality: the specifics will depend on the situation. As mentioned previously, we first need a general map from the neighborhood of θ into the parameter set for η, which map we will denote by t η t (θ, η), for t R k. 53

We require that η t (θ, η) Ĥ, for all t θ small enough, and (8) η θ (θ, η) = η for any (θ, η) Θ Ĥ, where Ĥ is a suitable enlargement of H that includes all estimators that satisfy the constraints of the estimation process. Now define the map l(t, θ, η) l(t, η t (θ, η)). 54

We will require several things of l(,, ), at various point in our discussion, that will result in further restrictions on η t (θ, η). Define and let l(t, θ, η) l(t, θ, η), t) ˆl θ,n l(θ, θ, ˆη n ). Clearly, P nˆlˆθn,n = 0, and thus ˆθ n is efficient for θ 0, provided ˆl θ,n satisfies the conditions of Theorem 1. 55

The reason it is necessary to check this even for maximum likelihood estimators is that, as mentioned previously, ˆη n is often on the boundary (or even a little bit outside of) the parameter space. Recall again the Cox model setting for right censored data. 56

In this case, η is the baseline integrated hazard function which is usually assumed to be continuous. However, ˆη n is the Breslow estimator, which is a right-continuous step function with jumps at observed failure times and is therefore not in the parameter space. Thus direct differentiation of the log-likelihood at the maximum likelihood estimator will not yield an efficient score equation. 57

We will also require that l(θ 0, θ 0, η 0 ) = l θ0,η 0. (9) Note that we are only insisting that this identity holds at the true parameter values. This approximately least-favorable submodel structure is very useful for developing methods of inference for θ. 58