Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research University of North Carolina-Chapel Hill 1

Score Functions and Estimating Equations A parameter ψ(p ) of particular interest is the parametric component θ of a semiparametric model {P θ,η : θ Θ, η H}, where Θ is an open subset of R k and H is an arbitrary set that may be infinite dimensional. Tangent sets can be used to develop an efficient estimator for ψ(p θ,η ) = θ through the formation of an efficient score function. 2

In this setting, we consider submodels of the form {P θ+ta,ηt, t N ɛ } that are differentiable in quadratic mean with score function where a R k, log dp θ+ta,ηt t = a lθ,η + g, t=0 l θ,η : X R k is the ordinary score for θ when η is fixed, and where g : X R is an element of a tangent set Ṗ (η) P θ,η for the submodel (holding θ fixed). P θ = {P θ,η : η H} 3

This tangent set is the tangent set for η and should be rich enough to reflect all parametric submodels of P θ. The tangent set for the full model is Ṗ Pθ,η = { } a lθ,η + g : a R k, g Ṗ(η) P θ,η. 4

While ψ(p θ+ta,ηt ) = θ + ta is clearly differentiable with respect to t, we also require that there there exists a function ψ θ,η : X R k such that ψ(p θ+ta,ηt ) [ ( t = a = P ψθ,η l θ,η a + g)] t=0 for all a R k and all g Ṗ(η) P θ,η., (1) After setting a = 0, we see that such a function must be uncorrelated with all of the elements of Ṗ (η) P θ,η. 5

Define Π θ,η to be the orthogonal projection onto the closed linear span of Ṗ (η) P θ,η in L 0 2 (P θ,η). The efficient score function for θ is l θ,η = l θ,η Π θ,η lθ,η, while the efficient information matrix for θ is Ĩ θ,η = P ] [ lθ,η l θ,η. 6

Provided that Ĩ θ,η is nonsingular, the function ψ θ,η = Ĩ 1 θ,η l θ,η satisfies (1) for all a R k and all g Ṗ (η) P θ,η. Thus the functional (parameter) ψ(p θ,η ) = θ is differentiable at P θ,η relative to the tangent set Ṗ Pθ,η, with efficient influence function ψ θ,η. 7

Recall that the search for an efficient estimator of θ is over if one can find an estimator T n satisfying n(tn θ) = np n ψθ,η + o P (1). Note that Ĩ θ,η = I θ,η P ) ] [Π θ,η lθ,η (Π θ,η lθ,η, where I θ,η = P [ ] lθ,η l θ,η. 8

An intuitive justification for the form of the efficient score is that some information for estimating θ is lost due to a lack of knowledge about η. The amount subtracted off of the efficient score, Π θ,η lθ,η, is the minimum possible amount for regular estimators when η is unknown. 9

Consider again the semiparametric regression model: where Y = β Z + e, E[e Z] = 0 and E[e 2 Z] K < almost surely, and where we observe (Y, Z), with the joint density η of (e, Z) satisfying R almost surely. eη(e, Z)de = 0 10

Assume η has partial derivative with respect to the first argument, η 1, satisfying and hence η 1 η L 2(P β,η ), η 1 η L0 2(P β,η ), where P β,η is the joint distribution of (Y, Z). The Euclidean parameter of interest in this semiparametric model is θ = β. 11

The score for β, assuming η is known, is l β,η = Z( η 1 /η)(y β Z, Z), where we use the shorthand (f/g)(u, v) = f(u, v)/g(u, v) for ratios of functions. One can show that the tangent set Ṗ (η) P β,η for η is the subset of L 0 2 (P β,η) which consists of all functions g(e, Z) L 0 2 (P β,η) which satisfy R eg(e, Z)η(e, Z)de E[eg(e, Z) Z] = R η(e, Z)de = 0, almost surely. 12

One can also show that this set is the orthocomplement in L 0 2 (P β,η) of all functions of the form ef(z), where f satisfies P β,η f 2 (Z) <. This means that l β,η = (I Π β,η ) l β,η is the projection in L 0 2 (P β,η) of Z( η 1 /η)(e, Z) onto {ef(z) : P β,η f 2 (Z) < }, where I is the identity. 13

Thus where l β,η (Y, Z) = Ze R η 1(e, Z)ede P β,η [e 2 Z] = Ze( 1) P β,η [e 2 Z] = Z(Y β Z) P β,η [e 2, Z] the second-to-last step follows from the identity R η 1 (e, Z)ede = R and the last step follows since e = Y β Z. η(te, Z)de t, t=1 14

When the function z P β,η [e 2 Z = z] is non-constant in z, l β,η (Y, Z) is not proportional to Z(Y β Z), and the usual least-squares estimator ˆβ will not be efficient. This is discussed in greater detail in Chapter 4. 15

Two very useful tools for computing efficient scores are score and information operators. Returning to the generic semiparametric model {P θ,η : θ Θ, η H}, sometimes it is easier to represent an element g in the tangent set for η, Ṗ (η) P θ,η, as B θ,η b, where b is an element of another set H η and B θ,η is an operator satisfying Ṗ (η) P θ,η = {B θ,η b : b H η }. 16

Such an operator is a score operator. The information operator is: B θ,η B θ,η : H η lin H η. If B θ,η B θ,η has an inverse, then it can be shown that the efficient score for θ has the form l θ,η = ( I B θ,η [ B θ,η B θ,η ] 1 B θ,η ) lθ,η. 17

To illustrate these methods, consider again the Cox model for right-censored data. Recall in this setting that we observe a sample of n realizations of X = (V, d, Z), where V = T C, d = 1{V = T }, Z R k is a covariate vector, T is a failure time, and C is a censoring time. 18

We also assume that T and C are independent given Z, that T given Z has integrated hazard function e β Z Λ(t), for β in an open subset B R k and Λ is continuous and monotone increasing with Λ(0) = 0, and that the censoring distribution does not depend on β or Λ (i.e., censoring is uninformative). 19

Recall also the counting and at-risk processes N(t) = 1{V t}d and Y (t) = 1{V t}, and let M(t) = N(t) t 0 Y (s)e β Z dλ(s). For some 0 < τ < with P {C τ} > 0, let H be the set of all Λ s satisfying our criteria with Λ(τ) <. Now the set of models P is indexed by β B and Λ H. 20

We let P β,λ be the distribution of (V, d, Z) corresponding to the given parameters. The likelihood for a single observation is thus proportional to [ ] d [ ] p β,λ (X) = e β Z λ(v ) exp e β Z Λ(V ), where λ is the derivative of Λ. 21

Now let L 2 (Λ) be the set of measurable functions b : [0, τ] R with τ 0 b 2 (s)dλ(s) <. Note that if b L 2 (Λ) is bounded, then Λ t (s) = s 0 e tb(u) dλ(u) H for all t. 22

The score function is thus τ log p β+ta,λt (X) t t=0 [ a Z + b(s) ] dm(s), 0 for any a R k. 23

The score function for β is therefore l β,λ (X) = ZM(τ), while the score function for Λ is τ 0 b(s)dm(s). In fact, one can show that there exists one-dimensional submodels Λ t such that log p β+ta,λt is differentiable with score a lβ,λ (X) + for any b L 2 (Λ) and a R k. τ 0 b(s)dm(s), 24

The operator B β,λ : L 2 (Λ) L 0 2(P β,λ ), given by B β,λ (b) = τ 0 b(s)dm(s), is the score operator which generates the tangent set for Λ, Ṗ (Λ) P β,λ {B β,λ b : b L 2 (Λ)}. It can be shown that this tangent space spans all square-integrable score functions for Λ generated by parametric submodels. 25

The adjoint operator can be shown to be B β,λ : L 2(P β,λ ) L 2 (Λ), where B β,λ (g)(t) = P β,λ[g(x)dm(t)] dλ(t). The information operator B β,λ B β,λ : L 2 (Λ) L 2 (Λ) 26

is thus B β,λ B β,λ(b)(t) = P β,λ [ τ 0 b(s)dm(s)dm(u)] dλ(u) = P β,λ [Y (t)e β Z ] b(t), using martingale methods. 27

Since B β,λ ( lβ,λ ) (t) = P β,λ [ ZY (t)e β Z ], we have that the efficient score for β is l β,λ = = ( [ ] ) I B β,λ B 1 β,λ B β,λ B β,λ lβ,λ (2) [ ] τ P β,λ ZY (t)e β Z Z P β,λ [Y (t)e β Z ] dm(t). 0 28

When Ĩ β,λ P β,λ [ lβ,λ l β,λ ] is positive definite, the resulting efficient influence function is ψ β,λ Ĩ 1 β,λ l β,λ. 29

Since the estimator ˆβ n obtained from maximizing the partial likelihood L n (β) = can be shown to satisfy ( ) di n e β Z i n j=1 1{V (3) j V i }e β Z j i=1 n( ˆβn β) = np n ψβ,λ + o P (1), this estimator is efficient. 30

Returning to our discussion of score and information operators, these operators are also useful for generating scores for the entire model, not just for the nuisance component. With semiparametric models having score functions of the form a lθ,η + B θ,η b, for a R k and b H η, we can define a new operator A β,η : {(a, b) : a R k, b lin H η } L 0 2(P θ,η ) where A β,η (a, b) = a lθ,η + B θ,η b. 31

More generally, we can define the score operator A η : lin H η L 2 (P η ) for the model {P η : η H}, where H indexes the entire model and may include both parametric and nonparametric components, and where lin H η indexes directions in H. Let the parameter of interest be ψ(p η ) = χ(η) R k. 32

We assume there exists a linear operator χ : lin H η R k such that, for every b lin H η, there exists a one-dimensional submodel {P ηt : η t H, t N ɛ } satisfying [ (dp ηt ) 1/2 (dp η ) 1/2 t 1 2 A ηb(dp η ) 1/2 ] 2 0, as t 0, and χ(η t ) t = χ(b). t=0 33

We require H η to be a Hilbert space with inner product, η. The efficient influence function is the solution ψ Pη R(A η ) L 0 2(P η ) of A ψ η Pη = χ η, (4) where χ η H η satisfies χ η, b η = χ η (b) for all b H η. 34

When A ηa η is invertible, then the solution to (4) can be written ψ Pη = A η ( A η A η ) 1 χη. In Chapter 4, we utilized this approach to derive efficient estimators for all parameters of the Cox model. 35

Returning to the semiparametric model setting, where P = {P θ,η : θ Θ, η H}, Θ is an open subset of R k, and H is a set, the efficient score can be used to derive estimating equations for computing efficient estimators of θ. 36

Recall that an estimating equation is a data dependent function Ψ n : Θ R k for which an approximate zero yields a Z-estimator for θ. When Ψ n ( θ) has the form P nˆl θ,n, where ˆl θ,n (X X 1,..., X n ) is a function for the generic observation X which depends on the value of θ and the sample data X 1,..., X n, we have the following estimating equation result: 37

THEOREM 1. Suppose that the model {P θ,η : θ Θ}, where Θ R k, is differentiable in quadratic mean with respect to θ at (θ, η) and let the efficient information matrix Ĩθ,η be nonsingular. Let ˆθ n satisfy np nˆlˆθn,n = o P (1) and be consistent for θ. Also assume that ˆlˆθn,n is contained in a P θ,η-donsker class with probability tending to 1 and that the following conditions hold: Pˆθn,η ˆlˆθn,n = o P (n 1/2 + ˆθ n θ ), (5) P θ,η ˆlˆθn,n l 2 P θ,η 0, Pˆθn,η ˆlˆθn,n 2 = O P (1). (6) 38

Then ˆθ n is asymptotically efficient at (θ, η). 39

Returning to the Cox model example, the profile likelihood score is the partial likelihood score Ψ n ( β) = P nˆl β,n, where ˆl β,n (X = (V, d, Z) X 1,..., X n ) [ ] τ P n ZY (t)e β Z = 0 Z ] dm P n [Y (t)e β Z β(t), and M β(t) = N(t) t 0 Y (u)e β Z dλ(u). We showed in Chapter 4 that all the conditions of Theorem 1 are satisfied for the root of Ψ n ( β) = 0, ˆβ n, and thus the partial likelihood yields efficient estimation of β. 40

Maximum Likelihood Estimation The most common approach to efficient estimation is based on modifications of maximum likelihood estimation that lead to efficient estimates. These modifications, which we will call likelihoods, are generally not really likelihoods (products of densities) because of complications resulting from the presence of an infinite dimensional nuisance parameter. Recall the setting of estimation of an unknown real density f(x) from an i.i.d. sample X 1,..., X n. 41

The likelihood is n i=1 f(x i), and the maximizer over all densities has arbitrarily high peaks at the observations, with zero at the other values, and is therefore not a density. This problem can be fixed by using an empirical likelihood n i=1 p i, where p 1,..., p n are the masses assigned to the observations indexed by i = 1,..., n and are constrained to satisfy n i=1 p i = 1. This leads to the empirical distribution function estimator, which is known to be fully efficient. 42

Consider again the Cox model for right-censored data explored in the previous section. The density for a single observation X = (V, d, Z) is proportional to [ ] d [ ] e β Z λ(v ) exp e β Z Λ(V ). Maximizing the likelihood based on this density will result in the same problem raised in the previous paragraph. 43

A likelihood that works is the following, which assigns mass only at observed failure times: L n (β, Λ) = n i=1 where Λ(t) is the jump size of Λ at t. [ ] di [ ] e β Z i Λ(V i ) exp e β Z i Λ(V i ), (7) For each value of β, one can maximize or profile L n (β, Λ) over the nuisance parameter Λ to obtain the profile likelihood pl n (β), which for the Cox model is exp [ n i=1 d i] times the partial likelihood (3). 44

Let ˆβ be the maximizer of pl n (β). Then the maximizer ˆΛ of L n ( ˆβ, Λ) is the Breslow estimator ˆΛ(t) = t 0 P n dn(s) ]. P n [Y (s)e ˆβ Z We showed in Chapter 4 that ˆβ and ˆΛ are both efficient. 45

Another useful class of likelihood variants are penalized likelihoods. Penalized likelihoods add a penalty term (or terms) in order to maintain an appropriate level of smoothness for one or more of the nuisance parameters. This method was used in the partly linear logistic regression model described in Chapter 1. Other methods of generating likelihood variants that work are possible. 46

The basic idea is that using the likelihood principle to guide estimation of semiparametric models often leads to efficient estimators for the model components which are n consistent. Because of the richness of this approach to estimation, one needs to verify for each new situation that a likelihood-inspired estimator is consistent, efficient and well-behaved for moderate sample sizes. Verifying efficiency usually entails demonstrating that the estimator satisfies the efficient score equation described in the previous section. 47

Approximately Least-Favorable Submodels One of the key challenges in this setting is to ensure that the efficient score is a derivative of the chosen log likelihood along some submodel. Something helpful in this setting are approximately least-favorable submodels. 48

The basic idea is to find a function η t (θ, η) such that η 0 (θ, η) = η, for all θ Θ and η H, where η t (θ, η) H for all t small enough, and such that κ θ0,η 0 = l θ0,η 0, where κ θ,η (x) = l θ+t,η t (θ,η)(x) t, t=0 l θ,η (x) is the log-likelihood for the observed value x at the parameters (θ, η), and where (θ 0, η 0 ) are the true parameter values. Note that we require κ θ,η = l θ,η only when (θ, η) = (θ 0, η 0 ). 49

If (ˆθ n, ˆη n ) is the maximum likelihood estimator, i.e., the maximizer of P n l θ,η, then the function t P n lˆθn +t,η t (ˆθ n,ˆη n ) is maximal at t = 0, and thus (ˆθ n, ˆη n ) is a zero of P n κ θ, η. Now if ˆθ n and ˆl θ,n = κ θ,ˆηn satisfy the conditions of Theorem 1 at (θ, η) = (θ 0, η 0 ), then the maximum likelihood estimator ˆθ n is asymptotically efficient at (θ 0, η 0 ). 50

Consider now the maximum likelihood estimator ˆθ n based on maximizing the joint empirical log-likelihood L n (θ, η) np n l(θ, η), where l(, ) is the log-likelihood for a single observation. For now, η will be regarded as a nuisance parameter, and thus we can restrict our attention to the profile log-likelihood θ pl n (θ) sup η L n (θ, η). Note that L n is a sum and not an average, since we multiplied the empirical measure by n. 51

While the solution of an efficient score equation need not be a maximum likelihood estimator, it is also possible that the maximum likelihood estimator in a semiparametric model may not be expressible as the zero of an efficient score equation. This possibility occurs because the efficient score is a projection, and, as such, there is no assurance that this projection is the derivative of the log-likelihood along a submodel. This is the main issue that motivates approximately least-favorable submodels. 52

An approximately least-favorable submodel approximates the true least-favorable submodel to a useful level of accuracy that facilitates analysis of semiparametric estimators. We will now describe this process in generality: the specifics will depend on the situation. As mentioned previously, we first need a general map from the neighborhood of θ into the parameter set for η, which map we will denote by t η t (θ, η), for t R k. 53

We require that η t (θ, η) Ĥ, for all t θ small enough, and (8) η θ (θ, η) = η for any (θ, η) Θ Ĥ, where Ĥ is a suitable enlargement of H that includes all estimators that satisfy the constraints of the estimation process. Now define the map l(t, θ, η) l(t, η t (θ, η)). 54

We will require several things of l(,, ), at various point in our discussion, that will result in further restrictions on η t (θ, η). Define and let l(t, θ, η) l(t, θ, η), t) ˆl θ,n l(θ, θ, ˆη n ). Clearly, P nˆlˆθn,n = 0, and thus ˆθ n is efficient for θ 0, provided ˆl θ,n satisfies the conditions of Theorem 1. 55

The reason it is necessary to check this even for maximum likelihood estimators is that, as mentioned previously, ˆη n is often on the boundary (or even a little bit outside of) the parameter space. Recall again the Cox model setting for right censored data. 56

In this case, η is the baseline integrated hazard function which is usually assumed to be continuous. However, ˆη n is the Breslow estimator, which is a right-continuous step function with jumps at observed failure times and is therefore not in the parameter space. Thus direct differentiation of the log-likelihood at the maximum likelihood estimator will not yield an efficient score equation. 57

We will also require that l(θ 0, θ 0, η 0 ) = l θ0,η 0. (9) Note that we are only insisting that this identity holds at the true parameter values. This approximately least-favorable submodel structure is very useful for developing methods of inference for θ. 58