Bounded, Efficient, and Doubly Robust Estimation with Inverse Weighting

Size: px

Start display at page:

Download "Bounded, Efficient, and Doubly Robust Estimation with Inverse Weighting"

Cordelia Young
5 years ago
Views:

1 Biometrika (2008), 94, 2, pp C 2008 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2008 Bounded, Efficient, and Doubly obust Estimation with Inverse Weighting BY Z. TAN Department of Statistics, utgers University, Piscataway, New Jersey 08854, U.S.A. ztan@stat.rutgers.edu SUMMAY Consider the problem of estimating the mean of an outcome in the presence of missing data or estimating population average treatment effects in causal inference. A doubly robust estimator remains consistent if an outcome regression model or a propensity score model is correctly specified. We build on the nonparametric likelihood approach of Tan and propose new doubly robust estimators. These estimators have desirable properties in efficiency if the propensity score model is correctly specified, and in boundedness even if the inverse probability weights are highly variable. We compare new and existing estimators in a simulation study and find that the robustified likelihood estimators yield overall the smallest mean squared errors. Some key words: Causal inference; Double robustness; Inverse weighting; Missing data; Nonparametric likelihood; Propensity score. 1. INTODUCTION Consider the problem of estimating the mean of an outcome in the presence of missing data under ignorability (ubin, 1976). A related problem is to estimate population average treatment effects under no unmeasured confounding in causal inference (Neyman, 1923; ubin, 1974). Such problems can be handled in two different ways. One approach is to model the mean of the outcome given covariates, called the outcome regression function, and derive an estimator based on the fitted values for observed and missing outcomes. The other approach is to model the probability of non-missingness given the covariates, called the propensity score (osenbaum & ubin, 1983), and derive an estimator through inverse probability weighting of observed outcomes. Inverse-probability-weighted estimators are central to the semiparametric theory of estimation with missing data (e.g., Tsiatis, 2006; van der Laan & obins, 2003). The two approaches rely on different modelling assumptions and one does not necessarily dominate the other (Tan, 2007). A doubly robust approach makes use of both the outcome regression model and the propensity score model and derives an estimator that remains consistent if either of the two models is correctly specified. A prototypical doubly robust estimator is the augmented inverse-probability-weighted estimator of obins et al. (1994). ecently, a number of alternative doubly robust estimators have been proposed. See Kang & Schafer (2007) and the related discussions. All existing doubly robust estimators are locally efficient: they attain the semiparametric variance bound, and hence asymptotically equivalent to each other, if both the propensity score model and the outcome regression model are correctly specified. Therefore, it is important to compare doubly robust estimators in their statistical properties if only one of the models is correctly specified or if both models are misspecified.

2 2 Z. TAN We review various doubly robust estimators and highlight statistical criteria underlying their construction. Some estimators are intrinsically efficient: if the propensity score model is correctly specified, then each of them is asymptotically efficient among a class of augmented inverseprobability-weighted estimators that use the same fitted outcome regression function (Tan, 2006, 2007). Some estimators are improved-locally efficient: if the propensity score model is correctly specified, then they are asymptotically at least as efficient as the augmented inverse-probabilityweighted estimator that uses the true propensity score and an optimally fitted outcome regression function (ubin & van der Laan, 2008; Tan, 2008). Some estimators are population-bounded or sample-bounded: they lie within the range of all possible values or that of observed values of the outcome (obins et al., 2007). The properties of boundedness rule out estimates outside the population or sample range even when the inverse probability weights are highly variable. We propose a robustification of the likelihood estimator of Tan (2006), named calibrated likelihood estimator, by calibrating the coefficients in a linear, extended propensity score model. The estimator is computationally convenient, involving two steps of maximizing concave functions. Moreover, the estimator is locally and intrinsically efficient and sample-bounded, and is further improved-locally efficient if the outcome regression function is suitably estimated. No existing doubly robust estimators achieve these four properties simultaneously. We further derive a robustification of the likelihood estimator of Tan (2006), named augmented likelihood estimator, by incorporating an augmentation term. This estimator satisfies only a weaker form of boundedness than population and sample boundedness. We compare new and existing estimators in a simulation study and find that the calibrated and augmented likelihood estimators yield overall the smallest mean squared errors. 2. MISSING DATA POBLEMS 2 1. Setup Let X be a vector of covariates and Y be an outcome. The variables X are always observed, but Y may be missing. Let be the non-missing indicator such that = 1 or 0 if Y is observed or missing respectively. Throughout, assume that the missing data mechanism is ignorable, that is, and Y are conditionally independent given X (ubin, 1976). Suppose that an independent and identically distributed sample of n units is available. The observed data consist of (X i, i, i Y i ), i = 1,..., n. Our objective is to estimate the population mean µ = E(Y ). Although this problem is simple to describe, it provides a basic setting for us to investigate methods for handling missing data Models There are two different ways of postulating dimension-reduction assumptions to obtain consistent and asymptotically normal estimators of µ. One approach is to specify a parametric model for the outcome regression function m(x) = E(Y X) in the form E(Y X) = m(x; α) = Ψ{α T g(x)}, (1) where Ψ is an inverse link function, g(x) is a vector of known functions including the constant 1, and α is a vector of unknown parameters. Let ˆα OLS be the maximum quasi-likelihood estimator of α or its variant. For concreteness, fix ˆα OLS as the estimator that solves the equation 0 = Ẽ [{Y m(x; α)}g(x)], where Ẽ denotes sample average. Let ˆµ OLS = Ẽ{ ˆm OLS(X)}, where ˆm OLS (X) = m(x; ˆα OLS ). Under regularity conditions, if model (1) is correctly specified, then ˆµ OLS is consistent and asymptotically normal, with asymptotic variance no greater than the semiparametric variance bound, provided that E(Y 2 ) <.

3 Biometrika style 3 The other approach is to specify a parametric model for the propensity score π(x) = P ( = 1 X) in the form P ( = 1 X) = π(x; γ) = Π{γ T f(x)}, (2) where Π is an inverse link function, f(x) is a vector of known functions, and γ is a vector of unknown parameters. Let ˆγ ML be the maximum likelihood estimator of γ and hence a solution to the equation 0 = Ẽ [{ π(x; γ)}ϱ(x; γ)f(x)], where ϱ(x; γ) = Π {γ T f(x)}/[π(x; γ){1 π(x; γ)}] and Π is the derivative of Π. Two non-augmented inverse-probability-weighted estimators are { } { } / { } Y Y ˆµ IPW = Ẽ, ˆµ IPW,ratio = ˆπ ML (X) Ẽ Ẽ, ˆπ ML (X) ˆπ ML (X) where ˆπ ML (X) = π(x; ˆγ ML ). Under regularity conditions, if model (2) is correctly specified, then ˆµ IPW and ˆµ IPW,ratio are consistent and asymptotically normal, with asymptotic variances no smaller than the semiparametric variance bound, provided that E{π 1 (X)} < and E{Y 2 π 1 (X)} <. See Tan (2007) for a comparison between the two approaches Existing estimators The estimator ˆµ O is based on model (1) only, and ˆµ IPW and ˆµ IPW,ratio are based on model (2) only. Alternatively, a range of estimators have been proposed by using both model (1) and model (2) to gain efficiency and robustness. Many such estimators can be cast in the form [ { } ] [ Y ˆµ(ˆπ, ˆm) = Ẽ ˆπ(X) ˆπ(X) 1 ˆm(X) = Ẽ ˆm(X) + ] {Y ˆm(X)}, ˆπ(X) where ˆπ(X) and ˆm(X) are fitted values of π(x) and m(x) respectively. See Kang & Schafer (2007), obins et al. (2007), and Tan (2006, 2007, 2008) for related discussions. Consider the following estimators of µ, with the same choice ˆπ ML (X) for ˆπ(X) but different choices for ˆm(X). obins et al. (1994) proposed the estimator ˆµ AIPW = ˆµ(ˆπ ML, ˆm OLS ). Scharfstein et al. (1999) suggested the estimator ˆµ OLS,ext = ˆµ{ˆπ ML, ˆm ext (ˆπ ML )} = Ẽ{ ˆm ext(x; ˆπ ML )}, where ˆm ext (X; ˆπ) = m ext {X; ˆκ(ˆπ)} and ˆκ(ˆπ) is a solution to 0 = Ẽ[{Y m ext(x; κ)} {ˆπ 1 (X), g T (X)} T ] for the extended outcome regression model E(Y X) = m ext (X; κ) = Ψ{κ 1ˆπ 1 (X) + κ T 2 g(x)} with κ = (κ 1, κ T 2 )T. Kang & Schafer (2007) considered the estimator ˆµ WLS = ˆµ{ˆπ ML, ˆm WLS (ˆπ ML )} = Ẽ{ ˆm WLS(X; ˆπ ML )}, where ˆm WLS (X; ˆπ) = m{x; ˆα WLS (ˆπ)} and ˆα WLS (ˆπ) is a solution to 0 = Ẽ[ˆπ 1 (X){Y m(x; α)}g(x)] and hence differs from ˆα OLS in using weight ˆπ 1 (X). ubin & van der Laan (2008) proposed two related estimators ˆµ V = ˆµ{ˆπ ML, ˆm V (ˆπ ML )}, µ V = ˆµ{ˆπ ML, m V (ˆπ ML )}, where ˆm V (X; ˆπ) = m{x; ˆα V (ˆπ)} and ˆα V (ˆπ) = argmin α Ẽ([Y/ˆπ(X) {/ˆπ(X) 1}m(X; α)] 2 ) for the first estimator and m V (X; ˆπ) = m{x; α V (ˆπ)} and α V (ˆπ) = argmin α Ẽ[{/ˆπ(X)}{/ˆπ(X) 1}{Y m(x; α)} 2 ] for the second estimator. The estimator α V (ˆπ) is a weighted least-squares estimator using weight ˆπ 1 (X){ˆπ 1 (X) 1}. Our notation makes explicit the dependency of ˆm ext (ˆπ), ˆm WLS (ˆπ), ˆm V (ˆπ), and m V (ˆπ) on ˆπ. The choice ˆπ ML (X) for ˆπ(X) is derived under model (2), independently of model (1). A more elaborate choice can be derived under an extended propensity score model with extra linear

4 Z. TAN predictors depending on ˆm(X). Consider the model { } ˆυ(X) P ( = 1 X) = π ext (X; ν) = Π ν1 T ˆϱ ML (X)ˆπ ML (X) + νt 2 f(x), (3) where ν = (ν T 1, νt 2 )T, ˆυ(X) = {1, ˆm(X)} T, and ˆϱ ML (X) = ϱ(x; ˆγ ML ). Let ˆν( ˆm) be the maximum likelihood estimator of ν and write ˆπ ext (X; ˆm) = π ext {X; ˆν( ˆm)}. Substitution of ˆπ ext ( ˆm OLS ) for ˆπ ML in ˆµ IPW yields the estimator of otnitzky & obins (1995), ˆµ IPW,ext = ˆµ{ˆπ ext ( ˆm OLS ), 0}. For ˆm = ˆm OLS or ˆm WLS (ˆπ ML ), substitution of ˆπ ext ( ˆm) for ˆπ ML in ˆµ(ˆπ ML, ˆm), but not for that within ˆm, yields the estimators ˆµ AIPW,ext = ˆµ{ˆπ ext ( ˆm OLS ), ˆm OLS }, ˆµ WLS,ext = ˆµ[ˆπ ext { ˆm WLS (ˆπ ML )}, ˆm WLS (ˆπ ML )]. by obins et al. in a 2008 technical report at Harvard University. In addition, they proposed ˆµ WLS,ext2 = ˆµ(ˆπ ext { ˆm WLS (ˆπ ML )}, ˆm WLS [ˆπ ext { ˆm WLS (ˆπ ML )}]) through a further iteration from ˆµ WLS,ext. The targeted maximum likelihood approach of van der Laan & ubin (2006, Sections ) is closely related to the estimators ˆµ OLS,ext and ˆµ IPW,ext. With ˆm OLS and ˆπ ML as initial fitted values, this approach leads to the estimators ˆµ TML = ˆµ{ˆπ ML, ˆm TML (ˆπ ML )} = Ẽ{ ˆm TML(X; ˆπ ML )}, ˆµ TIPW = ˆµ{ˆπ TML ( ˆm OLS ), 0}, ˆµ TAIPW = ˆµ{ˆπ TML ( ˆm OLS ), ˆm TML (ˆπ ML )}, where ˆm TML (X; ˆπ) is obtained by fitting E(Y X) = m ext (X; κ) with κ 2 fixed at ˆα OLS, and ˆπ TML (X; ˆm) is obtained by fitting P ( = 1 X) = π ext (X; ν) with ν 2 fixed at ˆγ ML. The estimators ˆµ IPW,ext and ˆµ TIPW are similar to the two likelihood estimators of Tan (2006). The first estimator accommodates the variation of ˆγ ML whereas the second ignores that variation Comparison Consider the following criteria for evaluating estimators of µ. Note that improved local efficiency implies local efficiency, and sample boundedness implies population boundedness. (a) Double robustness: ˆµ remains consistent if either model (1) or model (2) is correctly specified. (b) Local efficiency: ˆµ attains the semiparametric variance bound, i.e., it is asymptotically equivalent to the first order to Ẽ[Y/π(X) {/π(x) 1}m(X)] if both model (1) and model (2) are correctly specified. (c) Improved local efficiency: ˆµ is asymptotically at least as efficient as Ẽ[Y/π(X) {/π(x) 1}m(X; α)] for arbitrary α if model (2) is correctly specified. (d) Intrinsic efficiency: ˆµ attains the minimum asymptotic variance among the class of estimators Ẽ[Y/ˆπ ML (X) b T 1 {/ˆπ ML(X) 1}ˆυ(X)] for arbitrary b 1 if model (2) is correctly specified, where ˆυ(X) = {1, ˆm(X)} T and ˆm(X) is the fitted value of m(x) used in ˆµ. Therefore, ˆµ is asymptotically at least as efficient as ˆµ IPW, ˆµ IPW,ratio, and ˆµ(ˆπ ML, ˆm). (e) Population boundedness: ˆµ lies within the range of all possible values of Y, if model (1) or model (2) or both are misspecified. (f) Sample boundedness: ˆµ lies within the range of {Y i : i = 1, i = 1,..., n}, if model (1) or model (2) or both are misspecified. The upper half of Table 1 presents a comparison of various estimators in Section 2 3 in terms of the foregoing criteria. See Sections 3 4 for a discussion of the likelihood and regression estimators in the lower half of Table 1.

5 Biometrika style 5 Table 1. Theoretical comparison of estimators ˆµ TAIPW ˆµ TML ˆµ AIPW,ext ˆµ AIPW ˆµ OLS,ext ˆµ WLS ˆµ V µ V ˆµ IPW,ext ˆµ WLS,ext ˆµ WLS,ext2 D LE IE ILE PB SB ˆµ LIK,OLS ˆµ EG,OLS µ EG,OLS µ LIK2,OLS µ LIK2,WLS µ LIK2,V D LE IE ILE PB SB D, LE, IE, ILE, PB, and SB correspond to criteria (a) (f). 3. POPOSED APPOACH 3 1. Summary We extend the nonparametric likelihood approach of Tan (2006). The main contribution is to obtain an estimator of µ that is doubly robust, locally and intrinsically efficient, and samplebounded simultaneously. Moreover, our approach is flexible enough to allow different choices, such as ˆm OLS, ˆm WLS (ˆπ ML ), and m V (ˆπ ML ), for the fitted value ˆm. If ˆm = m V (ˆπ ML ), then the resulting estimator is further improved-locally efficient Non-doubly-robust likelihood estimator We describe the likelihood estimator of Tan (2006) in the current setup of missing data. The nonparametric likelihood of (X i, i, i Y i ), i = 1,..., n, is [ n ] L 1 L 2 = π(x i ; γ) i {1 π(x i ; γ)} 1 i G 1 ({X i, Y i }) G 0 ({X i }), i=1 i: i =1 i: i =0 where G 1 is the joint distribution of (X, Y ) and G 0 is the marginal distribution of X. Maximizing L 1 leads to the maximum likelihood estimator ˆγ ML. ecall that ˆm(x) is a fitted value of m(x) based on model (1) and ˆυ(x) = {1, ˆm(x)} T. Let ĥ = (ĥt 1, ĥt 2 )T where ĥ 1 (x) = {1 ˆπ ML (x)} ˆυ(x), ĥ 2 (x) = π γ (x; ˆγ ML). We choose to ignore the fact that G 0 and the marginal distribution of X under G 1 are identical, and retain only the constraints ĥ(x) dg 1 = ĥ(x) dg 0, i.e., {1 ˆπ(x)} dg 1 = {1 ˆπ(x)} dg 0, {1 ˆπ(x)} ˆm(x) dg 1 = {1 ˆπ(x)} ˆm(x) dg 0, π π γ (x; ˆγ ML) dg 1 = γ (x; ˆγ ML) dg 0.

6 Z. TAN See Kong et al. (2003) for a related formulation. The first two constraints respectively ensure that the resulting estimator of µ is consistent under correctly specified model (2) and locally efficient, whereas the third constraint accounts for the variation of ˆγ ML such that the resulting estimator is intrinsically efficient. Furthermore, we require that G 1 be a probability measure supported on {(X i, Y i ) : i = 1, i = 1,..., n} and hence dg 1 = 1, and G 0 be a nonnegative measure (not necessarily a probability) supported on {X i : i = 0, i = 1,..., n}. Maximizing L 2 subject to these constraints leads to the estimators n 1 Ĝ 1 ({X i, Y i }) = ω(x i ; ˆλ) if i = 1, n 1 Ĝ 0 ({X i }) = 1 ω(x i ; ˆλ) if i = 0, where ω(x; λ) = ˆπ ML (X) + λ T ĥ(x), ˆλ = argmaxλ l(λ), and l(λ) = Ẽ[ log{ω(x; λ)} + (1 ) log{1 ω(x; λ)}]. The function l(λ) is finite and concave on the set {λ : ω(x i ; λ) > 0 if i = 1 and ω(x i ; λ) < 1 if i = 0, i = 1,..., n}. Moreover, l(λ) is strictly concave and bounded from above, and hence has a unique maximum, if and only if the set {λ : λ T ĥ(x i ) 0 if i = 1 and λ T ĥ(x i ) 0 if i = 0, i = 1,..., n} is empty. (4) See the Appendix for a proof. From our experience, ˆλ can be computed effectively by using a globally convergent optimization algorithm such as the package trust. Setting the gradient of l(λ) to 0 shows that ˆλ is a solution to [ 0 = Ẽ By construction, ˆλ also satisfies The resulting estimator of µ is 1 = ˆµ LIK = ω(x; λ) ω(x; λ){1 ω(x; λ)}ĥ(x) dĝ1 = Ẽ { y dĝ1 = Ẽ ω(x; ˆλ) { Y } ω(x; ˆλ) ]. (5). (6) The estimator ˆµ LIK is structurally similar to ˆµ IPW,ext based on the extended model (3). The value ˆλ can be interpreted as the maximum likelihood estimator of λ under the linear, extended propensity score model P ( = 1 X) = ω(x; λ). However, there are important differences between ˆµ LIK and ˆµ IPW,ext. First, ω(x i ; ˆλ) may not lie between 0 and 1 for all i = 1,..., n. It is only required that ω(x i ; ˆλ) > 0 if i = 1 and ω(x i ; ˆλ) < 1 if i = 0. Moreover, equation (6) automatically holds, whereas Ẽ{/ˆπ ext(x)} = 1 does not. By (6), ω(x i ; ˆλ) with i = 1 are bounded from below by n 1, and ˆµ LIK is sample-bounded. In contrast, ˆπ ext (X i ) with i = 1 may be arbitrarily close to 0, and ˆµ IPW,ext is not sample-bounded. Tan (2006, Theorem 4) obtained an asymptotic expansion of ˆµ LIK, assuming that model (2) is correctly specified. Here, we provide a general asymptotic expansion of ˆµ LIK, allowing for misspecification of model (1) and model (2). See Manski (1988) for related asymptotic theory in misspecified models. Under regularity conditions, ˆλ converges to a constant λ in probability }.

7 Biometrika style 7 with the expansion [ ˆλ λ = ˆB 1 ω(x; λ ] ) Ẽ ω(x; λ ){1 ω(x; λ )}ĥ(x) + o p (n 1/2 ), where [ ˆB = Ẽ { ω(x; λ )} 2 ] ω 2 (X; λ ){1 ω(x; λ )} 2 ĥ(x)ĥt (X). Moreover, a Taylor expansion of ˆµ LIK about λ yields { } [ Y ˆµ LIK = Ẽ ω(x; λ ) Ĉ T ˆB 1 ω(x; λ ] ) Ẽ ω(x; λ ){1 ω(x; λ )}ĥ(x) + o p (n 1/2 ), (7) where Ĉ = Ẽ[{Y/ω2 (X; λ )}ĥ(x)]. If model (2) is correctly specified, then λ = 0 and hence the expansion reduces to ˆµ LIK = ˆµ EG + o p (n 1/2 ) with ˆµ EG = Ẽ(ˆη) ˆβ T Ẽ(ˆξ), where ˆη = Y/ˆπ ML (X), ˆξ = [{/ˆπ ML (X) 1}ˆυ T (X), { ˆπ ML (X)}ˆϱ ML (X)f T (X)] T, ˆB = Ẽ(ˆξ ˆξ T ), Ĉ = Ẽ(ˆξˆη), and ˆβ = ˆB 1 Ĉ is the least-squares estimator in the linear regression of ˆη on ˆξ. The estimator ˆµ EG is locally and intrinsically efficient (obins et al., 1995), but not doubly robust. See Section 4 5 for a further discussion Doubly robust likelihood estimator The estimator ˆµ LIK is sample-bounded and locally and intrinsically efficient. If ˆm = ˆm V (ˆπ ML ) or m V (ˆπ ML ), then ˆµ LIK is further improved-locally efficient because it is asymptotically at least as efficient as ˆµ V or µ V, which is improved-locally efficient. However, ˆµ LIK is not doubly robust. It may be inconsistent if model (1) is correctly specified but model (2) is misspecified. We propose a robustification of ˆµ LIK such that it satisfies double robustness in addition to sample boundedness and local and intrinsic efficiency. We first discuss a simple version of our proposal. Consider the system of estimating equations [{ } ] 0 = Ẽ ω(x; λ) 1 ˆυ(X), (8) [ 0 = Ẽ ω(x; λ) ω(x; λ){1 ω(x; λ)}ĥ2(x) ], (9) which are equivalent to (5) except that ( ω)/{ω(1 ω)} is replaced by (/ω 1)/(1 ˆπ ML ) in the equations associated with ĥ1 = (1 ˆπ ML )ˆυ. Let λ be a solution to (8) (9) subject to the constraint that ω(x i ; λ) > 0 if i = 1 (i = 1,..., n) and let { } Y µ LIK = Ẽ ω(x; λ). Note that ˆυ(X) includes the constant 1 and hence Ẽ{/ω(X; λ)} = 1 by (8). Therefore, µ LIK is sample-bounded in a similar manner as ˆµ LIK is. We derive asymptotic expansions for λ and µ LIK, allowing for misspecification of model (1) and model (2), in parallel to those for ˆλ and ˆµ LIK. Under regularity conditions, λ converges to a constant λ in probability with the expansion { } λ λ = B T 1 Ẽ ω(x;λ ) 1 ˆυ(X) + o ω(x;λ ) p (n 1/2 ), ω(x;λ ){1 ω(x;λ )}ĥ2(x)

8 Z. TAN where B = Ẽ ω 2 (X;λ )ĥ1(x)ˆυ T { ω(x;λ (X) )} 2 ω 2 (X;λ ){1 ω(x;λ )} 2 ĥ1(x)ĥt 2 (X) ω 2 (X;λ )ĥ2(x)ˆυ T { ω(x;λ (X) )} 2 ω 2 (X;λ ){1 ω(x;λ )} 2 ĥ2(x)ĥt 2 (X). Moreover, a Taylor expansion of µ LIK about λ yields { } { } Y µ LIK = Ẽ ω(x; λ ) Ĉ T BT 1 Ẽ ω(x;λ ) 1 ˆυ(X) + o ω(x;λ ) p (n 1/2 ). (10) ω(x;λ ){1 ω(x;λ )}ĥ2(x) If model (2) is correctly specified, then λ = 0 and hence the expansion reduces to µ LIK = µ EG + o p (n 1/2 ) with µ EG = Ẽ(ˆη) β T Ẽ(ˆξ), where ˆζ = [ˆυ T (X)/ˆπ ML (X), { ˆπ ML (X)}ˆϱ ML (X)f T (X)] T, B = Ẽ(ˆξ ˆζ T ), and β = B 1 Ĉ. In this case, ˆµ EG and µ EG are asymptotically equivalent to the first order and hence so are ˆµ LIK and µ LIK. However, µ EG is akin to the doubly robust regression estimator of Tan (2006). These regression estimators, unlike ˆµ EG, satisfies double robustness in addition to local and intrinsic efficiency. The estimators ˆµ LIK and µ LIK are sample-bounded and locally and intrinsically efficient. However, µ LIK, unlike ˆµ LIK, is further doubly robust. This difference follows from the general asymptotic expansions (7) for ˆµ LIK and (10) for µ LIK. The leading terms are structurally similar to respectively ˆµ EG, which is not doubly robust, and µ EG, which is doubly robust. Alternatively, µ LIK is doubly robust because { } Ẽ ˆm(X) = Ẽ{ ˆm(X)} (11) ω(x; λ) by (8) and hence µ LIK is identical to ˆµ{ω( ; λ), ˆm} in the typical form of doubly robust estimators. In contrast, Ẽ{ ˆm(X)/ω(X; ˆλ)} = Ẽ{ ˆm(X)} does not necessarily hold for ˆµ LIK. We regard λ as a calibration of the maximum likelihood estimator ˆλ in the linear, extended propensity score model P ( = 1 X) = ω(x; λ) such that equation (11) holds. So far, we seem to fulfil the objective of deriving an estimator that is doubly robust, locally and intrinsically efficient, and sample-bounded. However, there remain subtle issues about the existence and computation of λ. First, it is difficult to characterize conditions under which there exists a solution to (8) (9) subject to the constraint that ω(x i ; λ) > 0 if i = 1 (i = 1,..., n). Moreover, algorithms for solving nonlinear equations such as (8) (9) may fail to locate a solution, much less all possible solutions, if any exists. It presents a further challenge to accommodate the constraint on the domain of λ. Finally, if indeed there exists no solution or multiple solutions, it remains difficult to redefine λ or select λ among multiple solutions. These difficulties are applicable not only to (8) (9), but to nonlinear estimating equations in general. See Small et al. (2000) for a survey that mainly deals with multiple solutions. We now discuss a more effective version of our proposal to address the foregoing issues. ecall that ˆλ is defined as a maximizer of l(λ). Under condition (4), l(λ) is strictly concave and bounded from above and hence ˆλ exists and is unique. Consider the following two-step estimator. (a) Compute ˆλ = (ˆλ T 1, ˆλ T 2 )T, partitioned according to ĥ = (ĥt 1, ĥt 2 )T. (b) Compute λ step2 = ( λ T 1,step2, ˆλ T 2 )T, where λ 1,step2 = argmax λ1 κ 1 (λ 1 ) and [ κ 1 (λ 1 ) = Ẽ log{ω(x; λ 1, ˆλ 2 )} log{ω(x; ˆλ)} ] λ T 1 ˆυ(X). 1 ˆπ ML (X)

9 Biometrika style 9 The function κ 1 (λ 1 ) is finite and concave on the set {λ 1 : ω(x i ; λ 1, ˆλ 2 ) > 0 if i = 1, i = 1,..., n}. Moreover, as shown in the Appendix, κ 1 (λ 1 ) is strictly concave and bounded from above, and hence has a unique maximum, if and only if the set {λ 1 : λ T 1 ˆυ(X i) 0 if i = 1, i = 1,..., n, and Ẽ{λT 1 ˆυ(X)} 0} is empty. (12) Like ˆλ in step (a), λ 1,step2 in step (b) can be computed effectively by using a globally convergent optimization algorithm such as the package trust. Setting the gradient of κ 1 (λ 1 ) to 0 shows that λ 1,step2 is a solution to 0 = Ẽ [{ ω(x; λ 1, ˆλ 2 ) 1 } ˆυ(X) ], (13) which is equivalent to (8) with λ 2 evaluated at ˆλ 2. In fact, we consider (13) as estimating equations and obtain κ 1 (λ 1 ) as an objective function by integrating the right side of (13). This construction is feasible because the matrix of the partial derivatives of the right side of (13) is symmetric and negative-semidefinite. In the degenerate case where ĥ2(x) is removed from ĥ(x), then λ consists of λ 1 only and hence λ and λstep2 are identical. The resulting estimator of µ is { } µ LIK2 = Ẽ Y ω(x; λ. step2 ) The estimator µ LIK2, like µ LIK, is sample-bounded and doubly robust due to, respectively, Ẽ{/ω(X; λ step2 )} = 1 and E{ ˆm(X)/ω(X; λ step2 )} = E{ ˆm(X)} by (13). Furthermore, µ LIK2 is asymptotically equivalent to the first order to ˆµ LIK and µ LIK if model (2) is correctly specified, and hence is locally and intrinsically efficient. See the Appendix for an asymptotic expansion of µ LIK2, allowing for misspecification of model (1) and model (2). The foregoing development allows a general choice of the fitted value ˆm(X). The estimator µ LIK2 is doubly robust, locally and intrinsically efficient, and sample-bounded. Nevertheless, different choices of ˆm(X) lead to specific versions of µ LIK2 that differ beyond the four properties. Denote by µ LIK2,OLS, µ LIK2,WLS, and µ LIK2,V the versions of µ LIK2 corresponding to ˆm = ˆm OLS, ˆm WLS (ˆπ ML ), and m V (ˆπ ML ), and similarly denote those of ˆµ LIK, ˆµ EG, and µ EG. The estimator µ LIK2,V, unlike µ LIK2,OLS and µ LIK2,WLS, is further improved-locally efficient. See Table 1 for a comparison of these estimators among other estimators. 4. EXTENSIONS AND COMPAISONS 4 1. Specification of ˆυ(X) The vector ˆυ(X) is so far fixed as {1, ˆm(X)} T. However, it can be replaced throughout by a general vector of known functions of X including the constant 1 as in Tan (2006). With this extension, ˆµ LIK and µ LIK2 still have asymptotic expansions in the current forms. The two estimators are sample-bounded and intrinsically efficient. Furthermore, if ˆm(X) = b T 1 ˆυ(X) for some vector b 1, (14) then ˆµ LIK is locally efficient, and µ LIK2 is doubly robust and locally efficient. Condition (14) automatically holds for ˆυ(X) = {1, ˆm(X)} T with b 1 = (0, 1) T. Consider the case where model (1) is linear with identity link Ψ. Then g(x) is an alternative choice of ˆυ(X) satisfying (14). For this choice, intrinsic efficiency implies improved local effi-

10 Z. TAN ciency and hence ˆµ LIK and µ LIK2 are improved-locally efficient. This result can also be seen from the following relationship. Suppose that ĥ2(x) is removed from ĥ(x) throughout. Then ˆµ EG and µ EG are identical to ˆµ V and µ V respectively, which are improved-locally efficient (Tan, 2008). The estimators ˆµ LIK and µ LIK2 have increased asymptotic variances, but are still asymptotically equivalent to the first order to ˆµ EG and µ EG if model (2) is correctly specified. Therefore, the original estimators ˆµ LIK and µ LIK2 are improved-locally efficient Estimation of E(X) and G 1 The estimators ˆµ LIK and µ LIK2 for µ = E(Y ) can be used for estimating E(X) with Y replaced by X, and similarly for estimating the expectations of functions of X. The resulting estimators have similar properties to those of ˆµ LIK and µ LIK2. Suppose that X is contained in ˆυ(X) by specification. If model (2) is correctly specified, then Ẽ{X/ω(X; ˆλ)} is asymptotically at least as efficient as Ẽ[X/ˆπ ML(X) {/ˆπ ML (X) 1}X] = Ẽ(X) by intrinsic efficiency, and hence asymptotically equivalent to the first order to Ẽ(X). The estimator Ẽ{X/ω(X; λ step2 )}, in contrast with Ẽ{X/ω(X; ˆλ)}, is identical to Ẽ(X) by (13), whether or not model (2) is correctly specified. Estimation of E(Y ), E(X), and the expectations of functions of (X, Y ) is unified in estimation of G 1 from the distributional perspective of Tan (2006). Let G 1,step2 be the probability measure supported on {(X i, Y i ) : i = 1, i = 1,..., n} such that if i = 1 then G 1,step2 ({X i, Y i }) = n 1 ω(x i ; λ step2 ). Then Ĝ1 and G 1,step2 are both estimators of G 1, supported on the completely observed data. However, G1,step2 satisfies ˆυ(x) d G 1,step2 = Ẽ{ˆυ(X)}, i.e., the weighted average of ˆυ(X) under G 1,step2 is exactly matched to the overall sample average of ˆυ(X). We compare our approach with the empirical likelihood approach of Qin & Zhang (2003). Their approach is to maximize i: i =1 G 1({X i, Y i }) subject to the constraints that G 1 is a probability measure supported on {(X i, Y i ) : i = 1, i = 1,..., n} and â(x) dg 1 = Ẽ{â(X)}, where â(x) = {ˆπ ML (x), ˆm(x)} T. The maximization leads to the estimator that if i = 1 then Ĝ QZ ({X i, Y i }) = n ˆλ T QZ [â(x i) Ẽ{â(X)}], where n 1 = n i=1 i, ˆλQZ = argmax λ1 l QZ (λ 1 ) and l QZ (λ 1 ) = Ẽ{ log(1 + λt 1 [â(x i) Ẽ{â(X)}])}. The estimator ˆµ QZ = y dĝqz is sample-bounded due to dĝqz = 1, and doubly robust and locally efficient due to ˆm(x) dĝqz = Ẽ{ ˆm(X)}. However, ˆµ QZ is not intrinsically or improved-locally efficient, even in the special case where π(x) is known and substituted for ˆπ ML (X) and ˆm V (ˆπ ML ) or m V (ˆπ ML ) is used for ˆm Augmentation of ˆµ LIK The estimator µ LIK2 is derived as a robustification of ˆµ LIK to realize double robustness and retain sample boundedness and local and intrinsic efficiency. Our method is to calibrate the estimation of λ. An alternative method for robustification is to augment ˆµ LIK with the additional term Ẽ[{/ω(X; ˆλ) 1} ˆm(X)], in a similar manner to augmenting ˆµ IPW,ext to ˆµ AIPW,ext by obins et al. in their 2008 technical report. The resulting estimator is doubly robust and locally and intrinsically efficient, but not sample-bounded.

11 Biometrika style 11 ecall that ˆλ = ˆλ( ˆm) depends on ˆm and write ˆω(X; ˆm) = ω{x; ˆλ( ˆm)}. Substitution of ˆω( ˆm) for ˆπ ext ( ˆm) in various estimators in Section 2 3 leads to ˆµ AIPW,lik = ˆµ{ˆω( ˆm OLS ), ˆm OLS }, ˆµ WLS,lik = ˆµ[ˆω{ ˆm WLS (ˆπ ML )}, ˆm WLS (ˆπ ML )], ˆµ WLS,lik2 = ˆµ(ˆω{ ˆm WLS (ˆπ ML )}, ˆm WLS [ˆω{ ˆm WLS (ˆπ ML )}]), µ V,lik = ˆµ[ˆω{ m V (ˆπ ML )}, m V (ˆπ ML )]. These estimators are similar to their counterparts in Section 2 3 in terms of the six properties in Table 1. The estimator ˆµ{ˆω( ˆm), ˆm} is not population-bounded or sample-bounded, whereas ˆµ WLS,lik2 is population-bounded. Nevertheless, ˆµ{ˆω( ˆm), ˆm} is bounded in the absolute value by = max{ ˆm(X i ) : i = 1,..., n} + max{ Y i ˆm(X i ) : i = 1, i = 1,..., n}, due to normalization (6). In contrast, ˆµ{ˆπ ext ( ˆm), ˆm} may lie outside this range, because such a normalization does not hold for ˆπ ext (X) as discussed in Section 3 2. Kang & Schafer (2007) and obins et al. (2007) considered a modification of ˆµ(ˆπ, ˆm) by deliberately normalizing the weights, that is, ˆµ ratio (ˆπ, ˆm) = Ẽ 1 { ˆπ(X) } Ẽ ( Y ˆπ(X) ˆπ(X) } Ẽ = Ẽ{ ˆm(X)} + Ẽ 1 { ˆπ(X) ) [ ˆm(X) Ẽ{ ˆm(X)}] ] [ {Y ˆm(X)} ˆπ(X) The estimator ˆµ ratio {ˆπ ext ( ˆm), ˆm} is bounded in the absolute value by. Moreover, it is similar to ˆµ{ˆπ ext ( ˆm), ˆm} and ˆµ{ˆω( ˆm), ˆm} in terms of the six properties in Table 1. These estimators, two based on ˆπ ext and one based on ˆω, are asymptotically equivalent to each other if model (2) is correctly specified, but may differ in various ways otherwise Bounded robustification of ˆµ IPW,ext The estimator ˆµ AIPW,ext is doubly robust but not sample-bounded. An alternative robustification of ˆµ IPW,ext can be derived such that it is doubly robust and sample-bounded in a similar manner as µ LIK2 is derived from ˆµ LIK. Our method is to calibrate estimation of ν in the extended model (3). For simplicity, fix Π(z) = expit(z), i.e., {1 + exp( z)} 1. Then ϱ(x; γ) 1 free of γ, and π ext (X; ν) reduces to Π{ν T 1 ˆυ(X)/ˆπ ML(X) + ν T 2 f(x)}. ecall that ˆν = (ˆν T 1, ˆνT 2 )T is the maximum likelihood estimator of ν and hence a solution to 0 = Ẽ[ { π ext (X; ν)} f(x) ], [ 0 = Ẽ { π ext (X; ν)} ˆυ(X) ]. (15) ˆπ ML (X) Let ν step2 = ( ν 1,step2 T, ˆνT 2 )T, ν 1,step2 = argmax ν1 J 1(ν 1 ), and [ { J 1(ν 1 ) = Ẽ ˆυ(X) ˆπ ML (X) exp ν1 T ˆπ ML (X) ˆνT 2 f(x) } ] (1 )ν1 T ˆυ(X) by integrating the right side of (17) below. The function J 1(ν 1 ), unlike l(λ) and κ 1 (λ 1 ), is finite and concave everywhere. Moreover, J 1(ν 1 ) is strictly concave and bounded from above, and hence has a unique maximum, if and only if the set {ν 1 : ν1 Tˆυ(X i) 0 if i = 1, i = 1,..., n, and Ẽ{(1 )νt 1 ˆυ(X)} 0} is empty. (16) See the Appendix for a proof. The existence condition (16) for ν 1,step2 is more demanding than (12) for λ 1,step2 in that (16) implies (12), but not necessarily vice versa. Setting the gradient of.

12 Z. TAN J 1(ν 1 ) to 0 shows that ν 1,step2 is a solution to [{ } 0 = Ẽ π ext (X; ν 1, ˆν 2 ) 1 ] ˆυ(X), (17) which is equivalent to (15) with ( π ext ) replaced by (/π ext 1)ˆπ ML and ν 2 evaluated at ˆν 2. The resulting estimator of µ is µ IPW,ext2 = Ẽ{Y/π ext(x; ν step2 )}. This estimator, like µ LIK2, is doubly robust, locally and intrinsically efficient, and sample-bounded. We compare µ IPW,ext2 with the bounded, doubly robust estimator of obins et al. (2007, Section 4 1 2). Consider the extended propensity score model π ext,sl (X; χ, γ) = Π(χ[ ˆm(X) Ẽ{ ˆm(X)}] + γ T f(x)). Let ˆχ = ˆχ( ˆm) be a solution to ( 0 = Ẽ [ ˆm(X) Ẽ{ ˆm(X)}] π ext,sl (X; χ, ˆγ ML ) and write ˆπ ext,sl (X; ˆm) = π ext,sl {X; ˆχ( ˆm), ˆγ ML }. The estimator ˆµ IPW,ext,LS = ˆµ ratio {ˆπ ext,sl ( ˆm), 0} is sample-bounded. Moreover, it is identical to ˆµ ratio {ˆπ ext,sl ( ˆm), ˆm} by the construction of ˆχ and hence is doubly robust and locally efficient. However, it is not intrinsically or improved-locally efficient, even in the case where ˆγ ML is replaced by the true value and ˆm(X) Ẽ{ ˆm(X)} in π ext,sl(x; χ, γ) is replaced by [ ˆm(X) Ẽ{ ˆm(X)}]/π(X) egression estimators The estimators ˆµ EG and µ EG are called regression estimators (Tan, 2006, 2007), with connection to survey sampling (e.g., Cochran, 1977) and Monte Carlo integration (e.g., Hammersley & Handscomb, 1964). The idea is to exploit the fact that if model (2) is correctly specified, then ˆη has mean µ and ˆξ has mean 0 asymptotically. The estimator ˆµ EG attains the minimum asymptotic variance among the class of estimators Ẽ(ˆη) bt Ẽ(ˆξ) for arbitrary b. Moreover, µ EG is asymptotically equivalent to the first order to ˆµ EG because both β and ˆβ converge β = E 1 (ξξ T )E(ξη) in probability. Note that Ẽ(ˆξ 2 ) = 0 and hence Ẽ(ˆη) bt Ẽ(ˆξ) reduces to Ẽ(ˆη) bt 1 Ẽ(ˆξ 1 ), where b = (b T 1, bt 2 )T and ˆξ = (ˆξ 1 T, ˆξ 2 T)T according to ĥ = (ĥt 1, ĥt 2 )T. The estimators ˆµ EG and µ EG are no longer asymptotically equivalent if model (2) is misspecified. In fact, µ EG is doubly robust whereas ˆµ EG is not. The estimator µ EG is akin to the doubly robust regression estimator of Tan (2006), in which ˆη is defined as {ˆυ T (X)/ˆπ ML (X), ˆϱ ML (X)f T (X)} T. A benefit of using this version of ˆη is that the resulting matrix B is symmetric and negative-semidefinite. Moreover, if {λ : λ T h(x i ) = 0 if i = 1, i = 1,..., n} is empty, then B is negative-definite. This symmetrization tends to stabilize the inversion of B in β = B 1 Ĉ and hence improve the finite-sample behavior of µ EG. A similar symmetrization can be applied to estimating equations (8) (9). Consider the following estimating equations in place of (9) [ { } ] 0 = Ẽ ω(x; λ) 1 ĥ2 (X). (18) 1 ˆπ ML (X) The matrix of the partial derivatives of the right sides of (8) and (18) is symmetric and negativesemidefinite. If {λ : λ T ĥ(x i ) = 0 if i = 1, i = 1,..., n} is empty, then the matrix is negativedefinite. In fact, (8) and (18) are jointly equivalent to setting to 0 the gradient of κ(λ) = Ẽ([ log{ω(x; λ)} λ T ĥ(x)]/{1 ˆπ ML (X)}), similarly as (13) is obtained from κ 1 (λ 1 ). The function κ(λ) has similar properties of concavity and boundedness to those of κ 1 (λ 1 ). Therefore, it is numerically convenient to redefine λ as a maximizer to κ(λ) or equivalently a solution to ),

13 Biometrika style 13 (8) and (18) subject to the constraint that ω(x i ; λ) > 0 if i = 1 (i = 1,..., n). The resulting estimator µ LIK is comparable to µ LIK2 in terms of the six properties in Table 1. A limitation of the modified estimator µ LIK as compared with µ LIK2 is that it is difficult to generalize µ LIK while retaining the structure of λ to the setup of causal inference with non-binary, discrete treatments. See Section 5 4 for a further discussion. 5. CAUSAL INFEENCE 5 1. Setup We now turn to causal inference in the framework of potential outcomes (Neyman, 1923; ubin, 1974). Let X be a vector of covariates and Y be an outcome as before. Let T be a treatment variable taking values in T = {0, 1,..., J 1} with J 2, where 0 denotes the null treatment or placebo. For each t T, let Y t be the potential outcome that would be observed under treatment t. We make the consistency assumption that Y = Y t if T = t, and the no-confounding assumption that for each t T, t and Y t are conditionally independent given X, where t = 1{T = t}. Throughout, 1{ } denotes the indicator function. The observed data consist of independent and identically distributed (X i, T i, Y i ), i = 1,..., n. Our objective is to estimate the population mean µ t = E(Y t ) for t T. The difference µ t µ 0 is called the average causal effect of treatment t. To a certain extent, this problem can be handled as J separate problems of estimating µ t from the data (X i, t,i, t,i Y t,i ), i = 1,..., n, as in Sections 2 4. However, the estimators of µ t obtained in this way are not jointly intrinsically efficient and hence those of µ t µ 0 may be inefficient even marginally Models and existing estimators Consider a parametric model for m(t, X) = E(Y T = t, X) in the form E(Y T = t, X) = m(t, X; α) (t T ), (19) where m(t, x; α) is a known function and α is a vector of unknown parameters. To focus on main ideas, assume that m(t, X; α) = Ψ{α T t g(x)}, where α t is a vector of unknown parameters and α = (α T 0,..., αt J 1 )T. This specification of (19) is separable in the sense that m(t, X; α) depends on α only through α t. By abuse of notation, treat m(t, X; α) as m(t, X; α t ). Let ˆα t,ols be a solution to 0 = Ẽ[ t{y m(t, X; α t )}g(x)] and write ˆm OLS (t, X) = m(t, X; ˆα t,ols ). Consider a parametric model for π(t, X) = P (T = t X) in the form P (T = t X) = π(t, X; γ) (t T ), (20) where π(t, x; γ) is a known function and γ is a vector of unknown parameters. Let ˆγ ML be the maximum likelihood estimator of γ and write ˆπ ML (t, X) = π(t, X; ˆγ ML ). A convenient specification of (20) is the multinomial logit model π(t, X; γ) = exp{γt T f(x)} (21) j T exp{γt j f(x)}, where γ = (γ0 T, γt 1,..., γt J 1 )T with γ 0 = 0. In this case, the score equations for ˆγ ML are 0 = Ẽ[{ t π(t, X; γ)}f(x)] for t = 1,..., J 1. To estimate µ t, the estimators in Section 2 3 can be adopted. eplace ˆµ(ˆπ, ˆm) by ˆµ t (ˆπ, ˆm) = Ẽ [ t Y ˆπ(t, X) { t ˆπ(t, X) 1 } ] ˆm(t, X),

14 Z. TAN where ˆπ(t, X) and ˆm(t, X) are estimators of π(t, X) and m(t, X) respectively. Various choices of the two estimators are available. The estimator ˆm OLS (t, X) is a simple choice of ˆm(t, X), and ˆπ ML (t, X) is a simple choice of ˆπ(t, X). Moreover, there are iterative choices of ˆm(t, X) and ˆπ(t, X). Let ˆm ext (t, X; ˆπ) = m ext {t, X; ˆκ t (ˆπ)}, ˆm WLS (t, X; ˆπ) = m{t, X; ˆα t,wls (ˆπ)}, and m V (t, X; ˆπ) = m{t, X; α t,v (ˆπ)}, where ˆκ t (ˆπ), ˆα t,wls (ˆπ), and α t,v (ˆπ) are obtained by substituting t, ˆπ(t, X), and m(t, X; α t ) for, ˆπ(X), and m(x; α) throughout in ˆκ(ˆπ), ˆα WLS (ˆπ), and α V (ˆπ). Construction of an extension to ˆπ ext ( ˆm) seems difficult for a general specification of model (20) with J > 2. Nevertheless, the task is straightforward if the multinomial logit specification (21) is used. Consider the model 1 P (T = t X) = π ext (t, X; ν) = C(X; ν) exp ˆυ(j, X) ν1t,j T ˆπ ML (j, X) + νt 2tf(X), (22) where ν = (ν1 T, νt 2 )T, ν 1 is the vector of ν 1t,j for t, j T with ν 10,j = 0 for j T and ν 1t,0 = ν 11,0 for t 0, ν 2 is the vector of ν 2t for t T with ν 20 = 0, ˆυ(j, X) = {1, ˆm(j, X)} T, and C(X; ν) is determined by t T π ext(t, X; ν) 1. Let ˆν( ˆm) be the maximum likelihood estimator of ν and write ˆπ ext (t, X; ˆm) = π ext {t, X; ˆν( ˆm)}. The foregoing choices of ˆm(t, X) and ˆπ(t, X) can be employed in similar combinations to those of ˆm(X) and ˆπ(X) in Section 2 3. Label the resulting estimators of µ t accordingly. For each t T, the marginal behavior of ˆµ t can be evaluated by the criteria in Section 2 4. However, consider the following criteria for the joint behavior of (ˆµ 0, ˆµ 1,..., ˆµ J 1 ). We say that a vector-valued estimator ˆθ 1 is more efficient than ˆθ 2 if the asymptotic variance matrix of ˆθ 1 is smaller than that of ˆθ 2 in the order on positive-definite matrices. (a) Joint double robustness: (ˆµ 0, ˆµ 1,..., ˆµ J 1 ) remains consistent if either model (19) or model (20) is correctly specified. (b) Joint local efficiency: (ˆµ 0, ˆµ 1,..., ˆµ J 1 ) attains the semiparametric variance bound if both model (19) and model (20) are correctly specified. (c) Joint improved local efficiency: (ˆµ 0, ˆµ 1,..., ˆµ J 1 ) is at least as efficient as {ˆµ 0 (α 0 ), ˆµ 1 (α 1 ),..., ˆµ J 1 (α J 1 )} if model (20) is correctly specified, where ˆµ t (α t ) = Ẽ[ ty/π(t, X) { t /π(t, X) 1}m(t, X; α t )] for α t a vector of arbitrary constants (t T ). (d) Joint intrinsic efficiency: (ˆµ 0, ˆµ 1,..., ˆµ J 1 ) is at least as efficient as {ˆµ 0 (b 0 ), ˆµ 1 (b 1 ),..., ˆµ J 1 (b J 1 )} if model (20) is correctly specified, where ˆµ t (b t ) = Ẽ[ ty/ˆπ ML (t, X) b T t { t /ˆπ ML (t, X) 1}ˆυ(t, X)] for b t a vector of arbitrary constants (t T ). (e) Joint population boundedness: ˆµ t is population-bounded for each t T. (f) Joint sample boundedness: ˆµ t is sample-bounded for each t T. Joint double robustness, local efficiency, or population or sample boundedness is equivalent to the fact that ˆµ t satisfies the corresponding property for each t T. However, joint intrinsic or improved local efficiency is respectively more stringent than the fact that for each t T, ˆµ t satisfies intrinsic or improved local efficiency. The comparison in Table 1 remains applicable except for one correction, if the estimators are replaced by the joint estimators of (µ 0, µ 1,..., µ J 1 ) and the properties are replaced by those on the joint behavior. See Sections for a description of the likelihood and regression estimators. The correction is that none of the joint estimators satisfies joint improved local efficiency, although Table 1 is still valid regarding whether or not the estimators of µ t satisfy improved local efficiency marginally. See Tan (2008, Section 3) for a further discussion. j T

15 Biometrika style 15 Note that (ˆµ t,ipw,ext ) t T satisfies joint intrinsic efficiency because ˆυ(j, X)/ˆπ ML (j, X), j T, are simultaneously included as extra linear predictors for log{π(t, X)/π(0, X)} for each t 0 in model (22). For fixed j 0, if model (22) were specified such that log{π(t, X)/π(0, X)} = ν T 2t f(x) if t 0 or j, or νt 1j,j ˆυ(j, X)/ˆπ ML(j, X) + ν T 2j f(x) if t = j, then ˆµ j,ipw,ext would satisfy intrinsic efficiency marginally, but (ˆµ t,ipw,ext ) t T would not satisfy joint intrinsic efficiency. See Tan (2007, Section 3) for a related discussion Non-doubly-robust likelihood estimator We present the likelihood estimator of Tan (2006) in the setup of causal inference, with the extension to accommodate discrete, binary or non-binary, treatments. See a 2007 utgers University technical report by Tan for a further extension to deal with marginal and nested structural models. The nonparametric likelihood of (X i, T i, Y i ), i = 1,..., n, is n n L 1 L 2 = π(t i, X i ; γ) G Ti ({X i, Y i }), i=1 where G t is the joint distribution of (X, Y t ), t T. Maximizing L 1 leads to the maximum likelihood estimator ˆγ ML. ecall that ˆm(t, x) is an estimator of m(t, x) based on model (19) and υ(t, x) = {1, ˆm(t, x)} T. Let ĥ = (ĥt 1, ĥt 2 )T and ĥ1 = (ĥt 10, ĥt 11,..., ĥt 1,J 1 )T where ĥ 1j (t, x) = [1{t = j} ˆπ ML (t, x)]ˆυ(j, x) (j T ), ĥ 2 (t, x) = π γ (t, x; ˆγ ML). By construction, t T ĥ(t, x) 0 because t T ˆπ ML(t, x) 1. We choose to ignore the fact that G t, t T, induce the same marginal distribution of X, and retain only the constraints t T ĥ(t, x) dgt = 0, i.e., 0 = [1{t = j} ˆπ ML (t, x)] dg t (j T ), t T 0 = [1{t = j} ˆπ ML (t, x)] ˆm(j, x) dg t (j T ), t T 0 = π γ (t, x; ˆγ ML) dg t. t T Furthermore, we require that G t be a probability measure supported on {(X i, Y i ) : T i = t, i = 1,..., n} and hence dg t = 1, t T. Maximizing L 2 subject to these constraints leads to the estimators that if T i = t then Ĝ t ({X i, Y i }) = i=1 n 1 ω(t, X i ; ˆλ), where ω(t, X; λ) = ˆπ ML (t, X) + λ T ĥ(t, X), ˆλ = argmaxλ l(λ), and l(λ) = Ẽ[log{ω(T, X; λ)}]. The function l(λ) is finite and concave on the set {λ : ω(t i, X i ; λ) > 0, i = 1..., n}. Moreover, l(λ) is strictly concave and bounded from above, and hence has a unique maximum, if and only if {λ : λ T ĥ(t i, X i ) 0, i = 1..., n} is empty. This proposition follows in a similar manner as that concerning l(λ) and condition (4) in Section 3 2. The estimators Ĝt, t T, are similar to Ĝ1 in Section 3 2. If J = 2, ˆπ ML (1, X) is identified as ˆπ ML (X), ĥ10 is removed in ĥ, and the constraint dg 0 = 1 is cancelled, then Ĝ1 reduces to exactly Ĝ1 in Section 3 2. For causal inference, Ĝt, t T, are equally of interest and constrained

16 Z. TAN as probability measures. In contrast, only Ĝ1, but not Ĝ 0, is of interest and constrained as a probability measure in the missing data setup. Setting the gradient of l(λ) to 0 shows that ˆλ is a solution to { } ĥ(t, X) 0 = Ẽ, (23) ω(t, X; λ) or equivalently 0 = t T ĥ(t, x) d Ĝ t. The resulting estimator of µ t is { } ˆµ t,lik = y t dĝt = Ẽ t Y ω(t, X; ˆλ). We derive the following asymptotic expansions for ˆλ and ˆµ t,lik, allowing for misspecification of model (19) and model (20), similarly as in Section 3 2. Under regularity conditions, ˆλ converges to a constant λ with the expansion ˆλ λ = ˆB 1Ẽ{ĥ(T, X)/ω(T, X; λ )} + o p (n 1/2 ). Moreover, ˆµ t,lik has the expansion ˆµ t,lik = Ẽ { } t Y ω(t, X; λ ) Ĉ t T ˆB 1 Ẽ { } ĥ(t, X) ω(t, X; λ + o p (n 1/2 ), ) where ˆB = Ẽ{h(T, X)ĥT (T, X)/ω 2 (T, X; λ )} and Ĉt = Ẽ{ ty/ω 2 (T, X; λ )}. If model (20) is correctly specified, then λ = 0 and hence ˆµ t,lik is asymptotically equivalent to the first order to ˆµ t,eg = Ẽ(ˆη t) Ĉ T t ˆB 1 Ẽ(ˆξ), where ˆη t = t Y/ˆπ ML (T, X), ˆξ = ĥ(t, X)/ˆπ ML (T, X), ˆB = Ẽ(ˆξ ˆξ T ), and Ĉt = Ẽ(ˆξˆη t ) Doubly robust likelihood estimator The estimator ˆµ t,lik is sample-bounded and locally and intrinsically efficient marginally. Moreover, (ˆµ 0,LIK, ˆµ 1,LIK,..., ˆµ J 1,LIK ) satisfies joint intrinsic efficiency. However, ˆµ t,lik is not doubly robust. We propose a robustfication of ˆµ t,lik such that the resulting estimator of µ t satisfies double robustness in addition to sample boundedness and local and intrinsic efficiency, and the joint estimator satisfies joint intrinsic efficiency. For our derivation, rewrite ĥ(t, x) as ĥ(t, x) = ˆ (t, x) ˆπ ML (t, x) ˆ (j, x), (24) j T where ˆ = (ˆ T 1, ˆ T 2 )T, ˆ 2 is defined the same as ĥ2, but ˆ 1 is defined as ĥ1 with ĥ1j(t, x) replaced by ˆ 1j (t, x) = 1{t = j}υ(j, x), j T. Instead of (23), consider the system of estimating equations { ˆ (T, X) 0 = Ẽ ω(t, X; λ) } ˆ (t, X), (25) t T i.e., 0 = Ẽ[{ t/ω(t, X; λ) 1}ˆυ(t, X)], t T, and 0 = Ẽ{ˆ 2 (T, X)/ω(T, X; λ)}. In retrospect, the vector of estimating functions ˆ (T, X)/ω(T, X; λ) t T ˆ (t, X) in (25) equals ĥ(t, X)/ω(T, X; λ) in (23) left-multiplied by the matrix I t T ˆ (T, X)λ T, where I is the appropriate identity matrix. Let λ be a solution to (25) subject to the constraint that ω(t i, X i ; λ) > 0 (i = 1,..., n) and let µ t,lik = Ẽ{ ty/ω(t, X; λ)}.

Nonparametric Likelihood and Doubly Robust Estimating Equations for Marginal and Nested Structural Models. Zhiqiang Tan 1

Nonparametric Likelihood and Doubly Robust Estimating Equations for Marginal and Nested Structural Models Zhiqiang Tan 1 Abstract. Drawing inferences about treatment effects is of interest in many fields.