arxiv: v1 [math.st] 20 May 2008

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 20 May 2008"

Elmer Thornton
5 years ago
Views:

1 IMS Collections Probability and Statistics: Essays in Honor of David A Freedman Vol c Institute of Mathematical Statistics, 2008 DOI: / arxiv: v1 mathst 20 May 2008 Higher order influence functions and minimax estimation of nonlinear functionals James Robins 1, Lingling Li 1, Eric Tchetgen 1 and Aad van der Vaart 2 Harvard School of Public Health and Vrije Universiteit Amsterdam Abstract: We present a theory of point and interval estimation for nonlinear functionals in parametric, semi-, and non-parametric models based on higher order influence functions Robins 2004, Section 9; Li et al 2004, Tchetgen et al 2006, Robins et al 2007 Higher order influence functions are higher order U-statistics Our theory extends the first order semiparametric theory of Bicel et al 1993 and van der Vaart 1991 by incorporating the theory of higher order scores considered by Pfanzagl 1990, Small and McLeish 1994 and Lindsay and Waterman 1996 The theory reproduces many previous results, produces new non- n results, and opens up the ability to perform optimal non- n inference in complex high dimensional models We present novel rate-optimal point and interval estimators for various functionals of central importance to biostatistics in settings in which estimation at the expected n rate is not possible, owing to the curse of dimensionality We also show that our higher order influence functions have a multi-robustness property that extends the double robustness property of first order influence functions described by Robins and Rotnitzy 2001 and van der Laan and Robins Introduction Over the past 3 years, we have developed a theory of point and interval estimation for nonlinear functionals ψ F in parametric, semi-, and non-parametric models based on higher order lielihood scores and influence functions that applies equally to both n and non- n problems Robins 16, Section 9, Li et al 9, Tchetgen et al 21, Robins et al 18 The theory reproduces results previously obtained by the modern theory of non-parametric inference, produces many new non- n results, and most importantly opens up the ability to perform non- n inference in complex high dimensional models, such as models for the estimation of the causal effect of time varying treatments in the presence of time varying confounding and informative censoring See Tchetgen et al 22 for examples of the latter Higher order influence functions are higher order U-statistics Our theory extends the first order semiparametric theory of Bicel et al 3 and van der Vaart 25 by incorporating the theory of higher order scores and Bhattacharrya bases considered by Pfanzagl 11, Small and McLeish 20 and Lindsay and Waterman 8 1 Harvard School of Public Health, Department of Biostatistics and Epidemiology, 677 Huntington Ave, Boston, MA 02115, USA, robins@hsphharvardedu; lingling li@hphcorg; etchetge@hsphharvardedu 2 Vrije Universiteit Amsterdam, Department of Mathematics, aad@csvunl AMS 2000 subject classifications: 62G05, 62C20 Keywords and phrases: influence functions, nonparametric, minimax, robust inference, U- statistics 335

2 336 J Robins, L Li, E Tchetgen and A van der Vaart The purpose of this paper is to demonstrate the scope and flexibility of our methodology by deriving rate-optimal point and interval estimators for various functionals that are of central importance to biostatistics We now describe some of these functionals We suppose we observe iid copies of a random vector O = Y, A, X with unnown distribution F on each of n study subjects In this paper, we largely study non-parametric models that place no restrictions on F, other than bounds on both the L p norms and on the smoothness of certain density and conditional expectation functions The variable X represents a random vector of baseline covariates such as age, height, weight, hematocrit, and laboratory measures of lung, renal, liver, brain, and heart function X is assumed to have compact support and a density f X x with respect to the Lebesgue measure in R d, where, in typical applications, d is in the range 5 to 100 A is a binary treatment and Y is a response, higher values of which are desirable Then, in the absence of confounding by additional unmeasured factors, the functional ψ F = E E Y A = 1, X} E E Y A = 0, X} is the mean effect of treatment in the total study population Our results for E E Y A = 1, X} E E Y A = 0, X} follow from results for the functional ψ F = E E Y A = 1, X} based on data AY, A, X rather than Y, A, X If Y is missing for some study subjects, and A is now the indicator that taes the value 1 when Y is observed and zero otherwise, then the functional E E Y A = 1, X} is the marginal mean of Y under the missing at random assumption that the probability P A = 0 X, Y = P A = 0 X that Y is missing does not depend on the unobserved Y Returning to data O = Y, A, X, the functional ψ F = E covy, A X}/E var A X} = E w X E Y A = 1, X E Y A = 0, X}, with w X = var A X}/E var A X} is the variance weighted average treatment effect Our results for E covy, A X} /E var A X} are derived from results for the functionals ψ F = E covy, A X} and ψ F = E E Y X} 2 We note that Robins and van der Vaart s 19 construction of an adaptive confidence set for a regression function E Y X = x depended on being able to construct a confidence interval for ψ F = E E Y X} 2 They constructed an interval for E E Y X} 2 when the marginal distribution of X was nown In this paper, we construct a confidence interval for E E Y X} 2 when the marginal of X is unnown and, in Section 5, use it to obtain an adaptive confidence set for E Y X = x The functional E covy, A X} is the functional E vary X} in the special case in which Y = A wp1 Minimax estimation of vary X has recently been discussed by Wang et al 27 and Cai et al 6 in the setting of non-random X The function γ x = E Y A = 1, X = x E Y A = 0, X = x is the effect of treatment on the subgroup with X = x It is important to estimate the function γ x, in addition to the average treatment effect in the total population, because treatment should be given, since beneficial, to those subjects with γ x > 0 but withheld, since harmful, from subjects with γ x < 0 We show that one can obtain adaptive confidence sets for γ x if one can set confidence intervals for the functional ψ F = E γ X 2 We construct intervals for E γ X 2 under the additional assumption that the data O = Y, A, X came from a randomized trial In a randomized trial, in contrast to an observational study, the randomization

3 Higher order influence functions 337 probabilities, P A = 1 X = E A X are nown by design We plan to report confidence intervals for E γ X 2 with E A X unnown elsewhere All of the above functionals ψ F have a positive semiparametric information bound SIB and thus a first order efficient influence function with a finite variance In fact all the functionals ψ F have efficient influence function 11 IF b F, p F,ψ F if O, b X, F,pX, F,ψ F, where b x, F,px, F are functions of certain conditional expectations, and, for any b x, p x, E F IF b, p, ψ F = E F h 1 O b X b X; F} p X p X; F} where h 1 O is a nown function We refer to functionals in our class as doublyrobust to indicate that IF b F,pF,ψ F continues to have mean zero when either but not both p F is misspecified as p or b F is misspecified as b The functions b x, F,px, F,IF O, b X, F,pX, F,ψ F, and h 1 O differ depending on the functional ψ F of interest As the functionals ψ F are all closely related, we shall use E covy, A X} as a prototype in this introduction For ψ F E covy, A X}, b X; F = E F Y X, p X; F = E F A X, IF b F,pF,ψ F = Y b X; F} A p X; F} ψ F, and h 1 O 1 Whenever a functional ψ F has a non-zero SIB, given sufficiently stringent bounds on L p norms and on smoothness, it is possible to use the estimated first order influence function to construct regular estimators and honest asymptotic confidence intervals whose width shrins at the usual parametric rate of n 1/2 We recall that, by definition, regular estimators are n 1/2 -consistent When X is high dimensional, the a priori smoothness restrictions on p X; F and b X; F necessary for point or interval estimators of E covy, A X} to achieve the parametric rate of n 1/2 are so severe as to be substantively implausible As a consequence, we replace the usual approach based on first order influence functions by one based on higher order influence functions To provide quantitative results, we require a measure of the maximal possible complexity eg smoothness of p ; F and b ; F believed substantively plausible We use Hölder balls for concreteness, although our methods extend to other measures of complexity A function h lies in the Hölder ball Hβ, C, with Hölder exponent β > 0 and radius C > 0, if and only if h is bounded in supremum norm by C and all partial derivatives of hx up to order β exist, and all partial derivatives of order β are Lipschitz with exponent β β and constant C We mae the assumption that b, F, p, F lie in given Hölder balls Hβ b, C b, Hβ p, C p Furthermore, it turns out we must also mae assumptions about the complexity of the function g X; F E F h 1 O X f X X, which we tae to lie in a given Hβ g, C g For ψ F = E covy, A X},g X; F = f X X Using higher order influence functions, we construct regular estimators and honest ie uniform over our model asymptotic confidence intervals for functionals ψ F in our class whose width shrins at the usual parametric rate of n 1/2 whenever β/d β b+β p 2 /d > 1/4 and β g > 0 This result cannot be improved on, since even when g x is nown a priori, β/d > 1/4 is necessary for a regular estimator to exist

4 338 J Robins, L Li, E Tchetgen and A van der Vaart When β/d 1/4 and g x is nown a priori, we have shown using arguments similar to those of Birge and Massart 5 that the minimax rate of convergence for an estimator and minimax rate of shrinage of a confidence interval is n 4β/d 4β/d+1 n 1 2 When g x is unnown, we construct point and interval estimators with this same rate of n 4β/d 4β/d+1 whenever β/d 12 β g /d > β/d β/d 4 β/d1 4β/d + 1, where = βp β b 1 For example, if = 0, β/d = 1/8, we require β g /d exceed 1/22 to achieve the rate n 4β/d 4β/d+1 When the previous inequality does not hold and = 0, we have constructed, in a yet unpublished paper, estimators that converge at rate 13 log nn βg/d m βg/d 2β/d, with β m 4 βd d β 1/ βg /d 1 + 2β/d d β g /d We conjecture that this rate is minimax, up to log factors In this paper, however, the estimators we construct are inefficient when the previous inequality fails to hold, converging at rates less than the conjectured minimax rate of Equation 13 Let us return to the case where Y = A wp1 Then ψ F = E var Y X} and p = b so = 0 Now, for fixed β, Equation 13 converges to log nn 2β/d as β g 0, which agrees up to a log factor with the minimax rate of n 2β/d given by Wang et al 27 and Cai et al 6 under the semiparametric homoscedastic model vary X = σ 2 with equal-spaced non-random X This result might suggest that X being random rather than equal-spaced can result in faster rates of convergence only when the density of X has some smoothness, as quantified here by β g > 0 But this suggestion is not correct Recall that we obtained the rate log nn 2β/d for ψ F = E vary X} as β g 0 under a non-parametric model In Section 4, we construct a simple estimator of σ 2 under the homoscedastic model with X random with unnown density that, for β/d < 1/4, β < 1, and without smoothness restrictions on f X x, converges at the rate n 4β/d 4β/d+1, which is faster than the equal-spaced non-random minimax rate of n 2β/d The paper is organized as follows In Section 2, we define the higher order estimation influence functions of a functional ψ F for F contained in a model M and prove two fundamental theorems the extended information equality theorem and the efficient estimation influence function theorem Further, in the context of a parametric model whose dimension increases with sample size, we outline why estimators based on higher order influence can outperform those based on first order influence functions in high-dimensional models In Section 3, we introduce the class of functionals we study in the remainder of the paper and describe their importance in biostatistics The theory of Section 2, however, is not directly applicable to these functionals because they have first order but not higher order influence functions We show that higher order influence functions fail to exist precisely because the Dirac delta function is not an element of the Hilbert space L 2 of square integrable functions We describe two approaches to overcoming this difficulty The first approach is based on approximating the Dirac delta function by a projection operator onto a subspace of L 2 of dimension n, where n can be as large as

5 Higher order influence functions 339 the square of the sample size n The second approach is based on approximating the functional ψ F by a truncated functional ψ n F The truncated functional has influence functions of all orders, is equal to ψ F if either a n dimensional woring parametric model with n < n 2 for the function b or the function p in Equation 11 is correct, and remains close to ψ F even if both woring models are misspecified We then use higher order influence function based estimators of ψ n F as estimators of ψ F These estimators ψ m,n are asymptotically normal with variance and bias for ψ F depending both on the choice of the dimension n of the woring models and on the order m of the influence function of ψ n F We show that these same estimators ψ m,n can also be obtained under the approximate Dirac delta function approach We derive the optimal estimator ψ mopt, optn β b, β p, β g in the class as a function of the Hölder balls in which the functions b, p, and g are assumed to lie Finally we conclude Section 3 by showing that the estimators ψ m,n have a multi-robustness property that extends the double-robustness property of the first order influence function estimator ψ 1 In Section 4, we consider whether the estimators ψ mopt, optn β b, β p, β g are rate-minimax We show that whenever β/d β b+β p 2 /d > 1/4 and β g > 0, ψ mopt, optn β b, β p, β g is not only rate minimax but is semiparametric efficient Further, by letting the order m = m n of the U-statistic depend on sample size, we construct a single estimator ψ mn,n that is semiparametric efficient for all β/d > 1/4 even when g cannot be estimated at an algebraic rate We show, however, that when β/d < 1/4, ψmopt, optn β b, β p, β g does not in general converge at the minimax rate In Section 41, however, we construct a new estimator ψ eff K J β g, β b, β p that converges at the minimax rate of n 4β/d 4β/d+1 whenever Eq 12 holds In Section 5, we use the results obtained in earlier sections to construct adaptive confidence intervals for a regression function E Y X = x when the marginal of X is unnown and for the treatment effect function and optimal treatment regime in a randomized clinical trial In Section 61, we discuss how to obtain higher order U-statistic point estimators and confidence intervals for functionals τ F that are implicitly defined as the solution to an equation ψ τ, F = 0 In Section 62, we define higher order testing influence functions and efficient scores and describe their relationship to the higher order estimation influence functions and efficient influence functions of Section 2 Finally, in Section 63, we discuss the relationship between the higher order U-statistic point estimators of an implicitly defined functional τ F and higher order testing influence functions Before proceeding, several additional comments are in order In this paper, we investigate the asymptotic properties of our higher order U-statistic point and interval estimators The reader is referred to Li et al 9 for an investigation of the finite sample properties of our procedures through simulation Furthermore due to space limitations we only provide proofs for selected theorems Proofs of the remaining theorems can be found in an accompanying technical report In addition, precise regularity conditions are sometimes omitted from both the statements and the proofs of various theorems This reflects the fact that the goal of this paper is to provide a broad overview of our theory as it currently stands Different subject matter experts will clearly disagree as to the maximum possible complexity of p x; F, b x; F and g x; F Thus it is important to have methods that adapt to the actual smoothness of these functions Elsewhere, we plan to provide point estimators that optimally adapt to unnown smoothness In contrast to point estimators, however, for honest confidence intervals, the degree of possible

6 340 J Robins, L Li, E Tchetgen and A van der Vaart adaption to unnown smoothness is small Therefore we propose that an analyst should report a mapping from a priori smoothness assumptions encoded in the exponents and radii of Hölder balls or in other measures of complexity to the associated optimal 1 α honest confidence intervals proposed in this paper Such a mapping is finally only useful if substantive experts can approximately quantify their informal opinions concerning the shape and wiggliness of p, b, and g using the measure of complexity on offer by the analyst It is an open question which, if any, complexity measure is suitable for this purpose Finally, most of our mathematical results concern rates of convergence We offer only a few results on the constants in front of those rates This is not because the constant is less important than the rate in predicting how a proposed procedure will perform in the moderate sized samples occurring in practice Rather, at present, we do not possess the mathematical tools necessary to obtain useful results concerning constants A more extended discussion of the issue is found in Section 3 of Li et al 9 In the following, we use X n Y n to mean X n = O p Y n and Y n = O p X n ; X n Y n to mean Xn P Y n 1; and Xn Y n X n Y n to respectively mean Yn P X n 0 Xn P Y n 0 as n 2 Theory of higher order influence functions Given n iid observations O O n O i, i = 1,,n} from a model M Θ = F ; θ, θ Θ}, we consider inference on a nonlinear functional ψ θ In general, ψ θ can be infinite dimensional but for now we only consider the one dimensional case In the following all quantities can depend on the sample size n, including the support of O, the parameter space Θ, and the functional ψ θ We generally suppress the dependence on n in the notation We will be particularly interested in models in which the parameter θ is infinite dimensional and θ, Θ, and ψ do not depend on n We also briefly discuss models in which subvectors of θ are finite-dimensional parameters whose dimension n = n 1+ρ increases as power 1 + ρ often ρ > 0 of n and thus θ n, Θ n, and ψ n depend on n Our first tas is to define higher order influence functions Before proceeding we recall some facts about U statistics Consider a function b m o 1, o 2,,o m b o 1, o 2,, o m where we often suppress b s subscript m For integers i 1, i 2,, i m lying in 1,,n}, we define and B m,i1,,i m b m O i1, O i2,,o im b O i1, O i2,,o im n m! V n b m n! i 1 i 2 i m B m,i1,,i m In an abuse of notation, we will consider the following expressions to be equivalent V n B m V n B m,,i1,,i m V n b m Thus V n b m is an m th order U-statistic with ernel b m o 1, o 2,, o m We do not assume that b m o 1, o 2,, o m is symmetric We will write V n B m as B n,m So, suppressing the dependence on n, B m V B m

7 Higher order influence functions 341 Any B m has a unique up to permutation decomposition B m = m s=1 Db s θ under any F; θ as a sum of degenerate U-statistics D s b θ, where degeneracy of D b s θ means that D s b θ = d s b O i1, O i2,,o is ; θ satisfies E θ oi1,,o, O il 1 i l, o, o il+1 i s ; θ = 0, l = 1,,s, d b s where upper and lower case letters, respectively, denote random variables and their possible realizations Let U m θ be the Hilbert space of all U statistics of order m with mean zero and finite variance with inner product defined by covariances with respect to the n-fold product measure F n ; θ Note that any U-statistic B s of order s, s < m, is also an m th order U statistic with D b l θ identically zero for m l > s Since any two degenerate U statistics of different orders are uncorrelated, the U m θ-hilbert space projection of B m on U l θ is l s=1 Db s θ for l < m Thus a U statistic B m is degenerate B m = D m b θ Π θ B m U m 1 θ = 0 B m U m 1 θ m,θ, where Π θ Π θ,m is the projection operator of the Hilbert space U m θ with the dependence on m suppressed when no ambiguity can arise and, for any linear subspace R of U m θ, R m,θ is its orthocomplement in the Hilbert space U m θ Given any B m = V B m, D m b θ is explicitly given by V d m,θ B m } where d m,θ maps B m b O i1, O i2,,o im to 21 1 m t m 1 + t=0 d m,θ B m } = b O i1, O i2,, O im i r1 i r2 i rt E θ b Oi1, O i2,, O im O ir1, O i r2,,o i rt Given a function g ζ, ζ ζ 1,,ζ r } T, define for m = 0, 1, 2,, g \lm ζ g \l1,,l m ζ m g ζ ζ l1 ζ lm ψ θ with l s 1,, r}, where the \ symbol denotes differentiation by the variables occurring to its right and the overbar l m denotes the vector l 1,,l m Given a sufficiently smooth r-dimensional parametric submodel θ ζ mapping ζ R r injectively into Θ, define for θ in the range of θ, ψ \lm θ ζ \l ζ= θ 1,,l 1 θ} m and f \lm O n ; θ f θ ζ, where f O \l ζ= θ 1,,l 1 θ} n;θ i fo i; θ is m the density of O n with respect to a dominating measure That is ψ \lm θ and f \lm O n,θ are higher order derivatives of ψ and f O n ; under a parametric submodel θ ζ, where the model θ has been suppressed in the notation An s th order score associated with the submodel θ ζ is defined to be S s,ls θ f \ls O n ;θ/f O n ;θ, where S s,ls θ is a U-statistic of order s To understand why S s,ls θ is a U statistic we provide formulae for an arbitrary score S s,ls θ in terms of the subject specific scores S l1l m,j θ f /l1l m,j O j ; θ/f j O j ; θ

8 342 J Robins, L Li, E Tchetgen and A van der Vaart j = 1,,n for s = 1, 2, 3 Suppressing the θ-dependence, results in Waterman and Lindsay 8 imply S 1,l1 = S l1,j, j S 2,l2 = j S l1l 2,j + s j S l1,js l2,s, S 3,l3 = j S l1l 2l 3,j + s j S l1l 2,jS l3,s + S l3l 2,jS l1,s + S l1l 3,jS l2,s + s j t S l1,js l2,ss l3,t Note these equations express each S m,lm as a sum of degenerate U-statistics We now define a m th order estimation influence function IF m,ψ θ IF m,ψ θ IF m θ for ψ θ where we suppress the dependence on ψ when no ambiguity will arise Definition 21 A U-statistic IF m θ of order m and finite variance is said to be an m th order estimation influence function for ψ θ if i E θ IF m θ = 0, θ Θ and ii for s = 1, 2,, m and every suitably smooth and regular see Appendix r dimensional parametric submodel θ ζ, r = 1, 2,,m, ψ \ls θ = E θ IF m θ S s,ls θ Estimation influence functions need not always exist, but when they do they are useful for deriving point estimators of ψ with small bias and for deriving confidence interval estimators centered on an estimate of ψ We will generally refer to estimation influence functions simply as influence functions We remar that IF m θ is an influence function under the above definition if and only if it is one under the modified version in which the dimension of the parametric submodel θ ζ is unrestricted A ey result is the following theorem which is related to results of Small and McLeish 20 Theorem 22 Extended information equality theorem Given a m th order influence function IF m θ, for any smooth, regular submodel θ ζ and s m, s E θ IF m θ ζ / ζ l1 ζ ls ζ= θ 1 θ} = ψ \l s θ Thus, if the functionals E θ IF m θ and ψ θ ψ θ have bounded Fréchet derivatives with respect to θ to order m + 1 for a norm, E θ IF m θ + δθ = ψ θ + δθ ψ θ + O δθ m+1 since the functions E θ IF m θ and ψ θ ψ θ of θ have the same Taylor expansion around θ up to order m The proof is in the Appendix Define the m th order tangent space Γ m θ at θ for the model M Θ to be the subspace of U m θ formed by taing the closed linear span of all scores of order m or less as we vary over all regular parametric

9 Higher order influence functions 343 submodels θ ς whose range includes θ of our model M Θ We say a model is locally nonparametric for m th order inference if Γ m θ = U m θ Given any m th order estimation influence function IF m θ, define the m th order efficient estimation influence function to be IF eff m θ = Π θ IF m θ Γ m θ, where Π θ Π θ,m is the U m θ projection operator In the appendix, we prove the following: Theorem 23 Efficient estimation influence function theorem 1 IF eff m θ is unique in the sense that for any two mth order influence functions Π θ IF 1 m θ Γ m θ = Π θ IF 2 m θ Γ m θ as 2 IF eff m θ is a mth order estimation influence function and has variance less than or equal to any other m th order estimation influence function 3 IF m θ is a m th order estimation influence function if and only if } IF m θ IF eff m θ + U m θ ; U m θ Γ m,θ m θ where Γ m,θ m θ is the ortho-complement of Γ m θ in U m θ 4 If IF m θ exists then IF eff s θ exists for s < m and Π θ IF m θ Γ s θ = IF eff s θ 5 If the model M Θ is locally nonparametric, then a there is at most one m th order estimation influence function IF m θ for ψ θ, b IF m θ = IF m 1 θ + IF mm θ where IF m 1 θ = Π m,θ IF m θ U m 1 θ and IF mm θ is a degenerate m th order U-statistic and thus E θ IF m 1 θ IF mm θ = 0 c i Suppose, for a given m 2, IF m 1 θ exists and a ernel if m 1,m 1 oi1,,o im 1 ; θ of IF m 1,m 1 θ has a first order influence function with ernel if 1,ifm 1,m 1o i1,,o im 1 ; O i m ; θ for all o i1,, o im 1 in a set O m 1 which has probability 1 under F m 1, θ Then IF m θ exists and 22 mif m,m θ = V d m,θ if 1,ifm 1,m 1O i1,,o im 1 ; O i m ; θ where the operator d m,θ is given in Equation 21 ii Conversely, if IF m exists then the symmetric ernel if sym m 1,m 1 o i 1,, o im 1 ; θ of IF m 1,m 1 θ has a first order influence function for all o i1,, o im 1 in a set O m 1 which has probability 1 under F m 1, θ Further m 1 d m,θ if 1,if sym m 1,m 1O i1,,o im 1 ; O i m ; θ = ifm,m sym O i1,, O im ; θ

10 344 J Robins, L Li, E Tchetgen and A van der Vaart Remar 24 Pfanzagl 11 previously proved part 5ci for m = 2 Our theorem offers a generalization of his result Note, in part i of 5c, we can always tae the ernel to be the symmetric ernel Remar 25 Provided one nows how to calculate first order influence functions, one can obtain IF 2 θ,,if m θ recursively using part 5c An example of such a calculation is given in Section 322 below Thus part 5c has the interesting implication that even though higher order influence functions are defined in terms of their inner products with higher order scores S m,lm, nevertheless, in locally nonparametric models, one can derive all the higher order influence functions of a functional ψ θ without even nowing how to compute the scores S m,lm for any m > 1 In fact, one need not even be aware of the structure of the scores S m,lm in terms of the subject-specific higher order scores S l1l s,j θ In contrast, in parametric or semiparametric models whose tangent space Γ m θ does not equal the set U m θ of all m th order U statistics, one can often but not always still obtain an inefficient influence IF m θ by applying part 5c of the Theorem However, calculation of the efficient influence function IF eff m θ = Π θ IF m θ Γ m θ by projection generally requires explicit nowledge of the scores S m,lm to derive Γ m θ For this reason, it can be considerably more difficult to analyze certain parametric models with dimension increasing with sample size than to analyze locally nonparametric models We will consider derivation of and projections onto Γ m θ in a forthcoming paper In the current paper, however, we do calculate IF eff 2 θ in one model that is not locally nonparametric so as to provide some sense of the issues that arise Specifically in Section 4, we calculate IF eff 2 θ for a truncated version of the functional E of X is nown E Y X} 2 in a model that assumes the marginal distribution Remar 26 Implications of Theorem 23 for the variance of unbiased estimators Suppose we have n iid draws O =O 1,, O n from Fo; θ, θ Θ, and a U-statistic ψ m of order m n with var θ ψm < for θ Θ satisfying E θ ψm = ψ θ for all θ Θ That is, ψm is unbiased for ψ θ We will use Theorem 23 to generalize a number of well-nown results on minimum variance unbiased estimation to arbitrary models By E θ ψm = ψ θ, we immediately conclude that, viewing ψ m as a th order U- statistic, ψ m ψ θ is a th order estimation influence function for ψ θ for n m By Theorem 23, var θ ψm var θ IF eff m θ We refer to var θ IF eff m θ as the m th order Bhattacharyya variance bound at θ for the parameter ψ θ in model M Θ, as this definition, in a precise analogy to Bicel et al 3 s generalization of the Cramer-Rao variance bound, generalizes Bhattacharyya s 2 variance bound to arbitrary semi- and non- parametric models Indeed our first order Bhattacharyya bound is precisely Bicel et al s 3 generalization of the Cramer-Rao variance bound We shall refer to an m th order U-statistic estimator ψ m as m th order unbiased locally efficient at θ for ψ θ in model M Θ if it is unbiased for ψ θ under the model with variance at θ equal to the m th order Bhattacharyya bound at θ If ψ m is unbiased locally efficient at θ for all θ Θ, we say it is unbiased globally efficient By Theorem 23, var θ IF eff θ var θ IF eff m θ for n > m As

11 Higher order influence functions 345 a consequence if an m th order unbiased locally efficient estimator ψ m,eff exists at θ then, for n m, IF eff θ = IF eff m θ so the m th and th order Bhattacharyya bounds are equal at θ and ψ m,eff is also th order unbiased locally efficient at θ From the fact that for an unbiased estimator ψ m, ψm ψ θ is an m th order influence function, we conclude that the variance of ψm attains the bound var θ IF eff m θ at θ if and only if ψ m ψ θ = IF eff m θ It follows that ψ m is unbiased globally efficient if and only if ψ m ψ θ = IF eff m θ for all θ Θ We thus have proved the following theorem in the direction The direction is immediate Theorem 27 In a model M Θ, there exists an m th order unbiased globally efficient U-statistic estimator of ψ θ, if and only if, for all θ Θ, IF eff m θ+ψ θ is a function ψ m,eff of the data O, not depending on θ In that case, ψ m,eff is the unique unbiased globally efficient estimator In a locally nonparametric model all unbiased m th order estimators are unbiased globally efficient, as there is a unique m th order influence function For example, the usual unbiased estimator σ 2 = n i=1 X i 2 n j=1 j/n} X / n 1 of the variance of a random variable X is a second order U-statistic and thus is a th order unbiased globally efficient U-statistic for 2 in the locally nonparametric model consisting of all distributions under which σ 2 has a finite variance In Section 4 we use the results from this remar to compare the relative efficiencies of competing rate-optimal unbiased estimators in a model which is not locally nonparametric We now describe the main heuristic idea behind using higher order influence functions Technical details are suppressed Consider the estimator 23 ψm = ψ θ + IF eff m based on a sample size n, where θ is an initial rate optimal estimator of θ from a separate independent training sample That is we assume that our actual sample size is N and we randomly split the N observations into two samples: an analysis sample of size n and a training sample of size N n where N n/n = c, 1 > c > 0 We obtain our initial estimate θ from the training sample data Sample splitting has no effect on optimal rates of convergence, although in the form described here does affect constants Throughout the paper, we derive the properties of our estimators conditional on the data in the training sample In a later section, we describe how one can sometimes obtain an optimal constant by choosing N n/n = N ǫ, ǫ > 0 rather than c Remar 28 Note that sample splitting is avoided in most statistical applications by using modern empirical process theory to prove that plug-in estimators such as ψ m = ψ θ + IF eff m }θ= θ θ that estimate θ from the same sample used to calculate IF eff m have nice statistical properties However empirical process theory is not applicable in our setting because we are interested in function classes whose size entropy is so large that they fail to be Donser For this reason we initially believed that explicit sample splitting would be difficult to avoid in our methodology However, in Robins et al 18, we describe a new method that effectively θ

12 346 J Robins, L Li, E Tchetgen and A van der Vaart allows one to use all the data for estimator construction Expanding and evaluating conditionally on the training sample or equivalently on θ, we find by Theorem 22 that the conditional bias θ E θ ψm ψ θ θ = ψ θ ψ θ + E θ IF eff m θ is O p θ θ m+1 which decreases with m provided θ θ < 1 In Theorem 322 below, we show that if f sup o O o; θ f o; θ 0 as θ θ 0, where f o; θ is the density of O under θ and O has probability one under all θ Θ, then θ θ var θ ψm θ var θ θ = IF var θ eff m 1 + O p θ θ IF eff m Now, by Theorem 23, θ var θ IF var θ eff 1 IF eff m 1/n, since, conditional on θ, IF eff 1 θ increases with m Further, θ is the sample average of iid random variables To proceed further we shall need to be more explicit about the model M Θ For now, we consider finite-dimensional parametric models whose dimension n increases with sample size That is θ θ n depends on n and the dimension of Θ Θ n is n Suppose n n γ, γ 0 Let θ n be the maximum lielihood estimator of θ If n increases slower than the sample size ie, γ < 1, then, a under regularity conditions, θ n θ n = O p n/n} 1/2 = O p n γ θ with the usual Euclidean norm in R n ; and b IF var θ eff m, although increasing with m, remains order 1/n; as a consequence, if m is chosen greater than the solution m to n m γ = n 1/2, the bias of ψ m will be o p n 1/2, the rate of convergence will be the usual parametric rate of n 1/2, and thus, for n sufficiently large, the squared bias of ψ m will be less than the variance As a consequence, as discussed in Section 325, we can construct honest ie uniform over θ n Θ n asymptotic confidence intervals centered at ψ m with width of order n 1/2 Here is a concrete example Example Suppose O = Y, X with Y Bernoulli and with X having a density with respect to the uniform measure µ on the unit cube 0, 1 d in R d Suppose ψ = E E Y X 2 Let z l } z l x;1, 2, } be a countable, linearly independent, sequence of either spline, polynomial, or compact wavelet basis functions dense in L 2 µ Set z x = z 1 x,, z x T We assume E Y X = x f X x 1 b x; η n 1 + exp η T n z n x ; η n N n f x; ω n c ω n exp ω T n z n x ; ω n W n } },,

13 Higher order influence functions 347 where c ω n is a normalizing constant and N n and W n are open bounded subsets of R n and R n Hence, Θ n = N n W n has dimension n = n + n and ψ θ = ψ n θ n = b x; 2 η n f x; ω n dµ x He 7 and Portnoy 12 prove that, under regularity conditions, θ n θ n = O p n/n} 1/2 when n = n γ n Below we shall see that θ IF var θ eff m θ 1/n for n γ n Consider next models whose dimension n n γ increases faster than n ie, γ > 1 In such models, the MLE θ n is generally inconsistent and indeed there may exist no consistent estimator of θ n In that case, θ n θ n fails to be o p 1 and the conditional bias E θ ψm ψ θ θ may not decrease with m In order to guarantee consistent estimators of θ n exist, it is necessary to place further a priori restrictions on the complexity of Θ n Typical examples of complexity-reducing assumptions would be an ǫ sparseness assumption that only n ǫ, 0 < ǫ < 1, of the n parameters are non-zero or a smoothness assumption that specifies that the rate of decrease of the j th component of θ n is equal to 1/j raised to a given positive power Even after imposing such complexity-reducing assumptions, ψ θ ψ n θ n may not be estimable at rate n 1/2 For instance consider the previous example but now with γ and γ exceeding 1, so n = n γ n, n = n γ n and n = n + n n γ n with γ = max γ, γ Consider the norms η n ω n 1/2 = b 2 x; η n dµ x}, p = f } 1/p p x; ω n dµ x and θ p = + ω n p η n Suppose, under a particular smoothness assumption, optimal rate estimators η n and ω n of η n and ω n satisfy η n η n = O p n γη and ω n ω n = O p n γω for some γ η > 0, γ ω > 0 and all p 2 Hence, p θ θ p = O p max n γη, n γω } For γ > 1, based on arguments given later, we expect that ψm ψ θ θ and var θ E θ ψm ψ θ θ nγ 1m 1 n = O p η n η n 2 ω n ω n = O p n 2γη m 1γω = O p θ θ m+1 m 1 m 1 m 1 To find the estimator ψ mbest in the class ψ m with optimal rate of convergence, let m = 1 + be the value of m that equates the order n 4γη 2m 1γω 1 4γη γ 1+2γ ω of the squared bias and the order nγ 1m 1 n of the variance Then m best = m if the order n 4γη 2m 1γω +n γ 1m 1 1 of the mean squared error at m is less

14 348 J Robins, L Li, E Tchetgen and A van der Vaart than or equal to that at m Otherwise, m best = m The rate of convergence of ψ mbest will often be slower than n 1/2 Note m best = 1 whenever γ > 2, regardless of γ η and γ ω By using the estimator ψ m rather than ψ mbest, we can guarantee that the variance asymptotically dominates bias and construct honest ie uniform over θ n Θ n asymptotic confidence intervals centered at ψ m Of course, the sample size n at which, for all θ n Θ n, the finite sample coverage of the intervals discussed above is close to the asymptotic ie nominal coverage is generally unnown and could be very large For this reason, a better, but unfortunately as yet technically out of reach, approach to confidence interval construction is discussed in Section 325 In contrast to the case of parametric models of increasing dimension, in the infinite dimensional models which we consider in the following section, the functionals ψ θ of interest have first order influence functions IF 1 θ but do not have higher order influence functions As a consequence, an initial truncation step is needed before we can apply the approach outlined in the preceding paragraph Finally, even in the case of parametric models with n n and complexity reducing assumptions imposed,, when the minimax rate for estimation of ψ θ is slower than n 1/2, the optimal estimator ψ mbest in the class ψ m will generally not be rate minimax See Section 326 and Sections 411 for additional discussion 3 Inference for a class of doubly robust functionals 31 The class of functionals In this Section we consider models in which the parameter θ is infinite dimensional and θ, Θ, and ψ do not depend on n We mae the following three assumptions Ai Aiii: Ai The data O includes a vector X, where, for all θ Θ, the distribution of X is supported on the unit cube 0, 1 d or more generally a compact set in R d and has a density f x with respect to the Lebesgue measure Further Θ = Θ 1 Θ 2 where θ 1 Θ 1 governs the marginal law of X and θ 2 Θ 2 governs the conditional distribution of O X Aii The parameter θ 2 contains components b = b and p = p, b : 0, 1 d R and p : 0, 1 d R, such that the functional ψ θ of interest has a first order influence function IF 1,ψ θ = V IF 1,ψ θ, where IF 1,ψ θ = H b, p ψ θ, with H b, p h O, b X,pX b XpXh 1 O + b Xh 2 O + p Xh 3 O + h 4 O BPH 1 + BH 2 + PH 3 + H 4, and the nown functions h 1,h 2,h 3, h 4 do not depend on θ Aiii a Θ 2b Θ 2p Θ 2 where Θ 2b and Θ 2p are the parameter spaces for the functions b and p Furthermore the sets Θ 2b and Θ 2p are dense in L 2 F X x at each θ1 Θ 1 or b b = p, h 3 O = h 2 O wp1, and Θ 2b Θ 2 is dense in L 2 F X x at each θ1 Θ 1

15 Higher order influence functions 349 Remar 31 Aiiib can be viewed as a special case of Aiiia as discussed in Example 1a below, so we need only prove results under assumption Aiiia Assumptions Ai Aiii have a number of important implications that we summarize in a Theorem and two Lemmas Theorem 32 Double-robustness Assume Ai Aiii hold, and recall p and b are elements of θ Then E θ H b, p = E θ H b, p = E θ H b, p = ψ θ for all p, b Θ 2p Θ 2b, θ Θ Proof E θ H b, p E θ H b, p = E θ H 1 p X + H 2 } b X b X} and E θ H b, p E θ H b, p = E θ H 1 b X + H 3 } p X p X} The theorem then follows from part 1 of the following lemma Theorem 32 states that H, has mean ψ θ under F ; θ even when p is misspecified as p or b is misspecified as b We refer to the functional ψ θ as doubly robust because of this property The next lemma shows that H b, p is not unbiased if both b and p are simultaneously misspecified That is, E θ H b, p ψ θ Lemma 33 Assume Ai Aiii hold Then 1 E θ H 1 B + H 3 } X = E θ H 1 P + H 2 } X = 0 2 E θ H b, p E θ H b, p = E θ B B P P H 1 and ψ θ E θ H b, p = E θ BPH 1 + H 4 Proof Part 1: By assumptions Ai and Aiiia we have paths θ l t, l = 1, 2,, in our model with θ l 0 = θ and p l t = p l x; t = p x + tc l x, b l x; t = b x, F l x; t = F x for l = 1, 2,, where the sequence c l is dense in L 2 F X x Let S l θ be the score for path θ l t at t = 0 Then by ψ θl t = E θl t H b, p l t dψ θl t /dt t=0 = E θ H 1 B + H 3 } c l X + E θ H b, ps l θ By IF 1,ψ θ=h b, p ψ θ, dψ θl t /dt t=0 = E θ H b, ps l Thus E H 1 B + H 3 } c l X = 0 But c l } is dense in L 2 F 0 X so E H 1 B + H 3 X = 0 An analogous argument proves E θ H 1 P + H 2 } X = 0 Part 2: E θ H b, p E θ H b, p = E θ B P BPH 1 + B BH 2 + P PH 3 = E θ B P BPH 1 B B PH 1 P PBH 1 = E θ B B P P H 1, where the second equality is by part 1 Choosing P = B = 0 wp1 completes the proof of the theorem since then E θ H b, p = E θ H 4 Below we will need the following partial converse to Lemma 33

16 350 J Robins, L Li, E Tchetgen and A van der Vaart Lemma 34 Let Θ 2b, Θ 2p, Θ 1 and Θ and H b, p be as defined in Ai Aiiia Suppose that E θ H 1 B + H 3 } X = E θ H 1 P + H 2 } X = 0 and ψ θ = E θ H b, p Then V H b, p ψ θ is the first order influence function of ψ θ Proof The influence function of the functional E θ H b, p for nown functions b, p is V H b, p E θ H b, p Thus by the linearity of first order influence functions, the Lemma is true if and only if for each θ 0 Θ, the functional τ b, p = E θ0 H b, p with θ 0 fixed has influence function equal to 0 wp1 at b, p = b 0, p 0 θ 0 That the influence function is equal to 0 follows from the fact that, under the assumptions of the Lemma, for sets c l } and d l } dense in L 2 F 0 X, de θ0 H b 0 X + tc l X,p 0 X + td l X /dt t=0 = E θ H 1 b 0 X + H 3 } d l X + E θ H 1 p 0 X + H 2 } c l X = 0 Results of Ritov and Bicel 14 and Robins and Ritov 15 imply it is not possible to construct honest asymptotic confidence intervals for ψ θ whose width shrins to 0 as n if b and p are too rough Therefore we also place a priori bounds on their roughness Our bounds will be based on the following definition Definition 35 A function h with domain 0, 1 d is said to belong to a Hölder ball Hβ, C, with Hölder exponent β > 0 and radius C > 0, if and only if h is uniformly bounded by C, all partial derivatives of h up to order β exist and are bounded, and all partial derivatives β of order β satisfy sup β hx + δx β hx C δx β β x,x+δx 0,1 d We note that the L p, 2 < p < and L rates of convergence for estimation of a marginal density or conditional expectation h Hβ, C are O n β 2β+d and O n log n β 2β+d respectively We refer to an estimator attaining these rates as rate optimal We mae the following fourth assumption: Aiv We assume b, p, and g lie in given Hölder balls Hβ b, C b, Hβ p, C p, Hβ g, C g where 33 g x E H 1 X = x} f x Furthermore we assume g X > σ g > 0 wp1 Finally we assume, as can always be arranged by a suitable choice of estimator, that the initial training sample estimators b, p, and ĝ are rate optimal, have more than maxβ b, β g, β p } derivatives, and have L norm bounded by a constant c Further inf x 0,1 d ĝ x > σ g The reason for the restrictions on g will become clear below The restrictions Ai Aiv are the only restrictions common to all functionals and models in the class Additional model and/or functional specific restrictions will be given below To motivate our interest in such a class of functionals and models we provide a number of examples In each case, one can use Lemma 34 to verify that the

17 Higher order influence functions 351 influence function of ψ θ is as given All but Examples 3 and 4 are examples of locally nonparametric models Example 1 Suppose O=A, Y, X with A and Y univariate random variables Example 1a Expected product of conditional expectations Let ψ θ = E θ p X bx where b X = E θ Y X, p X = E θ A X In this model IF 1,ψ θ = p X bx ψ θ + p X Y b X} + b X A p X} so H 1 = 1, H 2 = A, H 3 = Y, H 4 = 0 We also consider the special case of this model where A = Y with probability one wp1 Then, as in assumption Aiiib, b X = p X wp1, H 2 = H 3 wp1 Then ψ θ = E θ b 2 X In Section 5, we show how our confidence interval for E θ b 2 X can be used to obtain an adaptive confidence interval for the regression function b Example 1b Expected conditional covariance has influence function ψ θ = E θ AY E θ p X bx = E θ cov θ Y, A X} AY p X b X + p X Y b X} + b X A p X}} ψ θ, so H 1 = 1, H 2 = A, H 3 = Y, H 4 = AY Example 1c below shows that a confidence interval and point estimators for E θ cov θ Y, A X} can be used to obtain confidence intervals and point estimator for the variance weighted average treatment effect in an observational study Example 1c Variance-weighted average treatment effect Suppose, in an observational study, O = Y, A, X}, A is a binary treatment taing values in 0, 1}, Y is a univariate response and X is a vector of pretreatment covariates Consider the parameter τ θ given by: 34 τ θ = E θ cov θ Y, A X E θ var θ A X = E θ cov θ Y, A X E θ π X 1 π X}, where π X = pr A = 1 X is often referred to as the propensity score We are interested in τ θ for several reasons First, in the absence of confounding by unmeasured factors, τ θ is the variance-weighted average treatment effect since τ θ can be rewritten as E θ w θ Xγ X; θ where w θ X = γ x; θ = E θ Y A = 1, X = x E θ Y A = 0, X = x var θa X E θ var θ A X and is the average conditional treatment effect at level x of the covariates Second, under the semiparametric model 35 γ X; θ = υ θ wp1 that assumes the treatment effect does not depend on X, τ θ = υ θ In Remar 42, we briefly consider inference on τ θ under model 35 However since the model 35 may not hold and therefore the parameter υ θ may be undefined, our main goal is to mae inference on τ θ without imposing 35

18 352 J Robins, L Li, E Tchetgen and A van der Vaart Now if for any τ R, we define ψ τ, θ to be ψ τ, θ = E θ Y τ E θ Y τ X} A E θ A X}, with Y τ = Y τa, it is easy to verify that τ θ may also be characterized as the solution τ = τ θ to the equation ψ τ, θ = 0 Thus inference on τ θ is easily obtained from inference on ψ τ, θ In particular a 1 α confidence set for τ θ is the set of τ such that a 1 α CI interval for ψ τ, θ contains 0 Therefore, with no loss of generality, we consider the construction of a 1 α CI for ψ τ, θ for a fixed value τ = τ, and write Y = Y τ and ψ θ = ψ τ, θ Thus ψ θ = E θ cov θ Y, A X} and we are in the setting of Example 1b In Section 6, we show the rates at which the width of the confidence sets for ψ τ, θ and for τ θ shrin with n are equal Example 2a Missing at random Suppose O = AY, A, X where Y is an outcome that is not always observed, A is the binary missingness indicator, X is a d-dimensional vector of always observed continuous covariates, and let b X = EY A = 1, X, π X = PA = 1 X be the propensity score, and p X = 1/π X We suppose π X > σ > 0 and define 36 ψ θ = E θ AY π X = E θ b X Interest in ψ θ lies in the fact that ψ θ is the marginal mean of Y under the missing equivalently, coarsening at random MAR assumption that PA = 1 X, Y = π X In this model IF 1,ψ θ = Ap XY bx + bx ψ θ so H 1 = A, H 2 = 1, H 3 = AY, H 4 = 0 Note that if one has assumed a priori that f X and p X lay in Hölder balls with respective exponents β fx and β p, then β g would be min β fx, β p, since g X = f X X/p X Example 2b Missing not-at random Consider again the setting of Example 2a but we no longer assume MAR Rather we assume PA = 1 X, Y = 1 + exp γ X + αy }} 1 may depend on Y, where now γ X is an unnown function and α is a nown constant to be later varied in a sensitivity analysis In this case the marginal mean of Y is given by ψ θ = E θ AY 1 + exp γ X + αy } Robins and Rotnitzy 17 proved this model places no restrictions on F o and derived where, now, IF 1,ψ θ = A 1 + exp αy }px} Y b X } + b X ψ θ b X = E Y exp αy } A = 1, X/E exp αy } A = 1, X, and p X = exp γ X} Thus H 1 = exp αy } A, H 2 = 1 A}, H 3 = AY exp αy }, and H 4 = AY When α = 0 this provides an alternate parametrization of Example 2a Example 3 Marginal structural models and the average treatment effect Consider the set-up of Example 1c including the non-identifiable assumption of no

19 Higher order influence functions 353 unmeasured confounders, except now A is discrete with possibly many levels and f a X > δ > 0 wp1 A marginal structural model assumes E fx E θ Y A = a, X} = d a, υ θ, where d a, υ is a nown function and υ θ is an unnown vector parameter of dimension d When A is dichotomous with a 0, 1} and d a, υ = υ 1 + υ 2 a, then υ 2 θ is the average treatment effect parameter Let f a be any density with the same support as A and let s a be a d -vector function, both chosen by the analyst Then υ θ is identified as the assumed unique value of υ satisfying ψ υ θ E θ s O, A, υ f A = 0, f A X where s O, a, υ = Y d a, υ} s a Thus a 1 α confidence set for υ θ is the set of vectors υ such that a 1 α CI for ψ υ θ contains 0 Therefore, with no loss of generality, we consider the construction of a 1 α CI for the d vector functional ψ θ θ for a fixed value υ and define h O, A s O, a, υ and ψ υ b a, X E θ h O, a A = a, X Then θ has influence function ψ υ IF 1 θ = f A h O, A b A, X } + f A X b a, X df a ψ θ Next define p a, X = 1/f a X,ψ θ, a = E fx b a, X Then IF 1 θ is the integral IF 1 θ = df aif 1 a, θ, IF 1 a, θ = H 1 apa, Xb a, X + H 2 aba, X + H 3 apa, X ψ θ, a, H 1 a = I A = a,h 2 a = 1, H 3 a = I A = aho, a It follows that IF 1 θ is a integral over a A of influence functions IF 1 a, θ for parameters ψ θ, a in our class with H 4 a = 0 Thus we can estimate ψ θ by df a ψ a, where ψ a is an estimator of ψ θ, a If the support of A is of greater cardinality than d, the model is not locally nonparametric Different choices for s a and f a for which / υ T} E θ s O, A, υ f A fa X is invertible may result in difference influence functions All yield the same rate of convergence, although the constants differ See Remar 25 above Extension of our methods to continuous A will be treated elsewhere Example 4 Confidence intervals for the optimal treatment strategy in a randomized clinical trial Consider a randomized clinical trial with data O = Y, Y, A, X}, A is a binary treatment taing values in 0, 1}, Y and Y are univariate responses, X is a vector of pretreatment covariates In a randomized trial, the randomization probabilities π 0 X = P A = 1 X are nown by design Let b x = E θ Y A = 1, X = x E θ Y A = 0, X = x and p x = E θ Y A = 1, X = x E θ Y A = 0, X = x be the average treatment effects at level X = x on Y and Y We assume Y and Y have been coded so that positive treatment effects are desirable Let ψ θ = E b XpX Because the model is not locally nonparametric there exists more than a single first order influence function Indeed, for any given function c, IF 1,ψ θ, c =b XpX ψ θ + b X Y Ap X} + p X Y Ab X} A π 0 X} σ 2 0 X + c X A π 0 X},

20 354 J Robins, L Li, E Tchetgen and A van der Vaart with σ 2 0 X = π 0 X 1 π 0 X} is an influence function in our class provided it is square integrable with H 1 = 1 2A A π 0 X} σ 2 0 X, H 2 = A π 0 X}σ 2 0 X Y, H 3 = A π 0 X}σ 2 0 XY, H 4 = c X A π 0 X} As c is varied, one obtains all first order influence functions We do not discuss the efficient choice of c in this paper Our interest lies in the special case where Y = Y wp1 so there is but one response of interest and thus, as in assumption Aiiib, b = p, H 2 = H 3 and we construct confidence interval for ψ θ = E b 2 X In Section 5 we describe how we can use a confidence interval for ψ θ = E b 2 X to obtain confidence intervals for the treatment effect function b x and, most importantly, for the optimal treatment strategy d opt x = I b x > 0 under which a subject with covariate value x is treated if and only if the treatment effect b x is positive ie, d opt x = 1 32 Higher order influence functions for our model 321 Dirac ernels, truncation bias, and a truncated parameter In all of our examples the functions p and b are functions of conditional expectations given the continuous random variable X It is well nown that the associated point-evaluation functional p x and b x do not have first order influence functions It then follows from part 5c of Theorem 23 and the dependence of IF 1,ψ θ = V if 1,ψ O i1 ; θ on b and p evaluated at the point X that, in none of our examples, does ψ θ have a second or higher order influence function As a precise understanding of the reason for the nonexistence of higher order influence functions for ψ θ is fundamental to our approach, we now use part 5c of Theorem 23 to prove that IF 2,ψ θ does not exist by showing that the functional if 1,ψ o; θ does not have a first order influence function V if 1,if1,ψ o; O; θ Let F X and f X = f X denote the marginal CDF and density of X In this proof, we do not assume that p and b are functions of conditional expectations Rather we only assume that our functional satisfies assumptions Ai-Aiv Consider paths parametric submodels θ l t such that θ l 0 = θ satisfying p l t p l x, t p x + tc l x, b l t b l x, t b x + ta l x, where the sequences c l and a l,l = 1, 2,, are each dense in L 2 F X x Let s l O; θ = s l O X; θ + s l X; θ, s l O X; θ, and s l X; θ denote the overall, conditional, and marginal scores lnf O; θ l 0 / t, lnf O X; θ l 0 / t, lnf X X; θ l 0 / t By linearity, if 1,ψ o; θ has an influence function only if the functionals b x and p x have one as well Now by differentiating the identity E θl t H 1b l X, t + H 3 } X = x = 0

21 with respect to t and evaluating at t = 0, we have Higher order influence functions 355 E θ H 1 b X + H 3 }}s l O X X = x = E θ H 1 X = x a l x However, by definition, b x has an influence function V if 1,bx O; θ at θ only if for l = 1, 2,, both b l x, t / t t=0 = a l x equals E θ if1,bx O; θ s l O; θ and E θ if1,bx O; θ = 0 Thus if if 1,bx O; θ exists, it must satisfy E θ H 1 b X + H 3 } s l O X X = x = E θ H 1 X = x E θ if1,bx O; θ s l O; θ Without loss of generality, suppose H 1 0 wp1 Now if we could find a ernel K fx, x, X such that 37 then r x = E fx K fx, x, X r X K fx, x, x r x f X x dx for all r L 2 F X if 1,bx O ; θ E θ H 1 X = x} 1/2 K fx, x, X E θ H 1 X} 1/2 H 1 b X + H 3 } would be an influence function since E E θ H 1 X = x E θ H 1 X = x} 1/2 K fx, x, X θ E θ H 1 X} 1/2 H 1 b X + H 3 }s l O; θ = E H 1 X = x 1/2 E θ K fx, x, X E θ H 1 X} 1/2 H 1 b X + H 3 } s l O X + s l X} = E H 1 X = x 1/2 E fx E θ KfX, x, X E θ H 1 X} 1/2 H 1 b X + H 3 } s l O X X = E θ H 1 b X + H 3 s l O X X = x By an analogous argument if 1,px O; θ = E θ H 1 X = x} 1/2 K fx, x, X E θ H 1 X} 1/2 H 1 p X + H 2 } would be an influence function Indeed since the sequences c l } and a l } are dense the existence of such a ernel is also a necessary condition for if 1,bx O; θ and if 1,px O; θ to exist and thus for if 1,ψ o; θ to exist A ernel satisfying Equation 37 is referred to as the Dirac delta function with respect to the measure df X x and would clearly have to satisfy 38 K fx, x i1, x i2 = 0 if x i2 x i1 were it to exist Of course a ernel satisfying Equation 37 is nown not to exist in L 2 F X L 2 F X We conclude that if 1,ψ o; θ does not have an influence function and therefore IF 2,2,ψ θ does not exist A formal approach To motivate how one might overcome this difficulty, we note that ernels satisfying Equation 37 exist as generalized functions or ernels also nown as Schwartz functions or distributions We shall formally derive higher order influence func- }

22 356 J Robins, L Li, E Tchetgen and A van der Vaart tions that appear to be elements of the space of generalized functions However, we use these calculations only as motivation for statistical procedures based on ordinary ernels living in L 2 F X L 2 F X Thus it does not matter whether these formal calculations could be made rigorous with appropriate redefinitions Rather we can simply regard the following as results obtained by applying a formal calculus to part 5c of Theorem 23 that adds to the usual calculus additional identities licensed by Equations 37 and 38 We will need the fact that, for any function v x; θ, Eq 38 implies that We now show that v x; θk fx, x, X = v X; θk fx, x, X IF 2,2,ψ θ V IF 2,2,ψ,i1,i 2 θ = Π θ,2 V if 1,if1,ψO i1 ; O i 2 ; θ /2 U 2,θ 1 θ would formally have U-statistic ernel ε IF 2,2,ψ,i1,i 2 θ = b,i1 θe θ H 1 X i1 1 2 K fx, X i1, X i2 39 E θ H 1 X i2, 1 2 ε p,i2 θ with ε b,i1 θ = B i1 H 1,i1 + H 3,i1 }, ε p,i2 θ = H 1,i2 P i2 + H 2,i2 } To show Equation 39 note, by and we have where H b, p/ P = BPH 1 + BH 2 + PH 3 + H 4 }/ P = BH 1 + H 3 H b, p/ B = PH 1 + H 2, if 1,if1,ψO i1 ; O i 2 ; θ = Q 2,b,i2 θ + Q 2,p,i2 θ IF 1,ψ,i2 θ, Q 2,p,i2 θ B i1 H 1,i1 + H 3,i1 }if 1,pXi1 O i 2 ; θ = B i1 H 1,i1 + H 3,i1 }E θ H 1 X i1 1 2 K fx, X i1, X i2 E θ H 1 X i2 1 2 P i2 H 1,i2 + H 2,i2 } = ε b,i1 θ E θ H 1 X i1 1 2 K fx, X i1, X i2 E θ H 1 X i2 1 2 ε p,i2 θ Q 2,b,i2 θ P i1 H 1,i1 + H 2,i1 }if 1,bXi1 O i 2 ; θ = ε b,i2 θ E θ H 1 X i2 1 2 K fx, X i2, X i1 E θ H 1 X i1 1 2 ε p,i1 θ Thus, by part 5c of Theorem 23, 1 } IF 2,2,ψ θ = Π θ,2 Q 2 2,p,i2 θ + Q 2,b,i2 θ + IF 1,ψ,i2 U 2,θ 1 θ = 1 } Q 2 2,p,i2 θ + Q 2,b,i2 θ = Q 2,p,i2 θ V RHS of Equation 39 since IF 1,ψ,i2 is a function of only one subject s data and Q 2,p,i2 θ and Q 2,b,i2 θ are the same up to a permutation that exchanges i 2 with i 1 To obtain IF 3,3,ψ,im θ, one must derive the influence function if 1,if2,2,ψO i1,o i2 ; O i 3 ; θ of if 2,2,ψ O i1, O i2 ; θ The formula for IF 3,3,ψ,im θ is given in Equation 313 A detailed derivation is given in our technical report Here we simply note that the only essentially new point is that we now require the

23 Higher order influence functions 357 influence function of K fx, X i1, X i2, which, as shown next, is given by 310 IF 1,KfX, X i1,x i2 = KfX, X i1, X i3 K fx, X i3, X i2 K fx, X i1, X i2 } To see that if Equation 37 held, Equation 310 would hold, note that for any path θ t with θ 0 = f X, h x = x, X E θt K θt, i1 hx i1 Differentiating with respect to t and evaluating at t = 0, we have 0 = E θ K fx, x, XhXS θ + E θ t K θt, x, X i 1 t=0 }hx i1 Hence it suffices to show that E θ K fx, x, XhXS θ = E θ E θ K fx, x, X i2 K fx, X i2, X i1 S i2 θ X i1 }}hx i1 But, by Equation 37, E θ E θ K fx, x, X i2 K fx, X i2, X i1 S i2 θ X i1 }}hx i1 = E θ K fx, x, X i1 S i1 θhx i1 Feasible estimators These formal calculations motivate a truncated Dirac approach to estimate ψ θ Let z l } z l X;1, 2, } be a countable sequence of nown basis functions with dense span in L 2 F X and define z X T = z 1 X,,z X Define K fx, X i1, X i2 z X i1 T E fx z X z X T} 1 z X i2 to be the projection ernel in L 2 F X onto the subspace lin z X} η T z x ;η R, η T z x L 2 F X } spanned by the elements of z X That is, for any h x, Π fx h X lin z x} = E fx K fx, x, XhX = z x T E fx z X z X T} 1 EfX z X hx Then we can view K fx, x i1, x i2 as a truncated at approximation to K fx, x i1, x i2 that is in L 2 F X L 2 F X and satisfies Equation 37 for all r x lin z X} Then a natural idea would be to substitute IF 2,2,ψ,i 1,i 2 θ with, for example, ε b,i 1 θ H E θ 1 X i1 1 2 K fx, X i 1, X i2 H E θ 1 X i2 1 2 ε p,i2 θ } ε b,i1 θ = H E θ 1 X i1 1 2 Bi1 H 1,i1 + H 3,i1 for the generalized function IF 2,2,ψ,i1,i 2 θ based on Equation 39 resulting in

24 358 J Robins, L Li, E Tchetgen and A van der Vaart the feasible second U-statistic estimator ψ 2 = ψ θ + IF 1,ψ θ + IF 2,2,ψθ θ where IF 2,2,ψ θ V IF 2,2,ψ,i 1,i 2 θ To avoid having to do a matrix inversion it would be convenient, when possible, to choose z X = ϕ X / fx X} 1/2 where ϕ1 X,ϕ 2 X, is a complete orthonormal basis with respect to Lebesgue measure in R d Then E fx z X z X T = I so K fx, X i 1, X i2 = z X i1 T z X i2 = K Leb, X i1, X i2 fx X i1 f X X i2 } 1/2, where K Leb, X i1, X i2 ϕ X i1 T ϕ X i2 This choice corresponds to having taen K fx, X i1, X i2 = K Leb, X i1, X i2 /f X X i1 f X X i2 } 1/2 in our formal calculations where K Leb, X i1, X i2 is the Dirac delta function with respect to Lebesgue measure In that case with G g X f X XE θ H 1 X and Ĝ ĝ X f X XE θ H 1 X, 311 IF 2,2,ψ,i1,i 2 θ = ε b,i1 θ g X i1 1 2 K Leb, X i1, X i2 g X i2 1 2 ε p,i2 θ 312 IF 2,2,ψ,i 1,i 2 θ = ε b,i1 θ ĝ X i1 1 2 K Leb, X i1, X i2 ĝ X i2 1 2 ε p,i2 θ Unfortunately, we show later that this choice for z has good statistical properties only when f X is nown to lie in a Holder ball with exponent exceeding max β b, β p } In our technical report we show one can proceed by induction to formally obtain that for m = 3, 4,, 313 IF m,m,ψ,im θ = ε b,i1 θ ς X i1 1 2 j s=1 H 1,is+1 m 2 j=0 cm, j ςx is+1 K f X, Xis, X is+1 Xij+1, X im K fx, ς X i m 1 2 ε p,im θ where ς X = E H 1 X and cm, j = m 2 j 1 j+1, which we then use to obtain IF θ and ψ m = ψ m,m,ψ,i m 2 + m j=3 IF j,j,ψ θ

25 Higher order influence functions 359 Statistical properties We shall prove below that the estimator var θ ψ m ψ m 1 n max has variance m 1 1,, n when z l X;l = 1, 2, } is a compact wavelet basis Robins et al 18 proves this result for more general bases We also prove that the bias E ψ θ m ψ θ = TB θ + EB m θ, of ψ m is the sum of a truncation bias term of order TB θ = O p β b+β p/d for a basis z l X;l = 1, 2, } that provides optimal rate approximation for Hölder balls and an estimation bias term of order EB m θ = O p P P } B B } m 1 G Ĝ Ĝ = O p n m 1βg 2βg+d β b 2β b +d βp 2βp+d θ Note this estimation bias is O P θ m+1 It gets its name from the fact that, unlie the truncation bias, it would be exactly zero if the initial estimator θ happened to equal θ Thus, the U-statistic estimator ψ m for our functional ψ θ which does not admit a second order influence function differs from the U-statistic estimators ψ m of Eq 23 for functionals that admit second order influence functions θ in that, owing to truncation bias, the total bias of is not O p θ m+1 ψ = m The choice of determines the trade-off between the variance and truncation bias As with n fixed, var ψ θ m and TB θ 0 Thus, we can heuristically view the non-existent estimator ψ m = as the choice of that results in θ no truncation bias and therefore a total bias of O p θ m+1 at the expense of an infinite variance Writing = n = n ρ, the order of the asymptotic MSE of ψ m is minimized at the value of ρ for which order of the variance equals the order of the sum of the truncation and estimation bias Remar 36 The models of Examples 1-4 exhibit a spectrum of different lielihood functions and therefore a spectrum of different first order and higher order scores Nonetheless, because the first order influence functions of the functionals ψ θ share a common structure, we were able to use part 5c of Theorem 23 to formally derive IF m,m,ψ,im θ and, thus, the feasible IF θ in Examples 1-4 in a m,m,ψ,i m unified manner without needing to consult the full lielihood function for any of the models See Remar 25 above for a closely related discussion ψ m

26 360 J Robins, L Li, E Tchetgen and A van der Vaart A critical non-uniqueness We have as yet neglected a critical non-uniqueness in our definition of IF ψ m m,m,ψθ θ and thus that poses a significant problem for our truncated Dirac approach For instance, when m = 3, the two generalized U-statistic ernels IF 3,3,ψ,i1,i 2,i 3 θ of Equation 313 and IF3,3,ψ,i 1,i 2,i 3 θ ε H 1,i2 b,i 1 θ ςx i2 K f X, X i1, X i2 ς X i1 1 2 E θ K fx, X i1, X i2 X i1 K fx, X i1, X i3 ε p,i 3 θ ς X i3 1 2 are precisely equal, by Eq 38; nonetheless, upon truncation, they result in different feasible ernels; and IF 3,3,ψ,i 1,i 2 θ = ε b,i 1 θ ς X i1 1 2 H1,i2 ςx i2 K f X, X i 1, X i2 K fx, X i 2, X i3 IF, 3,3,ψ,i 1,i 2,i 3 θ ε b,i 1 θ ς X i1 1 2 K fx, X i 1, X i3 K fx, X i 1, X i3 ε p,i 3 θ ς X i3 1 2 ε p,i 3 θ ς X i3 1 2 H 1,i2 ςx i2 K f X, X i 1, X i2 E θ K fx, X i 1, X i2 X i1 with possibly different orders of bias For simplicity, we consider the case where H 1 = 1 as in Examples 1b Let δb B B, δp P P, δf = δg f f 1, and Z z X Ê ϕ Xϕ X T} 1/2 ϕ X, then, E θ IF, 3,3,ψ,i 1,i 2,i 3 θ δb i1 fx i2 = E θ E θ fx i2 1 z X i2 z T X i1 E θ δp i3 z X i3 T z X i1 fx i1 fx i δb i1 fx = i2 E θ E θ fx i2 1 z X i2 z T X i1 fx i3 E θ fx i δp i3 z X i3 T z X i1 = δb E θ i1 δf X E θ i2 Z T,i 2 Z,i1 δp E θ i3 Z T,i 3 Z,i1 + O p B B } P P } } 2 G Ĝ

27 Higher order influence functions 361 and That is, E θ IF 3,3,ψ,i 1,i 2,i 3 θ δb i1 z X i1 T z X i2 fx i2 fx i2 1 = E θ E θ z X i2 T E θ δp i3 z X i3 δf X E θ i1 + 1δB i1 Z T,i 1 fx = i2 E θ Z,i2 fx i2 1 Z T,i 2 E θ δf Xi3 + 1δP i3 Z,i3 δb = E θ i1 Z T fx i2,i1 Z,i2 fx i2 1 E θ Z T,i δpi3 2 Z E θ,i3 + O p B B } P P } } 2 G Ĝ E θ IF, 3,3,ψ,i 1,i 2,i 3 θ E θ IF 3,3,ψ,i 1,i 2,i 3 θ = δb E θ i1 δf X E θ i2 Z T,i 2 E θ E θ Z,i1 δp E θ i3 Z T,i 3 δb i1 Z T Z,i1,i2 δf X i2 Z T,i δpi3 2 Z E θ,i3 + O p B B } P P } } 2 G Ĝ = E θ δbπ θ δp Z Π θ δf X Z E θ Π θ δp Z δf δb Z XΠ θ + O p B B } P P } } 2 G Ĝ Π θ δp Z Π θ δb Z Π θ δf X Z = E θ δb Z Π θ Π δf X Z θ + O p B B } P P } } 2 G Ĝ } Z,i1 where Π θ h X Z and Π h X Z respectively denote the projection under F ; θ θ in L 2 F of h X on the dimensional linear subspace lin z X} spanned by the components of the vector z X and the projection on the orthocomplement of this subspace Since the basis ϕ l X;l = 1, 2, } provides optimal rate approximation for Hölder balls, it is easy to verify that the difference is of order βp/d 1+2βp/d βg/d 1+2βg/d βb/d + n βp/d 1+2βp/d β b /d 1+2β b /d βg/d O p n +n βp/d 1+2βp/d β b /d 1+2β b /d 2βg/d 1+2βg/d,

28 362 J Robins, L Li, E Tchetgen and A van der Vaart which we expect to be sharp for many bases, although not for Haar For concreteness, we shall loo at an example Suppose β b /d = β p /d = 03 and β g /d = 01, thus, by choosing = n 5 6, ψ 3 converges to ψ θ at rate n 2 1 In contrast, the order, min n βp/d 1+2βp/d βg/d 1+2βg/d β b/d + n of the optimal root mean squares error of βp/d 1+2βp/d β b /d 1+2β b /d βg/d + ψ, 3 that uses IF, ψ, 1 n max 1, 2 n 2, 3,3,ψ,i 3 θ is n 0477 n 05 Thus, for many orthonormal bases, 3 converges to ψ θ at a slower rate than ψ 3 which uses IF θ Nothing in our development up to this point 3,3,ψ,i 3 provides any guidance as to which of the many equivalent generalized U-statistic ernels should be selected for truncation To provide some guidance, we introduce an alternative approach to the estimation of ψ θ based on truncated parameters that admit higher order influence functions The class of estimators we derive using this alternative approach includes members algebraically identical to the estimators ψ m but does not include estimators equivalent to less efficient estimators such as ψ, 3 An approach based on truncated parameters We introduce a class of truncated parameters ψ θ that i depend on the sample size through a positive integer index = n which we refer to as the truncation index and will be optimized below, ii have influence functions IF m, ψ θ of all orders m, iii equal ψ θ on a large subset Θ sub, of Θ and iv the initial estimator θ is an element of Θ sub, so that the plug-ins ψ θ and ψ θ are equal To prepare we introduce a simplified notation For functions h o, or r of θ, we will often write h o, θ and r θ as ĥ o and r, and as Ê Similarly, we often E θ write h o, θ and r θ as h o and r, and E θ as E Further we shall introduce slightly different definitions of truncation and estimation bias Define the estimator ψ m, θ ψ θ + IF m, ψ θ or, equivalently, ψm, ψ+îf Then the conditional bias E ψm, θ ψ of ψ m, ψ m, is TB +EB m, where the truncation bias TB = ψ ψ is zero for θ Θ sub, and does not depend on m and the estimation bias EB m, = E ψm, θ ψ is O P θ θ m+1 by Theorem 22 Since, as we show later, the order of EB m, does not depend on, we will abbreviate EB m, as EB m, suppressing the dependence on Under minimal conditions, the conditional variance of ψ m, is of the order of var IF m, ψ whenever n n The rate of convergence of ψ m, to ψ can depend on the choice of ψ Nevertheless, many different choices ψ result in estimators ψ m, that achieve what we conjecture to be the optimal rate for estimators of the form ψ m, We choose, among all such ψ, the class that minimizes the computational complexity of ψ m, Specifically for all ψ in our chosen class and all j, IF jj, ψ consists of a single term rather than a sum of many terms We conjecture this appealing property does not hold for any ψ outside our class We now describe this choice The parameter ψ is

29 Higher order influence functions 363 defined in terms of n dimensional woring linear parametric submodels for p and b depending on unnown parameters α and η K through the basepoints p and b, where p and b are initial estimators from the training sample Specifically let ṗ X and ḃ X be arbitrary nown functions chosen by the analyst satisfying Eqs below ṗ Xḃ XE H 1 X 0 wp1, ṗ X ḃ X < C ḃx, < C, ṗ X ṗx 316 ḃx has at least max β b, β p } derivatives Particular choices of ṗ X and θ ḃ X can mae the form of IF more aes- m, ψ thetic The choice has no bearing on the rate of convergence of the estimator ψ m, to ψ θ Often there are fairly natural choices for ṗ and ḃ See Remar 39 below for examples Let α, η be vectors of unnown parameters and consider the woring linear models p X, α px + ṗxα T z X P + Pα T Z, b X, η = bx + ḃxηt z X = B + ḂηT Z We define the parameters η θ and α θ respectively to be the solution to = E θ H b X, η,p X, α / α = E θ H 1 b X, η + H 3 } PZ, 0 = E θ H b X, η,p X, α / η = E θ H 1 p X, α + H 2 }ḂZ The solution to 319 and 320 exist in closed form as η θ = E θ Ḃ PH 1 Z Z T 1 } Eθ Z P H 1 B + H3, α θ = E θ PḂH 1Z Z T 1 } Eθ Z Ḃ H 1 P + H2 Next define b θ = b, θ = b, η θ ψ θ = E θ H and p θ = p, θ = p, α θ and b θ, p θ Note the models p, α and b, η are used only to define the truncated parameter ψ θ They are not assumed to be correctly specified In particular, the training sample estimates p, b need not be based on the models p, α, b, η We now compare our truncated parameter ψ θ with ψ θ and calculate the truncation bias It is important to eep in mind that b, p are components of the unnown θ while ṗ, ḃ, p, b are regarded as nown functions

30 364 J Robins, L Li, E Tchetgen and A van der Vaart Theorem 37 If our model satisfies Ai Aiii and θ Θ sub, = θ; p = p, α for some α or b = b, η for some η } Θ then ψ θ = ψ θ Further TB θ = ψ θ ψ θ = E θ B θ B } P θ P } H 1 Proof Immediate from Theorem 32 and Lemma 33 We now from the above Theorem that TB θ = 0 for θ Θ sub, However to control the truncation bias in forming confidence intervals for ψ θ we will need to now how fast sup θ Θ TB θ} decreases as increases The following theorem is a ey step towards determining an upper bound Theorem 38 Suppose Let ḃx and ṗ X are chosen so that Ḃ PE H 1 X 0 wp1 } 1/2 Q q X = Ḃ PE H 1 X and Π h Z QZ and Π h X QZ be, respectively, the projection in L 2 F X x of h X on the dimensional linear subspace lin } QZ spanned by the components of the vector QZ = q Xz X and the projection on the orthocomplement of this subspace Then if Ai Aiii are satisfied, TB = E Π P P P Q QZ Π B B Q QZ Ḃ Remar 39 To simplify various formulae it is often convenient and aesthetically pleasing to have Q = 1 We can choose Ḃ and P to guarantee Q = 1 wp1 For the functional ψ θ = E θ b XpX of Example 1a, H 1 = 1 wp1 Thus choosing Ḃ and P equal to 1 and 1, respectively, wp1 maes Q = 1 wp1 In the missing data Example 2a, the function H 1 = A so Ê H 1 X = 1/ P and thus the choice Ḃ = 1, P = P maes Q = 1 wp1 Note since inference on ψ θ is conditional on the training sample data, we view the initial estimator p of p from the training sample as nown and thus an analyst is free to choose P to be P Examples continued In Example 1a, recall ψ = E BP Choose Ḃ = P = 1 wp1 so Q = Q = 1, and tae B lin } Z Then B = B + Π B B Z = Π B Z, P = Π P Z, TB = E Π B Z Π P Z }, ψ = ψ TB = E Π B Z Π P Z } Thus ψ appears to be the natural choice for a truncated parameter In Example 2a with ψ = E B, Ḃ = 1, P = P = 1/ π, Q } 1/2 = 1, Q = P/P = } 1/2, π π X,π π X, we obtain π π } 1/2 Π π 1 π TB = E 1 π π π } 1/2 Π B π π B π π π π } 1/2 Z } 1/2 Z

31 Higher order influence functions 365 Thus the truncated parameter ψ = ψ TB does not seem to be a particular natural or obvious choice The complexity of ψ is not simply due to the fact that we chose P = P rather than P = 1 as we now demonstrate In Example 2a with Ḃ = 1, P = 1, Q = π 1/2, Q = π 1/2, Π 1 π 1 π π 1/2 π 1/2 Z TB = E } 1/2 Π B π π B π 1/2 Z Nonetheless we will see that, for either choice of result in estimators with good properties Ḃ, P, the parameter ψ will Remar 310 Henceforth, given β p, β b, β g, ϕ l X,l = 1, 2, } will always denote a complete orthonormal basis with respect to Lebesgue measure in R d or in the unit cube in R d that provides optimal rate approximation for Hölder balls H β, C,β maxβ p, β b, β g, ie 323 sup h Hβ,Cinf ςl R d h x l=1 2 ς l ϕ l x dx = O 2β /d The basis consisting of d-fold tensor products of univariate orthonormal polynomials satisfies 323 for all β The basis consisting of d-fold tensor products of a univariate Daubechies compact wavelet basis with mother wavelet ϕ w u satisfying R 1 u m ϕ w udu = 0, m = 0, 1,,M also satisfies 323 for β < M + 1 Theorem 311 Suppose that Ai Aiv are satisfied, that ḃx and ṗ X satisfy and that we tae 324 z X = ϕ, f X Ê H1 Xḃ XṗX } 1/2, where Then ϕ X, f Ê ϕ Xϕ X T} 1/2 ϕ X sup θ Θ TB 2 θ } = O p 2β b+β p/d Remar 312 Note if we have chosen ḃ and ṗ so that Q = 1 wp1 then z X = ϕ, f X simplifies The preceeding theorem does not hold with ϕ X/ f X } 1/2 replacing ϕ, f X unless f X is nown to lie in a Holder ball with exponent exceeding max β b, β p } 322 Derivation of the higher order influence functions of the truncated parameter We begin by proving that the first order influence functions of ψ and ψ are identical except with bθ, p θ, ψ θ replacing b, p, ψ θ

32 366 J Robins, L Li, E Tchetgen and A van der Vaart Theorem 313 IF 1, ψ θ = V IF 1, ψ θ,i 1 with Proof Since ψ θ = E θ H IF 1, ψ θ = H bθ, pθ, b θ, p θ ψ θ IF 1, ψ θ = H b θ, pθ ψ θ + E H b X, η θ, p X, α θ / η T IF θ 1, η + E H b X, η θ, p X, α θ / α T IF θ 1, α But, by definition of η θ and α θ, both expectations are zero Note that η θ and α θ are not maximizers of the expected log-lielihood for α and η This choice was deliberate Had we defined η θ and α θ as the maximizers of the expected log-lielihood, then IF 1, ψ θ would have had additional terms since the expectations in the preceding proof would not be zero The existence of these extra terms would translate to many extra terms in IF m, ψ θ for large m leading to computational difficulties Similarly had we chosen models p X, α Φ P + Pα T Z and b X, η = Φ B + Ḃη T Z with Φ a non-linear inverse-lin function, IF m, ψ θ would also have had many extra terms without an improvement in the rate of convergence The following is proved in the Appendix Theorem 314 IF m, ψ = IF 1, ψ + m j=2 IF where IF jj, ψ jj, ψ =V IF jj, ψ,ij is a jth order degenerate U statistic given by T H 1 P + H 2 ḂZ E PḂH 1 Z Z IF 22, ψ = i1 T,i 2 Z H 1 B + H3 P i 2 IF jj, ψ = 1 j 1 T H,i 1 P + H2 ḂZ j i 1 j } E PḂH 1Z Z T 1 s=3 } PḂH 1Z Z T E PḂH 1Z Z T i s } E PḂH 1Z Z T 1 Z H 1 B + H3 P i 2 } 1, 323 The Estimator ψ m, ψ + ÎF m, ψ and its Estimation Bias We can now calculate the estimator ψ m, ψ + ÎF m, ψ by substitution of θ for θ in IF m, ψ IF m, ψ θ to obtain the following

33 Higher order influence functions 367 Theorem 315 Suppose 324 holds and define ς X = Ê H 1 X Then ψ m, = ψ + ÎF 1, ψ + m j=2 ÎF jj, ψ where ψ + ÎF = B PH 1, ψ 1 + BH 2 + PH 3 + H 4, T ÎF 22, ψ = H,i 1 P + H2 ḂZ 2 Z H 1 B + H3 P i1 i 2 1/2 ϕ T X H 1 P + Ḃ, f H2 Ṗ ςx 1/2 = i } 1, 1/2 H 1 B + Ḃ ϕ, f X H3 Ṗ ςx 1/2 i 2 ÎF jj, ψ = 1 j 1,i j = 1 j 1 j H 1 P + H2 ḂZ T j PḂH 1Z Z T i s s=3 I H 1 B + H3 P i 1 Z 1/2 ϕ T H 1 P + Ḃ, f H2 P s=3 H 1 B + H3 Ḃ Ṗ X ςx 1/2 ϕ, f H XϕT X, f 1 ςx i s I } 1/2 ϕ, f X ςx 1/2 i 2 i 2 i 1 } } } Proof By Lemma 33 H E θ 1 B + H3 PZ = H E θ 1 P + H2 ḂZ = 0 Thus by Eqs 321 and 322 η θ = α θ = 0 so B θ = B and P θ = P Further, by Eq 324, Ê PḂH 1Z Z T = Ê PḂÊ H 1 XZ Z T = Ê Q2 Z = I Z T Remar 316 The reader can easily chec that when Ḃ = P = 1 and H 1 0 wp 1, ÎF is precisely the same as IF 22, ψ,i 2 2,2,ψ,i 1,i 2 θ of equation 312 in Section } m ϕ, f 321 By expanding the product H ϕt, f 1 I, the equality of ςz s=3 i s ÎF mm, ψ and IF θ can be demonstrated for all m Note that ÎF,i m m,m,ψ,i m jj, ψ,i j depends on P and Ḃ only through their ratio Example 1a Continued ψ = E BP, Ḃ = P = 1 wp1, Q = 1, H1 = 1, Z = ϕ, f X so Ê Z Z T = I is Then } P H 1 B + H3 = Y B }, H Ḃ 1 P + H2 = A P

34 368 J Robins, L Li, E Tchetgen and A van der Vaart and thus ÎF 22, ψ = A P Z T,i 2 Z Y i1 B, i 2 ÎF jj, ψ = 1 j A P j } Z T,i j Z is Z T is I Z Y B i 1 s=3 Example 2a Continued H 1 = A, Ḃ = 1, P = P = 1/ π, Q = 1, ψ = E B, } 1/2 } 1/2 Q = P/P = π and Z = ϕ X, so Ê π, f Z Z T is = I Then } P H 1 B + H3 Y B }, Ḃ H 1 P + H2 = 1, so A π = A π A ÎF 22, ψ =,i 2 π 1 Z T A Z Y i 1 π B, i 2 ÎF jj, ψ = 1 Z j 1,i j A π 1 T j s=3 i 1 } A π Z Z T I i s A Z Y π B i 2 Consider Example 2a with Ḃ = 1, P = 1, Q = π 1/2, and Z = ϕ } X,, f Ê Q2 Z Z T i s = I, P H 1 B + H3 = A Y B }, Ḃ H 1 P + H2 = 1, so A π ÎF 22, ψ,i 2 A = π 1 Z T Z A Y B i 1 ÎF jj, ψ,i j = 1 j 1 A π 1 Z T j s=3 i 1 } AZ Z T I i s i 2, Z A Y B i 2 Our next theorem, proved in the Appendix of our technical report, derives the estimation bias EB m = E ψm, ψ Theorem 317 Suppose and Ai Aiv hold then } EB m = 1 m 1 E Q 2 B B Ḃ Z T E Q 2 Z Z T m 1 I 325 } E Q 2 Z Z T 1 E Z Q 2 P P Ṗ EB m 1/2 1/2 ḂṖ G} ṖḂG} δg m o p /2 } 2 1/2 p X px dx} 2 b X b X dx = O P log n m 1βg d+2βg n βb d+2β + b d+2βp 327 n i 2

35 Higher order influence functions 369 for m 1, where δg = gx ĝx ĝx Remar 318 At the cost of a longer proof we could have used Hölder s inequality repeatedly to control δg in the L p norm δg m+1 with p = m + 1 to show that EB m = O P δg m 1 m+1 b b p p m+1 Thus, EB m is m+1 θ O P θ m+1, consistent with the form of the bias given in our fundamental Theorem 22 Remar 319 An alternate derivation of ψ m, The above derivation of ψ m, required that one have facility in calculating higher order influence functions IF m, ψ, as done in the proof of Theorem 314 in the Appendix However, there exists an alternate derivation of ψ m, that does not require one learn how to calculate influence functions Specifically, we now from Theorems 22 and 23 that in a locally nonparametric model ÎF, j 2 is the unique j th order U-statistic that is degenerate jj, ψ under θ and satisfies j EB j 1 + E θ EB j = O p θ θ ÎFjj, ψ with EB 1 = E ψ1 θ ψ In fact, we first derived ψ m, by beginning with ψ 1 = ψ + ÎF, calculating EB 1, ψ 1 = E ψ1 θ ψ, and then, recursively for j = 2, finding ÎF satisfying the above equation In fact if one did not even now how jj, ψ to derive IF 1, ψ, one could begin the recursion by obtaining ÎF as the unique 1, ψ first order U-statistic with mean zero under θ satisfying ψ ψ + E θ = ÎF1, ψ O p θ θ The variance of ψ m, ψ + ÎF m, ψ using compact wavelets In this section, we derive the order of the variance of ψ m, when the orthonormal system ϕ j X} used to construct our U-statistics are a compact wavelet basis First consider the case where X is univariate; without loss of generality, assume that X Uniform0, 1 Because we are primarily interested in convergence rates, the fact that X may not follow the uniform distribution will not affect the rate results given below, but can influence the size of the constants We use φ j X in place of ϕ j X to indicate univariate basis functions Let, be integer powers of two with > Denote by φx φ 1 X the dimensional basis vector whose first components φ 1 X are the vector of level log 2 scaled and translated versions of a compactly supported father wavelet Mallat 10 and whose last components φ +1 X are the associated compact mother wavelets between levels log 2 and log 2 In particular, one may use periodic wavelets, folded wavelets or Daubechies boundary wavelets with enough vanishing moments to obtain the optimal approximation rate of O 2β/d for β = maxβ g, β p, β b The multiresolution analysis MRA property of wavelets allows us to decompose the vector space spanned by the

36 370 J Robins, L Li, E Tchetgen and A van der Vaart log 2 -level father wavelets V log2 into the direct sum of the subspace spanned by log 2 -level father wavelets V log2 = a T φ 1 X : a R } and the span of mother wavelets for each level between log 2 and log 2 1 which we respectively write as W log2 = a T φ 2 +1 X : a R }, W log2 0+1 = a T φ X : a R 2 }, W log2 1 = } a T φ 2 +1 X : a R 2 Then for any integer s with log s, we have V s = V log2 s 1 v=log 2 As s, the resulting basis system is dense in L 2 X Mallat 10 Since, in fact, X is d dimensional we require a generalization that allows for multivariate tensor wavelet basis functions In fact, suppose X T = X 1,, X d is now multivariate, and we again assume X Uniform on 0, 1 d Given d univariate vector spaces V 1,log2, V 2,log2,,V d,log2 respectively spanned by vectors φ 1 X 1, φ 1 X 2,, φ 1 X d, so that for 1 r d, V r,log2 V r,log2 +1 V r,log2 1 V r,log2 and V r,log2 = V r,log2 W v log 2 1 v=log 2 One may define d dimensional tensor vector spaces such that where for s 0, W r,v Y d,log2, Y d,log2 +1,, Y d,log2 Y d,log2 Y d,log2 +1 Y d,log2, Y d,log2 0+s= 1 r d V r,log2 0+s As s, the resulting tensor basis system is dense in L 2 XMallat 10 Next, suppose that we have a set of multivariate basis functions } ϕ j 1 X,j = 0, 1,, 2m such that for each j, ϕ j 1 X spans 1 r d V r,log2 j,r, where d j,r = j Define 2 as the L 2 norm with respect to the Lebesgue measure The following theorem is ey to our derivation of the order of the variance of ψ m, r=1

37 Higher order influence functions 371 Theorem 320 For m 0, m 1 X i 1 T ϕ j 1 Xij+1 ϕ j+1 T 1 Xij+1 }ϕ m+1 1 ϕ1 j=1 m+1 = E K 1,j Xij, X ij+1 m+1 j=1 j=1 j 2 Xim+2 The following theorem is an immediate consequence of Theorem 320 obtained by taing j = = which implies we use the father wavelets at level log 2 but no mother wavelets Theorem 321 For all θ Θ, var θ IF 1, ψ θ 1 n, } j 1 1 var θ IF jj, ψ θ n max 1,, n var θ IF m, ψ θ θ 1, var θ ÎFm, ψ 1 n max } m 1 n We now use Theorem 321 to derive the order of the conditional variance of ψ m, given θ f Theorem 322 If sup o O o; θ f o; θ 0 as θ θ 0, then for a fixed m, var θ ψm, ψ θ = var θ θ ÎFm, ψ = θ var θ ÎFm, ψ 1 n max 1, n 1 + o p 1 } m 1 The proof is in our technical report For a given m, the estimator ψ m,optm that minimizes the maximum asymptotic MSE over the model M Θ defined by Ai Aiv among the candidates ψ m, uses the value opt m opt m, n of that equates the order 1 n 1, max m 1 } n of var ψm, ψ θ to the order max TB } 2, EB m θ} 2 = max 2m 1βg log n d+2βg n n 2βb d+2β + 2βp b d+2βp, 2β b +βp d 2 2

38 372 J Robins, L Li, E Tchetgen and A van der Vaart of the maximal squared bias The estimator ψ mopt, opt ψ mopt, optm opt that minimizes the maximum asymptotic MSE over the model M Θ among all candidates ψ m, is the estimator ψ m 1 m,optm,n which minimizes 1 n 1, max optm,n n 325 Distribution theory and confidence interval construction We derive a consistent estimator of the variance and give the asymptotic distribution of ψ m, for any model and functional satisfying Ai Aiv Let z α be the upper α quantile of a standard normal distribution, ie a N 0, 1 2 Theorem 323 a Let Ŵ2 = n 1 V ÎF 1, ψ 1, ψ,i1}, for j 2, and where Ŵ 2 jj, ψ = 1 2 n V ÎF s j j,j, ψ, Ŵ 2 m, ψ = Ŵ2 1, ψ + m Ŵ 2, jj, ψ j=2 s ÎF j,j, ψ is the symmetric ernel of ÎF We have, jj, ψ Ê Ŵ2 Ê Ŵ2 1, ψ jj, ψ Ê Ŵ2 = var m, ψ where var = var θ b Conditional on the training sample, 1 n max opt m, n 1, n = var ÎF1, ψ = var ÎFjj, ψ θ, θ,, θ ÎFm, ψ }} m 1 1/2 ψm,optm,n E ψm,optm,n θ} converges uniformly for θ Θ to a normal distribution with finite variance as n The asymptotic variance is uniformly consistently estimated by }} m n max 1, Ŵ 2 n m, ψ opt m,n Thus ψm,optm,n E ψm,optm,n θ} /Ŵm, ψ opt m,n is converging in distribution to a standard normal distribution c Define the interval C m, = ψ m, ±z α Ŵ m, ψ Suppose opt m, n = n ρ opt m,n Then for = n ρ, ρ > ρ opt m,n, E θ ψ2, θ sup θ Θ = o p 1 var θ ψ2, θ

39 Higher order influence functions 373 and ψm, ψ θ} converges uniformly in θ Θ to a N 0, 1 Moreover, C m, is a conservative uniform asymptotic 1 α confidence interval for /Ŵm, ψ ψ θ d Suppose we could derive a constant C bias and a constant N such that sup E θ ψm,optm,n ψ θ} θ = sup TB optm,n θ + EB m θ } θ 1 n ρ }} C bias n max opt m,n m 1 1/2 1, n for n > N Then BC m,optm,n = ψ m,optm,n ± z αŵm, ψ + C bias 1 n max 1, n ρ }} opt m,n m 1 1/2 n is a conservative uniform asymptotic 1 α confidence interval for ψ θ Part a of the theorem is an easy calculation The asymptotic normality of ψ m,optm,n is based on new results on the asymptotic distribution of higher order U-statistics with ernels depending on n to be published elsewhere Robins et al 18 Part c of the theorem implies we obtain a conservative uniform asymptotic 1 α confidence interval for any value of ρ exceeding ρ optm,n However, for the actual fixed sample size of our study, say n = 5000, there is no guarantee the interval of part c based on given difference ρ ρ optm,n, say 03, will provide conservative finite sample coverage Because of this difficulty, a better approach, described in part d, would be to determine a constant C bias that can be used to bound the maximal bias under the model at a sample sizes exceeding N, with N no greater than the actual fixed sample size n of the study Then the interval BC m,optm,n will be a honest conservative finite sample 1 α confidence interval, provided that ψ m,optm,n has nearly converged to its normal limit at sample size n Unfortunately, as yet we do not now how to determine the constants C bias and N of part d as a function of our model and of our initial estimator θ This is an important open problem 326 Models of increasing dimension and multi-robustness A model of increasing dimension The previous results can also be used for the analysis of models whose dimension increases with sample size In fact, consider the M Θ n η, η nown, that differs from model M Θ in that, rather than assuming b x and p x live in particular Hölder balls, we instead assume the woring models of Eqs 317 and 318 are precisely true for = n η, so ψ θ ψ n η θ and the dimensions of b x and p x increase as n η Valid point and interval estimation for ψ n η θ can still be based on the estimators ψ m, except now i there is truncation bias only when < n η, ii the variance remains of the order of

40 374 J Robins, L Li, E Tchetgen and A van der Vaart 1 n max 1, m 1 n, and iii the estimation and truncation bias when it exists orders will be determined by any additional complexity reducing restrictions placed on the fraction of non-zero components or on the rate of decay of the components of the vectors η n η θ and α n η θ, and, for estimation bias, by β g as well As a consequence, m opt and opt under model M Θ n η will differ from their values under model M Θ Note we need not tae = n η as we did in the heuristic discussion following Remar 28 Indeed ψ mbest in that discussion corresponds to the estimator in the class ψ m,=n η with the fastest rate of convergence In general, ψmbest will have convergence rate slower than ψ mopt, opt Furthermore, the discussion in Section 411 implies that, when n η n and the minimax rate for estimation of ψ θ is slower than n 1/2, even ψ mopt, opt will typically fail to converge at the minimax rate when complexity reducing restrictions have been imposed on η n η θ and α n η θ Multi-robustness and a practical data analysis strategy Conditional on θ, for m 2, EB m is zero and thus estimator ψ m, is unbiased for ψ if p = p, b = b, or ĝ = g We refer to ψ m, as triply-robust for ψ, generalizing Robins and Rotnitzy 17 and van der Laan and Robins 24 who referred to ψ 1 as doubly-robust because of its being unbiased for ψ if either p = p ψ mod m, or b = b In fact, for m 3, we can construct a modified estimator that is m + 1-fold robust as follows Let ĝ s, s = 3,,m, denote m 2 additional initial estimators of g that differ from one another and from ĝ Define ψ m, mod = ψ + ÎF + ÎF + m 1, ψ 22, ψ,i j j=3 ÎFmod jj, ψ, where ÎF mod jj, ψ,i j = 1 j 1 T H 1 P + H2 ḂZ i 1 PḂH 1Z Z T I i 2 j 1 } Ês PḂH 1Z Z T 1 s=3 } PḂH 1Z Z T i Ês PḂH 1Z Z T s } Êj PḂH 1Z Z T 1 Z H 1 B + H3 P with Ês defined lie Ê, except with ĝ s replacing ψmod ĝ In the Appendix of our technical report, we prove that EBm mod = E m, ψ is } E Ḃ PH P P 1 Ṗ Z T E Ḃ PH 1 Z Z T I m } Ês Ḃ PH 1 Z Z T 1 1 m 1 s=3 } E Ḃ PH 1 Z Z T Ḃ Ês PH 1 Z Z T } E Ḃ PH 1 Z Z T 1 E Ḃ PH B B 1 which is zero if p = p, b = b, ĝ = g, or if any of the m 2 ĝ s equals g We note that if p = p or b = b, ψ = ψ mod and thus ψ m, and ψ m, are unbiased for ψ In settings where the dimension d of X is so large eg, that the above asymptotic results fail as a guide to the finite sample performance of our procedures i j Ḃ }

41 Higher order influence functions 375 at the moderate sample sizes, say n = , commonly found in practice, one might consider, as a practical data analysis strategy, using the m + 1-fold robust mod estimator ψ m, with p, b,ĝ, and the ĝ s selected by cross-validation as in van der Laan and Dudoit 23 Specifically, the training sample is split into two random subsamples a candidate estimator subsample of size n c and a validation subsample of size n v, where both n c /n and n v /n are bounded away from 0 as n A large number eg, n 3 candidate parametric models of various dimensions and functional forms for p, b, and g are fit to the candidate estimator subsample and the validation sample is used to find the candidate estimators p and b for p and b and the m 1 candidate estimators ĝ and ĝ s,s = 3,,m, for g with the smallest estimated riss with respect to an appropriate ris function such as squared error or Kullbac-Leibler In the setting of very high dimensional X, current practice is to use a doubly robust estimator, say ψ 1 with p and b selected by cross validation van der Laan mod and Dudoit 23 An m + 1-robust estimator ψ m, with n and m rather large may be preferable to a doubly robust estimator for two reasons First, if one uses an m + 1-robust estimator of ψ rather than a DR estimator, it may be more liely that the estimator will have very small bias, as it is more liely that at least one of m + 1, rather than one of two, models is very nearly correct Second, because n, nominal 1 α Wald confidence intervals centered at will be wider ψ mod m, than the interval of length proportional to n 1/2 centered at ψ 1 A wide interval is a more appropriate measure of the actual uncertainty about ψ However, it is also mod the case that the bias of ψ m, can exceed that of ψ 1 when all of the m + 1 models selected by cross-validation are far from correct, owing to the product structure of the estimation bias The product of m + 1 errors, each greater than 1, will exceed the product of just 2 of the errors We therefore plan to compare through mod simulations the finite sample performances of ψ m, and ψ 1 in the setting of very high dimensional X 4 Rates of convergence and minimaxity We consider a generic version in which we only assume a model and functional satisfying Ai Aiv To examine efficiency issues, we first consider the estimator ψ 1 based on the first order influence function and sample splitting Without loss of generality we assume β p β b Otherwise simply interchange β p and β b in what follows It will be useful to consider the alternative parametrization β = β p + β b, 2 βp = 1 0 β b The conditional variance of ψ 1 is of the order of 1/n and the conditional bias of ψ 1 in estimating ψ is O p n βb d+2β + βp b d+2βp If = 0 and thus β p = β b, the 2β d+2β bias of ψ 1 is n and ψ 1 is not n 1/2 consistent for ψ when β/d < 1/2 At the other extreme, as, ie β b 0, the bias of ψ 1 is n 2β d+4β which fails to be n 1/2 consistent for any finite β

42 376 J Robins, L Li, E Tchetgen and A van der Vaart Minimaxity with g nown To further examine efficiency issues, it is instructive to first consider the estimation of ψ with g nown If g were nown, we could set ĝ X = g X when calculating ψ m, Then EB 2 = 0 and ψ 2, would therefore be an unbiased estimator of ψ Letting a superscript g denote the model with g nown, it is easy to see that ψ m g opt,g optm g opt would be ψ 2, g opt 2 where g opt 2 satisfies max 1/n, /n 2 var ψ2, = TB 2 = 2β b+β p/d = 4β/d Solving this, we find that when β/d is greater than or equal to 1/4, we can tae g opt = n 1 4β/d n and ψ2, g opt 2 ψ = O p n 1 2 regardless of, which is, of course, the minimax rate In contrast if β/d < 1/4, g opt 2 = n 2 1+4β/d and ψ2, g opt 2 ψ = n 4β 4β+d In an unpublished paper, we have proved that this is the minimax rate when g is nown This raises the question of whether the lower bounds of rate n 1 2 for β/d 1/4 and/or rate n 4β 4β+d for β/d < 1/4 are still achievable when g is unnown, without restrictions on the smoothness of g Before addressing this question, we tae the opportunity to compare the relative efficiencies of competing rate-optimal unbiased estimators in the case of g nown This discussion will provide further insight into the results given in Remar 26 for models which are not locally nonparametric Relative efficiency of various unbiased estimators with g nown For simplicity, we restrict the following discussion to the truncated version of the parameter ψ = E b X} 2, with b X = EY X, g nown, and Y Bernoulli For this choice of ψ, g is the marginal density of X In this subsection, we assume ĝ X is chosen equal to the nown g X so E Z Z T = I Also we choose Ḃ = P = 1 and tae B = b X lin } Z, so B = Π B Z = E BZ T Z and Π } 2 ψ = E B Z do not depend on B Further we only concern ourselves with efficiency relative to the n observations in the estimation sample We thus ignore any efficiency loss from using N n observations to construct b Let Θ g = b : x b x 0, 1} Θ denote the subset of Θ corresponding to the nown g, which consists of all functions from the unit cube in R d to the unit interval The model M Θ g is not locally nonparametric For example, the first order tangent space Γ 1 θ does not include first order scores for g Its second order tangent space Γ 2 b does not contain second order scores for g or mixed scores for g and b Rather, Γ 2 b is the closed linear span of the first and second order scores for b Thus where Γ 2 b = S a, c;var b S a, c} < ; a A, c C } S ij a, c = Y BaX} i + Y B i c X i, X j Y B j, and A and C are the set of one and two dimensional functions of x Since, for

43 Higher order influence functions 377 b lin z x}, ψ 2, b ψ 2, B2 + 2 B Y B = V i + Y B Z T Z Y B i j is unbiased for ψ Π } 2 b = E B Z in model M Θ g, we now, by Remar 26, that IF eff b for ψ b is the projection Π b ψ2, ψ b Γ 2 θ of the second 2, ψ order influence function ψ 2, ψ b onto Γ 2 b Now if ψ 2, ψ b was an element of Γ 2 b, ψ2, ψ b would equal IF eff b and thus be second order unbiased 2, ψ locally efficient, at b Θ g, as defined earlier in Remar 26 However we show below that, when b X = c for some c wp1 does not hold, ψ 2, ψ b is not an element of Γ 2 b for any b Rather, a straightforward calculation gives 2E BZ T IF eff Z Y B b = V i 2, ψ + Y BZ T Z Y B j Now one can chec that ψ b +IF eff 2, ψ b i is a function of b, so by Theorem 27 of Remar 26, we conclude no unbiased globally efficient estimator exists However, we prove below that ψ b + IF b eff and ψ 2, have identical means It follows 2, ψ that ψ b + IF b eff is an unbiased estimator of ψ Π 2 b = E B Z for 2, ψ any b lin z x} Thus, for a given choice of b lin z x}, ψ b +IF eff 2, ψ b is second order unbiased locally efficient at b = b However, one can show using a proof analogous to that in Theorem 322 that for n 2 var b ψ b + IF b eff /var b IF eff b 2, ψ 2, ψ = 1 + o P b b Henceforth we assume that b lies in a Hölder ball Hβ b, C b That is, we consider the submodel b Θ g Hβ b, C b and assume b x lin z x} converges βb /2β b +d n to b in sup norm at the optimal rate of log n uniformly over Θ g Hβ b, C b The submodel and the model Θ g have identical tangent spaces For all b Θ g Hβ b, C b, max n 1, /n 2 1/2 ψ b + IF eff ψ } b has an 2, ψ b asymptotic distribution with mean zero and variance equal to lim max n 1, /n 2 1 varb IF eff b, n 2, ψ for all b Θ g Hβ b, C b In a slight abuse of language, b we shall refer to var b IF eff as the asymptotic variance of ψ + IF eff ψ } b 2, ψ b 2, ψ b Thus, as with standard first order theory, even when no unbiased estimator has finite sample variance that attains the Bhattacharyya bound for all b Θ g Hβ b, C b,

44 378 J Robins, L Li, E Tchetgen and A van der Vaart there can exist an unbiased estimator sequence whose asymptotic variance does attain the bound globally We next compare the means and variances of ψ b +IF eff and ψ 2, Now 2, ψ b the two estimators are algebraically related by } ψ 2, = ψ b + IF b } eff + V B2 E B2 2, ψ Since V B2 E B2 is unbiased for zero, we conclude that ψ 2, and ψ b + IF b eff have the same mean but var b ψ2, /var b IF eff b > 1 except when 2, ψ 2, ψ b X = b X = c wp1 for some c Thus, since ψ b +IF eff variance var b IF eff b and, except when bx = c+o p 1, var 2, ψ 2, ψ b has asymptotic n 1, V B2 we conclude the asymptotic variance of ψ2, attains the bound var b IF eff 2, ψ b when n, but exceeds the bound when n, except when b X = c + o p 1 Finally, for completeness, Robins and van der Vaart 19 considered an alternative particularly simple rate-optimal unbiased estimator of ψ Π } 2 b = E B Z given by ψ RV = V Y Z T Z Y j} The Hoeffding decomposition of ψ RV ψ b is i V E BZ T Z Y ψ b + V Y Z T E BZ T Z Y E } BZ i j = IF eff b + Q + T 2, ψ with Q = V Π } B Z B ψ 2 B i Z T,i T = V Z,j Π B Z Y B j j +B i Z T,iZ,j B j Π B Z i B i Π B Z j B j + ψ Since, except when B = c wp1, var b Q n 1 and var b T /n 2, we conclude that the asymptotic variance of ψ RV exceeds the bound var b IF eff regardless 2, ψ b of whether n does or does not hold except when b X = c wp1 Minimaxity with unnown g and β/d 1/4 We now show that the bound n 1 2 for β/d 1/4 is achievable for each β g > 0 Consider the estimator ψ m, with n 2 1+4β/d n and 1 m β b β } p 2βg + d d + 2β b d + 2β p β g so that EB m = O p n m 1βg 2βg+d + β b d+2β + βp b d+2βp is O p n 1/2 Then var ψm, 1/n, TB 2 = O p 1/n and EBm 2 = O p 1/n so ψ m, will be n 1 2 consistent for ψ If = 0 and β < 1/2, the above expression implies that m d 2β 22β +d / βg 2β + 1 g+d

45 Higher order influence functions 379 for n 1 2 consistency Similarly, if, ie β b 0, it is necessary that m d 24β +d / βg 2β +1 for n 1 2 g+d consistency These results imply that estimators ψm, in our class can always achieve n 1 2 consistency whenever β g > 0, but for fixed β < d/2, the order m of the required U-statistic increases without bound as the smoothness β g of g approaches zero Efficiency We now show that when β/d is strictly greater than 1/4, we can construct an unconditional asymptotically linear estimator based on all N subjects with influence function N 1 N i=1 IF 1,ψ,i θ by having the number of the N subjects allotted to the validation sample and analysis sample be N 1 ǫ and n = n ǫ = N N 1 ǫ, respectively, for 1 > ǫ > 0 It then follows from van der Vaart 26 that the estimator is regular and semiparametric efficient Specifically, suppose β/d = 1/4 + δ, δ > 0 Consider the estimator ψ m, with m > 1+ so that EB m = O p N 1 ǫ m 1βg 2βg+d + β b d+2β b + βp d+2βp 1 21 ǫ β b d+2β b } βp 2βg+d d+2β p β g is o p N 1/2 and = n ǫ 1 1+2δ < n ǫ so that TB 2 = o p 1/N and var 2 Then, by our previous results, ÎF jj, ψ = o p 1/N for j It remains to show that But the LHS is nǫ ψ m, ψ θ = n ǫ 1 IF 1,ψ,i θ + o p N 1/2 i=1 N nǫ N 1 IF 1,ψ,i θ n ǫ 1 IF 1,ψ,i θ = o p N 1/2 i=1 i=1 i=1 nǫ } n ǫ 1 n ǫ IF 1,ψ,i θ N 1 N + N 1 i=nǫ+1 = O p n ǫ 1/2 N ǫ + O p N 1 ǫ/2 N 1 = O p N 1/2 N ǫ + O p N 1/2 N ǫ/2 = o p N 1/2 IF 1,ψ,i θ Adaptivity when β/d > 1/4 We next prove that if we let n n ǫ = N N 1 ǫ, m m N = on with ln N = O m N and = n ǫ/ ln n, ψm, will be semiparametric efficient for each β > 1/4, provided ĝ g = o p m N 1 ǫ 2 Clearly, the truncation bias is o N 1/2 The estimation bias EB mn is O p m N 1 ǫ 2mN 1 βb } 1 ǫ N d+2β + βp b d+2βp Thus EB mn = o p N 1/2 if m N 1 ǫ 2mN 1 = o N ǫ βb d+2β b + βp d+2βp } So we require 2m N 1ln m N 1 ǫ} } 1 / 2 1 ǫ βb d+2β b + βp d+2β p ln N, which is satisfied if ln N = O m N In the appendix of our technical report

46 380 J Robins, L Li, E Tchetgen and A van der Vaart we prove that var θ ψm, = ψm, 1 + o var θ p 1} provided ĝ g = o p m N 1 ǫ 2 Now ψm, = n IF θ} mn var θ 1 var θ 1,ψ,i O lnn} l But mn lnn} l 1 lnn mn+1 = O 1 lnn} 1 = O 1 + lnn} 1, l=0 } so var θ ψm, is n IF 1 var θ 1,ψ,i θ 1 + o p 1} = n 1 var IF 1,ψ } 1 + o p 1} The proof of efficiency now proceeds as above Alternative estimators when β/d > 1/4 When β/d > 1/4, there actually exist, at least for certain functionals in our class, n 1 2 -consistent estimators of ψ that are much simpler than our very high order U-statistic estimators For example consider the expected conditional covariance ψ = E cov A, Y X} of Example 1b with d = 1 Example 1b Continued Number the study subjects i = 0,,N 1 ordered by their realized values X i, where we have not split the sample Following Wang et al 27, consider the difference-based estimator N/2 1 ψ d = N 1 i=0 l=0 Y 2i A 2i + Y 2i+1 A 2i+1 Y 2i+1 A 2i Y 2i A 2i+1 }, which has conditional mean given X 1,,X N } of Hence N/2 1 N 1 i=0 N/2 1 + N 1 i=0 E ψd ψ = N 1 E = O p cov A, Y X 2i } + cov A, Y X 2i+1 } N/2 1 i=0 N N i=0 b X i+1 b X i } p X i+1 p X i } b X i+1 b X i } p X i+1 p X i } E X i+1 X i } 2β = O N 2β by the theory of spacings Pye 13 But O N 2β is o p N 1/2 when β > 1/4 The variance of ψ d is O N 1 so ψ d is N 1/2 var θ ψd consistent However, var θ IF 1,ψ θ 1 + o p 1 so ψ d is not semiparametric efficient As discussed by Arellano 1, by using a m th order rather than a second order difference operator and letting m at an appropriate rate as N, the m th order estimator ψ d can be made efficient

47 Higher order influence functions 381 Minimaxity with unnown g and β/d < 1/4 Consider next whether the lower bound of n 4β 4β+d for β/d < 1/4 is achievable when g is unnown but βg > 0 We will show the next section that the bound n 4β 4β+d is achievable provided β/d 2β g /d 2β g /d + 1 > 4β/d 1+4β/d + 1, + 2 2β +11 4β/d β/d 4β/d1 4β/d +1 ie, β g > To attain the bound 4β n 4β+d whenever Equation 41 holds, we introduce new more efficient estimators, owing to the fact that an estimator ψ m, in our class can attain the bound n 4β 4β+d only in the special case where the second order estimation bias EB 2 = O p n βg 2βg+d + β b d+2β + βp b d+2βp is less than n 4β 4β+d For a fixed β = β p + β b /2, the right hand side of Equation 41 is minimized over 0 at = 0 At = 0, Equation 41 reduces to β g /d 2β g /d + 1 > 1 4β/d 1 + 4β/d β/d β g > β 1 4β/d 1 + 2β/d + 8 β/d 2 The right hand side of Equation 41 increases with with asymptote equal to twice the RHS of Equation 42 as Hence, in order to attain the optimal rate n 4β β 4β+d when βp = 2β and β b = 0, the quantity g 2β g+d must be twice as large as when β p = β b = β In the next section, we construct an estimator with a convergence rate of log nn 4β β 4β+d at the cut-point g 1+2βg = 1 4ββ 1+4β In this paper we do not consider the construction of estimators that are rate optimal below this cutpoint However, for the special case = 0, in an unpublished paper Li et al 9 have constructed estimators which converge at a rate given in Eq 13, whenever inequality 41 fails to hold We conjecture that this rate is minimax, possibly only up to log factors, when inequality 41 fails to hold and = 0 At the β cut-point g 1+2βg = 1 4ββ 1+4β, we obtain m = 0 and thus Equation 13 becomes log nn 4β 4β+d, in agreement with the rate of the estimator of Section 412 below In the extreme case in which β g 0 with β remaining fixed, log nn βg/d m βg/d 2β/d log nn βg 1 1+2βg β β1 4β/d1+2βg 2βg = log nn 2β/d, which agrees up to a log factor with the rate of n 2β/d given by the simple estimator of Wang et al 27 analyzed above under Example 1b continued Based on the arguments given in the Appendix of our technical report, we conjecture that when β < d/4 and r n ĝ g = o p 1 for some r n as n, ψ m, ψ = O p n 2β/d log n t for some natural number t and any m = m n 4β/d 2 logn 1+2β/dlogrn and = n = n mn 4β/d mn 1 Remar 41 Suppose b = E Y X = is nown to be contained in a Hölder ball H β, C We provide a heuristic argument as to why the minimax rate for the linear functional b x does not depend on a priori nowledge of the smoothness of f X x but the minimax rate for the functional ψ = E b 2 X may when β/d < 1/4 First

48 382 J Robins, L Li, E Tchetgen and A van der Vaart consider the case where f X x is nown Let φ l x;l = 1,, } be a complete linear independent basis for L 2 F X Define η T = E FX b Xφ X T E FX φ Xφ X T 1 Then b x = l=1 Π b x φ x = η Tφ x is the projection in L 2 F X of b x onto linear span of the first basis functions With f X x nown and η T P n Y φ X T E FX φ Xφ X T 1, unbiased estimators l=1 ηt φ x and η 1, φ T E Xφ X T η 2, of the the truncated functionals b x = η Tφ x and ψ = E b2 X = E η Tφ Xφ X T η are, respectively, rate minimax for b x and ψ, when opt is chosen to equate the order of the respective variances /n and max 1 n, /n2 with the order of the respective squared truncation biases b x b x 2 = 2β/d and ψ ψ 2 = 4β/d For b x, opt = n 1/1+2β/d n and the rate is n β/d+2β For ψ, opt = n 2/1+4β/d n and the rate is n 4β/d/1+4β/d when β < 1/4 The minimax rate for b x with f X x unnown and without smoothness assumptions imposed is the same as with f X x nown, since, subject to some regularity conditions, for opt < n, P n Y φ opt X T P n φ opt Xφ opt X T 1 φopt x η T opt φ opt x remains of order O p opt /n In contrast, with > n, P n φ Xφ X T is not invertible It is for this reason that the minimax rate for ψ with f X x unnown is slower than n 4β/d/1+4β/d unless the model places sufficient restrictions on the complexity of f X x Improved rates of convergence with X random in a semiparametric model We now, as promised in the Introduction, construct an estimator of σ 2 under the homoscedastic model E Y X = b X, vary X = σ 2 with X random with unnown density that, whenever β < min 1, d/4} and, regardless of the smoothness of f X x, converges at the rate n 4β/d 4β/d+1, which is faster than equalspaced non-random minimax rate of n 2β/d Specifically we divide the support of X, ie the unit cube in R d, into = n = n γ, γ > 1 identical subcubes with edge length 1/d We continue to assume the unnown density f X x is absolutely continuous with respect to Lebesgue measure and both it and its inverse are bounded in sup-norm Then it is a standard probability calculation that the number of subcubes containing at least two observations is O p n 2 / We estimate σ 2 in each such subcube by Y i Y j 2 /2, where, for any subcube with 3 or more observations, i and j are chosen randomly, without replacement Our final estimator σ 2 of σ 2 is the average of our subcube-specific estimates Y i Y j 2 /2 over the O p n 2 / subcubes with at least two observations The rate of convergence of the estimator is minimized at n 4β/d 2 4β/d+1 by taing = n 1+4β/d, as we now show We note that E Y i Y j 2 /2 X i, X j = σ 2 + b X i b X j } 2 /2, b X i b X j = O X i X j β by β < 1, and X i X j = d 1/2 O 1/d when X i and X j are in the same subcube It follows that the estimator has variance O p /n 2 and bias of O 2β/d To minimize the convergence rate we

49 Higher order influence functions 383 equate the orders of the variance and the squared bias by solving /n 2 = 4β/d which gives = n 2 1+4β/d Our random design estimator has better bias control and hence converges faster than the optimal equal-spaced fixed X estimator, because the random design estimator exploits the O p n 2 /n 2 1+4β/d random fluctuations for which X s corresponding to two different observations are a distance of O n 2 } 1/d 1+4β/d apart Our estimator will not converge at rate n 4β/d 4β/d+1 to E vary X in our nonparametric model, because it then no longer suffices to average estimates of vary X only over subcubes containing 2 or more observations Indeed, when vary X depends on X, the estimator σ 2 = σ n 2 satisfies 2 σ n E f n Xvar Y X} /Ef n X } = O p n 4β/d 4β/d+1, n = n 2 1+4β/d, f n X is 1/ n times the integral of f X x with respect to Lesbegue measure over the subcube containing X Remar 42 Consider again Example 1c with τ θ being the variance weighted average treatment effect We impose no smoothness assumptions on f X x The argument in the previous two paragraphs implies that if β = β p + β b /2 < d/4 and max β p, β b < 1, we can construct an estimator τ of τ θ that converges at rate n 4β/d 4β/d+1 when the semiparametric model 35 holds, which is faster than our conjectured minimax rate of n 2β/d for τ θ when 35 is not assumed Specifically, again create = n 2 1+4β/d subcubes Let τ mae the sum ψ τ over subcubes containing at least 2 observations of Yi τ Yj τ} A i A j }/2 equal to 0 treating subcubes with greater than 2 observations as above, where Y τ = Y τa When 35 holds, cov Y τ θ,a X} = 0 Thus, an argument analogous to that above implies that ψ τ θ converges to cov Y τ θ, A X} = 0 at rate n 4β/d 4β/d+1 That τ converges to τ θ at rate n 4β/d 4β/d+1 is then proved by a Taylor expansion of 0 = ψ τ around τ θ 41 More efficient estimators 411 Case 1: The estimation bias of the third order estimator is less than the optimal rate In a locally nonparametric model M Θ, the estimator ψ m, = ψ + ÎF is m, ψ essentially the unique m th order U-statistic estimator of the truncated parameter m+1 ψ for which the leading term in the bias is O θ θ However, when the minimax rate of convergence for ψ is less than n 1/2, other m th order U-statistics estimators will often converge to ψ and thus ψ at a faster rate uniformly over the model than does any estimator ψ m, constructed from an estimated higher order influence function ÎF for ψ m, ψ by tolerating bias at orders less than m + 1 in exchange for a savings in variance Remar 43 A heuristic understanding as to why this is so can be gained from the following considerations The theory of higher order influence functions as developed in Theorems 22 and 23 is a theory of score functions derivatives Thus it can directly incorporate the restriction that a function, say b x, has an expansion b x = l=1 η lz l x for which η l = 0 for l >, as the restriction is equivalent to

50 384 J Robins, L Li, E Tchetgen and A van der Vaart various scores being equal to zero However the theory cannot directly incorporate restrictions such as l= η2 l = 2β b or η l l β b+ 2 1 that do not imply any restrictions on score functions Thus to find an optimal estimator, one must perform additional side calculations to quantify the estimation and truncation bias of various candidate estimators under these restrictions As the assumption that b x lies in a Hölder ball can be expressed in terms of such restrictions, this remar is relevant to a search for an optimal rate estimator We now construct more efficient estimators We first consider the case where β b, β b, and β g are such that the estimation bias O n βg 2βg+d + β b d+2β + βp b d+2βp of the second order estimator is greater than O n 4β 4β+d but the estimation bias O n 2βg 2βg+d + β b d+2β + βp b d+2βp of the third order estimator is less than O n 4β 4β+d That is, 44 n 2βg 2βg+d + β b d+2β + βp b d+2βp < n 4β βg 4β+d < n 2βg+d + β b d+2β + βp b d+2βp Then the most efficient estimator ψ m, in our class has rate of convergence slower than n 4β 4β+d because ψ2,opt2 converges at rate n βg 2βg+d + β b d+2β + βp b d+2βp determined by the second order estimation bias and, for m > 3, ψ m,optm converges at a rate 6β no faster than n d+2β = n 4 β d 3/3 1+4 β d = minm;m>3} n 4 β d m/m 1+4 β d We obtained n 4 β d m/m 1+4 β d as 4β/d 1/2, where solves the equation m /n m+1 = 4β/d that equates the variance m /n m+1 of IF m to the squared truncation bias 4β/d First, for the remainder of the paper, we redefine Z z X ϕ, ˆfXÊH 1 XḃXṗx} 1/2 by redefining ϕ n 2, ˆf X to have orthonormal components under L 2 ˆF X such that the linear spans of ϕ t, ˆfX,, ϕ n2, ˆf X} and ϕ tx,, ϕ n 2X} agree for t < n 2 This can be accomplished by Gram-Schmidt orthogonalization of ϕ n 2X in L 2 ˆF X beginning with ϕ n 2X and woring bacwards To describe our more efficient estimator, define for nonnegative integers 0, 1, 0, 1 with 0 < 1 and 0 < 1 the U-statistic 1, 1 1, Û 3 0, = V Û , 0 with 1, Û 1 3 0, 0 = ǫ i1 Z 1,T 0,i PḂH 1 1Z 1 0 Z 1,T 0 I 1 0} 1 0} Z 1 i 0,i i ǫ i1 z s1 X i1 Ḃ } = PH 1 z s1 X i2 z s2 X i2 I s 1 = s 2, i 2 s 1=0+1 s 2= 0+1 z s2 X i3 i3 where Z 1 0 = T Z 0+1,, Z 1, ǫ = H 1 P + H2 Ḃ, = H 1 B + H3 P, I r v = I ij r v with I ij = I i = j

51 Higher order influence functions 385 As an example, ÎF = 0 33, ψ Û3, 1, 0 We can identify 1 0, with the rectangle in R 2 defined by r 1, r 2 ; r 1 1, r 1 1} with , and 1 + 1, 1 + 1, respectively, the vertices closest and furthest from the origin Thus ÎF = 0 33, ψ Û3, 0 is identified with the rectangle 0, 0 Indeed we can write 1, 1 Û 3 0, 0 ǫ i1 z s1 X i1 Ḃ } = PH 1 z s1 X i2 z s2 X i2 I s 1 = s 2 s 1,s 2 1, i 2 1 0, 0 z s2 X i3 i3 where, here and below, s 1 and s 2 are restricted to be integers, so s 1, s 2 1, 1 0, are the lattice points of the rectangle 0 1, We next study the variance of Û3 1 0, It follows from Theorem 320 above 0 1, that the number of lattice points in 1 0, is proportional to the variance of 0 1, Û 1 3 0, so if 0 1 and 0 1, 1 then var Û , 0 1, and var Û3 1 0, 0 are both of order 1 1/n 3 Hence the order of the 1, variance of Û3 1 1, 0, is determined by the vertex of the rectangle 1 0 0, 0 furthest from the origin In contrast by a theorem in the Appendix of our technical report, the mean 1, E Û3 1 is 0, 0 Ê Π δb Z 1 0 δg Q 2 Π δp Z o p 1 with δb = PÊ H 1 X B B, δp = ḂÊ H 1 X P P, δg = gx ĝx and ĝx Q 2 = Ḃ PÊ H 1 X It follows that if 0 1 and 0 1 then 1, E Û3 1 and E, Û3 are both of order 0, 0 0, 0 O p 0 β b 0 βp n/ logn βg 2βg+1 This bias is a product mixture of truncation bias through the term 0 β b 0 βp and estimaton bias through the term n/ logn βg 2βg+1 To see this 1, for E Û3 1, we sup out δg Q 2 from which is 0, 0 Ê Π δb Z 1 0 We then apply Cauchy Schwartz to Π Ê Ê Π δb Z 1 0} 2 1/2 = O δg Q 2 Π δp Z 1 0 O p n/ log n βg 2βg+1 Ê Π δb Z0 1 Π δp Z 1 0 δb Z0 1 Π δp Z 1 0, noting that 0 β b Again a more careful argument using

52 386 J Robins, L Li, E Tchetgen and A van der Vaart Hölder s inequality would show the log factor is unnecessary Hence the order of 1, the mean of Û3 1 1, 0, is determined by the vertex of the rectangle 1 0 0, 0 closest to the origin Motivation With this bacground we are ready to motivate our new estimator Recall from Section 325, that with g nown, the choice g opt 2 = n 2 1+4β/d gives ψ2, g opt 2 ψ = O p n 4β 4β+d because the truncation bias ψ g opt 2 ψ and variance are of order n 4β 4β+d and the estimation bias is zero Any choice of larger than g opt 2 will result in a slower rate of convergence However, when g is unnown and thus estimated, ψ 2, g opt 2 ψ does not attain the optimal rate of convergence because the estimation bias n βg 2βg+d + β b d+2β + βp b d+2βp exceeds n 4β 4β+d The estimator ψ3, g opt 2 = ψ 2, g opt 2 + Û3 g opt 0, 2, fails to attain the rate n 4β 4β+d because it has variance of the order of g opt 2 g opt 2 n 2 1+4β/d n n 2 = O n 8β 4β+d, n g opt 2 0 which exceeds O n 8β 4β+d On the other hand, ψ 3, g opt 2 has bias of O p because the truncation bias is O p n 4β 4β+d and the estimation bias is also O p n 4β 4β+d replace the term Û3 g opt 2, Û 3 by 0, g opt 2 0 Û 3 Ω = O p n 2βg 2βg+d + β b d+2β + βp b d+2βp g opt 2 0 also n 4β 4β+d under our assumption 44 Our strategy will be to try to g opt 2, 0, in the estimator ψ 3, g opt 2 = ψ 2, g opt 2 + s 1,s 2 Ω where Ω is a subset of the rectangle n 8β 4β+d but the additional bias Û 3 g opt 2, 0, E = E Û 3 E g opt 2 0 g opt 2, 0, ǫ i1 z s1 X i1 z s2 X i3 i3 Ḃ } PH 1 z s1 X i2 z s2 X i2 I s 1 = s 2 i 2 g opt 2 0 g 2, opt g s 1,s 2 opt 2 0, 0 Û3 Ω \Ω \Ω g opt 0, 2, ǫ i1 z s1 X i1 g opt 2 such that varû3 Ω 0 Ḃ PH 1 z s1 X i2 i 2 z s2 X i2 I s 1 = s 2 z s2 X i3 i3 }

53 Higher order influence functions 387 is O p n 4β 4β+d This approach will succeed if we can chose Ω and thus g opt 2, 0, g opt 2 \Ω 0 to be sums of rectangles whose number does not increase with n such that g opt 2, i each rectangle in g opt 2 0, \Ω has its closest vertex to the origin, say 0 0, 0, satisfying O p 0 β b 0 βp n βg 2βg+1 n 4β 4β+d and ii simultaneously each rectangle in Ω has its furthest vertex from the origin, say 1, 1, satisfying O 1 1/n 3 = O n 8β 4β+d We index the vertices of our set of rectangles as follows Consider a natural number J and a set of non-negative integers K J,tot = 2, 1, 0,, 2J, 2J+1, 2J+2 } satisfying 0 = 2 < 0 < 2 < < 2J 2 < 2J < 2J+2 = 2J+1 < 2J 1 < < 1 < 1 Note the elements with even subscripts increase from 0 to 2J + 2 while elements with odd subscripts decrease from 1 to 2J 1 Further the smallest element with odd subscript equals the largest element with even subscript We will use two such sets K b,j,tot and K p,j,tot with corresponding elements bl and pl with b, 1 = p, 1 Set for s 1, 0,, J} b,2s+1 = n 3d+4β d+4β /p,2s+2, p,2s+1 = n 3d+4β d+4β /b,2s+2, so p,2s+1 b,2s+2 n 3 = b,2s+1 p,2s+2 n 3 = n 8β 4β+d We leave J, K p,j = p,2s, s = 0,,J + 1}, and K b,j = b,2s, s = 0,,J + 1} unspecified for now but derive optimal values below Let Ω = Ω K pj, K bj be the union of rectangles Ω K pj, K bj = J s=0 } p,2s 1, b,2s p,2s, p,2s 2, b,2s 2 b,2s 1 p,2j+1 p,2s 2 b,2s b,2j+1 p,2j b,2j The points p,2s+1, b,2s+2, p,2s+2, b,2s+1 for s = 1, 0, },J + 1 lie on a hyperbola Hy in R 2 defined by Hy = r 1, r 2 ;r 1 r 2 = n 3d+4β d+4β shown in Figure g opt 2, 1 for J = 2 The set Ω K pj, K bj g opt 2 0, lies below Hy 0 Define We then have ψ 3,KpJ,K bj = ψ 2, 1 + Û3 Ω K pj, K bj Theorem 44 i The estimator ψ 3,KpJ,K bj has variance of the order of 1 + 2J + 1n n2 8β 4β+d

388 J Robins, L Li, E Tchetgen and A van der Vaart Fig 1 Hyperbola Hy and Associated Rectangles and bias E ψ3,kpj,k bj ψ of order O p n βg 2βg+d J s=0 +O p n 2βg 2βg+d + β b d+2β b + βp d+2βp } βp/d

54 388 J Robins, L Li, E Tchetgen and A van der Vaart Fig 1 Hyperbola Hy and Associated Rectangles and bias E ψ3,kpj,k bj ψ of order O p n βg 2βg+d J s=0 +O p n 2βg 2βg+d + β b d+2β b + βp d+2βp } βp/d p,2s+1 β b/d b,2s + β b/d b,2s+1 βp/d p,2s + O p βp+β b/d 1 Proof Each of the 2J + 1 rectangles whose union is Ω K pj, K bj has p,2s+1, b,2s+2 or p,2s+2, b,2s+1 for some s 1, 0,, J} as the vertex furthest from the origin and thus contributes p,2s+1 b,2s+2 n = n 8β 3 4β+d to the variance of ψ 3,KpJ,K bj The variance of ψ 2, 1 1 n Now 2 E ψ3,kpj,k bj, = E ψ3, 1 = O p + E Û 3 ψ } ψ βp+β b/d 1 1, 0, + E Û3 Ω K pj, K bj } E Û 3 + O p n 2βg 2βg+d + β b d+2β + βp b d+2βp } 1 \Ω K pj, K bj 0 As is evident from Figure 1, Ω c K pj, K bj 1, 0, 1, 0, }} \Ω K pj, K bj is the union

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38 Outline Heuristics