Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Size: px

Start display at page:

Download "Minimax Estimation of a nonlinear functional on a structured high-dimensional model"

Jane Robertson
5 years ago
Views:

1 Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38

2 Outline Heuristics (Minimax ) 2 / 38

3 Outline Heuristics First order Influence functions: n -regular case (Minimax ) 2 / 38

4 Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n (Minimax ) 2 / 38

5 Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n Application to quadratic functional and missing at random data functional (Minimax ) 2 / 38

6 Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n Application to quadratic functional and missing at random data functional This is joint work with Robins, Li, Mukherjee and van der Vaart. (Minimax ) 2 / 38

7 Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R (Minimax ) 3 / 38

8 Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R We are particularly interested in settings of semiparametric or nonparametric model such that P is infinite dimensional. (Minimax ) 3 / 38

9 Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R We are particularly interested in settings of semiparametric or nonparametric model such that P is infinite dimensional. We will give special attention to setting where the semiparametric model is described through parameters of low regularity, in which case n convergence may not be attainable. (Minimax ) 3 / 38

10 Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). (Minimax ) 4 / 38

11 Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). One possible choice for χ ˆpn = 0 leading to the plug-in estimator, this is typically suboptimal. (Minimax ) 4 / 38

12 Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). One possible choice for χ ˆpn = 0 leading to the plug-in estimator, this is typically suboptimal. ˆp n constructed from an independent sample, i.e. sample splitting. This is not required in the regular regime through use of empirical process theory. Such theory does not apply in non-regular regime. (Minimax ) 4 / 38

13 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). (Minimax ) 5 / 38

14 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) (Minimax ) 5 / 38

15 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) This can be achieved if Pχ ˆpn acts like minus the derivative of the functional χ in the (ˆp n p) direction. (Minimax ) 5 / 38

16 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) This can be achieved if Pχ ˆpn acts like minus the derivative of the functional χ in the (ˆp n p) direction. A function χ ˆpn with this property is known as a "first order" influence function. (Minimax ) 5 / 38

17 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d (Minimax ) 6 / 38

18 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large (Minimax ) 6 / 38

19 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large For instance regression or density can be estimated at rate n 1/4 if a-priori known to have at least d/2 derivatives (Minimax ) 6 / 38

20 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large For instance regression or density can be estimated at rate n 1/4 if a-priori known to have at least d/2 derivatives Our main concern will be for settings where the model is very large or has very low regularity, such that n 1/4 is not attainable for d(ˆp n, p). (Minimax ) 6 / 38

21 Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. (Minimax ) 7 / 38

22 Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. We will replace the first order IF P n χ ˆpn by a higher order IF U n χ ˆpn which is an mth order U-statistic, such that and ˆχ n = χ (ˆp n ) + U n χ ˆpn ˆχ n χ (p) = [ χ (ˆp n ) χ (p) + P m χ ˆpn ] + (Un P m ) χ ˆpn (Minimax ) 7 / 38

23 Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. We will replace the first order IF P n χ ˆpn by a higher order IF U n χ ˆpn which is an mth order U-statistic, such that and ˆχ n = χ (ˆp n ) + U n χ ˆpn ˆχ n χ (p) = [ χ (ˆp n ) χ (p) + P m ] χ ˆpn + (Un P m ) χ ˆpn This in fact suggest choosing the HOIF such that P m χ ˆpn behaves like the first m terms of the Taylor expansion of χ (ˆp n ) χ (p). (Minimax ) 7 / 38

24 Heuristics Exact HOIFs exist only for special functionals. (Minimax ) 8 / 38

25 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. (Minimax ) 8 / 38

26 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. (Minimax ) 8 / 38

27 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? (Minimax ) 8 / 38

28 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: (Minimax ) 8 / 38

29 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: case 1:smoothness of p is assumed known, we have obtained both lower and upper bounds (Minimax ) 8 / 38

30 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: case 1:smoothness of p is assumed known, we have obtained both lower and upper bounds case 2: smoothness of p is not known and therefore one must adapt to it; some recent results (Minimax ) 8 / 38

31 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. (Minimax ) 9 / 38

32 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. (Minimax ) 9 / 38

33 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. Well studied problem e.g Bickel and Ritov (1988), Laurent (1996, 1997), Cai and Low (2005), Cai, Low et al (2006), Donoho and Nussbaum (1990), Efromovitch and Low et al (1994, 1996), Giné and Nickl (2008), Laurent and Massart (2000). (Minimax ) 9 / 38

34 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. Well studied problem e.g Bickel and Ritov (1988), Laurent (1996, 1997), Cai and Low (2005), Cai, Low et al (2006), Donoho and Nussbaum (1990), Efromovitch and Low et al (1994, 1996), Giné and Nickl (2008), Laurent and Massart (2000). Exhibits elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. Below the threshold, minimax lower bound is n d +4α 8α in squared error norm. (Birgé and Massart, 1995) (Minimax ) 9 / 38

35 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. (Minimax ) 10 / 38

36 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). (Minimax ) 10 / 38

37 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). Exhibit elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. Below the threshold, minimax lower bound is n d +4α 8α in squared error norm, Birgé and Massart,1995). (Minimax ) 10 / 38

38 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). Exhibit elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. in squared error norm, Birgé and Massart,1995). The case T (p) = p 3 studied when d = 1 by Kerkyacharian, Picard and Tsybakov (1998), general solution for d > 1 given by Tchetgen et al (2008). (Minimax ) 10 / 38 Below the threshold, minimax lower bound is n 8α d +4α

39 Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) (Minimax ) 11 / 38

40 Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) χ is nonparametrically identified under MAR A Y Z provided Pr (A = 1 Z ) > 0 a.s. (Minimax ) 11 / 38

41 Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) χ is nonparametrically identified under MAR A Y Z provided Pr (A = 1 Z ) > 0 a.s. Note: isomorphic to estimating the average treatment effect of of a binary intervention A on Y under the assumption of no unmeasured confounding given Z.i.e. χ = E (Y a ) where Y a is the counterfactual response under regime a, full data is (Y a, X ), observed data is (Y a, X ) if A = a and X otherwise. Identified under A Y a Z (Minimax ) 11 / 38

42 Example 3:Missing at random data The functional can be expressed as an observed data functional in terms of the density f of Z (relative to a measure ν), the probability b(z) = Pr(Y = 1 Z = z) = Pr(Y = 1 Z = z, A = 1) : χ : p b,f χ (p b,f ) = bfdν (Minimax ) 12 / 38

43 Example 3:Missing at random data The functional can be expressed as an observed data functional in terms of the density f of Z (relative to a measure ν), the probability b(z) = Pr(Y = 1 Z = z) = Pr(Y = 1 Z = z, A = 1) : χ : p b,f χ (p b,f ) = bfdν Alternative parametrization in terms of a(z) 1 = Pr(A = 1 Z = z) and g = f /a (proportional to the density of Z given A = 1) χ : p a,b,g χ (p a,b,g ) = abgdν (Minimax ) 12 / 38

44 Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. (Minimax ) 13 / 38

45 Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. Formally, suppose that a and b belong to H d (α, C α ) and H d ( β, C β ) respectively, then n 1/2 consistent estimators have been obtained for α and β large enough that α 2α + d + β 2β + d 1 2 (1) (Minimax ) 13 / 38

46 Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. Formally, suppose that a and b belong to H d (α, C α ) and H d ( β, C β ) respectively, then n 1/2 consistent estimators have been obtained for α and β large enough that α 2α + d + β 2β + d 1 2 (1) In our work, we show that (1) is more stringent than required to achieve root-n-consistency which can be attained using HOIF provided α + β 2d 1 4 (Minimax ) 13 / 38

47 Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. (Minimax ) 14 / 38

48 Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. In fact in Robins, Li, Tchetgen, van der Vaart (2009) we establish that even if g were known, the minimax lower bound when (α + β) /2d < 1 4 is n 2(α+β) 2(α+β)+d (Minimax ) 14 / 38

49 Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. In fact in Robins, Li, Tchetgen, van der Vaart (2009) we establish that even if g were known, the minimax lower bound when (α + β) /2d < 1 4 is n 2(α+β) 2(α+β)+d Our estimators are shown to achieve this minimax rate when α and β are apriori known, provided g is smooth enough. (Minimax ) 14 / 38

50 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) (Minimax ) 15 / 38

51 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. (Minimax ) 15 / 38

52 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. For a suffi ciently regular submodel t p t P d χ (p t ) = d P t χ dt t=0 dt p = P ( χ p g ) t=0 where g = (d/dt) t=0 p t /p is the score function of the model t p t at t = 0. The closed linear span of all scored attached to submodels t p t of P is called the tangent space. (Minimax ) 15 / 38

53 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. For a suffi ciently regular submodel t p t P d χ (p t ) = d P t χ dt t=0 dt p = P ( χ p g ) t=0 where g = (d/dt) t=0 p t /p is the score function of the model t p t at t = 0. The closed linear span of all scored attached to submodels t p t of P is called the tangent space. Therefore an influence function is an element of the Hilbert space L 2 (p) whose inner product with elements of the tangent space represent the derivative of the functional. This result is known in functional analysis as Riesz Representation Theorem. (Minimax ) 15 / 38

54 IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) (Minimax ) 16 / 38

55 IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) Furthermore ˆχ = χ (ˆp n ) + P n χ (1) ˆp n = 2P n ˆp n (X ) χ (ˆp n ) so that P ( ˆχ χ (p)) = (p ˆp n ) 2 (x) dx which is quadratic in the preliminary estimator as expected. (Minimax ) 16 / 38

56 IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) Furthermore ˆχ = χ (ˆp n ) + P n χ (1) ˆp n = 2P n ˆp n (X ) χ (ˆp n ) so that P ( ˆχ χ (p)) = (p ˆp n ) 2 (x) dx which is quadratic in the preliminary estimator as expected. The variance of ˆχ is of order 1/n; if the preliminary estimator ˆp n can be constructed so that the squared bias is smaller than the variance, the estimator is rate optimal; otherwise the bias dominates and a higher order IF is needed. (Minimax ) 16 / 38

57 Missing Data IF To compute IF in the MAR example, consider the score functions of the one dimensional submodel p t = p at,b t,f t induced by paths of the form a t = a + ta b t = b + tb f t = f + tf given measurable functions a, b, f : Z R. (Minimax ) 17 / 38

58 Missing Data IF To compute IF in the MAR example, consider the score functions of the one dimensional submodel p t = p at,b t,f t induced by paths of the form a t = a + ta b t = b + tb f t = f + tf given measurable functions a, b, f : Z R. This gives rise to corresponding score equations Aa (Z ) 1 a(z )(a 1)(Z ) a (Z ) A (Y b(z )) b(z )(1 b)(z ) b(z ) f(z ) a score b score f score (Minimax ) 17 / 38

59 IF for Missing Data Functional Let B = a score+ b score+f score. Then, d ( ) χ dt p (p t ) = P χ p (1) B t=0 for all B in the tangent space of the model P, where χ (1) p = Aa (Z ) (Y b (Z )) + b (Z ) χ (p) (Minimax ) 18 / 38

60 IF for Missing Data Functional Let B = a score+ b score+f score. Then, d ( ) χ dt p (p t ) = P χ p (1) B t=0 for all B in the tangent space of the model P, where χ (1) p = Aa (Z ) (Y b (Z )) + b (Z ) χ (p) This is the well-known doubly robust influence function due to Robins, Rotnitzky and Zhao (1994). (Minimax ) 18 / 38

61 IF for Missing Data Functional Suppose one has obtained rate optimal nonparametric estimators â, b (optimal kernel smoothing or series estimation),e.g â converges at rate n 2α+d 2α in mean squared error. Then, the one-step estimator has 2nd order bias ˆχ = χ (ˆp n ) + P n χ (1) ˆp ( n ) = P n Aâ (Z ) Y b (Z ) + b (Z ) E ( ˆχ χ (p)) â a 2 b b 2 which is quadratic in the preliminary estimators as expected. Note that f does not enter the bias ( provided the empirical measure is used). (Minimax ) 19 / 38

62 IF for Missing Data Functional Suppose one has obtained rate optimal nonparametric estimators â, b (optimal kernel smoothing or series estimation),e.g â converges at rate n 2α+d 2α in mean squared error. Then, the one-step estimator has 2nd order bias ˆχ = χ (ˆp n ) + P n χ (1) ˆp ( n ) = P n Aâ (Z ) Y b (Z ) + b (Z ) E ( ˆχ χ (p)) â a 2 b b 2 which is quadratic in the preliminary estimators as expected. Note that f does not enter the bias ( provided the empirical measure is used). The variance of ˆχ is of order 1/n. If the preliminary estimators â and b can be constructed so that the squared bias is smaller than the variance, the estimator is rate optimal; otherwise the bias dominates and a higher order IF is needed. (Minimax ) 19 / 38

63 HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. (Minimax ) 20 / 38

64 HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. The third equation implies a Taylor expansion of t χ (p t ) at t = 0 of order m. but in addition requires that the derivatives of this map can be represented as expectations involving a function χ p. (Minimax ) 20 / 38

65 HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. The third equation implies a Taylor expansion of t χ (p t ) at t = 0 of order m. but in addition requires that the derivatives of this map can be represented as expectations involving a function χ p. In order to exploit derivatives up to the mth order, groups of m observations can be used to match the expectation P m, this leads to U statistics of order m. (Minimax ) 20 / 38

66 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. Pt m χ p j = 1,..., m (Minimax ) 21 / 38

67 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. (Minimax ) 21 / 38

68 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. For actual computation of influence functions we have shown that it is usually easier to derive higher order influence functions as influence functions of lower order ones. (Minimax ) 21 / 38

69 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. For actual computation of influence functions we have shown that it is usually easier to derive higher order influence functions as influence functions of lower order ones. In fact, the Hoeffding Decomposition Theorem states that one can decompose any mth order U-statistic as the sum of m degenerate U-statistics of orders 1, 2,..., m. (Minimax ) 21 / 38

70 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p (Minimax ) 22 / 38

71 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : (Minimax ) 22 / 38

72 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) (Minimax ) 22 / 38

73 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) 2 Let x j χ p (j) ( ) x1, x 2,..., x j be a first order influence function of the functional p χ p (j 1) ( ) x1, x 2,..., x j 1 for each x1, x 2,..., x j 1 and j = 1,..., m. (Minimax ) 22 / 38

74 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) 2 Let x j χ p (j) ( ) x1, x 2,..., x j be a first order influence function of the functional p χ p (j 1) ( ) x1, x 2,..., x j 1 for each x1, x 2,..., x j 1 and j = 1,..., m. 3 Let χ p (j) be the degenerate part of χ p (j) relative to P, i.e. χ p (j) ( ) X1,..., x s,..., X j dp(xs ) = 0 for all s = 1,...j. (Minimax ) 22 / 38

75 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. (Minimax ) 23 / 38

76 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p (Minimax ) 23 / 38

77 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. (Minimax ) 23 / 38

78 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. The starting IF χ p (1) in step 1 of the algorithm may be any first order IF, it does not need to be an element of the tangent space (i.e. effi cient) and does not need to have mean zero. (Minimax ) 23 / 38

79 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. The starting IF χ p (1) in step 1 of the algorithm may be any first order IF, it does not need to be an element of the tangent space (i.e. effi cient) and does not need to have mean zero. The same remark applies in step 2 of the algorithm, it is only in step 3 that we make the IFs degenerate. (Minimax ) 23 / 38

80 HOIFs of quadratic functional Recall the uncentered (1st order) IF of p p R (x) 2 dx is p χ (1) p (x) = 2p (x), where recall p(x ) is the joint density of X. Applying step 2 for j = 2, we seek χ (2) p (x 1, x 2 ) that solves d { } dt t=0 χ (1) p t (x 1 ) = P χ (2) p (x 1, X 2 )g(x 2 ) for all x 1 and smooth paths t p t ; however d dt t=0 χ (1) p t (x 1 ) = 2 d p dt t (x 1 ) ; t=0 (Minimax ) 24 / 38

81 HOIFs of quadratic functional Recall the uncentered (1st order) IF of p p R (x) 2 dx is p χ (1) p (x) = 2p (x), where recall p(x ) is the joint density of X. Applying step 2 for j = 2, we seek χ (2) p (x 1, x 2 ) that solves d { } dt t=0 χ (1) p t (x 1 ) = P χ (2) p (x 1, X 2 )g(x 2 ) for all x 1 and smooth paths t p t ; however d dt t=0 χ (1) p t (x 1 ) = 2 d p dt t (x 1 ) ; t=0 however, p(x) as a continuous density is not pathwise differentiable so that such representation does not exist! (Minimax ) 24 / 38

82 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. (Minimax ) 25 / 38

83 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. (Minimax ) 25 / 38

84 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. This leads to representation bias. To ensure that this bias is negligible will require the projection of p on p involves models of high dimension # of parms n. (Minimax ) 25 / 38

85 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. This leads to representation bias. To ensure that this bias is negligible will require the projection of p on p involves models of high dimension # of parms n. This will typically result in variance of HOIF n 1 and nonparametric rates of convergence apply. (Minimax ) 25 / 38

86 Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. (Minimax ) 26 / 38

87 Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. We also require that K p (u, u) k and K p t t (Minimax ) 26 / 38

88 Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. We also require that K p (u, u) k and K p t t Moreover, it is desirable that the projection operator is suitable for approximation in Hölder space: for every k,and α a norm for H α [0, 1] d, sup t K p t t α 1 ( ) 1 α/d k (Minimax ) 26 / 38

89 Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure (Minimax ) 27 / 38

90 Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure Note that whereas p(x) is not pathwise differentiable p (x) is! (Minimax ) 27 / 38

91 Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure Note that whereas p(x) is not pathwise differentiable p (x) is! therefore χ (2) p t d χ ( p t ) dt t=0 = P {2 ( p (X )) g} d dt t=0 χ (1) p t (x 1 ) = 2 d t (x 1 ) dt t=0 p = P { 2K µ (x 1, X 2 ) g (X 2 ) } (x 1, x 2 ) = K µ (x 1, x 2 ), χ (j) p t (x 1, x 2 ) = 0, j > 2 (Minimax ) 27 / 38

92 Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) (Minimax ) 28 / 38

93 Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) We further have and ˆχ n = χ (ˆp n ) + U n χ ˆpn P ( ˆχ n χ ( p)) = 0 ( ( var( ˆχ n ) max var ( 1 = max n, k ) n 2 U n χ (1) p ) ( )) 1, var U n 2 χ(2) p (Minimax ) 28 / 38

94 Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) We further have and ˆχ n = χ (ˆp n ) + U n χ ˆpn P ( ˆχ n χ ( p)) = 0 ( ( ) ( )) var( ˆχ n ) max var U n χ (1) 1 p, var U n 2 χ(2) p ( 1 = max n, k ) n 2 The representation bias is χ ( p) χ (p) = ( ) 1 2α/d k (Minimax ) 28 / 38

95 Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. (Minimax ) 29 / 38

96 Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. However whenever α d k opt = n 2d 4α+d < 1 4 optimal MSE is attained by > n achieving the minimax rate n 4α+d 8α (Minimax ) 29 / 38

97 Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. However whenever α d k opt = n 2d 4α+d < 1 4 optimal MSE is attained by > n achieving the minimax rate n 4α+d 8α Note that because there is no estimation bias, the optimal rate is obtained by trading off variance and representation bias. (Minimax ) 29 / 38

98 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. (Minimax ) 30 / 38

99 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. (Minimax ) 30 / 38

100 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. To do so, we propose to define the approximate functional to satisfy d ( ) χ ( p t ) + Pχ (1) dt p t = 0 t=0 (Minimax ) 30 / 38

101 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. To do so, we propose to define the approximate functional to satisfy d ( ) χ ( p t ) + Pχ (1) dt p t = 0 t=0 In fact then χ (1) p = χ (1) p, therefore step 1 of the algorithm returns the same 1st order IF. (Minimax ) 30 / 38

102 Approximate functional for missing data functional To define approximate functional, for user-specified a we let ã be the function such that (ã â) /a is the orthogonal projection of (a â) /a onto a "large" linear subspace of L 2 (g), with projection kernel K g, i.e. ã a = â ( a a + K g a â ) a and define b similarly. (Minimax ) 31 / 38

103 Approximate functional for missing data functional To define approximate functional, for user-specified a we let ã be the function such that (ã â) /a is the orthogonal projection of (a â) /a onto a "large" linear subspace of L 2 (g), with projection kernel K g, i.e. ã a = â ( a a + K g a â ) a and define b similarly. The corresponding approximate functional is χ ( p) = ã bgdν satisfies χ (1) p = χ (1) p and its representation bias is ã bgdν abgdν 2 (a ã) a 2 ( ) 1 (α+β)/d k ( ) b b abgdν b 2 abgdν which can be made small by chosing the range of K g large enough. (Minimax ) 31 / 38

104 Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) (Minimax ) 32 / 38

105 Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) The corresponding bias of ˆχ n = χ (ˆp n ) + U n χ 1ˆp n ) (â a) ( b b gdν = O p (n α is 2α+d β 2β+d while the corresponding variance is order 1/n, this requires 1/2 to recover regular estimator when possible. α 2α+d + β 2β+d ) (Minimax ) 32 / 38

106 Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) The corresponding bias of ˆχ n = χ (ˆp n ) + U n χ 1ˆp n ) (â a) ( b b gdν = O p (n α is 2α+d β 2β+d while the corresponding variance is order 1/n, this requires α 1/2 to recover regular estimator when possible. 2α+d + β 2β+d One may be able to relax this somewhat by undersmoothing or exploiting the fact that the bias has integral representation. We do not wish to do so here as we assume all nuisance parameters are rate optimal. ) (Minimax ) 32 / 38

107 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. (Minimax ) 33 / 38

108 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) (Minimax ) 33 / 38

109 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) Step 3 symmetrizes this IF and makes it degenerate U-statistic. (Minimax ) 33 / 38

110 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) Step 3 symmetrizes this IF and makes it degenerate U-statistic. Note that because K g will depend on g the 2nd order IF is not unbiased and therefore higher order IFs may be required to further reduce estimation bias depending on smoothness of g. (Minimax ) 33 / 38

111 Minimax estimation with 2nd order IF Let ˆχ n = χ (ˆp n ) + U n χ 1ˆp n + U n χ 2ˆp n of g. and let γ denote the Hölder coeff (Minimax ) 34 / 38

112 Minimax estimation with 2nd order IF Let ˆχ n = χ (ˆp n ) + U n χ 1ˆp n + U n χ 2ˆp n of g. and let γ denote the Hölder coeff Theorem ( γ Suppose d +2γ 2(β+α) d +2(β+α) have the following result: 1 4, then If (β+α) 2d β d +2β + ) α d +2α ( ) 1 sup ( ˆχ n χ (p)) = O p n θ Θ(β,α,γ) ; set k = n d +2(β+α), then we 2 If (β+α) 2d 1 4, then ( ) sup ( ˆχ n χ (p)) = O p n 2(β+α) d +2(β+α) θ Θ(β,α,γ) (Minimax ) 34 / 38

113 Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. (Minimax ) 35 / 38

114 Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. We have likewise constructed ( minimax ) estimators when γ d +2γ < 2(β+α) d +2(β+α) β d +2β + α d +2α (Minimax ) 35 / 38

115 Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. We have likewise constructed ( minimax ) estimators when γ d +2γ < 2(β+α) d +2(β+α) β d +2β + α d +2α These higher order IFs are higher order U statistics for further bias reduction without increasing the variance from the second order rate k, this requires tremendous care. (see Robins et al, 2016, Annals of n 2 Statistics) (Minimax ) 35 / 38

116 Honest confidence Intervals Our approach can be used to obtain honest CIs for χ (p) given by C n = ˆχ n ±m 1 α σ ( ˆχ n ) so that lim Pr {χ (p) / C n} α n uniformly in p P with appropriate constant m 1 α (Minimax ) 36 / 38

117 Honest confidence Intervals Our approach can be used to obtain honest CIs for χ (p) given by C n = ˆχ n ±m 1 α σ ( ˆχ n ) so that lim Pr {χ (p) / C n} α n uniformly in p P with appropriate constant m 1 α This is because our HOIFs are asymptotically normally distributed (Robins et al, 2016). (Minimax ) 36 / 38

118 Simulations of honest confidence Intervals Nominal coverage 90%. Sample size iterations, effective smoothnes of g =.2. (Minimax ) 37 / 38

119 Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. (Minimax ) 38 / 38

120 Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. One will rarely know the smoothness of nuisance parameters, ideally one would like to adapt to such smoothness. (Minimax ) 38 / 38

121 Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. One will rarely know the smoothness of nuisance parameters, ideally one would like to adapt to such smoothness. We have recently developed such adaptive estimators for a large class of functionals by extending and applying a technique proposed by Lepski.(Mukherjee,Tchetgen Tchetgen, Robins, 2016) (Minimax ) 38 / 38

MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED HIGH-DIMENSIONAL MODEL

Submitted to the Annals of Statistics arxiv: math.pr/0000000 MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED HIGH-DIMENSIONAL MODEL By James M. Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen