Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Size: px
Start display at page:

Download "Minimax Estimation of a nonlinear functional on a structured high-dimensional model"

Transcription

1 Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38

2 Outline Heuristics (Minimax ) 2 / 38

3 Outline Heuristics First order Influence functions: n -regular case (Minimax ) 2 / 38

4 Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n (Minimax ) 2 / 38

5 Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n Application to quadratic functional and missing at random data functional (Minimax ) 2 / 38

6 Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n Application to quadratic functional and missing at random data functional This is joint work with Robins, Li, Mukherjee and van der Vaart. (Minimax ) 2 / 38

7 Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R (Minimax ) 3 / 38

8 Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R We are particularly interested in settings of semiparametric or nonparametric model such that P is infinite dimensional. (Minimax ) 3 / 38

9 Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R We are particularly interested in settings of semiparametric or nonparametric model such that P is infinite dimensional. We will give special attention to setting where the semiparametric model is described through parameters of low regularity, in which case n convergence may not be attainable. (Minimax ) 3 / 38

10 Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). (Minimax ) 4 / 38

11 Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). One possible choice for χ ˆpn = 0 leading to the plug-in estimator, this is typically suboptimal. (Minimax ) 4 / 38

12 Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). One possible choice for χ ˆpn = 0 leading to the plug-in estimator, this is typically suboptimal. ˆp n constructed from an independent sample, i.e. sample splitting. This is not required in the regular regime through use of empirical process theory. Such theory does not apply in non-regular regime. (Minimax ) 4 / 38

13 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). (Minimax ) 5 / 38

14 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) (Minimax ) 5 / 38

15 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) This can be achieved if Pχ ˆpn acts like minus the derivative of the functional χ in the (ˆp n p) direction. (Minimax ) 5 / 38

16 Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) This can be achieved if Pχ ˆpn acts like minus the derivative of the functional χ in the (ˆp n p) direction. A function χ ˆpn with this property is known as a "first order" influence function. (Minimax ) 5 / 38

17 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d (Minimax ) 6 / 38

18 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large (Minimax ) 6 / 38

19 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large For instance regression or density can be estimated at rate n 1/4 if a-priori known to have at least d/2 derivatives (Minimax ) 6 / 38

20 Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large For instance regression or density can be estimated at rate n 1/4 if a-priori known to have at least d/2 derivatives Our main concern will be for settings where the model is very large or has very low regularity, such that n 1/4 is not attainable for d(ˆp n, p). (Minimax ) 6 / 38

21 Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. (Minimax ) 7 / 38

22 Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. We will replace the first order IF P n χ ˆpn by a higher order IF U n χ ˆpn which is an mth order U-statistic, such that and ˆχ n = χ (ˆp n ) + U n χ ˆpn ˆχ n χ (p) = [ χ (ˆp n ) χ (p) + P m χ ˆpn ] + (Un P m ) χ ˆpn (Minimax ) 7 / 38

23 Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. We will replace the first order IF P n χ ˆpn by a higher order IF U n χ ˆpn which is an mth order U-statistic, such that and ˆχ n = χ (ˆp n ) + U n χ ˆpn ˆχ n χ (p) = [ χ (ˆp n ) χ (p) + P m ] χ ˆpn + (Un P m ) χ ˆpn This in fact suggest choosing the HOIF such that P m χ ˆpn behaves like the first m terms of the Taylor expansion of χ (ˆp n ) χ (p). (Minimax ) 7 / 38

24 Heuristics Exact HOIFs exist only for special functionals. (Minimax ) 8 / 38

25 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. (Minimax ) 8 / 38

26 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. (Minimax ) 8 / 38

27 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? (Minimax ) 8 / 38

28 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: (Minimax ) 8 / 38

29 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: case 1:smoothness of p is assumed known, we have obtained both lower and upper bounds (Minimax ) 8 / 38

30 Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: case 1:smoothness of p is assumed known, we have obtained both lower and upper bounds case 2: smoothness of p is not known and therefore one must adapt to it; some recent results (Minimax ) 8 / 38

31 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. (Minimax ) 9 / 38

32 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. (Minimax ) 9 / 38

33 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. Well studied problem e.g Bickel and Ritov (1988), Laurent (1996, 1997), Cai and Low (2005), Cai, Low et al (2006), Donoho and Nussbaum (1990), Efromovitch and Low et al (1994, 1996), Giné and Nickl (2008), Laurent and Massart (2000). (Minimax ) 9 / 38

34 Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. Well studied problem e.g Bickel and Ritov (1988), Laurent (1996, 1997), Cai and Low (2005), Cai, Low et al (2006), Donoho and Nussbaum (1990), Efromovitch and Low et al (1994, 1996), Giné and Nickl (2008), Laurent and Massart (2000). Exhibits elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. Below the threshold, minimax lower bound is n d +4α 8α in squared error norm. (Birgé and Massart, 1995) (Minimax ) 9 / 38

35 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. (Minimax ) 10 / 38

36 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). (Minimax ) 10 / 38

37 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). Exhibit elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. Below the threshold, minimax lower bound is n d +4α 8α in squared error norm, Birgé and Massart,1995). (Minimax ) 10 / 38

38 Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). Exhibit elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. in squared error norm, Birgé and Massart,1995). The case T (p) = p 3 studied when d = 1 by Kerkyacharian, Picard and Tsybakov (1998), general solution for d > 1 given by Tchetgen et al (2008). (Minimax ) 10 / 38 Below the threshold, minimax lower bound is n 8α d +4α

39 Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) (Minimax ) 11 / 38

40 Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) χ is nonparametrically identified under MAR A Y Z provided Pr (A = 1 Z ) > 0 a.s. (Minimax ) 11 / 38

41 Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) χ is nonparametrically identified under MAR A Y Z provided Pr (A = 1 Z ) > 0 a.s. Note: isomorphic to estimating the average treatment effect of of a binary intervention A on Y under the assumption of no unmeasured confounding given Z.i.e. χ = E (Y a ) where Y a is the counterfactual response under regime a, full data is (Y a, X ), observed data is (Y a, X ) if A = a and X otherwise. Identified under A Y a Z (Minimax ) 11 / 38

42 Example 3:Missing at random data The functional can be expressed as an observed data functional in terms of the density f of Z (relative to a measure ν), the probability b(z) = Pr(Y = 1 Z = z) = Pr(Y = 1 Z = z, A = 1) : χ : p b,f χ (p b,f ) = bfdν (Minimax ) 12 / 38

43 Example 3:Missing at random data The functional can be expressed as an observed data functional in terms of the density f of Z (relative to a measure ν), the probability b(z) = Pr(Y = 1 Z = z) = Pr(Y = 1 Z = z, A = 1) : χ : p b,f χ (p b,f ) = bfdν Alternative parametrization in terms of a(z) 1 = Pr(A = 1 Z = z) and g = f /a (proportional to the density of Z given A = 1) χ : p a,b,g χ (p a,b,g ) = abgdν (Minimax ) 12 / 38

44 Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. (Minimax ) 13 / 38

45 Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. Formally, suppose that a and b belong to H d (α, C α ) and H d ( β, C β ) respectively, then n 1/2 consistent estimators have been obtained for α and β large enough that α 2α + d + β 2β + d 1 2 (1) (Minimax ) 13 / 38

46 Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. Formally, suppose that a and b belong to H d (α, C α ) and H d ( β, C β ) respectively, then n 1/2 consistent estimators have been obtained for α and β large enough that α 2α + d + β 2β + d 1 2 (1) In our work, we show that (1) is more stringent than required to achieve root-n-consistency which can be attained using HOIF provided α + β 2d 1 4 (Minimax ) 13 / 38

47 Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. (Minimax ) 14 / 38

48 Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. In fact in Robins, Li, Tchetgen, van der Vaart (2009) we establish that even if g were known, the minimax lower bound when (α + β) /2d < 1 4 is n 2(α+β) 2(α+β)+d (Minimax ) 14 / 38

49 Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. In fact in Robins, Li, Tchetgen, van der Vaart (2009) we establish that even if g were known, the minimax lower bound when (α + β) /2d < 1 4 is n 2(α+β) 2(α+β)+d Our estimators are shown to achieve this minimax rate when α and β are apriori known, provided g is smooth enough. (Minimax ) 14 / 38

50 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) (Minimax ) 15 / 38

51 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. (Minimax ) 15 / 38

52 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. For a suffi ciently regular submodel t p t P d χ (p t ) = d P t χ dt t=0 dt p = P ( χ p g ) t=0 where g = (d/dt) t=0 p t /p is the score function of the model t p t at t = 0. The closed linear span of all scored attached to submodels t p t of P is called the tangent space. (Minimax ) 15 / 38

53 First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. For a suffi ciently regular submodel t p t P d χ (p t ) = d P t χ dt t=0 dt p = P ( χ p g ) t=0 where g = (d/dt) t=0 p t /p is the score function of the model t p t at t = 0. The closed linear span of all scored attached to submodels t p t of P is called the tangent space. Therefore an influence function is an element of the Hilbert space L 2 (p) whose inner product with elements of the tangent space represent the derivative of the functional. This result is known in functional analysis as Riesz Representation Theorem. (Minimax ) 15 / 38

54 IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) (Minimax ) 16 / 38

55 IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) Furthermore ˆχ = χ (ˆp n ) + P n χ (1) ˆp n = 2P n ˆp n (X ) χ (ˆp n ) so that P ( ˆχ χ (p)) = (p ˆp n ) 2 (x) dx which is quadratic in the preliminary estimator as expected. (Minimax ) 16 / 38

56 IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) Furthermore ˆχ = χ (ˆp n ) + P n χ (1) ˆp n = 2P n ˆp n (X ) χ (ˆp n ) so that P ( ˆχ χ (p)) = (p ˆp n ) 2 (x) dx which is quadratic in the preliminary estimator as expected. The variance of ˆχ is of order 1/n; if the preliminary estimator ˆp n can be constructed so that the squared bias is smaller than the variance, the estimator is rate optimal; otherwise the bias dominates and a higher order IF is needed. (Minimax ) 16 / 38

57 Missing Data IF To compute IF in the MAR example, consider the score functions of the one dimensional submodel p t = p at,b t,f t induced by paths of the form a t = a + ta b t = b + tb f t = f + tf given measurable functions a, b, f : Z R. (Minimax ) 17 / 38

58 Missing Data IF To compute IF in the MAR example, consider the score functions of the one dimensional submodel p t = p at,b t,f t induced by paths of the form a t = a + ta b t = b + tb f t = f + tf given measurable functions a, b, f : Z R. This gives rise to corresponding score equations Aa (Z ) 1 a(z )(a 1)(Z ) a (Z ) A (Y b(z )) b(z )(1 b)(z ) b(z ) f(z ) a score b score f score (Minimax ) 17 / 38

59 IF for Missing Data Functional Let B = a score+ b score+f score. Then, d ( ) χ dt p (p t ) = P χ p (1) B t=0 for all B in the tangent space of the model P, where χ (1) p = Aa (Z ) (Y b (Z )) + b (Z ) χ (p) (Minimax ) 18 / 38

60 IF for Missing Data Functional Let B = a score+ b score+f score. Then, d ( ) χ dt p (p t ) = P χ p (1) B t=0 for all B in the tangent space of the model P, where χ (1) p = Aa (Z ) (Y b (Z )) + b (Z ) χ (p) This is the well-known doubly robust influence function due to Robins, Rotnitzky and Zhao (1994). (Minimax ) 18 / 38

61 IF for Missing Data Functional Suppose one has obtained rate optimal nonparametric estimators â, b (optimal kernel smoothing or series estimation),e.g â converges at rate n 2α+d 2α in mean squared error. Then, the one-step estimator has 2nd order bias ˆχ = χ (ˆp n ) + P n χ (1) ˆp ( n ) = P n Aâ (Z ) Y b (Z ) + b (Z ) E ( ˆχ χ (p)) â a 2 b b 2 which is quadratic in the preliminary estimators as expected. Note that f does not enter the bias ( provided the empirical measure is used). (Minimax ) 19 / 38

62 IF for Missing Data Functional Suppose one has obtained rate optimal nonparametric estimators â, b (optimal kernel smoothing or series estimation),e.g â converges at rate n 2α+d 2α in mean squared error. Then, the one-step estimator has 2nd order bias ˆχ = χ (ˆp n ) + P n χ (1) ˆp ( n ) = P n Aâ (Z ) Y b (Z ) + b (Z ) E ( ˆχ χ (p)) â a 2 b b 2 which is quadratic in the preliminary estimators as expected. Note that f does not enter the bias ( provided the empirical measure is used). The variance of ˆχ is of order 1/n. If the preliminary estimators â and b can be constructed so that the squared bias is smaller than the variance, the estimator is rate optimal; otherwise the bias dominates and a higher order IF is needed. (Minimax ) 19 / 38

63 HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. (Minimax ) 20 / 38

64 HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. The third equation implies a Taylor expansion of t χ (p t ) at t = 0 of order m. but in addition requires that the derivatives of this map can be represented as expectations involving a function χ p. (Minimax ) 20 / 38

65 HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. The third equation implies a Taylor expansion of t χ (p t ) at t = 0 of order m. but in addition requires that the derivatives of this map can be represented as expectations involving a function χ p. In order to exploit derivatives up to the mth order, groups of m observations can be used to match the expectation P m, this leads to U statistics of order m. (Minimax ) 20 / 38

66 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. Pt m χ p j = 1,..., m (Minimax ) 21 / 38

67 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. (Minimax ) 21 / 38

68 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. For actual computation of influence functions we have shown that it is usually easier to derive higher order influence functions as influence functions of lower order ones. (Minimax ) 21 / 38

69 HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. For actual computation of influence functions we have shown that it is usually easier to derive higher order influence functions as influence functions of lower order ones. In fact, the Hoeffding Decomposition Theorem states that one can decompose any mth order U-statistic as the sum of m degenerate U-statistics of orders 1, 2,..., m. (Minimax ) 21 / 38

70 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p (Minimax ) 22 / 38

71 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : (Minimax ) 22 / 38

72 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) (Minimax ) 22 / 38

73 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) 2 Let x j χ p (j) ( ) x1, x 2,..., x j be a first order influence function of the functional p χ p (j 1) ( ) x1, x 2,..., x j 1 for each x1, x 2,..., x j 1 and j = 1,..., m. (Minimax ) 22 / 38

74 HOIFs Thus by HDT U n χ p = U n χ p (1) U n 2 χ(2) p U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) 2 Let x j χ p (j) ( ) x1, x 2,..., x j be a first order influence function of the functional p χ p (j 1) ( ) x1, x 2,..., x j 1 for each x1, x 2,..., x j 1 and j = 1,..., m. 3 Let χ p (j) be the degenerate part of χ p (j) relative to P, i.e. χ p (j) ( ) X1,..., x s,..., X j dp(xs ) = 0 for all s = 1,...j. (Minimax ) 22 / 38

75 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. (Minimax ) 23 / 38

76 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p (Minimax ) 23 / 38

77 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. (Minimax ) 23 / 38

78 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. The starting IF χ p (1) in step 1 of the algorithm may be any first order IF, it does not need to be an element of the tangent space (i.e. effi cient) and does not need to have mean zero. (Minimax ) 23 / 38

79 HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. The starting IF χ p (1) in step 1 of the algorithm may be any first order IF, it does not need to be an element of the tangent space (i.e. effi cient) and does not need to have mean zero. The same remark applies in step 2 of the algorithm, it is only in step 3 that we make the IFs degenerate. (Minimax ) 23 / 38

80 HOIFs of quadratic functional Recall the uncentered (1st order) IF of p p R (x) 2 dx is p χ (1) p (x) = 2p (x), where recall p(x ) is the joint density of X. Applying step 2 for j = 2, we seek χ (2) p (x 1, x 2 ) that solves d { } dt t=0 χ (1) p t (x 1 ) = P χ (2) p (x 1, X 2 )g(x 2 ) for all x 1 and smooth paths t p t ; however d dt t=0 χ (1) p t (x 1 ) = 2 d p dt t (x 1 ) ; t=0 (Minimax ) 24 / 38

81 HOIFs of quadratic functional Recall the uncentered (1st order) IF of p p R (x) 2 dx is p χ (1) p (x) = 2p (x), where recall p(x ) is the joint density of X. Applying step 2 for j = 2, we seek χ (2) p (x 1, x 2 ) that solves d { } dt t=0 χ (1) p t (x 1 ) = P χ (2) p (x 1, X 2 )g(x 2 ) for all x 1 and smooth paths t p t ; however d dt t=0 χ (1) p t (x 1 ) = 2 d p dt t (x 1 ) ; t=0 however, p(x) as a continuous density is not pathwise differentiable so that such representation does not exist! (Minimax ) 24 / 38

82 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. (Minimax ) 25 / 38

83 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. (Minimax ) 25 / 38

84 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. This leads to representation bias. To ensure that this bias is negligible will require the projection of p on p involves models of high dimension # of parms n. (Minimax ) 25 / 38

85 Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. This leads to representation bias. To ensure that this bias is negligible will require the projection of p on p involves models of high dimension # of parms n. This will typically result in variance of HOIF n 1 and nonparametric rates of convergence apply. (Minimax ) 25 / 38

86 Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. (Minimax ) 26 / 38

87 Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. We also require that K p (u, u) k and K p t t (Minimax ) 26 / 38

88 Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. We also require that K p (u, u) k and K p t t Moreover, it is desirable that the projection operator is suitable for approximation in Hölder space: for every k,and α a norm for H α [0, 1] d, sup t K p t t α 1 ( ) 1 α/d k (Minimax ) 26 / 38

89 Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure (Minimax ) 27 / 38

90 Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure Note that whereas p(x) is not pathwise differentiable p (x) is! (Minimax ) 27 / 38

91 Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure Note that whereas p(x) is not pathwise differentiable p (x) is! therefore χ (2) p t d χ ( p t ) dt t=0 = P {2 ( p (X )) g} d dt t=0 χ (1) p t (x 1 ) = 2 d t (x 1 ) dt t=0 p = P { 2K µ (x 1, X 2 ) g (X 2 ) } (x 1, x 2 ) = K µ (x 1, x 2 ), χ (j) p t (x 1, x 2 ) = 0, j > 2 (Minimax ) 27 / 38

92 Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) (Minimax ) 28 / 38

93 Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) We further have and ˆχ n = χ (ˆp n ) + U n χ ˆpn P ( ˆχ n χ ( p)) = 0 ( ( var( ˆχ n ) max var ( 1 = max n, k ) n 2 U n χ (1) p ) ( )) 1, var U n 2 χ(2) p (Minimax ) 28 / 38

94 Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) We further have and ˆχ n = χ (ˆp n ) + U n χ ˆpn P ( ˆχ n χ ( p)) = 0 ( ( ) ( )) var( ˆχ n ) max var U n χ (1) 1 p, var U n 2 χ(2) p ( 1 = max n, k ) n 2 The representation bias is χ ( p) χ (p) = ( ) 1 2α/d k (Minimax ) 28 / 38

95 Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. (Minimax ) 29 / 38

96 Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. However whenever α d k opt = n 2d 4α+d < 1 4 optimal MSE is attained by > n achieving the minimax rate n 4α+d 8α (Minimax ) 29 / 38

97 Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. However whenever α d k opt = n 2d 4α+d < 1 4 optimal MSE is attained by > n achieving the minimax rate n 4α+d 8α Note that because there is no estimation bias, the optimal rate is obtained by trading off variance and representation bias. (Minimax ) 29 / 38

98 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. (Minimax ) 30 / 38

99 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. (Minimax ) 30 / 38

100 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. To do so, we propose to define the approximate functional to satisfy d ( ) χ ( p t ) + Pχ (1) dt p t = 0 t=0 (Minimax ) 30 / 38

101 Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. To do so, we propose to define the approximate functional to satisfy d ( ) χ ( p t ) + Pχ (1) dt p t = 0 t=0 In fact then χ (1) p = χ (1) p, therefore step 1 of the algorithm returns the same 1st order IF. (Minimax ) 30 / 38

102 Approximate functional for missing data functional To define approximate functional, for user-specified a we let ã be the function such that (ã â) /a is the orthogonal projection of (a â) /a onto a "large" linear subspace of L 2 (g), with projection kernel K g, i.e. ã a = â ( a a + K g a â ) a and define b similarly. (Minimax ) 31 / 38

103 Approximate functional for missing data functional To define approximate functional, for user-specified a we let ã be the function such that (ã â) /a is the orthogonal projection of (a â) /a onto a "large" linear subspace of L 2 (g), with projection kernel K g, i.e. ã a = â ( a a + K g a â ) a and define b similarly. The corresponding approximate functional is χ ( p) = ã bgdν satisfies χ (1) p = χ (1) p and its representation bias is ã bgdν abgdν 2 (a ã) a 2 ( ) 1 (α+β)/d k ( ) b b abgdν b 2 abgdν which can be made small by chosing the range of K g large enough. (Minimax ) 31 / 38

104 Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) (Minimax ) 32 / 38

105 Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) The corresponding bias of ˆχ n = χ (ˆp n ) + U n χ 1ˆp n ) (â a) ( b b gdν = O p (n α is 2α+d β 2β+d while the corresponding variance is order 1/n, this requires 1/2 to recover regular estimator when possible. α 2α+d + β 2β+d ) (Minimax ) 32 / 38

106 Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) The corresponding bias of ˆχ n = χ (ˆp n ) + U n χ 1ˆp n ) (â a) ( b b gdν = O p (n α is 2α+d β 2β+d while the corresponding variance is order 1/n, this requires α 1/2 to recover regular estimator when possible. 2α+d + β 2β+d One may be able to relax this somewhat by undersmoothing or exploiting the fact that the bias has integral representation. We do not wish to do so here as we assume all nuisance parameters are rate optimal. ) (Minimax ) 32 / 38

107 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. (Minimax ) 33 / 38

108 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) (Minimax ) 33 / 38

109 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) Step 3 symmetrizes this IF and makes it degenerate U-statistic. (Minimax ) 33 / 38

110 Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) Step 3 symmetrizes this IF and makes it degenerate U-statistic. Note that because K g will depend on g the 2nd order IF is not unbiased and therefore higher order IFs may be required to further reduce estimation bias depending on smoothness of g. (Minimax ) 33 / 38

111 Minimax estimation with 2nd order IF Let ˆχ n = χ (ˆp n ) + U n χ 1ˆp n + U n χ 2ˆp n of g. and let γ denote the Hölder coeff (Minimax ) 34 / 38

112 Minimax estimation with 2nd order IF Let ˆχ n = χ (ˆp n ) + U n χ 1ˆp n + U n χ 2ˆp n of g. and let γ denote the Hölder coeff Theorem ( γ Suppose d +2γ 2(β+α) d +2(β+α) have the following result: 1 4, then If (β+α) 2d β d +2β + ) α d +2α ( ) 1 sup ( ˆχ n χ (p)) = O p n θ Θ(β,α,γ) ; set k = n d +2(β+α), then we 2 If (β+α) 2d 1 4, then ( ) sup ( ˆχ n χ (p)) = O p n 2(β+α) d +2(β+α) θ Θ(β,α,γ) (Minimax ) 34 / 38

113 Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. (Minimax ) 35 / 38

114 Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. We have likewise constructed ( minimax ) estimators when γ d +2γ < 2(β+α) d +2(β+α) β d +2β + α d +2α (Minimax ) 35 / 38

115 Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. We have likewise constructed ( minimax ) estimators when γ d +2γ < 2(β+α) d +2(β+α) β d +2β + α d +2α These higher order IFs are higher order U statistics for further bias reduction without increasing the variance from the second order rate k, this requires tremendous care. (see Robins et al, 2016, Annals of n 2 Statistics) (Minimax ) 35 / 38

116 Honest confidence Intervals Our approach can be used to obtain honest CIs for χ (p) given by C n = ˆχ n ±m 1 α σ ( ˆχ n ) so that lim Pr {χ (p) / C n} α n uniformly in p P with appropriate constant m 1 α (Minimax ) 36 / 38

117 Honest confidence Intervals Our approach can be used to obtain honest CIs for χ (p) given by C n = ˆχ n ±m 1 α σ ( ˆχ n ) so that lim Pr {χ (p) / C n} α n uniformly in p P with appropriate constant m 1 α This is because our HOIFs are asymptotically normally distributed (Robins et al, 2016). (Minimax ) 36 / 38

118 Simulations of honest confidence Intervals Nominal coverage 90%. Sample size iterations, effective smoothnes of g =.2. (Minimax ) 37 / 38

119 Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. (Minimax ) 38 / 38

120 Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. One will rarely know the smoothness of nuisance parameters, ideally one would like to adapt to such smoothness. (Minimax ) 38 / 38

121 Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. One will rarely know the smoothness of nuisance parameters, ideally one would like to adapt to such smoothness. We have recently developed such adaptive estimators for a large class of functionals by extending and applying a technique proposed by Lepski.(Mukherjee,Tchetgen Tchetgen, Robins, 2016) (Minimax ) 38 / 38

MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED HIGH-DIMENSIONAL MODEL

MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED HIGH-DIMENSIONAL MODEL Submitted to the Annals of Statistics arxiv: math.pr/0000000 MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED HIGH-DIMENSIONAL MODEL By James M. Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen

More information

arxiv: v2 [stat.me] 8 Dec 2015

arxiv: v2 [stat.me] 8 Dec 2015 Submitted to the Annals of Statistics arxiv: math.pr/0000000 HIGHER ORDER ESTIMATING EQUATIONS FOR HIGH-DIMENSIONAL MODELS arxiv:1512.02174v2 [stat.me] 8 Dec 2015 By James Robins, Lingling Li, Rajarshi

More information

Cross-fitting and fast remainder rates for semiparametric estimation

Cross-fitting and fast remainder rates for semiparametric estimation Cross-fitting and fast remainder rates for semiparametric estimation Whitney K. Newey James M. Robins The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP41/17 Cross-Fitting

More information

Semiparametric Minimax Rates

Semiparametric Minimax Rates arxiv: math.pr/0000000 Semiparametric Minimax Rates James Robins, Eric Tchetgen Tchetgen, Lingling Li, and Aad van der Vaart James Robins Eric Tchetgen Tchetgen Department of Biostatistics and Epidemiology

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

On Efficiency of the Plug-in Principle for Estimating Smooth Integrated Functionals of a Nonincreasing Density

On Efficiency of the Plug-in Principle for Estimating Smooth Integrated Functionals of a Nonincreasing Density On Efficiency of the Plug-in Principle for Estimating Smooth Integrated Functionals of a Nonincreasing Density arxiv:188.7915v2 [math.st] 12 Feb 219 Rajarshi Mukherjee and Bodhisattva Sen Department of

More information

Locally Robust Semiparametric Estimation

Locally Robust Semiparametric Estimation Locally Robust Semiparametric Estimation Victor Chernozhukov Juan Carlos Escanciano Hidehiko Ichimura Whitney K. Newey The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper

More information

arxiv: v1 [math.st] 20 May 2008

arxiv: v1 [math.st] 20 May 2008 IMS Collections Probability and Statistics: Essays in Honor of David A Freedman Vol 2 2008 335 421 c Institute of Mathematical Statistics, 2008 DOI: 101214/193940307000000527 arxiv:08053040v1 mathst 20

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

Additive Isotonic Regression

Additive Isotonic Regression Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Semiparametric minimax rates

Semiparametric minimax rates Electronic Journal of Statistics Vol. 3 (2009) 1305 1321 ISSN: 1935-7524 DOI: 10.1214/09-EJS479 Semiparametric minimax rates James Robins and Eric Tchetgen Tchetgen Department of Biostatistics and Epidemiology

More information

A note on L convergence of Neumann series approximation in missing data problems

A note on L convergence of Neumann series approximation in missing data problems A note on L convergence of Neumann series approximation in missing data problems Hua Yun Chen Division of Epidemiology & Biostatistics School of Public Health University of Illinois at Chicago 1603 West

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

Efficient Estimation of Population Quantiles in General Semiparametric Regression Models

Efficient Estimation of Population Quantiles in General Semiparametric Regression Models Efficient Estimation of Population Quantiles in General Semiparametric Regression Models Arnab Maity 1 Department of Statistics, Texas A&M University, College Station TX 77843-3143, U.S.A. amaity@stat.tamu.edu

More information

Nonparametric Inference via Bootstrapping the Debiased Estimator

Nonparametric Inference via Bootstrapping the Debiased Estimator Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Three Papers by Peter Bickel on Nonparametric Curve Estimation

Three Papers by Peter Bickel on Nonparametric Curve Estimation Three Papers by Peter Bickel on Nonparametric Curve Estimation Hans-Georg Müller 1 ABSTRACT The following is a brief review of three landmark papers of Peter Bickel on theoretical and methodological aspects

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

The International Journal of Biostatistics

The International Journal of Biostatistics The International Journal of Biostatistics Volume 2, Issue 1 2006 Article 2 Statistical Inference for Variable Importance Mark J. van der Laan, Division of Biostatistics, School of Public Health, University

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

A Novel Nonparametric Density Estimator

A Novel Nonparametric Density Estimator A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with

More information

Data-adaptive smoothing for optimal-rate estimation of possibly non-regular parameters

Data-adaptive smoothing for optimal-rate estimation of possibly non-regular parameters arxiv:1706.07408v2 [math.st] 12 Jul 2017 Data-adaptive smoothing for optimal-rate estimation of possibly non-regular parameters Aurélien F. Bibaut January 8, 2018 Abstract Mark J. van der Laan We consider

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Generated Covariates in Nonparametric Estimation: A Short Review.

Generated Covariates in Nonparametric Estimation: A Short Review. Generated Covariates in Nonparametric Estimation: A Short Review. Enno Mammen, Christoph Rothe, and Melanie Schienle Abstract In many applications, covariates are not observed but have to be estimated

More information

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness finite-sample optimal estimation and inference on average treatment effects under unconfoundedness Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2017 Introduction

More information

Math Real Analysis II

Math Real Analysis II Math 4 - Real Analysis II Solutions to Homework due May Recall that a function f is called even if f( x) = f(x) and called odd if f( x) = f(x) for all x. We saw that these classes of functions had a particularly

More information

SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION. Timothy B. Armstrong and Michal Kolesár. June 2016 Revised March 2018

SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION. Timothy B. Armstrong and Michal Kolesár. June 2016 Revised March 2018 SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION By Timothy B. Armstrong and Michal Kolesár June 2016 Revised March 2018 COWLES FOUNDATION DISCUSSION PAPER NO. 2044R2 COWLES FOUNDATION

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo) Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: 1304.0282 Victor MIT, Economics + Center for Statistics Co-authors: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

Confidence intervals for kernel density estimation

Confidence intervals for kernel density estimation Stata User Group - 9th UK meeting - 19/20 May 2003 Confidence intervals for kernel density estimation Carlo Fiorio c.fiorio@lse.ac.uk London School of Economics and STICERD Stata User Group - 9th UK meeting

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

An Omnibus Nonparametric Test of Equality in. Distribution for Unknown Functions

An Omnibus Nonparametric Test of Equality in. Distribution for Unknown Functions An Omnibus Nonparametric Test of Equality in Distribution for Unknown Functions arxiv:151.4195v3 [math.st] 14 Jun 217 Alexander R. Luedtke 1,2, Mark J. van der Laan 3, and Marco Carone 4,1 1 Vaccine and

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 24 Paper 153 A Note on Empirical Likelihood Inference of Residual Life Regression Ying Qing Chen Yichuan

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2011 Paper 290 Targeted Minimum Loss Based Estimation of an Intervention Specific Mean Outcome Mark

More information

Simple and Honest Confidence Intervals in Nonparametric Regression

Simple and Honest Confidence Intervals in Nonparametric Regression Simple and Honest Confidence Intervals in Nonparametric Regression Timothy B. Armstrong Yale University Michal Kolesár Princeton University June, 206 Abstract We consider the problem of constructing honest

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2014 Paper 327 Entering the Era of Data Science: Targeted Learning and the Integration of Statistics

More information

Learning 2 -Continuous Regression Functionals via Regularized Riesz Representers

Learning 2 -Continuous Regression Functionals via Regularized Riesz Representers Learning 2 -Continuous Regression Functionals via Regularized Riesz Representers Victor Chernozhukov MIT Whitney K. Newey MIT Rahul Singh MIT September 10, 2018 Abstract Many objects of interest can be

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Modification and Improvement of Empirical Likelihood for Missing Response Problem UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

Efficient Estimation in Convex Single Index Models 1

Efficient Estimation in Convex Single Index Models 1 1/28 Efficient Estimation in Convex Single Index Models 1 Rohit Patra University of Florida http://arxiv.org/abs/1708.00145 1 Joint work with Arun K. Kuchibhotla (UPenn) and Bodhisattva Sen (Columbia)

More information

SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION. Timothy B. Armstrong and Michal Kolesár. June 2016 Revised October 2016

SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION. Timothy B. Armstrong and Michal Kolesár. June 2016 Revised October 2016 SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION By Timothy B. Armstrong and Michal Kolesár June 2016 Revised October 2016 COWLES FOUNDATION DISCUSSION PAPER NO. 2044R COWLES FOUNDATION

More information

The Nonparametric Bootstrap

The Nonparametric Bootstrap The Nonparametric Bootstrap The nonparametric bootstrap may involve inferences about a parameter, but we use a nonparametric procedure in approximating the parametric distribution using the ECDF. We use

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Efficiency of Profile/Partial Likelihood in the Cox Model

Efficiency of Profile/Partial Likelihood in the Cox Model Efficiency of Profile/Partial Likelihood in the Cox Model Yuichi Hirose School of Mathematics, Statistics and Operations Research, Victoria University of Wellington, New Zealand Summary. This paper shows

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2011 Paper 288 Targeted Maximum Likelihood Estimation of Natural Direct Effect Wenjing Zheng Mark J.

More information

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Hilbert Spaces Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Vector Space. Vector space, ν, over the field of complex numbers,

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Density estimators for the convolution of discrete and continuous random variables

Density estimators for the convolution of discrete and continuous random variables Density estimators for the convolution of discrete and continuous random variables Ursula U Müller Texas A&M University Anton Schick Binghamton University Wolfgang Wefelmeyer Universität zu Köln Abstract

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall 1 Structural Nested Mean Models for Assessing Time-Varying Effect Moderation Daniel Almirall Center for Health Services Research, Durham VAMC & Dept. of Biostatistics, Duke University Medical Joint work

More information

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Andreas Buja joint with: Richard Berk, Lawrence Brown, Linda Zhao, Arun Kuchibhotla, Kai Zhang Werner Stützle, Ed George, Mikhail

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2006 Paper 213 Targeted Maximum Likelihood Learning Mark J. van der Laan Daniel Rubin Division of Biostatistics,

More information

Global Sensitivity Analysis for Repeated Measures Studies with Informative Drop-out: A Semi-Parametric Approach

Global Sensitivity Analysis for Repeated Measures Studies with Informative Drop-out: A Semi-Parametric Approach Global for Repeated Measures Studies with Informative Drop-out: A Semi-Parametric Approach Daniel Aidan McDermott Ivan Diaz Johns Hopkins University Ibrahim Turkoz Janssen Research and Development September

More information

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function Solution. If we does not need the pointwise limit of

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

SDS : Theoretical Statistics

SDS : Theoretical Statistics SDS 384 11: Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin https://psarkar.github.io/teaching Manegerial Stuff

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2015 Paper 334 Targeted Estimation and Inference for the Sample Average Treatment Effect Laura B. Balzer

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity

More information

Smooth simultaneous confidence bands for cumulative distribution functions

Smooth simultaneous confidence bands for cumulative distribution functions Journal of Nonparametric Statistics, 2013 Vol. 25, No. 2, 395 407, http://dx.doi.org/10.1080/10485252.2012.759219 Smooth simultaneous confidence bands for cumulative distribution functions Jiangyan Wang

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Measuring robustness

Measuring robustness Measuring robustness 1 Introduction While in the classical approach to statistics one aims at estimates which have desirable properties at an exactly speci ed model, the aim of robust methods is loosely

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

On variable bandwidth kernel density estimation

On variable bandwidth kernel density estimation JSM 04 - Section on Nonparametric Statistics On variable bandwidth kernel density estimation Janet Nakarmi Hailin Sang Abstract In this paper we study the ideal variable bandwidth kernel estimator introduced

More information

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions Vadim Marmer University of British Columbia Artyom Shneyerov CIRANO, CIREQ, and Concordia University August 30, 2010 Abstract

More information

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University

More information

Missing Covariate Data in Matched Case-Control Studies

Missing Covariate Data in Matched Case-Control Studies Missing Covariate Data in Matched Case-Control Studies Department of Statistics North Carolina State University Paul Rathouz Dept. of Health Studies U. of Chicago prathouz@health.bsd.uchicago.edu with

More information

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA

NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA Statistica Sinica 213): Supplement NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA Hokeun Sun 1, Wei Lin 2, Rui Feng 2 and Hongzhe Li 2 1 Columbia University and 2 University

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data Jae-Kwang Kim 1 Iowa State University June 28, 2012 1 Joint work with Dr. Ming Zhou (when he was a PhD student at ISU)

More information

1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0

1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0 4 INUTES. If, ω, ω, -----, ω 9 are the th roots of unity, then ( + ω) ( + ω ) ----- ( + ω 9 ) is B) D) 5. i If - i = a + ib, then a =, b = B) a =, b = a =, b = D) a =, b= 3. Find the integral values for

More information

13 Endogeneity and Nonparametric IV

13 Endogeneity and Nonparametric IV 13 Endogeneity and Nonparametric IV 13.1 Nonparametric Endogeneity A nonparametric IV equation is Y i = g (X i ) + e i (1) E (e i j i ) = 0 In this model, some elements of X i are potentially endogenous,

More information

Lecture 11: Regression Methods I (Linear Regression)

Lecture 11: Regression Methods I (Linear Regression) Lecture 11: Regression Methods I (Linear Regression) Fall, 2017 1 / 40 Outline Linear Model Introduction 1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear

More information

Inference in Nonparametric Series Estimation with Data-Dependent Number of Series Terms

Inference in Nonparametric Series Estimation with Data-Dependent Number of Series Terms Inference in Nonparametric Series Estimation with Data-Dependent Number of Series Terms Byunghoon ang Department of Economics, University of Wisconsin-Madison First version December 9, 204; Revised November

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

Wavelet Shrinkage for Nonequispaced Samples

Wavelet Shrinkage for Nonequispaced Samples University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Wavelet Shrinkage for Nonequispaced Samples T. Tony Cai University of Pennsylvania Lawrence D. Brown University

More information

Lecture 11: Regression Methods I (Linear Regression)

Lecture 11: Regression Methods I (Linear Regression) Lecture 11: Regression Methods I (Linear Regression) 1 / 43 Outline 1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear Regression Ordinary Least Squares Statistical

More information

Targeted Maximum Likelihood Estimation in Safety Analysis

Targeted Maximum Likelihood Estimation in Safety Analysis Targeted Maximum Likelihood Estimation in Safety Analysis Sam Lendle 1 Bruce Fireman 2 Mark van der Laan 1 1 UC Berkeley 2 Kaiser Permanente ISPE Advanced Topics Session, Barcelona, August 2012 1 / 35

More information

Uncertainty Quantification for Inverse Problems. November 7, 2011

Uncertainty Quantification for Inverse Problems. November 7, 2011 Uncertainty Quantification for Inverse Problems November 7, 2011 Outline UQ and inverse problems Review: least-squares Review: Gaussian Bayesian linear model Parametric reductions for IP Bias, variance

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information