Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Similar documents
MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED HIGH-DIMENSIONAL MODEL

arxiv: v2 [stat.me] 8 Dec 2015

Cross-fitting and fast remainder rates for semiparametric estimation

Semiparametric Minimax Rates

Optimal Estimation of a Nonsmooth Functional

On Efficiency of the Plug-in Principle for Estimating Smooth Integrated Functionals of a Nonincreasing Density

Locally Robust Semiparametric Estimation

arxiv: v1 [math.st] 20 May 2008

Model Selection and Geometry

Statistical inference on Lévy processes

Additive Isotonic Regression

Lecture 3: Statistical Decision Theory (Part II)

Semiparametric minimax rates

A note on L convergence of Neumann series approximation in missing data problems

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Efficient Estimation of Population Quantiles in General Semiparametric Regression Models

Nonparametric Inference via Bootstrapping the Debiased Estimator

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Statistics: Learning models from data

Three Papers by Peter Bickel on Nonparametric Curve Estimation

ECO Class 6 Nonparametric Econometrics

The International Journal of Biostatistics

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

A Novel Nonparametric Density Estimator

Data-adaptive smoothing for optimal-rate estimation of possibly non-regular parameters

Flexible Estimation of Treatment Effect Parameters

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Plug-in Approach to Active Learning

Generated Covariates in Nonparametric Estimation: A Short Review.

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness

Math Real Analysis II

SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION. Timothy B. Armstrong and Michal Kolesár. June 2016 Revised March 2018

Fast learning rates for plug-in classifiers under the margin condition

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

Confidence intervals for kernel density estimation

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

An Omnibus Nonparametric Test of Equality in. Distribution for Unknown Functions

University of California, Berkeley

Time Series and Forecasting Lecture 4 NonLinear Time Series

Statistical Properties of Numerical Derivatives

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

University of California, Berkeley

Simple and Honest Confidence Intervals in Nonparametric Regression

University of California, Berkeley

Learning 2 -Continuous Regression Functionals via Regularized Riesz Representers

Least squares under convex constraint

ECE521 week 3: 23/26 January 2017

Modification and Improvement of Empirical Likelihood for Missing Response Problem

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

Efficient Estimation in Convex Single Index Models 1

SIMPLE AND HONEST CONFIDENCE INTERVALS IN NONPARAMETRIC REGRESSION. Timothy B. Armstrong and Michal Kolesár. June 2016 Revised October 2016

The Nonparametric Bootstrap

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Efficiency of Profile/Partial Likelihood in the Cox Model

University of California, Berkeley

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Lecture 7 Introduction to Statistical Decision Theory

D I S C U S S I O N P A P E R

Density estimators for the convolution of discrete and continuous random variables

Bayesian Regularization

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

University of California, Berkeley

Global Sensitivity Analysis for Repeated Measures Studies with Informative Drop-out: A Semi-Parametric Approach

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

SDS : Theoretical Statistics

Lecture 8: Information Theory and Statistics

University of California, Berkeley

Class 2 & 3 Overfitting & Regularization

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Smooth simultaneous confidence bands for cumulative distribution functions

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Measuring robustness

41903: Introduction to Nonparametrics

On variable bandwidth kernel density estimation

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

Missing Covariate Data in Matched Case-Control Studies

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA

Inverse problems in statistics

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0

13 Endogeneity and Nonparametric IV

Lecture 11: Regression Methods I (Linear Regression)

Inference in Nonparametric Series Estimation with Data-Dependent Number of Series Terms

Linear Methods for Prediction

Inverse problems in statistics

Wavelet Shrinkage for Nonequispaced Samples

Lecture 11: Regression Methods I (Linear Regression)

Targeted Maximum Likelihood Estimation in Safety Analysis

Uncertainty Quantification for Inverse Problems. November 7, 2011

Inference For High Dimensional M-estimates. Fixed Design Results

Transcription:

Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38

Outline Heuristics (Minimax ) 2 / 38

Outline Heuristics First order Influence functions: n -regular case (Minimax ) 2 / 38

Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n (Minimax ) 2 / 38

Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n Application to quadratic functional and missing at random data functional (Minimax ) 2 / 38

Outline Heuristics First order Influence functions: n -regular case Higher order influence functions non-regular rates n Application to quadratic functional and missing at random data functional This is joint work with Robins, Li, Mukherjee and van der Vaart. (Minimax ) 2 / 38

Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R (Minimax ) 3 / 38

Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R We are particularly interested in settings of semiparametric or nonparametric model such that P is infinite dimensional. (Minimax ) 3 / 38

Heuristics X 1,..., X n i.i.d p where p belongs to a collection P of densities and we aim to estimate the value χ (p) of a functional χ : P R We are particularly interested in settings of semiparametric or nonparametric model such that P is infinite dimensional. We will give special attention to setting where the semiparametric model is described through parameters of low regularity, in which case n convergence may not be attainable. (Minimax ) 3 / 38

Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). (Minimax ) 4 / 38

Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). One possible choice for χ ˆpn = 0 leading to the plug-in estimator, this is typically suboptimal. (Minimax ) 4 / 38

Heuristics For estimation, consider the class of "one-step" estimators of the form ˆχ = χ (ˆp n ) + P n χ ˆpn for ˆp n an initial estimator of p and x χ p (x) is a measurable function for each p and P n f = n 1 n i=1 f (X i ). One possible choice for χ ˆpn = 0 leading to the plug-in estimator, this is typically suboptimal. ˆp n constructed from an independent sample, i.e. sample splitting. This is not required in the regular regime through use of empirical process theory. Such theory does not apply in non-regular regime. (Minimax ) 4 / 38

Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). (Minimax ) 5 / 38

Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) (Minimax ) 5 / 38

Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) This can be achieved if Pχ ˆpn acts like minus the derivative of the functional χ in the (ˆp n p) direction. (Minimax ) 5 / 38

Heuristics To motivate a good choice of χ ˆpn consider ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn where Pχ ˆpn = χ ˆpn (x) dp(x). The second term is standard normal, a good choice of χ ˆpn is to make sure ( that the term in brackets has "small bias", i.e. no larger than O P n 1/2 ) This can be achieved if Pχ ˆpn acts like minus the derivative of the functional χ in the (ˆp n p) direction. A function χ ˆpn with this property is known as a "first order" influence function. (Minimax ) 5 / 38

Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d (Minimax ) 6 / 38

Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large (Minimax ) 6 / 38

Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large For instance regression or density can be estimated at rate n 1/4 if a-priori known to have at least d/2 derivatives (Minimax ) 6 / 38

Heuristics For a 1st order influence function the bias term is quadratic in the error d(ˆp n, p) for an appropriate distance d no bias condition essentially requires d(ˆp n, p) = o p (n 1/4 ). To get this rate the model cannot be too large For instance regression or density can be estimated at rate n 1/4 if a-priori known to have at least d/2 derivatives Our main concern will be for settings where the model is very large or has very low regularity, such that n 1/4 is not attainable for d(ˆp n, p). (Minimax ) 6 / 38

Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. (Minimax ) 7 / 38

Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. We will replace the first order IF P n χ ˆpn by a higher order IF U n χ ˆpn which is an mth order U-statistic, such that and ˆχ n = χ (ˆp n ) + U n χ ˆpn ˆχ n χ (p) = [ χ (ˆp n ) χ (p) + P m χ ˆpn ] + (Un P m ) χ ˆpn (Minimax ) 7 / 38

Heuristics The one step estimator ˆχ = χ (ˆp n ) + P n χ ˆpn is suboptimal in such cases because it does not strike the right balance between bias and variance; that is ˆχ χ (p) = [ χ (ˆp n ) χ (p) + Pχ ˆpn ] + (Pn P) χ ˆpn the magnitude of the first term dominates and one cannot obtain valid inferences about χ (p), i.e. CIs not centered properly to achieve nominal coverage. We will replace the first order IF P n χ ˆpn by a higher order IF U n χ ˆpn which is an mth order U-statistic, such that and ˆχ n = χ (ˆp n ) + U n χ ˆpn ˆχ n χ (p) = [ χ (ˆp n ) χ (p) + P m ] χ ˆpn + (Un P m ) χ ˆpn This in fact suggest choosing the HOIF such that P m χ ˆpn behaves like the first m terms of the Taylor expansion of χ (ˆp n ) χ (p). (Minimax ) 7 / 38

Heuristics Exact HOIFs exist only for special functionals. (Minimax ) 8 / 38

Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. (Minimax ) 8 / 38

Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. (Minimax ) 8 / 38

Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? (Minimax ) 8 / 38

Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: (Minimax ) 8 / 38

Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: case 1:smoothness of p is assumed known, we have obtained both lower and upper bounds (Minimax ) 8 / 38

Heuristics Exact HOIFs exist only for special functionals. For general functionals, we find a "good approximate" functional in which case the convergence rate obtained by trading-off estimation bias+approximation bias with variance. In some cases the optimal rate of the HOIF may still be n 1/2 in which case the semiparametric effi ciency bound may be achieved, the first order IF determines the variance and HOIFs correct the bias. Typically the rate will be slower than n 1/2. What is the optimal (i.e. minimax) rate of estimation? can we achieve it? Two cases emerge: case 1:smoothness of p is assumed known, we have obtained both lower and upper bounds case 2: smoothness of p is not known and therefore one must adapt to it; some recent results (Minimax ) 8 / 38

Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. (Minimax ) 9 / 38

Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. (Minimax ) 9 / 38

Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. Well studied problem e.g Bickel and Ritov (1988), Laurent (1996, 1997), Cai and Low (2005), Cai, Low et al (2006), Donoho and Nussbaum (1990), Efromovitch and Low et al (1994, 1996), Giné and Nickl (2008), Laurent and Massart (2000). (Minimax ) 9 / 38

Example 1: Quadratic density functional Quadratic functional of a density p(x ): χ : p χ (p) = p (x) 2 dx R p p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Of interest in bandwidth selection for optimal density smoothing. Well studied problem e.g Bickel and Ritov (1988), Laurent (1996, 1997), Cai and Low (2005), Cai, Low et al (2006), Donoho and Nussbaum (1990), Efromovitch and Low et al (1994, 1996), Giné and Nickl (2008), Laurent and Massart (2000). Exhibits elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. Below the threshold, minimax lower bound is n d +4α 8α in squared error norm. (Birgé and Massart, 1995) (Minimax ) 9 / 38

Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. (Minimax ) 10 / 38

Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). (Minimax ) 10 / 38

Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). Exhibit elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. Below the threshold, minimax lower bound is n d +4α 8α in squared error norm, Birgé and Massart,1995). (Minimax ) 10 / 38

Example 2:Nonlinear density functional nonlinear functional of a density p(x ): χ : p χ (p) = T (p (x)) dx R p where T is a smooth mapping p H d (α, C ) is in a d-dimensional Hölder ball with smoothness α and radius C. Large body on estimating the entropy of an underlying distribution. Beirlant et al. (1997), more recent works include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 2010; Pál, Póczos and Szepesvári, 2010). Also see Kandasamy et al. (2014). Exhibit elbow phenomenon in that n 1/2 -estimable only if α/d 1/4. in squared error norm, Birgé and Massart,1995). The case T (p) = p 3 studied when d = 1 by Kerkyacharian, Picard and Tsybakov (1998), general solution for d > 1 given by Tchetgen et al (2008). (Minimax ) 10 / 38 Below the threshold, minimax lower bound is n 8α d +4α

Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) (Minimax ) 11 / 38

Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) χ is nonparametrically identified under MAR A Y Z provided Pr (A = 1 Z ) > 0 a.s. (Minimax ) 11 / 38

Example 3:Missing at random data Suppose full data is (Y, Z ) where Y is a binary response and Z a d-dimensional vector of covariates;however one observes (YA, A, Z ) from which we wish to estimate χ = E (Y ) = Pr(Y = 1) χ is nonparametrically identified under MAR A Y Z provided Pr (A = 1 Z ) > 0 a.s. Note: isomorphic to estimating the average treatment effect of of a binary intervention A on Y under the assumption of no unmeasured confounding given Z.i.e. χ = E (Y a ) where Y a is the counterfactual response under regime a, full data is (Y a, X ), observed data is (Y a, X ) if A = a and X otherwise. Identified under A Y a Z (Minimax ) 11 / 38

Example 3:Missing at random data The functional can be expressed as an observed data functional in terms of the density f of Z (relative to a measure ν), the probability b(z) = Pr(Y = 1 Z = z) = Pr(Y = 1 Z = z, A = 1) : χ : p b,f χ (p b,f ) = bfdν (Minimax ) 12 / 38

Example 3:Missing at random data The functional can be expressed as an observed data functional in terms of the density f of Z (relative to a measure ν), the probability b(z) = Pr(Y = 1 Z = z) = Pr(Y = 1 Z = z, A = 1) : χ : p b,f χ (p b,f ) = bfdν Alternative parametrization in terms of a(z) 1 = Pr(A = 1 Z = z) and g = f /a (proportional to the density of Z given A = 1) χ : p a,b,g χ (p a,b,g ) = abgdν (Minimax ) 12 / 38

Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. (Minimax ) 13 / 38

Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. Formally, suppose that a and b belong to H d (α, C α ) and H d ( β, C β ) respectively, then n 1/2 consistent estimators have been obtained for α and β large enough that α 2α + d + β 2β + d 1 2 (1) (Minimax ) 13 / 38

Example 3:Missing at random data Estimators of χ (p) that are n 1/2 consistent and asymptotically effi cient in the semiparametric sense have been constructed w a variety of methods, but only if a or b or both parameters are restricted to suffi ciently small regularity classes. Formally, suppose that a and b belong to H d (α, C α ) and H d ( β, C β ) respectively, then n 1/2 consistent estimators have been obtained for α and β large enough that α 2α + d + β 2β + d 1 2 (1) In our work, we show that (1) is more stringent than required to achieve root-n-consistency which can be attained using HOIF provided α + β 2d 1 4 (Minimax ) 13 / 38

Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. (Minimax ) 14 / 38

Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. In fact in Robins, Li, Tchetgen, van der Vaart (2009) we establish that even if g were known, the minimax lower bound when (α + β) /2d < 1 4 is n 2(α+β) 2(α+β)+d (Minimax ) 14 / 38

Example 3:Missing at random data For moderate to large dimensions d this is still a restrictive requirement, likewise if regularity is low even if d is small. In fact in Robins, Li, Tchetgen, van der Vaart (2009) we establish that even if g were known, the minimax lower bound when (α + β) /2d < 1 4 is n 2(α+β) 2(α+β)+d Our estimators are shown to achieve this minimax rate when α and β are apriori known, provided g is smooth enough. (Minimax ) 14 / 38

First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) (Minimax ) 15 / 38

First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. (Minimax ) 15 / 38

First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. For a suffi ciently regular submodel t p t P d χ (p t ) = d P t χ dt t=0 dt p = P ( χ p g ) t=0 where g = (d/dt) t=0 p t /p is the score function of the model t p t at t = 0. The closed linear span of all scored attached to submodels t p t of P is called the tangent space. (Minimax ) 15 / 38

First Order Semiparametric Theory Theory which goes back to Pfanzagl (1982), Newey (1990), van der Vaart (1991), Bickel et al (1993) Given a functional χ : P R defined on the semiparametric model P, an IF is a function χ p : x χ p (x) of the observed data which satisfies the following equation. For a suffi ciently regular submodel t p t P d χ (p t ) = d P t χ dt t=0 dt p = P ( χ p g ) t=0 where g = (d/dt) t=0 p t /p is the score function of the model t p t at t = 0. The closed linear span of all scored attached to submodels t p t of P is called the tangent space. Therefore an influence function is an element of the Hilbert space L 2 (p) whose inner product with elements of the tangent space represent the derivative of the functional. This result is known in functional analysis as Riesz Representation Theorem. (Minimax ) 15 / 38

IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) (Minimax ) 16 / 38

IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) Furthermore ˆχ = χ (ˆp n ) + P n χ (1) ˆp n = 2P n ˆp n (X ) χ (ˆp n ) so that P ( ˆχ χ (p)) = (p ˆp n ) 2 (x) dx which is quadratic in the preliminary estimator as expected. (Minimax ) 16 / 38

IF of quadratic density functional To compute the IF of χ (p) = R p p (x) 2 dx consider χ (p t ) = R p p t (x) 2 dx and computing d dt t=0 χ (p t ) one obtains Therefore d dt t=0 χ (p t ) = P {2 (p (X ) χ (p)) g} χ p = 2 (p (X ) χ (p)) Furthermore ˆχ = χ (ˆp n ) + P n χ (1) ˆp n = 2P n ˆp n (X ) χ (ˆp n ) so that P ( ˆχ χ (p)) = (p ˆp n ) 2 (x) dx which is quadratic in the preliminary estimator as expected. The variance of ˆχ is of order 1/n; if the preliminary estimator ˆp n can be constructed so that the squared bias is smaller than the variance, the estimator is rate optimal; otherwise the bias dominates and a higher order IF is needed. (Minimax ) 16 / 38

Missing Data IF To compute IF in the MAR example, consider the score functions of the one dimensional submodel p t = p at,b t,f t induced by paths of the form a t = a + ta b t = b + tb f t = f + tf given measurable functions a, b, f : Z R. (Minimax ) 17 / 38

Missing Data IF To compute IF in the MAR example, consider the score functions of the one dimensional submodel p t = p at,b t,f t induced by paths of the form a t = a + ta b t = b + tb f t = f + tf given measurable functions a, b, f : Z R. This gives rise to corresponding score equations Aa (Z ) 1 a(z )(a 1)(Z ) a (Z ) A (Y b(z )) b(z )(1 b)(z ) b(z ) f(z ) a score b score f score (Minimax ) 17 / 38

IF for Missing Data Functional Let B = a score+ b score+f score. Then, d ( ) χ dt p (p t ) = P χ p (1) B t=0 for all B in the tangent space of the model P, where χ (1) p = Aa (Z ) (Y b (Z )) + b (Z ) χ (p) (Minimax ) 18 / 38

IF for Missing Data Functional Let B = a score+ b score+f score. Then, d ( ) χ dt p (p t ) = P χ p (1) B t=0 for all B in the tangent space of the model P, where χ (1) p = Aa (Z ) (Y b (Z )) + b (Z ) χ (p) This is the well-known doubly robust influence function due to Robins, Rotnitzky and Zhao (1994). (Minimax ) 18 / 38

IF for Missing Data Functional Suppose one has obtained rate optimal nonparametric estimators â, b (optimal kernel smoothing or series estimation),e.g â converges at rate n 2α+d 2α in mean squared error. Then, the one-step estimator has 2nd order bias ˆχ = χ (ˆp n ) + P n χ (1) ˆp ( n ) = P n Aâ (Z ) Y b (Z ) + b (Z ) E ( ˆχ χ (p)) â a 2 b b 2 which is quadratic in the preliminary estimators as expected. Note that f does not enter the bias ( provided the empirical measure is used). (Minimax ) 19 / 38

IF for Missing Data Functional Suppose one has obtained rate optimal nonparametric estimators â, b (optimal kernel smoothing or series estimation),e.g â converges at rate n 2α+d 2α in mean squared error. Then, the one-step estimator has 2nd order bias ˆχ = χ (ˆp n ) + P n χ (1) ˆp ( n ) = P n Aâ (Z ) Y b (Z ) + b (Z ) E ( ˆχ χ (p)) â a 2 b b 2 which is quadratic in the preliminary estimators as expected. Note that f does not enter the bias ( provided the empirical measure is used). The variance of ˆχ is of order 1/n. If the preliminary estimators â and b can be constructed so that the squared bias is smaller than the variance, the estimator is rate optimal; otherwise the bias dominates and a higher order IF is needed. (Minimax ) 19 / 38

HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. (Minimax ) 20 / 38

HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. The third equation implies a Taylor expansion of t χ (p t ) at t = 0 of order m. but in addition requires that the derivatives of this map can be represented as expectations involving a function χ p. (Minimax ) 20 / 38

HOIFs Given a collection of smooth submodels t p t in P an mth order influence function χ pt of the functional χ (p) if it satisfies d j dt j t=0 P m χ p = 0 and χ (p t ) = d j = dt j t=0 d j dt j t=0 P m χ pt P m t χ p j = 1,..., m for every submodel through p in P. The third equation implies a Taylor expansion of t χ (p t ) at t = 0 of order m. but in addition requires that the derivatives of this map can be represented as expectations involving a function χ p. In order to exploit derivatives up to the mth order, groups of m observations can be used to match the expectation P m, this leads to U statistics of order m. (Minimax ) 20 / 38

HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. Pt m χ p j = 1,..., m (Minimax ) 21 / 38

HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. (Minimax ) 21 / 38

HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. For actual computation of influence functions we have shown that it is usually easier to derive higher order influence functions as influence functions of lower order ones. (Minimax ) 21 / 38

HOIFs Thus, we must solve for an mth order U-statistic with mean zero that satisfies d j dt j t=0 χ (p t ) = d j dt j t=0 for every submodel through p in P. P m t χ p j = 1,..., m This equation involves multiple derivatives and many paths, directly solving for χ p challenging. For actual computation of influence functions we have shown that it is usually easier to derive higher order influence functions as influence functions of lower order ones. In fact, the Hoeffding Decomposition Theorem states that one can decompose any mth order U-statistic as the sum of m degenerate U-statistics of orders 1, 2,..., m. (Minimax ) 21 / 38

HOIFs Thus by HDT U n χ p = U n χ p (1) 1 1 + U n 2 χ(2) p +... + U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p (Minimax ) 22 / 38

HOIFs Thus by HDT U n χ p = U n χ p (1) 1 1 + U n 2 χ(2) p +... + U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : (Minimax ) 22 / 38

HOIFs Thus by HDT U n χ p = U n χ p (1) 1 1 + U n 2 χ(2) p +... + U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) (Minimax ) 22 / 38

HOIFs Thus by HDT U n χ p = U n χ p (1) 1 1 + U n 2 χ(2) p +... + U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) 2 Let x j χ p (j) ( ) x1, x 2,..., x j be a first order influence function of the functional p χ p (j 1) ( ) x1, x 2,..., x j 1 for each x1, x 2,..., x j 1 and j = 1,..., m. (Minimax ) 22 / 38

HOIFs Thus by HDT U n χ p = U n χ p (1) 1 1 + U n 2 χ(2) p +... + U n m! χ(m) p where χ p (j) is a degenerate (symmetric) kernel of j arguments defined uniquely as a projection of χ p Suitable functions χ (j) p can be found by the algorithm : 1 Let x 1 χ p (1) (x 1 ) be a first order influence function of the functional p χ (p) 2 Let x j χ p (j) ( ) x1, x 2,..., x j be a first order influence function of the functional p χ p (j 1) ( ) x1, x 2,..., x j 1 for each x1, x 2,..., x j 1 and j = 1,..., m. 3 Let χ p (j) be the degenerate part of χ p (j) relative to P, i.e. χ p (j) ( ) X1,..., x s,..., X j dp(xs ) = 0 for all s = 1,...j. (Minimax ) 22 / 38

HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. (Minimax ) 23 / 38

HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p (Minimax ) 23 / 38

HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. (Minimax ) 23 / 38

HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. The starting IF χ p (1) in step 1 of the algorithm may be any first order IF, it does not need to be an element of the tangent space (i.e. effi cient) and does not need to have mean zero. (Minimax ) 23 / 38

HOIFs Thus, HIOFs are constructed as first order influence functions of an influence function. Although the order m is fixed at a suitable value, we suppress it in the notation χ p In abuse of language, we refer to χ (j) p as the jth order IF. The starting IF χ p (1) in step 1 of the algorithm may be any first order IF, it does not need to be an element of the tangent space (i.e. effi cient) and does not need to have mean zero. The same remark applies in step 2 of the algorithm, it is only in step 3 that we make the IFs degenerate. (Minimax ) 23 / 38

HOIFs of quadratic functional Recall the uncentered (1st order) IF of p p R (x) 2 dx is p χ (1) p (x) = 2p (x), where recall p(x ) is the joint density of X. Applying step 2 for j = 2, we seek χ (2) p (x 1, x 2 ) that solves d { } dt t=0 χ (1) p t (x 1 ) = P χ (2) p (x 1, X 2 )g(x 2 ) for all x 1 and smooth paths t p t ; however d dt t=0 χ (1) p t (x 1 ) = 2 d p dt t (x 1 ) ; t=0 (Minimax ) 24 / 38

HOIFs of quadratic functional Recall the uncentered (1st order) IF of p p R (x) 2 dx is p χ (1) p (x) = 2p (x), where recall p(x ) is the joint density of X. Applying step 2 for j = 2, we seek χ (2) p (x 1, x 2 ) that solves d { } dt t=0 χ (1) p t (x 1 ) = P χ (2) p (x 1, X 2 )g(x 2 ) for all x 1 and smooth paths t p t ; however d dt t=0 χ (1) p t (x 1 ) = 2 d p dt t (x 1 ) ; t=0 however, p(x) as a continuous density is not pathwise differentiable so that such representation does not exist! (Minimax ) 24 / 38

Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. (Minimax ) 25 / 38

Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. (Minimax ) 25 / 38

Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. This leads to representation bias. To ensure that this bias is negligible will require the projection of p on p involves models of high dimension # of parms n. (Minimax ) 25 / 38

Approximate functionals HOIF will generally fail to exist if the first order IF depends on a parameter that is not pathwise differentiable, such as a conditional expectation or a density at a point. To make progress, we propose to replace χ (p) with an approximate functional χ ( p) for a given mapping p p such that χ (p) and χ ( p) are close, and χ ( p) is higher order pathwise differentiable. This leads to representation bias. To ensure that this bias is negligible will require the projection of p on p involves models of high dimension # of parms n. This will typically result in variance of HOIF n 1 and nonparametric rates of convergence apply. (Minimax ) 25 / 38

Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. (Minimax ) 26 / 38

Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. We also require that K p (u, u) k and K p t t (Minimax ) 26 / 38

Projection Kernel An orthogonal projection K p : L 2 (p) L 2 (p) on a k-dimensional linear subspace of L 2 (p) is given by a kernel operator, with kernel denoted by the same symbol as the operator: K p t (x) = K p (x, y) t (y) dp (y). The projection property K 2 p = K p holds. We also require that K p (u, u) k and K p t t Moreover, it is desirable that the projection operator is suitable for approximation in Hölder space: for every k,and α a norm for H α [0, 1] d, sup t K p t t α 1 ( ) 1 α/d k (Minimax ) 26 / 38

Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure (Minimax ) 27 / 38

Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure Note that whereas p(x) is not pathwise differentiable p (x) is! (Minimax ) 27 / 38

Approximate quadratic functional The approximate functional χ ( p) = p (x) 2 dx p (x) = K µ p (x), µ p-fold Lebesgue measure Note that whereas p(x) is not pathwise differentiable p (x) is! therefore χ (2) p t d χ ( p t ) dt t=0 = P {2 ( p (X )) g} d dt t=0 χ (1) p t (x 1 ) = 2 d t (x 1 ) dt t=0 p = P { 2K µ (x 1, X 2 ) g (X 2 ) } (x 1, x 2 ) = K µ (x 1, x 2 ), χ (j) p t (x 1, x 2 ) = 0, j > 2 (Minimax ) 27 / 38

Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) (Minimax ) 28 / 38

Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) We further have and ˆχ n = χ (ˆp n ) + U n χ ˆpn P ( ˆχ n χ ( p)) = 0 ( ( var( ˆχ n ) max var ( 1 = max n, k ) n 2 U n χ (1) p ) ( )) 1, var U n 2 χ(2) p (Minimax ) 28 / 38

Approximate quadratic functional Step 3 of the algorithm yields a 2nd order IF U n χ p = U n χ p (1) 1 + U n 2 χ(2) p χ p (1)(x 1) = 2 ( p (x 1 ) χ ( p)) ; χ p (2)(x 1, x 2 ) = K µ (x 1, x 2 ) K µ p (x 1 ) K µ p (x 2 ) + χ ( p) We further have and ˆχ n = χ (ˆp n ) + U n χ ˆpn P ( ˆχ n χ ( p)) = 0 ( ( ) ( )) var( ˆχ n ) max var U n χ (1) 1 p, var U n 2 χ(2) p ( 1 = max n, k ) n 2 The representation bias is χ ( p) χ (p) = ( ) 1 2α/d k (Minimax ) 28 / 38

Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. (Minimax ) 29 / 38

Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. However whenever α d k opt = n 2d 4α+d < 1 4 optimal MSE is attained by > n achieving the minimax rate n 4α+d 8α (Minimax ) 29 / 38

Approximate quadratic functional α Therefore if d 1 4, approximation bias can be made 1 n for k < n so that var( ˆχ n ) max ( 1 n, k ) n = 1 2 n, and the functional is estimable at rate n 1/2. However whenever α d k opt = n 2d 4α+d < 1 4 optimal MSE is attained by > n achieving the minimax rate n 4α+d 8α Note that because there is no estimation bias, the optimal rate is obtained by trading off variance and representation bias. (Minimax ) 29 / 38

Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. (Minimax ) 30 / 38

Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. (Minimax ) 30 / 38

Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. To do so, we propose to define the approximate functional to satisfy d ( ) χ ( p t ) + Pχ (1) dt p t = 0 t=0 (Minimax ) 30 / 38

Approximate functional Interestingly, in this example the approximate functional is locally equivalent to the true functional in the sense that their first order IFs match. Although this is a desirable property because it leads to parsimounous HOIFs, this will not generally be the case for a general functional, but can be ensured by construction. To do so, we propose to define the approximate functional to satisfy d ( ) χ ( p t ) + Pχ (1) dt p t = 0 t=0 In fact then χ (1) p = χ (1) p, therefore step 1 of the algorithm returns the same 1st order IF. (Minimax ) 30 / 38

Approximate functional for missing data functional To define approximate functional, for user-specified a we let ã be the function such that (ã â) /a is the orthogonal projection of (a â) /a onto a "large" linear subspace of L 2 (g), with projection kernel K g, i.e. ã a = â ( a a + K g a â ) a and define b similarly. (Minimax ) 31 / 38

Approximate functional for missing data functional To define approximate functional, for user-specified a we let ã be the function such that (ã â) /a is the orthogonal projection of (a â) /a onto a "large" linear subspace of L 2 (g), with projection kernel K g, i.e. ã a = â ( a a + K g a â ) a and define b similarly. The corresponding approximate functional is χ ( p) = ã bgdν satisfies χ (1) p = χ (1) p and its representation bias is ã bgdν abgdν 2 (a ã) a 2 ( ) 1 (α+β)/d k ( ) b b abgdν b 2 abgdν which can be made small by chosing the range of K g large enough. (Minimax ) 31 / 38

Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) (Minimax ) 32 / 38

Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) The corresponding bias of ˆχ n = χ (ˆp n ) + U n χ 1ˆp n ) (â a) ( b b gdν = O p (n α is 2α+d β 2β+d while the corresponding variance is order 1/n, this requires 1/2 to recover regular estimator when possible. α 2α+d + β 2β+d ) (Minimax ) 32 / 38

Approximate functional for missing data functional Step 1 of the algorithm ( applied ) to χ ( p) gives χ (1) p = a 1 ã (z 1 ) y 1 b (z 1 ) + b (z 1 ) The corresponding bias of ˆχ n = χ (ˆp n ) + U n χ 1ˆp n ) (â a) ( b b gdν = O p (n α is 2α+d β 2β+d while the corresponding variance is order 1/n, this requires α 1/2 to recover regular estimator when possible. 2α+d + β 2β+d One may be able to relax this somewhat by undersmoothing or exploiting the fact that the bias has integral representation. We do not wish to do so here as we assume all nuisance parameters are rate optimal. ) (Minimax ) 32 / 38

Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. (Minimax ) 33 / 38

Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) (Minimax ) 33 / 38

Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) Step 3 symmetrizes this IF and makes it degenerate U-statistic. (Minimax ) 33 / 38

Approximate functional for missing data functional Whereas χ (1) p (x 1 ) is not pathwise diff., χ (1) p (x 1 ) is. Step 2 of the algorithm applied to χ (1) p (x 1 ) gives χ (2) p (x 1, X 2 ) = 2a 1 (y 1 b (z 1 ))K g (z 1, Z 2 ) (A 2 ã(z 2 ) 1) Step 3 symmetrizes this IF and makes it degenerate U-statistic. Note that because K g will depend on g the 2nd order IF is not unbiased and therefore higher order IFs may be required to further reduce estimation bias depending on smoothness of g. (Minimax ) 33 / 38

Minimax estimation with 2nd order IF Let ˆχ n = χ (ˆp n ) + U n χ 1ˆp n + U n χ 2ˆp n of g. and let γ denote the Hölder coeff (Minimax ) 34 / 38

Minimax estimation with 2nd order IF Let ˆχ n = χ (ˆp n ) + U n χ 1ˆp n + U n χ 2ˆp n of g. and let γ denote the Hölder coeff Theorem ( γ Suppose d +2γ 2(β+α) d +2(β+α) have the following result: 1 4, then If (β+α) 2d β d +2β + ) α d +2α ( ) 1 sup ( ˆχ n χ (p)) = O p n θ Θ(β,α,γ) ; set k = n d +2(β+α), then we 2 If (β+α) 2d 1 4, then ( ) sup ( ˆχ n χ (p)) = O p n 2(β+α) d +2(β+α) θ Θ(β,α,γ) (Minimax ) 34 / 38

Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. (Minimax ) 35 / 38

Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. We have likewise constructed ( minimax ) estimators when γ d +2γ < 2(β+α) d +2(β+α) β d +2β + α d +2α (Minimax ) 35 / 38

Minimax estimation This result is new!!! All existing methods break down once (β + α) /2d 1 4, i.e. once the effective smoothness (β + α) /2d is too small. We have likewise constructed ( minimax ) estimators when γ d +2γ < 2(β+α) d +2(β+α) β d +2β + α d +2α These higher order IFs are higher order U statistics for further bias reduction without increasing the variance from the second order rate k, this requires tremendous care. (see Robins et al, 2016, Annals of n 2 Statistics) (Minimax ) 35 / 38

Honest confidence Intervals Our approach can be used to obtain honest CIs for χ (p) given by C n = ˆχ n ±m 1 α σ ( ˆχ n ) so that lim Pr {χ (p) / C n} α n uniformly in p P with appropriate constant m 1 α (Minimax ) 36 / 38

Honest confidence Intervals Our approach can be used to obtain honest CIs for χ (p) given by C n = ˆχ n ±m 1 α σ ( ˆχ n ) so that lim Pr {χ (p) / C n} α n uniformly in p P with appropriate constant m 1 α This is because our HOIFs are asymptotically normally distributed (Robins et al, 2016). (Minimax ) 36 / 38

Simulations of honest confidence Intervals Nominal coverage 90%. Sample size 1000. 200 iterations, effective smoothnes of g =.2. (Minimax ) 37 / 38

Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. (Minimax ) 38 / 38

Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. One will rarely know the smoothness of nuisance parameters, ideally one would like to adapt to such smoothness. (Minimax ) 38 / 38

Final remarks Our results appy to a general class of functionals which includes the semilinear model, several IV models, MAR monotone missing data functionals, nonignorable missing data functionals among others. One will rarely know the smoothness of nuisance parameters, ideally one would like to adapt to such smoothness. We have recently developed such adaptive estimators for a large class of functionals by extending and applying a technique proposed by Lepski.(Mukherjee,Tchetgen Tchetgen, Robins, 2016) (Minimax ) 38 / 38