Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Similar documents
A semiparametric multivariate partially linear model: a difference approach

Variance Function Estimation in Multivariate Nonparametric Regression

Optimal Estimation of a Nonsmooth Functional

Optimal global rates of convergence for interpolation problems with random design

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Local Polynomial Regression

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

DEPARTMENT MATHEMATIK ARBEITSBEREICH MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Econ 582 Nonparametric Regression

Statistical inference on Lévy processes

Estimation of a quadratic regression functional using the sinc kernel

Minimax lower bounds I

ECE 4400:693 - Information Theory

Nonparametric Methods

A Note on Data-Adaptive Bandwidth Selection for Sequential Kernel Smoothers

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Nonparametric Econometrics

A Difference Based Approach to Semiparametric Partial Linear Model

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Tools from Lebesgue integration

A Bootstrap Test for Conditional Symmetry

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

Lecture 7 Introduction to Statistical Decision Theory

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

A tailor made nonparametric density estimate

A nonparametric method of multi-step ahead forecasting in diffusion processes

Fast learning rates for plug-in classifiers under the margin condition

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

Nonparametric Density Estimation

Lecture 35: December The fundamental statistical distances

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Department of Statistics Purdue University West Lafayette, IN USA

Local Polynomial Modelling and Its Applications

Goodness-of-fit tests for the cure rate in a mixture cure model

Modelling Non-linear and Non-stationary Time Series

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Nonparametric Density Estimation (Multidimension)

Supremum of simple stochastic processes

7 Semiparametric Estimation of Additive Models

The Arzelà-Ascoli Theorem

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Large Sample Properties of Estimators in the Classical Linear Regression Model

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

A Probability Review

Estimation of the functional Weibull-tail coefficient

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Regularity of the density for the stochastic heat equation

41903: Introduction to Nonparametrics

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

On a Nonparametric Notion of Residual and its Applications

Nonparametric Modal Regression

Maths 212: Homework Solutions

Defect Detection using Nonparametric Regression

Additive Isotonic Regression

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Sums of exponentials of random walks

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Course 212: Academic Year Section 1: Metric Spaces

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Estimation of Treatment Effects under Essential Heterogeneity

Frontier estimation based on extreme risk measures

Mid Term-1 : Practice problems

Concentration behavior of the penalized least squares estimator

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

The deterministic Lasso

Function of Longitudinal Data

Wavelet Shrinkage for Nonequispaced Samples

Problem List MATH 5143 Fall, 2013

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

Information measures in simple coding problems

ASYMPTOTICS FOR PENALIZED SPLINES IN ADDITIVE MODELS

NOTES ON THE REGULARITY OF QUASICONFORMAL HOMEOMORPHISMS

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

A Note on Hilbertian Elliptically Contoured Distributions

Introduction to Regression

Time Series and Forecasting Lecture 4 NonLinear Time Series

Inference For High Dimensional M-estimates. Fixed Design Results

h(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote

Asymptotic normality of conditional distribution estimation in the single index model

Bandwidth selection for kernel conditional density

Minimax theory for a class of non-linear statistical inverse problems

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

CURRENT STATUS LINEAR REGRESSION. By Piet Groeneboom and Kim Hendrickx Delft University of Technology and Hasselt University

Persistent homology and nonparametric regression

ERROR VARIANCE ESTIMATION IN NONPARAMETRIC REGRESSION MODELS

Convergence rates in weighted L 1 spaces of kernel density estimators for linear processes

Estimating a frontier function using a high-order moments method

Your first day at work MATH 806 (Fall 2015)

Large deviations and averaging for systems of slow fast stochastic reaction diffusion equations.

Introduction to Regression

Single Index Quantile Regression for Heteroscedastic Data

Transcription:

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of Statistics Purdue University May, 2014

Journal of Multivariate Analysis (2014): Preprint 1 Minimax rate of convergence for an estimator of the functional component in a semiparametric multivariate partially linear model Michael Levine Purdue University Abstract: A multivariate semiparametric partial linear model for both fixed and random design cases is considered. Earlier, in Brown, Levine and Wang (2014), the model has been analyzed using a difference sequence approach. In particular, the functional component has been estimated using a multivariate Nadaraya-Watson kernel smoother of the residuals of the linear fit. Moreover, this functional component estimator has been shown to be rate optimal if the Lipschitz smoothness index exceeds half the dimensionality of the functional component domain. In the current manuscript, we take this research further and show that, for both fixed and random designs, the rate achieved is the minimax rate under both risk at a point and the L 2 risk. The result is achieved by proving lower bounds on both pointwise risk and the L 2 risk of possible estimators of the functional component. Key words and phrases: Multivariate semiparametric partial linear model, minimax rate of convergence, functional component, lower bound, Fano s lemma, Varshamov-Gilbert bound. 1. Introduction Semiparametric models have a long history in statistics and have received considerable attention in the last several decades. The main reason they are of considerable interest is that, quite often, the relationships between the response and predictors are very heterogeneous in the same model. Some of the relationships are clearly linear whereas the detailed information about others is hard to

2 Michael Levine come by. In many situations, a small subset of variables is presumed to have an unknown relationship with the response that is modeled nonparametrically while the rest are assumed to have a linear relationship with it. As an example, Engle, Granger, Rice and Weiss (1986) studied the nonlinear relationship between temperature and electricity usage where other related factors, such as income and price, are parameterized linearly. The model we consider in this paper is a semiparametric partial linear multivariate model Y i = a + X iβ + f(u i ) + ε i (1.1) where X i R p and U i R q, β is an unknown p 1 vector of parameters, a is an unknown intercept term, f( ) is an unknown function and ε i are independent and identically distributed random variables with mean 0 and constant variance σ 2. We consider two cases with respect to U: a random design case whereby U i is a q-dimensional random variable and a fixed design case with U i being a q-dimensional vector where each coordinate is defined on an equispaced grid on [0, 1]. In the fixed design case the errors are independent of X i while in the random design case they are independent of (X i, U i). To obtain meaningful results, the function f is assumed to belong in the Lipschitz ball Λ α (M) where α is the Lipschitz exponent. Of particular interest is the fact that, to be coherent, in the fixed design case when q > 1 the model (1.1) must have multivariate indices. The version with q = 1 was earlier considered in Wang, Brown and Cai (2011) while Brown, Levine and Wang (2014) considered the case of q > 1 in detail. The latter defined two conceptually similar difference based estimators of the parametric component for fixed and random design cases, respectively, and showed n asymptotic normality of both of these estimators. Moreover, it was also established that, in order for the estimator of the parametric component to be efficient, the order of the difference sequence must go to infinity. Brown, Levine and Wang (2014) also obtained a uniform over a Lipschitz ball Λ α (M) convergence result for an estimator of the functional component, establishing the rate of convergence n 2α/(2α+q). To the best of our knowledge, the optimal in the minimax sense rate of convergence for estimators of the nonparametric component of multivariate partial linear model (1.1) has not been established. Since results of Brown, Levine and

Minimax rate of convergence for an estimator of the functional component in a semiparametric partial linear model 3 Wang (2014) amount to establishing the upper bound of that rate, the remaining task is to establish the lower bound. In this manuscript, we are doing just that for both fixed and random designs as well as for the two different functional distances. The first distance considered is the difference at a given fixed point and the second is that L 2 [0, 1] q distance. A number of different techniques are used to obtain these results. The next two sections present detailed results for the fixed and random design cases, respectively. 2. Optimal rates of convergence for the deterministic design case We consider the following semiparametric model Y i = a + X i β + f(u i) + ε i (2.1) where X i R p, U i S = [0, 1] q R q, ε i are iid zero mean random variables with variance σ 2 and finite absolute moment of the order δ + 2 for some small δ > 0, that is, E ε i δ+2 <. As noticed earlier in Brown, Levine and Wang (2014), the model (2.1) must have multidimensional indices i = (i 1,..., i q ) to be coherent. Throughout this work, we will use bold font i to refer to multivariate indices and regular font to refer to coordinates of a multivariate index. For some positive integer m, we can take i k = 0, 1,..., m for k = 1,..., q; thus, the total sample size is n = m q. Note that this assumption implies that m = o(n) as n. Due to the use of multivariate indices, one can also say that ε i form an independent random field with the marginal density function h(x) where x is a generic argument. We will say that two indices i 1 = (i 1 1,..., i1 q) i 2 = (i 2 1,..., i2 q) if i 1 k i2 k for any k = 1,..., q; the relationship between i1 and i 2 is that of partial ordering. Also, for a multivariate index i, we denote i = i 1 +... + i q. In this section, we assume that U i follows a fixed equispaced design: U i = (u i1,..., u iq ) R q where each coordinate is u ik = i k m. In the model (2.1), β is an unknown p-dimensional vector of parameters and a is an unknown intercept term. We assume that X i s are independent random vectors that are also independent of ε i ; moreover, we denote the non-singular covariance matrix of X Σ X. For convenience, we also denote N = {1,..., m} q. This model requires an identifiability condition to be satisfied; more specifically, [0,1] f(u)du = 0. q The version of (2.1) with q = 1 has been considered earlier in Wang, Brown and Cai (2010). The case of q = 1 is quite different in that it only requires univariate

4 Michael Levine indices for the model to be coherent. As a reminder, we consider functions f belonging to the Lipschitz ball Λ α (M) for some positive constant M that is defined as follows. For a q-dimensional index j = (j 1,..., j q ), we define j(l) = {j : j = j 1 +... + j q = l}. for any function f : R q R, D j(l) f u j 1 1... u jq q Then, is defined for all j such that j = l. Then, the Lipschitz ball Λ α (M) consists of all functions f(u) : [0, 1] q R such that D j(l) f(u) M for l = 0, 1,..., α and D j( α ) f(v) D j( α ) f(w) M v w α with α = α α. Here and in the future, stands for the regular l 2 norm in R q. In Brown, Levine and Wang (2014), the estimator of the nonparametric component f was constructed in two stages. First, the vector coefficient β and the intercept a were estimated using the difference approach. Then, averaged residuals from that fit were smoothed using the multivariate Nadaraya-Watson kernel smoother to obtain an estimator of the unknown function f. For that estimator, Brown, Levine and Wang (2014) established a uniform upper bound on the risk at a point as well as for the L 2 [0, 1] q risk. More specifically, they proved that, for any Lipschitz indicator α > 0 and any U 0 [0, 1] q, the estimator ˆf satisfies sup E[( ˆf(U 0 ) f(u 0 )) 2 ] Cn 2α/(2α+q) f Λ α (M) and sup E f Λ α (M) [ [0,1] q ( ˆf(U) f(u)) 2 du ] Cn 2α/(2α+q) for a generic constant C > 0 that does not depend on n. The following result establishes the respective lower bounds on the risk of that estimator, therefore proving that n α/(2α+q) is the minimax rate of convergence for both distances. Compared to Theorem (2.3) of Brown, Levine and Wang (2014), some extra assumptions are needed. Theorem 2.1. Let ε i be independent identically distributed random variables with zero mean and finite variance σ 2 ; moreover, we assume that, for some small δ > 0, Eε 2+δ i <. Also, we assume that the marginal density function of observations Y i p(y i ) = p ε (Y i X i β)p X(X i ) dx i satisfies the following assumption:

Minimax rate of convergence for an estimator of the functional component in a semiparametric partial linear model 5 there exists p > 0, v 0 > 0 such that p(u) p(u) log p(u + V ) du p v 2 for all V < v 0 (2.2) Then, the convergence rate n α/(2α+q) is optimal 1. on (Λ α (M), d 0 ) where d 0 is the distance at a fixed point U 0 [0, 1] q and 2. on (Λ α (M), 2 ) where 2 is an L 2 distance on [0, 1] q. Remark 2.2. Note that the condition (2.2) is fairly general; in particular, a standard Gaussian density satisfies this condition. For more information on this condition, see Tsybakov (2009). Proof. 1. To obtain the minimax convergence rate for the first case, we will use a two-point argument going all the way back to Ibragimov and Has minskii (1977) and Has minskii (1978). First, we need to define an appropriate test function. We will use the optimal bandwidth h n = n 1 2α+q in each dimension j = 1,..., q and a bandwidth matrix H n = diag{h n } = h n I q q for the estimator of function f(u). Next, consider a non-negative function K Λ α ( ) 1 2 such that K(U) > 0 if and only if U 1 ( 2. ) As an example of such function, we can select K(U) = a exp 1 I ( U 1 ) 1 U 2 2 for a sufficiently small a > 0. Note that this function reaches its maximum when U = 0 and that this maximum is exp ( 1). A pair of functions that we will need for our problem are f 0,n (U) 0 and f 1,n (U) = Mh α nk(hn 1 (U U 0 )) = Mh α nk(h 1 n (U U 0 )). The following three conditions must be checked for us to conclude that the lower bound is achieved at the convergence rate n α 2α+q. The first condition is that f j,n Λ α (M) for j = 1, 2 and sufficiently large n. For f 0,n 0 it is clearly satisfied immediately. Next, for any l = 0, 1,..., α, D j(l) f 1,n (U) = Mh α l n D j(l) K(h 1 n (U U 0 )) M for sufficiently large n since the function K Λ α ( 1 D (j( α )) f 1,n (V ) D (j( α )) f 1,n (W ) h α α n 2). Finally, since (V U 0 )) D (j( α )) K(h 1 n D (j( α )) K(h 1 n (W U 0 )) M V W α for all sufficiently large n, we can say that f 1,n Λ α (M)

6 Michael Levine Next, we need to check that, for sufficiently large n, the distance between the two test functions d(f 1,n (U 0 ), f 0,n (U 0 )) n α/(2α+q). Indeed, f 1,n (U 0 ) = h α nk(0) 2An α 2α+q for e.g. A = M 4 exp ( 1). Therefore, the distance between the two points in the Lipschitz ball Λ α (M) that we selected is, indeed, of the right order n α 2α+q. Finally, let us define two product densities m 0,n and m 1,n that are densities of observations generated by (2.1) when the functional component is equal to f 0,n and f 1,n, respectively. Using assumption (2.2), the Kullback distance between the two is K(m 0,n, m 1,n ) = log p(y i ) [p(y i ) dy i ] p(y i f 1,n (U i )) i n i n = p(y ) p(y ) log p(y f 1,n (U i )) dy p f1,n(u 2 i ) i n i n = Mp h 2α n exp ( 2) ( I U i U 0 h ) n p h 2α n exp ( 2) max((m/2)nh q 2 n, 1) i n (M/2)p h 2α n exp ( 2)nh q n = (M/2)p exp ( 2)nh 2α+q n C for sufficiently large n where C = (M/2)p exp ( 2). This establishes the optimality of the rate n α/(2α+q). 2. Next, we establish the minimax rate of convergence for the L 2 [0, 1] q risk. First, recall the earlier result that, for any α > 0, [ ] sup E ( ˆf(U) f(u)) 2 du Cn 2α/(2α+q) f Λ α (M) [0,1] q We will argue that the rate n α/(2α+q) is also the minimax rate under the L 2 [0, 1] q loss. In what follows, we denote x the smallest integer that is larger than x R. First, let us define m = c 0 n 2α+q where c 0 > 0 is some real number. As a second step, we choose the bandwidth h n = m 1/q. Our next purpose is to define a partition of [0, 1] q into a set of disjoint subsets and define a sequence of functions that take non-zero values on just one of these subsets. Such a multivariate partition with R q -valued partition points (vectors) U k = (u 1 k,..., uq k ) can be defined by selecting u j k = k 1 2 m, for k = q

Minimax rate of convergence for an estimator of the functional component in a semiparametric partial linear model 7 1,..., m. Now, denote k = { [ k 1 m, k k 1 m ),..., [ m, k m )} R q ; note that the entire [0, 1] q = k k and that k k = if k k, that is k s are disjoint. The next step consists of selecting hypotheses based on a function K(U) : [0, 1] q R such that K(U) Λ α ( 1 2) and K(U) > ( 0 if and ) only if U < 1 2. As before, we select the function K(U) = exp 1 I ( U 1 1 U 2). 2 Also, denote K 2 the L 2 [0, 1] q norm of the function K. To simplify the notation, let the diagonal bandwidth matrix be H = diag h n = h n I q q. Now, we can define a set of m functions Φ k (U) = Mh α nk(h 1 (U U k )), for k = 1,..., m. Finally, denote the set of all binary sequences of length m Ω = {ω = (ω 1,..., ω m ), ω i (0, 1)} = {0, 1} m. Then, the test functions f jn, j = 0,..., J will be selected from the set of functions. m E = {f ω (u) = ω k Φ k (u), ω Ω} (2.3) k=1 The following three conditions now must be verified to ensure that n α/(2α+q) is, indeed, the minimax rate. (a) First, we need to show that the L 2 [0, 1] q distance between any two of the test functions is bounded below by the multiple of n α/(2α+q). For any two functions f ω, f ω E, the L 2 [0, 1] q -distance between them is [ d(f ω, f ω ) = fω (U) f ω (U) ] 2 du = [ m 2 (ω k ω k k(u)] )Φ du [0,1] q [0,1] q k=1 = m (ω k ω q k )2 2 (U) du = Mhα+ n K 2 ρ(ω, ω ) k=1 k Φ 2 k where ρ(ω, ω ) = q k=1 I(ω k ω k ) is the Hamming distance between ω and ω. Our next step requires the use of the so-called Varshamov- Gilbert Lemma that has been known for a long time in information theory (see, e.g. Gilbert (1952) as well as Ibragimov and Hasminskii (1977) for pioneering examples of the use of Varshamov-Gilbert Lemma in statistical context)). The Varshamov-Gilbert Lemma states that,

8 Michael Levine for any m 8, there exists a subset {ω (0),..., ω (J) } of Ω such that ω (0) = (0,..., 0), and ρ((ω (j), ω (k) ) m 8 for any 0 j k J; moreover, J 2 m/8. Note that it suffices to choose the ω and ω such that ρ(ω, ω ) hn q/2 which is equivalent to ρ(ω, ω ) m. In other words, to show that the rate n α/(2α+q) is, indeed, the minimax rate of convergence, we need to use an infinite number of testing hypotheses J. It is now easy to verify that, for a sufficiently large n, d(f ω, f ω ) Mh α+ q 2 n K 2 m 16 = M 4 K 2h α n = M 4 K 2n α 2α+q and so the rate is correct (2.4) (b) Clearly, each Φ k (U) Λ α (M); since each ω k 1 and the functions Φ k (U) have disjoint supports for different k, f ω Λ α (M). (c) Finally, we also need to verify that the average Kullback-Leibler distance between the null hypothesis and others is bounded from above as follows: 1 J J j=1 K(f 0,n, f j,n ) α log J. Indeed, proceeding as before in the case of pointwise risk, one can find that m K(f 0,n, f j,n ) p fj,n(u 2 i ) p Φ 2 k (U i) i n k=1 i:u i K m p M 2 Kmaxh 2 2α n {# : U i k } p M 2 Kmaxnh 2 2α n k=1 for a sufficiently large n. ( 8p L 2 K 2 max α log 2 p M 2 K 2 maxc ( 2α+q)/q 0 m Since Varshamov-Gilbert result suggests that m 8 log M/ log 2, the claim is, indeed, true if we select c 0 = ) q 2α+q. 3. Optimal rates of convergence for the random design case In the same way as in Wang, Brown and Cai (2011), our next step is to obtain minimax convergence rates in the case of random design. For convenience purposes, we restate the assumptions of that case. Our model is again Y i = a + X iβ + f(u i ) + ε i (3.1)

Minimax rate of convergence for an estimator of the functional component in a semiparametric partial linear model 9 for i = 1,..., n; we also assume that U i are random variables on [0, 1] q and that (X i, U i) R p R q are independent with an unknown joint density g(x, u). The random errors ε i are independent identically distributed with mean zero, variance σ 2 and are independent of (X i, U i). Moreover, we assume that the conditional covariance matrix Σ = E[(X 1 E(X 1 U 1 ))(X 1 E(X 1 U 1 )) ] is nonsingular. As in any linear regression model, β R p is a vector of coefficients. For any U with the marginal distribution g(u), we also need to assume that E(f(U)) f(u)g(u) du = 0 to ensure identifiability of the model (3.1). Finally, an individual coordinate of the vector X i is denoted Xi l, for l = 1,..., p and an individual coordinate of the random vector U is denoted U r, for r = 1,..., q. Note that, unlike the fixed design case, the indices i here are univariate. As a first step, we obtain least squares estimates of the coefficient vector ˆβ and the intercept â. The asymptotic normality and efficiency of the estimator ˆβ in the random case were established in the Theorem 3.4 of Brown, Levine and Wang (2014). To estimate the function f(u), we apply a multivariate kernel smoother to the residuals r i = Y i â X i ˆβ. In the case of q = 1 Wang, Brown and Cai (2011) used the Gasser-Müller kernel smoother; however, generalization of Gasser-Müller kernels to the multivariate case is not very convenient and a better alternative is to use the Nadaraya-Watson smoother based on the multivariate product kernel. More specifically, let K(U r ) be a univariate kernel function for a specific coordinate U r, r = 1,..., q satisfying K(U r ) du r = 1 and having α vanishing moments. We choose the asymptotically optimal bandwidth h n = n 1/(2α+q) (see, for example, J. Fan and I. Gijbels (1995)). The rescaled version of this kernel is K h (U r ) = h 1 K(h 1 U r ) so that the q-dimensional rescaled kernel is K h (U) = h q q r=1 K(h 1 U r ). Then, the Nadaraya-Watson estimator of f(u) is n ˆf n (U) = W i,h (U U i )r i i=1 where the weights W i,h (U U i ) = K h(u U i ) n i=1 K h(u U i. We stress the dependence of ) this estimator on the sample size n by using it as a subscript. To make the notation shorter, we will also use 2 to denote the L 2 [0, 1] q - norm and 2 2 the squared norm in the same space. As a first step, we need to establish the analogue of Theorem 2.3 from Brown, Levine and Wang (2014) in the case of

10 Michael Levine random design. Since the proof of that result is almost analogous to Theorem 2.3, we omit it and only state the final result. Theorem 3.3. Under the assumptions of Theorem 3.4 of Brown, Levine and Wang (2014), for any Lipschitz indicator α > 0 and any U 0 [0, 1] q, the estimator ˆf n satisfies sup E[( ˆf n (U 0 ) f(u 0 )) 2 ] Cn 2α/(2α+q) f Λ α (M) for a constant C > 0. Also, for any α > 0, [ sup E f Λ α (M) [0,1] q ( ˆf n (U) f(u)) 2 du ] Cn 2α/(2α+q) Theorem (3.3) establishes upper bounds on the rate of convergence for the distance at a point and the L 2 [0, 1] q distance. In order to obtain the optimality of this convergence rate, we need to match these upper bounds with lower bounds. This is done in the following main result of this section. Theorem 3.4. 1. Let T n be an arbitrary estimator of the function f. Under the assumptions of Theorem 3.4 of Brown, Levine and Wang (2014), for any Lipschitz indicator α > 0 and any U 0 [0, 1] q, the following holds: [ lim inf inf sup E f n 2α 2α+q (T n (U 0 ) f(u 0 )) 2] c 1 n T n f Λ α (M) where c 1 is a constant that does not depend on n. In other words, n α/(2α+q) is an optimal (minimax) rate of convergence when estimating the function f(u) at a given fixed point U 0. 2. Again, let T n be an arbitrary estimator of the function f. Under the assumptions of Theorem 3.4 of Brown, Levine and Wang (2014), for any Lipschitz indicator α > 0, [ lim inf inf sup E f n 2α 2α+q (T n (U) f(u)) 2 du n T n f Λ α (M) [0,1] q where c 2 is a constant that doesn t depend on n. ] c 2

Minimax rate of convergence for an estimator of the functional component in a semiparametric partial linear model 11 Proof. 1. As a first step, we will consider the case of estimation at a point U 0 [0, 1] q. The proposed minimax rate is ψ n = n α/(2α+q). The subscript n is used to stress its dependence on the sample size n. We also define two test functions f 0,n (U) 0 and f 1,n (U) = Mh α nk(h 1 n (U U 0 )) that are exactly the same as those used in the proof of the first part of the theorem (2.1). Recall that the distance at a point between these functions is f 1n (U 0 ) 2Aψ n where the convergence rate ψ n = n α/(2α+q ) and A = M 4 exp( 1) Denote E U 1,...,U n the conditional expectation with respect to the joint distribution of U 1,..., U n. Also, denote the test function Ψ that, for a given set of data, selects either the first or the second function f 0,n or f 1,n. Then, using Chebyshev s inequality, we have sup f Λ α(m) A2 2 E[ψ n [T n (U 0 ) f(u 0 )]] A 2 max P ( T n (U 0 ) f(u 0 ) Aψ n ) f f 0,f 1n 1 E U1,...,U n [P ( T n (U 0 ) f(u 0 Aψ n U 1,..., U n )] j=0 = A 2 E U1,...,U n 1 2 A 2 1 E U1,...,U n inf Ψ 2 1 P ( T n (U 0 ) f(u 0 ) Aψ n U 1,..., U n ) j=0 1 P ((Ψ j U 1,..., U n ) j=0 where the last inequality follows from the triangle inequality and the fact that the distance between the two functions at the point U 0 is greater than or equal to 2Aψ n. For fixed U 1,..., U n we have the distance between the two product densities m 0,n and m 1,n associated with functions f 0,n and f 1,n K(P 0, P 1 ) C for the same finite C that was obtained in the proof of theorem (2.1). Thus, ( the lower bound becomes p e,1 = 1 inf Ψ P ((Ψ j U 1,..., U n ) max 4 exp( C), 1 ) C/2 2 according to the Theorem 2.2 of Tsybakov (2009) which finishes our proof. 2. Next, we need to consider the L 2 [0, 1] q case. Here, yet again, we will need not just two, but J +1 test functions with J as n. The needed test functions are defined as f jn, j = 0,..., J where f jn are again selected from the function set (2.3) as before in the case of fixed design. Earlier, we

12 Michael Levine proved that the L 2 distance between any two functions from this set is f jn f kn 2 M 4 K 2n α/(2α+q) for any 0 j, k J. The basic idea we are going to use is to bound the L 2 [0, 1] q risk from below by the average probability of error rather than the minimax probability of error. This approach uses the so-called Fano s lemma that was originally borrowed from the information theory; see, for example, Fano (1952). Again, using Chebyshev s inequality, we obtain [ sup E n 2α/(2α+q) ˆf ] f 2 2 f Λ α (M) A 2 max ( ˆf ) n f 2 Aψ n f {f 0n,...,f Jn } P f A 2 1 J + 1 J j=0 = A 2 E U1,...,U n E U1,...,U n [ P j ( ˆf n f 2 Aψ n U 1,..., U n )] 1 J + 1 J ( P j ˆf ) n f 2 Aψ n U 1,..., U n j=0 Next, let Ψ be a test that based on the data, makes a choice between the J + 1 hypotheses considered. Then, the above result means that [ sup E n 2α/(2α+q) ˆf ] f 2 2 f Λ α (M) A 2 E U1,...,U n inf Ψ 1 J + 1 J P j (Ψ j U 1,..., U n ) j=0 While proving the theorem (2.1), we showed that 1 J J j=1 K(f 0,n, f j,n ) α log J. A direct consequence of Fano s lemma (for details see, for example, Corollary 2.6 in Tsybakov (2009) is that the minimum average probability of error is bounded from below as 1 J inf P j (Ψ j U 1,..., U n ) Ψ J + 1 j=0 log(j + 1) log 2 log J α (3.2) The right hand side of the last inequality does not depend on U 1,..., U n and so the desired result is obtained.

Minimax rate of convergence for an estimator of the functional component in a semiparametric partial linear model 13 4. Acknowledgements Michael Levine s research has been partially supported by the NSF grant DMS # 1208994 References Brown, L.D., Levine,M., and Wang, L. (2014) A semiparametric multivariate partially linear model: a difference approach. A Technical Report at http://www.stat.purdue.edu/research/technical reports/2014-tr.html Engle, R. F., Granger, C. W., Rice, J., and Weiss, A. (1986). Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81(394), 310-320. Ibragimov, I.A. and Has minskii, R.Z. (1977) On the estimation of an infinitedimensional parameter in Gaussian white noise. Soviet Mathematics. Doklady, 18, 1307-1309 Has minskii, R.Z. (1978) A lower bound on the risks of nonparametric estimates of densities in the uniform metric Theory of Probability and Its Applications, 23, 794-798 Fan, J., and Gijbels, I. (1995) Local polynomial modelling and its applications: Monographs on Statistics and Applied Probability, Chapman& Hall, London Fano, R.M. (1952) Class Notes for Transmission of Information. Course 6.574, MIT, Cambridge, Massachusetts Gilbert (1952) A comparison of signalling alphabets. Journal, 31, 504-522 Bell System Technical Tsybakov, A.B. (2009) Introduction to Nonparametric Estimation, Springer Wang, L., Brown, L.D. and Cai, Tony T. (2011). A difference based approach to the semiparametric partial linear model. Electronic Journal of Statistics, 5, 619-641.