Kernel Density Based Linear Regression Estimate

Similar documents
7 Semiparametric Methods and Partially Linear Regression

EFFICIENCY OF MODEL-ASSISTED REGRESSION ESTIMATORS IN SAMPLE SURVEYS

Financial Econometrics Prof. Massimo Guidolin

The Priestley-Chao Estimator

Fast optimal bandwidth selection for kernel density estimation

Fast Exact Univariate Kernel Density Estimation

Bandwidth Selection in Nonparametric Kernel Testing

Chapter 1. Density Estimation

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES

DEPARTMENT MATHEMATIK SCHWERPUNKT MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

On Local Linear Regression Estimation of Finite Population Totals in Model Based Surveys

Nonparametric density estimation for linear processes with infinite variance

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

232 Calculus and Structures

Applications of the van Trees inequality to non-parametric estimation.

Volume 29, Issue 3. Existence of competitive equilibrium in economies with multi-member households

Copyright c 2008 Kevin Long

Order of Accuracy. ũ h u Ch p, (1)

Boosting Kernel Density Estimates: a Bias Reduction. Technique?

ERROR BOUNDS FOR THE METHODS OF GLIMM, GODUNOV AND LEVEQUE BRADLEY J. LUCIER*

Basic Nonparametric Estimation Spring 2002

New Distribution Theory for the Estimation of Structural Break Point in Mean

New families of estimators and test statistics in log-linear models

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,

Math 161 (33) - Final exam

Poisson Equation in Sobolev Spaces

Exam 1 Review Solutions

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist

Kernel Smoothing and Tolerance Intervals for Hierarchical Data

Artificial Neural Network Model Based Estimation of Finite Population Total

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

A Simple Matching Method for Estimating Sample Selection Models Using Experimental Data

Homework 1 Due: Wednesday, September 28, 2016

A = h w (1) Error Analysis Physics 141

EFFICIENT REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLING

Bootstrap confidence intervals in nonparametric regression without an additive model

Exponential Concentration for Mutual Information Estimation with Application to Forests

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics 1

A Jump-Preserving Curve Fitting Procedure Based On Local Piecewise-Linear Kernel Estimation

Function Composition and Chain Rules

Numerical Differentiation

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

A New Diagnostic Test for Cross Section Independence in Nonparametric Panel Data Model

(a) At what number x = a does f have a removable discontinuity? What value f(a) should be assigned to f at x = a in order to make f continuous at a?

Solution. Solution. f (x) = (cos x)2 cos(2x) 2 sin(2x) 2 cos x ( sin x) (cos x) 4. f (π/4) = ( 2/2) ( 2/2) ( 2/2) ( 2/2) 4.

Regularized Regression

Department of Econometrics and Business Statistics

Mathematics 5 Worksheet 11 Geometry, Tangency, and the Derivative

Math 102 TEST CHAPTERS 3 & 4 Solutions & Comments Fall 2006

Combining functions: algebraic methods

The Verlet Algorithm for Molecular Dynamics Simulations

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines

UNIMODAL KERNEL DENSITY ESTIMATION BY DATA SHARPENING

More on generalized inverses of partitioned matrices with Banachiewicz-Schur forms

Stationary Gaussian Markov Processes As Limits of Stationary Autoregressive Time Series

f a h f a h h lim lim

arxiv: v1 [math.pr] 28 Dec 2018

Uniform Convergence Rates for Nonparametric Estimation

Uniform Consistency for Nonparametric Estimators in Null Recurrent Time Series

ch (for some fixed positive number c) reaching c

Differentiation in higher dimensions

Polynomial Interpolation

Numerical Experiments Using MATLAB: Superconvergence of Nonconforming Finite Element Approximation for Second-Order Elliptic Problems

Bootstrap prediction intervals for Markov processes

MAT 145. Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points

158 Calculus and Structures

OPTIMAL PREDICTION UNDER ASYMMETRIC LOSS 1. By Peter F. Christoffersen and Francis X. Diebold 2 1. INTRODUCTION

WEIGHTED KERNEL ESTIMATORS IN NONPARAMETRIC BINOMIAL REGRESSION

Polynomial Interpolation

Local Orthogonal Polynomial Expansion (LOrPE) for Density Estimation

Section 15.6 Directional Derivatives and the Gradient Vector

Taylor Series and the Mean Value Theorem of Derivatives

Estimating Peak Bone Mineral Density in Osteoporosis Diagnosis by Maximum Distribution

Math 212-Lecture 9. For a single-variable function z = f(x), the derivative is f (x) = lim h 0

VARIANCE ESTIMATION FOR COMBINED RATIO ESTIMATOR

Preconditioning in H(div) and Applications

Online Learning: Bandit Setting

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

lecture 26: Richardson extrapolation

The derivative function

POLYNOMIAL AND SPLINE ESTIMATORS OF THE DISTRIBUTION FUNCTION WITH PRESCRIBED ACCURACY

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

Continuity and Differentiability of the Trigonometric Functions

4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.

The Complexity of Computing the MCD-Estimator

2.8 The Derivative as a Function

Introduction to Derivatives

3.4 Worksheet: Proof of the Chain Rule NAME

Analysis of Solar Generation and Weather Data in Smart Grid with Simultaneous Inference of Nonlinear Time Series

1 Introduction to Optimization

Ratio estimation using stratified ranked set sample

Deconvolution problems in density estimation

Lecture 21. Numerical differentiation. f ( x+h) f ( x) h h

Pre-Calculus Review Preemptive Strike

Complexity of Decoding Positive-Rate Reed-Solomon Codes

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Kernel Density Estimation

THE STURM-LIOUVILLE-TRANSFORMATION FOR THE SOLUTION OF VECTOR PARTIAL DIFFERENTIAL EQUATIONS. L. Trautmann, R. Rabenstein

Numerical analysis of a free piston problem

Estimation of boundary and discontinuity points in deconvolution problems

Transcription:

Kernel Density Based Linear Regression Estimate Weixin Yao and Zibiao Zao Abstract For linear regression models wit non-normally distributed errors, te least squares estimate (LSE will lose some efficiency compared to te maximum likeliood estimate (MLE. In tis article, we propose a kernel density based regression estimate (KDRE tat is adaptive to te unknown error distribution. Te key idea is to approximate te likeliood function by using a nonparametric kernel density estimate of te error density based on some initial parameter estimate. Te proposed estimate is sown to be asymptotically as efficient as te oracle MLE wic assumes te error density were known. In addition, we propose an EM type algoritm to maximize te estimated likeliood function and sow tat te KDRE can be considered as an iterated weigted least squares estimate, wic provides us some insigts on te adaptiveness of KDRE to te unknown error distribution. Our Monte Carlo simulation studies sow tat, wile comparable to te traditional LSE for normal errors, te proposed estimation procedure can ave substantial efficiency gain for non-normal errors. Moreover, te efficiency gain can be acieved even for a small sample size. Key words: EM algoritm, Kernel density estimate, Least squares estimate, Linear regression, Maximum likeliood estimate. Department of Statistics, Kansas State University, Manattan, KS 66506. Email: wxyao@ksu.edu Department of Statistics, Te Pennsylvania State University, University Park, PA 16802. Email: zuz13@stat.psu.edu 1

1 Introduction Linear regression models are widely used to investigate te relationsip between several variables. Suppose (x 1, y 1,..., (x n, y n are sampled from te regression model y = x T β + ϵ, (1.1 were x is a p-dimensional vector of covariates independent of te error ϵ wit E(ϵ = 0. Te well-known least squares estimate (LSE of β is β = argmin β (y i x T i β 2. (1.2 For normally distributed errors, β is exactly te maximum likeliood estimate (MLE. However, β will lose some efficiency wen te error is not normally distributed. Terefore, it is desirable to ave an estimate tat can be adaptive to te unknown error distribution. Te idea of adaptiveness is not new. Beran (1974 and Stone (1975 considered adaptive estimation for location model. Bickel (1982, Scick (1993, Yuan and De Gooijer (2007, and Yuan (2010 extended te adaptive idea to regression and some oter models. Linton and Xiao (2007 furter applied te adaptive idea to nonparametric regression estimate. Wang and Yao (2012 applied te adaptive idea to dimension reduction. Empirical likeliood tecniques (Owen, 1988, 2001 ave also been used for regression problems to adaptively construct te confidence intervals and testing statistics witout any parametric assumption for te error density. However, empirical likeliood regression can t provide te efficient point regression estimates by adaptively using te unknown error density information. In tis article, we propose an adaptive kernel density based regression estimate (KDRE. Te basic idea is to estimate te error density by kernel density estimate based on some initial parameter estimate and ten estimate te regression parameters by maximizing te estimated likeliood function. Our proposed estimation procedure uses similar kernel error idea of Stone (1975 and Linton and Xiao (2007 to gain te adaptiveness based on some initial consistent estimate. However, Linton and Xiao (2007 mainly deals wit nonparametric regression, te current paper deals wit 2

parametric regression. We prove tat te proposed estimate is asymptotically as efficient as te oracle MLE, wic assumes te error density were known. Terefore, our proposed estimate can adapt to different error distributions. In addition, we propose a novel EM algoritm to maximize te estimated likeliood function and sow tat te KDRE can be viewed as an iterated weigted least squares estimate, wic provides us some insigts on wy te KDRE can adapt to te unknown error distribution. To examine te finite sample performance, we conduct a Monte Carlo simulation study based on a wide range of error densities, including eavy-tail error, multiple-modal error, and skewed error density. Our simulation study confirms our teoretical finding. Our main claims are as follows. 1. Te KDRE is comparable to te traditional LSE wen te error is normal. 2. Te KDRE is more efficient tan te LSE wen te error is not normal. Te efficiency gain can be substantial even for a small sample size. Te remainder of tis paper is organized as follows. In section 2, we introduce te new estimation procedure and prove its asymptotic oracle property. In addition, an EM type algoritm is introduced to maximize te estimated likeliood function. Numerical comparisons are conducted in Section 3. Summary and discussion are given in Section 4. Tecnical proofs are gatered in te Appendix. 2 Kernel Density Based Regression Estimate 2.1 Te new estimation metod Let f(t be te marginal density of ϵ in (1.1. If f(t is known, instead of using te LSE, we can better estimate β in (1.1 by maximizing te log-likeliood log f(y i x T i β. (2.1 In practice, owever, f(t is often unknown and tus (2.1 is not directly applicable. To attenuate tis, denote by β an initial estimate of β, suc as te LSE in (1.2. Based on te residuals ϵ i = y i x T β, i we can estimate f(t by te kernel density 3

estimate, denoted by f(t, as f(t = 1 n K (t ϵ j, (2.2 j=1 were K (t = 1 K(t/, K( is a kernel density, and is te tuning parameter. In tis article, we use te Gaussian kernel for K(. Replacing f( in (2.1 wit f(, we ten propose te kernel density based regression parameter estimate (KDRE as were Q(β is te estimated likeliood function Q(β = log f(y i x T i β = ˆβ = argmax Q(β, (2.3 β log { 1 n ( } K yi x T i β ϵ j. (2.4 Here we use leave-one-out kernel density estimate for f(ϵ i to remove te estimation bias; see also Yuan and Gooijer (2007 and Linton and Xiao (2007. j i Te above estimation procedure can be easily extended to te nonlinear regression by replacing x T i β in (2.4 wit te assumed nonlinear function. 2.2 Asymptotic result Let β 0 be te true value of β. Ten we ave te following asymptotic oracle results for our proposed estimate ˆβ. Teorem 2.1. Assume tat Assumptions C1 C5 in te Appendix old. As n, n(ˆβ β0 d N ( 0, V 1, (2.5 were ϵ = y x T β 0, V = I β0 M, 1 M = lim n n { } f x i x T i = E(xx T (ϵ 2, I β0 = E. (2.6 f(ϵ 2 Remark 1: By te above teorem, te proposed estimate ˆβ in (2.4 as root n 4

convergence rate and its asymptotic distribution does not depend on te kernel K( or te bandwidt, altoug te kernel density estimator wit slower convergence rate is involved. In addition, ˆβ as te same asymptotic variance as tat of te infeasible oracle MLE, wic assumes f( were known. Remark 2: In (2.4, if we replace te objective function log f( by anoter objective function, say ρ( wit E {ρ (ϵ} = 0 (te LSE corresponds to ρ(ϵ = ϵ 2, ten te resulting estimate as limiting variance v ρ = [ ] E{ρ (ϵ 2 1 } E{ρ (ϵ} M. 2 Based on te classical Cramér-Rao inequality tat [ E{ρ (ϵ 2 } E{ρ (ϵ} 2 ] 1 I 1 β 0, we ave v ρ [I β0 M] 1. Terefore, te objective functions we used in (2.4 is optimal in te sense tat te proposed estimate is asymptotically efficient. Remark 3: Our proposed metod can also be applied to nonlinear regression model and similar oracle properties can also be establised as in Teorem 2.1. Remark 4: Yuan and De Gooijer (2007 proposed estimating β by maximizing log [ 1 n { K r(yi x T i β r(y j x T j β }], (2.7 j i were r( is some monotone nonlinear function, suc as r(z = e z /(1 + e z. Here r( is used to avoid te cancelation of te intercept term in β. Note tat te asymptotic variance in (2.5 is te same as tat in Yuan and De Gooijer (2007 wit r(t = t, wic is efficient. One main advantage of teir metod is tat it does not require an initial estimate. However, te asymptotic variance of teir estimator depends on te coice of r( and generally does not reac te Cramér-Rao lower bound [I β0 M] 1 for a nonlinear function of r(. Note tat wen r(t = t in (2.7, altoug te intercept term, denoted by β 0, will be canceled, te slope parameter, denoted by β 1, will remain estimable. Let β 1 be 5

its estimate. In (2.5, let V 1 = ( V 11 V 12 V 21 V 22, were V 11 is a scalar. Based on te result of Yuan and De Gooijer (2007, we know tat β 1 is still an efficient estimate and as te asymptotic distribution n( β1 β 1 d N ( 0, V 22. Let x = (1, x T T. Based on te slope estimate β 1, we can simply estimate β 0 by β 0 = Ȳ x i T β1. Note tat β 0 can be considered as an LSE for model y i x i T β1 = β 0 + ϵ i after we fix β 1 at β 1. Denote by KDRE1 te resulting estimate ( β 0, β 1. Based on some standard calculations (te sketcy of te proof is given at te end of Appendix, we can get te asymptotic distribution for β 0 : n( β0 β 0 d N(0, σ 2, were [ σ 2 = var ϵ i f (ϵ i { } ] E(x T V 21 + E(x T V 22 x i. f(ϵ i Note tat generally β 0 does not reac te Cramér-Rao lower bound and te efficiency loss depends on te true error density f(ϵ. However, one nice feature of suc estimate is tat it doesn t require an initial estimate. In addition, it does not require to coose a nonlinear function r(. 2.3 Computations: an EM algoritm Note tat te objective function (2.4 as a mixture form. In tis section, we propose an EM algoritm to maximize it. Te proposed EM algoritm can be similarly used to find β 1 by maximizing (2.7 wen r(t = t. Let β (0 be an initial parameter estimate, suc as te LSE. We ten update te parameter estimate according to te algoritm below. Algoritm 2.1. At (k + 1t step, we calculate te following E and M steps: 6

E-Step: Calculate te classification probabilities, p (k+1 ij = K (y i x T i β (k ϵ j l i K (y i x T i β(k ϵ l K (y i x T i β (k ϵ j, j i, (2.8 M-Step: Update β (k+1, β (k+1 = argmax β = argmin β { } p (k+1 ij log K (y i x T i β ϵ j j i { } p (k+1 ij (y i x T i β ϵ j 2, (2.9 j i wic as explicit solutions, since K ( is a Gaussian kernel density. From te M step (2.9, te KDRE can be considered as a weigted least squares estimate, wic minimizes te weigted squared difference between te new residual y i x T i β and te initial residual ϵ j for all 1 i j n. Based on te weigts in (2.8, one knows tat if jt observation is an isolated outlier (i.e., ϵ j is large, ten te weigts p (k+1 ij will be also small. will be small for i j and tus te effect of ϵ j on updating β (k+1 By Teorem 2.2 below, te Algoritm 2.1 is truly an EM algoritm and as te monotone property for te objective function (2.4. Teorem 2.2. Te objective function (2.4 is non-decreasing after eac iteration of Algoritm 2.1, i.e., Q(β (k+1 Q(β (k, until a fixed point is reaced. 3 Simulation Studies In tis section, we use a simulation study to compare te proposed KDRE and KDRE1 wit te traditional LSE for linear regression models wit different types of error densities. For te proposed estimate, we use te rule-of-tumb bandwidt = 1.06n 1/5ˆσ for te kernel density estimate of f(ϵ, were ˆσ is te sample standard deviation of te initial residual ϵ i = y i x T β i and β is te LSE. Better estimates migt be obtained if we use some more sopisticated bandwidt for kernel density estimate. See, for example, Seater and Jones (1991 and Raykar and Duraiswami (2006. In addition, 7

we can also use cross validation metod to selection te bandwidt, wic focuses on te performance of regression estimate directly instead of density estimate. We generate independent and identically distributed data {(x i, y i, i = 1,..., n} from te model Y = 1 + 3X + ϵ, were X U(0, 1, te uniform distribution on [0, 1]. For te error density, we consider te following six coices (all ave standard deviation around 1: Case 1: ϵ N(0, 1, normal error. Case 2: ϵ U( 2, 2, te uniform distribution on [ 2, 2], sort-tail error. Case 3: ϵ t 3 / 3, t-distribution wit 3 degrees of freedom, eavy-tail error. Case 4: ϵ 0.95N(0, 0.7 2 + 0.05N(0, 3.5 2, contaminated normal error. Te 5% data from N(0, 3.5 2 are most likely to be outliers. Case 5: ϵ 0.5N( 1, 0.5 2 + 0.5N(1, 0.5 2, multi-modal error. Case 6: ϵ 0.3N( 1.4, 1 + 0.7N(0.6, 0.4 2, skewed error. Here, we also used te Case 6 to ceck ow our metod performed compared wit LSE wen te error is not symmetric. We estimate te regression parameters using KDRE, KDRE1, and te traditional LSE. Based on 1000 replicates, Tables 1 2 report te mean squared errors (MSE of te parameter estimates for intercept and slope, respectively, for sample size n = 30, 100, 300, and 600. Te rigtmost two columns contain te relative efficiency of KDRE and KDRE1 wen compared to te LSE. For example, RE(KDRE=MSE(LSE/MSE(KDRE. From te Case 2 to Case 6 in Tables 1 2, we can see tat KDRE and KDRE1 are muc more efficient tan te LSE wen te error is not normal (for bot symmetric and skewed error densities. Moreover, te efficiency gain can be substantial even for a small sample size. In addition, wen te error is normal, KDRE is comparable to te LSE and works better tan KDRE1 especially for small sample size. However, for Case 6 skewed error densities, KDRE1 works better tan KDRE, altoug bot of tem ave muc better performance tan LSE. In addition, for large sample size, te performances of KDRE and KDRE1 are 8

almost te same, even for intercept estimate, altoug KDRE as some teoretical advantage over KDRE1. Note tat KDRE1 is simpler witout first estimating te error data. 4 Summary In tis article, we proposed an adaptive linear regression estimate by maximizing an estimated likeliood function, in wic te error density is estimated by kernel density estimate. Te proposed estimate can adapt to unknown error density and is asymptotically equivalent to te oracle MLE. Using te proposed EM algoritm, te computation is quick and stable. Our extensive simulation studies sow tat te proposed metod outperforms te LSE in te presence of non-normal errors. Altoug developed for linear regression models, te same idea can be easily extended to nonlinear regression cases. Te asymptotic oracle property follows similarly. In addition, our proposed EM algoritm can be also used to estimate te adaptive nonparametric regression of Linton and Xiao (2007 and te semiparametric regression of Yuan and De Gooijer (2007. Future researc directions include extensions to oter regression models suc as varying coefficient partially linear models and nonparametric additive models. 5 Appendix: Proofs Te following conditions are imposed to facilitate te proof. C1. {ϵ i } and {x i } are i.i.d. and mutually independent wit E(ϵ i = 0, E( ϵ i 3 <. Additionally, te predictors x i ave bounded support and. C2. Te density f( of ϵ is symmetric about 0 and as bounded continuous derivatives up to order 4. Let l(ϵ = log f(ϵ. Assume E{l (ϵ 2 + l (ϵ + l (ϵ } <. C3. Te kernel K( is symmetric, as bounded support, and are four times continuously differentiable. C4. As n, n 4 and n 8 0. 9

C5. For te initial estimate β of β 0, assume β β 0 = O p (n 1/2. Te condition C1 can guarantee tat te least squares estimate is consistent and as root n convergence rate. Te condition C2 is used to guarantee te adaptiveness of our proposed estimate. If lim n n 1 n x i = 0, ten te symmetric condition of f(ϵ can be removed. 5.1 Proof of Teorem 2.1 We follow a similar strategy in Linton and Xiao (2007. Note tat te maximizer ˆβ in (2.3 is te solution of te score function 1 n f (y i x T i β f(y i x T i β x i, (5.1 were f (t is te derivative of f(t in (2.2. For tecnical reason, we will consider anoter trimmed version of ˆβ as te solution of S(β = 0, were S(β = 1 n f (y i x T i β f(y i x T i β x ig b ( f(ϵ i. (5.2 Here 0, x < b; G b (x = x b g b(tdt, b x 2b; 1, x > 2b. were g b (t is any density function wit support on [b, 2b] suc tat G b (t is four times continuously differentiable on [b, 2b]. In te following proof, we assume tat b = r, were 0 < r < 1/2. In practice, wen b is small, te difference between te original estimate and te trimmed one is negligible. By Taylor s expansion, tere exists β suc tat β β 0 ˆβ β 0 and S(ˆβ = S(β 0 + S(β 0 β (ˆβ β 0 + 1 2 (ˆβ β 0 T 2 S(β β β T (ˆβ β 0. Te desired result ten follows from Lemmas 5.2 5.4 below. 10

Lemma 5.1. For f in (2.2, we ave te uniform consistency results sup f(t f(t = O p [ 2 + t sup f (t f (t = O p [ 2 + t { } ] 1/2 log(n, (5.3 n { } ] 1/2 log(n. (5.4 n 3 Proof. Denote by f (k te kt derivative of f wit te convention f (0 = f. Let ˇf (k (t = 1 n k+1 j=1 ( t K (k ϵj, k = 0, 1, 2, 3, be te traditional kernel density derivative estimator of f (k (. By Silverman (1978, { } } 1/2 sup ˇf log(n (k (t f (k (t = O p { 2 +. (5.5 t n 2k+1 Since x i as bounded support and β β 0 = O p (n 1/2, ϵ j ϵ j = x T j (β 0 β = O p (n 1/2, uniformly over j. By Taylor s expansion, for some ϵ j between ϵ j and ϵ j, f(t ˇf(t = 1 n 2 j + 1 6n 4 ( t K ϵj (ϵ j ϵ j + 1 j K ( t ϵ j (ϵ j ϵ j 3 2n 3 j ( t K ϵj (ϵ j ϵ j 2 =O p (1/ n + O p (1/nO p {1 + log(n/(n 5 } + O p (1/n 3/2 O p (1/ 4, uniformly, entailing (5.3 via Condition C4 and (5.5. Similarly, (5.4 follows. Lemma 5.2. Let V be defined as in (2.6. Ten S(β 0 / β p V. Proof. For notational convenience we write f i = f(ϵ i, f i = f (ϵ i, f i = f (ϵ i, f i = f(ϵ i, f i = f (ϵ i, = f (ϵ i. Note tat f i S(β 0 β = 1 n f i 2 G f i 2 b ( f i x i x T i + 1 n f i f i G b ( f i x i x T i + 1 n f 2 i f i g b ( f i x i x T i = A + B + C. 11

It suffices to prove A p V, B p 0, and C p 0. First, we consider A. Let i = f i f i, i = f i f i, δ n = 2 + log(n/(n, and δ n = 2 + log(n/(n 3. By Lemma 5.1, max i i = O p (δ n and max i i = O p (δ n. By definition, sup x G b (x/x k 1/b k, k 0. So, by te boundedness of f i, f i, { f i 2 G f i 2 b ( f f i 2 i = fi 2 + i( f i + f i f 2 i + if i 2 (f i + f } i fi 2 f G i 2 b ( f i = f i 2 G fi 2 b ( f i + O p(δ n + O p(δ n f 2 i. b 2 b 2 fi 2 By Condition C2, f 2 i /f 2 i is integrable, so we ave A = 1 n f i 2 G fi 2 b ( f i x i x T i + O p ( δ n b 2. (5.6 By Condition C2 and te Dominated Convergence Teorem, as b 0, { f 2 E i f 2 i } { f 2 (1 G b (f i E i f 2 i } I(f(ϵ i < 2b = o(1. Note tat max 1 i n G b ( f i G b (f i = o p (1. Terefore, by decomposing G b ( f i in (5.6 into 1 + {G b (f i 1} + {G b ( f i G b (f i }, it is easily seen tat A p V. Next, we consider B. Tere exists ξ between 0 and ( f f/f suc tat { } f 1 (ϵ = f 1 (ϵ (1 + ξ 2 f 2 (ϵ f(ϵ f(ϵ. (5.7 Using te latter identity, we ave B = 1 f (ϵ i n f(ϵ i G b( f i x i x T i + 1 f (ϵ i f (ϵ i G b ( n f(ϵ i f i x i x T i 1 { f(ϵ i f(ϵ i } f (ϵ i G n f(ϵ i 2 (1 + ξ i 2 b ( f i x i x T i =B 1 + B 2 + B 3. (5.8 12

Similar to te proof of A in (5.6, we can get B 1 = o p (1. Note tat B 2 1 n 2 4 1 f(ϵ i j=1 ( K ϵi ϵ ( j x T j β x T j β 0 G b (f i x i x T i. Elementary calculations sow tat { ( } t ϵ E K (k = k+1 K(zf (k (t + zdz, k = 1, 2, 3. Let k 1 (ϵ i, ϵ j = 1 4 1 f(ϵ i K It can be easily sown tat, for distinct i, j, k, l, ( ϵi ϵ j G b (f i. E {k 1 (ϵ i, ϵ j } = O(b 1, E { k 2 1(ϵ i, ϵ j } = O(b 2 7, E {k 1 (ϵ i, ϵ j k 1 (ϵ i, ϵ l } = O(b 2 Tus, calculating te first two moments based on te result of U-statistics, we ave B 2 = O p (1/ n O p (b 1 O p (1/ n 2 b 2 7 + 1/ nb 2 = o p (1. Tat B 3 = o p (1 follows from { max f(ϵ [ i f(ϵ i } 1 i n f(ϵ i 2 (1 + ξ i G b( f 2 i = O p 2 + Finally, we consider C. Note tat { } ] 1/2 1 n log(1/ b 2 = o p (1. C = 1 f (ϵ i 2 n f(ϵ i g b( f i x i x T i + 1 f (ϵ i 2 f (ϵ i 2 g b ( n f(ϵ i f i x i x T i 1 f(ϵ i f(ϵ i n f(ϵ i 2 (1 + ξ i g b( f 2 i x i x T i =C 1 + C 2 + C 3. Based on te uniform convergency results in Lemma 5.1 and g b ( = O(b 1, we can 13

easily get C 2 = o p (1 and C 3 = o p (1. By te Dominated Convergence Teorem, { } f (ϵ i 2 E f(ϵ i g b(f i { } f max {g (ϵ i 2 b(xx}e x f 2 (ϵ i I(b f(ϵ i 2b 0, wic, along wit te argument in te proof of A in (5.6, gives C 1 = o p (1. Lemma 5.3. Let V be defined as in (2.6. Ten n S(β 0 Proof. By (5.7, n S(β0 = 1 n 1 n d N(0, V. f (ϵ i f(ϵ i x ig b ( f(ϵ i + 1 f (ϵ i f (ϵ i x i G b ( n f(ϵ i f(ϵ i f (ϵ i f { } (ϵ i f(ϵ f(ϵ x (1 + ξ 2 f(ϵ i 3 i G b ( f(ϵ i =J 1 + J 2 + J 3. By te tecnique in Lemma 5.2 and Lemma S2 of Linton and Xiao (2007, It remains to prove J 2 J 1 = 1 n f (ϵ i f(ϵ i x d i + o p (1 N(0, V. (5.9 p p 0 and J 3 0. Decompose J 2 as J 2 = 1 n = J 21 + J 22. f (ϵ i ˇf (ϵ i f(ϵ i x i G b ( f(ϵ i + 1 n ˇf (ϵ i f (ϵ i x i G b ( f(ϵ i f(ϵ i Note tat (J 21 a 1 n n 3 j=1 =O p (1/ 1 n n n 3 1 f(ϵ i K j=1 ( ϵi ϵ j 1 f(ϵ i K x T j ( β β 0 X ia G b (f(ϵ i ( ϵi ϵ j x T j X ia G b (f(ϵ i. Similar to te proof of B 2 in (5.8, by calculating te first two moments of (J 21 a 14

using te results of U-statistics, we ave E{(J 21 a } = O( 2 b 1 and var{(j 21 a } = O(1/ nb 4. Terefore, (J 21 a = o p (1. Note tat J 22 1 n = 1 n + 1 n ˇf (ϵ i f (ϵ i x i G b (f(ϵ i f(ϵ i { ( K ϵ i ϵ j f(ϵ i (n 2 1 n j=1 =J 22A + J 22B, (n 2 1 n j=1 E ik ( ϵ i ϵ j f(ϵ i Ei K ( ϵ i ϵ j } x i G b (f(ϵ i f (ϵ i x i G b (f(ϵ i were E i is te conditional expectation given ϵ i. Similar to te proof of B 2 in (5.8 and te proof tecniques in te Lemma S2 of Linton and Xiao (2007, we can prove E(J 22A = 0 and var{j 22A } = o(1. Terefore J 22A = o p (1. Similarly, we can prove J 22B = o p (1 and J 3 = o p (1. Lemma 5.4. 2 S(β / β β T = o p ( n. Proof. It follows from te same argument in Lemmas 5.2 5.3 and we omit te details. 5.2 Proof of Teorem 2.2 Let Z (k+1 i P be a random variable suc tat { Z (k+1 i = K ( y i x T i β (k+1 ϵ j /K ( y i x T i β (k ϵ j } = p (k+1 ij, j i. 15

By Jensen s inequality, we ave Q(β (k+1 Q(β (k j i K (y i x Ti β (k+1 ϵ j = log j i K (y i x Ti β(k ϵ j ( = log K y i x T i β (k+1 ϵ j p(k+1 ij ( j i K y i x T i β(k ϵ j { } = log E(Z (k+1 i E log(z (k+1 i. By te M-step of Algoritm 2.1, te desired result follows from { } E log(z (k+1 i = j i p (k+1 ij ( K log y i x T i β (k+1 ϵ j ( K y i x T i β(k ϵ j 0. Sketc of te proof of asymptotic distribution of β 0 : Let x = (1, x T T. Note tat β 0 = ȳ x T β1 = β 0 + x T β 1 + ϵ x T β1 = β 0 + x T (β 1 β 1 + ϵ Terefore, n( β0 β 0 = x T n(β 1 β 1 + n ϵ In addition, we know n(β1 β 1 = 1 n f (ϵ i f(ϵ i (V 21 + V 22 x i + o p (1. 16

Terefore, n( β0 β 0 = 1 n d N(0, σ 2, { ϵ i f } (ϵ i f(ϵ i ( x T V 21 + x T V 22 x i were [ σ 2 = var ϵ i f (ϵ i { } ] E(x T V 21 + E(x T V 22 x i. f(ϵ i 6 Acknowledgements Te autors are grateful to te editors and te referee for teir insigtful comments and suggestions, wic greatly improved tis article. In addition, te metod of KDRE1 is based on te referee s suggestion. References Beran, R. (1978. Asymptotically efficient adaptive rank estimates in location models. Annals of Statistics, 2, 248-266. Bickel, P. J. (1982. On adaptive estimation. Annals of Statistics, 10, 647-671. Linton, O. and Xiao, Z. (2007. A nonparametric regression estimator tat adapts to error distribution of unknown form. Econometric Teory, 23, 371-413. Owen, A. B. (1988. Empirical likeliood ratio confidence intervals for a single functional. Biometrika, 75, 237-249. Owen, A. B. (2001. Empirical Likeliood. New York: Capman & Hall/CRC. Raykar, V. C. and Duraiswami, R. (2006. Fast optimal bandwidt selection for kernel density estimation. In proceedings of te sixt SIAM International Conference on Data Mining, Betesda, April 2006, 524-528. Scick, A. (1993. On efficient estimation in regression models. Annals of Statistics, 21, 1486-1521. 17

Seater, S. J. and Jones, M. C. (1991. A reliable data-based bandwidt selection metod for kernel density estimation. Journal of Royal Statistical Society, B, 53, 683-690. Silverman, B. W. (1978. Weak and strong uniform consistency of te kernel estimate of density and its derivatives. Annals of Statistics, 6, 177-184. Stone, C. (1975. Adaptive maximum likeliood estimation of a location parameters. Annals of Statistics, 3, 267-284. Wang, Q. and Yao, W. (2012. An adaptive estimation of MAVE. Journal of Multivariate Analysis, 104, 88-100. Yuan, A. and De Gooijer, J. G. (2007. Semiparametric regression wit kernel error model. Scandinavian Journal of Statistics, 34, 841-869. Yuan, A. (2010. Semiparametric inference wit kernel likeliood. Journal of Nonparametric Statistics, 21, 207-228. 18

Table 1: Simulation Results for te Intercept Estimates. Mean(MSE Error Distribution n LSE KDRE KDRE1 RE(KDRE RE(KDRE1 30 0.146 0.156 0.175 0.939 0.834 N(0, 1 100 0.041 0.043 0.047 0.940 0.859 (Standard normal 300 0.014 0.015 0.016 0.960 0.893 600 0.007 0.007 0.007 1.010 0.997 30 0.183 0.144 0.154 1.266 1.190 U( 2, 2 100 0.060 0.033 0.036 1.807 1.689 (Sort-tail distribution 300 0.017 0.008 0.009 2.180 1.901 600 0.008 0.004 0.005 2.130 1.890 30 0.159 0.104 0.109 1.529 1.465 t 3 / 3 100 0.036 0.026 0.026 1.390 1.394 (Heavy-tail distribution 300 0.112 0.009 0.009 1.315 1.329 600 0.007 0.005 0.005 1.540 1.592 30 0.150 0.102 0.106 1.470 1.417 0.95N(0, 0.7 2 + 0.05N(0, 3.5 2 100 0.040 0.028 0.028 1.411 1.411 (Contaminated normal 300 0.015 0.009 0.009 1.564 1.597 600 0.008 0.005 0.005 1.513 1.438 30 0.180 0.122 0.111 1.477 1.598 0.5N( 1, 0.5 2 + 0.5N(1, 0.5 2 100 0.051 0.027 0.027 1.864 1.889 (Multi-modal distribution 300 0.019 0.009 0.010 2.077 2.010 600 0.009 0.005 0.005 1.918 1.825 30 0.182 0.115 0.088 1.593 2.083 0.3N( 1.4, 1 + 0.7N(0.6, 0.4 2 100 0.053 0.028 0.022 2.005 2.412 (Skewed distribution 300 0.016 0.008 0.007 2.102 2.363 600 0.009 0.005 0.004 1.907 2.270 19

Table 2: Simulation Results for te Slope Estimates. Mean(MSE Error Distribution n LSE KDRE KDRE1 RE(KDRE RE(KDRE1 30 0.418 0.456 0.543 0.918 0.771 N(0, 1 100 0.119 0.128 0.144 0.933 0.826 (Standard normal 300 0.046 0.049 0.053 0.951 0.878 600 0.020 0.020 0.020 1.020 0.999 30 0.520 0.414 0.413 1.259 1.259 U( 2, 2 100 0.169 0.088 0.081 2.001 2.093 (Sort-tail distribution 300 0.048 0.018 0.018 2.634 2.673 600 0.026 0.009 0.009 3.010 3.070 30 0.526 0.242 0.267 2.174 1.970 t 3 / 3 100 0.114 0.065 0.067 1.744 1.691 (Heavy-tail distribution 300 0.038 0.024 0.025 1.571 1.539 600 0.018 0.009 0.009 2.020 2.024 30 0.468 0.252 0.278 1.854 1.683 0.95N(0, 0.7 2 + 0.05N(0, 3.5 2 100 0.123 0.068 0.071 1.815 1.739 (Contaminated normal 300 0.043 0.020 0.021 2.118 2.097 600 0.023 0.012 0.013 1.904 1.812 30 0.519 0.319 0.256 1.629 1.985 0.5N( 1, 0.5 2 + 0.5N(1, 0.5 2 100 0.144 0.055 0.050 2.630 2.863 (Multi-modal distribution 300 0.058 0.019 0.018 3.058 3.157 600 0.023 0.007 0.007 3.358 3.358 30 0.546 0.239 0.173 2.283 3.148 0.3N( 1.4, 1 + 0.7N(0.6, 0.4 2 100 0.157 0.042 0.036 3.702 4.396 (Skewed distribution 300 0.046 0.012 0.011 4.007 4.153 600 0.027 0.006 0.006 4.401 4.594 20