Kernel Density Based Linear Regression Estimate

Kernel Density Based Linear Regression Estimate Weixin Yao and Zibiao Zao Abstract For linear regression models wit non-normally distributed errors, te least squares estimate (LSE will lose some efficiency compared to te maximum likeliood estimate (MLE. In tis article, we propose a kernel density based regression estimate (KDRE tat is adaptive to te unknown error distribution. Te key idea is to approximate te likeliood function by using a nonparametric kernel density estimate of te error density based on some initial parameter estimate. Te proposed estimate is sown to be asymptotically as efficient as te oracle MLE wic assumes te error density were known. In addition, we propose an EM type algoritm to maximize te estimated likeliood function and sow tat te KDRE can be considered as an iterated weigted least squares estimate, wic provides us some insigts on te adaptiveness of KDRE to te unknown error distribution. Our Monte Carlo simulation studies sow tat, wile comparable to te traditional LSE for normal errors, te proposed estimation procedure can ave substantial efficiency gain for non-normal errors. Moreover, te efficiency gain can be acieved even for a small sample size. Key words: EM algoritm, Kernel density estimate, Least squares estimate, Linear regression, Maximum likeliood estimate. Department of Statistics, Kansas State University, Manattan, KS 66506. Email: wxyao@ksu.edu Department of Statistics, Te Pennsylvania State University, University Park, PA 16802. Email: zuz13@stat.psu.edu 1

1 Introduction Linear regression models are widely used to investigate te relationsip between several variables. Suppose (x 1, y 1,..., (x n, y n are sampled from te regression model y = x T β + ϵ, (1.1 were x is a p-dimensional vector of covariates independent of te error ϵ wit E(ϵ = 0. Te well-known least squares estimate (LSE of β is β = argmin β (y i x T i β 2. (1.2 For normally distributed errors, β is exactly te maximum likeliood estimate (MLE. However, β will lose some efficiency wen te error is not normally distributed. Terefore, it is desirable to ave an estimate tat can be adaptive to te unknown error distribution. Te idea of adaptiveness is not new. Beran (1974 and Stone (1975 considered adaptive estimation for location model. Bickel (1982, Scick (1993, Yuan and De Gooijer (2007, and Yuan (2010 extended te adaptive idea to regression and some oter models. Linton and Xiao (2007 furter applied te adaptive idea to nonparametric regression estimate. Wang and Yao (2012 applied te adaptive idea to dimension reduction. Empirical likeliood tecniques (Owen, 1988, 2001 ave also been used for regression problems to adaptively construct te confidence intervals and testing statistics witout any parametric assumption for te error density. However, empirical likeliood regression can t provide te efficient point regression estimates by adaptively using te unknown error density information. In tis article, we propose an adaptive kernel density based regression estimate (KDRE. Te basic idea is to estimate te error density by kernel density estimate based on some initial parameter estimate and ten estimate te regression parameters by maximizing te estimated likeliood function. Our proposed estimation procedure uses similar kernel error idea of Stone (1975 and Linton and Xiao (2007 to gain te adaptiveness based on some initial consistent estimate. However, Linton and Xiao (2007 mainly deals wit nonparametric regression, te current paper deals wit 2

parametric regression. We prove tat te proposed estimate is asymptotically as efficient as te oracle MLE, wic assumes te error density were known. Terefore, our proposed estimate can adapt to different error distributions. In addition, we propose a novel EM algoritm to maximize te estimated likeliood function and sow tat te KDRE can be viewed as an iterated weigted least squares estimate, wic provides us some insigts on wy te KDRE can adapt to te unknown error distribution. To examine te finite sample performance, we conduct a Monte Carlo simulation study based on a wide range of error densities, including eavy-tail error, multiple-modal error, and skewed error density. Our simulation study confirms our teoretical finding. Our main claims are as follows. 1. Te KDRE is comparable to te traditional LSE wen te error is normal. 2. Te KDRE is more efficient tan te LSE wen te error is not normal. Te efficiency gain can be substantial even for a small sample size. Te remainder of tis paper is organized as follows. In section 2, we introduce te new estimation procedure and prove its asymptotic oracle property. In addition, an EM type algoritm is introduced to maximize te estimated likeliood function. Numerical comparisons are conducted in Section 3. Summary and discussion are given in Section 4. Tecnical proofs are gatered in te Appendix. 2 Kernel Density Based Regression Estimate 2.1 Te new estimation metod Let f(t be te marginal density of ϵ in (1.1. If f(t is known, instead of using te LSE, we can better estimate β in (1.1 by maximizing te log-likeliood log f(y i x T i β. (2.1 In practice, owever, f(t is often unknown and tus (2.1 is not directly applicable. To attenuate tis, denote by β an initial estimate of β, suc as te LSE in (1.2. Based on te residuals ϵ i = y i x T β, i we can estimate f(t by te kernel density 3

estimate, denoted by f(t, as f(t = 1 n K (t ϵ j, (2.2 j=1 were K (t = 1 K(t/, K( is a kernel density, and is te tuning parameter. In tis article, we use te Gaussian kernel for K(. Replacing f( in (2.1 wit f(, we ten propose te kernel density based regression parameter estimate (KDRE as were Q(β is te estimated likeliood function Q(β = log f(y i x T i β = ˆβ = argmax Q(β, (2.3 β log { 1 n ( } K yi x T i β ϵ j. (2.4 Here we use leave-one-out kernel density estimate for f(ϵ i to remove te estimation bias; see also Yuan and Gooijer (2007 and Linton and Xiao (2007. j i Te above estimation procedure can be easily extended to te nonlinear regression by replacing x T i β in (2.4 wit te assumed nonlinear function. 2.2 Asymptotic result Let β 0 be te true value of β. Ten we ave te following asymptotic oracle results for our proposed estimate ˆβ. Teorem 2.1. Assume tat Assumptions C1 C5 in te Appendix old. As n, n(ˆβ β0 d N ( 0, V 1, (2.5 were ϵ = y x T β 0, V = I β0 M, 1 M = lim n n { } f x i x T i = E(xx T (ϵ 2, I β0 = E. (2.6 f(ϵ 2 Remark 1: By te above teorem, te proposed estimate ˆβ in (2.4 as root n 4

convergence rate and its asymptotic distribution does not depend on te kernel K( or te bandwidt, altoug te kernel density estimator wit slower convergence rate is involved. In addition, ˆβ as te same asymptotic variance as tat of te infeasible oracle MLE, wic assumes f( were known. Remark 2: In (2.4, if we replace te objective function log f( by anoter objective function, say ρ( wit E {ρ (ϵ} = 0 (te LSE corresponds to ρ(ϵ = ϵ 2, ten te resulting estimate as limiting variance v ρ = [ ] E{ρ (ϵ 2 1 } E{ρ (ϵ} M. 2 Based on te classical Cramér-Rao inequality tat [ E{ρ (ϵ 2 } E{ρ (ϵ} 2 ] 1 I 1 β 0, we ave v ρ [I β0 M] 1. Terefore, te objective functions we used in (2.4 is optimal in te sense tat te proposed estimate is asymptotically efficient. Remark 3: Our proposed metod can also be applied to nonlinear regression model and similar oracle properties can also be establised as in Teorem 2.1. Remark 4: Yuan and De Gooijer (2007 proposed estimating β by maximizing log [ 1 n { K r(yi x T i β r(y j x T j β }], (2.7 j i were r( is some monotone nonlinear function, suc as r(z = e z /(1 + e z. Here r( is used to avoid te cancelation of te intercept term in β. Note tat te asymptotic variance in (2.5 is te same as tat in Yuan and De Gooijer (2007 wit r(t = t, wic is efficient. One main advantage of teir metod is tat it does not require an initial estimate. However, te asymptotic variance of teir estimator depends on te coice of r( and generally does not reac te Cramér-Rao lower bound [I β0 M] 1 for a nonlinear function of r(. Note tat wen r(t = t in (2.7, altoug te intercept term, denoted by β 0, will be canceled, te slope parameter, denoted by β 1, will remain estimable. Let β 1 be 5

its estimate. In (2.5, let V 1 = ( V 11 V 12 V 21 V 22, were V 11 is a scalar. Based on te result of Yuan and De Gooijer (2007, we know tat β 1 is still an efficient estimate and as te asymptotic distribution n( β1 β 1 d N ( 0, V 22. Let x = (1, x T T. Based on te slope estimate β 1, we can simply estimate β 0 by β 0 = Ȳ x i T β1. Note tat β 0 can be considered as an LSE for model y i x i T β1 = β 0 + ϵ i after we fix β 1 at β 1. Denote by KDRE1 te resulting estimate ( β 0, β 1. Based on some standard calculations (te sketcy of te proof is given at te end of Appendix, we can get te asymptotic distribution for β 0 : n( β0 β 0 d N(0, σ 2, were [ σ 2 = var ϵ i f (ϵ i { } ] E(x T V 21 + E(x T V 22 x i. f(ϵ i Note tat generally β 0 does not reac te Cramér-Rao lower bound and te efficiency loss depends on te true error density f(ϵ. However, one nice feature of suc estimate is tat it doesn t require an initial estimate. In addition, it does not require to coose a nonlinear function r(. 2.3 Computations: an EM algoritm Note tat te objective function (2.4 as a mixture form. In tis section, we propose an EM algoritm to maximize it. Te proposed EM algoritm can be similarly used to find β 1 by maximizing (2.7 wen r(t = t. Let β (0 be an initial parameter estimate, suc as te LSE. We ten update te parameter estimate according to te algoritm below. Algoritm 2.1. At (k + 1t step, we calculate te following E and M steps: 6

E-Step: Calculate te classification probabilities, p (k+1 ij = K (y i x T i β (k ϵ j l i K (y i x T i β(k ϵ l K (y i x T i β (k ϵ j, j i, (2.8 M-Step: Update β (k+1, β (k+1 = argmax β = argmin β { } p (k+1 ij log K (y i x T i β ϵ j j i { } p (k+1 ij (y i x T i β ϵ j 2, (2.9 j i wic as explicit solutions, since K ( is a Gaussian kernel density. From te M step (2.9, te KDRE can be considered as a weigted least squares estimate, wic minimizes te weigted squared difference between te new residual y i x T i β and te initial residual ϵ j for all 1 i j n. Based on te weigts in (2.8, one knows tat if jt observation is an isolated outlier (i.e., ϵ j is large, ten te weigts p (k+1 ij will be also small. will be small for i j and tus te effect of ϵ j on updating β (k+1 By Teorem 2.2 below, te Algoritm 2.1 is truly an EM algoritm and as te monotone property for te objective function (2.4. Teorem 2.2. Te objective function (2.4 is non-decreasing after eac iteration of Algoritm 2.1, i.e., Q(β (k+1 Q(β (k, until a fixed point is reaced. 3 Simulation Studies In tis section, we use a simulation study to compare te proposed KDRE and KDRE1 wit te traditional LSE for linear regression models wit different types of error densities. For te proposed estimate, we use te rule-of-tumb bandwidt = 1.06n 1/5ˆσ for te kernel density estimate of f(ϵ, were ˆσ is te sample standard deviation of te initial residual ϵ i = y i x T β i and β is te LSE. Better estimates migt be obtained if we use some more sopisticated bandwidt for kernel density estimate. See, for example, Seater and Jones (1991 and Raykar and Duraiswami (2006. In addition, 7

we can also use cross validation metod to selection te bandwidt, wic focuses on te performance of regression estimate directly instead of density estimate. We generate independent and identically distributed data {(x i, y i, i = 1,..., n} from te model Y = 1 + 3X + ϵ, were X U(0, 1, te uniform distribution on [0, 1]. For te error density, we consider te following six coices (all ave standard deviation around 1: Case 1: ϵ N(0, 1, normal error. Case 2: ϵ U( 2, 2, te uniform distribution on [ 2, 2], sort-tail error. Case 3: ϵ t 3 / 3, t-distribution wit 3 degrees of freedom, eavy-tail error. Case 4: ϵ 0.95N(0, 0.7 2 + 0.05N(0, 3.5 2, contaminated normal error. Te 5% data from N(0, 3.5 2 are most likely to be outliers. Case 5: ϵ 0.5N( 1, 0.5 2 + 0.5N(1, 0.5 2, multi-modal error. Case 6: ϵ 0.3N( 1.4, 1 + 0.7N(0.6, 0.4 2, skewed error. Here, we also used te Case 6 to ceck ow our metod performed compared wit LSE wen te error is not symmetric. We estimate te regression parameters using KDRE, KDRE1, and te traditional LSE. Based on 1000 replicates, Tables 1 2 report te mean squared errors (MSE of te parameter estimates for intercept and slope, respectively, for sample size n = 30, 100, 300, and 600. Te rigtmost two columns contain te relative efficiency of KDRE and KDRE1 wen compared to te LSE. For example, RE(KDRE=MSE(LSE/MSE(KDRE. From te Case 2 to Case 6 in Tables 1 2, we can see tat KDRE and KDRE1 are muc more efficient tan te LSE wen te error is not normal (for bot symmetric and skewed error densities. Moreover, te efficiency gain can be substantial even for a small sample size. In addition, wen te error is normal, KDRE is comparable to te LSE and works better tan KDRE1 especially for small sample size. However, for Case 6 skewed error densities, KDRE1 works better tan KDRE, altoug bot of tem ave muc better performance tan LSE. In addition, for large sample size, te performances of KDRE and KDRE1 are 8

almost te same, even for intercept estimate, altoug KDRE as some teoretical advantage over KDRE1. Note tat KDRE1 is simpler witout first estimating te error data. 4 Summary In tis article, we proposed an adaptive linear regression estimate by maximizing an estimated likeliood function, in wic te error density is estimated by kernel density estimate. Te proposed estimate can adapt to unknown error density and is asymptotically equivalent to te oracle MLE. Using te proposed EM algoritm, te computation is quick and stable. Our extensive simulation studies sow tat te proposed metod outperforms te LSE in te presence of non-normal errors. Altoug developed for linear regression models, te same idea can be easily extended to nonlinear regression cases. Te asymptotic oracle property follows similarly. In addition, our proposed EM algoritm can be also used to estimate te adaptive nonparametric regression of Linton and Xiao (2007 and te semiparametric regression of Yuan and De Gooijer (2007. Future researc directions include extensions to oter regression models suc as varying coefficient partially linear models and nonparametric additive models. 5 Appendix: Proofs Te following conditions are imposed to facilitate te proof. C1. {ϵ i } and {x i } are i.i.d. and mutually independent wit E(ϵ i = 0, E( ϵ i 3 <. Additionally, te predictors x i ave bounded support and. C2. Te density f( of ϵ is symmetric about 0 and as bounded continuous derivatives up to order 4. Let l(ϵ = log f(ϵ. Assume E{l (ϵ 2 + l (ϵ + l (ϵ } <. C3. Te kernel K( is symmetric, as bounded support, and are four times continuously differentiable. C4. As n, n 4 and n 8 0. 9

C5. For te initial estimate β of β 0, assume β β 0 = O p (n 1/2. Te condition C1 can guarantee tat te least squares estimate is consistent and as root n convergence rate. Te condition C2 is used to guarantee te adaptiveness of our proposed estimate. If lim n n 1 n x i = 0, ten te symmetric condition of f(ϵ can be removed. 5.1 Proof of Teorem 2.1 We follow a similar strategy in Linton and Xiao (2007. Note tat te maximizer ˆβ in (2.3 is te solution of te score function 1 n f (y i x T i β f(y i x T i β x i, (5.1 were f (t is te derivative of f(t in (2.2. For tecnical reason, we will consider anoter trimmed version of ˆβ as te solution of S(β = 0, were S(β = 1 n f (y i x T i β f(y i x T i β x ig b ( f(ϵ i. (5.2 Here 0, x < b; G b (x = x b g b(tdt, b x 2b; 1, x > 2b. were g b (t is any density function wit support on [b, 2b] suc tat G b (t is four times continuously differentiable on [b, 2b]. In te following proof, we assume tat b = r, were 0 < r < 1/2. In practice, wen b is small, te difference between te original estimate and te trimmed one is negligible. By Taylor s expansion, tere exists β suc tat β β 0 ˆβ β 0 and S(ˆβ = S(β 0 + S(β 0 β (ˆβ β 0 + 1 2 (ˆβ β 0 T 2 S(β β β T (ˆβ β 0. Te desired result ten follows from Lemmas 5.2 5.4 below. 10

Lemma 5.1. For f in (2.2, we ave te uniform consistency results sup f(t f(t = O p [ 2 + t sup f (t f (t = O p [ 2 + t { } ] 1/2 log(n, (5.3 n { } ] 1/2 log(n. (5.4 n 3 Proof. Denote by f (k te kt derivative of f wit te convention f (0 = f. Let ˇf (k (t = 1 n k+1 j=1 ( t K (k ϵj, k = 0, 1, 2, 3, be te traditional kernel density derivative estimator of f (k (. By Silverman (1978, { } } 1/2 sup ˇf log(n (k (t f (k (t = O p { 2 +. (5.5 t n 2k+1 Since x i as bounded support and β β 0 = O p (n 1/2, ϵ j ϵ j = x T j (β 0 β = O p (n 1/2, uniformly over j. By Taylor s expansion, for some ϵ j between ϵ j and ϵ j, f(t ˇf(t = 1 n 2 j + 1 6n 4 ( t K ϵj (ϵ j ϵ j + 1 j K ( t ϵ j (ϵ j ϵ j 3 2n 3 j ( t K ϵj (ϵ j ϵ j 2 =O p (1/ n + O p (1/nO p {1 + log(n/(n 5 } + O p (1/n 3/2 O p (1/ 4, uniformly, entailing (5.3 via Condition C4 and (5.5. Similarly, (5.4 follows. Lemma 5.2. Let V be defined as in (2.6. Ten S(β 0 / β p V. Proof. For notational convenience we write f i = f(ϵ i, f i = f (ϵ i, f i = f (ϵ i, f i = f(ϵ i, f i = f (ϵ i, = f (ϵ i. Note tat f i S(β 0 β = 1 n f i 2 G f i 2 b ( f i x i x T i + 1 n f i f i G b ( f i x i x T i + 1 n f 2 i f i g b ( f i x i x T i = A + B + C. 11

It suffices to prove A p V, B p 0, and C p 0. First, we consider A. Let i = f i f i, i = f i f i, δ n = 2 + log(n/(n, and δ n = 2 + log(n/(n 3. By Lemma 5.1, max i i = O p (δ n and max i i = O p (δ n. By definition, sup x G b (x/x k 1/b k, k 0. So, by te boundedness of f i, f i, { f i 2 G f i 2 b ( f f i 2 i = fi 2 + i( f i + f i f 2 i + if i 2 (f i + f } i fi 2 f G i 2 b ( f i = f i 2 G fi 2 b ( f i + O p(δ n + O p(δ n f 2 i. b 2 b 2 fi 2 By Condition C2, f 2 i /f 2 i is integrable, so we ave A = 1 n f i 2 G fi 2 b ( f i x i x T i + O p ( δ n b 2. (5.6 By Condition C2 and te Dominated Convergence Teorem, as b 0, { f 2 E i f 2 i } { f 2 (1 G b (f i E i f 2 i } I(f(ϵ i < 2b = o(1. Note tat max 1 i n G b ( f i G b (f i = o p (1. Terefore, by decomposing G b ( f i in (5.6 into 1 + {G b (f i 1} + {G b ( f i G b (f i }, it is easily seen tat A p V. Next, we consider B. Tere exists ξ between 0 and ( f f/f suc tat { } f 1 (ϵ = f 1 (ϵ (1 + ξ 2 f 2 (ϵ f(ϵ f(ϵ. (5.7 Using te latter identity, we ave B = 1 f (ϵ i n f(ϵ i G b( f i x i x T i + 1 f (ϵ i f (ϵ i G b ( n f(ϵ i f i x i x T i 1 { f(ϵ i f(ϵ i } f (ϵ i G n f(ϵ i 2 (1 + ξ i 2 b ( f i x i x T i =B 1 + B 2 + B 3. (5.8 12

Similar to te proof of A in (5.6, we can get B 1 = o p (1. Note tat B 2 1 n 2 4 1 f(ϵ i j=1 ( K ϵi ϵ ( j x T j β x T j β 0 G b (f i x i x T i. Elementary calculations sow tat { ( } t ϵ E K (k = k+1 K(zf (k (t + zdz, k = 1, 2, 3. Let k 1 (ϵ i, ϵ j = 1 4 1 f(ϵ i K It can be easily sown tat, for distinct i, j, k, l, ( ϵi ϵ j G b (f i. E {k 1 (ϵ i, ϵ j } = O(b 1, E { k 2 1(ϵ i, ϵ j } = O(b 2 7, E {k 1 (ϵ i, ϵ j k 1 (ϵ i, ϵ l } = O(b 2 Tus, calculating te first two moments based on te result of U-statistics, we ave B 2 = O p (1/ n O p (b 1 O p (1/ n 2 b 2 7 + 1/ nb 2 = o p (1. Tat B 3 = o p (1 follows from { max f(ϵ [ i f(ϵ i } 1 i n f(ϵ i 2 (1 + ξ i G b( f 2 i = O p 2 + Finally, we consider C. Note tat { } ] 1/2 1 n log(1/ b 2 = o p (1. C = 1 f (ϵ i 2 n f(ϵ i g b( f i x i x T i + 1 f (ϵ i 2 f (ϵ i 2 g b ( n f(ϵ i f i x i x T i 1 f(ϵ i f(ϵ i n f(ϵ i 2 (1 + ξ i g b( f 2 i x i x T i =C 1 + C 2 + C 3. Based on te uniform convergency results in Lemma 5.1 and g b ( = O(b 1, we can 13

easily get C 2 = o p (1 and C 3 = o p (1. By te Dominated Convergence Teorem, { } f (ϵ i 2 E f(ϵ i g b(f i { } f max {g (ϵ i 2 b(xx}e x f 2 (ϵ i I(b f(ϵ i 2b 0, wic, along wit te argument in te proof of A in (5.6, gives C 1 = o p (1. Lemma 5.3. Let V be defined as in (2.6. Ten n S(β 0 Proof. By (5.7, n S(β0 = 1 n 1 n d N(0, V. f (ϵ i f(ϵ i x ig b ( f(ϵ i + 1 f (ϵ i f (ϵ i x i G b ( n f(ϵ i f(ϵ i f (ϵ i f { } (ϵ i f(ϵ f(ϵ x (1 + ξ 2 f(ϵ i 3 i G b ( f(ϵ i =J 1 + J 2 + J 3. By te tecnique in Lemma 5.2 and Lemma S2 of Linton and Xiao (2007, It remains to prove J 2 J 1 = 1 n f (ϵ i f(ϵ i x d i + o p (1 N(0, V. (5.9 p p 0 and J 3 0. Decompose J 2 as J 2 = 1 n = J 21 + J 22. f (ϵ i ˇf (ϵ i f(ϵ i x i G b ( f(ϵ i + 1 n ˇf (ϵ i f (ϵ i x i G b ( f(ϵ i f(ϵ i Note tat (J 21 a 1 n n 3 j=1 =O p (1/ 1 n n n 3 1 f(ϵ i K j=1 ( ϵi ϵ j 1 f(ϵ i K x T j ( β β 0 X ia G b (f(ϵ i ( ϵi ϵ j x T j X ia G b (f(ϵ i. Similar to te proof of B 2 in (5.8, by calculating te first two moments of (J 21 a 14

using te results of U-statistics, we ave E{(J 21 a } = O( 2 b 1 and var{(j 21 a } = O(1/ nb 4. Terefore, (J 21 a = o p (1. Note tat J 22 1 n = 1 n + 1 n ˇf (ϵ i f (ϵ i x i G b (f(ϵ i f(ϵ i { ( K ϵ i ϵ j f(ϵ i (n 2 1 n j=1 =J 22A + J 22B, (n 2 1 n j=1 E ik ( ϵ i ϵ j f(ϵ i Ei K ( ϵ i ϵ j } x i G b (f(ϵ i f (ϵ i x i G b (f(ϵ i were E i is te conditional expectation given ϵ i. Similar to te proof of B 2 in (5.8 and te proof tecniques in te Lemma S2 of Linton and Xiao (2007, we can prove E(J 22A = 0 and var{j 22A } = o(1. Terefore J 22A = o p (1. Similarly, we can prove J 22B = o p (1 and J 3 = o p (1. Lemma 5.4. 2 S(β / β β T = o p ( n. Proof. It follows from te same argument in Lemmas 5.2 5.3 and we omit te details. 5.2 Proof of Teorem 2.2 Let Z (k+1 i P be a random variable suc tat { Z (k+1 i = K ( y i x T i β (k+1 ϵ j /K ( y i x T i β (k ϵ j } = p (k+1 ij, j i. 15

By Jensen s inequality, we ave Q(β (k+1 Q(β (k j i K (y i x Ti β (k+1 ϵ j = log j i K (y i x Ti β(k ϵ j ( = log K y i x T i β (k+1 ϵ j p(k+1 ij ( j i K y i x T i β(k ϵ j { } = log E(Z (k+1 i E log(z (k+1 i. By te M-step of Algoritm 2.1, te desired result follows from { } E log(z (k+1 i = j i p (k+1 ij ( K log y i x T i β (k+1 ϵ j ( K y i x T i β(k ϵ j 0. Sketc of te proof of asymptotic distribution of β 0 : Let x = (1, x T T. Note tat β 0 = ȳ x T β1 = β 0 + x T β 1 + ϵ x T β1 = β 0 + x T (β 1 β 1 + ϵ Terefore, n( β0 β 0 = x T n(β 1 β 1 + n ϵ In addition, we know n(β1 β 1 = 1 n f (ϵ i f(ϵ i (V 21 + V 22 x i + o p (1. 16

Terefore, n( β0 β 0 = 1 n d N(0, σ 2, { ϵ i f } (ϵ i f(ϵ i ( x T V 21 + x T V 22 x i were [ σ 2 = var ϵ i f (ϵ i { } ] E(x T V 21 + E(x T V 22 x i. f(ϵ i 6 Acknowledgements Te autors are grateful to te editors and te referee for teir insigtful comments and suggestions, wic greatly improved tis article. In addition, te metod of KDRE1 is based on te referee s suggestion. References Beran, R. (1978. Asymptotically efficient adaptive rank estimates in location models. Annals of Statistics, 2, 248-266. Bickel, P. J. (1982. On adaptive estimation. Annals of Statistics, 10, 647-671. Linton, O. and Xiao, Z. (2007. A nonparametric regression estimator tat adapts to error distribution of unknown form. Econometric Teory, 23, 371-413. Owen, A. B. (1988. Empirical likeliood ratio confidence intervals for a single functional. Biometrika, 75, 237-249. Owen, A. B. (2001. Empirical Likeliood. New York: Capman & Hall/CRC. Raykar, V. C. and Duraiswami, R. (2006. Fast optimal bandwidt selection for kernel density estimation. In proceedings of te sixt SIAM International Conference on Data Mining, Betesda, April 2006, 524-528. Scick, A. (1993. On efficient estimation in regression models. Annals of Statistics, 21, 1486-1521. 17

Seater, S. J. and Jones, M. C. (1991. A reliable data-based bandwidt selection metod for kernel density estimation. Journal of Royal Statistical Society, B, 53, 683-690. Silverman, B. W. (1978. Weak and strong uniform consistency of te kernel estimate of density and its derivatives. Annals of Statistics, 6, 177-184. Stone, C. (1975. Adaptive maximum likeliood estimation of a location parameters. Annals of Statistics, 3, 267-284. Wang, Q. and Yao, W. (2012. An adaptive estimation of MAVE. Journal of Multivariate Analysis, 104, 88-100. Yuan, A. and De Gooijer, J. G. (2007. Semiparametric regression wit kernel error model. Scandinavian Journal of Statistics, 34, 841-869. Yuan, A. (2010. Semiparametric inference wit kernel likeliood. Journal of Nonparametric Statistics, 21, 207-228. 18

Table 1: Simulation Results for te Intercept Estimates. Mean(MSE Error Distribution n LSE KDRE KDRE1 RE(KDRE RE(KDRE1 30 0.146 0.156 0.175 0.939 0.834 N(0, 1 100 0.041 0.043 0.047 0.940 0.859 (Standard normal 300 0.014 0.015 0.016 0.960 0.893 600 0.007 0.007 0.007 1.010 0.997 30 0.183 0.144 0.154 1.266 1.190 U( 2, 2 100 0.060 0.033 0.036 1.807 1.689 (Sort-tail distribution 300 0.017 0.008 0.009 2.180 1.901 600 0.008 0.004 0.005 2.130 1.890 30 0.159 0.104 0.109 1.529 1.465 t 3 / 3 100 0.036 0.026 0.026 1.390 1.394 (Heavy-tail distribution 300 0.112 0.009 0.009 1.315 1.329 600 0.007 0.005 0.005 1.540 1.592 30 0.150 0.102 0.106 1.470 1.417 0.95N(0, 0.7 2 + 0.05N(0, 3.5 2 100 0.040 0.028 0.028 1.411 1.411 (Contaminated normal 300 0.015 0.009 0.009 1.564 1.597 600 0.008 0.005 0.005 1.513 1.438 30 0.180 0.122 0.111 1.477 1.598 0.5N( 1, 0.5 2 + 0.5N(1, 0.5 2 100 0.051 0.027 0.027 1.864 1.889 (Multi-modal distribution 300 0.019 0.009 0.010 2.077 2.010 600 0.009 0.005 0.005 1.918 1.825 30 0.182 0.115 0.088 1.593 2.083 0.3N( 1.4, 1 + 0.7N(0.6, 0.4 2 100 0.053 0.028 0.022 2.005 2.412 (Skewed distribution 300 0.016 0.008 0.007 2.102 2.363 600 0.009 0.005 0.004 1.907 2.270 19

Table 2: Simulation Results for te Slope Estimates. Mean(MSE Error Distribution n LSE KDRE KDRE1 RE(KDRE RE(KDRE1 30 0.418 0.456 0.543 0.918 0.771 N(0, 1 100 0.119 0.128 0.144 0.933 0.826 (Standard normal 300 0.046 0.049 0.053 0.951 0.878 600 0.020 0.020 0.020 1.020 0.999 30 0.520 0.414 0.413 1.259 1.259 U( 2, 2 100 0.169 0.088 0.081 2.001 2.093 (Sort-tail distribution 300 0.048 0.018 0.018 2.634 2.673 600 0.026 0.009 0.009 3.010 3.070 30 0.526 0.242 0.267 2.174 1.970 t 3 / 3 100 0.114 0.065 0.067 1.744 1.691 (Heavy-tail distribution 300 0.038 0.024 0.025 1.571 1.539 600 0.018 0.009 0.009 2.020 2.024 30 0.468 0.252 0.278 1.854 1.683 0.95N(0, 0.7 2 + 0.05N(0, 3.5 2 100 0.123 0.068 0.071 1.815 1.739 (Contaminated normal 300 0.043 0.020 0.021 2.118 2.097 600 0.023 0.012 0.013 1.904 1.812 30 0.519 0.319 0.256 1.629 1.985 0.5N( 1, 0.5 2 + 0.5N(1, 0.5 2 100 0.144 0.055 0.050 2.630 2.863 (Multi-modal distribution 300 0.058 0.019 0.018 3.058 3.157 600 0.023 0.007 0.007 3.358 3.358 30 0.546 0.239 0.173 2.283 3.148 0.3N( 1.4, 1 + 0.7N(0.6, 0.4 2 100 0.157 0.042 0.036 3.702 4.396 (Skewed distribution 300 0.046 0.012 0.011 4.007 4.153 600 0.027 0.006 0.006 4.401 4.594 20