Local Polynomial Regression

Similar documents
Nonparametric Regression. Changliang Zou

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Local Polynomial Modelling and Its Applications

Local linear multivariate. regression with variable. bandwidth in the presence of. heteroscedasticity

4 Nonparametric Regression

Econ 582 Nonparametric Regression

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Nonparametric Econometrics

On the Robust Modal Local Polynomial Regression

Modelling Non-linear and Non-stationary Time Series

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Multivariate Locally Weighted Polynomial Fitting and Partial Derivative Estimation

Introduction to Regression

Nonparametric Regression. Badr Missaoui

A Note on Data-Adaptive Bandwidth Selection for Sequential Kernel Smoothers

Introduction to Regression

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Introduction to Regression

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

DEPARTMENT MATHEMATIK ARBEITSBEREICH MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

Local Modal Regression

Remedial Measures for Multiple Linear Regression Models

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

3 Nonparametric Density Estimation

A Design Unbiased Variance Estimator of the Systematic Sample Means

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Nonparametric Methods

Introduction to Regression

Time Series and Forecasting Lecture 4 NonLinear Time Series

Nonparametric Modal Regression

A Primer of Nonparametric Econometrics and Their Applications to Economics and Finance

Simple and Efficient Improvements of Multivariate Local Linear Regression

Single Index Quantile Regression for Heteroscedastic Data

Variance Function Estimation in Multivariate Nonparametric Regression

On variable bandwidth kernel density estimation

SUPPLEMENTARY MATERIAL FOR PUBLICATION ONLINE 1

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Nonparametric Function Estimation with Infinite-Order Kernels

Relative error prediction via kernel regression smoothers

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

ERROR VARIANCE ESTIMATION IN NONPARAMETRIC REGRESSION MODELS

Smooth functions and local extreme values

2 Two-Point Boundary Value Problems

Transformation and Smoothing in Sample Survey Data

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

41903: Introduction to Nonparametrics

Optimal bandwidth selection for differences of nonparametric estimators with an application to the sharp regression discontinuity design

Integral approximation by kernel smoothing

Smooth simultaneous confidence bands for cumulative distribution functions

A Novel Nonparametric Density Estimator

Single Index Quantile Regression for Heteroscedastic Data

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Fast learning rates for plug-in classifiers under the margin condition

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Non-parametric Inference and Resampling

Day 3B Nonparametrics and Bootstrap

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

A Resampling Method on Pivotal Estimating Functions

Simple and Honest Confidence Intervals in Nonparametric Regression

Adaptive Kernel Estimation of The Hazard Rate Function

BETA KERNEL SMOOTHERS FOR REGRESSION CURVES

Optimal global rates of convergence for interpolation problems with random design

Inference on distributions and quantiles using a finite-sample Dirichlet process

Additive Isotonic Regression

PREWHITENING-BASED ESTIMATION IN PARTIAL LINEAR REGRESSION MODELS: A COMPARATIVE STUDY

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

STAT 512 sp 2018 Summary Sheet

Automatic Local Smoothing for Spectral Density. Abstract. This article uses local polynomial techniques to t Whittle's likelihood for spectral density

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

On Asymptotic Normality of the Local Polynomial Regression Estimator with Stochastic Bandwidths 1. Carlos Martins-Filho.

Chapter 4. Replication Variance Estimation. J. Kim, W. Fuller (ISU) Chapter 4 7/31/11 1 / 28

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Statistical inference on Lévy processes

ERROR-DEPENDENT SMOOTHING RULES IN LOCAL LINEAR REGRESSION

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES

Introduction to Nonparametric Regression

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

12 - Nonparametric Density Estimation

arxiv: v5 [stat.me] 12 Jul 2016

Statistica Sinica Preprint No: SS

Estimation of a quadratic regression functional using the sinc kernel

A New Method for Varying Adaptive Bandwidth Selection

Error distribution function for parametrically truncated and censored data

Measurement Error in Nonparametric Item Response Curve Estimation

Frontier estimation based on extreme risk measures

An Asymptotic Study of Variable Bandwidth Selection for Local Polynomial Regression with Application to Density Estimation

Multiscale Exploratory Analysis of Regression Quantiles using Quantile SiZer

The Closed Form Reproducing Polynomial Particle Shape Functions for Meshfree Particle Methods

Chapter 4. Chapter 4 sections

Optimal Estimation of a Nonsmooth Functional

Independent and conditionally independent counterfactual distributions

Transcription:

VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based on (X 1, Y 1 ),, (X n, Y n ). (i) averaging as a constant regression model : m(u) u α (constant) LSE : α argmin (Y i α) 2 = Y α (ii) linear regression 1 model : m(u) = α 0 + α 1 u LSE : ( α 0, α 1 ) argmin (Y i α 0 α 1 X i ) 2 α 0,α 1 ( n = Y α 1 X, (X ) i X)Y i n (X i X) 2 2 model : m(u) = β 0 + β 1 (u x) where x is the point at which we want to estimate the value of m α 1 = β 1, α 0 = β 0 β 1 x LSE : ( β 0, β 1 ) argmin (Y i β 0 β 1 (X i x)) 2 ) β 0,β ( 1 n = Y β 1 (X x), (X ) i x (X x))y i n (X i x (X x)) 2 = m(x) = α 0 + α 1 x = β 0 m (x) = α 1 = β 1 (iii) polynomial regression ( n Y β 1 (X x), (X ) i X)Y i n (X i X) 2 1 model : m(u) = α 0 + α 1 u + α 2 u 2 + + α p u p LSE : ( α 0,, α p ) argmin α 0,,α p 55 (Y i α 0 α 1 X i α p X p i )2

2 model : m(u) = β 0 + β 1 (u x) + β 2 (u x) 2 + + β p (u x) p LSE : ( β 0,, β p ) argmin (Y i β 0 β 1 (X i x) β p (X i x) p ) 2 β 0,,β p p m (r) (x) = j(j 1) (j r + 1)α j x j r ; r = 1,..., p j=r = r! β r m ( x) = α 0 + α 1 x + + α p x p = β 0 p m (r) (x) = j(j 1) (j r + 1) α j x j r ; r = 1,..., p j=r = r! β r (2) Local polynomial regression : Basic idea Nothing is assumed for the structure of m (i) local constant regression 1 Idea : m(u) β 0 (constant) when u x Use only (X i, Y i ) s.t. X i x and approximate m(x i ) for such X i s by a (unknown) constant β 0 2 Suppose we use only (X i, Y i ) s.t. X i x h = β 0 argmin (Y i β 0 ) 2 I ( h,h) (X i x) β 0 = argmin (Y i β 0 ) 2 1 β 0 h 1 ( ) 2 I Xi x ( 1,1) h m(x; h) β 0 (See Figure 1, p.4 of the lecture note) 3 generalization of the weight function m(x; h) β 0 argmin (Y i β 0 ) 2 K h (X i x) = β 0 n K h(x i x)y i n K h(x i x) : This is the famous Nadaraya-Watson estimator 56

(ii) local polynomial regression 1 Idea : m(u) β 0 + β 1 (u x) + + β p (u x) p when u x (better approximate than local constant modelling!) β r = m(r) (x) r! 2 Definition : Note ( β 0,, β p ) = argmin (Y i β 0 β 1 (X i x) β p (X i x) p ) 2 K h (X i x) β 0,,β p m (r) (x; h) r! β r ; r = 0, 1,..., p (See Figure 2, p.6 of the lecture note) β r depends on x, the point of interest. (3) Theory for local constant fitting (Nadaraya-Watson estimator) f : density of X i s supported on [0, 1] v(u) var(y X = u) (i) Asymtotic MSE in Int(supp(f)) f has a continuous derivative at x, and f(x) > 0 v is continuous at x m has two continuous derivatives at x K is a symmetric probability function supported on a compact set, say [ 1, 1] ; K is bounded = bias( m(x; h) X 1,, X n ) = 1 { } m (x)f(x) + 2m (x)f (x) µ 2 (K)h 2 +o p (h 2 )+O p (n 1/2 h 1/2 ) 2 f(x) var( m(x; h) X 1,, X n ) = v(x) ( ) K 2 n 1 h 1 + o p (n 1 h 1 ) f(x) 57

Proof 1 (bias part) E { m(x; h) X 1,, X n } m(x) 1 n n = K h(x i x) {m(x i ) m(x)} 1 n n K h(x i x) Recall the formula given in p.8 and note that Z n = EZn + O p ( var(z n )) ( ) K h {m( ) m(x)}f( )(x) + O p n 1 h 1 (K 2 ) h {m( ) m(x)}2 f( )(x) = K h f(x) + O p(n 1/2 h 1/2 ) = 1 2 {m (x)f(x) + 2m (x)f (x)}µ 2 (K)h 2 + O p (n 1/2 h 1/2 ) + o p (h 2 ) f(x) + o(h) + O p (n 1/2 h 1/2 ) 2 (variance part) var { m(x; h) X 1,, X n } = [ ] 2 n 1 K h (x X i ) n 1 n j=1 K v(x i ) h(x X j ) = n 1 h 1 n 1 n (K2 ) h (x X i )v(x i ) {n 1 n K h(x X i )} 2 = n 1 h 1 [(K 2 ) h (vf)(x) + O p (n 1/2 h 1/2 )] {f(x) + o(h) + O p (n 1/2 h 1/2 )} 2 (ii) Asymptotic MSE at boundaries x x n = αh (0 α < 1) f is right continuous at 0, and f(0+) > 0 v is right continuous at 0 m is differentiable on (0, ɛ) for some ɛ > 0 and m is right continuous at 0 K is a symmetric probability function supported on [ 1, 1] and is bounded 58

= bias( m(x; h) X 1,, X n ) Proof Remark = µ 1(K; α) µ 0 (K; α) m (0+)h + o(h) + O p (n 1/2 h 1/2 ) var( m(x; h) X 1,, X n ) = µ 0(K 2 ; α) {µ 0 (K; α)} 2 v(0+) f(0+) n 1 h 1 + o p (n 1 h 1 ) Apply the formula in (8)-(i)- 1, Page 19. 1 (bias part) K h {m( ) m(x)}f( )(x) = µ 1(K; α)m (x)f(x)h + o(h) K h f(x) = µ 0(K; α)f(x) + o(1) (K 2 ) h {m( ) m(x)}2 f( )(x) = O(h 2 ) (K 2 ) h f(x) = µ 0(K 2 ; α)f(x) + o(1) bias( m(x; h) X 1,, X n ) 2 (variance part) = µ 1(K; α)m (x)f(x)h + o(h) + O p (n 1/2 h 1/2 ) µ 0 (K; α)f(x) + o(1) + O p (n 1/2 h 1/2 ) (K 2 ) h (vf)(x) = µ 0(K 2 ; α)v(x)f(x) + o(1) var( m(x; h) X 1,, X n ) = n 1 h 1 {µ 0 (K 2 ; α)v(x)f(x) + O p (n 1/2 h 1/2 ) + o(1)} {µ 0 (K; α)f(x) + o(1) + O p (n 1/2 h 1/2 )} 2 In fact, it is unnecessary fot K to be a symmetric probability density. (4) Theory for local linear fitting f : density of X i s supported on [0, 1] v(u) var(y X = u) (i) Asymtotic MSE in Int(supp(f)) 59

Note 1 Note 2 f is continuous at x, and f(x) > 0 v is continuous at x m has two continuous derivatives at x K is supported on [ 1, 1] and is bounded = bias( m(x; h) X 1,, X n ) = 1 µ 2 (K) 2 µ 1 (K)µ 3 (K) 2 µ 0 (K)µ 2 (K) µ 1 (K) 2 m (x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) var( m(x; h) X 1,, X n ) {µ2 (K) zµ 1 (K)} 2 K 2 (z)dz = v(x) {µ 0 (K)µ 2 (K) µ 1 (K) 2 } 2 f(x) n 1 h 1 + o p (n 1 h 1 ) We do not assume K is a symmetric probability density here. If we do, then the conditional bias and variance reduce to bias( m(x; h) X 1,, X n ) = 1 2 µ 2(K)m (x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) var( m(x; h) X 1,, X n ) = µ 0 (K 2 ) v(x) f(x) n 1 h 1 + o p (n 1 h 1 ) Why the conditional bias is still O p (h 2 ) even when we put unbalanced weights around the point x? 60

0.0 0.5 1.0 1.5 Local Constant Local Linear h x 0.0 0.5 1.0 1.5 2.0 local constant vs local linear when one puts weights on the right hand side only of the point x We can expect O p (h 2 ) conditional bias at boundaries too, which shall be shown shortly. proof of the bias and the variance formula ( β 0, β 1 ) = argmin (Y i β 0 β 1 (X i x)) 2 K h (X i x) β 0,β 1 Write â 0 = β 0 m(x), â 1 = h( β 1 m (x)) = m(u) = m(x) + m (x)(u x) : linear approximation near x (Y i β 0 β 1 (X i x)) 2 K h (X i x) = = { ( )} 2 Y i m(x) m Xi x (x)(x i x) â 0 â 1 K h (X i x) h { ( )} 2 1 Y i m(x i ) (â 0, â 1 ) X i x K h (X i x) h (â 0, â 1 ) minimizes, w.r.t. a T = (a 0, a 1 ), 61

(Yi a T X i ) 2 K h (X i x) = (Y X a) T W(Y X a) where Y = (Y 1,, Y Y i = Y i m(x i ), X i = n ) T, X = (X 1,, X n) T ( 1 X i x h = (â 0, â 1 ) T = (X T WX ) 1 X T WY 1 (bias part) ), W = Diag(K h (X i x)) E(â X 1,, X n ) = (X T WX ) 1 X T WE(Y X 1,, X n ) 1 n (X T WX ) r,s = 1 ( ) r+s Xi x K h (X i x) ; r, s = 0, 1 n h = 1 (P r+s K) h (X i x) n = (P r+s K) h f(x) + O p(n 1/2 h 1/2 ) = µ r+s (K)f(x) + O p (n 1/2 h 1/2 ) + o(1) A r,s means the (r, s) component of the matrix A with indices being counted from zero. P l (u) = u l ; P l K(u) = u l K(u); (P l K) h (u) = 1 ( u ) l ( u K h h h) 1 n {X T WE(Y X 1,, X n )} r ( a r means the rth component of the vector a with indices being counted from zero = 1 ( ) r Xi x K h (X i x){m(x i ) m(x i )}; r = 0, 1 n h = 1 (P r K) h (X i x){m(x i ) m(x i )} n = (P r K) h {m( ) m( )}f( )(x) + O p(n 1/2 h 3/2 ) = 1 2 µ 2(P r K)m (x)f(x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) ) 62

= 1 2 µ r+2(k)m (x)f(x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) ( ) ( ) µ0 (K) µ 1 (K) µ2 (K) Write N =, γ = µ 1 (K) µ 2 (K) µ 3 (K) Then, E(â 0 X 1,, X n ) = 1 2 (N 1 γ) 0 m (x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ); ( )( ) (N 1 γ) 0 = 1 µ2 (K) µ 1 (K) µ2 (K) µ 0 (K)µ 2 (K) µ 1 (K) 2 µ 1 (K) µ 0 (K) µ 3 (K) 2 (variance part) var(â X 1,, X n ) = µ 2(K) 2 µ 1 (K)µ 3 (K) µ 0 (K)µ 2 (K) µ 1 (K) 2 = (X T WX ) 1 X T W var(y X 1,, X n ) WX (X T WX ) 1 = (X T WX ) 1 X T ΣX (X T WX ) 1 where Σ = Diag(v(X i )(K h (X i x)) 2 ) 0 h ( X T ΣX ) n r,s = 1 ( ) r+s Xi x (K 2 ) h (X i x)v(x i ); r, s = 0, 1 n h = µ r+s (K 2 )v(x)f(x) + o p (1) ( ) µ0 (K 2 ) µ 1 (K 2 ) Write S =. Then, µ 1 (K 2 ) µ 2 (K 2 ) var(â 0 X 1,, X n ) = (N 1 SN 1 v(x) ) 0,0 f(x) n 1 h 1 + o p (n 1 h 1 ) 1 1 (N 1 SN 1 ) 0,0 = (N 1 ) 0,r S r,s (N 1 ) s,0 = r=0 s=0 {µ2 (K) zµ 1 (K)} 2 K(z) 2 dz {µ 0 (K)µ 2 (K) µ 1 (K) 2 } 2 63

Note 3 m I (x; h) 1 n {f(x i )} 1 L h (X i x)y i prototype of internal regression estimator If we take L(z) = K (z) µ 2 (K) zµ 1 (K) K(z), then µ 0 (K)µ 2 (K) µ 1 (K) 2 m I (x; h) has the same asymptotic property (first order) as m(x; h). We call K equivalent kernel (ii) Asymptotic MSE at boundaries x x n αh(0 α < 1) f is right continuous at 0, and f(0+) > 0 v is right continuous at 0 m is twice differentiable on (0, ɛ) for some ɛ > 0, and m is right continuous at 0 K is supported on [ 1, 1] and is bounded = bias( m(x; h) X 1,, X n ) = 1 µ 2 (K; α) 2 µ 1 (K; α)µ 3 (K; α) 2 µ 0 (K; α)µ 2 (K; α) µ 1 (K; α) 2 m (0+)h 2 +o(h 2 )+O p (n 1/2 h 3/2 ) var( m(x; h) X 1,, X n ) α 1 = {µ 2(K; α) zµ 1 (K; α)} 2 K(z) 2 dz v(0+) {µ 0 (K; α)µ 2 (K; α) µ 1 (K; α) 2 } 2 f(0+) n 1 h 1 +o p (n 1 h 1 ) (5) Theory for local polynomial fitting p : the order of the local polynomial fitting f : density of X i s supported on [0, 1] v(u) var(y X = u) (i) Asymptotic MSE in Int(supp(f)) : even p case 64

f has a continuous derivatives at x, and f(x) > 0 v is continuous at x m has (p + 2) continuous derivatives at x K is supported on [ 1, 1], and is bounded = bias( m(x; h) X 1,, X n ) = (N 1 m (p+1) (x) γ) 0 (p + 1)! hp+1 [ + (N 1 m (p+2) (x) δ) 0 (p + 2)! + o(h p+2 ) + O p (n 1/2 h (2p+1)/2 ) var( m(x; h) X 1,, X n ) + (N 1 δ N 1 JN 1 γ) 0 m (p+1) (x) (p + 1)! = (N 1 SN 1 ) 0,0 v(x) f(x) n 1 h 1 + o p (n 1 h 1 ) where N = (µ r+s (K)) : (p + 1) (p + 1) matrix Note 1 γ = (µ p+1+r (K)) : (p + 1) 1 vector δ = (µ p+2+r (K)) : (p + 1) 1 vector J = (µ r+s+1 (K)) : (p + 1) (p + 1) matrix S = (µ r+s (K 2 )) : (p + 1) (p + 1) matrix ] f (x) h p+2 f(x) Suppose all the odd moments of K vanish, i.e. µ j (K) = 0 for all odd j s. Then N r,s = 0 for r + s odd. This implies (N 1 ) r,s = 0 for r + s odd, too. (See Appendix (4) in the lecture note Nonparametric Regression Function Estimation ) p (N 1 γ) 0 = (N 1 ) 0,s µ p+1+s s=0 = (N 1 ) 0,s µ p+1+s s:even = 0 since p + 1 + s is odd = bias( m(x; h) X 1,, X n ). 65

[ = (N 1 m (p+2) (x) δ) 0 (p + 2)! + o(h p+2 ) + O p (n 1/2 h (2p+1)/2 ) (ii) Asymptotic MSE at boundaries : even p case x x n = αh, 0 α < 1 f is right continuous at 0, and f(0+) > 0 + (N 1 δ N 1 JN 1 γ) 0 m (p+1) (x) (p + 1)! v is right continuous at 0 m is (p + 1) times differentiable on (0, ɛ) for some ɛ > 0, and m (p+1) is right continuous at 0 K is supported on [ 1, 1] and is bounded = bias( m(x; h) X 1,, X n ) = (N 1 γ) 0 m (p+1) (0+) (p + 1)! var( m(x; h) X 1,, X n ) h p+1 + o(h p+1 ) + O p (n 1/2 h (2p+1)/2 ) = (N 1 SN 1 ) 0,0 v(0+) f(0+) n 1 h 1 + o p (n 1 h 1 ) ] f (x) h p+2 f(x) where all the entries of N, γ and S are replaced by their corresponding incomplete moments of K and K 2 66

(iii) Asymptotic MSE in Int(supp(f)) : odd p f is continuous at x, and f(x) > 0 v is continuous at x m has (p + 1) continuous derivatives at x K is supported on [ 1, 1], and is bounded = bias( m(x; h) X 1,, X n ) = (N 1 γ) 0 m (p+1) (x) (p + 1)! var( m(x; h) X 1,, X n ) h p+1 + o(h p+1 ) + O p (n 1/2 h (2p+1)/2 ) = (N 1 SN 1 ) 0,0 v(x) f(x) n 1 h 1 + o p (n 1 h 1 ) (iv) Asymptotic MSE at boundaries : odd p The conditions and the formula for bias and variance are the same as those in (ii). Proof (i) (iv) It suffices to prove (i). Define ( β 0, β 1,, β p ) = argmin {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (X i x). β 0,,β p ( ) Write â 0 = β 0 m(x), â 1 = h( β 1 m (x)),, â p = h p β p m(p) (x). p! m(u) = m(x) + m (x)(u x) + + m(p) (x) (u x) p. p! = {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (X i x) = { ( ) ( ) p } 2 Xi x Xi x Y i m(x i ) â 0 â 1 â p h h K h (X i x) = (â 0,, â p ) minimizes, w.r.t. a T = (a 0,, a p ), 67

(Yi a T X i ) 2 K h (X i x) = (Y X a) T W(Y X a) = â = (â 0,, â p ) T = (X T WX ) 1 X T WY where Y = (Y 1,, Y n ) T, Y i = Y i m(x i ) X = (X 1,, X n) T, X i = ( 1, ( X i x h W = Diag (K h (X i x)) 1 (bias part) ),, ( Xi x h ) p ) T E(â X 1,, X n ) = (X T WX ) 1 X T WE(Y X 1,, X n ) 1 n (X T WX ) = N f(x) + J f (x)h + o(h) + O p (n 1/2 h 1/2 ) 1 n {X T WE(Y X 1,, X n )} { } = γ m(p+1) (x) m (p+2) (x) (p + 1)! f(x)hp+1 +δ (p + 2)! f(x) + m(p+1) (x) (p + 1)! f (x) + o(h p+2 ) + O p (n 1/2 h (2p+1)/2 ) [N f(x) + J f (x)h + o(h) + O p (n 1/2 h 1/2 )] 1 = N 1 2 (variance part) 1 f(x) N 1 JN 1 f (x) f(x) h + o(h) + O p(n 1/2 h 1/2 ) 2 var(â X 1,, X n ) = (X T WX ) 1 X T ΣX (X T WX ) 1 where Σ = diag(v(x i )(K h (X i x)) 2 ) h n (X T ΣX ) = S v(x)f(x) + o p (1) 1 n (X T WX ) = N f(x) + o p (1) + O p (n 1/2 h 1/2 ) var(â X 1,, X n ) = n 1 h 1 ( 1 n X T WX ) 1 ( h n X T ΣX ) ( 1 n X T WX ) 1 = n 1 h 1 N 1 SN 1 v(x) f(x) + o p(n 1 h 1 ) 68 h p+2

(v) Estimation of m (r) ( m (r) (x; h) = r! β r ), r = 0,..., p under the conditions in (i), bias( m (r) (x; h) X 1,, X n ) = r! h r E(â r X 1,, X n ) = r! (N 1 m (p+1) (x) γ) r (p + 1)! hp r+1 [ +r! (N 1 m (p+2) (x) δ) r (p + 2)! + o(h p r+2 ) + O p (n 1/2 h p r+1/2 ) var( m (r) (x; h) X 1,, X n ) = (r!) 2 h 2r var(â r X 1,, X n ) + (N 1 δ N 1 JN 1 γ) r m (p+1) (x) (p + 1)! = (r!) 2 (N 1 SN 1 ) r,r v(x) f(x) n 1 h 2r 1 + o p (n 1 h 2r 1 ) If (p r) is even and all the odd moments of K vanish, then bias( m (r) (x; h) X 1,, X n ) [ = r! (N 1 m (p+2) (x) δ) r (p + 2)! + (N 1 δ N 1 JN 1 γ) r m (p+1) (x) (p + 1)! + o(h p r+2 ) + O p (n 1/2 h p r+1/2 ) p ( ) (N 1 γ) r = (N 1 ) r,s µ p+1+s s=0 = s:r+s even = = 0 s:p+s even (N 1 ) r,s µ p+1+s f ] (x) h p r+2 f(x) (N 1 ) r,s µ p+1+s (p r : even) ] f (x) h p r+2 f(x) 69

under the condition in (ii) (p r : even), bias( m (r) (x; h) X 1,, X n ) = r!(n 1 γ) r m (p+1) (0+) (p + 1)! var( m (r) (x; h) X 1,, X n ) h p r+1 + o(h p r+1 ) + O p (n 1/2 h p r+ 1 2 ) = (r!) 2 (N 1 SN 1 ) r,r v(0+) f(0+) n 1 h 2r 1 + o p (n 1 h 2r 1 ) under the condition in (iii) (p r : odd), bias( m (r) (x; h) X 1,, X n ) = r!(n 1 m (p+1) (x) γ) r (p + 1)! hp r+1 + o(h p r+1 ) + O p (n 1/2 h p r+ 1 var( m (r) (x; h) X 1,, X n ) = (r!) 2 (N 1 SN 1 ) r,r v(x) f(x) n 1 h 2r 1 + o p (n 1 h 2r 1 ) under the condition in (iv) (p r : odd), the formula for bias and variance are the same as those in the second case. (vi) Optimal bandwidth for estimating m (r), r = 0,..., p [ ] C 1 = (r!) 2 (N 1 m p+1 2 (x) γ) r (p + 1)! [ C 2 = (r!) 2 (N 1 m p+2 (x) δ) r C 3 = (r!) 2 (N 1 SN 1 ) r,r v(x) f(x) Case 1 (p + 2)! + (N 1 δ N 1 JN 1 m (p+1) (x) γ) r (p + 1)! Interior point, even (p r), symmetric kernel { } 1/(2p+5) (2r + 1)C3 h opt = n 1/(2p+5) 2(p r + 2)C 2 and minimum (conditional) MSE = n 2(p r+2)/(2p+5) 2 ) ] f 2 (x) f(x) 70

Case 2 Interior point, odd (p r) or boundary point (x = αh, 0 α < 1), odd (p r) or boundary point, even (p r) { } 1/(2p+3) (2r + 1)C3 h opt = n 1/(2p+3) 2(p r + 1)C 1 and minimum (conditional) MSE = n 2(p r+1)/(2p+3) (vii) Which order to fit when estimating m (r)? polynomial order r r + 1 r + 2 r + 3 interior bias O(h 2 ) O(h 2 ) O(h 4 ) O(h 4 ) boundary bias O(h) O(h 2 ) O(h 3 ) O(h 4 ) variance c 0 n 1 h 2r 1 c 1 n 1 h 2r 1 c 2 n 1 h 2r 1 c 3 n 1 h 2r 1 For the cases of polynomial order r + j with j even, the orders of the interior biases are based on the use of symmetric kernels. c 0 = c 1 < c 2 = c 3 < for interior points (see the lemma below) c 0 < c 1 < c 2 < c 3 < for boundary points When p r is even, the leading bias term involves a complicated constant factor depending on m and f, and the bias order at boundary is inferior to that at interior. When p r is odd, the leading bias term involves a relatively simple constant factor depending only on m, and the boundary bias is of the same order as interior. = p with (p r) odd is recommended. Lemma A, B : k k matrices. Ã, B : (k + 1) (k + 1) matrices s.t. Ã r,s = A r,s and B r,s = B r,s for 0 r, s k 1. Suppose Ãr,s = B r,s = 0 for (r, s) with r + s odd, and also for A and B. Suppose that the matrices obtained by deleting all the odd-numbered columns and rows from A, 71

Ã, and those obtained by deleting all the even-numbered columns and rows are invertible. Then (Ã 1 BÃ 1 ) r,r = (A 1 BA 1 ) r,r when k r is odd. Proof See p.10-p.11, Nonparametric Regression Function Estimatoin (6) Minimax efficiency of the local linear smoother x : a fixed interior point of supp(f) (A1) f is continuous at x, and f(x) > 0 (A2) v is continuous at x C 2 = {m : m(u) m(x) m (x)(u x) C 1 2 (u x)2 } (i) Best linear smoother L : class of all linear estimators of the form m(x) = w i (x; X 1,, X n )Y i R L (n, C 2 ) inf sup m L m C 2 E{( m(x) m(x)) 2 X 1,, X n } : linear minimax risk Then, under (A1) and (A2) R L (n, C 2 ) sup m C2 E{( m(x; h) m(x)) 2 X 1,, X n } p 1 where m(x; h) is the local linear smoother with Epanechnikov kernel K and the bandwidth h = [ ( ) ] 1/5 K 2 v(x) n 1/5 µ 2 (K) 2 C 2 f(x) 72

(ii) 89.4% minimax efficiency T : class of all estimators of m(x) R(n, C 2 ) = inf sup m T m C 2 E{( m(x) m(x)) 2 X 1,, X n } : minimax risk under (A1) and (A2) R(n, C 2 ) sup m C2 E{( m(x; h) m(x)) 2 X 1,, X n } (0.894)2 + o p (1) where m(x; h) is the local linear smoother with Epanechnikov kernel and the bandwidth h given at (i). Reference Fan(1993) Local linear regression smoothers and their minimax efficiency The Annals of Statistics, vol 21, pp.196-216. Note Minimax efficiency in the case of general local polynomial estimators and derivative estimation has been dealt in Fan, Gasser, Gijbels, Brockmann and Engel (1997), Annals of Institute of Statistical Mathematics, vol 49, pp.79-99. The fact that the local linear smoother is the best among all linear estimators at boundary points too has been proved by Cheng, Fan and Marron (1997), The Annals of Statistics, vol 25, pp.1691-1708. (7) Bandwidth selection (i) Cross-validation Prediction error : CV (h) ĥ CV 1 n [Y i m(x i ; h)] 2 [Y i m i (X i ; h)] 2 argmin h>0 CV (h) 73

where m i ( ; h) is the leave-one-out estimator with the i-th pair (X i, Y i ) being removed in its construction. Reference Härdle, W., Hall, P. and Marron, J. S. (1988). How far are automatically chosen regression smoothing parameters from their optimum (with discussion), Journal of the American Statistical Association, vol. 83, pp 86 99. (ii) Plug-in methods (p : odd) MISE ( m(x : h) X 1,..., X n ) [ ] = E { m(x; h) m(x)} 2 w(x)dx X 1,..., X n { } (N = h 2p+2 1 2 γ) 0 {m } (p+1) 2 w(x)dx (p + 1)! + n 1 h 1 (N 1 SN 1 ) 0,0 {v(x)w(x)/f(x)} dx + o p (h 2p+2 + n 1 h 1 ) When w = f and v(x) σ 2, [ ] 1/(2p+3) h MISE (p + 1)!p!(N 1 SN 1 ) 0,0 σ 2 = 2 {(N 1 γ) 0 } 2 {m (p+1) (x)} 2 n 1/(2p+3) f(x)dx θ p+1 (g) 1 n { 2 m(p+1) (X i ; g)} σ 2 : a reasonable estimator of σ 2 g(h) = C(K)h α with C(K) and α determined in a way similar to Park and Marron (1990) or Sheather and Jones (1991). ĥ SE : the solution of the equation [ ] 1/(2p+3) (p + 1)!p!(N 1 SN 1 ) 0,0 σ 2 h = n 1/(2p+3) 2 {(N 1 γ) 0 } 2 θp+1 (g(h)) 74

Reference Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression, Journal of the American Statistical Association, vol. 90, pp 1257 1270. See Chapter 4 of Fan and Gijbels (1996) for other methods and references therein. (8) Difficulties with design sparseness K is compactly supported, say on [ 1, 1]. = X T WX is singular if I [x h,x+h] (X i ) = 0. [ ] But, P I [x h,x+h] (X i ) = 0 > 0 for all n. = P [ X T WX is singular ] > 0 for all n = Unconditional bias and variance of the local polynomial estimator do not exist. (i) Local polynomial ridge regression 1 (X 1 x) (X 1 x) p X =.. Y = 1 (X n x) (X n x) p W = Diag (K h (X i x)) β ( β 0 (x),..., β p (x)) = argmin (Y Xβ) T W(Y Xβ) β = (X T WX) 1 X T WY β ridge (x) (H + X T WX) 1 X T WY where H is a nonnegative definite matrix such that H + X T WX is nonsingular. 75 Y 1. Y n

Reference Seifert, B. and Gasser, T. (1996). Finite-sample variance of local polynomials : analysis and solutions, Journal of the American Statistical Association, vol. 91, pp 267 275. (ii) Addding pseudo data Idea X (1) X (n), Y [i] X (i). If X (i+1) X (i) is large for some i, then add k i equally spaced pseudo design points to the interval [ ] X (i), X (i+1). For each pseudo design point X (X (i), X (i+1) ), define Y by linear interpolation between the pairs (X (i), Y [i] ) and (X (i+1), Y [i+1] ). Y [i+1] Y [i] pseudo data X (i) X X (i+1) Reference Hall, P. and Turlach, B. A. (1997). Interpolation methods for adapting to sparse design in nonparametric regression, Journal of the American Statistical Association, vol. 92, pp 466 476. 76

(9) Quantile regression (X 1, Y 1 ),..., (X n, Y n ) iid with a joint cdf F (x, y) and pdf f(x, y) (i) quantile function y = q p (x) F Y X (y x) = p (0 < p < 1) q p (x) = argmin β E[ρ p (Y β) X = x] where ρ p is the check function defined by ρ p (z) = 1 { z + (2p 1)z} 2 = pzi [0, ) (z) (1 p)zi (,0) (z) p = 1 2 p < 1 2 p > 1 2 ( ) E[ρ p (Y β) X = x] = p β yf Y X (y x)dy (1 p) β yf Y X (y x)dy pβ [ 1 F Y X (β x) ] + (1 p)βf Y X (β x) Solving β E[ρ p(y β) X = x] = 0 leads to the result. (ii) applications constructing prediction interval [ q α/2 (x), q 1 α/2 (x) ] detecting heteroscedasticity 77

(iii) local polynomial quantile regression q p (x) = β 0 (x) where ( β 0 (x),..., β p (x)) = argmin ρ p (Y i β 0 β 1 (X 1 x) β p (X i x) p ) K h (X i x) β Reference Yu, K. and Jones, M. C. (1998). Local linear quantile regression, Journal of the American Statistical Association, vol. 93, pp 228 237 (10) Robust regression very useful when the error variance is large. m(x; h) = β 0 (x) where ( β 0 (x),..., β p (x)) = argmin l (Y i β 0 β 1 (X 1 x) β p (X i x) p ) K h (X i x) β Choice of l more resistent to outliers than the squared error loss (i) Huber s Ψ : l = Ψ c where Ψ c (z) = max{ c, min(c, z)} c c c c 78

(ii) Huber s bisquare : l = B c where { ( z ) } 2 2 ( z ) B c (z) = z 1 I [ 1,1] c c c c 5 c 5 c References 1 Härdle, W. and Gasser, T. (1984). Robust non-parametric function fitting, Journal of the Royal Statistical Society, Series B, vol. 46, pp 42 51. 2 Fan, J., Hu, T. -C. and Truong, Y. K. (1994). Robust nonparametric function estimation, Scandinavian Journal of Statistics, vol. 21, pp 433 446. (iii) Robust locally weighted regression (LOWESS : LOcally WEighted Scatter plot Smoothing due to Cleveland (1979), JASA, vol. 74, pp 829 836) Step 1 For each k, minimize {Y i β 0 β 1 (X i X k ) β p (X i X k ) p } 2 K hk (X i X k ) with respect to β 0,, β p where h k = nd -th smallest number among { X k X j ; j = 1,..., n}. ( nd means the nearest integer to nd, 0 < d 1. Cleveland used K(t) = 70 81 (1 t 3 ) 3 I ( 1,1) (t), and suggest 0.2 d 0.8). Ŷ k = β 0 β 0 (X k ), k = 1,..., n r k = Y k Ŷk, k = 1,..., n 79

Step 2 M = median{ r 1,, r n } δ i = L(r i /6M) where L is a kernel (Cleveland used the biweight kernel L(t) = (1 t 2 ) 2 I ( 1,1) (t).) Step 3 For each k, minimize {Y i β 0 β 1 (X i X k ) β p (X i X k ) p } 2 δ i K hk (X i X k ) Ŷ k = β 0 β 0 (X k ) r k = Y k Ŷk Repeat Step 2 and 3 a total of N times. Get the final fitted values Ŷ k, k = 1,..., n. Get the estimated values at x different from the design points {X k : k = 1,..., n} by interpolation. (Cleveland suggested p = 1, N = 3 ; See Figure 2.4, p.27 of FG) Note When N, Ŷk converges to the robust regression estimator m(x k, h k ) with l (z) = B 6M (z). 80