VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based on (X 1, Y 1 ),, (X n, Y n ). (i) averaging as a constant regression model : m(u) u α (constant) LSE : α argmin (Y i α) 2 = Y α (ii) linear regression 1 model : m(u) = α 0 + α 1 u LSE : ( α 0, α 1 ) argmin (Y i α 0 α 1 X i ) 2 α 0,α 1 ( n = Y α 1 X, (X ) i X)Y i n (X i X) 2 2 model : m(u) = β 0 + β 1 (u x) where x is the point at which we want to estimate the value of m α 1 = β 1, α 0 = β 0 β 1 x LSE : ( β 0, β 1 ) argmin (Y i β 0 β 1 (X i x)) 2 ) β 0,β ( 1 n = Y β 1 (X x), (X ) i x (X x))y i n (X i x (X x)) 2 = m(x) = α 0 + α 1 x = β 0 m (x) = α 1 = β 1 (iii) polynomial regression ( n Y β 1 (X x), (X ) i X)Y i n (X i X) 2 1 model : m(u) = α 0 + α 1 u + α 2 u 2 + + α p u p LSE : ( α 0,, α p ) argmin α 0,,α p 55 (Y i α 0 α 1 X i α p X p i )2
2 model : m(u) = β 0 + β 1 (u x) + β 2 (u x) 2 + + β p (u x) p LSE : ( β 0,, β p ) argmin (Y i β 0 β 1 (X i x) β p (X i x) p ) 2 β 0,,β p p m (r) (x) = j(j 1) (j r + 1)α j x j r ; r = 1,..., p j=r = r! β r m ( x) = α 0 + α 1 x + + α p x p = β 0 p m (r) (x) = j(j 1) (j r + 1) α j x j r ; r = 1,..., p j=r = r! β r (2) Local polynomial regression : Basic idea Nothing is assumed for the structure of m (i) local constant regression 1 Idea : m(u) β 0 (constant) when u x Use only (X i, Y i ) s.t. X i x and approximate m(x i ) for such X i s by a (unknown) constant β 0 2 Suppose we use only (X i, Y i ) s.t. X i x h = β 0 argmin (Y i β 0 ) 2 I ( h,h) (X i x) β 0 = argmin (Y i β 0 ) 2 1 β 0 h 1 ( ) 2 I Xi x ( 1,1) h m(x; h) β 0 (See Figure 1, p.4 of the lecture note) 3 generalization of the weight function m(x; h) β 0 argmin (Y i β 0 ) 2 K h (X i x) = β 0 n K h(x i x)y i n K h(x i x) : This is the famous Nadaraya-Watson estimator 56
(ii) local polynomial regression 1 Idea : m(u) β 0 + β 1 (u x) + + β p (u x) p when u x (better approximate than local constant modelling!) β r = m(r) (x) r! 2 Definition : Note ( β 0,, β p ) = argmin (Y i β 0 β 1 (X i x) β p (X i x) p ) 2 K h (X i x) β 0,,β p m (r) (x; h) r! β r ; r = 0, 1,..., p (See Figure 2, p.6 of the lecture note) β r depends on x, the point of interest. (3) Theory for local constant fitting (Nadaraya-Watson estimator) f : density of X i s supported on [0, 1] v(u) var(y X = u) (i) Asymtotic MSE in Int(supp(f)) f has a continuous derivative at x, and f(x) > 0 v is continuous at x m has two continuous derivatives at x K is a symmetric probability function supported on a compact set, say [ 1, 1] ; K is bounded = bias( m(x; h) X 1,, X n ) = 1 { } m (x)f(x) + 2m (x)f (x) µ 2 (K)h 2 +o p (h 2 )+O p (n 1/2 h 1/2 ) 2 f(x) var( m(x; h) X 1,, X n ) = v(x) ( ) K 2 n 1 h 1 + o p (n 1 h 1 ) f(x) 57
Proof 1 (bias part) E { m(x; h) X 1,, X n } m(x) 1 n n = K h(x i x) {m(x i ) m(x)} 1 n n K h(x i x) Recall the formula given in p.8 and note that Z n = EZn + O p ( var(z n )) ( ) K h {m( ) m(x)}f( )(x) + O p n 1 h 1 (K 2 ) h {m( ) m(x)}2 f( )(x) = K h f(x) + O p(n 1/2 h 1/2 ) = 1 2 {m (x)f(x) + 2m (x)f (x)}µ 2 (K)h 2 + O p (n 1/2 h 1/2 ) + o p (h 2 ) f(x) + o(h) + O p (n 1/2 h 1/2 ) 2 (variance part) var { m(x; h) X 1,, X n } = [ ] 2 n 1 K h (x X i ) n 1 n j=1 K v(x i ) h(x X j ) = n 1 h 1 n 1 n (K2 ) h (x X i )v(x i ) {n 1 n K h(x X i )} 2 = n 1 h 1 [(K 2 ) h (vf)(x) + O p (n 1/2 h 1/2 )] {f(x) + o(h) + O p (n 1/2 h 1/2 )} 2 (ii) Asymptotic MSE at boundaries x x n = αh (0 α < 1) f is right continuous at 0, and f(0+) > 0 v is right continuous at 0 m is differentiable on (0, ɛ) for some ɛ > 0 and m is right continuous at 0 K is a symmetric probability function supported on [ 1, 1] and is bounded 58
= bias( m(x; h) X 1,, X n ) Proof Remark = µ 1(K; α) µ 0 (K; α) m (0+)h + o(h) + O p (n 1/2 h 1/2 ) var( m(x; h) X 1,, X n ) = µ 0(K 2 ; α) {µ 0 (K; α)} 2 v(0+) f(0+) n 1 h 1 + o p (n 1 h 1 ) Apply the formula in (8)-(i)- 1, Page 19. 1 (bias part) K h {m( ) m(x)}f( )(x) = µ 1(K; α)m (x)f(x)h + o(h) K h f(x) = µ 0(K; α)f(x) + o(1) (K 2 ) h {m( ) m(x)}2 f( )(x) = O(h 2 ) (K 2 ) h f(x) = µ 0(K 2 ; α)f(x) + o(1) bias( m(x; h) X 1,, X n ) 2 (variance part) = µ 1(K; α)m (x)f(x)h + o(h) + O p (n 1/2 h 1/2 ) µ 0 (K; α)f(x) + o(1) + O p (n 1/2 h 1/2 ) (K 2 ) h (vf)(x) = µ 0(K 2 ; α)v(x)f(x) + o(1) var( m(x; h) X 1,, X n ) = n 1 h 1 {µ 0 (K 2 ; α)v(x)f(x) + O p (n 1/2 h 1/2 ) + o(1)} {µ 0 (K; α)f(x) + o(1) + O p (n 1/2 h 1/2 )} 2 In fact, it is unnecessary fot K to be a symmetric probability density. (4) Theory for local linear fitting f : density of X i s supported on [0, 1] v(u) var(y X = u) (i) Asymtotic MSE in Int(supp(f)) 59
Note 1 Note 2 f is continuous at x, and f(x) > 0 v is continuous at x m has two continuous derivatives at x K is supported on [ 1, 1] and is bounded = bias( m(x; h) X 1,, X n ) = 1 µ 2 (K) 2 µ 1 (K)µ 3 (K) 2 µ 0 (K)µ 2 (K) µ 1 (K) 2 m (x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) var( m(x; h) X 1,, X n ) {µ2 (K) zµ 1 (K)} 2 K 2 (z)dz = v(x) {µ 0 (K)µ 2 (K) µ 1 (K) 2 } 2 f(x) n 1 h 1 + o p (n 1 h 1 ) We do not assume K is a symmetric probability density here. If we do, then the conditional bias and variance reduce to bias( m(x; h) X 1,, X n ) = 1 2 µ 2(K)m (x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) var( m(x; h) X 1,, X n ) = µ 0 (K 2 ) v(x) f(x) n 1 h 1 + o p (n 1 h 1 ) Why the conditional bias is still O p (h 2 ) even when we put unbalanced weights around the point x? 60
0.0 0.5 1.0 1.5 Local Constant Local Linear h x 0.0 0.5 1.0 1.5 2.0 local constant vs local linear when one puts weights on the right hand side only of the point x We can expect O p (h 2 ) conditional bias at boundaries too, which shall be shown shortly. proof of the bias and the variance formula ( β 0, β 1 ) = argmin (Y i β 0 β 1 (X i x)) 2 K h (X i x) β 0,β 1 Write â 0 = β 0 m(x), â 1 = h( β 1 m (x)) = m(u) = m(x) + m (x)(u x) : linear approximation near x (Y i β 0 β 1 (X i x)) 2 K h (X i x) = = { ( )} 2 Y i m(x) m Xi x (x)(x i x) â 0 â 1 K h (X i x) h { ( )} 2 1 Y i m(x i ) (â 0, â 1 ) X i x K h (X i x) h (â 0, â 1 ) minimizes, w.r.t. a T = (a 0, a 1 ), 61
(Yi a T X i ) 2 K h (X i x) = (Y X a) T W(Y X a) where Y = (Y 1,, Y Y i = Y i m(x i ), X i = n ) T, X = (X 1,, X n) T ( 1 X i x h = (â 0, â 1 ) T = (X T WX ) 1 X T WY 1 (bias part) ), W = Diag(K h (X i x)) E(â X 1,, X n ) = (X T WX ) 1 X T WE(Y X 1,, X n ) 1 n (X T WX ) r,s = 1 ( ) r+s Xi x K h (X i x) ; r, s = 0, 1 n h = 1 (P r+s K) h (X i x) n = (P r+s K) h f(x) + O p(n 1/2 h 1/2 ) = µ r+s (K)f(x) + O p (n 1/2 h 1/2 ) + o(1) A r,s means the (r, s) component of the matrix A with indices being counted from zero. P l (u) = u l ; P l K(u) = u l K(u); (P l K) h (u) = 1 ( u ) l ( u K h h h) 1 n {X T WE(Y X 1,, X n )} r ( a r means the rth component of the vector a with indices being counted from zero = 1 ( ) r Xi x K h (X i x){m(x i ) m(x i )}; r = 0, 1 n h = 1 (P r K) h (X i x){m(x i ) m(x i )} n = (P r K) h {m( ) m( )}f( )(x) + O p(n 1/2 h 3/2 ) = 1 2 µ 2(P r K)m (x)f(x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) ) 62
= 1 2 µ r+2(k)m (x)f(x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ) ( ) ( ) µ0 (K) µ 1 (K) µ2 (K) Write N =, γ = µ 1 (K) µ 2 (K) µ 3 (K) Then, E(â 0 X 1,, X n ) = 1 2 (N 1 γ) 0 m (x)h 2 + o(h 2 ) + O p (n 1/2 h 3/2 ); ( )( ) (N 1 γ) 0 = 1 µ2 (K) µ 1 (K) µ2 (K) µ 0 (K)µ 2 (K) µ 1 (K) 2 µ 1 (K) µ 0 (K) µ 3 (K) 2 (variance part) var(â X 1,, X n ) = µ 2(K) 2 µ 1 (K)µ 3 (K) µ 0 (K)µ 2 (K) µ 1 (K) 2 = (X T WX ) 1 X T W var(y X 1,, X n ) WX (X T WX ) 1 = (X T WX ) 1 X T ΣX (X T WX ) 1 where Σ = Diag(v(X i )(K h (X i x)) 2 ) 0 h ( X T ΣX ) n r,s = 1 ( ) r+s Xi x (K 2 ) h (X i x)v(x i ); r, s = 0, 1 n h = µ r+s (K 2 )v(x)f(x) + o p (1) ( ) µ0 (K 2 ) µ 1 (K 2 ) Write S =. Then, µ 1 (K 2 ) µ 2 (K 2 ) var(â 0 X 1,, X n ) = (N 1 SN 1 v(x) ) 0,0 f(x) n 1 h 1 + o p (n 1 h 1 ) 1 1 (N 1 SN 1 ) 0,0 = (N 1 ) 0,r S r,s (N 1 ) s,0 = r=0 s=0 {µ2 (K) zµ 1 (K)} 2 K(z) 2 dz {µ 0 (K)µ 2 (K) µ 1 (K) 2 } 2 63
Note 3 m I (x; h) 1 n {f(x i )} 1 L h (X i x)y i prototype of internal regression estimator If we take L(z) = K (z) µ 2 (K) zµ 1 (K) K(z), then µ 0 (K)µ 2 (K) µ 1 (K) 2 m I (x; h) has the same asymptotic property (first order) as m(x; h). We call K equivalent kernel (ii) Asymptotic MSE at boundaries x x n αh(0 α < 1) f is right continuous at 0, and f(0+) > 0 v is right continuous at 0 m is twice differentiable on (0, ɛ) for some ɛ > 0, and m is right continuous at 0 K is supported on [ 1, 1] and is bounded = bias( m(x; h) X 1,, X n ) = 1 µ 2 (K; α) 2 µ 1 (K; α)µ 3 (K; α) 2 µ 0 (K; α)µ 2 (K; α) µ 1 (K; α) 2 m (0+)h 2 +o(h 2 )+O p (n 1/2 h 3/2 ) var( m(x; h) X 1,, X n ) α 1 = {µ 2(K; α) zµ 1 (K; α)} 2 K(z) 2 dz v(0+) {µ 0 (K; α)µ 2 (K; α) µ 1 (K; α) 2 } 2 f(0+) n 1 h 1 +o p (n 1 h 1 ) (5) Theory for local polynomial fitting p : the order of the local polynomial fitting f : density of X i s supported on [0, 1] v(u) var(y X = u) (i) Asymptotic MSE in Int(supp(f)) : even p case 64
f has a continuous derivatives at x, and f(x) > 0 v is continuous at x m has (p + 2) continuous derivatives at x K is supported on [ 1, 1], and is bounded = bias( m(x; h) X 1,, X n ) = (N 1 m (p+1) (x) γ) 0 (p + 1)! hp+1 [ + (N 1 m (p+2) (x) δ) 0 (p + 2)! + o(h p+2 ) + O p (n 1/2 h (2p+1)/2 ) var( m(x; h) X 1,, X n ) + (N 1 δ N 1 JN 1 γ) 0 m (p+1) (x) (p + 1)! = (N 1 SN 1 ) 0,0 v(x) f(x) n 1 h 1 + o p (n 1 h 1 ) where N = (µ r+s (K)) : (p + 1) (p + 1) matrix Note 1 γ = (µ p+1+r (K)) : (p + 1) 1 vector δ = (µ p+2+r (K)) : (p + 1) 1 vector J = (µ r+s+1 (K)) : (p + 1) (p + 1) matrix S = (µ r+s (K 2 )) : (p + 1) (p + 1) matrix ] f (x) h p+2 f(x) Suppose all the odd moments of K vanish, i.e. µ j (K) = 0 for all odd j s. Then N r,s = 0 for r + s odd. This implies (N 1 ) r,s = 0 for r + s odd, too. (See Appendix (4) in the lecture note Nonparametric Regression Function Estimation ) p (N 1 γ) 0 = (N 1 ) 0,s µ p+1+s s=0 = (N 1 ) 0,s µ p+1+s s:even = 0 since p + 1 + s is odd = bias( m(x; h) X 1,, X n ). 65
[ = (N 1 m (p+2) (x) δ) 0 (p + 2)! + o(h p+2 ) + O p (n 1/2 h (2p+1)/2 ) (ii) Asymptotic MSE at boundaries : even p case x x n = αh, 0 α < 1 f is right continuous at 0, and f(0+) > 0 + (N 1 δ N 1 JN 1 γ) 0 m (p+1) (x) (p + 1)! v is right continuous at 0 m is (p + 1) times differentiable on (0, ɛ) for some ɛ > 0, and m (p+1) is right continuous at 0 K is supported on [ 1, 1] and is bounded = bias( m(x; h) X 1,, X n ) = (N 1 γ) 0 m (p+1) (0+) (p + 1)! var( m(x; h) X 1,, X n ) h p+1 + o(h p+1 ) + O p (n 1/2 h (2p+1)/2 ) = (N 1 SN 1 ) 0,0 v(0+) f(0+) n 1 h 1 + o p (n 1 h 1 ) ] f (x) h p+2 f(x) where all the entries of N, γ and S are replaced by their corresponding incomplete moments of K and K 2 66
(iii) Asymptotic MSE in Int(supp(f)) : odd p f is continuous at x, and f(x) > 0 v is continuous at x m has (p + 1) continuous derivatives at x K is supported on [ 1, 1], and is bounded = bias( m(x; h) X 1,, X n ) = (N 1 γ) 0 m (p+1) (x) (p + 1)! var( m(x; h) X 1,, X n ) h p+1 + o(h p+1 ) + O p (n 1/2 h (2p+1)/2 ) = (N 1 SN 1 ) 0,0 v(x) f(x) n 1 h 1 + o p (n 1 h 1 ) (iv) Asymptotic MSE at boundaries : odd p The conditions and the formula for bias and variance are the same as those in (ii). Proof (i) (iv) It suffices to prove (i). Define ( β 0, β 1,, β p ) = argmin {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (X i x). β 0,,β p ( ) Write â 0 = β 0 m(x), â 1 = h( β 1 m (x)),, â p = h p β p m(p) (x). p! m(u) = m(x) + m (x)(u x) + + m(p) (x) (u x) p. p! = {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (X i x) = { ( ) ( ) p } 2 Xi x Xi x Y i m(x i ) â 0 â 1 â p h h K h (X i x) = (â 0,, â p ) minimizes, w.r.t. a T = (a 0,, a p ), 67
(Yi a T X i ) 2 K h (X i x) = (Y X a) T W(Y X a) = â = (â 0,, â p ) T = (X T WX ) 1 X T WY where Y = (Y 1,, Y n ) T, Y i = Y i m(x i ) X = (X 1,, X n) T, X i = ( 1, ( X i x h W = Diag (K h (X i x)) 1 (bias part) ),, ( Xi x h ) p ) T E(â X 1,, X n ) = (X T WX ) 1 X T WE(Y X 1,, X n ) 1 n (X T WX ) = N f(x) + J f (x)h + o(h) + O p (n 1/2 h 1/2 ) 1 n {X T WE(Y X 1,, X n )} { } = γ m(p+1) (x) m (p+2) (x) (p + 1)! f(x)hp+1 +δ (p + 2)! f(x) + m(p+1) (x) (p + 1)! f (x) + o(h p+2 ) + O p (n 1/2 h (2p+1)/2 ) [N f(x) + J f (x)h + o(h) + O p (n 1/2 h 1/2 )] 1 = N 1 2 (variance part) 1 f(x) N 1 JN 1 f (x) f(x) h + o(h) + O p(n 1/2 h 1/2 ) 2 var(â X 1,, X n ) = (X T WX ) 1 X T ΣX (X T WX ) 1 where Σ = diag(v(x i )(K h (X i x)) 2 ) h n (X T ΣX ) = S v(x)f(x) + o p (1) 1 n (X T WX ) = N f(x) + o p (1) + O p (n 1/2 h 1/2 ) var(â X 1,, X n ) = n 1 h 1 ( 1 n X T WX ) 1 ( h n X T ΣX ) ( 1 n X T WX ) 1 = n 1 h 1 N 1 SN 1 v(x) f(x) + o p(n 1 h 1 ) 68 h p+2
(v) Estimation of m (r) ( m (r) (x; h) = r! β r ), r = 0,..., p under the conditions in (i), bias( m (r) (x; h) X 1,, X n ) = r! h r E(â r X 1,, X n ) = r! (N 1 m (p+1) (x) γ) r (p + 1)! hp r+1 [ +r! (N 1 m (p+2) (x) δ) r (p + 2)! + o(h p r+2 ) + O p (n 1/2 h p r+1/2 ) var( m (r) (x; h) X 1,, X n ) = (r!) 2 h 2r var(â r X 1,, X n ) + (N 1 δ N 1 JN 1 γ) r m (p+1) (x) (p + 1)! = (r!) 2 (N 1 SN 1 ) r,r v(x) f(x) n 1 h 2r 1 + o p (n 1 h 2r 1 ) If (p r) is even and all the odd moments of K vanish, then bias( m (r) (x; h) X 1,, X n ) [ = r! (N 1 m (p+2) (x) δ) r (p + 2)! + (N 1 δ N 1 JN 1 γ) r m (p+1) (x) (p + 1)! + o(h p r+2 ) + O p (n 1/2 h p r+1/2 ) p ( ) (N 1 γ) r = (N 1 ) r,s µ p+1+s s=0 = s:r+s even = = 0 s:p+s even (N 1 ) r,s µ p+1+s f ] (x) h p r+2 f(x) (N 1 ) r,s µ p+1+s (p r : even) ] f (x) h p r+2 f(x) 69
under the condition in (ii) (p r : even), bias( m (r) (x; h) X 1,, X n ) = r!(n 1 γ) r m (p+1) (0+) (p + 1)! var( m (r) (x; h) X 1,, X n ) h p r+1 + o(h p r+1 ) + O p (n 1/2 h p r+ 1 2 ) = (r!) 2 (N 1 SN 1 ) r,r v(0+) f(0+) n 1 h 2r 1 + o p (n 1 h 2r 1 ) under the condition in (iii) (p r : odd), bias( m (r) (x; h) X 1,, X n ) = r!(n 1 m (p+1) (x) γ) r (p + 1)! hp r+1 + o(h p r+1 ) + O p (n 1/2 h p r+ 1 var( m (r) (x; h) X 1,, X n ) = (r!) 2 (N 1 SN 1 ) r,r v(x) f(x) n 1 h 2r 1 + o p (n 1 h 2r 1 ) under the condition in (iv) (p r : odd), the formula for bias and variance are the same as those in the second case. (vi) Optimal bandwidth for estimating m (r), r = 0,..., p [ ] C 1 = (r!) 2 (N 1 m p+1 2 (x) γ) r (p + 1)! [ C 2 = (r!) 2 (N 1 m p+2 (x) δ) r C 3 = (r!) 2 (N 1 SN 1 ) r,r v(x) f(x) Case 1 (p + 2)! + (N 1 δ N 1 JN 1 m (p+1) (x) γ) r (p + 1)! Interior point, even (p r), symmetric kernel { } 1/(2p+5) (2r + 1)C3 h opt = n 1/(2p+5) 2(p r + 2)C 2 and minimum (conditional) MSE = n 2(p r+2)/(2p+5) 2 ) ] f 2 (x) f(x) 70
Case 2 Interior point, odd (p r) or boundary point (x = αh, 0 α < 1), odd (p r) or boundary point, even (p r) { } 1/(2p+3) (2r + 1)C3 h opt = n 1/(2p+3) 2(p r + 1)C 1 and minimum (conditional) MSE = n 2(p r+1)/(2p+3) (vii) Which order to fit when estimating m (r)? polynomial order r r + 1 r + 2 r + 3 interior bias O(h 2 ) O(h 2 ) O(h 4 ) O(h 4 ) boundary bias O(h) O(h 2 ) O(h 3 ) O(h 4 ) variance c 0 n 1 h 2r 1 c 1 n 1 h 2r 1 c 2 n 1 h 2r 1 c 3 n 1 h 2r 1 For the cases of polynomial order r + j with j even, the orders of the interior biases are based on the use of symmetric kernels. c 0 = c 1 < c 2 = c 3 < for interior points (see the lemma below) c 0 < c 1 < c 2 < c 3 < for boundary points When p r is even, the leading bias term involves a complicated constant factor depending on m and f, and the bias order at boundary is inferior to that at interior. When p r is odd, the leading bias term involves a relatively simple constant factor depending only on m, and the boundary bias is of the same order as interior. = p with (p r) odd is recommended. Lemma A, B : k k matrices. Ã, B : (k + 1) (k + 1) matrices s.t. Ã r,s = A r,s and B r,s = B r,s for 0 r, s k 1. Suppose Ãr,s = B r,s = 0 for (r, s) with r + s odd, and also for A and B. Suppose that the matrices obtained by deleting all the odd-numbered columns and rows from A, 71
Ã, and those obtained by deleting all the even-numbered columns and rows are invertible. Then (Ã 1 BÃ 1 ) r,r = (A 1 BA 1 ) r,r when k r is odd. Proof See p.10-p.11, Nonparametric Regression Function Estimatoin (6) Minimax efficiency of the local linear smoother x : a fixed interior point of supp(f) (A1) f is continuous at x, and f(x) > 0 (A2) v is continuous at x C 2 = {m : m(u) m(x) m (x)(u x) C 1 2 (u x)2 } (i) Best linear smoother L : class of all linear estimators of the form m(x) = w i (x; X 1,, X n )Y i R L (n, C 2 ) inf sup m L m C 2 E{( m(x) m(x)) 2 X 1,, X n } : linear minimax risk Then, under (A1) and (A2) R L (n, C 2 ) sup m C2 E{( m(x; h) m(x)) 2 X 1,, X n } p 1 where m(x; h) is the local linear smoother with Epanechnikov kernel K and the bandwidth h = [ ( ) ] 1/5 K 2 v(x) n 1/5 µ 2 (K) 2 C 2 f(x) 72
(ii) 89.4% minimax efficiency T : class of all estimators of m(x) R(n, C 2 ) = inf sup m T m C 2 E{( m(x) m(x)) 2 X 1,, X n } : minimax risk under (A1) and (A2) R(n, C 2 ) sup m C2 E{( m(x; h) m(x)) 2 X 1,, X n } (0.894)2 + o p (1) where m(x; h) is the local linear smoother with Epanechnikov kernel and the bandwidth h given at (i). Reference Fan(1993) Local linear regression smoothers and their minimax efficiency The Annals of Statistics, vol 21, pp.196-216. Note Minimax efficiency in the case of general local polynomial estimators and derivative estimation has been dealt in Fan, Gasser, Gijbels, Brockmann and Engel (1997), Annals of Institute of Statistical Mathematics, vol 49, pp.79-99. The fact that the local linear smoother is the best among all linear estimators at boundary points too has been proved by Cheng, Fan and Marron (1997), The Annals of Statistics, vol 25, pp.1691-1708. (7) Bandwidth selection (i) Cross-validation Prediction error : CV (h) ĥ CV 1 n [Y i m(x i ; h)] 2 [Y i m i (X i ; h)] 2 argmin h>0 CV (h) 73
where m i ( ; h) is the leave-one-out estimator with the i-th pair (X i, Y i ) being removed in its construction. Reference Härdle, W., Hall, P. and Marron, J. S. (1988). How far are automatically chosen regression smoothing parameters from their optimum (with discussion), Journal of the American Statistical Association, vol. 83, pp 86 99. (ii) Plug-in methods (p : odd) MISE ( m(x : h) X 1,..., X n ) [ ] = E { m(x; h) m(x)} 2 w(x)dx X 1,..., X n { } (N = h 2p+2 1 2 γ) 0 {m } (p+1) 2 w(x)dx (p + 1)! + n 1 h 1 (N 1 SN 1 ) 0,0 {v(x)w(x)/f(x)} dx + o p (h 2p+2 + n 1 h 1 ) When w = f and v(x) σ 2, [ ] 1/(2p+3) h MISE (p + 1)!p!(N 1 SN 1 ) 0,0 σ 2 = 2 {(N 1 γ) 0 } 2 {m (p+1) (x)} 2 n 1/(2p+3) f(x)dx θ p+1 (g) 1 n { 2 m(p+1) (X i ; g)} σ 2 : a reasonable estimator of σ 2 g(h) = C(K)h α with C(K) and α determined in a way similar to Park and Marron (1990) or Sheather and Jones (1991). ĥ SE : the solution of the equation [ ] 1/(2p+3) (p + 1)!p!(N 1 SN 1 ) 0,0 σ 2 h = n 1/(2p+3) 2 {(N 1 γ) 0 } 2 θp+1 (g(h)) 74
Reference Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression, Journal of the American Statistical Association, vol. 90, pp 1257 1270. See Chapter 4 of Fan and Gijbels (1996) for other methods and references therein. (8) Difficulties with design sparseness K is compactly supported, say on [ 1, 1]. = X T WX is singular if I [x h,x+h] (X i ) = 0. [ ] But, P I [x h,x+h] (X i ) = 0 > 0 for all n. = P [ X T WX is singular ] > 0 for all n = Unconditional bias and variance of the local polynomial estimator do not exist. (i) Local polynomial ridge regression 1 (X 1 x) (X 1 x) p X =.. Y = 1 (X n x) (X n x) p W = Diag (K h (X i x)) β ( β 0 (x),..., β p (x)) = argmin (Y Xβ) T W(Y Xβ) β = (X T WX) 1 X T WY β ridge (x) (H + X T WX) 1 X T WY where H is a nonnegative definite matrix such that H + X T WX is nonsingular. 75 Y 1. Y n
Reference Seifert, B. and Gasser, T. (1996). Finite-sample variance of local polynomials : analysis and solutions, Journal of the American Statistical Association, vol. 91, pp 267 275. (ii) Addding pseudo data Idea X (1) X (n), Y [i] X (i). If X (i+1) X (i) is large for some i, then add k i equally spaced pseudo design points to the interval [ ] X (i), X (i+1). For each pseudo design point X (X (i), X (i+1) ), define Y by linear interpolation between the pairs (X (i), Y [i] ) and (X (i+1), Y [i+1] ). Y [i+1] Y [i] pseudo data X (i) X X (i+1) Reference Hall, P. and Turlach, B. A. (1997). Interpolation methods for adapting to sparse design in nonparametric regression, Journal of the American Statistical Association, vol. 92, pp 466 476. 76
(9) Quantile regression (X 1, Y 1 ),..., (X n, Y n ) iid with a joint cdf F (x, y) and pdf f(x, y) (i) quantile function y = q p (x) F Y X (y x) = p (0 < p < 1) q p (x) = argmin β E[ρ p (Y β) X = x] where ρ p is the check function defined by ρ p (z) = 1 { z + (2p 1)z} 2 = pzi [0, ) (z) (1 p)zi (,0) (z) p = 1 2 p < 1 2 p > 1 2 ( ) E[ρ p (Y β) X = x] = p β yf Y X (y x)dy (1 p) β yf Y X (y x)dy pβ [ 1 F Y X (β x) ] + (1 p)βf Y X (β x) Solving β E[ρ p(y β) X = x] = 0 leads to the result. (ii) applications constructing prediction interval [ q α/2 (x), q 1 α/2 (x) ] detecting heteroscedasticity 77
(iii) local polynomial quantile regression q p (x) = β 0 (x) where ( β 0 (x),..., β p (x)) = argmin ρ p (Y i β 0 β 1 (X 1 x) β p (X i x) p ) K h (X i x) β Reference Yu, K. and Jones, M. C. (1998). Local linear quantile regression, Journal of the American Statistical Association, vol. 93, pp 228 237 (10) Robust regression very useful when the error variance is large. m(x; h) = β 0 (x) where ( β 0 (x),..., β p (x)) = argmin l (Y i β 0 β 1 (X 1 x) β p (X i x) p ) K h (X i x) β Choice of l more resistent to outliers than the squared error loss (i) Huber s Ψ : l = Ψ c where Ψ c (z) = max{ c, min(c, z)} c c c c 78
(ii) Huber s bisquare : l = B c where { ( z ) } 2 2 ( z ) B c (z) = z 1 I [ 1,1] c c c c 5 c 5 c References 1 Härdle, W. and Gasser, T. (1984). Robust non-parametric function fitting, Journal of the Royal Statistical Society, Series B, vol. 46, pp 42 51. 2 Fan, J., Hu, T. -C. and Truong, Y. K. (1994). Robust nonparametric function estimation, Scandinavian Journal of Statistics, vol. 21, pp 433 446. (iii) Robust locally weighted regression (LOWESS : LOcally WEighted Scatter plot Smoothing due to Cleveland (1979), JASA, vol. 74, pp 829 836) Step 1 For each k, minimize {Y i β 0 β 1 (X i X k ) β p (X i X k ) p } 2 K hk (X i X k ) with respect to β 0,, β p where h k = nd -th smallest number among { X k X j ; j = 1,..., n}. ( nd means the nearest integer to nd, 0 < d 1. Cleveland used K(t) = 70 81 (1 t 3 ) 3 I ( 1,1) (t), and suggest 0.2 d 0.8). Ŷ k = β 0 β 0 (X k ), k = 1,..., n r k = Y k Ŷk, k = 1,..., n 79
Step 2 M = median{ r 1,, r n } δ i = L(r i /6M) where L is a kernel (Cleveland used the biweight kernel L(t) = (1 t 2 ) 2 I ( 1,1) (t).) Step 3 For each k, minimize {Y i β 0 β 1 (X i X k ) β p (X i X k ) p } 2 δ i K hk (X i X k ) Ŷ k = β 0 β 0 (X k ) r k = Y k Ŷk Repeat Step 2 and 3 a total of N times. Get the final fitted values Ŷ k, k = 1,..., n. Get the estimated values at x different from the design points {X k : k = 1,..., n} by interpolation. (Cleveland suggested p = 1, N = 3 ; See Figure 2.4, p.27 of FG) Note When N, Ŷk converges to the robust regression estimator m(x k, h k ) with l (z) = B 6M (z). 80