Local Polyomial Regressio Joh Hughes October 2, 2013 Recall that the oparametric regressio model is Y i f x i ) + ε i, where f is the regressio fuctio ad the ε i are errors such that Eε i 0. The Nadaraya-Watso Kerel Estimator The Nadaraya-Watso kerel estimator offers what is probably the simplest approach to oparametric regressio. The kerel estimator is a example of a liear smoother. The estimator is liear i the sese that it is give by a liear trasformatio of the respose. Specifically, let sx) s 1 x),..., s x)), where s i x) w ix) j1 w jx), where w i x) K{x x i )/h}. Now, if Y Y 1,..., Y ), the kerel estimator of f x) is ˆf x) s x)y s i x)y i w i x) j1 w jx) Y i K{x x i )/h} j1 K{x x j)/h} Y i. This shows that ˆf x) is a weighted average of the observatios, where the weights sx) are ormalized kerel weights. This formulatio ca easily be exteded to hadle a grid of estimatio poits z z 1,..., z g ). Form the g matrix S, the kth row of which is s z k ). The ˆf z) ˆf z 1 ),..., ˆf z g )) SY. The matrix S is called the smoothig matrix. It is aalogous to the hat matrix from liear regressio.
local polyomial regressio 2 Theorem 1 The risk assumig the L 2 loss) of the Nadaraya-Watso kerel estimator is { 2 R ˆf, f ) h4 { x Kx)dx} 2 f x) + 2 f 4 x) ġx) } 2 dx 1) gx) + σ2 K 2 x)dx 1 h gx) dx + oh 1 ) + oh 4 ) as h 0 ad h, where g is the desity from which the x i are draw, ad σ 2 Vε i. If we set the derivative of 1) equal to zero ad solve for h, we get the optimal badwidth h opt 1/5 σ 2 K 2 x)dx 1 gx) dx { x 2 Kx)dx } 2 { } f x) + 2 f x) ġx) 2 gx) dx which implies that h opt O 1/5 ). If we plug h opt ito 1), we see that the risk decreases at the rate O 4/5 ). For most parametric models, the risk of the MLE decreases at the rate O 1 ). The moral of this story is that we pay a price for usig a oparametric approach. We gai flexibility, but we may sacrifice statistical power to get it. Local Polyomial Regressio A kerel estimator suffers from desig bias a bias that depeds o the distributio of the x i ) ad boudary bias a bias ear the edpoits of the x i ). These biases ca be reduced by usig local polyomial regressio. Cosider choosig a estimator that miimizes Y i β 0 ) 2. Note that this is equivalet to miimizig the squared legth of Y β 0 1, where legth is defied as the ordiary Euclidea orm v v 2 i, 1/5, 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 Figure 1: This figure illustrates the price we pay for adoptig a oparametric approach, as measured by the rate at which risk decreases with icreasig sample size. The solid curve is 4/5. The dashed curve is 1. which is i tur defied i terms of the usual ier product, the dot product: u, v u v u i v i. That is, v 2 v, v v v. Recall that the solutio to this estimatio problem is ˆβ 0 Ȳ. The vector Ȳ1 is the vector i spa{1} that is closest to Y with respect to the ordiary orm. You may also recall that Ȳ1 is the orthogoal
local polyomial regressio 3 projectio of Y oto spa{1}, where our otio of perpedicularity is give by the dot product: u v iff u v 0. To see this, observe that the orthogoal projectio of Y oto spa{1} is 11 1) 1 1 Y. This is just a special case of XX X) 1 X Y from ordiary liear regressio.) Now, 1 1 1, ad 1 Y Y i. Thus 11 1) 1 1 Y 1 1 Y i Ȳ1. Now chage the sceario slightly by chagig the ier product from u v to u W x v, where W x diag{w i x)}, with w i x) K{x x i )/h}. The aalogous estimatio problem is to miimize w ix)y i β 0 ) 2, but ow the relevat projectio is orthogoal with respect to this ew ier product. Hece, This implies that ˆβ 0 1 W x 1) 1 1 W x Y. ˆf x) ˆβ 0 w ix)y i w ix), the kerel estimator. Ad so we see that the kerel estimator results from itroducig kerel weights i a itercept-oly liear model. The weights localize the estimator i the sese that more distat observatios are dow-weighted. Sice the kerel estimator is local ad uses oly a itercept, the kerel estimator is sometimes called a locally costat estimator. Local polyomial regressio is based o the idea that we might improve the estimator by usig a higher-order polyomial as a local approximatio to f. Taylor s theorem tells us this is a sesible idea. Accordig to Taylor s theorem, f x) f z) + f 1) z)z x) + f 2) z) 2! β 0 + β 1 z x) + β 2 z x) 2 + + β p z x) p P x z, β) z x) 2 + + f p) z) z x) p p! for z i a eighborhood of x, where f m) deotes the mth derivative of f. The kerel estimator takes p 0. More geerally, local polyomial regressio of order p miimizes w i x){y i P x x i, β)} 2. 2)
local polyomial regressio 4 This yields the local estimate Note that the miimizer of 2) is where ˆf x) P x x, ˆβ) ˆβ 0 x). ˆβx) X xw x X x ) 1 X xw x Y, 1 x x 1 x 1 x) 2 X x....... 1 x x x x) 2 x 1 x) p p! x x) p p!. This implies that ˆf x) ˆβ 0 x) is the ier product of Y with the first row of X xw x X x ) 1 X xw x, ad so ˆf x) is a liear smoother. The estimator has mea ad variace E ˆf x) sx)fx) V ˆf x) σ 2 sx) 2, where sx) is the first row of X xw x X x ) 1 X xw x ad fx) f x 1 ),..., f x )). Why p Should Be Odd The case p 1 is called local liear regressio. Local liear regressio elimiates desig bias ad alleviates boudary bias. Theorem 2 Let Y i f X i ) + σx i )ε i for i {1,..., } ad X i [a, b]. Assume that the X i were draw from desity g. Suppose that g is positive; g, f, ad σ are cotiuous i a eighborhood of x; ad h 0 ad h. Let x a, b). The the local costat estimator ad the local liear estimator both have variace σ 2 x) gx)h ) 1 K 2 u)du + o. h The local costat estimator has bias 1 h 2 2 f f x) + ) x)ġx) u 2 Ku)du + oh 2 ), gx) ad the local liear estimator has bias h 2 1 2 f x) u 2 Ku)du + oh 2 ). At the edpoits of [a, b], the local costat estimator has bias of order h, ad the local liear estimator has bias of order h 2. More geerally, let p be eve. The local polyomial regressio of order p + 1 reduces desig bias ad boudary bias relative to local polyomial regressio of order p, without icreasig the variace.
local polyomial regressio 5 Variace Estimatio Homoscedasticity Util the previous theorem we had bee assumig homoscedasticity, i.e., Y i f x i ) + σε i for all i, where Vε i 1. I this case, we ca estimate σ 2 i a simple ad familiar way, amely, as the sum of the squared residuals divided by the residual degrees of freedom. More specifically, the estimator is ˆσ 2 {Y i ˆf x i )} 2 2ν + ν e e 2ν + ν e 2 2ν + ν where ν trs) ad ν trs S) i sx i ) 2. Recall that S is the smoothig matrix. The estimator ˆσ 2 is cosistet for σ 2. To see this, first observe that e Y SY I S)Y, which implies that ˆσ 2 Y ΛY trλ), where Λ I S) I S). A well-kow fact about quadratic forms is EY AY traσ) + µ Aµ, where Σ VY ad µ EY. Thus Eˆσ 2 EY ΛY trλ) trλσ2 I) trλ) σ 2 + + f Λf 2ν + ν f Λf 2ν + ν. 3) Uder mild coditios, the secod term i 3) will go to zero as. The appearace of 2ν + ν may seem mysterious, but this quatity is i fact aalogous to the residual degrees of freedom p i ordiary liear regressio. I that settig, p tri H) tr{i H) I H)}, where H is the hat matrix. I the curret settig,
local polyomial regressio 6 I H is replaced by I S, ad we have tr{i S) I S)} tri I I S S I + S S) tri S S + S S) tri) trs) trs ) + trs S) 2 trs) + trs S) 2ν + ν. Heteroscedasticity Now suppose that Y i f x i ) + σx i )ε i. Sice this implies that σ is a presumably o-costat) fuctio, estimatig it requires a secod regressio. The secod regressio is for the model Z i log{y i f x i )} 2 log σ 2 x i )ε 2 i log σ 2 x i ) + log ε 2 i log σ 2 x i ) + δ i. This model suggests that we could estimate log σ 2 x) by doig a regressio with the log squared residuals from the first regressio as the respose. Specifically, we do the followig. 1. Estimate f x) to arrive at ˆf x). 2. Let Z i log{y i ˆf x i )} 2. 3. Regress the Z i o the x i to get a estimate ĝx) of log σ 2 x). 4. Let ˆσ 2 x) exp ĝx). Cofidece Bads We would of course like to costruct cofidece bads for f. A cofidece iterval for f x) usually has the form ˆf x) ± c sex), where c > 0 is a costat ad sex) is a estimate of the stadard deviatio of ˆf x). Perhaps couterituitively, such a cofidece iterval is ot truly a iterval for f x), but is istead a iterval for f x) E ˆf x) sx)fx). This is because there is a bias that does ot disappear as the sample size becomes large.
local polyomial regressio 7 Let s x) be the stadard deviatio of ˆf x). The ˆf x) f x) s x) ˆf x) f x) s x) Z x) + bias{ ˆf x)} V ˆf. x) + f x) f x) s x) Typically, Z x) N 0, 1). I a oparametric settig, the secod term does ot go to zero as the sample size icreases. This meas the bias is preset i the limit, which implies that the resultig cofidece iterval is ot cetered aroud f x). We might respod to this by 1. acceptig that our cofidece iterval is for f x) rather tha f x); 2. attemptig to correct the bias by estimatig the bias fuctio f x) f x); or 3. miimizig the bias by udersmoothig. The secod optio is perhaps the most temptig but is cosiderably more difficult tha estimatig f x) sice the bias ivolves f x). This fact makes the first ad third optios more appealig. Most people go with the first optio because it is difficult to choose the right amout of udersmoothig. Poitwise Bads We ca costruct a poitwise bad by ivokig asymptotic ormality or by usig the bootstrap. I the former case, the iterval is ˆf x) ± Φ 1 1 α/2)sex). As for the bootstrap, how we should resample depeds o whether we assume homoscedasticity. If we do assume costat variace, i.e., σx) σ, the kth bootstrap dataset is Y k) i ˆf x i ) + e k) i i 1,..., ), where e k) e k) 1,..., ek) ) is a sample with replacemet) of size from the vector of residuals e Y 1 ˆf x 1 ),..., Y ˆf x )). The edpoits of the resultig iterval at x i are the α/2 ad 1 α/2 1) b) quatiles of the bootstrap sample ˆf x i ),..., ˆf x i ). If we assume that σx) is a o-costat fuctio, we ca still do a bootstrap, but we must modify the resamplig procedure. Here is the algorithm i detail. 1. Estimate σx i ) to arrive at ˆσx i ) for i {1,..., }.
local polyomial regressio 8 2. Studetize the vector of residuals Y 1 ˆf x 1 ),..., Y ˆf x )) by dividig the ith elemet by ˆσx i ): e i Y i ˆf x i ). ˆσx i ) 3. Compute the kth bootstrap dataset as Y k) i ˆf x i ) + ˆσx i ) e k) i i 1,..., ), where e k) e k) 1,..., ek) ) is a sample with replacemet) of size from the vector of Studetized residuals. 4. Compute ˆf k) x) SY k) for k {1,..., b}. 5. The edpoits of the cofidece iterval at x i are agai the α/2 1) b) ad 1 α/2 quatiles of the bootstrap sample ˆf x i ),..., ˆf x i ). Simultaeous Bads To costruct a simultaeous bad we use the so-called tube formula. Suppose that σ is kow, ad let Ix) be a iterval. The P{ f x) Ix) for some x [a, b]} P max ˆf x) f ) x) > c σ sx) P P x [a,b] max i ε i s i x) x [a,b] σ sx) ) max Wx) > c, x [a,b] ) > c where Wx) Z i T i x), Z i ε i /σ N 0, 1), T i x) s i x)/ sx). It turs out that ) P max Wx) > c 2{1 Φc)} + κ x π exp c2 /2) for large c, where κ b a Ṫx) dx, where Ṫx) Ṫ 1 x),..., Ṫ x)). Choosig c to solve 2{1 Φc)} + κ π exp c2 /2) α yields the desired bad ˆf x) ± c sex).
local polyomial regressio 9 Choosig the Right Badwidth We wat to choose h to miimize the risk 1 Rh) E { ˆf x i ) f x i )} 2 ) Sice Rh) depeds o the ukow fuctio f, we will istead miimize a estimate ˆRh) of Rh). It might seem sesible to estimate Rh) usig ˆRh) 1 {Y i ˆf x i )} 2, the so-called traiig error. But this estimator is biased dowward ad usually leads to udersmoothig. A better risk estimator is the leave-oe-out cross-validatio score: CVh) ˆRh) 1 {Y i ˆf i x i )} 2 where ˆf i x i ) is the estimate obtaied by leavig out the ith observatio. Ituitively, we are askig, "How well ca we predict Y i if we do ot use Y i i the estimatio procedure?" For liear smoothers, computig this score is ot as burdesome as it may seem because we do ot have to recompute the estimate with each observatio left out. Istead, we have CVh) ˆRh) 1 { Y i ˆf } 2 x i ), 1 S ii where S ii is the ith diagoal elemet of S. A alterative is the geeralized cross-validatio score: GCVh) 1 { Y i ˆf } 2 x i ) 1 1 trs), which replaces the S ii with their average. Usually CV ad GCV lead to badwidths that are close to oe aother..