Local Polynomial Regression

Similar documents
Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Topic 9: Sampling Distributions of Estimators

Lecture 2: Monte Carlo Simulation

Estimation for Complete Data

Section 14. Simple linear regression.

Stat 421-SP2012 Interval Estimation Section

32 estimating the cumulative distribution function

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Kernel density estimator

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Topic 9: Sampling Distributions of Estimators

Machine Learning Brett Bernstein

Efficient GMM LECTURE 12 GMM II

Output Analysis and Run-Length Control

Simulation. Two Rule For Inverting A Distribution Function

Topic 9: Sampling Distributions of Estimators

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

1 Introduction to reducing variance in Monte Carlo simulations

1.010 Uncertainty in Engineering Fall 2008

Lecture 12: September 27

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Linear Regression Demystified

ECON 3150/4150, Spring term Lecture 3

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 33: Bootstrap

Expectation and Variance of a random variable

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Problem Set 4 Due Oct, 12

7.1 Convergence of sequences of random variables

1 Review of Probability & Statistics

1 Inferential Methods for Correlation and Regression Analysis

10-701/ Machine Learning Mid-term Exam Solution

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

Algebra of Least Squares

8.1 Introduction. 8. Nonparametric Inference Using Orthogonal Functions

Lecture 24: Variable selection in linear models

1 Approximating Integrals using Taylor Polynomials

1 Covariance Estimation

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Machine Learning Theory (CS 6783)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Linear Support Vector Machines

Rank tests and regression rank scores tests in measurement error models

Simple Linear Regression

Introduction to Machine Learning DIS10

MATH/STAT 352: Lecture 15

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

TAMS24: Notations and Formulas

Lecture 11 and 12: Basic estimation theory

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

REGRESSION WITH QUADRATIC LOSS

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

CLRM estimation Pietro Coretto Econometrics

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Brett Bernstein

CSE 527, Additional notes on MLE & EM

Seunghee Ye Ma 8: Week 5 Oct 28

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

11 Correlation and Regression

BIOSTATISTICS. Lecture 5 Interval Estimations for Mean and Proportion. dr. Petr Nazarov

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Department of Mathematics

WEIGHTED LEAST SQUARES - used to give more emphasis to selected points in the analysis. Recall, in OLS we minimize Q =! % =!

5.1 Review of Singular Value Decomposition (SVD)

MATH 10550, EXAM 3 SOLUTIONS

Properties and Hypothesis Testing

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Economics Spring 2015

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

6.3 Testing Series With Positive Terms

ECO 312 Fall 2013 Chris Sims LIKELIHOOD, POSTERIORS, DIAGNOSING NON-NORMALITY

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Lesson 10: Limits and Continuity

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Final Review for MATH 3510

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Correlation Regression

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

The Method of Least Squares. To understand least squares fitting of data.

Math 61CM - Solutions to homework 3

Math 113, Calculus II Winter 2007 Final Exam Solutions

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Math 10A final exam, December 16, 2016

Transcription:

Local Polyomial Regressio Joh Hughes October 2, 2013 Recall that the oparametric regressio model is Y i f x i ) + ε i, where f is the regressio fuctio ad the ε i are errors such that Eε i 0. The Nadaraya-Watso Kerel Estimator The Nadaraya-Watso kerel estimator offers what is probably the simplest approach to oparametric regressio. The kerel estimator is a example of a liear smoother. The estimator is liear i the sese that it is give by a liear trasformatio of the respose. Specifically, let sx) s 1 x),..., s x)), where s i x) w ix) j1 w jx), where w i x) K{x x i )/h}. Now, if Y Y 1,..., Y ), the kerel estimator of f x) is ˆf x) s x)y s i x)y i w i x) j1 w jx) Y i K{x x i )/h} j1 K{x x j)/h} Y i. This shows that ˆf x) is a weighted average of the observatios, where the weights sx) are ormalized kerel weights. This formulatio ca easily be exteded to hadle a grid of estimatio poits z z 1,..., z g ). Form the g matrix S, the kth row of which is s z k ). The ˆf z) ˆf z 1 ),..., ˆf z g )) SY. The matrix S is called the smoothig matrix. It is aalogous to the hat matrix from liear regressio.

local polyomial regressio 2 Theorem 1 The risk assumig the L 2 loss) of the Nadaraya-Watso kerel estimator is { 2 R ˆf, f ) h4 { x Kx)dx} 2 f x) + 2 f 4 x) ġx) } 2 dx 1) gx) + σ2 K 2 x)dx 1 h gx) dx + oh 1 ) + oh 4 ) as h 0 ad h, where g is the desity from which the x i are draw, ad σ 2 Vε i. If we set the derivative of 1) equal to zero ad solve for h, we get the optimal badwidth h opt 1/5 σ 2 K 2 x)dx 1 gx) dx { x 2 Kx)dx } 2 { } f x) + 2 f x) ġx) 2 gx) dx which implies that h opt O 1/5 ). If we plug h opt ito 1), we see that the risk decreases at the rate O 4/5 ). For most parametric models, the risk of the MLE decreases at the rate O 1 ). The moral of this story is that we pay a price for usig a oparametric approach. We gai flexibility, but we may sacrifice statistical power to get it. Local Polyomial Regressio A kerel estimator suffers from desig bias a bias that depeds o the distributio of the x i ) ad boudary bias a bias ear the edpoits of the x i ). These biases ca be reduced by usig local polyomial regressio. Cosider choosig a estimator that miimizes Y i β 0 ) 2. Note that this is equivalet to miimizig the squared legth of Y β 0 1, where legth is defied as the ordiary Euclidea orm v v 2 i, 1/5, 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 Figure 1: This figure illustrates the price we pay for adoptig a oparametric approach, as measured by the rate at which risk decreases with icreasig sample size. The solid curve is 4/5. The dashed curve is 1. which is i tur defied i terms of the usual ier product, the dot product: u, v u v u i v i. That is, v 2 v, v v v. Recall that the solutio to this estimatio problem is ˆβ 0 Ȳ. The vector Ȳ1 is the vector i spa{1} that is closest to Y with respect to the ordiary orm. You may also recall that Ȳ1 is the orthogoal

local polyomial regressio 3 projectio of Y oto spa{1}, where our otio of perpedicularity is give by the dot product: u v iff u v 0. To see this, observe that the orthogoal projectio of Y oto spa{1} is 11 1) 1 1 Y. This is just a special case of XX X) 1 X Y from ordiary liear regressio.) Now, 1 1 1, ad 1 Y Y i. Thus 11 1) 1 1 Y 1 1 Y i Ȳ1. Now chage the sceario slightly by chagig the ier product from u v to u W x v, where W x diag{w i x)}, with w i x) K{x x i )/h}. The aalogous estimatio problem is to miimize w ix)y i β 0 ) 2, but ow the relevat projectio is orthogoal with respect to this ew ier product. Hece, This implies that ˆβ 0 1 W x 1) 1 1 W x Y. ˆf x) ˆβ 0 w ix)y i w ix), the kerel estimator. Ad so we see that the kerel estimator results from itroducig kerel weights i a itercept-oly liear model. The weights localize the estimator i the sese that more distat observatios are dow-weighted. Sice the kerel estimator is local ad uses oly a itercept, the kerel estimator is sometimes called a locally costat estimator. Local polyomial regressio is based o the idea that we might improve the estimator by usig a higher-order polyomial as a local approximatio to f. Taylor s theorem tells us this is a sesible idea. Accordig to Taylor s theorem, f x) f z) + f 1) z)z x) + f 2) z) 2! β 0 + β 1 z x) + β 2 z x) 2 + + β p z x) p P x z, β) z x) 2 + + f p) z) z x) p p! for z i a eighborhood of x, where f m) deotes the mth derivative of f. The kerel estimator takes p 0. More geerally, local polyomial regressio of order p miimizes w i x){y i P x x i, β)} 2. 2)

local polyomial regressio 4 This yields the local estimate Note that the miimizer of 2) is where ˆf x) P x x, ˆβ) ˆβ 0 x). ˆβx) X xw x X x ) 1 X xw x Y, 1 x x 1 x 1 x) 2 X x....... 1 x x x x) 2 x 1 x) p p! x x) p p!. This implies that ˆf x) ˆβ 0 x) is the ier product of Y with the first row of X xw x X x ) 1 X xw x, ad so ˆf x) is a liear smoother. The estimator has mea ad variace E ˆf x) sx)fx) V ˆf x) σ 2 sx) 2, where sx) is the first row of X xw x X x ) 1 X xw x ad fx) f x 1 ),..., f x )). Why p Should Be Odd The case p 1 is called local liear regressio. Local liear regressio elimiates desig bias ad alleviates boudary bias. Theorem 2 Let Y i f X i ) + σx i )ε i for i {1,..., } ad X i [a, b]. Assume that the X i were draw from desity g. Suppose that g is positive; g, f, ad σ are cotiuous i a eighborhood of x; ad h 0 ad h. Let x a, b). The the local costat estimator ad the local liear estimator both have variace σ 2 x) gx)h ) 1 K 2 u)du + o. h The local costat estimator has bias 1 h 2 2 f f x) + ) x)ġx) u 2 Ku)du + oh 2 ), gx) ad the local liear estimator has bias h 2 1 2 f x) u 2 Ku)du + oh 2 ). At the edpoits of [a, b], the local costat estimator has bias of order h, ad the local liear estimator has bias of order h 2. More geerally, let p be eve. The local polyomial regressio of order p + 1 reduces desig bias ad boudary bias relative to local polyomial regressio of order p, without icreasig the variace.

local polyomial regressio 5 Variace Estimatio Homoscedasticity Util the previous theorem we had bee assumig homoscedasticity, i.e., Y i f x i ) + σε i for all i, where Vε i 1. I this case, we ca estimate σ 2 i a simple ad familiar way, amely, as the sum of the squared residuals divided by the residual degrees of freedom. More specifically, the estimator is ˆσ 2 {Y i ˆf x i )} 2 2ν + ν e e 2ν + ν e 2 2ν + ν where ν trs) ad ν trs S) i sx i ) 2. Recall that S is the smoothig matrix. The estimator ˆσ 2 is cosistet for σ 2. To see this, first observe that e Y SY I S)Y, which implies that ˆσ 2 Y ΛY trλ), where Λ I S) I S). A well-kow fact about quadratic forms is EY AY traσ) + µ Aµ, where Σ VY ad µ EY. Thus Eˆσ 2 EY ΛY trλ) trλσ2 I) trλ) σ 2 + + f Λf 2ν + ν f Λf 2ν + ν. 3) Uder mild coditios, the secod term i 3) will go to zero as. The appearace of 2ν + ν may seem mysterious, but this quatity is i fact aalogous to the residual degrees of freedom p i ordiary liear regressio. I that settig, p tri H) tr{i H) I H)}, where H is the hat matrix. I the curret settig,

local polyomial regressio 6 I H is replaced by I S, ad we have tr{i S) I S)} tri I I S S I + S S) tri S S + S S) tri) trs) trs ) + trs S) 2 trs) + trs S) 2ν + ν. Heteroscedasticity Now suppose that Y i f x i ) + σx i )ε i. Sice this implies that σ is a presumably o-costat) fuctio, estimatig it requires a secod regressio. The secod regressio is for the model Z i log{y i f x i )} 2 log σ 2 x i )ε 2 i log σ 2 x i ) + log ε 2 i log σ 2 x i ) + δ i. This model suggests that we could estimate log σ 2 x) by doig a regressio with the log squared residuals from the first regressio as the respose. Specifically, we do the followig. 1. Estimate f x) to arrive at ˆf x). 2. Let Z i log{y i ˆf x i )} 2. 3. Regress the Z i o the x i to get a estimate ĝx) of log σ 2 x). 4. Let ˆσ 2 x) exp ĝx). Cofidece Bads We would of course like to costruct cofidece bads for f. A cofidece iterval for f x) usually has the form ˆf x) ± c sex), where c > 0 is a costat ad sex) is a estimate of the stadard deviatio of ˆf x). Perhaps couterituitively, such a cofidece iterval is ot truly a iterval for f x), but is istead a iterval for f x) E ˆf x) sx)fx). This is because there is a bias that does ot disappear as the sample size becomes large.

local polyomial regressio 7 Let s x) be the stadard deviatio of ˆf x). The ˆf x) f x) s x) ˆf x) f x) s x) Z x) + bias{ ˆf x)} V ˆf. x) + f x) f x) s x) Typically, Z x) N 0, 1). I a oparametric settig, the secod term does ot go to zero as the sample size icreases. This meas the bias is preset i the limit, which implies that the resultig cofidece iterval is ot cetered aroud f x). We might respod to this by 1. acceptig that our cofidece iterval is for f x) rather tha f x); 2. attemptig to correct the bias by estimatig the bias fuctio f x) f x); or 3. miimizig the bias by udersmoothig. The secod optio is perhaps the most temptig but is cosiderably more difficult tha estimatig f x) sice the bias ivolves f x). This fact makes the first ad third optios more appealig. Most people go with the first optio because it is difficult to choose the right amout of udersmoothig. Poitwise Bads We ca costruct a poitwise bad by ivokig asymptotic ormality or by usig the bootstrap. I the former case, the iterval is ˆf x) ± Φ 1 1 α/2)sex). As for the bootstrap, how we should resample depeds o whether we assume homoscedasticity. If we do assume costat variace, i.e., σx) σ, the kth bootstrap dataset is Y k) i ˆf x i ) + e k) i i 1,..., ), where e k) e k) 1,..., ek) ) is a sample with replacemet) of size from the vector of residuals e Y 1 ˆf x 1 ),..., Y ˆf x )). The edpoits of the resultig iterval at x i are the α/2 ad 1 α/2 1) b) quatiles of the bootstrap sample ˆf x i ),..., ˆf x i ). If we assume that σx) is a o-costat fuctio, we ca still do a bootstrap, but we must modify the resamplig procedure. Here is the algorithm i detail. 1. Estimate σx i ) to arrive at ˆσx i ) for i {1,..., }.

local polyomial regressio 8 2. Studetize the vector of residuals Y 1 ˆf x 1 ),..., Y ˆf x )) by dividig the ith elemet by ˆσx i ): e i Y i ˆf x i ). ˆσx i ) 3. Compute the kth bootstrap dataset as Y k) i ˆf x i ) + ˆσx i ) e k) i i 1,..., ), where e k) e k) 1,..., ek) ) is a sample with replacemet) of size from the vector of Studetized residuals. 4. Compute ˆf k) x) SY k) for k {1,..., b}. 5. The edpoits of the cofidece iterval at x i are agai the α/2 1) b) ad 1 α/2 quatiles of the bootstrap sample ˆf x i ),..., ˆf x i ). Simultaeous Bads To costruct a simultaeous bad we use the so-called tube formula. Suppose that σ is kow, ad let Ix) be a iterval. The P{ f x) Ix) for some x [a, b]} P max ˆf x) f ) x) > c σ sx) P P x [a,b] max i ε i s i x) x [a,b] σ sx) ) max Wx) > c, x [a,b] ) > c where Wx) Z i T i x), Z i ε i /σ N 0, 1), T i x) s i x)/ sx). It turs out that ) P max Wx) > c 2{1 Φc)} + κ x π exp c2 /2) for large c, where κ b a Ṫx) dx, where Ṫx) Ṫ 1 x),..., Ṫ x)). Choosig c to solve 2{1 Φc)} + κ π exp c2 /2) α yields the desired bad ˆf x) ± c sex).

local polyomial regressio 9 Choosig the Right Badwidth We wat to choose h to miimize the risk 1 Rh) E { ˆf x i ) f x i )} 2 ) Sice Rh) depeds o the ukow fuctio f, we will istead miimize a estimate ˆRh) of Rh). It might seem sesible to estimate Rh) usig ˆRh) 1 {Y i ˆf x i )} 2, the so-called traiig error. But this estimator is biased dowward ad usually leads to udersmoothig. A better risk estimator is the leave-oe-out cross-validatio score: CVh) ˆRh) 1 {Y i ˆf i x i )} 2 where ˆf i x i ) is the estimate obtaied by leavig out the ith observatio. Ituitively, we are askig, "How well ca we predict Y i if we do ot use Y i i the estimatio procedure?" For liear smoothers, computig this score is ot as burdesome as it may seem because we do ot have to recompute the estimate with each observatio left out. Istead, we have CVh) ˆRh) 1 { Y i ˆf } 2 x i ), 1 S ii where S ii is the ith diagoal elemet of S. A alterative is the geeralized cross-validatio score: GCVh) 1 { Y i ˆf } 2 x i ) 1 1 trs), which replaces the S ii with their average. Usually CV ad GCV lead to badwidths that are close to oe aother..