Nonparametric Regression

Size: px

Start display at page:

Download "Nonparametric Regression"

Laurel Russell
5 years ago
Views:

1 Nonparametric Regression Econ 674 Purdue University April 8, 2009 Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

2 Consider the univariate nonparametric regression model: where y i and x i are scalars, for simplicity. Note that the marginal density for x might be obtained as: Similarly, the joint density for x and y could be estimated (simply) as: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

3 The function of interest, The denominator can be estimated, as shown on the last slide. As for the numerator, substitute in our estimator for the joint density, and obtain (assuming a symmetric, mean-zero kernel): Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

4 Thus, we have all the pieces we need to obtain a nonparametric estimator (called the Nadaraya-Watson estimator) of the conditional mean. Noting that the bandwidth terms cancel in the ratio, we obtain: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

5 Intuition Let us now try and justify this estimator in an intuitive way, much like we did for the case of nonparametric density estimation. Suppose that x is discrete-valued and we observe n 0 points with x = x 0. In this case, we might use the sample average as a consistent estimate of the conditional mean function at x 0 : This technique works great, of course, if x is discrete-valued. However, if x is continuous, the above will not work - we will never observe n 0 points for which x = x 0. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

6 Intuition To remedy this problem, we can average those y s for which the x s fall in some interval around x 0. We can then replace ˆm(x 0 ) = Ê(y x = x 0) with the sample average of the y s falling in this region: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

7 Intuition In the previous estimator, we placed equal weight on all the points in the interval, and zero weight on points outside the interval. More generally, we might replace the indicator function above with a continuous weight function: where, as before, K is a mean-zero symmetric density function. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

8 Under certain regularity conditions, we can establish pointwise consistency of the constant kernel estimator: ˆm(x 0 ) p m(x 0 ) To save time we defer the proof, although it follows similarly to the proof for the kernel density estimator. The method above generalizes to higher dimensions: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

9 Local Polynomial Regression We can consider this problem yet another way, which will lead to an estimator that improves upon the N-W-estimator. As with least-squares, we might wish to minimize the objective function: To this end, we take a second-order expansion of the regression function about the point x 0 : Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

10 Local Polynomial Regression We then substitute this expansion into our objective function and include a kernel-weighting term: From the above, the kernel-weighting term weights the points closer to x 0 more heavily than the terms farther away from x 0. (like weighted least-squares). Let α 0 = m(x 0 ), α 1 = m (x 0 ), and α 2 = m (x 0 ). Then we re-write the objective function as: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

11 Local Polynomial Regression We can stack this problem in matrix form: min α y 1 y 2. y n (x 1 (x 1 x 0 ) 1 x 0 ) 2 2 (x 1 (x 2 x 0 ) 2 x 0 ) (xn x 1 (x n x 0 ) 0 ) 2 2 α 0 α 1 α 2 ( ) x1 x K hn ( ) x2 x 0 K 0 0 hn K ( xn x0 hn ) y 1 y 2. y n which is equivalent to: (x 1 (x 1 x 0 ) 1 x 0 ) 2 2 (x 1 (x 2 x 0 ) 2 x 0 ) (x n x 0 ) (xn x 0 ) 2 2 α 0 α 1. α 2 Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

12 Local Polynomial Regression This objective function is just like the objective function which yields the GLS estimator. Thus, we have: The (1,1) element gives an estimate of the CMF at x 0, the (2,1) element gives an estimate of the marginal effect at x 0. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

13 Local Polynomial Regression So, here s the procedure for fitting a regression model via Local Polynomial regression. 1 Select a bandwidth h n and kernel K 2 Pick a set of points at which to evaluate the CMF (3 σ rule perhaps). This could also be all the x i. 3 For each point, compute ˆα as above. You can plot the (1, 1) elements to plot the conditional mean function. (Often, the (2,1) elements are of primary interest; marginal effects mimic regression coefficients.) 4 What happens if you just approximate m(x i ) with a constant α 0? Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

14 Other points to note: The above also generalizes to the multivatiate case. However, there is the curse of dimensonality - the rate of convergence slows down with the dimension of the problem: nh d n (assuming a common bandwidth is employed in all dimensions). How can you pick the bandwidth h n and the kernel K? Standard errors for the above point estimates are rather involved and difficult to compute. Bootstrapping is a possibility, though the bootstrap should correct for the bias of the estimator. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

15 It s an odd world There is a preference for odd-order fits. Let p be the order of the series expansion (ex. 1 = linear, 2 = quadratic) and v be the order of the derivative we seek to estimate. Then Ruppert and Wand (1994 Ann Stat) show that bias is reduced and performance at the boundary is improved by setting p v to be odd. This suggests a preference for local linear regression when estimating the conditional mean function. To get around the curse of dimensionality, some specify a nonparametric part for only one/few elements of x: This is called a partially linear or semilinear model. See Robinson (1988 Econometrica) or Yatchew (1997 Economics Letters) for estimation procedures. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

16 Bandwidth Selection Fan and Gijbels (1995) derive and optimal bandwidth rule. They consider an asymptotic weighted mean integrated squared error criterion: where m v (x) is the v th derivative of the CMF which we are interested in, and w(x) is a weighting function. They show the bandwidth which minimizes this criterion is of the form: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

17 Bandwidth Selection In the above, σ 2 (x) E[(y m(x)) 2 ], w(x) w 0 (x)f (x), C p,v (K) is a constant which depends on the expansion order (p), order of the derivative (v), and kernel K. Finally, m p+1 is the p + 1 th derivative of the unknown function m. This can be estimated as: [ hn ˆσ 2 1/(2p+3) w 0 (x)dx C p,v (K) n i=1 ( ˆmp+1 (x i )) 2. w 0 (x i )] We can obtain ˆσ 2 and ˆm p+1 by running a linear regression of y on x, x 2, x p+3. A starting choice for w 0 may be 1. Finally, note that there are a variety of other bandwidth selectors used in practice (e.g., cross-validation or AIC c [Hurvich et. al (1998)] ). Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

18 η = True Curve 0.4 True Curve η = η = Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

19 Generalizing to Multiple Regression Consider, for illustration, the case of a bivariate nonparametric regression problem: As before, we can take a first-order approximation of the regression function: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

20 Subbing this back into our objective function produces: We can then formulate a weighted least-squares type objective function, as before: where K is a 2 dimesnional kernel, H is the bandwidth or smoothing matrix, and Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

21 Partially Linear Models Consider a model of the form: This model is often called a semilinear or partially linear model. Here, we assume that the z s (which can be large in number) enter in a linear fashion, and the x, still assumed a scalar, enters nonparametrically. Two questions naturally arise: How should we estimate m? What about β? Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

22 Partially Linear Models Robinson s (1998 Econometrica) Estimator: Given the above specification, note: This implies: If we knew each of the conditional mean functions, we could just run a least squares regression. Robinson s idea is to estimate each of these CMF s nonparametrically, as we have discussed. Thus, we can estimate β by running least squares using the following regression: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

23 Partially Linear Models 1 Since β converges at the standard parametric rate (we can show this), we can ignore the fact that it is estimated (asymptotically) when deriving confidence intervals for m. (Some focus on β as the parameter of interest). 2 This procedure can be quite computationally intensive, since we need to perform k z + 2 nonparametric regressions in total, where k z is the number of variables in Z. 3 This estimator is asymptotically efficient. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

24 Partially Linear Models An alternate method has been suggested by Yatchew (1997 Economics Letters). He suggests the use of differencing to eliminate the unknown function m. Note that for a continuous m with x i x j : This intuition suggests the following simple estimator: Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

25 Partially Linear Models 1 Sort the data by ascending values of X. 2 Take adjacent differences of the sorted data, and estimate β by an OLS regression of the differenced y s on the differenced z s. 3 Given ˆβ, estimate the unknown function m pointwise using local linear regression of y z ˆβ on x, or from an alternate nonparametric estimation procedure. Over a compact support, and under certain regularity conditions on m the differencing technique asymptotically purges the model of the nonparametric component m, and consistent estimates of β are obtained. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

26 Partially Linear Models Yatchew (1997, 1998) describes how higher-order optimal differencing can be applied to estimate β, and approach the efficiency of Robinson s estimator as the order of differencing gets large. Note that this estimator only requires one nonparametric regression! Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

27 Smooth Coefficient Models (Li et al (2002, JBES)) Consider the following smooth coefficient model pause y i = α(z i ) + x i β(z i ) + ɛ i = X i δ(z i ), where X i = [1 x i ], δ(z i) = [α(z i ) β(z i ) ]. We can think of β(z i ) as a vector of (smooth) coefficient that depend on z. The standard partially linear model follows as β(z i ) = β. Let z be q 1 and x be p 1. (Typically, think of both q and z being equal to 1). They suggest the following estimator: ˆδ(z 0 ) = (nh q ) 1 n j=1 X jy j K ( ) zj z 0 h (nh q ) 1 n j=1 X jx j K ( zj z 0 h ). Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

28 Smooth Coefficient Models (Li et al (2002, JBES)) Intuition: This is like a weighted least squares rule. Suppose that z is a scalar, and assume that we are using a uniform kernel: { 1/2h if zj z K(x) = 0 h 0 otherwise Under this rule, we can see that ˆδ(z 0 ) = j: z j z 0 <h X j X j 1 j: z j z 0 <h X j y j. This is the least squares estimator of the intercept and slopes, using only those data points for which z j is close to z 0. Doing this over a grid of z 0 values will enable us to piece together the intercept function (as a function of z) and the slope coefficients (also as a function of z). Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

29 Tests Against Parametric Alternatives Li et al also provide a way to test against parametric alternatives. They consider a parametric version of the model: y i = X i δ 0 (z i ) + ɛ i, with δ 0 (z i ) being a particular parametric function of z, for example: y i = α 0 + γ 0 z i + x i β + ɛ i imples that X i = [1 x i ], δ 0 (z) = [(α 0 + z i γ 0 ) β 0 ]. We would like to test H 0 : H A : δ(z) δ 0 (z) a.e. = 0 δ(z) δ 0 (z) 0 on a set with positive measure Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

30 Tests Against Parametric Alternatives They propose the test statistic: Î n = (n 2 h q ) 1 i j i X i X jˆɛ i ˆɛ j K ( zi z j h ), where They also show where ˆɛ i = y i X i ˆδ0 (z i ). J n = nhq/2 Î n ˆσ 0 N(0, 1), ˆσ 2 0 = 2(n 2 h q ) 1 i j i ( ) ˆɛ 2 i ˆɛ 2 j K 2 zi z j. h Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

31 Tests Against Parametric Alternatives Notes: A rule of thumb for the bandwidth choice is h l = z l,sd n 1/(4+q), where z l,sd is the sample standard deviation of z l. Justin L. Tobias (Purdue) Nonparametric Regression April 8, / 31

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity