Econ 582 Nonparametric Regression Eric Zivot May 28, 2013
Nonparametric Regression Sofarwehaveonlyconsideredlinearregressionmodels = x 0 β + [ x ]=0 [ x = x] =x 0 β = [ x = x] [ x = x] x = β The assume that [ x = x] =x 0 β is a linear function of is often done for convenience. In general, when the components of are continuously distributed then can take on any nonlinear shape. [ x = x] = (x)
Two cases to consider If [ x = x] = (x) = (x θ) for θ R then we have a parametric nonlinear regression model = (x θ)+ and the parameters θ and be estimated using nonlinear regression techniques If [ x = x] = (x) cannot be modeled parametrically or the parametric form (x θ) is unknown then we have a non-parametric regression = (x)+ and we can estimate the function (x) at each point x using nonparametric regression techniques.
Binned Estimation of (x) Consider a nonparametric regression with a single covariate = ( )+ Fix the point = and consider estimating ( ) using a local average of values associated values near such that ˆ ( ) = P =1 1( ) P =1 1( ) 1( ) = 1 if ;0 otherwise ( ) = 1( ) P =1 1( ) Note, P =1 ( )=1 = X =1 ( )
Example: Nonparametric regression (Hansen) Thetruemodelis = ( )+ =1 100 ( ) = 10 log( ) (4 1) (0 16) ( ( ) 16) Forbinnedestimationlet =2 3 4 5 6 and =0 5
Remarks: Binned estimator is a step-function (discontinuous estimate of ( )) Forcoarsegridof the steps (squares in figure) are large For a fine grid of the steps (solid line in figure) are smaller The bandwidth determines the smoothness of the estimate: Small gives small bins and less smoothness
Figure 1: Binned estimator at =2 3 4 5 6 with =1 2 and NW estimator with Epanechnikov kernel
Kernel Regression Binned estimator is discontinuous because ( ) is constructed from indicator functions which are discontinuous If ( ) is constructed from a continuous function then ˆ ( ) will also be continuous. Kernel estimators of ( ) are continuous estimators based on continuous kernel weight functions
Example: Kernel weight function based on uniform distribution Define the weights 1( ) in terms of the uniform density on [ 1 1] 0 ( ) = 1 1( 1) = rectangular kernel 2 Then and µµ 1( ) =1 ˆ ( ) = 1 P ³ =1 0 P ³ =1 0 µ =2 0
Definition: A second-order kernel function ( ) satisfies 0 ( ) ( ) = ( ) R ( ) =1 2 = R 2 ( )
Kernel Estimator Given a kernel weight function ( ) akernelestimatorof ( ) has the form where ˆ ( ) = = ( )= P =1 ³ P =1 ³ X =1 ( ) ³ P =1 ³ Note: The kernel estimator is also known as the Nadaraya-Watson estimator, the kernel regression estimator or the local constant estimator.
Role of Bandwidth Parameter Bandwidth determines smoothness of estimator: large gives smoother ˆ ( ); smaller gives rougher (more erratic) ˆ ( ) 0 ˆ ( ) ˆ ( )
Commonly used Kernels 1. Epanechnikov kernel 1 ( ) = 3 4 (1 2 )1 ( 1) 2. Gaussian kernel 4 ( ) = 1 Ã! exp 2 2 2 Two important properties of kernels 2 Z = = 2 ( ) Z ( )2
Properties of Commonly Used Kernels Kernel Equation 2 Uniform 0 ( ) = 1 21( 1) 1/2 1/3 Epanechnikov 1 ( ) = 3 4 (1 2 )1 ( 1) 3/5 1/5 Biweight 2 ( ) = 15 16 (1 2 ) 2 1( 1) 5/7 1/7 Triweight 3 ( ) = 35 32 (1 2 µ ) 3 1( 1) 350/429 1/9 Gaussian 4 ( ) = 1 exp 1/(2 ) 1 2 2 2
Local Linear Estimator The Nadaraya-Watson (NW) kernel estimator is often called a local constant estimator as it locally (about x) approximates ( ) as a constant function. In fact, the NW estimator solves the minimization problem ˆ ( ) =argmin X =1 µ ( ) 2 This is a weighted regression of on an intercept only.
A local linear approximation solves the minimization problem nˆ ( ) ˆ ( ) o =argmin X =1 µ ( ( )) 2 The local linear estimator of ( ) is the estimated intercept ˆ ( ) =ˆ ( ) The local linear estimator of the regression derivative ( ) is the estimated slope coefficient d ( ) =ˆ ( )
Matrix notation Define z = à 1! µ = Then the LL estimator is the weighted LS estimator à ˆ ( ) ˆ ( )! = X =1 ( )z ( )z ( ) 0 = (Z 0 KZ) 1 Z 0 Ky 1 X =1 ( )z ( ) where K ( ) = 1 ( )... ( )
Remarks ( ) ˆ + ˆ because ( ) 1 NW does better than LL when ( ) is close to a flat line LL does better than NW when ( ) is meaningfully nonconstant LL does better than NW for values near the boundary of support of
Figure 2: Local linear estimator.
Nonparametric Residuals and Regression Fit Define the nonparametric residual as ˆ = ˆ ( ) Problem: ˆ is not a good error measure for small because ˆ ( ) as 0 and so ˆ 0 as 0 Need a residual that does not suffer from this overfitting problem
Leave-one-out (Jacknife) Residuals (NW Estimator) Idea: For the NW estimator, we can prevent ˆ ( ) as 0 by leaving out and from the non-parametric fit ˆ ( ) = P 6= ³ P 6= ³ The leave-one-out (Jacknife) NW predictor and residual for observation are = ˆ ( ) =
Leave-one-out (Jacknife) Residuals (LL Estimator) The Jacknife LL estimator has the form Ã! = X 6= z z 0 1 z = µ = and the LL residual is à 1 X 6=! = z
Cross Validation and Bandwidth Selection = ( )+ ( )= 2 for all ˆ ( ) = nonparametric estimate of ( ) Problem: How to choose? large smoother estimator (smaller variance of ˆ ( )) but higher bias at each small noiser estimator (higher variance of ˆ ( )) but lower bias at each (recall, ˆ ( ) as 0) Key point: Desirable to choose to minimize the bias-variance tradeoff
MSE, IMSE and MSFE The mean-squared error (MSE) at is defined as ( ) = h (ˆ ( ) ( )) 2i = (ˆ ( ) ( )) 2 + (ˆ ( )) and is a function of both and The integrated MSE a weighted average MSE over all is ( ) = ( ) = pdf of and is only a function of Z ( ) ( ) = [ ( )] Goal: Find to minimize ( )
Problem: ( ) depends on ( ) which is unknown Result: ( ) can be estimated using the sample mean-squared forecast error (MSFE) Let ( +1 +1 ) be out-of-sample observations independent of the sample. The prediction of +1 given +1 is The MSFE is defined as ˆ +1 =ˆ ( +1 ) ( ) = h ( +1 ˆ +1 ) 2i = h ( +1 ˆ ( +1 )) 2i
Using the trivial identity +1 ˆ ( +1 ) = +1 ( +1 )+ ( +1 ) ˆ ( +1 ) = +1 + ( +1 ) ˆ ( +1 ) It can be shown that ( ) = h ( +1 + ( +1 ) ˆ ( +1 )) 2i = 2 + Z ( ) ( ) = 2 + ( ) Hence, minimizing ( ) is equivalent to minimizing ( )
Estimating ( ) Using the Jacknife nonparametric residuals ( ) = ( ) an estimate of ( ) is \ ( ) = 1 X =1 ( ) 2 Treated as a function of \ ( ) is called the cross-validation criterion ( ) = 1 X =1 ( ) 2
Optimal Bandwidth Estimation The bandwidth that minimizes an estimate of the IMSE solves ˆ = arg min 0 ( ) Notes: Typically, the univariate minimization is done by evaluating ( ) over a grid [ 1 2 ] and choosing ˆ as the value that gives the smallest ( ) over the grid. Plots of ( ) against provide a visual guide to choosing
Asymptotic Distribution Theory Theorem. Let ˆ ( ) denote either the NW or LL estimator of ( ). If is interior to the support of and ( ) 0 then as and 0 such that where (ˆ ( ) ( ) 2 2 ( )) ˆ ( ) Ã Ã ( )+ 2 2 ( ) 2 ( ) ( ) 2 ( ) = [ 2 = ] 2 = Z 2 ( ) = Z 0 2 ( ) ( )! ( )2!
Figure 3: Cross-validation criteria, NW and LL estimators.
Figure 4: NW and LL estimates using data-dependent CV bandwidths.
TheasymptoticbiastermsfortheNWandLLestimatorsare ( ) = 1 2 00 ( )+ ( ) 1 0 ( ) 0 ( ) ( ) = 1 2 ( ) 00 ( )
Remarks: Asymptotic variances of NW and LL estimators are the same but biases differ ˆ ( ) converges at rate instead of the usual CLT rate of Because 0 as diverges slower than Hence, nonparametric estimators converge more slowly to their asymptotic distributions than parametric estimators ˆ ( ) hasanasymptoticbiasterm 2 2 ( ) which depends on 2 0 ( ) 00 ( ) and ( ) and 0 ( )
Asymptotic bias decreases in andasymptoticvarianceincreasesin ( ) depends on both 0 ( ) and 0 ( ) whereas ( ) only depends on 00 ( ) ( ) = ( ) =0if ( ) is constant (i.e., 0 ( ) = 00 ( ) =0) ( ) is typically lower than ( )
Estimating Asymptotic Standard Errors The asymptotic distribution theory gives the result (ˆ ( )) = 2 ( ) ( ) The known quantities are and The unknown quantities are 2 ( ) = [ 2 = ] and ( ) An estimate of (ˆ ( )) uses estimates for 2 ( ) and ( ) \ (ˆ ( )) = ˆ 2 ( ) ˆ ( ) [ (ˆ ( )) = Question: How to estimate 2 ( ) and ( )? v u t ˆ 2 ( ) ˆ ( )
Nonparametric Estimation of 2 ( ) = [ 2 = ] and ( ) A nonparametric estimate of 2 ( ) has the form where is the Jackknife residual. 2 ( ) =P =1 ( ) 2 P =1 ( ) A nonparametric estimate of ( ) has the form ˆ ( ) = 1 X =1 µ
Extension to Multiple Regression = [ x = x]+ [ x = x] = (x)+ x = ( 1 ) 0 For any vector x and observation define the kernel weights and bandwidth vector Ã! Ã! Ã! 1 (x) = 1 2 2 1 h = ( 1 ) 0 2
Nonparametric Estimators Multivariate NW estimator: ˆ (x) = P =1 (x) P =1 (x) Multivariate LL estimator: Ã ˆ (x) ˆ (x) z =! = Ã 1 x x X =1! (x)z (x)z (x) 0 = (Z 0 KZ) 1 Z 0 Ky 1 X =1 (x)z (x)
Remarks Finding the cross-validation bandwidth vector ĥ =argmin (h) h is a cumbersome numerical problem if is large Asymptotic distribution theory is similar to univariate case with one important difference: convergence rate to asymptotic normal distribution depends on the dimension of x The higher is thesloweristheconvergence rate. This is called the curse of dimensionality and is a major problem in nonparametric regression.