Econ 582 Nonparametric Regression

Similar documents
Time Series and Forecasting Lecture 4 NonLinear Time Series

Nonparametric Methods

Modelling Non-linear and Non-stationary Time Series

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Nonparametric Density Estimation

41903: Introduction to Nonparametrics

Single Index Quantile Regression for Heteroscedastic Data

Local Polynomial Modelling and Its Applications

Introduction to Regression

Statistics for Python

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Introduction to Maximum Likelihood Estimation

Introduction to Regression

Nonparametric Regression. Badr Missaoui

Local Polynomial Regression

Introduction to Regression

Density Estimation (II)

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

Single Index Quantile Regression for Heteroscedastic Data

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Estimation of cumulative distribution function with spline functions

12 - Nonparametric Density Estimation

Introduction to Regression

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Nonparametric Econometrics

Penalized Splines, Mixed Models, and Recent Large-Sample Results

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Package NonpModelCheck

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Single Equation Linear GMM with Serially Correlated Moment Conditions

Additive Isotonic Regression

Single Equation Linear GMM with Serially Correlated Moment Conditions

Michael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp

Extending clustered point process-based rainfall models to a non-stationary climate

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Motivational Example

Nonparametric Regression. Changliang Zou

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Introduction to Statistical modeling: handout for Math 489/583

CSE446: non-parametric methods Spring 2017

Section 7: Local linear regression (loess) and regression discontinuity designs

Nonparametric Regression

Linear model selection and regularization

Maximum Likelihood Estimation. only training data is available to design a classifier

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

AGEC 661 Note Eleven Ximing Wu. Exponential regression model: m (x, θ) = exp (xθ) for y 0

Chapter 9. Non-Parametric Density Function Estimation

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Classification via kernel regression based on univariate product density estimators

Lecture 3: Statistical Decision Theory (Part II)

Introduction to Nonparametric Regression

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Non-parametric Methods

7 Semiparametric Estimation of Additive Models

Chapter 9. Non-Parametric Density Function Estimation

1 Empirical Likelihood

OFFICE OF NAVAL RESEARCH FINAL REPORT for TASK NO. NR PRINCIPAL INVESTIGATORS: Jeffrey D. Hart Thomas E. Wehrly

Kernel density estimation

Nonparametric Estimation of Luminosity Functions

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

6.435, System Identification

3 Nonparametric Density Estimation

Stat 5101 Lecture Notes

Economics 583: Econometric Theory I A Primer on Asymptotics

COMPARISON OF GMM WITH SECOND-ORDER LEAST SQUARES ESTIMATION IN NONLINEAR MODELS. Abstract

A Note on Data-Adaptive Bandwidth Selection for Sequential Kernel Smoothers

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Lecture 02 Linear classification methods I

Forecasting Lecture 2: Forecast Combination, Multi-Step Forecasts

Spatial Regression. 6. Specification Spatial Heterogeneity. Luc Anselin.

High-dimensional regression with unknown variance

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

LWP. Locally Weighted Polynomials toolbox for Matlab/Octave

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Resampling techniques for statistical modeling

Statistics: Learning models from data

Nonparametric Methods Lecture 5

STAT 705 Nonlinear regression

Titolo Smooth Backfitting with R

Covariance function estimation in Gaussian process regression

Nonparametric Inference via Bootstrapping the Debiased Estimator

13 Endogeneity and Nonparametric IV

A New Method for Varying Adaptive Bandwidth Selection

Machine Learning Lecture 3

Regression I: Mean Squared Error and Measuring Quality of Fit

Nonparametric Cointegrating Regression with Endogeneity and Long Memory

Lecture 24: Weighted and Generalized Least Squares

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Estimation of Parameters

F & B Approaches to a simple model

Optimal Kernel Shapes for Local Linear Regression

Nonparametric Principal Components Regression

Curve Fitting Re-visited, Bishop1.2.5

Transcription:

Econ 582 Nonparametric Regression Eric Zivot May 28, 2013

Nonparametric Regression Sofarwehaveonlyconsideredlinearregressionmodels = x 0 β + [ x ]=0 [ x = x] =x 0 β = [ x = x] [ x = x] x = β The assume that [ x = x] =x 0 β is a linear function of is often done for convenience. In general, when the components of are continuously distributed then can take on any nonlinear shape. [ x = x] = (x)

Two cases to consider If [ x = x] = (x) = (x θ) for θ R then we have a parametric nonlinear regression model = (x θ)+ and the parameters θ and be estimated using nonlinear regression techniques If [ x = x] = (x) cannot be modeled parametrically or the parametric form (x θ) is unknown then we have a non-parametric regression = (x)+ and we can estimate the function (x) at each point x using nonparametric regression techniques.

Binned Estimation of (x) Consider a nonparametric regression with a single covariate = ( )+ Fix the point = and consider estimating ( ) using a local average of values associated values near such that ˆ ( ) = P =1 1( ) P =1 1( ) 1( ) = 1 if ;0 otherwise ( ) = 1( ) P =1 1( ) Note, P =1 ( )=1 = X =1 ( )

Example: Nonparametric regression (Hansen) Thetruemodelis = ( )+ =1 100 ( ) = 10 log( ) (4 1) (0 16) ( ( ) 16) Forbinnedestimationlet =2 3 4 5 6 and =0 5

Remarks: Binned estimator is a step-function (discontinuous estimate of ( )) Forcoarsegridof the steps (squares in figure) are large For a fine grid of the steps (solid line in figure) are smaller The bandwidth determines the smoothness of the estimate: Small gives small bins and less smoothness

Figure 1: Binned estimator at =2 3 4 5 6 with =1 2 and NW estimator with Epanechnikov kernel

Kernel Regression Binned estimator is discontinuous because ( ) is constructed from indicator functions which are discontinuous If ( ) is constructed from a continuous function then ˆ ( ) will also be continuous. Kernel estimators of ( ) are continuous estimators based on continuous kernel weight functions

Example: Kernel weight function based on uniform distribution Define the weights 1( ) in terms of the uniform density on [ 1 1] 0 ( ) = 1 1( 1) = rectangular kernel 2 Then and µµ 1( ) =1 ˆ ( ) = 1 P ³ =1 0 P ³ =1 0 µ =2 0

Definition: A second-order kernel function ( ) satisfies 0 ( ) ( ) = ( ) R ( ) =1 2 = R 2 ( )

Kernel Estimator Given a kernel weight function ( ) akernelestimatorof ( ) has the form where ˆ ( ) = = ( )= P =1 ³ P =1 ³ X =1 ( ) ³ P =1 ³ Note: The kernel estimator is also known as the Nadaraya-Watson estimator, the kernel regression estimator or the local constant estimator.

Role of Bandwidth Parameter Bandwidth determines smoothness of estimator: large gives smoother ˆ ( ); smaller gives rougher (more erratic) ˆ ( ) 0 ˆ ( ) ˆ ( )

Commonly used Kernels 1. Epanechnikov kernel 1 ( ) = 3 4 (1 2 )1 ( 1) 2. Gaussian kernel 4 ( ) = 1 Ã! exp 2 2 2 Two important properties of kernels 2 Z = = 2 ( ) Z ( )2

Properties of Commonly Used Kernels Kernel Equation 2 Uniform 0 ( ) = 1 21( 1) 1/2 1/3 Epanechnikov 1 ( ) = 3 4 (1 2 )1 ( 1) 3/5 1/5 Biweight 2 ( ) = 15 16 (1 2 ) 2 1( 1) 5/7 1/7 Triweight 3 ( ) = 35 32 (1 2 µ ) 3 1( 1) 350/429 1/9 Gaussian 4 ( ) = 1 exp 1/(2 ) 1 2 2 2

Local Linear Estimator The Nadaraya-Watson (NW) kernel estimator is often called a local constant estimator as it locally (about x) approximates ( ) as a constant function. In fact, the NW estimator solves the minimization problem ˆ ( ) =argmin X =1 µ ( ) 2 This is a weighted regression of on an intercept only.

A local linear approximation solves the minimization problem nˆ ( ) ˆ ( ) o =argmin X =1 µ ( ( )) 2 The local linear estimator of ( ) is the estimated intercept ˆ ( ) =ˆ ( ) The local linear estimator of the regression derivative ( ) is the estimated slope coefficient d ( ) =ˆ ( )

Matrix notation Define z = à 1! µ = Then the LL estimator is the weighted LS estimator à ˆ ( ) ˆ ( )! = X =1 ( )z ( )z ( ) 0 = (Z 0 KZ) 1 Z 0 Ky 1 X =1 ( )z ( ) where K ( ) = 1 ( )... ( )

Remarks ( ) ˆ + ˆ because ( ) 1 NW does better than LL when ( ) is close to a flat line LL does better than NW when ( ) is meaningfully nonconstant LL does better than NW for values near the boundary of support of

Figure 2: Local linear estimator.

Nonparametric Residuals and Regression Fit Define the nonparametric residual as ˆ = ˆ ( ) Problem: ˆ is not a good error measure for small because ˆ ( ) as 0 and so ˆ 0 as 0 Need a residual that does not suffer from this overfitting problem

Leave-one-out (Jacknife) Residuals (NW Estimator) Idea: For the NW estimator, we can prevent ˆ ( ) as 0 by leaving out and from the non-parametric fit ˆ ( ) = P 6= ³ P 6= ³ The leave-one-out (Jacknife) NW predictor and residual for observation are = ˆ ( ) =

Leave-one-out (Jacknife) Residuals (LL Estimator) The Jacknife LL estimator has the form Ã! = X 6= z z 0 1 z = µ = and the LL residual is à 1 X 6=! = z

Cross Validation and Bandwidth Selection = ( )+ ( )= 2 for all ˆ ( ) = nonparametric estimate of ( ) Problem: How to choose? large smoother estimator (smaller variance of ˆ ( )) but higher bias at each small noiser estimator (higher variance of ˆ ( )) but lower bias at each (recall, ˆ ( ) as 0) Key point: Desirable to choose to minimize the bias-variance tradeoff

MSE, IMSE and MSFE The mean-squared error (MSE) at is defined as ( ) = h (ˆ ( ) ( )) 2i = (ˆ ( ) ( )) 2 + (ˆ ( )) and is a function of both and The integrated MSE a weighted average MSE over all is ( ) = ( ) = pdf of and is only a function of Z ( ) ( ) = [ ( )] Goal: Find to minimize ( )

Problem: ( ) depends on ( ) which is unknown Result: ( ) can be estimated using the sample mean-squared forecast error (MSFE) Let ( +1 +1 ) be out-of-sample observations independent of the sample. The prediction of +1 given +1 is The MSFE is defined as ˆ +1 =ˆ ( +1 ) ( ) = h ( +1 ˆ +1 ) 2i = h ( +1 ˆ ( +1 )) 2i

Using the trivial identity +1 ˆ ( +1 ) = +1 ( +1 )+ ( +1 ) ˆ ( +1 ) = +1 + ( +1 ) ˆ ( +1 ) It can be shown that ( ) = h ( +1 + ( +1 ) ˆ ( +1 )) 2i = 2 + Z ( ) ( ) = 2 + ( ) Hence, minimizing ( ) is equivalent to minimizing ( )

Estimating ( ) Using the Jacknife nonparametric residuals ( ) = ( ) an estimate of ( ) is \ ( ) = 1 X =1 ( ) 2 Treated as a function of \ ( ) is called the cross-validation criterion ( ) = 1 X =1 ( ) 2

Optimal Bandwidth Estimation The bandwidth that minimizes an estimate of the IMSE solves ˆ = arg min 0 ( ) Notes: Typically, the univariate minimization is done by evaluating ( ) over a grid [ 1 2 ] and choosing ˆ as the value that gives the smallest ( ) over the grid. Plots of ( ) against provide a visual guide to choosing

Asymptotic Distribution Theory Theorem. Let ˆ ( ) denote either the NW or LL estimator of ( ). If is interior to the support of and ( ) 0 then as and 0 such that where (ˆ ( ) ( ) 2 2 ( )) ˆ ( ) Ã Ã ( )+ 2 2 ( ) 2 ( ) ( ) 2 ( ) = [ 2 = ] 2 = Z 2 ( ) = Z 0 2 ( ) ( )! ( )2!

Figure 3: Cross-validation criteria, NW and LL estimators.

Figure 4: NW and LL estimates using data-dependent CV bandwidths.

TheasymptoticbiastermsfortheNWandLLestimatorsare ( ) = 1 2 00 ( )+ ( ) 1 0 ( ) 0 ( ) ( ) = 1 2 ( ) 00 ( )

Remarks: Asymptotic variances of NW and LL estimators are the same but biases differ ˆ ( ) converges at rate instead of the usual CLT rate of Because 0 as diverges slower than Hence, nonparametric estimators converge more slowly to their asymptotic distributions than parametric estimators ˆ ( ) hasanasymptoticbiasterm 2 2 ( ) which depends on 2 0 ( ) 00 ( ) and ( ) and 0 ( )

Asymptotic bias decreases in andasymptoticvarianceincreasesin ( ) depends on both 0 ( ) and 0 ( ) whereas ( ) only depends on 00 ( ) ( ) = ( ) =0if ( ) is constant (i.e., 0 ( ) = 00 ( ) =0) ( ) is typically lower than ( )

Estimating Asymptotic Standard Errors The asymptotic distribution theory gives the result (ˆ ( )) = 2 ( ) ( ) The known quantities are and The unknown quantities are 2 ( ) = [ 2 = ] and ( ) An estimate of (ˆ ( )) uses estimates for 2 ( ) and ( ) \ (ˆ ( )) = ˆ 2 ( ) ˆ ( ) [ (ˆ ( )) = Question: How to estimate 2 ( ) and ( )? v u t ˆ 2 ( ) ˆ ( )

Nonparametric Estimation of 2 ( ) = [ 2 = ] and ( ) A nonparametric estimate of 2 ( ) has the form where is the Jackknife residual. 2 ( ) =P =1 ( ) 2 P =1 ( ) A nonparametric estimate of ( ) has the form ˆ ( ) = 1 X =1 µ

Extension to Multiple Regression = [ x = x]+ [ x = x] = (x)+ x = ( 1 ) 0 For any vector x and observation define the kernel weights and bandwidth vector Ã! Ã! Ã! 1 (x) = 1 2 2 1 h = ( 1 ) 0 2

Nonparametric Estimators Multivariate NW estimator: ˆ (x) = P =1 (x) P =1 (x) Multivariate LL estimator: Ã ˆ (x) ˆ (x) z =! = Ã 1 x x X =1! (x)z (x)z (x) 0 = (Z 0 KZ) 1 Z 0 Ky 1 X =1 (x)z (x)

Remarks Finding the cross-validation bandwidth vector ĥ =argmin (h) h is a cumbersome numerical problem if is large Asymptotic distribution theory is similar to univariate case with one important difference: convergence rate to asymptotic normal distribution depends on the dimension of x The higher is thesloweristheconvergence rate. This is called the curse of dimensionality and is a major problem in nonparametric regression.