Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Similar documents
4 Nonparametric Regression

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Nonparametric Density Estimation (Multidimension)

Nonparametric Regression. Changliang Zou

Local Polynomial Regression

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Modelling Non-linear and Non-stationary Time Series

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Nonparametric Regression

Nonparametric Methods

Econ 582 Nonparametric Regression

Time Series and Forecasting Lecture 4 NonLinear Time Series

Nonparametric Regression. Badr Missaoui

41903: Introduction to Nonparametrics

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

3 Nonparametric Density Estimation

Additive Isotonic Regression

Histogram Härdle, Müller, Sperlich, Werwatz, 1995, Nonparametric and Semiparametric Models, An Introduction

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Local Polynomial Modelling and Its Applications

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Lecture 3: Statistical Decision Theory (Part II)

Transformation and Smoothing in Sample Survey Data

Local linear multiple regression with variable. bandwidth in the presence of heteroscedasticity

Non-parametric Inference and Resampling

On the Robust Modal Local Polynomial Regression

Nonparametric Econometrics

Optimal bandwidth selection for differences of nonparametric estimators with an application to the sharp regression discontinuity design

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Smooth simultaneous confidence bands for cumulative distribution functions

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

Regression, Ridge Regression, Lasso

Nonparametric Modal Regression

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Linear models and their mathematical foundations: Simple linear regression

DEPARTMENT MATHEMATIK ARBEITSBEREICH MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

Bandwidth selection for kernel conditional density

Optimal Bandwidth Choice for the Regression Discontinuity Estimator

Bootstrap of residual processes in regression: to smooth or not to smooth?

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

STA 2201/442 Assignment 2

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Introduction to Regression

Introduction to Regression

Defect Detection using Nonparametric Regression

STAT 512 sp 2018 Summary Sheet

Convergence rates for uniform confidence intervals based on local polynomial regression estimators

Goodness-of-fit tests for the cure rate in a mixture cure model

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

12 - Nonparametric Density Estimation

CURRENT STATUS LINEAR REGRESSION. By Piet Groeneboom and Kim Hendrickx Delft University of Technology and Hasselt University

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

Introduction to Regression

Function of Longitudinal Data

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Machine Learning for OR & FE

Chapter 7: Model Assessment and Selection

Single Index Quantile Regression for Heteroscedastic Data

Test for Discontinuities in Nonparametric Regression

Introduction to Regression

Local Modal Regression

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Day 3B Nonparametrics and Bootstrap

Single Index Quantile Regression for Heteroscedastic Data

Bayesian estimation of bandwidths for a nonparametric regression model with a flexible error density

UNIVERSITÄT POTSDAM Institut für Mathematik

A nonparametric method of multi-step ahead forecasting in diffusion processes

Semiparametric modeling and estimation of the dispersion function in regression

Log-Density Estimation with Application to Approximate Likelihood Inference

SUPPLEMENTARY MATERIAL FOR PUBLICATION ONLINE 1

Computational treatment of the error distribution in nonparametric regression with right-censored and selection-biased data

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Estimation of cumulative distribution function with spline functions

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Efficient estimation of a semiparametric dynamic copula model

Local Polynomial Regression with Application to Sea Surface Temperatures

Optimal global rates of convergence for interpolation problems with random design

Local linear multivariate. regression with variable. bandwidth in the presence of. heteroscedasticity

Math 494: Mathematical Statistics

Linear Models and Estimation by Least Squares

ERROR VARIANCE ESTIMATION IN NONPARAMETRIC REGRESSION MODELS

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Machine Learning for OR & FE

Stat 5101 Lecture Notes

Ultra High Dimensional Variable Selection with Endogenous Variables

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Bootstrap, Jackknife and other resampling methods

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

9. Robust regression

Motivational Example

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors

Independent and conditionally independent counterfactual distributions

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

Transcription:

Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann

Univariate Kernel Regression The relationship between two variables, X and Y where m( ) is a function. Y = m(x ) Model: y i = m(x i ) + ε i, i = 1,..., n, (1) E(Y X = x) = m(x). (2) (1): Y = m(x ) doesn t need to hold exactly for the i th observation. Error term ε. (2): The relationship holds on average.

Univariate Kernel Regression Goal: Estimate m( ) on basis of the observations (x i, y i ), i = 1,..., n. In a parametric approach: m(x) = α + βx and then estimate α and β. In the nonparametric approach: No prior restricions on m( ).

Conditional Expectation (repetition) Let X and Y be two random variables with joint pdf f (x, y). Then the conditional expectation of Y given that X = x E(Y X = x) = yf (y x) dy f (x, y) = y f X (x) dy = m(x) Note: m(x) might be quite nonlinear.

Example Consider the joint pdf f (x, y) = x + y for 0 x 1 and 0 y 1 with f (x, y) = 0 elsewhere, and marginal pdf f X (x) = 1 0 f (x, y) dy = x + 1 2 with f X (x) = 0 elsewhere. Then which is nonlinear. E(Y X = x) = = = 1 for 0 x 1 f (x, y) y f X (x) dy y x + y 0 x + 1 dy 2 1 2 x + 1 3 x + 1 = m(x) 2

Design Random design Observations (X i, Y i ), i = 1,..., n from bivariate distribtuion f (x, y). The distribution of f X (x) is unknown. Fix design Controle the predictor variable, X, then Y is the only random variable. (Eg. the beer-example). The distribution of f X (x) is known.

Kernel Regression Random design The Nadaraya-Watson estimator ˆm(x) = n 1 n i=1 K h(x X i )Y i n 1 n j=1 K h(x X j ) Rewrite the Nadaraya-Watson estimator ( ) ˆm(x) = 1 n K h (x X i ) n n 1 n i=1 j=1 K h(x X j ) = 1 n W hi (x)y i n i=1 1 Weighted (local) average of Y i (note: n n i=1 W hi(x) = 1) h determines the degree of smoothness. What happens if the denominator of W hi (x) is equal to 0? Then the numerator is also equal to 0, and the estimator is not defined. This happend in regions of sparse data. Y i

Kernel Regression

Kernel Regression Fix design f X (x) is known. Weights of the form W FD hi (x) = K h(x x i ) f X (x) Simpler structure and therefore easier to analyse. One particular fixed design kernel regression estimator: Gasser-Müller For the case of ordered design points x (i), i = 1,..., n from [a, b] si Whi GM (x) = n K h (x u) du, s i 1 where s i = x (i)+x (i+1) 2, s 0 = a, s n+1 = b. Note that Whi GM (x) sums to 1.

Statistical Properties Are the kernel regression estimator consistent? Theorem 4.1: Consistence (Nadaraya-Watson) Assume the univariate random design model and the regularity conditions: K(u) du <, uk(u) 0 for u, EY 2 <. Suppose also h 0, nh, then 1 n n W hi (x)y i = ˆm h (x) P m(x) i=1 where for x holds f X (x) > 0 and x is a point of continuity of m(x), f X (x), and σ 2 (x) = V(Y X = x). Proof idea: Show that the numerator and the denominator of ˆm h (x) converge. Then ˆm h (x) converges (Slutsky s theorem).

Statistical Properties What is the speed of the convergence? Theorem 4.2: Speed of the convergence (Gasser-Müller) Assume the univariate fixed design model and the conditions: K( ) has support [ 1, 1] with K( 1) = K(1) = 0, m( ) is twice continuously differentiable, max i x i x i 1 = O(n 1 ) Assume V(ε i ) = σ 2, i = 1,..., n. Then, under h 0, nh, it holds { } 1 n MSE Whi GM (x)y i 1 n nh σ2 K 2 2 + h4 4 µ2 2(K){m (x)} 2 i=1 Note: AMSE = var + bias 2, ie. usual bias-variance trade-off.

Statistical Properties What is the speed of the convergence in the random design? Theorem 4.3: Speed of the convergence (Nadaraya-Watson) Assume the univariate random design model and the conditions K(u) du <, uk(u) 0 for u, EY 2 <. Suppose also h 0, nh, then MSE( ˆm h (x)) 1 σ 2 ( nh f X (x) K 2 2+ h4 4 µ2 2(K) m (x) + 2 m (x)f X (x) ) 2 f X (x) where for x holds f X > 0 and x is a point of continuity of m (x), m (x), f X (x), f X (x), and σ 2 (x) = V(Y X = x) Proof idea: Rewrite the Nadaraya-Watson estimator: ˆm h (x) = ˆr h(x) ˆf h (x)

Statistical Properties Asymptotic MSE: AMSE(n, h) = 1 nh C 1 + h 4 C 2 Minimizing wrt. h gives the optimal bandwidth h opt n 1/5 Rate of convergence af AMSE is of order O(n 4/5 ). Slower than the rate obtained by LS estimation in linear regression, but the same as in nonparametric density estimation.

Local Polynomial Regression The Nadaraya-Watson estimator is a special case of local polynomial regression: Local constant least squares fit. To motivate the local polynomial fit: Look at Taylor expansion of the unknown conditional expectation function: m(t) m(x) + m (x)(t x) + + m (p) (x)(t x) p 1 p! for t in a neighborhood of x. Idea: Fit a polynomial in a neighborhood of x (called local polynomial regression). min β n {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (x X i ) }{{}}{{} polynomial local i=1 where β = (β 0, β 1,..., β p ) T.

Local Polynomial Regression The result is a weighted least squares estimator ˆβ(x) = (X T WX) 1 X T WY 1 X 1 x... (X 1 x) p Y 1 where X =........, Y =. and 1 X n x... (X n x) p Y n K h (x X 1 ) 0... 0 0 K h (x X 2 )... 0 W =........ 0 0... K h (x X n ) Note: This estimator varies with x (in contrast to parametric least squares)

Local Polynomial Regression Denote the components of ˆβ(x) by ˆβ 0 (x),..., ˆβ p (x). Then the local polynomial estimator of the regression function m is ˆm p,h (x) = ˆβ 0 (x) because m(x) β 0 (x) (recall the Taylor expansion).

Local Polynomial Regression Local constant estimator (Nadaraya-Watson): p = 0 Then min β n {Y i β 0 } 2 K h (x X i ) i=1 ˆm 0,h = n i=1 K h(x X i )Y i n i=1 K h(x X i )

Local Polynomial Regression Local linear estimator: p = 1 Then and where min β n {Y i β 0 β 1 (X 1 x)} 2 K h (x X i ) i=1 ( Sh,0 (x) S ˆβ(x) = h,1 (x) S h,1 (x) S h,2 (x) ) 1 ( Th,0 (x) T h,1 (x) ˆm 1,h = ˆβ 0 (x) = T h,0(x)s h,2 (x) T h,1 (x)s h,1 (x) S h,0 (x)s h,2 (x) S 2 h,1 (x) S h,j (x) = T h,j (x) = n K h (x X i )(X i x) j i=1 i=1 n K h (x X i )(X i x) j Y i )

Local Polynomial Regression Local polynomial estimator: The general formula for polynomial of order p min β n {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (x X i ) i=1 Then ˆβ(x) = S h,0 (x) S h,1 (x)... S h,p (x) S h,1 (x) S h,2 (x)... S h,p+1 (x)...... S h,p (x) S h,p+1 (x)... S h,2p (x) 1 T h,0 (x) T h,1 (x). T h,p (x)

Local Constant Regression n = 7125 observations of net-income and expenditure on food expenditures from the Family Expenditure Survey of British household in 1973.

Local Linear Regression Differences between local constant and linear regression on the boundaries. Later we will see, that the local linear fit will be influenced less by outliers.

Local Linear Regression Asymtotic MSE for the local linear regression estimator: Note: AMSE( ˆm 1,h (x)) = 1 σ 2 (x) nh f X (x) K 2 2 + h4 4 {m (x)} 2 µ 2 2(K) The bias is design-independent, The bias disappears when m( ) is linear.

Local Linear Regression Compare the AMSE: Compare the AMSE of the local constant, the Gasser-Müller and the local linear estimators: AMSE( ˆm 0,h (x)) = AMSE( ˆm GM,h (x)) = AMSE( ˆm 1,h (x)) = 1 σ 2 { (x) nh f X (x) K 2 2 + h4 4 µ2 2 (K) m (x) + 2 m (x)f X (x) } 2 f X (x) 1 nh σ2 K 2 2 + h4 4 µ2 2 (K){m (x)} 2 1 σ 2 (x) nh f X (x) K 2 2 + h4 4 µ2 2 (K){m (x)} 2 LC uses one parameter less compared to LL without reducing the asymptotic variance. But the extra parameter enables LL to reduce the bias. LL fit improve the estimation i boundary compared to LC. GM improves the bias compared to LC.

Local Linear Regression Remarks: LL is a weighted local average of the response variables Y i (as the LC), h determines the degree of smoothness of ˆm p,h : For h 0, at X i, then ˆm p,h Y i. When h all weights are equal, ie. we obtain a parametric pth order polynomial fit. Derivatives of m: The natual approach: Estimate m by ˆm and compute the derivative ˆm. Alternative and more effecient method: Polynomial derivative estimator ˆm (ν) p,h (x) = ν! ˆβ ν (x) for the νth derivative of m( ). Usually the order of the polynomial is p = ν + 1 or p = ν + 3.

Nearest-Neighbor Estimator Other nonparametric smoothing techniques that differ from the kernel method. Nearest-Neighbor Estimator Kernel regression: Weighted averages Y i in a fixed neighborhood (governed by h) around x. k-nn: Weighted averages of Y i in a neighborhood around x. The width is variable. Using the k nearest observations to x. ˆm k (x) = 1 n W ki (x)y i n where with the set of indices W ki (x) = i=1 { n/k if i Jx 0 otherwise J x = {i : X i is one of the k nearest observations to x}

Nearest-Neighbor Estimator Remark: When we estimate m( ) at a point x with spare data, then the k nearest neighbors are rather far away from x. k is the smoothing parameter. Increasing k makes the estimate smoother. The k-nn estimator is the kernel estimator with uniform kernel K(u) = 1 2 1 ( u 1) and variable bandwidth R = R(k) with R(k) being the distance between x and its furtherst k-nearest neighbor: ˆm k (x) = n i=1 K R(x X i )Y i n i=1 K R(x X i ) The k-nn estimator can be generalised by considering other kernel functions.

Nearest-Neighbor Estimator Bias and variance: Let k, k/n 0 and n. Then E{ ˆm k (x)} m(x) µ { 2(K) 8f X (x) 2 m (x) + 2 m (x)f X (x) } ( ) k 2 f X (x) n Remark: V{ ˆm k (x)} 2 K 2 σ 2 (x) 2 k The variance does not depend of f X (x) (unlike LC) because k-nn always averages over k observations. k-nn approx. identical to LC with bandwidth h wht. MSE: Choose k = 2nhf X (x).

Nearest-Neighbor Estimator

Median Smoothing Median Smoothing: Goal: Estimate the conditional median function where ˆm(x) = med{y i : i J x } J x = {i : X i is one of the k nearest observations to x} e.i. the median of the Y i s, for which X i is one of the k nearest neighbors of x. med(y X = x) is more robust to outliers than E(Y X = x)

Median Smoothing

Spline Smoothing Spline Smoothing: Consider the residual sum of squares (RSS) as a criterion for the goodness of fit of m( ) RSS = n {Y i m(x i )} 2 i=1 Minimizing RSS: Interpolating the data m(x i ) = Y i, i = 1,..., n Solution: Add a stabilizer that penalizes non-smoothness of m( ) m 2 2 = {m (x)} 2 dx

Spline Smoothing Minimizing problem: ˆm λ = arg min S λ(m) m where S λ (m) = n i=1 {Y i m(x i )} 2 + λ m 2 2. For the class of all twice diff. functions on [a, b] = [X (1), X (n) ] then the unique minimizer is the cubic spline, which consists of cubic polynomials p i (x) = α i + β i x + γ i x 2 + δ i x 3, i = 1,..., n 1 between X (i), X (i+1). λ controls the weight given to the stabilizer: High λ more weigth to m 2 2 smooth estimate. λ 0 ˆm λ ( ) interpolation. λ ˆm λ ( ) linear function.

Spline Smoothing The estimator must be twice continuously differentiable: p i (X (i) ) = p i 1 (X (i) ), (no jumps in the function) p i (X (i)) = p i 1 (X (i)), (no jumps in the first derivative) p i (X (i)) = p i 1 (X (i)), (no jumps in the second derivative) p 1 (X (1)) = p n 1 (X (n) ) = 0, (boundary condition) These restrictions and the conditions for minimizing S λ (m) wrt. the coefficients of p i define a system of linear equations which can be solved in O(n) calculations.

Spline Smoothing The spline smoother is a linear estimator in Y i ˆm λ (x) = 1 n n W λ,i (x)y i i=1 The spline smoother is asymptotically equivalent to the kernel smoother (under certain conditions) with the spline kernel K S (u) = 1 ( 2 exp u ) ( u sin + π ) 2 2 4 with local bandwidth h(x i ) = λ 1/4 n 1/4 f (X i ) 1/4.

Spline Smoothing

Orthogonal Series Suppose that m( ) can be represented by a Fourier series m(x) = β j ϕ j (x) j=0 where {ϕ j (x)} j=0 is a known basis of functions and {β} j=0 are the unknown Fourier coefficients. Goal: Estimate the unknown Fourier coefficients. Cannot estimate a infinite number of coefficients from a finite number of observations. Therefore choose N that will be included in the Fourier series representaion.

Orthogonal Series Estimation proceeds in three steps: Select a basis of functions, Select N < n (integer), Estimate N unknown coefficients N: Smoothing parameter. Large N many terms in the Fourier series interpolation, Small N few terms smooth estimates.

Orthogonal Series

Smoothing Parameter Selection How can we choose the smoothing parameter of the kernel regression estimators? Which criteria do we have, that measures how close the estimate is to the true curve?

Smoothing Parameter Selection Mean Squared Error (MSE): MSE( ˆm h (x)) = E[{ ˆm h (x) m(x)} 2 ] m(x): Unknown constant, ˆm h (x): Random variable, ei. E is taken wrt. this rv. ˆm h (x) is a function of {X i, Y i } n i=1, therefore MSE( ˆm h (x)) =... { ˆm h (x) m(x)} 2 f (x 1,..., x n, y 1,..., y n ) dx 1...dx n dy 1,...dy n Local measure. The AMSE-optimal bandwidth is a complicated function of m (x) and σ 2 (x).

Smoothing Parameter Selection Integrated Squared Error (ISE): ISE( ˆm h ) = { ˆm h (x) m(x)} 2 w(x) f X (x) dx Global measure. ISE rv. different samples gives different values of ˆm h (x) and ISE. Weight function: w(x). Used to assign less weight to observations in regions of sparse data (to reduce variance) or at the tail (to reduce the boundary effects).

Smoothing Parameter Selection Mean Integrated Squared Error (MISE): MISE( ˆm h ) = E{ISE( ˆm h )} [ =... Expected value of ISE. ] { ˆm h (x) m(x)} 2 w(x) f X (x) dx f (x 1,..., x n, y 1,..., y n ) dx 1...dx n dy 1,...dy n

Smoothing Parameter Selection Averaged Squared Error (ASE): ASE( ˆm h ) = 1 n n { ˆm h (X j ) m(x j )} 2 w(x j ) j=1 Gloabal measure and a rv. Discrete approximation to ISE.

Smoothing Parameter Selection Mean Averaged Squared Error (MASE): MASE( ˆm h ) = E{ASE X 1 = x 1,..., X n = x n } 1 n = E { ˆm h (X j ) m(x j )} 2 w(x j ) n j=1 Conditional expectation of ASE If we view X 1,..., X n as rv. then MASE is a rv. MASE ia asymptotic version of AMISE

Smoothing Parameter Selection Which measure should we use to derive a bandwidth selection method? A natural choise MISE/AMISE: Used in the density case. But more unknown quantities than in the density case. Bandwidth selection: Cross-validation, Penalty terms, Plug-in methods (Fan & Gijbels, 1996) NW-est.: ASE, ISE and MISE lead asymptotically to the same smoothing. Therefore use the easiest: ASE (discrete ISE).

Smoothing Parameter Selection Averages Squared Error ASE( ˆm h ) = 1 n n { ˆm h (X i ) m(x i )} 2 w(x i ) i=1 The conditional expectation, MASE MASE(h) = E{ASE(h) X 1 = x 1,...X n = x n } = 1 n E[{ ˆm h (X i ) m(x i )} 2 X 1 = x 1,...X n = x n ] w(x i ) n i=1

Smoothing Parameter Selection Bias 2 (thin solid line), variance (dashed line), MASE (thick line). Simulated data set: m(x) = {sin(2πx 3 )} 3 Y i = m(x i ) + ε i, X i U(0, 1), ε i N(0, 0.01)

Smoothing Parameter Selection However, MASE and ASE contain m( ) which is unknown. Resubstitution estimate: Replace m( ) with the observations Y p(h) = 1 n n {Y i ˆm h (X i )} 2 w(x i ) i=1 The weighted residual sum of squares (RSS). Problem: Y i is used in ˆm h (X i ) to predict it self. Therefore p(h) arbitrarily small by letting h 0 (interpolation).

Smoothing Parameter Selection Cross-validation: Leave-one-out-estimator: ˆm h, i (X i ) = j i K h(x i X j )Y j j i K h(x i X j ) Cross-validation function: CV (h) = 1 n n {Y i ˆm h, i (X i )} 2 w(x i ) i=1 Choose h that minimizes CV(h)

Smoothing Parameter Selection Left: Local constant regression estimator (Nadaraya-Watson) with cross-validation bandwidth (ĥ cv = 0.15). Quartic kernel. Right: Local linear regression estimator with cross-validation bandwidth (ĥcv = 0.56). More stable in regions with sparse data and outperforms the LC-est. at the boundaries.

Plug-in method (Fan & Gijbels, 1996) Bandwidth selection methods: Constant (global) bandwidth, Variable bandwidth. Reduction of bias at peaked regions, and variance at flat regions. Local variable bandwidth, h(x 0 ) Varies with the location point x 0. Global variable bandwidth, h(x i ) Varies with the data points. Focus on constant and local variable bandwidth.

Plug-in method (Fan & Gijbels, 1996) Local bandwidth: Optimal local bandwidth by minimizing the conditional MSE [Bias{ ˆm ν (x 0 ) X}] 2 + V{ ˆm ν (x 0 ) X} where ˆm ν ( ) is the ν th derivative of ˆm( ). Use the general expressions for bias and variance to obtain the (local) AMSE-optimal bandwidth [ hopt LOC σ 2 ] 1/(2p+3) (x 0 ) (x 0 ) = C ν,p (K) {m (p+1) (x 0 )} 2 n 1/(2p+3) f (x 0 ) where C ν,p (K) is just a contant C ν,p (K) = [ (p + 1)! 2 (2ν + 1) Kν 2 ] 1/(2p+3) (t) dt 2(p + 1 ν){ t p+1 Kν 2 (t) dt} 2

Plug-in method (Fan & Gijbels, 1996) Constant bandwidth: Optimal global bandwidth by minimizing the conditional MISE ([Bias{ ˆmν (x 0 ) X}] 2 + V{ ˆm ν (x 0 ) X} ) w(x) dx where w( ) 0 som weight function. Use (again) the asymptotic expression of bias and variance to obtain the optimal bandwidth h GLO opt [ σ 2 ] 1/(2p+3) (x)w(x)/f (x) dx = C ν,p (K) {m (p+1) (x)} 2 n 1/(2p+3) w(x) dx

Plug-in method (Fan & Gijbels, 1996) The optimal bandwidths hopt LOC quantities: Design density: f ( ) and h GLO opt The conditional variance: σ 2 ( ) = V(Y X = ) The derivative function: m (p+1) ( ) depends on unknown [ hopt LOC σ 2 ] 1/(2p+3) (x 0 ) (x 0 ) = C ν,p(k) {m (p+1) (x 0 )} 2 n 1/(2p+3) f (x 0 ) [ σ hopt GLO 2 ] 1/(2p+3) (x)w(x)/f (x) dx = C ν,p (K) {m (p+1) (x)} 2 n 1/(2p+3) w(x) dx Solution: Substitute by pilot estimates plug-in bandwidth.

Plug-in method (Fan & Gijbels, 1996) Rule of thumb for constant bandwidth selection: We find pilot estimates of f ( ), σ 2 ( ) and m (p+1) ( ) by fitting a polynomial of order p + 3 globally to m(x) (parametric fit): ˇm(x) = ˇα 0 +... + ˇα p+3 x p+3 ˇσ 2 : The standardized RSS from this fit. ˇm (p+1) : Derivative function on quadratic form.

Plug-in method (Fan & Gijbels, 1996) We want to estimate m (ν) ( ) and we take w(x) = f (x)w 0 (x). Subsitute the pilot estimates into the optimal constant bandwidth. Rule-of-thumb Bandwidth [ ˇσ ȟ ROT = C ν,p (K) 2 w 0 (x) dx n i=1 { ˇm(p+1) (X i )} 2 w 0 (X i ) ] 1/(2p+3) [ (p+1)! where C ν,p(k) = 2 (2ν+1) ] 1/(2p+3) Kν 2 (t) dt 2(p+1 ν){ t p+1 Kν 2 (t) dt}2 is a constant (tabulated for different kernels and different values of ν and p).

Smoothing Parameter Selection Local linear regression (p = 1) and the derivative estimation (p = 2) and Rule-of-thumb bandwidth, h.

Confidence Regions How close is the smoothed curve to the true curve? Theorem 4.5: Asymp. distribution of the Nadaraya-Watson est. Suppose m and f X are twice differentiable. K(u) 2+κ du < for some κ > 0, x is a continuity point of σ 2 (x), E( Y 2+κ X = x) and f X (x) > 0. Take h = cn 1/5. Then n 2/5 { ˆm h (x) m(x)} L N(b x, v 2 x ) where { m b x = c 2 (x) µ 2 (K) + m (x)f X (x) } 2 f X (x) v 2 x = σ2 (x) K 2 2 cf X (x)

Confidence Regions Suppose bias is of negligible magnitude compare to the variance, e.g. h sufficiently small. Approx. confidence intervals: [ ˆm h (x) z 1 α 2 K 2ˆσ 2 (x) n h ˆf h (x), ˆm h(x) + z 1 α 2 ] K 2ˆσ 2 (x) n h ˆf h (x) where z 1 α is the (1 α 2 2 )-quantile of the normal distribution and the estimate of σ 2 (x) is ˆσ 2 (x) = 1 n n W hi (x){y i ˆm h (x)} 2 i=1 where W h i is the weights from the Nadaraya-Watson est.

Confidence Regions The Nadaraya-Watson kernel regression and 95% confidence intervals, h = 0.2.

Hypothesis Testing What would we like to test? Is there indeed an impact of X on Y? Is the estimated function m( ) significantly different from the traditional parametrerization (eg. linear or log-linear models)? The optimal rates of convergence are different for nonparametric estimation and nonparametric testing. Therefore, the choice of smoothing parameter is an issue to be discussed separately.

Hypothesis Testing Specifying the hypothesis: H 0 : The hypothesis, H 1 : The alternative. Nonparametric regression: H 0 : m(x) 0 (X has no impact on Y. Assume EY = 0, otherwise, Ỹ i = Y i Ȳ ). H 1 : m(x) 0. Estimate m(x) by the Nadaraya-Watson estimate: ˆm( ) = ˆm h ( )

Hypothesis Testing A measure for the deviation from zero: T 1 = n h { ˆm(x) 0} 2 w(x) dx where w(x) is a weight function (trim boundaries or regions of sparse data). If w(x) = f X (x)w(x) Then the empirical version of T 1 is T 2 = h n { ˆm(x) 0} 2 w(x) i=1

Hypothesis Testing Under H 0 : T 1 0 and T 2 0 for n, Under H 1 : T 1 and T 2 for n. Under H 0 : ˆm( ) does not have any bias (Th. 4.3, Speed of the convergens for NW-est.) Under H 0 : T 1 and T 2 converge to N(0, V ) where σ 4 (x) w(x) V = 2 fx 2(x) dx (K K) 2 (x) dx

Hypothesis Testing General hypotesis: H 0 : m(x) m θ (x) (specific parametric model) H 1 : m(x) m θ (x). Test statistics, where ˆθ is a consistent estimator. n h { ˆm(x) mˆθ (x)}2 w(x) i=1 Problem: mˆθ ( ) is asymptotically unbiased and converging at rate n. ˆm(x) has a kernel smoothing bias and converges at rate nh.

Hypothesis Testing Solution: Introduce an artificial bias by replacing mˆθ ( ) with n ˆmˆθ (x) = i=1 K h(x X i )mˆθ (X i) n i=1 K h(x X i ) in the test statistics T = h n { ˆm(x) ˆmˆθ (x)}2 w(x) i=1 Now the bias of ˆm( ) and ˆmˆθ ( ) cancels out and the same for the convergence rates.

Hypothesis Testing We reject H 0 if T is large, but what is large? Theorem 4.7: Asymptotic distribtuion of T Assume the conditions of Th. 4.3, further that ˆθ is a n-consistent estimator for θ and h = O(n 1/5 ). Then T 1 σ 2 (x)w(x) dx h K 2 2 D N ( 0, 2 σ 4 (x)w 2 (x) dx ) (K K) 2 dx

Hypothesis Testing Problem: Slow convergence of T towards the normal distribution. Solution: Bootstrap: Simulate the distribution of the test statistics under the hypotesis and determine the critical values based on the simulated distribution. The most popular method is wild bootstrap: Resample from the residuals ˆɛ i, i = 1,..., n under H 0. Each bootstrap residual ɛ i drawn from the same distribution as ˆɛ i up to the first three moments.

Hypothesis Testing Wild bootstrap: 1. Estimate mˆθ ( ) under H 0 and construct ˆɛ i = Y i mˆθ (X i), 2. For each X i draw a bootstrap residual ɛ i so that E(ɛ i ) = 0 E(ɛ i 2 ) = ˆɛ 2 i, E(ɛ i 3 ) = ˆɛ 3 i 3. Generate a bootstrap sample {Y i, X i } i=1,...,n by setting Y i = mˆθ(x i ) + ɛ i 4. From this sample, calculate the bootstrap test statistics T 5. Repeat (2.)-(4.) n boot times and use the n boot generated test statistics T to determine the quantiles of the test statistic under H 0. This gives the approximative values of the critical values for your test statistic T.

Multivariate Kernel Regression How does Y depends on a vector of variables, X?. We want to estimate E(Y X) = E(Y X 1,..., X d ) = m(x) where X = (X 1,..., X d ) T. We have E(Y X) = yf (y x) dy = yf (y, x) dy f X (x)

Multivariate Kernel Regression Multivariate Nadaraya- Watson estimator: (Local constant estimator) Replace with kernel densities ˆm H (x) = n i=1 K H(X i x)y i n i=1 K H(X i x) Weighted sum of the observed Y i.

Multivariate Kernel Regression Local linear estimator: The minimization problem min β 0,β 1 n i=1 { Y i β 0 β T 1 (X i x)} 2 KH (X x) Solution ˆβ = ( ˆβ 0, ˆβ T 1 ) T = (X T WX) 1 X T WY 1 (X 1 x) T where X =.., Y = Y 1 and 1 (X n x) T. Y n W = diag(k H (X 1 x),..., K H (X n x)) ˆβ 0 estimates the regression function. ˆm 1,H (x) = ˆβ 0 (x). ˆβ 1 estimates the partial derivatives wrt. the components x.

Multivariate Kernel Regression Statistical properties of the Nadaraya-Watson estimator: Theorem 4.8: Asymptotic bias and variance The conditional asymptotic bias and variance of the multivariate Nadaraya-Watson kernel regression estimatora are Bias( ˆm H X 1,..., X n ) µ 2 (K) m(x) T HH T f (x) f X (x) V( ˆm H X 1,..., X n ) + 1 2 µ 2(K)tr{H T H m (x)h} 1 σ(x) n det(h) K 2 2 f X (x) in the interior of the support of f X.

Multivariate Kernel Regression Statistical properties of the local linear estimator: Theorem 4.9: Asymptotic bias and variance The conditional asymptotic bias and variance of the multivariate local linear kernel regression estimator are Bias( ˆm 1,H X 1,..., X n ) 1 2 µ 2(K)tr{H T H m (x)h} V( ˆm 1,H X 1,..., X n ) in the interior of the support of f X. 1 σ(x) n det(h) K 2 2 f X (x)

Multivariate Kernel Regression Practical aspects: d=2: Comparison of Nadaraya-Watson and the local linear two-dimensional estimate for simulated data. 500 design points uniformly distributed in [0, 1] [0, 1] and m(x) = sin(2πx 1 ) + x 2 and error term N(0, 1 4 ). Bandwidth h 1 = h 2 = 0.3

Multivariate Kernel Regression Curse of dimensionality: Nonparametric regression estimators are based on the idea of local (weighted) averaging. In higher dimension the observations are sparsely distributed even for lage sample sizes, and consequently estimators based on local averaging perform unsatisfactorily. Explain the effect by looking at the AMSE where H = h I. The optimal bandwidth AMSE(n, h) = 1 nh d C 1 + h 4 C 2 h opt n 1/(4+d) The rate of convergence of the AMSE is n 4/(4+d). The speed of conv. decreases dramatically for higher dimensions d.