Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann

Univariate Kernel Regression The relationship between two variables, X and Y where m( ) is a function. Y = m(x ) Model: y i = m(x i ) + ε i, i = 1,..., n, (1) E(Y X = x) = m(x). (2) (1): Y = m(x ) doesn t need to hold exactly for the i th observation. Error term ε. (2): The relationship holds on average.

Univariate Kernel Regression Goal: Estimate m( ) on basis of the observations (x i, y i ), i = 1,..., n. In a parametric approach: m(x) = α + βx and then estimate α and β. In the nonparametric approach: No prior restricions on m( ).

Conditional Expectation (repetition) Let X and Y be two random variables with joint pdf f (x, y). Then the conditional expectation of Y given that X = x E(Y X = x) = yf (y x) dy f (x, y) = y f X (x) dy = m(x) Note: m(x) might be quite nonlinear.

Example Consider the joint pdf f (x, y) = x + y for 0 x 1 and 0 y 1 with f (x, y) = 0 elsewhere, and marginal pdf f X (x) = 1 0 f (x, y) dy = x + 1 2 with f X (x) = 0 elsewhere. Then which is nonlinear. E(Y X = x) = = = 1 for 0 x 1 f (x, y) y f X (x) dy y x + y 0 x + 1 dy 2 1 2 x + 1 3 x + 1 = m(x) 2

Design Random design Observations (X i, Y i ), i = 1,..., n from bivariate distribtuion f (x, y). The distribution of f X (x) is unknown. Fix design Controle the predictor variable, X, then Y is the only random variable. (Eg. the beer-example). The distribution of f X (x) is known.

Kernel Regression Random design The Nadaraya-Watson estimator ˆm(x) = n 1 n i=1 K h(x X i )Y i n 1 n j=1 K h(x X j ) Rewrite the Nadaraya-Watson estimator ( ) ˆm(x) = 1 n K h (x X i ) n n 1 n i=1 j=1 K h(x X j ) = 1 n W hi (x)y i n i=1 1 Weighted (local) average of Y i (note: n n i=1 W hi(x) = 1) h determines the degree of smoothness. What happens if the denominator of W hi (x) is equal to 0? Then the numerator is also equal to 0, and the estimator is not defined. This happend in regions of sparse data. Y i

Kernel Regression

Kernel Regression Fix design f X (x) is known. Weights of the form W FD hi (x) = K h(x x i ) f X (x) Simpler structure and therefore easier to analyse. One particular fixed design kernel regression estimator: Gasser-Müller For the case of ordered design points x (i), i = 1,..., n from [a, b] si Whi GM (x) = n K h (x u) du, s i 1 where s i = x (i)+x (i+1) 2, s 0 = a, s n+1 = b. Note that Whi GM (x) sums to 1.

Statistical Properties Are the kernel regression estimator consistent? Theorem 4.1: Consistence (Nadaraya-Watson) Assume the univariate random design model and the regularity conditions: K(u) du <, uk(u) 0 for u, EY 2 <. Suppose also h 0, nh, then 1 n n W hi (x)y i = ˆm h (x) P m(x) i=1 where for x holds f X (x) > 0 and x is a point of continuity of m(x), f X (x), and σ 2 (x) = V(Y X = x). Proof idea: Show that the numerator and the denominator of ˆm h (x) converge. Then ˆm h (x) converges (Slutsky s theorem).

Statistical Properties What is the speed of the convergence? Theorem 4.2: Speed of the convergence (Gasser-Müller) Assume the univariate fixed design model and the conditions: K( ) has support [ 1, 1] with K( 1) = K(1) = 0, m( ) is twice continuously differentiable, max i x i x i 1 = O(n 1 ) Assume V(ε i ) = σ 2, i = 1,..., n. Then, under h 0, nh, it holds { } 1 n MSE Whi GM (x)y i 1 n nh σ2 K 2 2 + h4 4 µ2 2(K){m (x)} 2 i=1 Note: AMSE = var + bias 2, ie. usual bias-variance trade-off.

Statistical Properties What is the speed of the convergence in the random design? Theorem 4.3: Speed of the convergence (Nadaraya-Watson) Assume the univariate random design model and the conditions K(u) du <, uk(u) 0 for u, EY 2 <. Suppose also h 0, nh, then MSE( ˆm h (x)) 1 σ 2 ( nh f X (x) K 2 2+ h4 4 µ2 2(K) m (x) + 2 m (x)f X (x) ) 2 f X (x) where for x holds f X > 0 and x is a point of continuity of m (x), m (x), f X (x), f X (x), and σ 2 (x) = V(Y X = x) Proof idea: Rewrite the Nadaraya-Watson estimator: ˆm h (x) = ˆr h(x) ˆf h (x)

Statistical Properties Asymptotic MSE: AMSE(n, h) = 1 nh C 1 + h 4 C 2 Minimizing wrt. h gives the optimal bandwidth h opt n 1/5 Rate of convergence af AMSE is of order O(n 4/5 ). Slower than the rate obtained by LS estimation in linear regression, but the same as in nonparametric density estimation.

Local Polynomial Regression The Nadaraya-Watson estimator is a special case of local polynomial regression: Local constant least squares fit. To motivate the local polynomial fit: Look at Taylor expansion of the unknown conditional expectation function: m(t) m(x) + m (x)(t x) + + m (p) (x)(t x) p 1 p! for t in a neighborhood of x. Idea: Fit a polynomial in a neighborhood of x (called local polynomial regression). min β n {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (x X i ) }{{}}{{} polynomial local i=1 where β = (β 0, β 1,..., β p ) T.

Local Polynomial Regression The result is a weighted least squares estimator ˆβ(x) = (X T WX) 1 X T WY 1 X 1 x... (X 1 x) p Y 1 where X =........, Y =. and 1 X n x... (X n x) p Y n K h (x X 1 ) 0... 0 0 K h (x X 2 )... 0 W =........ 0 0... K h (x X n ) Note: This estimator varies with x (in contrast to parametric least squares)

Local Polynomial Regression Denote the components of ˆβ(x) by ˆβ 0 (x),..., ˆβ p (x). Then the local polynomial estimator of the regression function m is ˆm p,h (x) = ˆβ 0 (x) because m(x) β 0 (x) (recall the Taylor expansion).

Local Polynomial Regression Local constant estimator (Nadaraya-Watson): p = 0 Then min β n {Y i β 0 } 2 K h (x X i ) i=1 ˆm 0,h = n i=1 K h(x X i )Y i n i=1 K h(x X i )

Local Polynomial Regression Local linear estimator: p = 1 Then and where min β n {Y i β 0 β 1 (X 1 x)} 2 K h (x X i ) i=1 ( Sh,0 (x) S ˆβ(x) = h,1 (x) S h,1 (x) S h,2 (x) ) 1 ( Th,0 (x) T h,1 (x) ˆm 1,h = ˆβ 0 (x) = T h,0(x)s h,2 (x) T h,1 (x)s h,1 (x) S h,0 (x)s h,2 (x) S 2 h,1 (x) S h,j (x) = T h,j (x) = n K h (x X i )(X i x) j i=1 i=1 n K h (x X i )(X i x) j Y i )

Local Polynomial Regression Local polynomial estimator: The general formula for polynomial of order p min β n {Y i β 0 β 1 (X i x) β p (X i x) p } 2 K h (x X i ) i=1 Then ˆβ(x) = S h,0 (x) S h,1 (x)... S h,p (x) S h,1 (x) S h,2 (x)... S h,p+1 (x)...... S h,p (x) S h,p+1 (x)... S h,2p (x) 1 T h,0 (x) T h,1 (x). T h,p (x)

Local Constant Regression n = 7125 observations of net-income and expenditure on food expenditures from the Family Expenditure Survey of British household in 1973.

Local Linear Regression Differences between local constant and linear regression on the boundaries. Later we will see, that the local linear fit will be influenced less by outliers.

Local Linear Regression Asymtotic MSE for the local linear regression estimator: Note: AMSE( ˆm 1,h (x)) = 1 σ 2 (x) nh f X (x) K 2 2 + h4 4 {m (x)} 2 µ 2 2(K) The bias is design-independent, The bias disappears when m( ) is linear.

Local Linear Regression Compare the AMSE: Compare the AMSE of the local constant, the Gasser-Müller and the local linear estimators: AMSE( ˆm 0,h (x)) = AMSE( ˆm GM,h (x)) = AMSE( ˆm 1,h (x)) = 1 σ 2 { (x) nh f X (x) K 2 2 + h4 4 µ2 2 (K) m (x) + 2 m (x)f X (x) } 2 f X (x) 1 nh σ2 K 2 2 + h4 4 µ2 2 (K){m (x)} 2 1 σ 2 (x) nh f X (x) K 2 2 + h4 4 µ2 2 (K){m (x)} 2 LC uses one parameter less compared to LL without reducing the asymptotic variance. But the extra parameter enables LL to reduce the bias. LL fit improve the estimation i boundary compared to LC. GM improves the bias compared to LC.

Local Linear Regression Remarks: LL is a weighted local average of the response variables Y i (as the LC), h determines the degree of smoothness of ˆm p,h : For h 0, at X i, then ˆm p,h Y i. When h all weights are equal, ie. we obtain a parametric pth order polynomial fit. Derivatives of m: The natual approach: Estimate m by ˆm and compute the derivative ˆm. Alternative and more effecient method: Polynomial derivative estimator ˆm (ν) p,h (x) = ν! ˆβ ν (x) for the νth derivative of m( ). Usually the order of the polynomial is p = ν + 1 or p = ν + 3.

Nearest-Neighbor Estimator Other nonparametric smoothing techniques that differ from the kernel method. Nearest-Neighbor Estimator Kernel regression: Weighted averages Y i in a fixed neighborhood (governed by h) around x. k-nn: Weighted averages of Y i in a neighborhood around x. The width is variable. Using the k nearest observations to x. ˆm k (x) = 1 n W ki (x)y i n where with the set of indices W ki (x) = i=1 { n/k if i Jx 0 otherwise J x = {i : X i is one of the k nearest observations to x}

Nearest-Neighbor Estimator Remark: When we estimate m( ) at a point x with spare data, then the k nearest neighbors are rather far away from x. k is the smoothing parameter. Increasing k makes the estimate smoother. The k-nn estimator is the kernel estimator with uniform kernel K(u) = 1 2 1 ( u 1) and variable bandwidth R = R(k) with R(k) being the distance between x and its furtherst k-nearest neighbor: ˆm k (x) = n i=1 K R(x X i )Y i n i=1 K R(x X i ) The k-nn estimator can be generalised by considering other kernel functions.

Nearest-Neighbor Estimator Bias and variance: Let k, k/n 0 and n. Then E{ ˆm k (x)} m(x) µ { 2(K) 8f X (x) 2 m (x) + 2 m (x)f X (x) } ( ) k 2 f X (x) n Remark: V{ ˆm k (x)} 2 K 2 σ 2 (x) 2 k The variance does not depend of f X (x) (unlike LC) because k-nn always averages over k observations. k-nn approx. identical to LC with bandwidth h wht. MSE: Choose k = 2nhf X (x).

Nearest-Neighbor Estimator

Median Smoothing Median Smoothing: Goal: Estimate the conditional median function where ˆm(x) = med{y i : i J x } J x = {i : X i is one of the k nearest observations to x} e.i. the median of the Y i s, for which X i is one of the k nearest neighbors of x. med(y X = x) is more robust to outliers than E(Y X = x)

Median Smoothing

Spline Smoothing Spline Smoothing: Consider the residual sum of squares (RSS) as a criterion for the goodness of fit of m( ) RSS = n {Y i m(x i )} 2 i=1 Minimizing RSS: Interpolating the data m(x i ) = Y i, i = 1,..., n Solution: Add a stabilizer that penalizes non-smoothness of m( ) m 2 2 = {m (x)} 2 dx

Spline Smoothing Minimizing problem: ˆm λ = arg min S λ(m) m where S λ (m) = n i=1 {Y i m(x i )} 2 + λ m 2 2. For the class of all twice diff. functions on [a, b] = [X (1), X (n) ] then the unique minimizer is the cubic spline, which consists of cubic polynomials p i (x) = α i + β i x + γ i x 2 + δ i x 3, i = 1,..., n 1 between X (i), X (i+1). λ controls the weight given to the stabilizer: High λ more weigth to m 2 2 smooth estimate. λ 0 ˆm λ ( ) interpolation. λ ˆm λ ( ) linear function.

Spline Smoothing The estimator must be twice continuously differentiable: p i (X (i) ) = p i 1 (X (i) ), (no jumps in the function) p i (X (i)) = p i 1 (X (i)), (no jumps in the first derivative) p i (X (i)) = p i 1 (X (i)), (no jumps in the second derivative) p 1 (X (1)) = p n 1 (X (n) ) = 0, (boundary condition) These restrictions and the conditions for minimizing S λ (m) wrt. the coefficients of p i define a system of linear equations which can be solved in O(n) calculations.

Spline Smoothing The spline smoother is a linear estimator in Y i ˆm λ (x) = 1 n n W λ,i (x)y i i=1 The spline smoother is asymptotically equivalent to the kernel smoother (under certain conditions) with the spline kernel K S (u) = 1 ( 2 exp u ) ( u sin + π ) 2 2 4 with local bandwidth h(x i ) = λ 1/4 n 1/4 f (X i ) 1/4.

Spline Smoothing

Orthogonal Series Suppose that m( ) can be represented by a Fourier series m(x) = β j ϕ j (x) j=0 where {ϕ j (x)} j=0 is a known basis of functions and {β} j=0 are the unknown Fourier coefficients. Goal: Estimate the unknown Fourier coefficients. Cannot estimate a infinite number of coefficients from a finite number of observations. Therefore choose N that will be included in the Fourier series representaion.

Orthogonal Series Estimation proceeds in three steps: Select a basis of functions, Select N < n (integer), Estimate N unknown coefficients N: Smoothing parameter. Large N many terms in the Fourier series interpolation, Small N few terms smooth estimates.

Orthogonal Series

Smoothing Parameter Selection How can we choose the smoothing parameter of the kernel regression estimators? Which criteria do we have, that measures how close the estimate is to the true curve?

Smoothing Parameter Selection Mean Squared Error (MSE): MSE( ˆm h (x)) = E[{ ˆm h (x) m(x)} 2 ] m(x): Unknown constant, ˆm h (x): Random variable, ei. E is taken wrt. this rv. ˆm h (x) is a function of {X i, Y i } n i=1, therefore MSE( ˆm h (x)) =... { ˆm h (x) m(x)} 2 f (x 1,..., x n, y 1,..., y n ) dx 1...dx n dy 1,...dy n Local measure. The AMSE-optimal bandwidth is a complicated function of m (x) and σ 2 (x).

Smoothing Parameter Selection Integrated Squared Error (ISE): ISE( ˆm h ) = { ˆm h (x) m(x)} 2 w(x) f X (x) dx Global measure. ISE rv. different samples gives different values of ˆm h (x) and ISE. Weight function: w(x). Used to assign less weight to observations in regions of sparse data (to reduce variance) or at the tail (to reduce the boundary effects).

Smoothing Parameter Selection Mean Integrated Squared Error (MISE): MISE( ˆm h ) = E{ISE( ˆm h )} [ =... Expected value of ISE. ] { ˆm h (x) m(x)} 2 w(x) f X (x) dx f (x 1,..., x n, y 1,..., y n ) dx 1...dx n dy 1,...dy n

Smoothing Parameter Selection Averaged Squared Error (ASE): ASE( ˆm h ) = 1 n n { ˆm h (X j ) m(x j )} 2 w(x j ) j=1 Gloabal measure and a rv. Discrete approximation to ISE.

Smoothing Parameter Selection Mean Averaged Squared Error (MASE): MASE( ˆm h ) = E{ASE X 1 = x 1,..., X n = x n } 1 n = E { ˆm h (X j ) m(x j )} 2 w(x j ) n j=1 Conditional expectation of ASE If we view X 1,..., X n as rv. then MASE is a rv. MASE ia asymptotic version of AMISE

Smoothing Parameter Selection Which measure should we use to derive a bandwidth selection method? A natural choise MISE/AMISE: Used in the density case. But more unknown quantities than in the density case. Bandwidth selection: Cross-validation, Penalty terms, Plug-in methods (Fan & Gijbels, 1996) NW-est.: ASE, ISE and MISE lead asymptotically to the same smoothing. Therefore use the easiest: ASE (discrete ISE).

Smoothing Parameter Selection Averages Squared Error ASE( ˆm h ) = 1 n n { ˆm h (X i ) m(x i )} 2 w(x i ) i=1 The conditional expectation, MASE MASE(h) = E{ASE(h) X 1 = x 1,...X n = x n } = 1 n E[{ ˆm h (X i ) m(x i )} 2 X 1 = x 1,...X n = x n ] w(x i ) n i=1

Smoothing Parameter Selection Bias 2 (thin solid line), variance (dashed line), MASE (thick line). Simulated data set: m(x) = {sin(2πx 3 )} 3 Y i = m(x i ) + ε i, X i U(0, 1), ε i N(0, 0.01)

Smoothing Parameter Selection However, MASE and ASE contain m( ) which is unknown. Resubstitution estimate: Replace m( ) with the observations Y p(h) = 1 n n {Y i ˆm h (X i )} 2 w(x i ) i=1 The weighted residual sum of squares (RSS). Problem: Y i is used in ˆm h (X i ) to predict it self. Therefore p(h) arbitrarily small by letting h 0 (interpolation).

Smoothing Parameter Selection Cross-validation: Leave-one-out-estimator: ˆm h, i (X i ) = j i K h(x i X j )Y j j i K h(x i X j ) Cross-validation function: CV (h) = 1 n n {Y i ˆm h, i (X i )} 2 w(x i ) i=1 Choose h that minimizes CV(h)

Smoothing Parameter Selection Left: Local constant regression estimator (Nadaraya-Watson) with cross-validation bandwidth (ĥ cv = 0.15). Quartic kernel. Right: Local linear regression estimator with cross-validation bandwidth (ĥcv = 0.56). More stable in regions with sparse data and outperforms the LC-est. at the boundaries.

Plug-in method (Fan & Gijbels, 1996) Bandwidth selection methods: Constant (global) bandwidth, Variable bandwidth. Reduction of bias at peaked regions, and variance at flat regions. Local variable bandwidth, h(x 0 ) Varies with the location point x 0. Global variable bandwidth, h(x i ) Varies with the data points. Focus on constant and local variable bandwidth.

Plug-in method (Fan & Gijbels, 1996) Local bandwidth: Optimal local bandwidth by minimizing the conditional MSE [Bias{ ˆm ν (x 0 ) X}] 2 + V{ ˆm ν (x 0 ) X} where ˆm ν ( ) is the ν th derivative of ˆm( ). Use the general expressions for bias and variance to obtain the (local) AMSE-optimal bandwidth [ hopt LOC σ 2 ] 1/(2p+3) (x 0 ) (x 0 ) = C ν,p (K) {m (p+1) (x 0 )} 2 n 1/(2p+3) f (x 0 ) where C ν,p (K) is just a contant C ν,p (K) = [ (p + 1)! 2 (2ν + 1) Kν 2 ] 1/(2p+3) (t) dt 2(p + 1 ν){ t p+1 Kν 2 (t) dt} 2

Plug-in method (Fan & Gijbels, 1996) Constant bandwidth: Optimal global bandwidth by minimizing the conditional MISE ([Bias{ ˆmν (x 0 ) X}] 2 + V{ ˆm ν (x 0 ) X} ) w(x) dx where w( ) 0 som weight function. Use (again) the asymptotic expression of bias and variance to obtain the optimal bandwidth h GLO opt [ σ 2 ] 1/(2p+3) (x)w(x)/f (x) dx = C ν,p (K) {m (p+1) (x)} 2 n 1/(2p+3) w(x) dx

Plug-in method (Fan & Gijbels, 1996) The optimal bandwidths hopt LOC quantities: Design density: f ( ) and h GLO opt The conditional variance: σ 2 ( ) = V(Y X = ) The derivative function: m (p+1) ( ) depends on unknown [ hopt LOC σ 2 ] 1/(2p+3) (x 0 ) (x 0 ) = C ν,p(k) {m (p+1) (x 0 )} 2 n 1/(2p+3) f (x 0 ) [ σ hopt GLO 2 ] 1/(2p+3) (x)w(x)/f (x) dx = C ν,p (K) {m (p+1) (x)} 2 n 1/(2p+3) w(x) dx Solution: Substitute by pilot estimates plug-in bandwidth.

Plug-in method (Fan & Gijbels, 1996) Rule of thumb for constant bandwidth selection: We find pilot estimates of f ( ), σ 2 ( ) and m (p+1) ( ) by fitting a polynomial of order p + 3 globally to m(x) (parametric fit): ˇm(x) = ˇα 0 +... + ˇα p+3 x p+3 ˇσ 2 : The standardized RSS from this fit. ˇm (p+1) : Derivative function on quadratic form.

Plug-in method (Fan & Gijbels, 1996) We want to estimate m (ν) ( ) and we take w(x) = f (x)w 0 (x). Subsitute the pilot estimates into the optimal constant bandwidth. Rule-of-thumb Bandwidth [ ˇσ ȟ ROT = C ν,p (K) 2 w 0 (x) dx n i=1 { ˇm(p+1) (X i )} 2 w 0 (X i ) ] 1/(2p+3) [ (p+1)! where C ν,p(k) = 2 (2ν+1) ] 1/(2p+3) Kν 2 (t) dt 2(p+1 ν){ t p+1 Kν 2 (t) dt}2 is a constant (tabulated for different kernels and different values of ν and p).

Smoothing Parameter Selection Local linear regression (p = 1) and the derivative estimation (p = 2) and Rule-of-thumb bandwidth, h.

Confidence Regions How close is the smoothed curve to the true curve? Theorem 4.5: Asymp. distribution of the Nadaraya-Watson est. Suppose m and f X are twice differentiable. K(u) 2+κ du < for some κ > 0, x is a continuity point of σ 2 (x), E( Y 2+κ X = x) and f X (x) > 0. Take h = cn 1/5. Then n 2/5 { ˆm h (x) m(x)} L N(b x, v 2 x ) where { m b x = c 2 (x) µ 2 (K) + m (x)f X (x) } 2 f X (x) v 2 x = σ2 (x) K 2 2 cf X (x)

Confidence Regions Suppose bias is of negligible magnitude compare to the variance, e.g. h sufficiently small. Approx. confidence intervals: [ ˆm h (x) z 1 α 2 K 2ˆσ 2 (x) n h ˆf h (x), ˆm h(x) + z 1 α 2 ] K 2ˆσ 2 (x) n h ˆf h (x) where z 1 α is the (1 α 2 2 )-quantile of the normal distribution and the estimate of σ 2 (x) is ˆσ 2 (x) = 1 n n W hi (x){y i ˆm h (x)} 2 i=1 where W h i is the weights from the Nadaraya-Watson est.

Confidence Regions The Nadaraya-Watson kernel regression and 95% confidence intervals, h = 0.2.

Hypothesis Testing What would we like to test? Is there indeed an impact of X on Y? Is the estimated function m( ) significantly different from the traditional parametrerization (eg. linear or log-linear models)? The optimal rates of convergence are different for nonparametric estimation and nonparametric testing. Therefore, the choice of smoothing parameter is an issue to be discussed separately.

Hypothesis Testing Specifying the hypothesis: H 0 : The hypothesis, H 1 : The alternative. Nonparametric regression: H 0 : m(x) 0 (X has no impact on Y. Assume EY = 0, otherwise, Ỹ i = Y i Ȳ ). H 1 : m(x) 0. Estimate m(x) by the Nadaraya-Watson estimate: ˆm( ) = ˆm h ( )

Hypothesis Testing A measure for the deviation from zero: T 1 = n h { ˆm(x) 0} 2 w(x) dx where w(x) is a weight function (trim boundaries or regions of sparse data). If w(x) = f X (x)w(x) Then the empirical version of T 1 is T 2 = h n { ˆm(x) 0} 2 w(x) i=1

Hypothesis Testing Under H 0 : T 1 0 and T 2 0 for n, Under H 1 : T 1 and T 2 for n. Under H 0 : ˆm( ) does not have any bias (Th. 4.3, Speed of the convergens for NW-est.) Under H 0 : T 1 and T 2 converge to N(0, V ) where σ 4 (x) w(x) V = 2 fx 2(x) dx (K K) 2 (x) dx

Hypothesis Testing General hypotesis: H 0 : m(x) m θ (x) (specific parametric model) H 1 : m(x) m θ (x). Test statistics, where ˆθ is a consistent estimator. n h { ˆm(x) mˆθ (x)}2 w(x) i=1 Problem: mˆθ ( ) is asymptotically unbiased and converging at rate n. ˆm(x) has a kernel smoothing bias and converges at rate nh.

Hypothesis Testing Solution: Introduce an artificial bias by replacing mˆθ ( ) with n ˆmˆθ (x) = i=1 K h(x X i )mˆθ (X i) n i=1 K h(x X i ) in the test statistics T = h n { ˆm(x) ˆmˆθ (x)}2 w(x) i=1 Now the bias of ˆm( ) and ˆmˆθ ( ) cancels out and the same for the convergence rates.

Hypothesis Testing We reject H 0 if T is large, but what is large? Theorem 4.7: Asymptotic distribtuion of T Assume the conditions of Th. 4.3, further that ˆθ is a n-consistent estimator for θ and h = O(n 1/5 ). Then T 1 σ 2 (x)w(x) dx h K 2 2 D N ( 0, 2 σ 4 (x)w 2 (x) dx ) (K K) 2 dx

Hypothesis Testing Problem: Slow convergence of T towards the normal distribution. Solution: Bootstrap: Simulate the distribution of the test statistics under the hypotesis and determine the critical values based on the simulated distribution. The most popular method is wild bootstrap: Resample from the residuals ˆɛ i, i = 1,..., n under H 0. Each bootstrap residual ɛ i drawn from the same distribution as ˆɛ i up to the first three moments.

Hypothesis Testing Wild bootstrap: 1. Estimate mˆθ ( ) under H 0 and construct ˆɛ i = Y i mˆθ (X i), 2. For each X i draw a bootstrap residual ɛ i so that E(ɛ i ) = 0 E(ɛ i 2 ) = ˆɛ 2 i, E(ɛ i 3 ) = ˆɛ 3 i 3. Generate a bootstrap sample {Y i, X i } i=1,...,n by setting Y i = mˆθ(x i ) + ɛ i 4. From this sample, calculate the bootstrap test statistics T 5. Repeat (2.)-(4.) n boot times and use the n boot generated test statistics T to determine the quantiles of the test statistic under H 0. This gives the approximative values of the critical values for your test statistic T.

Multivariate Kernel Regression How does Y depends on a vector of variables, X?. We want to estimate E(Y X) = E(Y X 1,..., X d ) = m(x) where X = (X 1,..., X d ) T. We have E(Y X) = yf (y x) dy = yf (y, x) dy f X (x)

Multivariate Kernel Regression Multivariate Nadaraya- Watson estimator: (Local constant estimator) Replace with kernel densities ˆm H (x) = n i=1 K H(X i x)y i n i=1 K H(X i x) Weighted sum of the observed Y i.

Multivariate Kernel Regression Local linear estimator: The minimization problem min β 0,β 1 n i=1 { Y i β 0 β T 1 (X i x)} 2 KH (X x) Solution ˆβ = ( ˆβ 0, ˆβ T 1 ) T = (X T WX) 1 X T WY 1 (X 1 x) T where X =.., Y = Y 1 and 1 (X n x) T. Y n W = diag(k H (X 1 x),..., K H (X n x)) ˆβ 0 estimates the regression function. ˆm 1,H (x) = ˆβ 0 (x). ˆβ 1 estimates the partial derivatives wrt. the components x.

Multivariate Kernel Regression Statistical properties of the Nadaraya-Watson estimator: Theorem 4.8: Asymptotic bias and variance The conditional asymptotic bias and variance of the multivariate Nadaraya-Watson kernel regression estimatora are Bias( ˆm H X 1,..., X n ) µ 2 (K) m(x) T HH T f (x) f X (x) V( ˆm H X 1,..., X n ) + 1 2 µ 2(K)tr{H T H m (x)h} 1 σ(x) n det(h) K 2 2 f X (x) in the interior of the support of f X.

Multivariate Kernel Regression Statistical properties of the local linear estimator: Theorem 4.9: Asymptotic bias and variance The conditional asymptotic bias and variance of the multivariate local linear kernel regression estimator are Bias( ˆm 1,H X 1,..., X n ) 1 2 µ 2(K)tr{H T H m (x)h} V( ˆm 1,H X 1,..., X n ) in the interior of the support of f X. 1 σ(x) n det(h) K 2 2 f X (x)

Multivariate Kernel Regression Practical aspects: d=2: Comparison of Nadaraya-Watson and the local linear two-dimensional estimate for simulated data. 500 design points uniformly distributed in [0, 1] [0, 1] and m(x) = sin(2πx 1 ) + x 2 and error term N(0, 1 4 ). Bandwidth h 1 = h 2 = 0.3

Multivariate Kernel Regression Curse of dimensionality: Nonparametric regression estimators are based on the idea of local (weighted) averaging. In higher dimension the observations are sparsely distributed even for lage sample sizes, and consequently estimators based on local averaging perform unsatisfactorily. Explain the effect by looking at the AMSE where H = h I. The optimal bandwidth AMSE(n, h) = 1 nh d C 1 + h 4 C 2 h opt n 1/(4+d) The rate of convergence of the AMSE is n 4/(4+d). The speed of conv. decreases dramatically for higher dimensions d.