Nonparametric Regression. Badr Missaoui

Similar documents
Introduction to Regression

Introduction to Regression

Introduction to Regression

Introduction to Regression

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

A Modern Look at Classical Multivariate Techniques

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Nonparametric Regression 10/ Larry Wasserman

Linear Regression Model. Badr Missaoui

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Data Mining Stat 588

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

3 Nonparametric Density Estimation

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

4 Nonparametric Regression

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Theoretical Exercises Statistical Learning, 2009

Econ 582 Nonparametric Regression

41903: Introduction to Nonparametrics

Time Series and Forecasting Lecture 4 NonLinear Time Series

Nonparametric Methods

Nonparametric Regression. Changliang Zou

Alternatives. The D Operator

Spatial Process Estimates as Smoothers: A Review

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Weighted Least Squares

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Ch 2: Simple Linear Regression

Math 3330: Solution to midterm Exam

A Short Introduction to the Lasso Methodology

UNIVERSITETET I OSLO

Additive Isotonic Regression

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Local Polynomial Regression

Linear Models in Machine Learning

Regression, Ridge Regression, Lasso

Math 533 Extra Hour Material

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Biostatistics Advanced Methods in Biostatistics IV

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Lecture 5: A step back

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

STAT 462-Computational Data Analysis

Simple Linear Regression

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bickel Rosenblatt test

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Data Mining Stat 588

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Terminology for Statistical Data

Generalised Smoothing in Functional Data Analysis

Nonparametric Econometrics

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Computational Physics

Introduction to Nonparametric Regression

Simple Linear Regression

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

Multiple Regression Analysis. Part III. Multiple Regression Analysis

11 Hypothesis Testing

Regression: Lecture 2

Chapter 9. Non-Parametric Density Function Estimation

Simple Linear Regression

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

High-dimensional regression

Simple Linear Regression

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Estimation of cumulative distribution function with spline functions

Covariance function estimation in Gaussian process regression

High-dimensional regression with unknown variance

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Lecture 3: Statistical Decision Theory (Part II)

Chapter 9. Non-Parametric Density Function Estimation

Linear models and their mathematical foundations: Simple linear regression

Modelling Non-linear and Non-stationary Time Series

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Remedial Measures for Multiple Linear Regression Models

ASYMPTOTICS FOR PENALIZED SPLINES IN ADDITIVE MODELS

Local Polynomial Modelling and Its Applications

STAT 4385 Topic 03: Simple Linear Regression

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Curve Fitting. Objectives

Motivation for multiple regression

Multiple Linear Regression

MS&E 226: Small Data

Simple and Multiple Linear Regression

DEPARTMENT MATHEMATIK ARBEITSBEREICH MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

The Statistical Property of Ordinary Least Squares

F-tests and Nested Models

Chapter 4: Interpolation and Approximation. October 28, 2005

Regularization: Ridge Regression and the LASSO

Confidence intervals for kernel density estimation

Machine Learning for OR & FE

ECON 450 Development Economics

Lecture 02 Linear classification methods I

Transcription:

Badr Missaoui

Outline Kernel and local polynomial regression. Penalized regression.

We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y X = x). If X i are deterministic, we assume that ε i N(0, σ 2 ). If X i are random variables, we assume that ε i N(0, σ 2 ) are independent of X i. In the absence of any hypothesis on the function r, we are in the nonparametric framework.

The simplest nonparametric estimator is the regressogram. Suppose that X i are in the interval [a, b] and denote the bins byb 1,..., B m. Let k j be the number of observations in bin B j. Define ˆr n (x) = 1 Y i = k Ȳj j i:x i B j We can rewrite the estimator as for x B j ˆr n (x) = n l i (x)y i i=1 where l i (x) = 1/k j if X j B j, and l i (x) = 0 otherwise. In other words, the estimate ˆr n is a step function obtained by averaging the Y i over each bin

Recall from our discussion of model selection that R(h) = E(Y ˆr n (X)) 2 = σ 2 + E(r(X) ˆr n (X)) 2 = σ 2 + MSE. We can write the MSE as MSE = bias 2 (x)p(x)dx + where and bias(x) = E(ˆr n (x) r(x)) var(x) = Variance(ˆr n (x)) var(x)p(x)dx When the data are oversmoothed, the bias term is large and the variance is small. When the data are undersmoothed, the opposite is true. This is called the bias-variance tradeoff. Minimizing risk corresponds to balancing bias and variance.

Ideally, we would like to choose h to minimize R(h). R(h) depends on the unknown function r(x). We use instead the average residual sums of squares to estimate R(h). 1 n n (Y i ˆr n (X i )) 2 i=1 We will estimate the risk using the leave-one-out cross validation which defined by CV = ˆR(h) = 1 n n (Y i ˆr n( i) (X i )) 2 i=1 where ˆr n( i) is the estimator obtained by omitting the i th pair (X i, Y i )

There is a shortcut formula for computing R just like in linear regression. Let ˆr n be a linear smoother. Then the leave-one-out cross-validation R(h) can be written as n ( Yi ˆr n( i) (X i ) ) 2 R(h) = 1 n i=1 1 L ii where L ii = l i (X i ) An alternative is to use generalized cross validation GCV (h) = 1 n n i=1 ( ) Yi ˆr n (X i ) 2 1 n 1 n i=1 L ii

Kernel regression Kernel refers to any smooth function K such that K (x) 0 and K (x)dx = 1, xk (x)dx = 0 and σ 2 K = x 2 K (x)dx > 0 Some commonly used kernels : the boxcar kernel : K (x) = 1 2 I x 1 the Gaussian kernel : K (x) = 1 e x 2/2 2π

Kernel regression Let h > 0 be a positive number (bandwidth). The Nadaraya-Watson kernel estimator is defined by where ˆr n (x) = l i (x) = n l i (x)y i i=1 ( ) K x Xi h ( n j=1 K x Xi h In R, we suggest using the loess command or the locfit library out = loess(y x,span=.25,degree=0) lines(x,fitted(out)) ) out = locfit(y x,deg=0,alpha=c(0,h))

Kernel regression To do the cross-validation, create a vector bandwidths h = (h 1,..., h k ) h = c(... put your values here... ) k = length(h) zero = rep(0,k) H = cbind(zero,h) out = gcvplot(yx,deg=0,alpha=h) plot(out$df,out$values)

Kernel regression The choice of the kernel K is not too important. What does matter much is the choice of the bandwidth which controls the amount of smoothing. In general the bandwidth depends on the sample size (h n ). We assume that f is the density of x 1,..., x n. The risk of the Nadaraya-Watson kernel estimator is R(h n ) = h4 n 4 ( + σ2 K 2 (x)dx nh n as h n 0 and nh n ) 2 ( x 2 K (x)dx r (x) + 2r (x) f ) (x) 2 dx f (x) 1 f (x) dx + o(n 1 h n ) + o(h 4 n)

Kernel regression What is especially notable is the presence of the term 2r (x) f (x) f (x) This means that the bias is sensitive to the position s of the X i s. If we differentiate R with respect to h n and set the result to 0, we find that the optimal h is h = ( 1 n ) 1/5 σ 2 K 2 (x)dx 1 f (x) dx ( x 2 K (x)dx ) 2 ( r (x) + 2r (x) f (x) f (x) ) 2 dx 1/5

Kernel regression Thus, h = n 1/5. The risk R hn decreases at rate (n 4/5 ) In practice we cannot not use the formula of h mentioned above since it depends on the unknown function r. Instead, we use leave-one-out cross-validation.

Local Polynomials Kernel estimators suffer from design bias. These problem can be alleviated by using a local polynomial regression. The idea is to approximate a smooth regression function r(u) in the target value x by the polynomial : where r(u) P x (u; a) P x (u; a) = a 0 + a 1 (u x) + a 2 2! (u x)2 ) +... + a p p! (u x)p. We estimate a = (a 0,..., a p ) T by minimizing n w i (x)(y i P x (u; a)) 2 i=1

Local Polynomials To find â(x), it is helpful to re-express the problem in matrix notation. Let 1 x 1 x 1 x X x = 2 x... 1 x n x and let W x = diag{w i (x)} i we can write then the problem as (x 1 x) p p! (x 2 x) p p!. (x n x) p p! (Y X x a) T W x (Y X x a) Minimizing this gives the weighted least squares estimator a(x) = (X T x W x X x ) 1 X T x W x Y

Local Polynomials The local polynomial regression estimate is ˆr n (x) = n l i (x)y i i=1 where l(x) T = (l 1 (x),..., l n (x)), e 1 = (1, 0,..., 0) T. l(x) T = e T 1 (X T x W x X x ) 1 X T x W x The R code is the same except we use deg=1 for local linear, deg=2 for quadratic loess(y locfit(y x,deg=1,span=h) x,deg = 1,alpha=c(0,h))

Local Polynomials Let Y i = r(x i ) + σ(x i )ε i for i = 1,..., n and a < X i < b. Assume that X 1,..., X n are a sample from a distribution with density f and that (i) f (x) > 0, (ii) f, r and σ 2 are continuous in a neighborhood of x, and (iii) h n 0 and nh n 0. The local linear estimator has a variance σ 2 (x) K 2 (x)dx + o(1/nh n ) f (x)nh n and has an asymptotic bias hn 2 1 2 r (x) x 2 K (x)dx + o(h 2 ) Thus, the local linear estimator is free from design bias.

Penalized Regression Consider polynomial regression p Y = β j x j + ε i=0 or n ˆr(x) = ˆβ j x j j=0 We have the design matrix 1 x 1 x p 1 1 x 2 x p 2 X =.... 1 x n xn p

Penalized Regression Least squares minimizes (Y Xβ) T (Y Xβ), which implies ˆβ = (X T X) 1 X T Y The ridge regression aims to minimize (Y Xβ) T (Y Xβ) + λβ T β then β = (X T X + λi) 1 X T Y

Penalized Regression An alternative way is to minimize the penalized sums of squares M(λ) = (Y i ˆr n (X i )) 2 + λj(r) i where J(r) = (r (x)) 2 dx is the roughness penalty. This penalty leads to a solution that favors smoother functions. The parameter λ controls the amount of smoothness.

Penalized Regression The most commonly used splines are piecewise cubic splines. Let ξ 1 <... < ξ k be a set of ordered point -called knotscontained in some interval (a, b). A cubic spline is a continuous function r such that (i) r is a cubic polynomial over (ξ 1, ξ 2 ),... and r has continuous first and second derivatives at the knots. A spline that is linear beyond the boundary knots is called a natural spline. Theorem The function ˆr n (x) that minimizes M(λ) is a natural cubic spline with knots the data points. The estimator ˆr n is called a smoothing spline.

Penalized Regression The theorem above does not give you an explicit form for ˆr n We will construct a basis for for the set of splines (cubic B-spline). B i,0 (t) = 1 t [ti,t i+1 ] B i,d (t) = t t i B i,d 1 (t) + t i+d+1 t B i+1,d 1 (t) t i+d t i t i+d+1 t i+1 Without the penalty, the B-spline basis interpolate the data and therefore provide a perfect fit to the data.

Penalized Regression Now, we can write ˆr n = n ˆβ j B j (x) j=1 where B j (x) are the basis vectors for the B-splines. We follow the pattern of polynomial regression B 1 (x 1 ) B n (x 1 ) B 1 (x 2 ) B n (x 2 ) B =... B 1 (x n ) B n (x n )

Penalized Regression We can rewrite the problem as follows : argmin β (Y Bβ) T (Y Bβ) + λβ T Ωβ where B ij = [B j (X i )] ij and Ω jk = B j (x)b k (x)dx. The solution is and where β = (B T B + λω) 1 B T Y Ŷ = LY L = B(B T B + λω) 1 B T We define the effective degree of freedom by df = trace(l), and we choose λ using the GCV or CV. In R, out=smooth.spline(x,y,df=10,cv=true) lines(x,out$y)

Penalized Regression In more general case, we will assume that r admits an expansion series wrt to a orthonormal basis ( i ) i such that r(x) = β j φ j (x) j=1 We will approximate r by r J (x) = J β j φ j (x) j=1 The number of terms J will be our smoothing parameter. Our estimate is ˆr(x) = J j=1 ˆβ j φ j (x).

Penalized Regression The estimate ˆβ is ˆβ = (U T U) 1 U T Y where U = [φ j (X i )] ij and Ŷ = SY where S = U(UT U) 1 U T We can choose J by cross validation. Note that trace(s) = J so the GCV takes the simple form GCV (J) = SSE n 1 (1 J/n) 2

Penalized Regression : Variance estimation Theorem Let ˆr n be a linear smoother. Let ˆσ 2 = n i=1 (Y i ˆr(X i )) 2 n 2ν + ˆν where ν = tr(l), ˆν = tr(l T L). If r are sufficiently smooth, ν = o(n) and ˆν = o(n) then ˆσ 2 is a consistent estimator of σ 2.