Introduction to Regression

Similar documents
Introduction to Regression

Introduction to Regression

Introduction to Regression

Nonparametric Regression. Badr Missaoui

Introduction to Regression

Nonparametric Estimation of Luminosity Functions

Nonparametric Methods

41903: Introduction to Nonparametrics

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Time Series and Forecasting Lecture 4 NonLinear Time Series

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

Econ 582 Nonparametric Regression

Nonparametric Regression. Changliang Zou

Chapter 9. Non-Parametric Density Function Estimation

Nonparametric Statistics

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Chapter 9. Non-Parametric Density Function Estimation

Linear Regression Model. Badr Missaoui

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Local Polynomial Regression

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

A Modern Look at Classical Multivariate Techniques

Modelling Non-linear and Non-stationary Time Series

Statistics: Learning models from data

Simple Linear Regression

Generalized Additive Models

12 - Nonparametric Density Estimation

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Can we do statistical inference in a non-asymptotic way? 1

Generalized Linear Models

Data Mining Stat 588

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Inference for Regression

Additive Isotonic Regression

Introduction to Nonparametric Regression

Ch 2: Simple Linear Regression

Lecture 14 Simple Linear Regression

Kernel density estimation

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Sliced Inverse Regression

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Announcements. Proposals graded

Linear Models in Machine Learning

Remedial Measures for Multiple Linear Regression Models

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

3 Nonparametric Density Estimation

Lecture 2. The Simple Linear Regression Model: Matrix Approach

1. Variance stabilizing transformations; Box-Cox Transformations - Section. 2. Transformations to linearize the model - Section 5.

Motivational Example

Simple Linear Regression

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

probability of k samples out of J fall in R.

Day 4A Nonparametrics

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 3: Statistical Decision Theory (Part II)

Nonparametric Econometrics

Multiple Linear Regression

Ch. 5 Transformations and Weighting

Unit 10: Simple Linear Regression and Correlation

ECON The Simple Regression Model

STAT5044: Regression and Anova. Inyoung Kim

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Inference in Regression Analysis

Bickel Rosenblatt test

Linear Methods for Prediction

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

MATH 644: Regression Analysis Methods

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Adaptive Nonparametric Density Estimators

Basis Expansion and Nonlinear SVM. Kai Yu

Nonparametric Regression 10/ Larry Wasserman

Regression, Ridge Regression, Lasso

Uncertainty Quantification for Inverse Problems. November 7, 2011

Confidence intervals for kernel density estimation

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Regression Estimation Least Squares and Maximum Likelihood

Correlation and Regression

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Spatial Process Estimates as Smoothers: A Review

UNIVERSITÄT POTSDAM Institut für Mathematik

Linear Regression. 1 Introduction. 2 Least Squares

Nonparametric Inference via Bootstrapping the Debiased Estimator

Section 7: Local linear regression (loess) and regression discontinuity designs

Lecture 18: Simple Linear Regression

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

7 Semiparametric Estimation of Additive Models

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Lecture 19 Multiple (Linear) Regression

Lecture 3: Inference in SLR

Nonparametric Density Estimation

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

CS 195-5: Machine Learning Problem Set 1

Lecture 5: LDA and Logistic Regression

Transcription:

Introduction to Regression Chad M. Schafer May 20, 2015

Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression Confidence Bands Basis Methods: Splines Multiple Regression

Basic Concepts in Regression The Regression Problem: Objective is to construct a model that can be used to predict the response variable Y using observation of the predictor variables X. Note: In this framework, Y is real-valued, while X could be a vector. Example: Predicting redshift from photometric information Y = redshift X = intensity in different photometric bands

Basic Concepts in Regression The Regression Problem: Observe a training sample (X 1,Y 1 ),...,(X n,y n ), use this to estimate f(x) = E(Y X = x). Equivalently: Estimate f(x) when where E(ǫ i ) = 0. Y i = f(x i )+ǫ i, i = 1,2,...,n Simple estimators when X is real-valued: f(x) = β 0 + β 1 x parametric f(x) = mean{y i : X i x h} nonparametric

Basic Concepts of Regression Parametric Regression: Y = β 0 +β 1 X +ǫ Y = β 0 +β 1 X 1 + +β d X d +ǫ Y = β 0 e β 1X 1 + β 2 X 2 +ǫ Nonparametric Regression: Y = f(x)+ǫ Y = f 1 (X 1 )+ +f d (X d )+ǫ Y = f(x 1,X 2 )+ǫ

Next... We will first quickly look at a simple example from astronomy

Example: Type Ia Supernovae Distance Modulus 36 38 40 42 44 0.0 0.5 1.0 1.5 Redshift From Riess, et al. (2007), 182 Type Ia Supernovae

Example: Type Ia Supernovae Assumption: Observed (redshift, distance modulus) pairs (z i,y i ) are related by Y i = f(z i )+σ i ǫ i, where the ǫ i are i.i.d. standard normal with mean zero. Error standard deviations σ i assumed known. Objective: Estimate f( ).

Example: Type Ia Supernovae ΛCDM, two parameter model: f(z) = 5log 10 ( c(1+z) H 0 z 0 ) du Ωm (1+u) 3 +(1 Ω m ) +25

Example: Type Ia Supernovae Distance Modulus 36 38 40 42 44 0.0 0.5 1.0 1.5 Redshift Estimate where H 0 = 72.76 and Ω m = 0.341 (the MLE)

Example: Type Ia Supernovae Distance Modulus 36 38 40 42 44 0.0 0.5 1.0 1.5 Redshift Fourth order polynomial fit: Y = β 0 +β 1 z +β 2 z 2 +β 3 z 3 +β 4 z 4 +ǫ

Example: Type Ia Supernovae Distance Modulus 36 38 40 42 44 0.0 0.5 1.0 1.5 Redshift Fifth order polynomial fit: Y = β 0 +β 1 z +β 2 z 2 +β 3 z 3 +β 4 z 4 +β 5 z 5 +ǫ

Next... The previous two slides illustrate the fundamental challenge of constructing a regression model: How complex should it be?

The Bias Variance Tradeoff Must choose model to achieve balance between too simple high bias, i.e., E( f(x)) not close to f(x) xxxxxx precise, but not accurate too complex high variance, i.e., Var( f(x)) large xxxxxx accurate, but not precise This is the classic bias variance tradeoff. Could be called the accuracy precision tradeoff.

The Bias Variance Tradeoff An Example: Consider the Linear Regression Model. These models are of the form f(x) = β 0 + p i=1 β i g i (x) where g i ( ) are fixed, specified functions. (For example, g i (x) = x i.) Increasing p yields more complex model

Next... Linear regression models are the classic approach to regression. Here, the response is modelled as a linear combination of p selected predictors.

Linear Regression Assume that p Y = β 0 + β i g i (x)+ǫ i=1 where g i ( ) are specified functions, not estimated. Assume that E(ǫ) = 0, Var(ǫ) = σ 2 Often assume ǫ is Gaussian, but this is not essential Start with the Simple Linear Regression model: Y = β 0 +β 1 x+ǫ

Linear Regression Redshift 0.0 0.2 0.4 0.6 0.8 0.5 1.0 1.5 2.0 2.5 g r LRG data from 2SLAQ, www.2slaq.info

Linear Regression Redshift 0.0 0.2 0.4 0.6 0.8 Y i Y i 0.5 1.0 1.5 2.0 2.5 g r Showing one of the fitted values, Ŷi.

Linear Regression Redshift 0.0 0.2 0.4 0.6 0.8 Y i ε i Y i 0.5 1.0 1.5 2.0 2.5 g r Showing one of the residuals, ǫ i.

Linear Regression The line fit here is Ŷ i = β 0 + β 1 x i + ǫ i with β 0 = 0.493 and β 1 = 0.0272. These estimates of β 0 and β 1 given above minimize the sum of squared errors: n n ( ) 2 SSE = ǫ 2 i = Y i Ŷi i=1 This is least squares regression. i=1 If ǫ i are Gaussian, then β are Gaussian, leading to confidence intervals, hypothesis tests

Linear Regression Least squares has a natural geometric interpretation.

Linear Regression The Design Matrix: D = 1 g 1 (X 1 ) g 2 (X 1 ) g p (X 1 ) 1 g 1 (X 2 ) g 2 (X 2 ) g p (X 2 )..... 1 g 1 (X n ) g 2 (X n ) g p (X n ) Then, L = D(D T D) 1 D T is the projection matrix. Note that ν = trace(l) = p+1 = complexity of model

Linear Regression Let Y be a n-vector filled with Y 1,Y 2,...,Y n. The Least Squares Estimates of β: β = (D T D) 1 D T Y The Fitted Values: Ŷ = LY = D β The Residuals: ǫ = Y Ŷ = (I L)Y

Linear Regression in R Linear regression in R is done via the command lm() > holdlm = lm(y x) Note: That is a tilde ( ) between y and x. Default is to include intercept term. If want to exclude, use > holdlm = lm(y x - 1)

Linear Regression in R y 0.0 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 x

Linear Regression in R Fitting the model Y = β 0 +β 1 x+ǫ The summary() shows the key output > holdlm = lm(y x) > summary(holdlm) < CUT SOME HERE > Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.2468 0.1061-2.326 0.0319 * x 2.0671 0.1686 12.260 3.57e-10 *** < CUT SOME HERE > Note that β 1 = 2.07, the SE for β 1 is approximately 0.17.

Linear Regression in R Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.2468 0.1061-2.326 0.0319 * x 2.0671 0.1686 12.260 3.57e-10 *** If the ǫ are Gaussian, then the last column gives the p-value for the test of H 0 : β i = 0 versus H 1 : β i 0. Also, if the ǫ are Gaussian, β i ±2ŜE( β i ) is an approximate 95% confidence interval for β i.

Linear Regression in R Residuals 0.4 0.2 0.0 0.2 0.0 0.5 1.0 1.5 Fitted Values Always a good idea to look at plot of residuals versus fitted values. There should be no apparent pattern. > plot(holdlm$residuals,holdlm$fitted.values)

Linear Regression in R Sample Quantiles 0.4 0.2 0.0 0.2 2 1 0 1 2 Theoretical Quantiles Is it reasonable to assume ǫ are Gaussian? Check the normal probability plot of the residuals. > qqplot(holdlm$residuals) > qqline(holdlm$residuals)

Linear Regression in R Simple to include more terms in the model: > holdlm = lm(y x + z) Unfortunately, this does not work: > holdlm = lm(y x + xˆ2) Instead, can use: > holdlm = lm(y x + I(xˆ2))

The Correlation Coefficient A standard way of quantifying the strength of the linear relationship between two variables is via the (Pearson) correlation coefficient. Sample Version: r = 1 n 1 n i=1 X i Y i where X i and Y i are the standardized versions of X i and Y i, i.e., X i = X i X s X.

The Correlation Coefficient Note that 1 r 1. r = 1 indicates perfect linear, increasing relationship r = 1 indicates perfect linear, decreasing relationship Also, note that In R: use cor() β 1 = r( sy s X )

The Correlation Coefficient 1 0.8 0.4 0 0.4 0.8 1 1 1 1 1 1 1 0.1 0 0 0 0 0 0 Examples of different scatter plots and values for r (Wikipedia)

The Coefficient of Determination A very popular, but overused, regression diagnostic is the coefficient of determination, denoted R 2 R 2 = 1 n i=1( Y i Ŷi ) 2 n i=1( Yi Y ) 2 The proportion of the variation explained by the model. Note that 0 R 2 1, and, R 2 = r 2. But: Large R 2 does not imply that the model is a good fit.

Anscombe s Quartet (1973) Each of these has the same value for the marginal means and variances, same r, same R 2, same β 0 and β 1.

Next... Now we will begin our discussion of nonparametric regression. Reconsider the SNe example from the start of the lecture.

Example: Type Ia Supernovae Distance Modulus 36 38 40 42 44 0.0 0.5 1.0 1.5 Redshift Nonparametric estimate.

The Bias Variance Tradeoff Nonparametric Case: Every nonparametric procedure requires a smoothing parameter h. Consider the regression estimator based on local averaging: f(x) = mean{y i : X i x h}. Increasing h yields smoother estimate f

What is nonparametric? Data Estimate Assumptions In the parametric case, the influence of assumptions is fixed

What is nonparametric? Data Estimate Assumptions h In the nonparametric case, the influence of assumptions is controlled by h, where h = O( 1 n 1/5 )

The Bias Variance Tradeoff Can formalize this via appropriate measure of error in estimator Squared error loss: L(f(x), f(x)) = (f(x) f(x)) 2. Mean squared error MSE (risk) MSE = R(f(x), f(x)) = E(L(f(x), f(x))).

The Bias Variance Tradeoff The key result: R(f(x), f(x)) = bias 2 x +variance x where bias x = E( f(x)) f(x) variance x = Var( f(x)) Often written as MSE = BIAS 2 + VARIANCE.

The Bias Variance Tradeoff Mean Integrated Squared Error: MISE = R(f(x), f(x))dx = bias 2 x dx+ variance x dx Empirical version: 1 n n R(f(x i ), f(x i )). i=1

The Bias Variance Tradeoff Risk Bias squared Variance Less Optimal smoothing More

The Bias Variance Tradeoff For nonparametric procedures, when x is one-dimensional, which is minimized at Hence, MISE c 1 h 4 + c 2 nh ( 1 h = O MISE = O n 1/5 ) ( ) 1 n 4/5 whereas, for parametric problems, when the model is correct, ( ) 1 MISE = O n

Next... Now we will begin our discussion of nonparametric regression procedures. But, these nonparametric procedures share a common structure with linear regression. Namely, they are all linear smoothers.

Linear Smoothers For a linear smoother: f(x) = i Y i l i (x) for some weights: l 1 (x),...,l n (x) The vector of weights depends on the target point x.

Linear Smoothers If f(x) = n i=1 l i(x)y i then Ŷ 1. Ŷ n }{{} Ŷ Ŷ = L Y f(x 1 ). f(x n ) = l 1 (X 1 ) l 2 (X 1 ) l n (X 1 ) l 1 (X 2 ) l 2 (X 2 ) l n (X 2 ).... l 1 (X n ) l 2 (X n ) l n (X n ) } {{ } L Y 1. Y n } {{ } Y The effective degrees of freedom is: ν = trace(l) = n L ii. i=1 Can be interpreted as the number of parameters in the model. Recall that ν = p+1 for linear regression.

Linear Smoothers Consider the Extreme Cases: When l i (X j ) = 1/n for all i,j, and ν = 1. f(x j ) = Y for all j When l i (X j ) = δ ij for all i,j, and ν = n. f(x j ) = Y j

Next... To begin the discussion of nonparametric procedures, we start with a simple, but naive, approach: binning.

Binning Divide the x-axis into bins B 1,B 2,... of width h. f is a step function based on averaging the Y i s in each bin: { for x B j : f(x) = mean Y i : X i B j }. The (arbitrary) choice of the boundaries of the bins can affect inference, especially when h large.

Binning WMAP data, binned estimate with h = 15. Cl 4000 2000 0 2000 4000 6000 8000 0 100 200 300 400 500 600 700 l

Binning WMAP data, binned estimate with h = 50. Cl 4000 2000 0 2000 4000 6000 8000 0 100 200 300 400 500 600 700 l

Binning WMAP data, binned estimate with h = 75. Cl 4000 2000 0 2000 4000 6000 8000 0 100 200 300 400 500 600 700 l

Next... A step beyond binning is to take local averages to estimate the regression function.

Local Averaging For each x let N x = [ x h 2, x+ h ]. 2 This is a moving window of length h, centered at x. Define } f(x) = mean {Y i : X i N x. This is like binning but removes the arbitrary boundaries.

Local Averaging WMAP data, local average estimate with h = 15. Cl 4000 2000 0 2000 4000 6000 8000 Binned Local Averaging 0 100 200 300 400 500 600 700 l

Local Averaging WMAP data, local average estimate with h = 50. Cl 4000 2000 0 2000 4000 6000 8000 Binned Local Averaging 0 100 200 300 400 500 600 700 l

Local Averaging WMAP data, local average estimate with h = 75. Cl 4000 2000 0 2000 4000 6000 8000 Binned Local Averaging 0 100 200 300 400 500 600 700 l

Next... Local averaging is a special case of kernel regression.

Kernel Regression The local average estimator can be written: where f(x) = K(x) = n i=1 Y i K ( x Xi h n i=1 K ( x Xi h ) { 1 x < 1/2 0 otherwise. Can improve this by using a function K which is smoother. )

Kernels A kernel is any smooth function K such that K(x) 0 and K(x)dx = 1, xk(x)dx = 0 and σk 2 x 2 K(x)dx > 0. Some commonly used kernels are the following: the boxcar kernel : K(x) = 1 I( x < 1), 2 the Gaussian kernel : K(x) = 1 e x2 /2, 2π the Epanechnikov kernel : K(x) = 3 4 (1 x2 )I( x < 1) the tricube kernel : K(x) = 70 81 (1 x 3 ) 3 I( x < 1).

Kernels 3 0 3 3 0 3 3 0 3 3 0 3

Kernel Regression We can write this as f(x) = n i=1 Y i K f(x) = i ( x Xi h n i=1 K ( x Xi h Y i l i (x) ) ) where l i (x) = K( x Xi h ) n i=1 K ( x Xi h ).

Kernel Regression WMAP data, kernel regression estimates, h = 15. Cl 4000 2000 0 2000 4000 6000 8000 boxcar kernel Gaussian kernel tricube kernel 0 100 200 300 400 500 600 700 l

Kernel Regression WMAP data, kernel regression estimates, h = 50. Cl 4000 2000 0 2000 4000 6000 8000 boxcar kernel Gaussian kernel tricube kernel 0 100 200 300 400 500 600 700 l

Kernel Regression WMAP data, kernel regression estimates, h = 75. Cl 4000 2000 0 2000 4000 6000 8000 boxcar kernel Gaussian kernel tricube kernel 0 100 200 300 400 500 600 700 l

Kernel Regression ( ) 2 ( ) MISE h4 x 2 K(x)dx f (x)+2f (x) g (x) 2 dx 4 g(x) + σ2 K 2 (x)dx 1 nh g(x) dx. where g(x) is the density for X. What is especially notable is the presence of the term 2f (x) g (x) g(x) = design bias. Also, bias is large near the boundary. We can reduce these biases using local polynomials.

Kernel Regression WMAP data, h = 50. Note the boundary bias. Cl 500 1000 1500 2000 2500 3000 boxcar kernel Gaussian kernel tricube kernel 0 20 40 60 80 100 l

Next... The deficiencies of kernel regression can be addressed by considering models which, on small scales, model the regression function as a low order polynomial.

Local Polynomial Regression Recall polynomial regression: f(x) = β 0 + β 1 x+ β 2 x 2 + + β p x p where β = ( β 0,..., β p ) are obtained by least squares: minimize ( n Y i [β 0 +β 1 x+β 2 x 2 + +β p x p ] i=1 ) 2

Local Polynomial Regression Local polynomial regression: approximate f(x) locally by a different polynomial for every x: f(u) β 0 (x)+β 1 (x)(u x)+β 2 (x)(u x) 2 + +β p (x)(u x) p for u near x. Estimate ( β 0 (x),..., β p (x)) by local least squares: minimize n ( ) (Y i [β 0 (x)+β 1 (x)x+β 2 (x)x 2 + +β p (x)x p ]) 2 x Xi K h }{{} kernel i=1 f(x) = β 0 (x)

Local Polynomial Regression Taking p = 0 yields the kernel regression estimator: f n (x) = l i (x) = n l i (x)y i i=1 K ( ) x x i h ( ). n j=1 K x xj h Taking p = 1 yields the local linear estimator. This is the best, all-purpose smoother. Choice of Kernel K: not important Choice of bandwidth h: crucial

Local Polynomial Regression The local polynomial regression estimate is f n (x) = n l i (x)y i i=1 where l(x) T = (l 1 (x),...,l n (x)), l(x) T = e T 1(X T xw x X x ) 1 X T xw x, e 1 = (1,0,...,0) T and X x and W x are defined by X x = 1 X 1 x 1 X 2 x.. 1 X n x...... (X 1 x) p p! (X 2 x) p p! (Xn x) p p! W x = ( x X1 K h. ) 0 0. 0 0 0 K ( x X n h.. ).

Local Polynomial Regression Note that f(x) = n i=1 l i(x)y i is a linear smoother. Define Ŷ = (Ŷ1,...,Ŷn) where Ŷi = f(x i ). Then Ŷ = LY where L is the smoothing matrix: L = l 1 (X 1 ) l 2 (X 1 ) l n (X 1 ) l 1 (X 2 ) l 2 (X 2 ) l n (X 2 ).... l 1 (X n ) l 2 (X n ) l n (X n ). The effective degrees of freedom is: ν = trace(l) = n L ii. i=1

Theoretical Aside Why local linear (p = 1) is better than kernel (p = 0). Both have (approximate) variance σ 2 (x) K 2 (u)du g(x)nh The kernel estimator has bias h 2 ( 1 2 f (x)+ f (x)g (x) g(x) ) u 2 K(u)du whereas the local linear estimator has asymptotic bias h 21 2 f (x) u 2 K(u)du The local linear estimator is free from design bias. At the boundary points, the kernel estimator has asymptotic bias of O(h) while the local linear estimator has bias O(h 2 ).

Next... Parametric proceudres force the choice p, the number of predictors. Nonparametric procedures force the choice of the smoothing parameter h. Next we will discuss how this can be selected in a data-driven manner via cross-validation.

Choosing p and h Estimate the risk 1 n n E( f(x i ) f(x i )) 2 i=1 with the leave-one-out cross-validation score: CV = 1 n n (Y i f ( i) (X i )) 2 i=1 where f ( i) is the estimator obtained by omitting the i th pair (X i,y i ).

Choosing p and h Amazing shortcut formula: CV = 1 n ( n i=1 Y i f ) 2 n (x i ). 1 L ii An commonly used approximation is GCV (generalized cross-validation): GCV = 1 n ( 1 ν n n ) 2 (Y i f(x i )) 2 i=1 ν = trace(l).

Next... Next, we will go over a bit of how to perform local polynomial regression in R.

Using locfit() Need to include the locfit library: > install.packages("locfit") > library(locfit) > result = locfit(y x, alpha=c(0, 1.5), deg=1) y and x are vectors the second argument to alpha gives the bandwidth (h) the first argument to alpha specifies the nearest neighbor fraction, an alternative to the bandwidth fitted(result) gives the fitted values, f(x i ) residuals(result) gives the residuals, Y i f(x i )

Using locfit() degree=1, n=1000, sigma=0.1 y 0.6 0.4 0.2 0.0 0.2 0.4 0.6 x(1 x)sin((2.1π) (x + 0.2)) estimate using h = 0.005 estimate using gcv optimal h = 0.015 0.0 0.2 0.4 0.6 0.8 1.0 x

Using locfit() degree=1, n=1000, sigma=0.1 y 0.6 0.4 0.2 0.0 0.2 0.4 0.6 x(1 x)sin((2.1π) (x + 0.2)) estimate using h = 0.1 estimate using gcv optimal h = 0.015 0.0 0.2 0.4 0.6 0.8 1.0 x

Using locfit() degree=1, n=1000, sigma=0.5 y 2 0 2 4 6 (x + 5exp( 4x 2 )) estimate using h = 0.05 estimate using gcv optimal h = 0.174 2 1 0 1 2 x

Using locfit() degree=1, n=1000, sigma=0.5 y 2 0 2 4 6 (x + 5exp( 4x 2 )) estimate using h = 0.5 estimate using gcv optimal h = 0.174 2 1 0 1 2 x

Example WMAP data, local linear fit, h = 15. y 4000 2000 0 2000 4000 6000 8000 estimate using h = 15 estimate using gcv optimal h = 63.012 0 100 200 300 400 500 600 700 x

Example WMAP data, local linear fit, h = 125. y 4000 2000 0 2000 4000 6000 8000 estimate using h = 125 estimate using gcv optimal h = 63.012 0 100 200 300 400 500 600 700 x

Next... We should quantify the amount of error in our estimator for the regression function. Next we will look at how this can be done using local polynomial regression.

Variance Estimation Let σ 2 = where n i=1 (Y i f(x i )) 2 n 2ν + ν ν = tr(l), ν = tr(l T L) = n l(x i ) 2. If f is sufficiently smooth, then σ 2 is a consistent estimator of σ 2. i=1

Variance Estimation For the WMAP data, using local linear fit. > wmap = read.table("wmap.dat",header=t) > opth = 63.0 > locfitwmap = locfit(wmap$cl[1:700] wmap$ell[1:700], alpha=c(0,opth),deg=1) > nu = as.numeric(locfitwmap$dp[6]) > nutilde = as.numeric(locfitwmap$dp[7]) > sigmasqrhat = sum(residuals(locfitwmap)ˆ2)/(700-2*nu+nutilde) > sigmasqrhat [1] 1122214 But, does not seem reasonable to assume homoscedasticity...

Variance Estimation Allow σ to be a function of x: Y i = f(x i )+σ(x i )ǫ i. Let Z i = log(y i f(x i )) 2 and δ i = logǫ 2 i. Then, Z i = log(σ 2 (x i ))+δ i. This suggests estimating logσ 2 (x) by regressing the log squared residuals on x.

Variance Estimation 1. Estimate f(x) with any nonparametric method to get an estimate f n (x). 2. Define Z i = log(y i f n (x i )) 2. 3. Regress the Z i s on the x i s (again using any nonparametric method) to get an estimate q(x) of logσ 2 (x) and let σ 2 (x) = e q(x).

Example WMAP data, log squared residuals, local linear fit, h = 63.0. log residual squared 0 5 10 15 0 100 200 300 400 500 600 700 l

Example With local linear fit, h = 130, chosen via GCV log residual squared 0 5 10 15 0 100 200 300 400 500 600 700 l

Example Estimating σ(x): sigma hat 500 1000 1500 2000 0 100 200 300 400 500 600 700 l

Confidence Bands Recall that f(x) = i Y i l i (x) so Var( f(x)) = i σ 2 (X i )l 2 i(x). An approximate 1 α confidence interval for f(x) is f(x)±z α/2 i σ 2 (X i )l 2 i (x). When σ(x) is smooth, we can approximate i σ 2 (X i )l 2 i (x) σ(x) l(x).

Confidence Bands Two caveats: 1. f is biased so this is really an interval for E( f(x)). Result: bands can miss sharp peaks in the function. 2. Pointwise coverage does not imply simultaneous coverage for all x. Solution for 2 is to replace z α/2 with a larger number (Sun and Loader 1994). locfit() does this for you.

More on locfit() > diaghat = predict.locfit(locfitwmap,where="data",what="infl") > normell = predict.locfit(locfitwmap,where="data",what="vari") diaghat will be L ii, i = 1,2,...,n. normell will be l(x i ), i = 1,2,...,n The Sun and Loader replacement for z α/2 is found using kappa0(locfitwmap)$crit.val.

Example 95% pointwise confidence bands: Cl 1000 2000 3000 4000 5000 estimate 95% pointwise band assuming constant variance 95% pointwise band using nonconstant variance 0 100 200 300 400 500 600 700 l

Example 95% simultaneous confidence bands: Cl 1000 2000 3000 4000 5000 estimate 95% simultaneous band assuming constant variance 95% simultaneous band using nonconstant variance 0 100 200 300 400 500 600 700 l

Next... We discuss the use of regression splines. These are a well-motivated choice of basis for constructing regression functions.

Basis Methods Idea: expand f as f(x) = j β j ψ j (x) where ψ 1 (x),ψ 2 (x),... are specially chosen, known functions. Then estimate β j and set f(x) = β j ψ j (x). j Here we consider a popular version, splines

Splines and Penalization Define f n to be the function that minimizes M(λ) = i (Y i f n (x i )) 2 +λ (f (x)) 2 dx. λ = 0 = f n (X i ) = Y i (no smoothing) λ = = f n (x) = β 0 + β 1 x (linear) 0 < λ < = f n (x) = cubic spline with knots at X i. A cubic spline is a continuous function f such that 1. f is a cubic polynomial between the X i s 2. f has continuous first and second derivatives at the X i s.

Basis For Splines Define (z) + = max{z,0}, N = n+4, ψ 1 (x) = 1 ψ 2 (x) = x ψ 3 (x) = x 2 ψ 4 (x) = x 3 ψ 5 (x) = (x X 1 ) 3 + ψ 6 (x) = (x X 2 ) 3 + ψ N (x) = (x X n ) 3 +. These functions form a basis for the splines: we can write f(x) = N β j ψ j (x). j=1 (For numerical calculations it is actually more efficient to use other spline bases.) We can thus write f n (x) = N j=1 β j ψ j (x), (1)

Basis For Splines We can now rewrite the minimization as follows: minimize : (Y Ψβ) T (Y Ψβ)+λβ T Ωβ where Ψ ij = ψ j (X i ) and Ω jk = ψ j (x)ψ k (x)dx. The value of β that minimizes this is β = (Ψ T Ψ+λΩ) 1 Ψ T Y. The smoothing spline f n (x) is a linear smoother, that is, there exist weights l(x) such that f n (x) = n i=1 Y il i (x).

Basis For Splines Basis functions with 5 knots. Psi 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x

Basis For Splines Same span as previous slide, the B-spline basis, 5 knots: 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x

Smoothing Splines in R > smosplresult = smooth.spline(x,y, cv=false, all.knots=true) If cv=true, then cross-validiation used to choose λ; if cv=false, gcv is used If all.knots=true, then knots are placed at all data points; if all.knots=false, then set nknots to specify the number of knots to be used. Using fewer knots eases the computational cost. predict(smosplresult) gives fitted values.

Example WMAP data, λ chosen using GCV. Cl 4000 2000 0 2000 4000 6000 8000 0 100 200 300 400 500 600 700 l

Next... Finally, we address the challenges of fitting regression models when the predictor x is of large dimension.

Recall: The Bias Variance Tradeoff For nonparametric procedures, when x is one-dimensional, which is minimized at Hence, MISE c 1 h 4 + c 2 nh ( 1 h = O MISE = O n 1/5 ) ( ) 1 n 4/5 whereas, for parametric problems, when the model is correct, ( ) 1 MISE = O n

The Curse of Dimensionality For nonparametric procedures, when x is d-dimensional, which is minimized at Hence, MISE c 1 h 4 + c 2 nh d ( ) 1 h = O n 1/(4+d) ( ) 1 MISE = O n 4/(4+d) whereas, for parametric problems, when the model is correct, still, ( ) 1 MISE = O n

Multiple Regression Y = f(x 1,X 2,...,X d )+ǫ The curse of dimensionality: Optimal rate of convergence for d = 1 is n 4/5. In d dimensions the optimal rate of convergence is is n 4/(4+d). Thus, the sample size m required for a d-dimensional problem to have the same accuracy as a sample size n in a one-dimensional problem is m n cd where c = (4+d)/(5d) > 0. To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension d. Put another way, confidence bands get very large as the dimension d increases.

Multiple Local Linear Given a nonsingular positive definite d d bandwidth matrix H, we define K H (x) = 1 H 1/2K(H 1/2 x). Often, one scales each covariate to have the same mean and variance and then we use the kernel h d K( x /h) where K is any one-dimensional kernel. Then there is a single bandwidth parameter h.

Multiple Local Linear At a target value x = (x 1,...,x d ) T, the local sum of squares is given by ( n d w i (x) Y i a 0 a j (x ij x j ) i=1 j=1 ) 2 where w i (x) = K( x i x /h). The estimator is f n (x) = â 0 â = (â 0,...,â d ) T is the value of a = (a 0,...,a d ) T that minimizes the weighted sums of squares.

Multiple Local Linear The solution â is â = (X T xw x X x ) 1 X T xw x Y where 1 (x 11 x 1 ) (x 1d x d ) 1 (x 21 x 1 ) (x 2d x d ) X x =...... 1 (x n1 x 1 ) (x nd x d ) and W x is the diagonal matrix whose (i,i) element is w i (x).

A Compromise: Additive Models Y = α+ d f j (x j )+ǫ j=1 Usually take α = Y. Then estimate the f j s by backfitting. 1. set α = Y, f 1 = f d = 0. 2. Iterate until convergence: for j = 1,...,d: Compute Ỹi = Y i α k j f k (X i ), i = 1,...,n. Apply a smoother to Ỹi on X j to obtain f j. Set f j (x) equal to f j (x) n 1 n i=1 f j (x i ). The last step ensures that i f j (X i ) = 0 (identifiability).

To Sum Up... Classic linear regression is computationally and theoretically simple Nonparametric procedures for regression offer increased flexibility Careful consideration of MISE suggests use of local polynomial fits Smoothing parameters can be chosen objectively High-dimensional data present a challenge