Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Similar documents
Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

AMS-207: Bayesian Statistics

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Spatial Process Estimates as Smoothers: A Review

Data Mining Stat 588

High-dimensional regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Ch 3: Multiple Linear Regression

CMSC858P Supervised Learning Methods

Chapter 3: Multiple Regression. August 14, 2018

variability of the model, represented by σ 2 and not accounted for by Xβ

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Multiple Linear Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

A Significance Test for the Lasso

Lecture 6 Multiple Linear Regression, cont.

MS&E 226: Small Data

Ch 2: Simple Linear Regression

A Bayesian Treatment of Linear Gaussian Regression

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Linear Methods for Prediction

Lecture 3: Introduction to Complexity Regularization

Simple Linear Regression

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Lecture 5: A step back

Multiple Linear Regression

Linear Models in Machine Learning

Lecture 14: Shrinkage

Regression: Lecture 2

Linear regression methods

[y i α βx i ] 2 (2) Q = i=1

Regularization: Ridge Regression and the LASSO

Model comparison and selection

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Biostatistics Advanced Methods in Biostatistics IV

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Math 423/533: The Main Theoretical Topics

Terminology for Statistical Data

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning for OR & FE

Sparse Linear Models (10/7/13)

Lecture 2 Machine Learning Review

The Statistical Property of Ordinary Least Squares

Simple and Multiple Linear Regression

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Nonparametric Regression. Badr Missaoui

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Geographically Weighted Regression as a Statistical Model

1 Least Squares Estimation - multiple regression.

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Overfitting, Bias / Variance Analysis

Regression, Ridge Regression, Lasso

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

20.1. Balanced One-Way Classification Cell means parametrization: ε 1. ε I. + ˆɛ 2 ij =

1 Multiple Regression

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Bayesian linear regression

Theorems. Least squares regression

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Lecture 2. The Simple Linear Regression Model: Matrix Approach

STAT 540: Data Analysis and Regression

Sampling Distributions

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded

UNIVERSITETET I OSLO

Linear Models A linear model is defined by the expression

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

ST430 Exam 2 Solutions

Sampling Distributions

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

UNIVERSITY OF TORONTO Faculty of Arts and Science

Matrix Factorizations

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Statistics for high-dimensional data: Group Lasso and additive models

Recap from previous lecture

The Hilbert Space of Random Variables

Generalized Additive Models

Linear Methods for Prediction

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

6. Multiple Linear Regression

A significance test for the lasso

ST 740: Linear Models and Multivariate Normal Inference

Chapter 7: Model Assessment and Selection

Applied Regression Analysis

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

1 Introduction 1. 2 The Multiple Regression Model 1

For GLM y = Xβ + e (1) where X is a N k design matrix and p(e) = N(0, σ 2 I N ), we can estimate the coefficients from the normal equations

Transcription:

Basis Penalty Smoothers Simon Wood Mathematical Sciences, University of Bath, U.K.

Estimating functions It is sometimes useful to estimate smooth functions from data, without being too precise about the functional form. When estimating such unknown functions there are 2 desirable properties 1. A reasonable approximation to the truth, without too much mis-specification bias. 2. Reasonable computational efficiency. The basis-penalty approach is a reasonable compromise between these objectives.

Bases and penalties Write the function to be estimated as f (x) = K β k b k (x) k where β k are coefficients to be estimated, and b k (x) are known basis functions chosen for convenience. PROBLEM: If K is large enough definitely avoid underfit, it will probably overfit! SOLUTION: During fitting, penalize departure from smoothness with a penalty, e.g.... P(f ) = f (x) 2 dx = β T Sβ

Why does f (x) 2 dx = β T Sβ? We can write f in terms of β and a vector of basis function derivatives K f (x) = β k b k (x) = βt b(x) k where b(x) = [b 1 (x), b 2 (x),...]t. Then the penalty can also be re-written f (x) 2 dx = β T b T (x)b(x)βdx = β T b T (x)b(x)dxβ = β T Sβ where S = b T (x)b(x)dx, a matrix of fixed coefficients.

Basis & Penalty example y 0.0 0.2 0.4 0.6 0.8 1.0 basis functions y 2 0 2 4 6 8 10 f"(x)2 dx = 377000 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x y 2 0 2 4 6 8 10 f"(x)2 dx = 63000 y 2 0 2 4 6 8 10 f"(x)2 dx = 200 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x

A simple basis Before considering smoothing in detail, consider linearly interpolating x, y data. y 1 2 3 4 0 2 4 6 8 The interpolant (red) can be written f (x) = k β kb k (x), where the b k are tent funtions:there is one per data point ( ). x

The tent basis The k th tent function is 1 at xk at xk±1. Elsewhere it is zero. The full set look like this... and descends linearly to zero b k (x) 0 1 2 3 4 0 2 4 6 8 Under this definition of b k (x), β k = y k... x

The interpolating basis So the interpolating function is represented by multiplying each tent function by its coefficient (β k = yk ) and summing the results... y 0 1 2 3 4 0 2 4 6 8 Given the basis functions and coefficients, we can predict the value of the interpolant anywhere in the range of the x values. x

Prediction matrix Suppose that we have a series of points xk, y k to interpolate. The xk values define the tent basis, and the y k give the coefficients β k. Now suppose that we want to evaluate the interpolant at a series of values x i. If f = [f (x 1 ), f (x 2 ),...] T, then f = Xβ where the prediction matrix is given by X = b 1 (x 1 ) b 2 (x 1 ) b 3 (x 1 ). b 1 (x 2 ) b 2 (x 2 )..........

Regression with a basis Suppose that we want to model these data accel 100 50 0 50 10 20 30 40 50 time One model is a i = f (t i ) + ɛ i, where f is an unknown function. We could set f (t) = k β kb k (t) where the b k are tent functions, based on a set of tk values spaced evenly through the range of observed times.

Regression model form Writing the model in vector from we have a = f + ɛ, where f T = [f (t 1 ), f (t 2 ),...]. Let X be a prediction matrix produced using the tent basis, so that X ij = b j (t i ). The model becomes a = Xβ + ɛ So we have a linear model, with model matrix X. This can be estimated by standard linear modelling methods. Let s try it out in R.

A simple regression smoother in R First an R function for producing tent functions. tf <- function(x,xk=seq(0,1,length=10),k=1) { ## generate kth tent function from set defined by xk yk <- xk*0;yk[k] <- 1 approx(xk,yk,x)$y } And now a function to use it for making prediction/model matrices. tf.x <- function(x,xk=seq(0,1,length=10)) { ## tent function basis matrix given knot sequence xk nk <- length(xk); n <- length(x) X <- matrix(na,n,nk) for (i in 1:nk) X[,i] <- tf(x,xk,i) X }

Fitting the mcycle data K <- 40 ## basis dimension t0 <- min(mcycle$times);t1 <- max(mcycle$times) tk=seq(t0,t1,length=k) ## knot sequence X <- tf.x(x=mcycle$times,xk=tk) ## model matrix b <- coef(lm(mcycle$accel~x-1)) ## fit model Xp <- tf.x(x=0:120/2,xk=tk) ## prediction matrix plot(mcycle$times,mcycle$accel,ylab="accel",xlab="time") lines(0:120/2,xp%*%b,col=2,lwd=2) accel 100 50 0 50 10 20 30 40 50 Far too wiggly! Reduce K time

Reducing K After some experimentation, K = 15 seems reasonable... accel 100 50 0 50 10 20 30 40 50... but K selection is a bit fiddly and ad hoc. 1. Models with different K are not nested, so we can t use hypothesis testing. 2. We have little choice but to fit with every possible K value if AIC is to be used. 3. Very difficult to generalize this approach to model selection to models with more than one function. time

Smoothing Using the basis for regression was ok, but there are some problems in choosing K and deciding where to put the knots, t k. To overcome these consider using the basis for smoothing. 1. Make K large enough that bias is negligible. 2. Use even x k spacing. 3. To avoid overfit, penalize the wiggliness of f using, e.g. K 1 P(f ) = (β k 1 2β k + β k+1 ) 2 k+1

Evaluating the penalty To get the penalty in convenient form, note that β 1 2β 2 + β 3 β 2 2β 3 + β 4.. by definition of D Hence by definition of S. = 1 2 1 0.. 0 1 2 1.............. P(f ) = β T D T Dβ = β T Sβ β = Dβ

Penalized fitting Now the penalized least squares estimates are ˆβ = arg min {a i f (t i )} 2 + λp(f ) β i smoothing parameter λ controls the fit-wiggliness tradeoff. For computational purposes this is re-written ˆβ = arg min a Xβ 2 + λβ T Sβ. β In fact we can re-write again ˆβ = arg min β [ a 0 ] [ ] X β λd 2... least squares for an augmented linear model!

Penalized mcycle fit D <- diff(diff(diag(k))) ## t(d)%*%d is penalty coef matrix sp <- 2 ## square root smoothing parameter XD <- rbind(x,d*sp) ## augmented model matrix y0 <- c(mcycle$accel,rep(0,nrow(d))) ## augmented data b <- lm(y0~xd-1) ## fit augmented model plot(mcycle$times,mcycle$accel,ylab="accel",xlab="time") lines(0:120/2,xp%*%coef(b),col=2,lwd=2) accel 100 50 0 50 10 20 30 40 50 time

Diversion: avoiding the penalty The penalized fit is much nicer than the regression fit, but much theory will be needed to select λ and account for the fact that we penalized. Couldn t we avoid the penalty? To some extend the answer is yes! Given the penalty we can re-parameterize in such a way that the basis functions are orthogonal, and arranged in decreasing order of penalization. i.e. we can express the original smooth in terms of a set of basis functions than are orthonormal, and where the k th is smoother than the (k 1) th, according to the penalty. We can use such a basis unpenalized, and perform smoothness selection by deciding how many basis functions to drop, starting from the first. Formal hypothesis testing or AIC can be used to aid the decision.

The natural basis Let a smoother have model matrix X and penalty matrix S. Form QR decomposition X = QR, followed by symmetric eigen-decompostion R T SR 1 = UΛU T Define P = U T R. And reparameterize β = Pβ. In the new parameterization the model matrix is X = QU, which has orthogonal columns. (X = X P.) The penalty matrix is now the diagonal matrix Λ (eigenvalues in decreasing order down leading diagonal). Λ ii is now the wiggliness of the i th basis function.

Natural basis regression smoothing qrx <- qr(x) RD <- forwardsolve(t(qr.r(qrx)),t(d)) es <- eigen(rd%*%t(rd),symmetric=true) Xs <- qr.qy(qrx,rbind(es$vectors,matrix(0,nrow(x)-nrow(rd),nrow(rd)))) b <- lm(mcycle$accel~xs-1) ## unpenalized full fit summary(b)... Coefficients: Estimate Std. Error t value Pr(> t ) Xs1 0.8267 23.5508 0.035 0.9721 Xs2-8.2636 23.5508-0.351 0.7265..... Xs30-5.8426 23.5508-0.248 0.8046 Xs31-10.4568 23.5508-0.444 0.6581 Xs32-126.5964 23.5508-5.375 5.64e-07 *** Xs33 117.9980 23.5508 5.010 2.58e-06 *** Xs34 136.8142 23.5508 5.809 8.70e-08 *** Xs35 260.9345 23.5508 11.080 < 2e-16 ***..... Since basis is orthogonal, I can legitimately look for the low to high p-value cut off. This suggests deleting the first 31 basis functions (out of 40).

NP refit > b0 <- lm(mcycle$accel~xs[,-c(1:31)]-1) > anova(b0,b) Analysis of Variance Table Model 1: mcycle$accel ~ Xs[, -c(1:31)] - 1 Model 2: mcycle$accel ~ Xs - 1 Res.Df RSS Df Sum of Sq F Pr(>F) 1 124 62336 2 93 51582 31 10755 0.6255 0.9304 > AIC(b0,b) df AIC b0 10 1215.381 b 41 1252.194 The reduced model b0 is clearly better. Since this is just an ordinary linear model it is perfectly legitimate to use AIC and F-ratio testing here.

The NP fit plot(mcycle$times,mcycle$accel,ylab="accel",xlab="time") lines(mcycle$times,fitted(b0),col=3,lwd=2) accel 100 50 0 50 10 20 30 40 50... a nice smooth fit without having to leave the inferential framework of standard linear models. time This would work just as well in a GLM. However with multiple correlated predictors model selection would not be so straightforward.

Back to penalized fitting!

Effective Degrees of Freedom Penalization restricts the freedom of the coefficients to vary. So with 40 coefficients we have < 40 effective degrees of freedom (EDF). How the penalty restricts the coefficients is best seen in the natural parameterization. (Let y be the response.) Without penalization the coefficients would be β = X T y. With penalization the coefficients are ˆβ = (I + λλ) 1 X T y. i.e. ˆβ j = β j (1 + λλ jj ) 1. So (1 + λλ jj ) 1 is the shrinkage factor for the i th coefficient, and is bounded between 0 and 1. It gives the EDF for ˆβ j. So total EDF is tr{(1 + λλ jj ) 1 } = tr(f), where F = (X T X + λs) 1 X T X}, the EDF matrix.

Smoothing bias The formal expression for the penalized least squares estimates is ˆβ = (X T X + λs) 1 X T y Hence Smooths are baised! E( ˆβ) = (X T X + λs) 1 X T E(y) = (X T X + λs) 1 X T Xβ = Fβ β This is the price paid for reducing bias when you don t know the functional form in advance. The bias makes frequentist inference difficult (including bootstrapping!).

A Bayesian smoothing model We penalize because we think that the truth is more likely to be smooth than wiggly. Things can be formalized by putting a prior on wiggliness wiggliness prior exp( λβ T Sβ/(2σ 2 ))... equivalent to a prior β N(0, S σ 2 /λ) where S is a generalized inverse of S. From the model y β N(Xβ, Iσ 2 ), so from Bayes Rule β y N( ˆβ, (X T X + λs) 1 σ 2 ) Finally ˆσ 2 = y X ˆβ 2 /{n tr(f)} is useful.

Using the β y distribution It is very cheap to simulate from the distribution of β y, allowing easy Bayesian inference about any quantity derived from the model. Confidence/credible intervals for the smooth can be constructed without simulation. Across the function they have very good frequentist properties. (Nychka, 1988) Nychka s idea Construct an interval C(x) so that if x is chosen randomly, Pr{f (x) C(x)} =.95 (say). Choosing x randomly turns bias{ˆf (x) f (x)} into a zero mean random variable. Basing C(x) on the sum of sampling variability and random bias yields the Bayesian intervals (approximatly). So by construction, the Bayesian intervals have close to nominal across the function coverage.

P-values You might want to test whether the a smooth could be replaced by a constant (or other parametric term). This sits uncomfortably with the Bayesian approach, but there are several alternatives: 1. Functions in the null space of the penalty are not penalized by smoothing, so under the null the parameter estimates are unbiased, and a purely frequentist approach can be taken, however the correct degrees of freedom to use is problematic. 2. Given Nychka s result on the good frequentist properties of Bayesian intervals, it is possible to try and invert the intervals to obtain p-values. 3. If testing is really a key part of the analysis, it may be better to simply use unpenalized models, via the natural parameterization trick.

Adding a CI to mcycle fit V <- solve(t(xd)%*%xd) ## (X X+\lambda S)^{-1} ldf <- rowsums(v*(t(x)%*%x)) ## diag(f) n <- nrow(x) sig2 <- sum(residuals(b)[1:n]^2)/(n-sum(ldf)) V <- V*sig2 ## posterior covariance matrix ## get s.e. of predicted curve f.se <- rowsums((xp%*%v)*xp)^.5 ## diag(xp%*%v%*%t(xp)) t <- 0:120/2; f <- Xp%*%coef(b) lines(t,f+2*f.se,col=3,lwd=2) lines(t,f-2*f.se,col=3,lwd=2) title(paste("edf =",round(sum(ldf),digits=2))) EDF = 13.7 accel 100 50 0 50 10 20 30 40 50 time

Time to consider λ selection...