Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

Size: px

Start display at page:

Download "Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K."

Maximillian Reynolds
5 years ago
Views:

1 Smoothness Selection Simon Wood Mathematical Sciences, Universit of Bath, U.K.

2 Smoothness selection approaches The smoothing model i = f ( i ) + ɛ i, ɛ i N(0, σ 2 ), is represented via a basis epansion of f, with coefficients β. The β estimates are ˆβ = arg min β Xβ 2 + λβ T Sβ where X is the model matri derived from the basis, and S is the wiggliness penalt matri. λ controls smoothness how should it be chosen? There are 3 main statistical approaches 1. Choose λ to minimize error in predicting new data. 2. Treat smooths as random effects, following the Baesian smoothing model, and estimate λ as a variance parameter using a marginal likelihood approach. 3. Go full Baesian b completing the Baesian model with a prior on λ (requires simulation and not pursued here).

3 Prediction error: C p /UBRE Suppose σ 2 is known, and let A = X(X T X + λs) 1 X T. ˆµ = A where E() = µ, so consider µ ˆµ 2 = µ A 2 = A ɛ 2 = A 2 + ɛ T ɛ 2ɛ T ( A) = A 2 + ɛ T ɛ 2ɛ T (µ + ɛ) + 2ɛ T A(µ + ɛ) = A 2 ɛ T ɛ 2ɛ T µ + 2ɛ T Aµ + 2ɛ T Aɛ Hence E µ ˆµ 2 = E A 2 nσ 2 + 2σ 2 tr(a) Estimating E A 2 ields... C p = A 2 nσ 2 + 2σ 2 tr(a) Can choose λ to minimize C p.

4 σ 2 unknown: cross validation λ too high λ about right λ too low Choose λ to tr to minimize the error predicting new data. 2. Minimize the average error in predicting single datapoints omitted from the fit. Each datum left out once in average. 3. It turns out that V o (λ) = 1 ( i ˆµ [ i] i ) 2 = 1 ( i ˆµ i ) 2 n n (1 A ii ) 2 i i

5 OCV not invariant ocv OCV is not invariant in an odd wa. If Q is orthogonal then fitting objective Q QXβ 2 + λβ T Sβ ields identical inferences about β as the original objective, but it gives a different V o. edf

6 GCV: generalized cross validation If we find the Q that causes the leading diagonal elements of A to be constant, and then perform OCV, the result is the invariant alternative GCV: V g = n ˆµ 2 {n tr(a)} 2 It is eas to show that tr(a) = tr(f), where F is the degrees of freedom matri. In addition to invariance, GCV is much easier to optimize efficientl in the multiple smoothing parameter case.

7 REML/ML λ estimation The Baesian smooth model is = Xβ + ɛ, β N(0, S σ 2 /λ), ɛ N(0, Iσ 2 ) This can be viewed as a mied model for computational purposes, but the impropiet of f (β) is awkward. To fi this, find the eigen-decomposition S = UΛU T Reparameterize β = U T β and let Λ + denote the diagonal matri of +ve eigenvalues. Now β T Sβ = β T Λβ = b T Λ + b where β = (b T, γ T ) T. Now partition X = XU = (Z : X), so that the model becomes = Xγ + Zb + ɛ, b N(0, Λ 1 + σ2 /λ), ɛ N(0, Iσ 2 )

8 REML/ML λ estimation Now that the model is in standard mied model form, mied model methods can estimate λ as a variance parameter. MLE or REML can be used. From a Baesian perspective we are being empirical Baesians and using marginal likelihood. Notice that the restricted/marginal likelihood has the form f ( β)f (β)dβ That is, we are taking the epectation of the likelihood over the prior on β. From this perspective it is possible to plot wh the approach is intuitivel sensible.

9 Basic principle of ML smoothness selection λ too low, prior variance too high λ and prior variance about right λ too high, prior variance too low Choose λ to maimize the average likelihood of random draws from the prior implied b λ. 2. If λ too low, then almost all draws are too variable to have high likelihood. If λ too high, then draws all underfit and have low likelihood. The right λ maimizes the proportion of draws close enough to data to give high likelihood. 3. Formall, maimize e.g. V r (λ) = log f ( β)f λ (β)dβ.

10 Prediction error vs. likelihood λ estimation s(,12.07) log GCV log(λ) REML/n log(λ) s(,1) log GCV log(λ) REML/n log(λ) 1. Pictures show GCV and REML scores for different replicates from same truth. 2. Compared to REML, GCV penalizes overfit onl weakl, and so tends to undersmooth.

11 Are smoothers reall random effects? Most times that smooth functions are used in models, the modeller believes that the function is a fied state of nature. i.e. the assumption is that the true function is something that would sta fied on replication of the dataset. So we are reall being Baesian in treating the function as random. If the function was a true frequentist random effect then we would epect to get a different random draw from its prior at each dataset replication. This almost never makes sense. Does this mean that using mied modelling methods is wrong? No. It just happens that the mied model methods can convenientl compute the Baesian answers for us.

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K. Basis Penalty Smoothers Simon Wood Mathematical Sciences, University of Bath, U.K. Estimating functions It is sometimes useful to estimate smooth functions from data, without being too precise about the