Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Size: px

Start display at page:

Download "Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods"

Lucas Chambers
5 years ago
Views:

1 Alternatives to Basis Expansions Basis expansions require either choice of a discrete set of basis or choice of smoothing penalty and smoothing parameter Both of which impose prior beliefs on data. Alternatives are local methods. Kernels in Density Estimation To estimate a density, simply add a bump centered on each peak ˆf (t) = 1 n K(t ti ) Idea Behind Kernel Methods estimate quantities based on local data. easiest is just to include nearby data: t i t < h alternatively, use a weighting function K(t i t) Kernels and Bandwidth We still need to control the width of the Kernel. Usually use a consistent shape, but take K h (t t i )=K[(t t i )/h]. Larger h means K is more spread out.

2 Kernels in Regression Choosing a Bandwidth We observe that Simplest non-parametric: moving average. Alternatively, we could use a Kernel to weight: ŷ i = j K[(t t j )/h] K[(t tj )/h] y j ˆx(t) = 1 n yi K[(t t i )/h] But don t we want to have the weights sum to 1? yi K[(t t i )/h] ˆx(t) = K[(t ti )/h] This is the Nadaraya-Watson Estimator. so that ŷ = S h y and we have OCV(λ) = ( ) ŷ i y 2 i (ŷi y i ) and GCV(λ) = 2 1 S ii (1 trs) 2 There are also some plug-in rules that avoid searching. Same idea for variance estimates as before. Vancouver Precipitation The Effect of Bandwidth h= 10 h = 100 h = 1000

3 What s Going on Here? Derivatives and Local Polynomial Regression Note: this is NOT using the K(t t i ) as basis functions. But the Nadaraya-Watson estimator looks sensible: yi K[(t t i )/h] ˆx(t) = K[(t ti )/h] It s a locally-weighted estimate of the mean. That means it solves ˆx(t) =argmin K[(t t i )/h](y i m) 2 If we have ˆx(t) =â(t)+ˆb(t)t + ĉ(t)t 2 then we can also estimate Dˆx(t) =ˆb(t)+2ĉ(t)t This turns out to be better than differentiating the coefficients. Local polynomial methods perform better at the edges of the data. Notice that they are still linear smoothers. Local Polynomial Regression Local Quadratic Regression for Vancouver Precipitation Why stop with estimating the mean? Set (â(t), ˆb(t)) = argmin K[(t t i )/h](y i a + bt) 2 and ˆx(t) =â(t)+ˆb(t)t Then as h increases, we revert to a linear trend. Can add in further polynomial terms. ˆx(t) Dˆx(t)

4 Vancouver Precipitation λ = 1000 Advantages/Disadvantages Computational (-) no data compression requires O(n) operations to evaluate (+) but for compact kernels (are eventually zero) some speedup is possible (+) scale to higher dimensions (+) Numerically stable Practical (+) implicit definition of smoothness; don t need to think (-) implicit definition of smoothness hard to enforce constraints Mathematical: much easier analysis available for kernels A Comparison Some Resources in R A number of packages provide local polynomial fitting locpol choice of kernels degree 1 through 10 nice plotting features KernSmooth only Gaussian kernels also allows degree 0 density estimation etc locfit (not as clearly documented, but does multivariate local polynomials) Generally, require evaluation points to be specified beforehand.

5 What about Functional Linear Regression? Principal Components Analysis Smooth each curve Compute the covariance surface C(s, t) = 1 (xi (t) x(t))(x i (s) x(s)) n at a fine grid of points to give a matrix C Take the principle components of C Interpolate A Few Variations 1 Remove the diagonal Because y i = x(t i )+ɛ we have E(y i x(t i )) 2 = C(s, t)+σ 2 2 Use diagonal co-ordinates s 1 = s + t, s 2 = s t 3 Add a quadratic term on the diagonal. Smoothing the Covariance Surface Idea: if curves are sparsely observed, we can still estimate the covariance We have observations (t 1i, y 1i ),...,(t ki i, y ki i), i = 1,...,n. Put all of the observations together and smooth to get x(t). For each i, with 1 j 1, j 2 k i form triples Non-parametric Functional Data Analysis Kernel methods are not dependent on dimension can use them for non-parametric functional regression. Then estimate { K[x(t) x i (t)] = exp } (x(t) x i (t)) 2 dt [t j1 i, t j2 i, (y j1 i x(t j1 i))(y j2 i x(t j2 i))]=[s, t, z] Now produce a bivariate smooth C(s, t) =local linear regression on {(s i, t i, z i )} by y i = f [x i (t)] + ɛ ˆf [x(t)] = yi K[x(t) x i (t)] K[x(t) xi (t)] See Ferraty and Vieu (2005) Nonparametric Functional Data Analysis.

6 Summary Ridge Regression Kernel methods are an alternative way of smoothing avoids needing to decide on penalty terms generalizable to local polynomial methods even more generalizable to local likelihood methods requires fpca for regression can boost up to non-parametric models with functional covariates instead of minimizing try y i = β 0 + x i β + ɛ i SSE(β) = (y i β 0 x i β) 2 PENSSE λ (β) =SSE(β)+λ β 2 j Can show there is always λ>0 in which future prediction error is improved. If p > 3, we also improve E (β j ˆβ j ) 2. More Recently More Recently LASSO (Tibshirani 1996): First, an old idea Smoothing works for splines, why not for ordinary regression? Actually, historically that s the wrong way around. fit by minimizing y i = x i β + ɛ i (yi x i β)+λ β j Tends to set ˆβ j to zero. Lots of recent work on this.

The Selection Effect Why does the lasso set ˆβ j to zero? consider (β 1) 2 + λ x FLRTI Test On data designed for it, it gets very near the truth The point at zero sill exists for large λ.

7 The Selection Effect Why does the lasso set ˆβ j to zero? consider (β 1) 2 + λ x FLRTI Test On data designed for it, it gets very near the truth The point at zero sill exists for large λ. Functional Linear Regression thats Interpretable (FLiRTI: James and Zhu, 2007): apply LASSO to functional linear regression flirti and Penetration Data (Non-adjusted) R 2 : FLRTI PCA Smooth PENSSE(β) = ( y i ) 2 β(t)x i (t) +λ 1 β(t) +λ 2 D 2 β(t) Can choose penalties so that parts of the curve are set exactly x(t) =0 exactly Dx(t) =0(x(t) is flat) exactly D 2 x(t) =0(x(t) is exactly linear) etc More theory concerning a new technique called the Dantzig Selector.

Alternatives. The D Operator

Alternatives. The D Operator Using Smoothness Alternatives Text: Chapter 5 Some disadvantages of basis expansions Discrete choice of number of basis functions additional variability. Non-hierarchical bases (eg B-splines) make life