Modern Regression Basics

Size: px

Start display at page:

Download "Modern Regression Basics"

Rudolph Wilkerson
5 years ago
Views:

1 Modern Regression Basics T. W. Yee University of Auckland October Cagliari t.yee@auckland.ac.nz T. W. Yee (University of Auckland) Modern Regression Basics 1/167 October Cagliari 1 / 167

2 Outline of This Talk Outline of This Talk 1 Linear Models 2 Generalized Linear Models (GLMs) 3 Smoothing 4 Generalized Additive Models (GAMs) 5 Introduction to VGLMs and VGAMs 6 Concluding Remarks T. W. Yee (University of Auckland) Modern Regression Basics 2/167 October Cagliari 2 / 167

3 Linear Models Linear Models Data (x i, y i, w i ), i = 1,..., n, Var(ε i ) = σ 2 /w i, and E(Y i ) = η(x i ) = p x ik β k. k=1 That is, y = X β + ε, ε N p ( 0, σ 2 W 1). (1) X is an n p matrix (assumed of rank p), and β is a p-vector of regression coefficients (parameters). t-test, ANOVA, multiple linear regression etc. are special cases of (1). T. W. Yee (University of Auckland) Modern Regression Basics 3/167 October Cagliari 3 / 167

4 Linear Models Estimation I Estimate β by weighted least squares (WLS): ) 2 n p ˆβ = argmin w i (y i x ik β k i=1 k=1 = argmin (y X β) T W (y X β). Solution is (from the normal equations) ˆβ = ( X T W X ) 1 X T Wy, (2) ŷ = X ˆβ. (3) Also, the variance-covariance matrix of β is Var(ˆβ) = σ 2 ( X T W X ) 1. (4) T. W. Yee (University of Auckland) Modern Regression Basics 5/167 October Cagliari 5 / 167

5 Estimation II Linear Models Suppose W = I n. Then LS has a very nice geometric interpretation. Also, ŷ = H y where H = X ( X T X ) 1 X T. (5) Note that H = H 2 (idempotent) and H = H T (symmetric), hence H is a projection matrix. Such a matrix represents an orthogonal projection. The eigenvalues of H are p 1 s and (n p) 0 s. Consequently, trace(h) = rank(h). T. W. Yee (University of Auckland) Modern Regression Basics 6/167 October Cagliari 6 / 167

Linear Models Figure: http://www.r-project.org T. W.

6 Linear Models Figure: T. W. Yee (University of Auckland) Modern Regression Basics 8/167 October Cagliari 8 / 167

7 S Model Formulae I Linear Models The S model formula adopted from Wilkinson and Rogers (1973). Form: response expression LHS = the response (usually a vector in a data frame or a matrix). RHS = explanatory variables. T. W. Yee (University of Auckland) Modern Regression Basics 10/167 October Cagliari 10 / 167

8 Linear Models S Model Formulae II Consider > y ~ x1 + x2 + x3 + f1:f2 + f1 * x1 + f2/f3 + f3:f4:f5 + + (f6 + f7)^2 where variables beginning with an x are numeric and those beginning with an f are factors. By default an intercept is fitted, which is 1. Suppress intercepts by -1. The interaction f1*f2 is expanded to 1 + f1 + f2 + f1:f2. The terms f1 and f2 are main effects. A second-order interaction between two factors can be expressed using factor:factor: γ ij. There are other types of interactions. Interactions between a factor and numeric, factor:numeric, produce β j x. T. W. Yee (University of Auckland) Modern Regression Basics 11/167 October Cagliari 11 / 167

9 Linear Models S Model Formulae III Interactions between two numerics, numeric:numeric, produce a cross-product term such as βx 2 x 3. The term (f6 + f7)^2 expands to f6 + f7 + f6:f7. A term (f6 + f7 + f8)^2 - f7:f8 would expand to all main effects and all second-order interactions except for f7:f8. Nesting is achieved by /, e.g., f2/f3 is shorthand for 1 + f2 + f3:f2, or equivalently, > 1 + f2 + f3 %in% f2 Example: f2 = state and f3 = county. T. W. Yee (University of Auckland) Modern Regression Basics 12/167 October Cagliari 12 / 167

10 Linear Models S Model Formulae IV There are times when you need to use the identity function I(), e.g., because ^ has special meaning, > lm(y ~ -1 + offset(a) + x1 + I(x2-1) + I(x3^3)) fits y i = a i + β 1 x i1 + β 2 (x i2 1) + β 3 x 3 i3 + ε i, ε i iid N(0, σ 2 ), i = 1,..., n, where a is a vector containing the (known) a i. Other functions: factor(), as.factor(), ordered() terms(), levels(), options(). T. W. Yee (University of Auckland) Modern Regression Basics 13/167 October Cagliari 13 / 167

11 Linear Models S generics Generic functions are available for lm objects. They include add1() anova() coef() deviance() drop1() plot() predict() print() residuals() step() summary() update(). Other less used generic functions are alias(), effects(), family(), kappa(), labels(), proj(). Some other functions are model.matrix(), options(). T. W. Yee (University of Auckland) Modern Regression Basics 14/167 October Cagliari 14 / 167

12 Linear Models The lm() Function > args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset,...) NULL The most useful arguments are weights, subset, na.action na.fail(), na.omit(), contrasts. Data frames: read.table(), write.table(), na.omit(). T. W. Yee (University of Auckland) Modern Regression Basics 16/167 October Cagliari 16 / 167

13 Factors I Linear Models > options()$contrasts unordered "contr.treatment" ordered "contr.poly" Note: 1 contr.treatment is used so that each coefficient compares that level with level 1 (omitting level 1 itself). 2 contr.sum constrains the coefficients to sum to zero. 3 contr.poly is used for equally spaced, equally replicated orthogonal polynomial contrasts T. W. Yee (University of Auckland) Modern Regression Basics 18/167 October Cagliari 18 / 167

14 Linear Models Factors II One can change them by, for example, > options(contrasts = c("contr.treatment", "contr.poly")) Here, the first level of the factor is the baseline level. Table: Dummy variables partial method. RACE D 1 D 2 D 3 White Black Hispanic Other T. W. Yee (University of Auckland) Modern Regression Basics 19/167 October Cagliari 19 / 167

15 Factors III An example: Linear Models > options(contrasts = c("contr.treatment", "contr.poly")) > y <- 1:9 > x <- rep(1:3, len = 9) > lm(y ~ as.factor(x)) Call: lm(formula = y ~ as.factor(x)) Coefficients: (Intercept) as.factor(x)2 as.factor(x) T. W. Yee (University of Auckland) Modern Regression Basics 20/167 October Cagliari 20 / 167

16 Factors IV Another example: Linear Models > options(contrasts = c("contr.sum", "contr.poly")) > lm(y ~ as.factor(x)) Call: lm(formula = y ~ as.factor(x)) Coefficients: (Intercept) as.factor(x)1 as.factor(x) e e e-17 T. W. Yee (University of Auckland) Modern Regression Basics 21/167 October Cagliari 21 / 167

17 Linear Models Topics not done... Other important topics not done: residual analysis influential observations robust regression variable selection... T. W. Yee (University of Auckland) Modern Regression Basics 22/167 October Cagliari 22 / 167

18 Generalized Linear Models (GLMs) Generalized Linear Models (GLMs) Y Exponential family (normal, binomial, Poisson,... ) g(µ) = η(x) = β T x = β 1 + β 2 x β p x p g is the link function (known, monotonic, twice differentiable). p η = β k x k is known as the linear predictor. k=1 Proposed by Nelder and Wedderburn (1972), GLMs include the general linear model, logistic regression, probit analysis, Poisson regression, gamma, inverse Gaussian etc. The unification was a major breakthrough in statistical theory. Estimation: iteratively reweighted least squares (IRLS; see later) T. W. Yee (University of Auckland) Modern Regression Basics 23/167 October Cagliari 23 / 167

19 Generalized Linear Models (GLMs) The Exponential Family I The distribution of a univariate r.v. Y belongs to the exponential family if its p.(d).f. f (y; θ) can be written as f (y; θ) = exp{p(y)q(θ) + r(y) + s(θ)}. (6) Here θ = parameter of interest, and the functions p, q, r, s are known. Other parameters can be accomodated provided they are known we simply incorporate them in p, q, r and s. The exponential family has a canonical form where p(y) = y. We also want to be able to explicitly consider scale parameters such as σ in N(µ, σ 2 ) so we write { } y d(θ) b(θ) f (y; θ, φ) = exp ω + c(y, φ, ω). (7) φ T. W. Yee (University of Auckland) Modern Regression Basics 25/167 October Cagliari 25 / 167

20 Generalized Linear Models (GLMs) The Exponential Family II Equation (7) belongs to (6) provided φ is known (ω is some known constant here). (φ > 0, ω 0). θ = d(θ) is often called the natural parameter of the distribution. T. W. Yee (University of Auckland) Modern Regression Basics 26/167 October Cagliari 26 / 167

21 Generalized Linear Models (GLMs) The Exponential Family III (i) Y N(µ, σ 2 ) { 1 f (y; µ, σ) = exp 1 } (y µ)2 2πσ 2 2σ2 { } yµ 1 2 = exp µ2 σ 2 y 2 2σ log(2πσ2 ). (ii) Y Poisson(θ) f (y; θ) = e θ θ y y! = exp{y log θ θ log y!}. T. W. Yee (University of Auckland) Modern Regression Basics 27/167 October Cagliari 27 / 167

22 Generalized Linear Models (GLMs) The Exponential Family IV (iii) Z Binomial(m, p). For z = 0, 1,..., m, ( ) m P(Z = z; p) = p z (1 p) m z z { ( )} m = exp z log p + (m z) log(1 p) + log z { ( )} p m = exp z log + m log(1 p) + log. 1 p z If we look at the sample proportion, Y = Z/m, for y = 0, 1/m,..., m/m, P(Y = y; p) = P(Z = my; p) [ { p = exp m y log + log(1 p) 1 p } ( m + log my )]. T. W. Yee (University of Auckland) Modern Regression Basics 28/167 October Cagliari 28 / 167

23 Generalized Linear Models (GLMs) The Exponential Family V Let l(θ) = log L(θ) = log f (Y ; θ), I(θ) = 2 l(θ) θ 2. Can show and Var(Y ) = u(θ) = l(θ) θ, and E[Y ] = µ = b (θ) d (θ) φ 2 [ ω 2 d (θ) 2 ω b (θ) µd ] (θ), or φ (8) Var(Y ) = b (θ)φ d (θ) 2 ω φ b (θ)d (θ) ω {d (θ)} 3. (9) T. W. Yee (University of Auckland) Modern Regression Basics 29/167 October Cagliari 29 / 167

24 Generalized Linear Models (GLMs) The Exponential Family VI If the model is parameterized in terms of the natural parameter θ, (e.g., θ = log p 1 p instead of p) then d (θ ) = 1 and One then gets the following table. Var(Y ) = b (θ ) φ ω. (10) T. W. Yee (University of Auckland) Modern Regression Basics 30/167 October Cagliari 30 / 167

25 Generalized Linear Models (GLMs) The Exponential Family VII t Normal(µ, σ 2 ) Poisson(λ) 1 m Binomial(m, p) (Y = Sample proportion) f (y; θ, φ) 1 e 1 (y u) 2 2 σ 2 2πσ e λ λ y y! m my «p my (1 p) m my Range of Y (, ) 0, 1, 2,... 0, 1/m, 2/m,..., m/m Mean = E(Y ) µ µ = λ µ = p Usual parameter, θ µ λ p Natural param, θ µ log λ = log µ p log 1 p = log µ 1 µ b(θ) 1 2 µ 2 (= 1 2 θ 2 ) λ (= e θ ) log(1 p) = log(1 + e θ ) φ σ ω 1 1 m c(y, φ, ω) 1 2 ( y 2 φ + log(2πφ) ) log y! m log my «µ = E[Y ] µ λ (= e θ ) p = eθ Variance Function Constant (σ 2 ) µ (Var(Y ) as fn of µ) 1 + e θ µ(1 µ) T. W. Yee (University of Auckland) Modern Regression Basics 31/167 October Cagliari 31 / 167 m

26 Generalized Linear Models (GLMs) S and GLMs I In S use, e.g., glm(y x2 + x3 + x4, family=binomial, data=d) Family functions are gaussian(), binomial(), poisson(), Gamma(), inverse.gaussian(), quasi(). Generic functions include anova(), coef(), fitted(), plot(), predict(), print(), resid(), summary(), update(). Recall the Wilkinson and Rogers (1973) formula language, e.g., if f1 and f2 are factors and x1 and x2 are numeric, then f1 f2 1 + f1 + f2 + f1 : f2, f1/f2 f1 and then f2 within factor f1, x1 + x2 β 1 X 1 + β 2 X 2. Data frames to hold all the data. Columns are the variables. T. W. Yee (University of Auckland) Modern Regression Basics 33/167 October Cagliari 33 / 167

27 S and GLMs II Generalized Linear Models (GLMs) > data(nzc) > with(nzc, plot(year, female/(male + female), ylab = "Proportion", + main = "Proportion of NZ Chinese that are female", + col = "blue", las = 1)) > abline(h = 0.5, lty = "dashed") > fit.nzc = vglm(cbind(female, male) ~ year, fam = binomialff, + data = nzc) > with(nzc, lines(year, fitted(fit.nzc), col = "red")) T. W. Yee (University of Auckland) Modern Regression Basics 34/167 October Cagliari 34 / 167

28 S and GLMs III Generalized Linear Models (GLMs) Proportion of NZ Chinese that are female Proportion year T. W. Yee (University of Auckland) Modern Regression Basics 35/167 October Cagliari 35 / 167

29 S and GLMs IV Generalized Linear Models (GLMs) > with(nzc, plot(year, female/(male + female), ylab = "Proportion", + main = "Proportion of NZ Chinese that are female", + col = "blue", las = 1)) > abline(h = 0.5, lty = "dashed") > fit.nzc = vglm(cbind(female, male) ~ poly(year, + 2), fam = binomialff, data = nzc) Proportion of NZ Chinese that are female Proportion year T. W. Yee (University of Auckland) Modern Regression Basics 36/167 October Cagliari 36 / 167

30 Logistic regression I Generalized Linear Models (GLMs) > options(contrasts = c("contr.treatment", "contr.poly")) > y <- cbind(c(5, 20, 15, 10), c(20, 10, 10, 10)) > x <- 1:4 > fit <- glm(y ~ as.factor(x), family = binomial) > fit Call: glm(formula = y ~ as.factor(x), family = binomial) Coefficients: (Intercept) as.factor(x)2 as.factor(x)3 as.factor(x) Degrees of Freedom: 3 Total (i.e. Null); Null Deviance: Residual Deviance: 4.441e-15 AIC: Residual > exp(coef(fit)[-1]) T. W. Yee (University of Auckland) Modern Regression Basics 38/167 October Cagliari 38 / 167

31 Logistic regression II Generalized Linear Models (GLMs) as.factor(x)2 as.factor(x)3 as.factor(x) Here the model is where β 1 = 0. logit p(x) = β 0 + β j, j = 1, 2, 3, 4 T. W. Yee (University of Auckland) Modern Regression Basics 39/167 October Cagliari 39 / 167

32 Generalized Linear Models (GLMs) Logistic regression III If η(x) = β 0 + β 1 x then the log odds for a change in c units in x is obtained from the logit difference and the associated odds ratio is η(x + c) η(x) = c β 1 (11) ψ(c) = ψ(x + c, x) = exp(c β 1 ). (12) Example If logit P(D AGE) = AGE then an increase in age of 10 years will increase the odds of disease by exp(1.3) = T. W. Yee (University of Auckland) Modern Regression Basics 40/167 October Cagliari 40 / 167

33 Generalized Linear Models (GLMs) Logistic regression IV In general, for logit p(x) = β 0 + β T x (13) we have log ψ = log { } p(x1 )/(1 p(x 1 )) p(x 0 )/(1 p(x 0 )) Thus, for confidence intervals etc., use = β T (x 1 x 0 ). (14) Note: log ˆψ = (x 1 x 0 ) T Var(ˆβ) (x 1 x 0 ). (15) p(x 1 )/(1 p(x 1 )) p(x 0 )/(1 p(x 0 )) is the odds ratio for Y = 1 for a person with x 1 relative to a person with x 0. T. W. Yee (University of Auckland) Modern Regression Basics 41/167 October Cagliari 41 / 167

34 Generalized Linear Models (GLMs) Some extensions of GLMs Quasi-likelihood (Wedderburn, 1974) Composite link functions (Thompson and Baker, 1981) IRLS for other models (Green, 1984) Double Exponential Families (Efron, 1986) Generalized Estimating Equations (GEE; Liang and Zeger, 1986) Generalized linear mixed models (GLMMs) ANOVA Splines (Wahba and co-workers, 1995) Polychotomous Regression (Kooperberg and co-workers, 1997) Multivariate GLMs (Fahrmeir and Tutz, 2001) Heirarchial GLMs (Nelder and Lee, late 1990 s) Generalized additive models (GAMs; Hastie and Tibshirani, 1986) Generalized additive mixed models (GAMMs; Lin, 1998) Vector GLMs and VGAMs (Yee and Wild, 1996) T. W. Yee (University of Auckland) Modern Regression Basics 42/167 October Cagliari 42 / 167

35 Smoothing Smoothing Smoothing is a powerful tool for exploratory data analysis. It allows a data-driven approach rather than model-driven approach. Allows the data to speak for itself. Probably the central idea is localness, i.e., local behaviour versus global behaviour of a function. Scatterplot data (x i, y i ), i = 1,..., n. The classical smoothing problem is y i = f (x i ) + ε i, ε i (0, σ i ) (16) independently. Here, f is an arbitary smooth function, and i = 1,..., n. Q: How can f be estimated? A: If there is no a priori function form for f, one solution is the smoother. T. W. Yee (University of Auckland) Modern Regression Basics 43/167 October Cagliari 43 / 167

36 Smoothing Uses of Smoothing Smoothing has many uses, e.g., data visualization and EDA prediction derivative estimation, e.g., growth curves, acceleration used as a basis for many modern statistical techniques T. W. Yee (University of Auckland) Modern Regression Basics 45/167 October Cagliari 45 / 167

37 Example I Smoothing y x T. W. Yee (University of Auckland) Modern Regression Basics 47/167 October Cagliari 47 / 167

38 Example I Smoothing y x T. W. Yee (University of Auckland) Modern Regression Basics 49/167 October Cagliari 49 / 167

39 Smoothing There are four broad categories of smoothers: 1 series or regression smoothers (polynomials, Fourier regression, regression splines, filtering), 2 Kernel smoothers (N-W, locally weighted averages, local regression, loess), 3 Smoothing splines (roughness penalties), 4 Near neighbour smoothers (running means, medians, Tukey smoothers). We will look at kernel smoothers and splines. T. W. Yee (University of Auckland) Modern Regression Basics 50/167 October Cagliari 50 / 167

40 Smoothing Scatterplot data (y i, x i ), i = 1,..., n. The classical smoothing problem is where Y i = f (X i ) + ε i (17) f = a smooth function estimated from the data, E(ε i ) = 0, Var(ε i ) = σ 2 independently. We let Var(ε i ) w 1 i (known), written Var(ε) = W 1, W = diag(w 1,..., w n ) = Σ 1. WLOG let the data is ordered so that x 1 < x 2 < < x n. T. W. Yee (University of Auckland) Modern Regression Basics 52/167 October Cagliari 52 / 167

41 Kernel Smoothers I Nadaraya-Watson Estimator Smoothing Kernel regression estimators are well-known, easily understood and mathematically tractable. The Nadaraya-Watson (N-W) estimator estimates f (x) by ( ) n x xi n K y i K h (x x i ) y i i=1 h i=1 ˆf nw (x) = ( ) = n x xi n K K h (x x i ) h i=1 i=1 (18) where ( u ) K h (u) = h 1 K h. (19) T. W. Yee (University of Auckland) Modern Regression Basics 54/167 October Cagliari 54 / 167

42 Kernel Smoothers II Nadaraya-Watson Estimator Smoothing K, a symmetric unimodal function about 0 which integrates to unity, creates the local averaging of values of y i whose corresponding values of x i are close to the point of estimation x. The amount of smoothing is controlled by the bandwidth h. Some popular kernel functions are given in Table 2. As h decreases, the bias decreases and the variance increases. In practice, the choice of bandwidth h is more crucial than the choice of kernel function. T. W. Yee (University of Auckland) Modern Regression Basics 55/167 October Cagliari 55 / 167

43 Kernel Smoothers III Nadaraya-Watson Estimator Smoothing 3.0 Regression function T. W. Yee (University of Auckland) Modern Regression Basics 56/167 October Cagliari 56 / 167

44 Kernel Smoothers IV Nadaraya-Watson Estimator Smoothing Table: Popular kernel functions. Nb. the quartic is also known as the biweight. Kernel K(u) 1 Uniform 2 I ( u 1) Triangle (1 u ) I ( u 1) 3 Epanechnikov 4 (1 u2 ) I ( u 1) 15 Quartic 16 (1 u2 ) 2 I ( u 1) 70 Tricube 81 (1 u 3 ) 3 I ( u 1) 35 Triweight 32 (1 u2 ) 3 I ( u 1) Gaussian exp( 1 2 u2 )/ 2π Cosinus cos(πu/2) I ( u 1) π 4 T. W. Yee (University of Auckland) Modern Regression Basics 57/167 October Cagliari 57 / 167

45 Smoothing Local Regression I Theoretically elegant and also called local polynomial kernel estimators, it has favourable asymptotic properties and boundary behaviour. Idea: estimate f (x 0 ) by locally fitting a rth degree polynomial to data via weighted least squares (WLS). Example: a local linear kernel estimate for f (x i ) = 2 exp( x 2 i /0.3 2 ) + 3 exp( (x i 1) 2 /0.7 2 ) + ε i, x i = (i 1)/n, ε i N(0, σ = 0.115) independently. T. W. Yee (University of Auckland) Modern Regression Basics 59/167 October Cagliari 59 / 167

46 Local Regression II Smoothing 3.0 Regression function Figure: Local linear kernel estimate (solid red) of the regression function f given in the text based on 100 simulated observations (crosses). The solid black curve is the true function. The red dashed curves are the kernel weights. T. W. Yee (University of Auckland) Modern Regression Basics 60/167 October Cagliari 60 / 167

47 Smoothing We now derive an explicit expression for the local polynomial kernel estimator. Let r be the degree of the polynomial being fitted. At a point x, the estimator ˆf (x; r, h) is obtained by fitting the polynomial β 0 + β 1 ( x) + + β r ( x) r to the (x i, y i ) using WLS with kernel weights K h (x i x). The value of ˆf (x; r, h) is the height of the fit ˆβ0, where ˆβ = ( ˆβ 0,..., ˆβ r ) T minimizes n i=1 {y i β 0 β 1 (x i x) β r (x i x) r } 2 K h (x i x). (20) The solution is ˆβ = (X T x W x X x ) 1 X T x W x y (21) T. W. Yee (University of Auckland) Modern Regression Basics 62/167 October Cagliari 62 / 167

48 Smoothing where y = (y 1,..., y n ) T, 1 (x 1 x)... (x 1 x) r X x =... 1 (x n x)... (x n x) r is n (r + 1), and W x =Diag(K h (x 1 x),..., K h (x n x)). Since the estimator of f (x) is the intercept, we have ˆf (x; r, h) = e T 1 (X T x W x X x ) 1 X T x W x y. (22) T. W. Yee (University of Auckland) Modern Regression Basics 63/167 October Cagliari 63 / 167

49 Smoothing Simple explicit formulae exist for the N-W estimator (r = 0): ˆf (x; 0, h) = n K h (x i x) y i i=1 n K h (x i x) i=1 (23) and the local linear estimator (r = 1): ˆf (x; 1, h) = n 1 n i=1 {ŝ 2 (x; h) ŝ 1 (x; h)(x i x)} K h (x i x) y i ŝ 2 (x; h)ŝ 0 (x; h) ŝ 1 (x; h) 2 (24) where n ŝ r (x; h) = n 1 (x i x) r K h (x i x). (25) i=1 T. W. Yee (University of Auckland) Modern Regression Basics 65/167 October Cagliari 65 / 167

50 Derivative Estimation I Smoothing Uses include the study of human growth curves where the first two derivatives of height as a function of age ( speed and acceleration of growth) have important biological significance. The extension of local polynomial ideas to estimate the νth derivative is straightforward. One can estimate f (ν) (x) via the intercept coefficient of the νth derivative of the local polynomial being fitted at x, assuming ν r. In general, ˆf (ν) (x; r, h) = ν! e T ν+1(x T x W x X x ) 1 X T x W x y, for all ν = 0,..., r (26) from (22). Note that ˆf (ν) (x; r, h) is not in general equal to the νth derivative of ˆf (x; r, h). T. W. Yee (University of Auckland) Modern Regression Basics 67/167 October Cagliari 67 / 167

51 Derivative Estimation II Smoothing Choosing r In the early 1990 s Fan and co-workers showed that, for estimating f (ν) (x), there is no increase in variability when passing from an even (i.e., r ν even) r = ν + 2q order fit to an odd r = ν + 2q + 1 order fit, but when passing from an odd r = ν + 2q + 1 order fit to the consecutive even r = ν + 2q + 2 order there is a price to be paid in terms of increased variability. Therefore, even order fits r = ν + 2q are not recommended. Fan and Gijbels (1996) recommend using the lowest odd order, i.e., r = ν + 1, or occasionally r = ν + 3. For f choose r = 1 (maybe 3)... For f choose r = 2 (maybe 4)... T. W. Yee (University of Auckland) Modern Regression Basics 68/167 October Cagliari 68 / 167

52 Lowess and Loess I Smoothing A popular method based on local regression is Lowess (Cleveland, 1979) and Loess (Cleveland and Devlin, 1988). Lowess = locally weighted scatterplot smoother, and it robustifies the locally WLS method above. The basic idea is to fit a polynomial of degree r locally via (20) and obtain the fitted values. Then calculate the residuals and assign weights to each residual: large/small residuals receive small/large weights respectively. Then perform another local polynomial fit of order r with weights given by the product of the initial weight and new weight. Thus observations showing large residuals at the initial fit are downweighted in the second fit. The above process is repeated a number of times. Cleveland (1979) recommended r = 1 and 3 iterations (default). T. W. Yee (University of Auckland) Modern Regression Basics 70/167 October Cagliari 70 / 167

53 Lowess and Loess II Smoothing > par(mfrow = c(2, 2), mar = c(5, 4, 2, 1) + 0.1) > set.seed(761) > x <- sort(rnorm(100)) > eps <- rnorm(100, 0, 0.1) > y <- sin(x) + eps > plot(x, y, col = "blue", pch = 4) > title("default: lowess(x, y)", cex = 0.5) > lo1 <- lowess(x, y) > lines(lo1, lty = 1, col = "red") > plot(x, y, col = "blue", pch = 4) > title("lowess(x, y, f=0.5)", cex = 0.5) > lo2 <- lowess(x, y, f = 0.5) > lines(lo2, lty = 1, col = "red") > plot(x, y, col = "blue", pch = 4) > title("lowess(x, y, f=0.2)", cex = 0.5) > lo3 <- lowess(x, y, f = 0.2) > lines(lo3, lty = 1, col = "red") > plot(x, y, col = "blue", pch = 4) > title("default: loess(y ~ x)", cex = 0.5) T. W. Yee (University of Auckland) Modern Regression Basics 71/167 October Cagliari 71 / 167

54 Lowess and Loess III Smoothing > lo4 <- loess(y ~ x) > lines(x, fitted(lo4), lty = 1, col = "red") Default: lowess(x, y) lowess(x, y, f=0.5) y 0.0 y x x lowess(x, y, f=0.2) Default: loess(y ~ x) y 0.0 y x x T. W. Yee (University of Auckland) Modern Regression Basics 72/167 October Cagliari 72 / 167

55 Lowess and Loess IV Smoothing Once again choosing a good bandwidth is crucial. T. W. Yee (University of Auckland) Modern Regression Basics 73/167 October Cagliari 73 / 167

56 Local Likelihood I Smoothing Local likelihood replaces the local least squares criterion by an appropriate local log-likelihood criterion. Example: for binary data (x i, y i ), i = 1,..., n, y i = 0 or 1, the local log-likelihood is n i=1 ( ) xi x K {y i log p i + (1 y i ) log (1 p i )} (27) h where p i = p(x i ) = P(Y = 1 x i ). We could model p(x) directly using local polynomials, however, it is usually preferable to use θ(x) = logit p(x). We approximate θ(x) locally by a polynomial, then choose the polynomial coefficients to maximize the likelihood. T. W. Yee (University of Auckland) Modern Regression Basics 75/167 October Cagliari 75 / 167

57 Local Likelihood II Smoothing Local likelihood can also be applied to other regression models and density estimation. Local likelihood developed by Tibshirani (1984). A good book on the topic is Loader (1999). T. W. Yee (University of Auckland) Modern Regression Basics 76/167 October Cagliari 76 / 167

58 Smoothing Regression Splines I Idea: fit a higher degree polynomial (polynomial regression.) Some drawbacks: Polynomials aren t very local but have a global nature. So usually are not ok at the boundaries, especially if the degree of the polynomial is high [cf. Stone-Weierstrass Theorem]; Individual observations can have a large influence on remote parts of the curve; The polynomial degree cannot be controlled continuously. Polynomial regression can be fitted using the poly() function, e.g., > fit <- lm(y ~ poly(x, 5)) fits a 5th degree polynomial. T. W. Yee (University of Auckland) Modern Regression Basics 78/167 October Cagliari 78 / 167

59 Regression Splines II Smoothing Regression splines use a piecewise polynomial. The regions are separated by knots (or breakpoints). The positions where each pair of segments join are called joints. The more knots, the more flexible the family of curves become. It is customary to force the piecewise polynomials to join smoothly at these knots. A popular choice are piecewise cubic polynomials with continuous 0th, 1st and 2nd derivatives called cubic splines. Using splines of degree > 3 seldom yields any advantage. Given a set of knots, the smooth is computed by multiple regression on a set of basis vectors. T. W. Yee (University of Auckland) Modern Regression Basics 79/167 October Cagliari 79 / 167

60 Regression Splines III Smoothing Here s a regression spline. > pos <- function(x) ifelse(x > 0, x, 0) > x <- 1:7 > y <- c(8, 3, 8, 5, 9, 14, 11) > knot <- 4 > plot(x, y, col = "blue") > X <- cbind(1, x, x^2, x^3, pos(x - knot)^3) > fit <- lm(y ~ X - 1) > xx <- seq(1, 7, length = 200) > XX <- cbind(1, xx, xx^2, xx^3, pos(xx - knot)^3) > lines(xx, XX %*% coef(fit)) > abline(v = knot, lty = "dashed", col = "purple") > X T. W. Yee (University of Auckland) Modern Regression Basics 80/167 October Cagliari 80 / 167

61 Regression Splines IV Smoothing x [1,] [2,] [3,] [4,] [5,] [6,] [7,] T. W. Yee (University of Auckland) Modern Regression Basics 81/167 October Cagliari 81 / 167

62 Regression Splines V Smoothing y x T. W. Yee (University of Auckland) Modern Regression Basics 82/167 October Cagliari 82 / 167

63 Regression Splines VI Smoothing Definitions: A function f C k [a, b] if derivatives f, f,..., f (k) all exist and are continuous in [a, b], e.g., x / C 1 [a, b]. Notes: 1 f C k [a, b] = f C k 1 [a, b]. 2 C[a, b] C 0 [a, b] = {f (t) : f (t) continuous and real valued, a t b}. There are at least two bases for cubic splines: 1 truncated power series easier to understand but is not used in practice, 2 B-splines harder to understand but is used in practice. T. W. Yee (University of Auckland) Modern Regression Basics 83/167 October Cagliari 83 / 167

64 Smoothing Regression Splines VII Advantages of regression splines: computationally and statistically simple, standard parametric inferences are available. For example, testing whether a knot can be removed and the same polynomial equation used to explain two adjacent segments can be tested by H 0 : θ j = 0, which is one of the t-tests statistics always printed by a regression program. Disadvantages of regression splines: difficult to choose the number of knots, difficult to choose the position of the knots, the smoothness of the estimate cannot be varied continuously as a function of a single smoothing parameter. T. W. Yee (University of Auckland) Modern Regression Basics 84/167 October Cagliari 84 / 167

65 Smoothing Regression Splines VIII Here is a more formal definition of a spline. In mathematics, a spline denotes a function s(x) which is essentially a piecewise polynomial over an interval (a, b), such that a certain number of its derivatives are continuous for all points in (a, b). More precisely, s(x) is a spline of degree r (some given positive integer) with knots ξ 1,..., ξ K (such that a < ξ 1 < ξ 2 < < ξ K < b) if it satisfies the following properties: for any subinterval (ξ j, ξ j+1 ), s(x) is a polynomial of degree r; (order r + 1); s (x),..., s (r 1) (x) are continuous (derivatives), i.e., s C r 1 (a, b); the rth derivative of s(x) is a step function with jumps at ξ 1,..., ξ K. Often r is chosen to be 3, and the term cubic spline is then used for the associated curve. T. W. Yee (University of Auckland) Modern Regression Basics 85/167 October Cagliari 85 / 167

66 Smoothing Regression Splines IX Wold (1974), in a paper reflecting a lot of experience fitting regression splines, made the following recommendations when using cubic splines: 1 Knot points should be located at data points, 2 Have as few knots as possible, ensuring that a minimum of 4 or 5 observations should fall between knot points, 3 No more than one extremum point and one inflexion point should fall between knots (because a cubic is not possible of approximating more variations), 4 Extrema should be centered in intervals and inflexion points should be located near knot points. T. W. Yee (University of Auckland) Modern Regression Basics 86/167 October Cagliari 86 / 167

67 Smoothing B-Splines I B-splines form a numerically stable basis for splines. It is convenient to consider splines of a general order, M say. 1 M = 4: cubic spline. 2 M = 3: quadratic spline which has continuous derivatives up to order M 2 = 1 at the knots this is aka a parabolic spline. 3 M = 2: linear spline which has continuous derivatives up to order M 2 = 0 at the knots i.e., the function is continuous. Let ξ 0 (< ξ 1 ) and ξ K+1 (> ξ K ) be 2 boundary knots. Define the augmented knot sequence {τ } such that τ 1 τ 2 τ M ξ 0 ; τ j+m = ξ j, j = 1,..., K; ξ K+1 τ K+M+1 τ K+2M. T. W. Yee (University of Auckland) Modern Regression Basics 88/167 October Cagliari 88 / 167

68 B-Splines II Smoothing The actual values of these additional knots beyond the boundary are arbitrary, and it is customary to make them all the same and equal to ξ 0 and ξ K+1 respectively. Denote by B i,m (x) the ith B-spline basis function of order m for the knot sequence {τ }, m M. They are defined recursively as follows: For i = 1,..., K + 2M 1, B i,1 (x) = { 1, τi x < τ i+1, 0, otherwise; (28) T. W. Yee (University of Auckland) Modern Regression Basics 89/167 October Cagliari 89 / 167

69 Smoothing B-Splines III Then for i = 1,..., K + 2M m, B i,m (x) = x τ i B i,m 1 (x) + τ i+m x B i+1,m 1 (x) (29) τ i+m 1 τ i τ i+m τ i+1 (de Boor, 1978). He showed stable and efficient recursive algorithms for computing them. Thus with m = 4, B i,4, i = 1,..., K + 4, are the K + 4 cubic B-spline basis functions for the knot sequence {ξ}. This recursion can be continued and will generate the B-spline basis for any order spline. T. W. Yee (University of Auckland) Modern Regression Basics 90/167 October Cagliari 90 / 167

70 B-Splines IV Smoothing > knots <- c(1:3, 5, 7, 8, 10) > atx <- seq(0, 11, by = 0.01) > mycol = (1:(22 + 1))[-7] > for (ord in 2:5) { + B <- bs(x = atx, degree = ord - 1, knots = knots, + intercept = TRUE) + matplot(atx, B[, 1], type = "l", ylim = 0:1, + lty = 2, ylab = "", xlab = "") + matlines(atx, B[, -1], col = mycol, lty = 1) + title(paste("b-splines of order", ord)) + abline(v = knots, lty = 2, col = "purple") + } > attr(b, "degree") [1] 4 > attr(b, "knots") [1] T. W. Yee (University of Auckland) Modern Regression Basics 91/167 October Cagliari 91 / 167

71 B-Splines V Smoothing > attr(b, "Boundary.knots") [1] 0 11 > attr(b, "intercept") [1] TRUE > attr(b, "class") [1] "bs" "basis" T. W. Yee (University of Auckland) Modern Regression Basics 92/167 October Cagliari 92 / 167

72 Smoothing B-Splines VI B splines of order 2 B splines of order B splines of order 4 B splines of order T. W. Yee (University of Auckland) Modern Regression Basics 93/167 October Cagliari 93 / 167

73 B-Splines VII Smoothing In general, bs() adds ord boundary knots to each end, where the boundary knot values are min(x i ) and max(x i ). If intercept=false then the left-most function/column is omitted. T. W. Yee (University of Auckland) Modern Regression Basics 94/167 October Cagliari 94 / 167

74 Smoothing B-Splines VIII To illustrate that linear combinations of the B-spline basis functions do accommodate smooth curves, > matplot(atx, 2 * B[, 3] - 5 * B[, 4] + 3 * B[, + 7], type = "l", lwd = 2, col = "blue") 2 * B[, 3] 5 * B[, 4] + 3 * B[, 7] atx T. W. Yee (University of Auckland) Modern Regression Basics 95/167 October Cagliari 95 / 167

75 Smoothing B-Splines IX Here are some additional notes: 1 > args(bs) function (x, df = NULL, knots = NULL, degree = 3, intercept = FALSE, Boundary.knots = range(x)) NULL In fact, in Value:, df should be length(knots) + degree + intercept. 2 Safe prediction not as good as smart prediction, e.g., I(bs(x)), poly(scale(x), 2). T. W. Yee (University of Auckland) Modern Regression Basics 96/167 October Cagliari 96 / 167

76 B-Splines X Smoothing 3 B-splines are actually defined by means of divided differences. B i,m, which is based on knots τ i,..., τ i+m, is defined as B i,m (x) = (τ i+m τ i ) Equation (29) follows from this. i+m j=i (x τ j ) m 1 + i+m s=i,s j. (30) (τ j τ s ) T. W. Yee (University of Auckland) Modern Regression Basics 97/167 October Cagliari 97 / 167

77 B-Splines XI Smoothing As an illustration, > library(splines) > n <- 50 > set.seed(760) > knots <- 1:5 > x <- seq(0, 2 * pi, length = n) > y <- sin(x) + rnorm(n, sd = 0.5) > plot(x, y, col = "blue", pch = 4) > fit <- lm(y ~ bs(x, knots = knots)) > abline(v = knots, lty = "dashed", col = "purple") > lines(x, sin(x), col = "black") > aknots = c(-inf, knots, Inf) > for (ii in 2:length(aknots)) { + newx = seq(max(aknots[ii - 1], min(x)), min(aknots[ii], + max(x)), len = 200) + lines(newx, predict(fit, data.frame(x = newx)), + col = ii - 1, lwd = 2) + } T. W. Yee (University of Auckland) Modern Regression Basics 98/167 October Cagliari 98 / 167

78 B-Splines XII Smoothing Overall, the fit is ok, but could be improved by decreasing the number of knots and heeding the recommendations of Wold (1974) y x T. W. Yee (University of Auckland) Modern Regression Basics 99/167 October Cagliari 99 / 167

79 B-Splines XIII Smoothing Knots with varying multiplicities have an effect illustrated by the following Multiplicity Multiplicity Multiplicity Multiplicity T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 100 / 167

80 Natural Splines I Smoothing A cubic spline on [a, b] is a natural cubic splines (NCS) if its 2nd and 3rd derivatives are 0 at a and b (natural boundary conditions). Natural splines, a restricted form of B-splines, has been implemented by the function ns(). Given knots ξ 1,..., ξ K, ns() is linear on (, ξ 0 ] and [ξ K+1, ) where ξ 0 and ξ K+1 are two extra knots. ns() chooses these to be the minimum and maximum of the x i respectively. The result is K + 2 parameters. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 102 / 167

81 Natural Splines II Here s an example. Smoothing > set.seed(21) > nn = 20 > x = seq(0, 1, len = nn) > y = runif(nn) > myknots = c(0.3, 0.7) > plot(x, y, xlim = c(-0.5, 1.5), col = "blue") > fit = lm(y ~ ns(x, knot = myknots)) > newx = seq(-0.5, 2.5, len = 100) > lines(newx, predict(fit, data.frame(x = newx)), + col = "blue") > abline(v = c(range(x), myknots), col = "purple", + lty = "dashed") > coef(fit) T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 103 / 167

82 Natural Splines III Smoothing (Intercept) ns(x, knot = myknots) ns(x, knot = myknots)2 ns(x, knot = myknots) T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 104 / 167

83 Natural Splines IV Smoothing y x T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 105 / 167

84 Smoothing Smoothing splines I Cubic smoothing splines minimize S(f ) = n (y i f (x i )) 2 + λ i=1 b a {f (x)} 2 dx, (31) over a Sobolev space of order 2. Here, a < x 1 < < x n < b for some a and b, and λ 0. The terms of S(f ): 1 The first penalizes lack-of-fit; 2 the second penalizes wiggliness. These two conflicting quantities are weighted by the non-negative smoothing parameter λ. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 107 / 167

85 Smoothing splines II Smoothing Larger values of λ produce more smoother curves. As λ, f (x) 0 and the solution is a least squares line. As λ 0, the solution tends to an interpolating twice-differentiable function. (31) fits into the penalty function approach (Green and Silverman, 1994). Penalized least squares minimizes (y f) T Σ 1 (y f) + f T Kf. Solution: f = A(λ) y where A(λ) = (I n + ΣK) 1 is the influence or smoother matrix. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 108 / 167

86 Some notes I Smoothing Here are some notes: 1 The smoothing parameter λ can be regarded as the turning knob which controls the tradeoff between fidelity to the data and smoothness. Can select λ by trial and error. 2 The justification of the penalty term by physics (energy b a curvature2 ), which b a f (t) 2 dt. Hooke s Law. 3 Importantly, Reinsch (1967) showed, using the calculus of variations, that the solution of (31) is a cubic spline with knots at the unique values of the x i. It can be shown that minimizing (31) is equivalent to minimizing b a {f (x)} 2 dx subject to 4 As n, λ should become smaller. n i=1 {y i f (x i )} 2 σ. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 110 / 167

87 Smoothing Some notes II 5 There are alternative regularizations, e.g., b a f (x) 2 dx (32) whose solution is a linear spline. In general, b a f (ν) (x) 2 dx produces a spline of degree 2ν 1. Note we never get an even degree spline not unless fractional derivatives are used. 6 S 2 [a, b] is actually a Sobolev space of order 2. In general, a Sobolev space of order m is W m 2 [a, b] = {f : f (j), j = 0,..., m 1, is absolutely continuous on [a, b], f (m) L 2 [a, b]}, i.e., b a {f (m) (t)} 2 dt <. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 111 / 167

88 Smoothing Some notes III How can we compute a cubic smoothing spline? There are several ways: 1 Direct method. Not recommended (O(n 3 )). 2 State-space approach (O(n)). 3 B-splines this is a numerically stable method (O(n)). 4 Reinsch algorithm (O(n)). T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 112 / 167

89 Smoothing > args(smooth.spline) function (x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE, all.knots = FALSE, nknots = NULL, keep.data = TRUE, df.offset = 0, penalty = 1, control.spar = list()) NULL For basic use, use the arguments df. Here s an example. > data(cars) > with(cars, plot(speed, dist, main = "data(cars) & smoothing splines")) > cars.spl <- with(cars, smooth.spline(speed, dist)) > cars.spl Call: smooth.spline(x = speed, y = dist) Smoothing Parameter spar= lambda= (11 iterations) Equivalent Degrees of Freedom (Df): Penalized Criterion: GCV: T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 114 / 167

90 Smoothing This example has duplicate points, so avoid cv=true. > lines(cars.spl, col = "blue") > with(cars, lines(smooth.spline(speed, dist, df = 10), + lty = 2, col = "red", lwd = 2)) > with(cars.spl, legend(5, 120, c(paste("default [C.V.] => df =", + round(df, 1)), "s( *, df = 10)"), col = c("blue", + "red"), lty = 1:2, lwd = 1:2, bg = "bisque")) data(cars) & smoothing splines default [C.V.] => df = 2.6 s( *, df = 10) dist speed T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 115 / 167

91 Smoothing Smoothing Some General Theory I In scatterplot smoothing, there is a fundamental trade-off between the bias and variance of the estimate, and this phenomenon is governed by the smoothing parameter. An optimal choice of span would trade the bias against the variance. One such criterion is the mean square error (MSE): [ ] 2 ) ( 2 E (ˆf k (x i ) f (x i )) = Var (ˆf k (x i ) + E ˆf k (x i ) f (x i )). T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 117 / 167

92 Linear Smoothers I Smoothing A smoother is linear if S(a y 1 + b y 2 x) = a S(y 1 x) + b S(y 2 x) (33) for any constants a and b. That is, where S does not depend on y. ŷ = Sy (34) S is referred to as the influence (or smoother) matrix. Examples: For the bin, running-mean, running-line, regression spline, cubic spline, kernel and local polynomial kernel smoothers are all linear smoothers (with a fixed smoothing parameter). T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 119 / 167

93 Smoothing Linear Smoothers II The theory for linear smoothers is much simpler than for nonlinear smoothers. Many properties of smoothers seen by the eigenvalues and eigenvectors of S. For example, for a cubic spline, S(λ) has all eigenvalues values in (0, 1] with exactly two unit eigenvalues with corresponding eigenvectors 1 and x. That is, S 1 = 1 1 and S x = 1 x. These correspond to constant and linear functions. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 120 / 167

94 Linear Smoothers III Smoothing 1e 01 1e 03 1e 05 Figure: Eigenvalues of a cubic smoothing spline. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 121 / 167

95 Degrees of Freedom I Smoothing All smoothers allow the user to vary the amount of smoothing done via the smoothing parameter, e.g., bandwidth, the span, or λ. However, it would be useful to have some measure of the amount of smoothing done. One such measure is the effective degrees of freedom (EDF) of a smooth. It is useful for a number of reasons, e.g., comparing different types of smoothers while keeping the amount of smoothing roughly equal. The theory of EDF is a natural extension of standard results from the general linear model Recall that, if β is p 1, that Y = Xβ + ε, Var(ε) = σ 2 I. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 123 / 167

96 Smoothing Degrees of Freedom II 1 Ŷ = Py where P = X(X T X) 1 X T is idempotent of rank p. Then trace(p) = p, 2 trace(var(ŷ)) = σ 2 trace(pp T ) = σ 2 p, 3 E[(n p) S 2 ] = E[ResSS] = E[(Y Ŷ) T (Y Ŷ)] = σ 2 (n p). By replacing P by S, these results suggest the following three definitions for the effective degrees of freedom of a smooth: 1 df = trace(s), 2 df var = trace(ss T ) and 3 df err = n trace(2s SS T ). More generally, with weights W, these are 1 df = trace(s), 2 df var = trace(wsw 1 S T ) and 3 df err = n trace(2s S T WSW 1 ). T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 124 / 167

97 Smoothing Degrees of Freedom III It can be shown that if S is a symmetric projection matrix then trace(s), trace(2s SS T ) and trace(ss T ) coincide. For cubic splines, it can be shown that trace(ss T ) trace(s) trace(2s SS T ) and that all three of these functions are decreasing in λ. Notes: 1 df is the most popular and easiest to compute. Cost of df is O(n) for most smoothers. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 125 / 167

98 Degrees of Freedom IV Smoothing 2 2 the degrees of freedom the number of distinct x i. Linear fit = 2. The number of distinct x i = interpolant. As the degrees of freedom increases the fit becomes more wiggly. A smooth with 3 degrees of freedom has approximately the same flexibility as a quadratic. A value of 4 or 5 degrees of freedom is often used as the default value in software, as this can accommodate a reasonable amount of nonlinearity without being excessive. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 126 / 167

99 Standard Errors I Smoothing ˆf = S y = Var(ˆf) = σ 2 SS T (35) Can form pointwise SE bands for ˆf (useful in preventing the over-interpretation of a plot of the estimated function). But (35) impractical if n is large (all of S needed). Trick: for cubic splines, Silverman (1985) uses a Bayesian derivation to discuss the use of the alternative Cost is O(n) σ 2 S. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 128 / 167

100 Equivalent Kernels I Smoothing Consider ŷ = Sy for a linear smoother. Plotting the jth row of S versus x i gives the weights used for the estimate ŷ j. This mimics the kernel function of a kernel smoother. The EK for a cubic spline is κ(u) = 1 ( 2 exp u ) ( u sin + π ) as n (Silverman, 1984). T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 130 / 167

101 Equivalent Kernels II Smoothing EK u If the design points x i have a local density g(x), and if x is not too near the boundary and λ is not too big or too small, then the local bandwidth h(x) satisfies h(x) = {λ n g(x)} 1/4. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 131 / 167

102 Smoothing Automatic Smoothing Parameter Selection I Choosing the bandwidth/smoothing parameter is the most important decision for a specified method. We want an automatic way of choosing the right smoothing parameter. A popular method is cross-validation (CV), and restrict attention to linear smoothers. CV idea: leave point (x i, y i ) out one at a time and estimating the smooth at x i based on the remaining n 1 points. Choose λ CV to minimize the cross-validation sum of squares CV (λ) = 1 n n i=1 { y i ˆf i λ (x i)} 2 (36) T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 133 / 167

103 Smoothing Automatic Smoothing Parameter Selection II where ˆf i λ (x i) is the fitted value at x i, computed by leaving out the ith data point. One can compute (36) naïvely. But there is a trick. Define ˆf i λ (x i) to be the fit obtained by setting the weight of the ith observation to zero, and increasing the remaining weights so that they sum to unity, i.e., This means ˆf i λ (x i) = ˆf i λ (x i) = n j=1 j i n j=1 j i s ij 1 s ii y j. (37) s ij y j + s ii ˆf i λ (x i) (38) T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 134 / 167

104 Smoothing Automatic Smoothing Parameter Selection III and y i ˆf i λ (x i) = y i ˆf λ (x i ) 1 s ii. (39) Thus, CV (λ) can be written CV (λ) = 1 n { } 2 n y i ˆf λ (x i ). (40) 1 s ii (λ) i=1 So there is no need to compute ˆf i λ (x i) naïvely. In practice, CV sometimes gives questionable performance. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 135 / 167

105 Smoothing Generalized Cross-Validation A variant of CV (λ) is generalized cross-validation (GCV). The GCV idea is to replace s ii by its average value trace(s)/n which is easier to compute: GCV (λ) = 1 n GCV tends to undersmooth. { } 2 n y i ˆf λ (x i ). 1 trace(s)/n i=1 T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 136 / 167

106 Testing for nonlinearity Smoothing Suppose we wish to compare two smooths ˆf 1 = S 1 y and ˆf 2 = S 2 y. For example, the smooth ˆf 2 might be rougher than ˆf 1, and we wish to test if it picks up any significant bias. A standard case that often arises is when ˆf 1 is linear, in which case we want to test if the linearity is real. We must assume that ˆf 2 is unbiased, and that ˆf 1 is unbiased under H 0. Letting ResSS j be the residual sum of squares for the jth smooth and γ j be trace(2s S T S), then (ResSS 1 ResSS 2 )/(γ 2 γ 1 ) ResSS 2 /(n γ 1 ) F γ2 γ 1,n γ 1 (41) approximately. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 137 / 167

107 Smoothing The Curse of Dimensionality I Sometimes multidimensional smoothers can work with a moderate number of inputs. But the curse of dimensionality hinders them in higher dimensions: local neighbourhoods are empty, or nearest-neighbourhoods are not local all points are close to the boundary sample sizes need to grow exponentially. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 139 / 167

108 Smoothing The Curse of Dimensionality II That is, neigbourhoods with a fixed number of points become less local as the dimensions increase. For fixed n, the data becomes more isolated in d-space and smoothers require a larger neighbourhood to find enough data points in order to calculate the variance of an estimate. Hence the estimate is no longer local and can be severely biased. The following illustrates the curse of dimensionality. Suppose we have data uniformly distributed in a d dimensional unit cube. We spread out a subcube from the origin to capture span% of the data. What distance do we have to reach out on each axis? The next figure gives the answer. Most reasonable high-dimensional procedures assume some structure. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 140 / 167

109 Smoothing The Curse of Dimensionality III d=10 d=3 d=2 d=1 distance span (%) Figure: Distance on each axis of a subcube required to capture span% of the data of a d dimensional unit cube. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 141 / 167

110 Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) I For general p, the linear model is Y = β 1 X β p X p + ε, ε N(0, σ 2 ) independently. (42) This model has some strong assumptions: 1 Linearity, i.e., the effect of each X k on E(Y ) is linear, 2 Normal errors with zero mean, constant variance, and independent, 3 Additivity, i.e., X k and X l do not interact; they have an additive effect on the response. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 143 / 167

111 Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) II We relax the linearity assumption. The linear predictor becomes an additive predictor: a sum of arbitary smooth functions. η(x) = f 1 (x 1 ) + + f p (x p ), (43) Additivity is still assumed. Easy to interpret. Identifiability: the f k (x k ) are centred. Very useful for exploratory data analysis. Allows the data to speak for itself. Some GAM books are Hastie and Tibshirani (1990) and Wood (2006). T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 144 / 167

112 Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) I Fit an additive model by backfitting. It is iterative procedure that smooths partial residuals to f so p E(y x) = f t (x t ) + f k (x k ) p f t (x t ) = E y k=1, k t k=1, k t f k (x k ) X t. Modified backfitting possible and is implemented. It decomposes η(x) = X β + p r k (x k ) k=1 i.e., into a linear and nonlinear components. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 146 / 167

Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) I Example 1 Kauri data Y = presence/absence of a tree species, agaaus, which is Agathis australis, better known as Kauri, NZ s

113 Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) I Example 1 Kauri data Y = presence/absence of a tree species, agaaus, which is Agathis australis, better known as Kauri, NZ s most famous tree. Data is from 392 sites from the Hunua forest near Auckland. Figure: Big Kauri tree. T. W. Yee (University of Auckland) Modern Regression Basics October Cagliari 148 / 167

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: 1 / 23 Recap HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: Pr(G = k X) Pr(X G = k)pr(g = k) Theory: LDA more