Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Size: px

Start display at page:

Download "Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:"

Rebecca Hood
5 years ago
Views:

1 1 / 23 Recap HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: Pr(G = k X) Pr(X G = k)pr(g = k) Theory: LDA more efficient if X really is normally distributed Practice: LR usually viewed as more robust, but HTF claim prediction errors are very similar Data: LR is more complicated with K > 2; use multinom in the MASS library Deviance in a generalized linear model (such as LR), is 2 log L( ˆβ) + constant Comparing deviances from two model fits is a log-likelihood ratio test that the corresponding parameters are 0 AIC compares instead 2 log L( ˆβ) + 2p see heart-dump.txt using dummy variables for inputs (board)

2 2 / 23 Separating hyperplanes ( 4.5) assume two classes only; change notation so that y = ±1 use linear combinations of inputs to predict y { 1 y = +1 as β 0 + x T β < 0 β 0 + x T β > 0 misclassification error D(β, β 0 ) = Σ i M y i (β 0 + x T i β) where M = {j; y j (β 0 + x T j β) < 0} note that D(β) > 0 and proportional to the size of β 0 + x T i β Can show that an algorithm to minimize D(β, β 0 ) exists and converges the the plane that separates y = +1 from y = 1 if such a plane exists But it will cycle if no such plane exists and be very slow if the gap is small

3 Also if one plane exists there is likely many (Figure 4.14) The plane that defines the largest gap is defined to be best can show that this needs to s.t. 1 min β 0,β 2 β 2 y i (β 0 + x T i β) 1, i = 1,... N (4.48) See Figure 4.16 the points on the edges (margin) of the gap called support points or support vectors; there are typically many fewer of these than original points this is the basis for the development of Support Vector Machines (SVM), more later sometimes add features by using basis expansions; to be discussed first in the context of regression (Chapter 5) 3 / 23

4 4 / 23 Flexible modelling using basis expansions (Chapter 5) Linear regression: y = Xβ + ɛ, ɛ (0, σ 2 ) Smooth regression: y = f (X) + ɛ f (X) = E(Y X) to be specified Flexible linear modelling f (X) = Σ M m=1 β mh m (X) This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or f (X) = β 0 + β 1 sin(x) + β 2 cos(x), etc. Simple linear regression has h 1 (X) = 1, h 2 (X) = X. Several other examples on p.140

5 5 / 23 Polynomial fits: h j (x) = x j, j = 0,..., m Fit using linear regression with design matrix X, where X ij = h j (x i ) Justification is that any smooth function can be approximated by a polynomial expansion (Taylor series) Can be difficult to fit numerically, as correlation between columns can be large May be useful locally, but less likely to work over the range of X (rafal.pdf) f (x) = x 2 sin(2x), ɛ N(0, 2), n = 200 It turns out to be easier to find a good set of basis functions if we only consider small segments in the X-space (local fits) Need to be careful not to overfit, since we are using only a fraction of the data

6 6 / 23 Piecewise polynomials Piecewise constant: h 1 (X) = I(X < ξ 1 ), h 2 (X) = I(ξ 1 X < ξ 2 ), h 3 (x) = I(ξ 2 X); corresponds to fitting by local averaging Similarly for piecewise linear fits (Figure 5.1), but constraints to make it continuous at the break points: h 1 (X) = 1, h 2 (X) = X, h 3 (X) = (X ξ 1 ) +, h 4 (X) = (X ξ 2 ) + windows defined by knots ξ 1, ξ 2,... To fit a cubic polynomial in each window: e.g. a 1 + b 1 X + c 1 X 2 + d 1 X 3 in the first, a 2 + b 2 X + c 2 X 2 + d 2 X 3 in second, etc. basis functions {1 w1, X w1, Xw 2 1, Xw 3 1 }, {1 w2, X w2, Xw 2 2, Xw 3 2 }, etc. Now require f w (k) 1 (ξ 1 ) = f w (k) 2 (ξ 1 ) etc., puts constraints on the coefficients a, b, c, d Cubic splines require continuous function, first derivatives, second derivatives (Figure 5.2)

7 7 / 23 Constraints on derivatives can be incorporated into the basis: {1, X, X 2, X 3, (X ξ 1 ) 3 +,..., (X ξ k) 3 + } the truncated power basis procedure: choose number (K ) and placement of knots ξ 1,... ξ K construct X matrix using truncated power basis set run linear regression with?? degrees of freedom Example: heart failure data from Chapter 4 (462 observations, 10 covariates)

8 8 / 23 > bs.sbp <- bs(hr$sbp,df=4) # the B-spline basis > dim(bs.sbp) [1] # this is the basis matrix > bs.sbp[1:4,] [1,] [2,] [3,] [4,]

9 9 / 23 The B-spline basis is hard to described explicitly (see Appendix to Ch. 5), but can be shown to be equivalent to the truncated power basis: In R library(splines): bs(x, df=null, knots=null, degree=3,intercept=false, Boundary.knots=range(x)) Must specify either df or knots. For the B-spline basis, # knots = df - degree (degree is usually 3: see?bs). The knots are fixed, even if you use df (see R code) Natural cubic splines have better endpoint behaviour (linear) ( 5.2.1) ns(x, df=null, knots=null, degree=3,intercept=false, Boundary.knots=range(x)) For natural cubic splines, # knots = df - 1

10 Regression splines (p.144) refers to using these basis matrices in a regression model. > ns.hr.glm <- glm (chd ns.sbp+ns.tobacco+ns.ldl+famhist+ns.obesity + + ns.alcohol + ns.age, family=binomial, data=hr) > summary(ns.hr.glm) Call: glm(formula = chd ns.sbp + ns.tobacco + ns.ldl + famhist + ns.obesity + ns.alcohol + ns.age, family = binomial, data = hr) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ns.sbp ns.sbp ns.sbp ns.sbp ns.tobacco ns.tobacco ns.tobacco ** ns.tobacco ns.ldl ns.ldl ns.ldl ns.ldl * famhistpresent e-06 *** ns.obesity ns.obesity * ns.obesity / 23

11 11 / 23 The individual coefficients don t mean anything, we need to evaluate groups of coefficients. We can do this with successive likelihood ratio tests, by hand, e.g. > glm(hr$chd ns.sbp + ns.tobacco + ns.ldl + hr$famhist + ns.obesity + ns.age, family=binomia Call: glm(formula = hr$chd ns.sbp + ns.tobacco + ns.ldl + hr$famhist + ns.obesity + ns Coefficients: (Intercept) ns.sbp1 ns.sbp ns.sbp3 ns.sbp4 ns.tobacco ns.tobacco2 ns.tobacco3 ns.tobacco ns.ldl1 ns.ldl2 ns.ldl stuff omitted Degrees of Freedom: 461 Total (i.e. Null); Null Deviance: Residual Deviance: AIC: Residual

12 12 / 23 > update(.last.value,.. - ns.sbp) Call: glm(formula = hr$chd ns.tobacco + ns.ldl + hr$famhist + ns.obesity + ns.age, fam Coefficients: (Intercept) ns.tobacco1 ns.tobacco ns.tobacco3 ns.tobacco4 ns.ldl ns.ldl2 ns.ldl3 ns.ldl hr$famhistpresent ns.obesity1 ns.obesity ns.obesity3 ns.obesity4 ns.age ns.age2 ns.age3 ns.age Degrees of Freedom: 461 Total (i.e. Null); Null Deviance: Residual Deviance: AIC: Residual > [1] 9.1 > pchisq(9.1,df=4) [1] > 1-.Last.value [1] # compare Table 5.1 See Figure 5.4

13 The function step does all this for you: > step(ns.hr.glm) Start: AIC= chd ns(sbp, 4) + ns(tobacco, 4) + ns(ldl, 4) + famhist + ns(obesity, 4) + ns(alcohol, 4) + ns(age, 4) Df Deviance AIC - ns(alcohol, 4) ns(obesity, 4) <none> ns(sbp, 4) ns(tobacco, 4) ns(ldl, 4) ns(age, 4) famhist Step: AIC= chd ns(sbp, 4) + ns(tobacco, 4) + ns(ldl, 4) + famhist + ns(obesity, 4) + ns(age, 4) Df Deviance AIC <none> ns(obesity, 4) ns(sbp, 4) ns(tobacco, 4) ns(ldl, 4) ns(age, 4) famhist Call: glm(formula = chd ns(sbp, 4) + ns(tobacco, 4) + ns(ldl, 4) + famhist + ns(obesit See Table / 23

14 14 / 23 notes 1. The degrees of freedom fitted are the number of columns in the basis matrix (+ 1 for the intercept). This can also be computed as the trace of the hat matrix, which can be extracted from lm. 2. There is something analogous for glm, because glm s are fitted using iteratively reweighted least squares. 3. Skip and 5.3 (for now at least)

15 15 / 23 Smoothing splines ( 5.4) ridge regression applied to natural cubic splines Recall y = f (X) + ɛ; f (X) = Σ M m=1 β mh m (X) change of notation f (X) = Σ N j=1 θ jn j (X) knots at each distinct x value min(y Nθ) T (y Nθ) + λθ T Ω N θ (5.11) θ N ij = N j (x i ), Ω jk = N j (t)n k (t)dt note use of N for set of natural splines solution ˆθ = (N T N + λω N ) 1 N T y

16 16 / 23 Smoothing splines ( 5.4) solution ˆθ = (N T N + λω N ) 1 N T y This solves the variational problem argmin f Σ i (y i f (x i )) 2 + λ b a {f (t)} 2 dt the solution is a natural cubic spline with knots at each x i How many parameters have been fit? It can be shown that the solution to the smoothing spline problem gives fitted values of the form ŷ = S λ y

17 17 / 23 Smoothing splines ( 5.4) By analogy with ordinary regression, define the effective degrees of freedom (EDF) as trace S λ In the smoothing case it can be shown that ŷ smooth = N(N T N + λω N ) 1 N T y where N is the basis matrix. See p for details on S λ, the smoothing matrix. How to choose λ? a) Decide on df to be used up, e.g. smooth.spline(x,y,df=6), note that increasing df means less bias and more variance. b) Automatic selection by cross-validation (Figure 5.9)

18 Smoothing Parameter spar= lambda= Equivalent Degrees of Freedom (Df): Penalized Criterion: GCV: > lines(cars.spl, col = "blue") > lines(smooth.spline(speed, dist, df=10), lty=2, c > legend(5,120,c(paste("default [C.V.] => df =",rou + "s( *, df = 10)"), col = c("blue" + bg= bisque ) 18 / 23 A smoothing spline version of logistic regression is outlined in 5.6, but we ll wait till we discuss generalized additive models. An example from the R help file for smooth.spline: > data(cars) > attach(cars) > plot(speed, dist, main = "data(cars) & smoothing spl > cars.spl <- smooth.spline(speed, dist) > (cars.spl) Call: smooth.spline(x = speed, y = dist)

19 data(cars) & smoothing splines dist speed

20 data(cars) & smoothing splines dist speed

21 data(cars) & smoothing splines dist speed

22 data(cars) & smoothing splines dist default [C.V.] => df = 2.6 s( *, df = 10) speed

23 23 / 23 Multidimensional splines ( 5.7) so far we are considering just 1 X at a time for regression splines we replace each X by the new columns of the basis matrix for smoothing splines we get a univariate regression it is possible to construct smoothing splines for two or more inputs simultaneously, but computational difficulty increases rapidly these are called thin plate splines

STA 450/4000 S: January

STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have