Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Size: px

Start display at page:

Download "Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books"

Evangeline Johnston
5 years ago
Views:

1 STA 44/04 Jan 6, 00 / 5 Administration Homework on web page, due Feb NSERC summer undergraduate award applications due Feb 5 Some helpful books

2 STA 44/04 Jan 6, administration / 5

3 STA 44/04 Jan 6, 00 / 5... administration collection of tools for regression and classification some old (least squares, discriminant analysis) some new (lasso, support vector machines) statistical justifications: loss, likelihood, mean squared error, classification error, posterior probabilities... statistical thinking framework for analysing new tools

4 STA 44/04 Jan 6, 00 4 / 5 Linear regression plus variable selection: forward, backward, stepwise, all possible subsets comparing models: adjusted R, C p vs. p, AIC and variants, K -fold cross-validation shrinkage methods: ridge regression, lasso, Least Angle Regression tuning parameter (amount of shrinkage): validation data, K -fold cross-validation derived variables: principal components regression, partial least squares C p and AIC can be shown to be estimates of expected prediction error

5 STA 44/04 Jan 6, 00 Predictions with smoothing regression on training data: x s are centered and scaled when fitting Lasso, LAR, and ridge regression ˆβ 0 is not included in shrinkage (.4) p p ˆβ ridge = argmin β (yi β 0 x ij β j ) + λ (.5) ˆβ lasso = argmin β (yi β 0 ˆβ 0 = ȳ = ȳ train for predicting a new y 0 = ˆβ 0 + p j= p x ij β j ) + λ j= j= (x j 0 j= β j p β j j= x j ) ˆβ j : use ˆβ 0 = ȳ train, and x j means average for jth feature on the training data see construction of tx and mm in TableR.txt 5 / 5

6 STA 44/04 Jan 6, 00 6 / 5...predictions > options(digits=4) > as.vector(predict.lars(pr.lars,newx=test[,:8],type="fit",s=0.6, mode="fraction")$fit) [] [] [5] > as.vector(tx %*% coef(pr.lars,s=0.6,mode="fraction"))+mean(train$lpsa) [] [] [5]

7 STA 44/04 Jan 6, 00 7 / 5 Derived features.5 replace x,... x p with linear combinations of columns principal components from SVD are natural candidates X = UDV T, U T U = I N, V T V = I p z m = Xv m, m =,..., M < p z m are orthogonal by construction ŷ pcr (M) = ȳ + M m= ˆθ m z m ˆθ m = z m, y n z m, z m = i= z miy i n i= z mi inputs should be scaled first (mean 0, variance ) Angle brackets notation explained at (.5). Exercise: z m = 0??

8 STA 44/04 Jan 6, 00 8 / 5... derived features closely related method Partial least squares also constructs derived variables widely used in chemometrics, where often p > N see.6 for discussion

STA 44/04 Jan 6, 00 0 / 5 (Chapter 4) Linear methods for classification inputs X = X,..., X p (notation x used on p.0) output Y takes values in one of K classes output G is a group label: values,.

10 STA 44/04 Jan 6, 00 0 / 5 (Chapter 4) Linear methods for classification inputs X = X,..., X p (notation x used on p.0) output Y takes values in one of K classes output G is a group label: values,..., K response Y as needed (Y G), e.g. Y =, 0 as G = blue, orange Fig.; eq. (.7) data (x i, g i ), i =,... N goal to learn a model to predict the correct class for a future output, based on inputs

11 STA 44/04 Jan 6, 00 / 5 Linear methods rule: G = linear function of inputs something, else G = (with extensions for K > ) see Fig.. for a nonlinear method note that boundaries can be curved by including terms like Xj and X j X k : see Fig. 4. types of linear methods: linear regression 4. Y = β 0 + β T X linear discriminant analysis 4. logistic regression 4.4 log separating hyperplanes 4.5 P(Y = ) P(Y = 0) = β 0 + β T X

12 STA 44/04 Jan 6, 00 / 5 Linear regression 4. Y = y... y K y... y K... y N... y NK y i = (y i,..., y ik ) multivariate Bernoulli E(Y X) = XB ˆB = (X T X) X T Y : dimension? new observation (, x0 T ), new prediction (, x0 T )ˆB = (ˆf,...,ˆf K ): find the largest = x T x T X = C.. A xn T

13 STA 44/04 Jan 6, 00 / 5... linear regression if K = this is linear regression of 0/ response will produce predictions larger than, smaller than 0 more natural to consider instead modelling Pr(Y = X) and finding methods that predict in (0, ) but if data supports 0. < Pr(Y = X) < 0.8 results will not be very different Figure 4. targets t k (p. 04): t k = (0,...,,..., 0) }{{} k find the largest {ˆf (x 0 ) t k } ; another way to state the least squares solution

14 STA 44/04 Jan 6, 00 4 / 5 Discriminant analysis ( 4.) G {,,..., K }, f k (x) = f (x G = k) =density of x in class k new ingredient: density of inputs Bayes Theorem: Pr(G = k x) = f (x G = k)π k f (x) k =,..., K associated classification rule: assign a new observation to class k if Pr(G = k x) > Pr(G = k x) (maximize the posterior probability) k k

15 STA 44/04 Jan 6, 00 max{log π k + x T Σ µ k k µt k Σ µ k } 5 / 5 special case - Normal x G = k N p (µ k, Σ k ) Pr(G = k x) π k ( π) p Σ k / exp (x µ k) T Σ k (x µ k) which is maximized by maximizing the log: max{log π k k log Σ k (x µ k) T Σ k (x µ k)} if we further assume Σ k = Σ, then max{log π k k (x µ k) T Σ (x µ k )} max{log π k k (x T Σ x x T Σ µ k µ k Σ x + µ T k Σ µ k )}

16 STA 44/04 Jan 6, 00 6 / 5 Procedure: compute δ k (x) = log π k + x T Σ µ k µt k Σ µ k classify observation x to class k if δ k (x) largest (see Figure 4.5, left) Estimate unknown parameters π k, µ k, Σ: ˆπ k = N k N, ˆµ k = x i N k i:g i =k K Σ = (x i ˆµ k )(x i ˆµ k ) T /(N K ) k= i:g i =k Figure 4.5, 4., 4.

17 STA 44/04 Jan 6, 00 7 / 5 Special case: classes log ˆπ + x T ˆΣ ˆµ ˆµT ˆΣ ˆµ > (<) log ˆπ + x T ˆΣ ˆµ ˆµT ˆΣ ˆµ, x T ˆΣ (ˆµ ˆµ ) > (<) ˆµ ˆΣ ˆµ ˆµ ˆΣ ˆµ + log(n /N) log(n /N) LHS is a linear combination of the inputs; RHS is the cutpoint If Σ k not all equal, the discriminant function δ k (x) defines a quadratic boundary; see Figure 4.6, left An alternative is to augment the original set of features with quadratic terms and use linear discriminant functions; see Figure 4.6, right

18 STA 44/04 Jan 6, 00 8 / 5 Another description of LDA ( 4.., 4..) Let W = within class covariance matrix K k= i:g i =k (x i x k )(x i x k ) T B = between class covariance matrix K k= N k( x k x)( x k x) T B + W = K (x i x)(x i x) T = T k= i:g i =k linear classification rule a T x equivalently max a a T Ba a T Wa max a T Ba, subject to a T Wa = a

19 STA 44/04 Jan 6, 00 9 / 5... LDA Solution a, say, is the eigenvector of W B corresponding to the largest eigenvalue. This determines a line in R p. continue, finding a, orthogonal (with respect to W) to a, which is the eigenvector corresponding to the second largest eigenvalue, and so on. There are at most min(p, K ) positive eigenvalues. These eigenvectors are the linear discriminants, also called canonical variates. This technique can be useful for visualization of the groups. Figure 4.

20 STA 44/04 Jan 6, 00 0 / 5... LDA ( 4..) write ˆΣ = UDU T, where U T U = I, D is diagonal (see p.09 for ˆΣ) X = D / U T X, with ˆΣ = I classification rule is to choose k if ˆµ k is closest (closest class centroid) only needs the K points ˆµ k, and the K dimension subspace to compute this, since remaining directions are orthogonal (in the X space) if K = can plot the first two variates (cf wine data) Figures 4.4 and 4.8 algorithm on p.4

21 STA 44/04 Jan 6, 00 R code for the wine data > library(mass) > wine.lda <- lda(class alcohol + malic + ash + alcil + mag + totphen + flav + nonflav + proanth + col + hue + dil + proline, data = wine) > wine.lda Call: lda.formula(class alcohol + malic + ash + alcil + mag + totphen + flav + nonflav + proanth + col + hue + dil + proline, data = wine) Prior probabilities of groups: Group means: alcohol malic ash alcil mag totphen flav nonflav proanth col hue dil proline Coefficients of linear discriminants: LD LD alcohol malic ash alcil mag totphen flav nonflav / 5

22 STA 44/04 Jan 6, first linear discriminant second linear discriminant / 5

23 STA 44/04 Jan 6, 00 Logistic regression data (x i, g i ), g i =, ; equivalently (x i, y i ), y i = 0, natural starting point is Bernoulli distribution for y i : y i = with probability p i = p i (β) likelihood function log-likelihood L(β) = N i= p y i i ( p i ) y i l(β) = Σ N i= y i log p i (β) + ( y i ) log{ p i (β)} A common choice for p i (β) is the logistic function p i (β) = exp(βt x i ) + exp(β T x i ) x i is a column vector: the ith row of X is x T i / 5

24 STA 44/04 Jan 6, 00 4 / 5 log-likelihood Likelihood methods l(β) = Σ N i= {y iβ T x i log( + e βt x i )} (constant term included in the set of inputs; β has length p + ) Maximum likelihood estimate of β: l(β) β = 0 Σ N i= y ix ij = Σ N i= p i( ˆβ)x ij, j =,..., p β= ˆβ Fisher information l(β) β β T β= ˆβ = Σ N i= x ix T i ˆp i ( ˆp i ) Fitting: use an iteratively reweighted least squares algorithm; equivalent to Newton-Raphson; p. Asymptotics: ˆβ d N(β, { l ( ˆβ)} )

25 STA 44/04 Jan 6, 00 5 / 5 Inference Component: ˆβ j. N(β j, ˆσ j ) ˆσ j = [{ l ( ˆβ)} ] jj ; gives a t-test (z-test) for each component {l( ˆβ) l(β j, β j )}. χ dimβ j ; in particular for each component get a χ, or equivalently sign( ˆβ j β j ) [{l( ˆβ) l(β j, β j )}. N(0, ) this gives two (asymptotically equivalent) ways to test if an input is statistically significant To compare models M 0 M can use this twice to get {l M ( ˆβ) l M0 ( β q )}. χ p q which provides a test of the adequacy of M 0 LHS is the difference in (residual) deviances; analogous to SS in regression

26 STA 44/04 Jan 6, 00 Heart Data > library(elemstatlearn) > data(saheart) > hr = SA.heart > pairs(hr[:9],pch=,bg=c("red","green")[codes(factor(hr$chd))]) > hr.glm = glm(chd., data = hr, family=binomial) > summary(hr.glm) Call: glm(formula = chd., family = binomial, data = hr) Deviance Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-06 *** sbp tobacco ** ldl ** adiposity famhistpresent e-05 *** typea ** obesity alcohol age *** --- Signif. codes: *** 0.00 ** 0.0 * (Dispersion parameter for binomial family taken to be ) Null deviance: 596. on 46 degrees of freedom Residual deviance: 47.4 on 45 degrees of freedom AIC: / 5

27 tobacco sbp adiposity ldl typea famhist alcohol age obesity

28 STA 44/04 Jan 6, 00 8 / 5 extensions E(y i ) = p i, var(y i ) = p i ( p i ) under Bernoulli Often the model is generalized to allow var(y i ) = φp i ( p i ); called over-dispersion Most software provides an estimate of φ based on residuals. if y i Binom(n i, p i ) same model applies E(y i ) = n i p i and var(y i ) = n i p i ( p i ) under Binomial Model selection uses AIC: l( ˆβ) + p In R, use glm to fit logistic regression, step for model selection glm can be used for all exponential family models: uses iteratively reweighted least squares See also 4.4.

29 STA 44/04 Jan 6, 00 9 / 5 step > step(hr.glm) Start: AIC=49.4 chd sbp + tobacco + ldl + adiposity + famhist + typea + obesity + alcohol + age Df Deviance AIC - alcohol adiposity sbp <none> obesity ldl tobacco typea age famhist Step: AIC=490.4 chd sbp + tobacco + ldl + adiposity + famhist + typea + obesity + age Df Deviance AIC - adiposity sbp <none> obesity ldl tobacco typea age famhist

30 STA 44/04 Jan 6, 00 0 / 5... step Step: AIC= chd sbp + tobacco + ldl + famhist + typea + obesity + age Df Deviance AIC - sbp <none> obesity tobacco ldl typea famhist age Step: AIC= chd tobacco + ldl + famhist + typea + obesity + age Df Deviance AIC - obesity <none> tobacco typea ldl famhist age

31 STA 44/04 Jan 6, 00 / 5... step Step: AIC= chd tobacco + ldl + famhist + typea + age Df Deviance AIC <none> ldl typea tobacco famhist age Call: glm(formula = chd tobacco + ldl + famhist + typea + age, family = binomial, data Coefficients: (Intercept) tobacco ldl famhistpresent typea age Degrees of Freedom: 46 Total (i.e. Null); Null Deviance: 596. Residual Deviance: AIC: Residual

32 STA 44/04 Jan 6, 00 / 5 Interpretation of coefficients e.g. tobacco (measured in kg): coeff= 0.08 logit {p i (β)} = β T x i increase in one unit of x ij, say, leads to increase in logit p i of 0.08 increase in p i /( p i ) of exp(0.08) =.084. estimated s.e. 0.06, logit p i ± 0.06, exp( ), exp( )) is (.0,.4). similarly for age: ˆβ j = 0.044; increased odds.045 for year increase prediction of new values to class or 0 according as ˆp > (<)0.5

33 STA 44/04 Jan 6, 00 / 5 L regularization eqn (4.) N [ ] max y β 0,β i (β 0 + β T x i ) log( + e β 0+β T x i ) λ i= note that intercept has been separated out Figure 4. interpretation? recently developed software: glmpath and glmnet p β j j=

34 STA 44/04 Jan 6, 00 4 / 5 Notes how to choose between logistic regression and discriminant analysis? classification error on the training data is overly optimistic logistic regression and generalizations to K classes doesn t assume any distribution for the inputs discriminant analysis more efficient if the assumed distribution is correct warning: in 4. x and x i are p vectors, and we estimate β 0 and β, the latter a p vector in 4.4 they are (p + ) with first element equal to and β is (p + ).

35 STA 44/04 Jan 6, 00 5 / 5 Misclassification > hr.glm.class = predict(hr.glm)>0 > table(hr.glm.class,hr$chd) hr.glm.class 0 FALSE TRUE 46 8 > hr.glm = glm(chd tobacco + ldl + famhist + age, data = hr, family=binomial) > table(predict(hr.glm)>0, hr$chd) 0 FALSE TRUE > hr.lda = lda(chd., data = hr) > table(predict(hr.lda)$class,hr$chd)

STA 450/4000 S: January

STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have