STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have the MASS library available when using R or Splus (in R type library(mass)). All the code in the 4th edition of the book is available in a file called scripts, in the MASS subdirectory of the R library. On Cquest this is in /usr/lib/r/library. Undergraduate Summer Research Awards (USRA): see Statistics office SS 608: application due Feb 8 :,
Logistic regression ( 4.4) Likelihood methods log-likelihood l(β) = Σ N i= y iβ T x i log( + e βt x i ) Maximum likelihood estimate of β: l(β) β = 0 ΣN i= y ix ij = Σp i ( ˆβ)x ij, j =,..., p Fisher information l(β) β β T = ΣN i= x ix T i p i ( p i ) Fitting: use an iteratively reweighted least squares algorithm; equivalent to Newton-Raphson; p.99 Asymptotics: ˆβ d N(β, { l ( ˆβ)} ) :,
Logistic regression ( 4.4) Inference Component: ˆβ j N(β j, ˆσ j ) ˆσ j = [{ l ( ˆβ)} ] jj ; gives a t-test (z-test) for each component {l( ˆβ) l(β j, β j )} χ dimβ j ; in particular for each component get a χ, or equivalently sign( ˆβ j β j ) [{l( ˆβ) l(β j, β j )} N(0, ) To compare models M 0 M can use this twice to get {l M ( ˆβ) l M0 ( β q )} χ p q which provides a test of the adequacy of M 0 LHS is the difference in (residual) deviances; analogous to SS in regression See Ch. 4 of 0 text, and algorithm on p.99 of HTF. (See Figure 4.) (See R code) :,
Logistic regression ( 4.4) extensions E(y i ) = p i, var(y i ) = p i ( p i ) under Bernoulli Often the model is generalized to allow var(y i ) = φp i ( p i ); called over-dispersion Most software provides an estimate of φ based on residuals. if y i Binom(n i, p i ) same model applies E(y i ) = n i p i and var(y i ) = n i p i ( p i ) under Binomial Model selection uses a C p -like criterion called AIC In Splus or R, use glm to fit logistic regression, stepaic for model selection :, 4
Logistic regression ( 4.4) > hr <- read.table("heart.data",header=t) > dim(hr) [] 46 > hr <- data.frame(hr) > pairs(hr[:0],pch=,bg=c("red","green")[codes(factor(hr$chd))]) > glm(chd sbp+tobacco+ldl+famhist+obesity+alcohol+age, + family=binomial,data=hr) Call: glm(formula = chd sbp + tobacco + ldl + famhist + obesity + alcohol + age, family = binomial, data = hr) Coefficients: (Intercept) sbp tobacco ldl -4.90787 0.0057608 0.07957 0.84770 famhistpresent obesity alcohol age 0.990-0.045467 0.0006058 0.04544 Degrees of Freedom: 46 Total (i.e. Null); 454 Residual Null Deviance: 596. Residual Deviance: 48. AIC: 499. > hr.glm <-.Last.value > coef(hr.glm) (Intercept) sbp tobacco ldl famhistpresent -4.9078750 0.005760899 0.0795750 0.847709867 0.9904 obesity alcohol age -0.045466980 0.000605845 0.04544469 :, 5
Logistic regression ( 4.4) > summary(hr.glm) Call: glm(formula = chd sbp + tobacco + ldl + famhist + obesity + alcohol + age, family = binomial, data = hr) Deviance Residuals: Min Q Median Q Max -.757-0.879-0.455 0.99.44 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.90787 0.960686-4.98.7e-05 *** sbp 0.0057608 0.005650.04 0.0577 tobacco 0.07957 0.06876.07 0.009 ** ldl 0.84770 0.0576.4 0.007 ** famhistpresent 0.990 0.468 4.86.84e-05 *** obesity -0.045467 0.0905 -.89 0.440 alcohol 0.0006058 0.0044490 0.6 0.8968 age 0.04544 0.0095 4.99.68e-05 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. (Dispersion parameter for binomial family taken to be ) Null deviance: 596. on 46 degrees of freedom Residual deviance: 48.7 on 454 degrees of freedom AIC: 499.7 Number of Fisher Scoring iterations: :, 6
Logistic regression ( 4.4) > library(mass) > hr.step <- stepaic(hr.glm,trace=f) > anova(hr.step) Analysis of Deviance Table Model: binomial, link: logit Response: chd Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 46 596. tobacco 4.46 460 554.65 ldl. 459 5.5 famhist 4.8 458 507.4 age.80 457 485.44 > coef(hr.step) (Intercept) tobacco ldl famhistpresent age -4.0899 0.0806979 0.675745 0.940600 0.0440574 :, 7
Logistic regression ( 4.4) > summary(hr.step) Call: glm(formula = chd tobacco + ldl + famhist + age, family = binomial, data = hr) Deviance Residuals: Min Q Median Q Max -.7559-0.86-0.4546 0.9457.490 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.080 0.49457-8.50 < e-6 *** tobacco 0.080698 0.05484.67 0.0054 ** ldl 0.67574 0.05409.098 0.0095 ** famhistpresent 0.94060 0.66 4.50.e-05 *** age 0.04406 0.009696 4.54 5.58e-06 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. (Dispersion parameter for binomial family taken to be ) Null deviance: 596. on 46 degrees of freedom Residual deviance: 485.44 on 457 degrees of freedom AIC: 495.44 Number of Fisher Scoring iterations: :, 8
Logistic regression ( 4.4) Interpretation of coefficients e.g. tobacco (measured in kg): coeff= 0.08 logit{p i (β)} = β T x i ; increase in one unit of x ij, say, leads to increase in logitp i of 0.08; increase in p i /( p i ) of exp(0.08) =.084. estimated s.e. 0.06, logitp i ± 0.06, exp(0.08 +.06), exp(0.08.06)) is (.0,.4). similarly for age: ˆβ j = 0.044; increased odds.045 for year increase prediction of new values to class or 0 according as ˆp > (<)0.5 :, 9
Logistic regression ( 4.4) generalize to K classes. multivariate logistic regression: log pr(y = x) pr(y = K x),..., log pr(y = K x) pr(y = K x). impose an ordering (polytomous regression: MASS p.??? and polr). multinomial distribution (related to neural networks: MASS p.??? and??) :, 0
Discriminant analysis ( 4.) y {,,..., K }, f c (x) = f (x y = c) =density of x in class c Bayes Theorem: pr(y = c x) = f (x y = c)π c f (x) c =,..., K associated classification rule: assign a new observation to class c if p(y = c x) > p(y = k x) (maximize the posterior probability) k c :,
Discriminant analysis ( 4.) x y = k N p (µ k, Σ k ) p(y = k x) π k ( π) p Σ k / exp (x µ k) T Σ k (x µ k) which is maximized by maximizing the log: max{log π k k log Σ k (x µ k) T Σ k (x µ k)} if we further assume Σ k = Σ, then max{log π k k (x µ k) T Σ (x µ k )} max{log π k k (x T Σ x x T Σ µ k µ k Σ x + µ T k Σ µ k )} max{log π k + x T Σ µ k k µt k Σ µ k } :,
Discriminant analysis ( 4.) Procedure: compute δ k (x) = log π k + x T Σ µ k µt k Σ µ k classify observation x to class c if δ c (x) largest (see Figure 4.5, left) Estimate unknown parameters π k, µ k, Σ: ˆπ k = N k N ˆµ k = y i =k x i N c K Σ = (x i ˆµ k )(x i ˆµ k ) T /(N K ) k= i:y i =k (see Figure 4.5,right) :,
Discriminant analysis ( 4.) Special case: classes Choose class if log ˆπ + x T ˆΣ ˆµ ˆµT ˆΣ ˆµ > log ˆπ + x T ˆΣ ˆµ ˆµT ˆΣ ˆµ, x T ˆΣ (ˆµ ˆµ ) > ˆµ ˆΣ ˆµ ˆµ ˆΣ ˆµ + log(n /N) log(n /N) Note it is often common to specify π k = /K in advance rather than estimating from the data If Σ k not all equal, the discriminant function δ k (x) defines a quadratic boundary; see Figure 4.6, left An alternative is to augment the original set of features with quadratic terms and use linear discriminant functions; see Figure 4.6, right :, 4
Discriminant analysis ( 4.) Another description of LDA ( 4.., 4..): Let W = within class covariance matrix (ˆΣ) B = between class covariance matrix Find a T X such that a T Ba is maximized and a T Wa minimized, i.e. a T Ba max a a T Wa equivalently max a T Ba subject to a T Wa = a Solution a, say, is the eigenvector of W B corresponding to the largest eigenvalue. This determines a line in R p. continue, finding a, orthogonal (with respect to W ) to a, which is the eigenvector corresponding to the second largest eigenvalue, and so on. :, 5
Discriminant analysis ( 4.) There are at most min(p, K ) positive eigenvalues. These eigenvectors are the linear discriminants, also called canonical variates. This technique can be useful for visualization of the groups. Figure 4. shows the st two canonical variables for a data set with 0 classes. ( 4..) write ˆΣ = UDU T, where U T U = I, D is diagonal (see p.87 for ˆΣ) X = D / U T X, with ˆΣ = I classification rule is to choose k if ˆµ k is closest (closest class centroid) only needs the K points ˆµ k, and the K dimension subspace to compute this, since remaining directions are orthogonal (in the X space) if K = can plot the first two variates (cf wine data) See p.9, Figures 4.4 and 4.8 (algorithm on p.9 finds the best in order, as described on the previous slide) :, 6
Discriminant analysis ( 4.) Notes 4. considers linear regression of 0, variable on several inputs (odd from a statistical point of view) how to choose between logistic regression and discriminant analysis? they give the same classification error on the heart data (is this a coincidence?) logistic regression and generalizations to K classes doesn t assume any distribution for the inputs discriminant analysis more efficient if the assumed distribution is correct warning: in 4. x and x i are p vectors, and we estimate β 0 and β, the latter a p vector in 4.4 they are (p + ) with first element equal to and β is (p + ). :, 7
R code for the wine data > library(mass) > wine.lda <- lda(class alcohol + malic + ash + alcil + mag + totphen + flav + nonflav + proanth + col + hue + dil + proline, data = wine) > wine.lda Call: lda.formula(class alcohol + malic + ash + alcil + mag + totphen + flav + nonflav + proanth + col + hue + dil + proline, data = wine) Prior probabilities of groups: 0.4607 0.988764 0.69669 Group means: alcohol malic ash alcil mag totphen flav nonflav.74475.00678.45559 7.079 06.90.84069.9879 0.90000.787.9676.44789 0.80 94.549.5887.080845 0.666.575.750.4708.4667 99.5.678750 0.78458 0.447500 proanth col hue dil proline.899 5.5805.0609.57797 5.79.608.08660.05687.7855 59.5070.554 7.9650 0.68708.6854 69.8958 Coefficients of linear discriminants: LD LD alcohol -0.409978 0.87790699 malic 0.6554596 0.057975 ash -0.6907556.458497486 alcil 0.54797889-0.46807654 mag -0.006496-0.000467565 totphen 0.6805068-0.087 flav -.6695-0.49998054 nonflav -.49588440 -.6095795 proanth 0.40968-0.070875776 :, col 0.5505570 0.506865 hue -0.880607 -.55644987 8
R code for the wine data -4-0 LD :, 9
Separating hyperplanes ( 4.5) assume two classes only; change notation so that y = ± use linear combinations of inputs to predict y y = { + as β 0 + x T β < 0 β 0 + x T β > 0 misclassification error D(β, β 0 ) = Σ i M y i (β 0 + x T i β) where M = {j; y j (β 0 + x T j β) < 0} note that D(β) > 0 and proportional to the size of β 0 + x T i β Can show that an algorithm to minimize D(β, β 0 ) exists and converges the the plane that separates y = + from y = if such a plane exists But it will cycle if no such plane exists and be very slow if the gap is small :, 0
Separating hyperplanes ( 4.5) Also if one plane exists there is likely many (Figure 4.) The plane that defines the largest gap is defined to be best can show that this needs to s.t. See Figure 4.5 min β 0,β β y i (β 0 + x T i β), i =,... N (4.44) the points on the edges (margin) of the gap called support points or support vectors; there are typically many fewer of these than original points this is the basis for the development of Support Vector Machines (SVM), more later sometimes add features by using basis expansions; to be discussed first in the context of regression (Chapter 5) :,