Homework 2: Solutions

Size: px

Start display at page:

Download "Homework 2: Solutions"

Rose Poole
5 years ago
Views:

1 Homework 2: Solutions Statistics 63 Fall 207 Theoretical Problems:. Since ˆβ = arg min β { Y X β 22/2n + λ β }, we have: Y X ˆβ 2 2/2n + λ ˆβ Y X β 0 2 2/2n + λ β 0 Also β 0 is the true parameter value. Y = X β 0 + ɛ. Hence: X β 0 + ɛ X ˆβ 2 2/2n + λ ˆβ X β 0 + ɛ X β 0 2 2/2n + λ β 0 X( ˆβ β 0 ɛ 2 2/2n + λ ˆβ ɛ 2 2/2n + λ β 0 X( ˆβ β 0 2 2/2n ɛ T X( ˆβ β 0 /n + λ ˆβ λ β 0 X( ˆβ β 0 2 2/n + 2λ ˆβ 2ɛ T X( ˆβ β 0 /n + 2λ β 0 2. Since Therefore: The Hessian: θ k J(θ = n log( + e yx T θ = θ k + e yx = h yxt θ θ( yxyx + e y(i θ T x (i y(i x (i k = n i= 2 i= H kl = J(θ θ k θ l = h θ ( y (i x (i y (i x (i k n θ i= l = h θ (x (i ( h θ (x (i x (i l x (i k n i= h θ ( y (i x (i y (i x (i k The last equality uses that: for g(z = /(+e z, g (z = g(z( g(z. Therefore, for h(x = g(θ T x, h(x θ k = h(x( h(xx k. So we have for the Hessian matrix H: H = h(x (i ( h(x (i x (i x (it n To prove H is positive semidefinite, we show z T Hz 0 for all z: i= z T Hz = ( n n zt h(x (i ( h(x (i x (i x z (it = n = n i= h(x (i ( h(x (i z T x (i x (it z i= h(x (i ( h(x (i (z T x (i 2 0 i= The last inequality holds since 0 h(x (i, which implies h(x (i ( h(x (i 0,and (z T x (i 2 0.

2 3. (a The between-class covariance is Σ B = n K n k µ k µ T k k= where n k is the number of observations in the k th class. (b Since Y is the n K indicator matrix of class label, X T Y gives us a p K matrix [ n µ n 2 µ 2 ] n K µ K. Therefore Σ B = n (XT Y(Y T Y (X T Y T (c We have Σ W = n K k= {i:y ik =} (X i µ k (X i µ k T where X i is the i th column of X. We can also write Σ W as Σ W = n XT ( I Y(Y T Y Y T X Hence Σ B + Σ W = n (XT Y(Y T Y (X T Y T + ( n XT I Y(Y T Y Y T X = n XT Y(Y T Y Y T X + ( n XT I Y(Y T Y Y T X = n XT X = Σ T 4. Let us assume that the data has been centered so that the grand mean, µ = 0. Let K be the total number of classes X be the data matrix Y be a n K indicator matrix of class membership n i be the number of samples in class i K N = n i be the total number of samples i= µ be the grand mean of the data, by assumption 0µ i be the (estimated center of class i Σ W be the within-class covariance Σ B be the between-class covariance Before digging into details, note that Y T X = ( n µ n 2 µ 2... T n k µ K which gives (Y T Y Y T X = M = ( µ µ 2... µ K T R K p Notation-wise, it may help to recall that M is an upper-case µ. 2

3 To see this, note that M i,j = (µ i j = n i n k=:y k =i = n i n k=:y k =i (x k j X k,j = n yk =i X k,j n i k= = n Y k,i X k,j n i k= = n i n k= Y T i,k X k,j = n i (Y T X i,j where the (i, i-th element of (Y T Y is n i and we don t worry about cross terms since (Y T Y is diagonal. 2 Recall that, for a general centered data set Z of k observations, the covariance is given by k ZT Z Applying this principal to M = (Y T Y Y T X, we have: Σ B = M T M K = ((Y T Y Y T X T (Y T Y Y T X K = XT Y (Y T Y (Y T Y Y T X K = XT Y (Y T Y 2 Y T X K From here, recall that Σ T = N XT X = N XT [ Y (Y T Y Y T + I Y (Y T Y Y T ] X = N XT [ Y (Y T Y Y T ] X + N XT [ I Y (Y T Y Y T ] X = Σ B + Σ W 2 (Y T Y i,j = n k= Y T i,k Y k,j = n k= Y k,iy k,j = n k= y k =i yk =j = n k= y k =i=j = i=j n i. (Y T Y is diagonal because the inverse of a diagonal matrix is simply the matrix with the diagonal (non-zero elements inverted. 3

4 as claimed in lecture. This can also be verified directly: (Σ W i,j = K = K = K K (Σ (k W i,j k= K E[X,i X,j X = k] E[X,i X = k]e[x,j X = k] k= K n k k= n l= y l =k x l,i x l,j µ k,i µ k,j ( K = n k K n k n k= ( K = K n k= ( K = K n k= l= ( K ( = X T K n i,lx l,j K µt µ l= ( ( = X T n i,lx l,j K µt µ l= i,j ( ( = n XT X K µt µ ( x l,i x l,j µ k,i µ k,j K l= k= ( X l,i X l,j µ T K i,kµ k,j l= k= ( Xi,lX T l,j K µt µ i,j = (Σ T i,j (Σ B i,j This reflects a general result of probability theory, the Law of Total Variance: Total variance of X {}}{ Within group variance {}}{ Between group variance {}}{ Var(X = E[Var(X Y ] + Var(E[X Y ] = E[Var(X Y ] = Var(X Var(E[X Y ] From here forward, let us assume without loss of generality that K = 2 and n = n 2 (hence N = 2 n. Equivalence of LDA and FDA: With the above relationships worked out, we can now prove the equivalence of LDA and FDA. Recall that FDA solves the problem: maximize β i,j i,j β T Σ B β subject to β T Σ W β = This is a generalized eigenvalue problem and can be solved easily. We can write it in Lagrangian form and take the gradient with respect to β: i,j L = β T Σ B β λ(β T Σ W β 0 = β L = Σ B β = λσ W β Σ W Σ Bβ = λβ = 2Σ B β 2λΣ W β 4

5 assuming Σ W is invertible. 3 Hence, our solution vector β is the first eigenvector of Σ W Σ B. 4 Alternatively, we can consider FDA as the problem of finding w which maximizes the ratio of the between- and within-class variances: J(w = wt Σ B w w T Σ W w This problem does not have a unique solution (J(w = J(αw for any α R, w R p but our decision rule depends only on the scale of w so this isn t a problem and we can play a bit fast-and-loose with constants. Taking the gradient of J( and setting it equal to zero we find: 0 = w J = (w T Σ B w(2σ W w = (w T Σ W w(2σ B w = (wt Σ W w(2σ B w (w T Σ B w(2σ W w 2Σ W w 2 2 (w T Σ B w(σ W w = (w T Σ W w(σ B w = Σ W w Σ B w Here we note that Σ B w will always lie in the span of µ 2 µ so we have: or which defines the discriminant vector. Σ W w µ 2 µ w Σ W (µ 2 µ Now consider LDA. From Cf. [HTF09, Eq. 4.9], we know that the decision boundary for two-class LDA is a line of the form: ( n 0 = log n 2 2 (µ + µ l T Σ W (µ µ 2 + x T Σ W (µ µ 2 ( n = log µt Σ W µ µ T Σ W µ 2 + µ T 2 Σ W µ µ T 2 Σ W µ 2 + x T Σ W 2 (µ µ 2 = log = log n 2 ( n n 2 ( n n 2 µt Σ W µ µ T Σ W µ 2 + Transpose of a scalar is itself {}}{ µ T Σ W µ µt 2 Σ W µ 2 µ T Σ W µ + x T Σ W 2 (µ µ 2 µ T 2 Σ W µ 2 + x T Σ W (µ µ 2 Hence the decision boundary lies along the span of Σ W (µ µ 2 By construction, it is clear that Span(µ µ 2 = range(σ B so we have the same line as before and hence the same decision boundary. 5 Equivalence of FDA and Optimal Scoring: Next we show that FDA and Optimal Scoring are equivalent. 3 A reasonable assumption since Σ W is a covariance matrix (and hence positive semi-definite by construction. If it is not, then our data lies in a linear manifold and we should apply some form of dimension reduction before classification. 4 If we considered the K class case, FDA would identify K eigenvectors. Note here that Σ W Σ B has only one non-zero eigenvector under the centering constraint. 5 For completeness, we should show that the constant from LDA has a relationship with the decision boundary from FDA. I omit this step. 5

6 We first find a solution to the Optimal Scoring problem: minimize β,θ Y Θ Xβ 2 2 subject to Θ T Y T Y Θ = Let us fix β temporarily and optimize with respect to Θ R 2. Moving the constraint into a penalty in the Lagrangian form of the problem, we cast this as a generalized ridge regression problem: 6 minimize Θ Y Θ Xβ λθ T Y T Y Θ with solution given by: L = Y Θ Xβ λθ T Y T Y Θ 0 = Θ L = 2Y T (Y Θ Xβ + 2λY T Y Θ 2Y T Xβ = 2Y T Y Θ + 2λY T Y Θ Y T Xβ = (Y T Y + λy T Y Θ Θ = (Y T Y + λy T Y Y T Xβ = + λ (Y T Y Y T Xβ Y Θ = + λ Y (Y T Y Y T Xβ Note here that Y T Y is the diagonal matrix of counts so it is invertible. Next we choose λ so that the original problem is feasible: = Y Θ 2 2 [ ] T [ ] = + λ Y (Y T Y Y T Xβ + λ Y (Y T Y Y T Xβ = ( + λ 2 βt X T Y (Y T Y Y T Y (Y T Y Y T Xβ T = ( + λ 2 βt X T Y (Y T Y Y T Xβ T = λ = β T X T Y (Y T Y Y T Xβ T Substituting this back into the original optimal scoring problem, we find the optimal β is that which satisfies: minimize 2 β T Σ B β + β T Σ T β β or equivalently 6 With the substitutions: minimize 2 β T Σ B β + β T Σ B β + β T Σ W β β Generalized Ridge Optimal Scoring β Θ Ω Y T Y Y Xβ X Y 6

7 To avoid clutter, let β = Σ /2 W β and Σ B = Σ /2 W Σ BΣ /2 W. Our problem then becomes minimize 2 β T Σ B β + βt ( Σ B + I β β Suppose ˆβ is a non-trivial solution to this problem: we then have ˆβ Σ B ˆβ > 0 so ˆβ satisfies Σ B ˆβ ˆβ T Σ B ˆβ and hence ˆβ is an eigenvector of Σ B. Making the same notational substitutions into FDA, we see that FDA is characterized by the first eigenvector of Σ B, hence the solutions are equal. This proof is due to [WT, Appendix A.6] with Ω = 0. [HBT95, Section 3] gives an alternate proof of this result (based on a clever use of the SVD which you may find clearer. Note: This equivalence only holds for the unpenalized form of these classifiers. The equivalence is broken for the penalized forms of these penalties. See jrojo/4th-lehmann/ slides/witten.pdf or [WT] References [HBT95] Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penalized discriminant analysis. Annals of Statistics, 23(:73 02, [HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, 2nd edition, February tibs/elemstatlearn/. [WT] Daniel M. Witten and Robert Tibshirani. Penalized classification using fisher s linear discriminant. Journal of the Royal Statistical Society, Series B (Statistical Methodology, 73(5: ,

8 STAT 63 Homework II Yanjun Yang Data Analysis (a The test misclassification errors for these classifiers are: Test Error Rate NB LDA QDA Logistic Regression (LR LR with lasso LR with ridge LR with elastic net alpha = Linear SVMs v. For regularized logistic regression, I tried lasso, ridge and an elastic net penalty. If I need to choose one, I wound choose the ridge, because the predictors are grayscale values, they are highly correlated for hand writing zip codes. (b Logistic regression with the ridge regularization and the linear SVMs gave the best performance. These predictors are highly correlated, especially for nearby pixels. On the other hand, since 3 s and 8 s are only different on the left side, only these predictors are essential to classify 3 s and 8 s. Therefore, the regularized logistic regression with the ridge penalty performs best, considering the highly correlated predictors, the ridge performs better than the lasso. Linear SVMs with a slack variable also perform best because they did a similar job if we view it as a penalization method. Meanwhile, the poor performance of QDA indicates it overfitted the data. The decision boundaries are closer to linear. The R code are attached after problem 2 of this part.

9 2 (a The test misclassification errors for these classifiers are: Test Error Rate NB LDA Multinomial Regression (MR MR with lasso MR with ridge MR with elastic net alpha = MR with grouped lasso Linear SVMs (one vs one For regularized multinomial regression, I tried lasso, ridge, an elastic net and grouped lasso penalty. If I need to choose one, I would choose the grouped lasso, because the predictors are highly correlated and only part of them are essential in this classification. For linear SVMs, I used one-vs-one methods to implement a multi-class SVM. Because one-vs-one is usually more accurate than one-vs-all and we only have 0 classes here the computational cost is affordable. (b Linear SVMs performs best, followed by regularized multinomial regression with a grouped lasso regularization. Because the decision boundaries are close to linear and the predictors are highly correlated, in this situation, linear SVMs with slack variables would perform very well. The confusion matrix for these classifiers are: i. Naive Bayes: y yhat

10 Most often misclassified is 4 ( 9, followed by 5 ( 6 and 3 ( 8. ii. LDA: y yhat Most often misclassified is 2 ( 4 or 8, followed by 5 ( 3 and 8 ( 3. iii. Multinomial regression: y yhat Most often misclassified is 8 ( 5, 0 or 2, followed by 2 ( 4 or 8 and 4 ( 9. iv. Regularized multinomial regression (with a grouped lasso penalty 3

11 y yhat Most often misclassified is 2 ( 4 or 8, followed by 4 ( 2 or 9 and 8 ( 0 or 5. v. Linear SVMs (one-vs-one y yhat Most often misclassified is 2 ( 4 or 8, followed by 3 ( 5 and 8 ( 5 or 0. The most often misclassified by all methods are 2 and 8, they ranked top 3 in 4 of these 5 classifiers. Because they look similar to a lot other classes in hand writing, for example, 2 looks similar to 4, 5, 8 etc and 8 looks similar to 0, 2, 3, 5 etc. R code are attached together with problem s below. train = as.matrix(read.csv(file="zip.train.csv", header=false test = as.matrix(read.csv(file="zip.test.csv", header=false 4

12 #Problem Binary Classification: #select 3 and 8 for problem tetrain = train[(train[,]==3 train[,]==8,] tetest = test[(test[,]==3 test[,]==8,] tetrainy = as.factor(tetrain[,] tetrainx = as.matrix(tetrain[,-] tetesty = as.factor(tetest[,] tetestx = as.matrix(tetest[,-] #Naive Bayes library("e07" mod.nb = naivebayes(x=tetrainx, y=as.factor(tetrainy pred.nb = predict(mod.nb,newdata = tetestx, type = "class" conmat.nb = table(pred.nb, tetesty err.nb = - mean(pred.nb == tetesty #LDA library(mass tetrain=data.frame(tetrain tetest=data.frame(tetest mod.lda = lda(v~., data = tetrain pred.lda = predict(mod.lda, tetest conmat.lda = table(pred.lda$class, tetesty err.lda = - mean(pred.lda$class == tetesty #QDA tetrain.j <- tetrain tetrain.j[, -] <- apply(tetrain[,-],2,jitter mod.qda = qda(v~., data = tetrain.j pred.qda = predict(mod.qda, tetest conmat.qda = table(pred.qda$class, tetesty err.qda = - mean(pred.qda$class== tetesty #Logistic Regression tetrain.b <- tetrain tetrain.b[,]=(tetrain[,]==3 tetest.b <- tetest tetest.b[,]=(tetest[,]==3 mod.lr = glm(v~., data = tetrain.b, family = "binomial" pred.lr = predict(mod.lr, newdata = tetest.b, type = " 5

13 response" pred.lr2=rep("8", length(tetesty pred.lr2[pred.lr>.5]="3" conmat.lr = table(pred.lr2, tetesty err.lr = -mean(pred.lr2==tetesty #Regularized logistic regression library("glmnet" #lasso tetrainy.b = as.numeric(tetrain.b[,] tetrainx.b = as.matrix(tetrain.b[,-] tetesty.b = as.numeric(tetest.b[,] tetestx.b = as.matrix(tetest.b[,-] mod.rlr.lasso = glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha = cv.rlr.lasso = cv.glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha= bestlam = cv.rlr.lasso$lambda.min pred.rlr.lasso0 = predict(mod.rlr.lasso, s=bestlam, newx = tetestx, type = "response" pred.rlr.lasso=rep("8", length(tetesty pred.rlr.lasso[pred.rlr.lasso0>.5]="3" conmat.rlr.lasso = table(pred.rlr.lasso, tetesty err.rlr.lasso = - mean(pred.rlr.lasso==tetesty #ridge mod.rlr.ridge = glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha = 0 cv.rlr.ridge = cv.glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha=0 bestlam = cv.rlr.ridge$lambda.min pred.rlr.ridge0 = predict(mod.rlr.ridge, s=bestlam, newx = tetestx, type = "response" pred.rlr.ridge=rep("8", length(tetesty pred.rlr.ridge[pred.rlr.ridge0>.5]="3" conmat.rlr.ridge = table(pred.rlr.ridge, tetesty err.rlr.ridge = -mean(pred.rlr.ridge==tetesty #enet mod.rlr.ent = glmnet(tetrainx, tetrainy, family = "binomial", 6

14 alpha = 0.8 cv.rlr.ent = cv.glmnet(tetrainx, tetrainy, family = "binomial ", alpha=0.8 bestlam = cv.rlr.ent$lambda.min pred.rlr.ent0 = predict(mod.rlr.ent, s=bestlam, newx = tetestx, type = "response" pred.rlr.ent=rep("8", length(tetesty pred.rlr.ent[pred.rlr.ent0>.5]="3" conmat.rlr.ent = table(pred.rlr.ent, tetesty err.rlr.ent = -mean(pred.rlr.ent==tetesty #LiblineaR might be helpful as well #Linear SVMs tetraindat = data.frame(x=tetrainx, y=as.factor(tetrainy tetestdat = data.frame(x=tetestx, y=as.factor(tetesty tune.svm = tune(svm, y~., data=tetraindat, kernel="linear", ranges = list(cost=c (0.00,0.005,0.0,0.05,0.,,5,0,00, scale = FALSE bestsvmmod = tune.svm$best.model pred.svm = predict(bestsvmmod, tetestdat conmat.svm = table(pred.svm, tetesty err.svm = -mean(pred.svm == tetesty #Summary bierror <- rbind(err.nb, err.lda, err.qda, err.lr, err.rlr.lasso, err.rlr.ridge, err.rlr.ent, err.svm colnames(bierror <- c("test Error Rate" rownames(bierror <- c("nb", "LDA", "QDA", "Logistic Regression (LR", "LR with lasso", "LR with ridge", "LR with elastic net alpha = 0.8", "Linear SVMs" #Problem 2 Multi-class Classification: trainy = as.factor(train[,] trainx = as.matrix(train[,-] testy = as.factor(test[,] testx = as.matrix(test[,-] #NB mod.nb2 = naivebayes(x=trainx, y=as.factor(trainy 7

15 pred.nb2 = predict(mod.nb2, newdata = testx, type = "class" conmat.nb2 = table(yhat=pred.nb2, y=testy err.nb2 = - mean(pred.nb2 == testy #LDA traindat = data.frame(x=trainx, y=as.factor(trainy testdat = data.frame(x=testx, y=as.factor(testy mod.lda2 = lda(y~., data = traindat pred.lda2 = predict(mod.lda2, testdat conmat.lda2 = table(yhat = pred.lda2$class, y=testy err.lda2 = -mean(pred.lda2$class == testy #Multinomial regression library(nnet mod.mr2 = multinom(y~., traindat, MaxNWts = 3000 pred.mr2 = predict(mod.mr2, testdat conmat.mr2 = table(yhat=pred.mr2, y=testy err.mr2 = -mean(pred.mr2 == testy #Regularized multinomial regression #lasso mod.rmr.lasso2 = glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = cv.rmr.lasso2 = cv.glmnet(trainx, as.factor(trainy, family = "multinomial", alpha = bestlam = cv.rmr.lasso2$lambda.min pred.rmr.lasso2 = predict(mod.rmr.lasso2, s=bestlam, newx = testx, type = "class" conmat.lasso2 = table(yhat=pred.rmr.lasso2, y=testy err.lasso2 = - mean(pred.rmr.lasso2 == testy #ridge mod.rmr.ridge2 = glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = 0 cv.rmr.ridge2 = cv.glmnet(trainx, as.factor(trainy, family = "multinomial", alpha = 0 bestlam = cv.rmr.ridge2$lambda.min pred.rmr.ridge2 = predict(mod.rmr.ridge2, s=bestlam, newx = testx, type = "class" conmat.ridge2 = table(yhat = pred.rmr.ridge2, y = testy 8

16 err.ridge2 = -mean(pred.rmr.ridge2 == testy #enet mod.rmr.enet2 = glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = 0.8 cv.rmr.enet2 = cv.glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = 0.8 bestlam = cv.rmr.enet2$lambda.min pred.rmr.enet2 = predict(mod.rmr.enet2, s=bestlam, newx = testx, type = "class" conmat.enet2 = table(yhat = pred.rmr.enet2, y = testy err.enet2 = -mean(pred.rmr.enet2 == testy #lasso 2 (grouped mod.rmr.lasso3 = glmnet(trainx, as.factor(trainy, family = " multinomial", type.multinomial = "grouped", alpha = cv.rmr.lasso3 = cv.glmnet(trainx, as.factor(trainy, family = "multinomial", type.multinomial = "grouped", alpha = bestlam = cv.rmr.lasso3$lambda.min pred.rmr.lasso3 = predict(mod.rmr.lasso3, s=bestlam, newx = testx, type = "class" conmat.lasso3 = table(yhat = pred.rmr.lasso3, y = testy err.lasso3 = -mean(pred.rmr.lasso3 == testy #Linear SVMs #this function used one vs one method to implement a multi- class SVM. tune.svm2 = tune(svm, y~., data = traindat, kernel = "linear", ranges = list(cost=c(0.00,0.005,0.0,0.05,0.,,5,0, scale = FALSE bestsvmmod2 = tune.svm2$best.model pred.svm2 = predict(bestsvmmod2, testdat conmat.svm2 = table(yhat = pred.svm2, y = testy err.svm2 = -mean(pred.svm2 == testy #Summary error <- rbind(err.nb2, err.lda2, err.mr2, err.lasso2, err. ridge2, err.enet2, err.lasso3, err.svm2 colnames(error <- c("test Error Rate" rownames(error <- c("nb", "LDA", "Multinomial Regression (MR 9

17 ", "MR with lasso", "MR with ridge", "MR with elastic net alpha = 0.8", "MR with grouped lasso", "Linear SVMs (one vs one" 0

MSA200/TMS041 Multivariate Analysis

MSA200/TMS041 Multivariate Analysis Lecture 8 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Back to Discriminant analysis As mentioned in the previous