Homework 2: Solutions

Size: px
Start display at page:

Download "Homework 2: Solutions"

Transcription

1 Homework 2: Solutions Statistics 63 Fall 207 Theoretical Problems:. Since ˆβ = arg min β { Y X β 22/2n + λ β }, we have: Y X ˆβ 2 2/2n + λ ˆβ Y X β 0 2 2/2n + λ β 0 Also β 0 is the true parameter value. Y = X β 0 + ɛ. Hence: X β 0 + ɛ X ˆβ 2 2/2n + λ ˆβ X β 0 + ɛ X β 0 2 2/2n + λ β 0 X( ˆβ β 0 ɛ 2 2/2n + λ ˆβ ɛ 2 2/2n + λ β 0 X( ˆβ β 0 2 2/2n ɛ T X( ˆβ β 0 /n + λ ˆβ λ β 0 X( ˆβ β 0 2 2/n + 2λ ˆβ 2ɛ T X( ˆβ β 0 /n + 2λ β 0 2. Since Therefore: The Hessian: θ k J(θ = n log( + e yx T θ = θ k + e yx = h yxt θ θ( yxyx + e y(i θ T x (i y(i x (i k = n i= 2 i= H kl = J(θ θ k θ l = h θ ( y (i x (i y (i x (i k n θ i= l = h θ (x (i ( h θ (x (i x (i l x (i k n i= h θ ( y (i x (i y (i x (i k The last equality uses that: for g(z = /(+e z, g (z = g(z( g(z. Therefore, for h(x = g(θ T x, h(x θ k = h(x( h(xx k. So we have for the Hessian matrix H: H = h(x (i ( h(x (i x (i x (it n To prove H is positive semidefinite, we show z T Hz 0 for all z: i= z T Hz = ( n n zt h(x (i ( h(x (i x (i x z (it = n = n i= h(x (i ( h(x (i z T x (i x (it z i= h(x (i ( h(x (i (z T x (i 2 0 i= The last inequality holds since 0 h(x (i, which implies h(x (i ( h(x (i 0,and (z T x (i 2 0.

2 3. (a The between-class covariance is Σ B = n K n k µ k µ T k k= where n k is the number of observations in the k th class. (b Since Y is the n K indicator matrix of class label, X T Y gives us a p K matrix [ n µ n 2 µ 2 ] n K µ K. Therefore Σ B = n (XT Y(Y T Y (X T Y T (c We have Σ W = n K k= {i:y ik =} (X i µ k (X i µ k T where X i is the i th column of X. We can also write Σ W as Σ W = n XT ( I Y(Y T Y Y T X Hence Σ B + Σ W = n (XT Y(Y T Y (X T Y T + ( n XT I Y(Y T Y Y T X = n XT Y(Y T Y Y T X + ( n XT I Y(Y T Y Y T X = n XT X = Σ T 4. Let us assume that the data has been centered so that the grand mean, µ = 0. Let K be the total number of classes X be the data matrix Y be a n K indicator matrix of class membership n i be the number of samples in class i K N = n i be the total number of samples i= µ be the grand mean of the data, by assumption 0µ i be the (estimated center of class i Σ W be the within-class covariance Σ B be the between-class covariance Before digging into details, note that Y T X = ( n µ n 2 µ 2... T n k µ K which gives (Y T Y Y T X = M = ( µ µ 2... µ K T R K p Notation-wise, it may help to recall that M is an upper-case µ. 2

3 To see this, note that M i,j = (µ i j = n i n k=:y k =i = n i n k=:y k =i (x k j X k,j = n yk =i X k,j n i k= = n Y k,i X k,j n i k= = n i n k= Y T i,k X k,j = n i (Y T X i,j where the (i, i-th element of (Y T Y is n i and we don t worry about cross terms since (Y T Y is diagonal. 2 Recall that, for a general centered data set Z of k observations, the covariance is given by k ZT Z Applying this principal to M = (Y T Y Y T X, we have: Σ B = M T M K = ((Y T Y Y T X T (Y T Y Y T X K = XT Y (Y T Y (Y T Y Y T X K = XT Y (Y T Y 2 Y T X K From here, recall that Σ T = N XT X = N XT [ Y (Y T Y Y T + I Y (Y T Y Y T ] X = N XT [ Y (Y T Y Y T ] X + N XT [ I Y (Y T Y Y T ] X = Σ B + Σ W 2 (Y T Y i,j = n k= Y T i,k Y k,j = n k= Y k,iy k,j = n k= y k =i yk =j = n k= y k =i=j = i=j n i. (Y T Y is diagonal because the inverse of a diagonal matrix is simply the matrix with the diagonal (non-zero elements inverted. 3

4 as claimed in lecture. This can also be verified directly: (Σ W i,j = K = K = K K (Σ (k W i,j k= K E[X,i X,j X = k] E[X,i X = k]e[x,j X = k] k= K n k k= n l= y l =k x l,i x l,j µ k,i µ k,j ( K = n k K n k n k= ( K = K n k= ( K = K n k= l= ( K ( = X T K n i,lx l,j K µt µ l= ( ( = X T n i,lx l,j K µt µ l= i,j ( ( = n XT X K µt µ ( x l,i x l,j µ k,i µ k,j K l= k= ( X l,i X l,j µ T K i,kµ k,j l= k= ( Xi,lX T l,j K µt µ i,j = (Σ T i,j (Σ B i,j This reflects a general result of probability theory, the Law of Total Variance: Total variance of X {}}{ Within group variance {}}{ Between group variance {}}{ Var(X = E[Var(X Y ] + Var(E[X Y ] = E[Var(X Y ] = Var(X Var(E[X Y ] From here forward, let us assume without loss of generality that K = 2 and n = n 2 (hence N = 2 n. Equivalence of LDA and FDA: With the above relationships worked out, we can now prove the equivalence of LDA and FDA. Recall that FDA solves the problem: maximize β i,j i,j β T Σ B β subject to β T Σ W β = This is a generalized eigenvalue problem and can be solved easily. We can write it in Lagrangian form and take the gradient with respect to β: i,j L = β T Σ B β λ(β T Σ W β 0 = β L = Σ B β = λσ W β Σ W Σ Bβ = λβ = 2Σ B β 2λΣ W β 4

5 assuming Σ W is invertible. 3 Hence, our solution vector β is the first eigenvector of Σ W Σ B. 4 Alternatively, we can consider FDA as the problem of finding w which maximizes the ratio of the between- and within-class variances: J(w = wt Σ B w w T Σ W w This problem does not have a unique solution (J(w = J(αw for any α R, w R p but our decision rule depends only on the scale of w so this isn t a problem and we can play a bit fast-and-loose with constants. Taking the gradient of J( and setting it equal to zero we find: 0 = w J = (w T Σ B w(2σ W w = (w T Σ W w(2σ B w = (wt Σ W w(2σ B w (w T Σ B w(2σ W w 2Σ W w 2 2 (w T Σ B w(σ W w = (w T Σ W w(σ B w = Σ W w Σ B w Here we note that Σ B w will always lie in the span of µ 2 µ so we have: or which defines the discriminant vector. Σ W w µ 2 µ w Σ W (µ 2 µ Now consider LDA. From Cf. [HTF09, Eq. 4.9], we know that the decision boundary for two-class LDA is a line of the form: ( n 0 = log n 2 2 (µ + µ l T Σ W (µ µ 2 + x T Σ W (µ µ 2 ( n = log µt Σ W µ µ T Σ W µ 2 + µ T 2 Σ W µ µ T 2 Σ W µ 2 + x T Σ W 2 (µ µ 2 = log = log n 2 ( n n 2 ( n n 2 µt Σ W µ µ T Σ W µ 2 + Transpose of a scalar is itself {}}{ µ T Σ W µ µt 2 Σ W µ 2 µ T Σ W µ + x T Σ W 2 (µ µ 2 µ T 2 Σ W µ 2 + x T Σ W (µ µ 2 Hence the decision boundary lies along the span of Σ W (µ µ 2 By construction, it is clear that Span(µ µ 2 = range(σ B so we have the same line as before and hence the same decision boundary. 5 Equivalence of FDA and Optimal Scoring: Next we show that FDA and Optimal Scoring are equivalent. 3 A reasonable assumption since Σ W is a covariance matrix (and hence positive semi-definite by construction. If it is not, then our data lies in a linear manifold and we should apply some form of dimension reduction before classification. 4 If we considered the K class case, FDA would identify K eigenvectors. Note here that Σ W Σ B has only one non-zero eigenvector under the centering constraint. 5 For completeness, we should show that the constant from LDA has a relationship with the decision boundary from FDA. I omit this step. 5

6 We first find a solution to the Optimal Scoring problem: minimize β,θ Y Θ Xβ 2 2 subject to Θ T Y T Y Θ = Let us fix β temporarily and optimize with respect to Θ R 2. Moving the constraint into a penalty in the Lagrangian form of the problem, we cast this as a generalized ridge regression problem: 6 minimize Θ Y Θ Xβ λθ T Y T Y Θ with solution given by: L = Y Θ Xβ λθ T Y T Y Θ 0 = Θ L = 2Y T (Y Θ Xβ + 2λY T Y Θ 2Y T Xβ = 2Y T Y Θ + 2λY T Y Θ Y T Xβ = (Y T Y + λy T Y Θ Θ = (Y T Y + λy T Y Y T Xβ = + λ (Y T Y Y T Xβ Y Θ = + λ Y (Y T Y Y T Xβ Note here that Y T Y is the diagonal matrix of counts so it is invertible. Next we choose λ so that the original problem is feasible: = Y Θ 2 2 [ ] T [ ] = + λ Y (Y T Y Y T Xβ + λ Y (Y T Y Y T Xβ = ( + λ 2 βt X T Y (Y T Y Y T Y (Y T Y Y T Xβ T = ( + λ 2 βt X T Y (Y T Y Y T Xβ T = λ = β T X T Y (Y T Y Y T Xβ T Substituting this back into the original optimal scoring problem, we find the optimal β is that which satisfies: minimize 2 β T Σ B β + β T Σ T β β or equivalently 6 With the substitutions: minimize 2 β T Σ B β + β T Σ B β + β T Σ W β β Generalized Ridge Optimal Scoring β Θ Ω Y T Y Y Xβ X Y 6

7 To avoid clutter, let β = Σ /2 W β and Σ B = Σ /2 W Σ BΣ /2 W. Our problem then becomes minimize 2 β T Σ B β + βt ( Σ B + I β β Suppose ˆβ is a non-trivial solution to this problem: we then have ˆβ Σ B ˆβ > 0 so ˆβ satisfies Σ B ˆβ ˆβ T Σ B ˆβ and hence ˆβ is an eigenvector of Σ B. Making the same notational substitutions into FDA, we see that FDA is characterized by the first eigenvector of Σ B, hence the solutions are equal. This proof is due to [WT, Appendix A.6] with Ω = 0. [HBT95, Section 3] gives an alternate proof of this result (based on a clever use of the SVD which you may find clearer. Note: This equivalence only holds for the unpenalized form of these classifiers. The equivalence is broken for the penalized forms of these penalties. See jrojo/4th-lehmann/ slides/witten.pdf or [WT] References [HBT95] Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penalized discriminant analysis. Annals of Statistics, 23(:73 02, [HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, 2nd edition, February tibs/elemstatlearn/. [WT] Daniel M. Witten and Robert Tibshirani. Penalized classification using fisher s linear discriminant. Journal of the Royal Statistical Society, Series B (Statistical Methodology, 73(5: ,

8 STAT 63 Homework II Yanjun Yang Data Analysis (a The test misclassification errors for these classifiers are: Test Error Rate NB LDA QDA Logistic Regression (LR LR with lasso LR with ridge LR with elastic net alpha = Linear SVMs v. For regularized logistic regression, I tried lasso, ridge and an elastic net penalty. If I need to choose one, I wound choose the ridge, because the predictors are grayscale values, they are highly correlated for hand writing zip codes. (b Logistic regression with the ridge regularization and the linear SVMs gave the best performance. These predictors are highly correlated, especially for nearby pixels. On the other hand, since 3 s and 8 s are only different on the left side, only these predictors are essential to classify 3 s and 8 s. Therefore, the regularized logistic regression with the ridge penalty performs best, considering the highly correlated predictors, the ridge performs better than the lasso. Linear SVMs with a slack variable also perform best because they did a similar job if we view it as a penalization method. Meanwhile, the poor performance of QDA indicates it overfitted the data. The decision boundaries are closer to linear. The R code are attached after problem 2 of this part.

9 2 (a The test misclassification errors for these classifiers are: Test Error Rate NB LDA Multinomial Regression (MR MR with lasso MR with ridge MR with elastic net alpha = MR with grouped lasso Linear SVMs (one vs one For regularized multinomial regression, I tried lasso, ridge, an elastic net and grouped lasso penalty. If I need to choose one, I would choose the grouped lasso, because the predictors are highly correlated and only part of them are essential in this classification. For linear SVMs, I used one-vs-one methods to implement a multi-class SVM. Because one-vs-one is usually more accurate than one-vs-all and we only have 0 classes here the computational cost is affordable. (b Linear SVMs performs best, followed by regularized multinomial regression with a grouped lasso regularization. Because the decision boundaries are close to linear and the predictors are highly correlated, in this situation, linear SVMs with slack variables would perform very well. The confusion matrix for these classifiers are: i. Naive Bayes: y yhat

10 Most often misclassified is 4 ( 9, followed by 5 ( 6 and 3 ( 8. ii. LDA: y yhat Most often misclassified is 2 ( 4 or 8, followed by 5 ( 3 and 8 ( 3. iii. Multinomial regression: y yhat Most often misclassified is 8 ( 5, 0 or 2, followed by 2 ( 4 or 8 and 4 ( 9. iv. Regularized multinomial regression (with a grouped lasso penalty 3

11 y yhat Most often misclassified is 2 ( 4 or 8, followed by 4 ( 2 or 9 and 8 ( 0 or 5. v. Linear SVMs (one-vs-one y yhat Most often misclassified is 2 ( 4 or 8, followed by 3 ( 5 and 8 ( 5 or 0. The most often misclassified by all methods are 2 and 8, they ranked top 3 in 4 of these 5 classifiers. Because they look similar to a lot other classes in hand writing, for example, 2 looks similar to 4, 5, 8 etc and 8 looks similar to 0, 2, 3, 5 etc. R code are attached together with problem s below. train = as.matrix(read.csv(file="zip.train.csv", header=false test = as.matrix(read.csv(file="zip.test.csv", header=false 4

12 #Problem Binary Classification: #select 3 and 8 for problem tetrain = train[(train[,]==3 train[,]==8,] tetest = test[(test[,]==3 test[,]==8,] tetrainy = as.factor(tetrain[,] tetrainx = as.matrix(tetrain[,-] tetesty = as.factor(tetest[,] tetestx = as.matrix(tetest[,-] #Naive Bayes library("e07" mod.nb = naivebayes(x=tetrainx, y=as.factor(tetrainy pred.nb = predict(mod.nb,newdata = tetestx, type = "class" conmat.nb = table(pred.nb, tetesty err.nb = - mean(pred.nb == tetesty #LDA library(mass tetrain=data.frame(tetrain tetest=data.frame(tetest mod.lda = lda(v~., data = tetrain pred.lda = predict(mod.lda, tetest conmat.lda = table(pred.lda$class, tetesty err.lda = - mean(pred.lda$class == tetesty #QDA tetrain.j <- tetrain tetrain.j[, -] <- apply(tetrain[,-],2,jitter mod.qda = qda(v~., data = tetrain.j pred.qda = predict(mod.qda, tetest conmat.qda = table(pred.qda$class, tetesty err.qda = - mean(pred.qda$class== tetesty #Logistic Regression tetrain.b <- tetrain tetrain.b[,]=(tetrain[,]==3 tetest.b <- tetest tetest.b[,]=(tetest[,]==3 mod.lr = glm(v~., data = tetrain.b, family = "binomial" pred.lr = predict(mod.lr, newdata = tetest.b, type = " 5

13 response" pred.lr2=rep("8", length(tetesty pred.lr2[pred.lr>.5]="3" conmat.lr = table(pred.lr2, tetesty err.lr = -mean(pred.lr2==tetesty #Regularized logistic regression library("glmnet" #lasso tetrainy.b = as.numeric(tetrain.b[,] tetrainx.b = as.matrix(tetrain.b[,-] tetesty.b = as.numeric(tetest.b[,] tetestx.b = as.matrix(tetest.b[,-] mod.rlr.lasso = glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha = cv.rlr.lasso = cv.glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha= bestlam = cv.rlr.lasso$lambda.min pred.rlr.lasso0 = predict(mod.rlr.lasso, s=bestlam, newx = tetestx, type = "response" pred.rlr.lasso=rep("8", length(tetesty pred.rlr.lasso[pred.rlr.lasso0>.5]="3" conmat.rlr.lasso = table(pred.rlr.lasso, tetesty err.rlr.lasso = - mean(pred.rlr.lasso==tetesty #ridge mod.rlr.ridge = glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha = 0 cv.rlr.ridge = cv.glmnet(tetrainx.b, tetrainy.b, family = " binomial", alpha=0 bestlam = cv.rlr.ridge$lambda.min pred.rlr.ridge0 = predict(mod.rlr.ridge, s=bestlam, newx = tetestx, type = "response" pred.rlr.ridge=rep("8", length(tetesty pred.rlr.ridge[pred.rlr.ridge0>.5]="3" conmat.rlr.ridge = table(pred.rlr.ridge, tetesty err.rlr.ridge = -mean(pred.rlr.ridge==tetesty #enet mod.rlr.ent = glmnet(tetrainx, tetrainy, family = "binomial", 6

14 alpha = 0.8 cv.rlr.ent = cv.glmnet(tetrainx, tetrainy, family = "binomial ", alpha=0.8 bestlam = cv.rlr.ent$lambda.min pred.rlr.ent0 = predict(mod.rlr.ent, s=bestlam, newx = tetestx, type = "response" pred.rlr.ent=rep("8", length(tetesty pred.rlr.ent[pred.rlr.ent0>.5]="3" conmat.rlr.ent = table(pred.rlr.ent, tetesty err.rlr.ent = -mean(pred.rlr.ent==tetesty #LiblineaR might be helpful as well #Linear SVMs tetraindat = data.frame(x=tetrainx, y=as.factor(tetrainy tetestdat = data.frame(x=tetestx, y=as.factor(tetesty tune.svm = tune(svm, y~., data=tetraindat, kernel="linear", ranges = list(cost=c (0.00,0.005,0.0,0.05,0.,,5,0,00, scale = FALSE bestsvmmod = tune.svm$best.model pred.svm = predict(bestsvmmod, tetestdat conmat.svm = table(pred.svm, tetesty err.svm = -mean(pred.svm == tetesty #Summary bierror <- rbind(err.nb, err.lda, err.qda, err.lr, err.rlr.lasso, err.rlr.ridge, err.rlr.ent, err.svm colnames(bierror <- c("test Error Rate" rownames(bierror <- c("nb", "LDA", "QDA", "Logistic Regression (LR", "LR with lasso", "LR with ridge", "LR with elastic net alpha = 0.8", "Linear SVMs" #Problem 2 Multi-class Classification: trainy = as.factor(train[,] trainx = as.matrix(train[,-] testy = as.factor(test[,] testx = as.matrix(test[,-] #NB mod.nb2 = naivebayes(x=trainx, y=as.factor(trainy 7

15 pred.nb2 = predict(mod.nb2, newdata = testx, type = "class" conmat.nb2 = table(yhat=pred.nb2, y=testy err.nb2 = - mean(pred.nb2 == testy #LDA traindat = data.frame(x=trainx, y=as.factor(trainy testdat = data.frame(x=testx, y=as.factor(testy mod.lda2 = lda(y~., data = traindat pred.lda2 = predict(mod.lda2, testdat conmat.lda2 = table(yhat = pred.lda2$class, y=testy err.lda2 = -mean(pred.lda2$class == testy #Multinomial regression library(nnet mod.mr2 = multinom(y~., traindat, MaxNWts = 3000 pred.mr2 = predict(mod.mr2, testdat conmat.mr2 = table(yhat=pred.mr2, y=testy err.mr2 = -mean(pred.mr2 == testy #Regularized multinomial regression #lasso mod.rmr.lasso2 = glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = cv.rmr.lasso2 = cv.glmnet(trainx, as.factor(trainy, family = "multinomial", alpha = bestlam = cv.rmr.lasso2$lambda.min pred.rmr.lasso2 = predict(mod.rmr.lasso2, s=bestlam, newx = testx, type = "class" conmat.lasso2 = table(yhat=pred.rmr.lasso2, y=testy err.lasso2 = - mean(pred.rmr.lasso2 == testy #ridge mod.rmr.ridge2 = glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = 0 cv.rmr.ridge2 = cv.glmnet(trainx, as.factor(trainy, family = "multinomial", alpha = 0 bestlam = cv.rmr.ridge2$lambda.min pred.rmr.ridge2 = predict(mod.rmr.ridge2, s=bestlam, newx = testx, type = "class" conmat.ridge2 = table(yhat = pred.rmr.ridge2, y = testy 8

16 err.ridge2 = -mean(pred.rmr.ridge2 == testy #enet mod.rmr.enet2 = glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = 0.8 cv.rmr.enet2 = cv.glmnet(trainx, as.factor(trainy, family = " multinomial", alpha = 0.8 bestlam = cv.rmr.enet2$lambda.min pred.rmr.enet2 = predict(mod.rmr.enet2, s=bestlam, newx = testx, type = "class" conmat.enet2 = table(yhat = pred.rmr.enet2, y = testy err.enet2 = -mean(pred.rmr.enet2 == testy #lasso 2 (grouped mod.rmr.lasso3 = glmnet(trainx, as.factor(trainy, family = " multinomial", type.multinomial = "grouped", alpha = cv.rmr.lasso3 = cv.glmnet(trainx, as.factor(trainy, family = "multinomial", type.multinomial = "grouped", alpha = bestlam = cv.rmr.lasso3$lambda.min pred.rmr.lasso3 = predict(mod.rmr.lasso3, s=bestlam, newx = testx, type = "class" conmat.lasso3 = table(yhat = pred.rmr.lasso3, y = testy err.lasso3 = -mean(pred.rmr.lasso3 == testy #Linear SVMs #this function used one vs one method to implement a multi- class SVM. tune.svm2 = tune(svm, y~., data = traindat, kernel = "linear", ranges = list(cost=c(0.00,0.005,0.0,0.05,0.,,5,0, scale = FALSE bestsvmmod2 = tune.svm2$best.model pred.svm2 = predict(bestsvmmod2, testdat conmat.svm2 = table(yhat = pred.svm2, y = testy err.svm2 = -mean(pred.svm2 == testy #Summary error <- rbind(err.nb2, err.lda2, err.mr2, err.lasso2, err. ridge2, err.enet2, err.lasso3, err.svm2 colnames(error <- c("test Error Rate" rownames(error <- c("nb", "LDA", "Multinomial Regression (MR 9

17 ", "MR with lasso", "MR with ridge", "MR with elastic net alpha = 0.8", "MR with grouped lasso", "Linear SVMs (one vs one" 0

MSA200/TMS041 Multivariate Analysis

MSA200/TMS041 Multivariate Analysis MSA200/TMS041 Multivariate Analysis Lecture 8 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Back to Discriminant analysis As mentioned in the previous

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010 STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify

More information

Classification and Support Vector Machine

Classification and Support Vector Machine Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline

More information

Data Mining 2018 Logistic Regression Text Classification

Data Mining 2018 Logistic Regression Text Classification Data Mining 2018 Logistic Regression Text Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 50 Two types of approaches to classification In (probabilistic)

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

arxiv: v3 [stat.ml] 14 Apr 2016

arxiv: v3 [stat.ml] 14 Apr 2016 arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

Support Vector Machine I

Support Vector Machine I Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis Jonathan Taylor, 10/12 Slide credits: Sergio Bacallado 1 / 1 Review: Main strategy in Chapter 4 Find an estimate ˆP

More information

Week 5: Classification

Week 5: Classification Big Data BUS 41201 Week 5: Classification Veronika Ročková University of Chicago Booth School of Business http://faculty.chicagobooth.edu/veronika.rockova/ [5] Classification Parametric and non-parametric

More information

Unsupervised Learning: Dimensionality Reduction

Unsupervised Learning: Dimensionality Reduction Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Classification: Linear Discriminant Analysis

Classification: Linear Discriminant Analysis Classification: Linear Discriminant Analysis Discriminant analysis uses sample information about individuals that are known to belong to one of several populations for the purposes of classification. Based

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

From dummy regression to prior probabilities in PLS-DA

From dummy regression to prior probabilities in PLS-DA JOURNAL OF CHEMOMETRICS J. Chemometrics (2007) Published online in Wiley InterScience (www.interscience.wiley.com).1061 From dummy regression to prior probabilities in PLS-DA Ulf G. Indahl 1,3, Harald

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

More information

Regularized Discriminant Analysis and Its Application in Microarray

Regularized Discriminant Analysis and Its Application in Microarray Regularized Discriminant Analysis and Its Application in Microarray Yaqian Guo, Trevor Hastie and Robert Tibshirani May 5, 2004 Abstract In this paper, we introduce a family of some modified versions of

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

Neural networks (NN) 1

Neural networks (NN) 1 Neural networks (NN) 1 Hedibert F. Lopes Insper Institute of Education and Research São Paulo, Brazil 1 Slides based on Chapter 11 of Hastie, Tibshirani and Friedman s book The Elements of Statistical

More information

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Graphical Model Selection

Graphical Model Selection May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Regularized Discriminant Analysis and Its Application in Microarrays

Regularized Discriminant Analysis and Its Application in Microarrays Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

Learning From Data: Modelling as an Optimisation Problem

Learning From Data: Modelling as an Optimisation Problem Learning From Data: Modelling as an Optimisation Problem Iman Shames April 2017 1 / 31 You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify

More information

MSA220 Statistical Learning for Big Data

MSA220 Statistical Learning for Big Data MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant

More information

LDA, QDA, Naive Bayes

LDA, QDA, Naive Bayes LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:

More information

Covariance-regularized regression and classification for high-dimensional problems

Covariance-regularized regression and classification for high-dimensional problems Covariance-regularized regression and classification for high-dimensional problems Daniela M. Witten Department of Statistics, Stanford University, 390 Serra Mall, Stanford CA 94305, USA. E-mail: dwitten@stanford.edu

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Applied Multivariate and Longitudinal Data Analysis

Applied Multivariate and Longitudinal Data Analysis Applied Multivariate and Longitudinal Data Analysis Discriminant analysis and classification Ana-Maria Staicu SAS Hall 5220; 919-515-0644; astaicu@ncsu.edu 1 Consider the examples: An online banking service

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2. CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information