Assignment 4. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14

Similar documents
1 A Support Vector Machine without Support Vectors

LEC1: Instance-based classifiers

Neural networks (NN) 1

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

MIRA, SVM, k-nn. Lirong Xia

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear and Logistic Regression. Dr. Xiaowei Huang

Classifier Complexity and Support Vector Classifiers

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

DISCRIMINANT ANALYSIS: LDA AND QDA

ECE 661: Homework 10 Fall 2014

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Prediction problems 3: Validation and Model Checking

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Machine Learning 2017

Machine Learning Basics

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

CS 188: Artificial Intelligence. Outline

The Perceptron. Volker Tresp Summer 2016

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Kernel Methods and Support Vector Machines

12. Machine learning I. Regression Overfitting Classification Example: Handwritten digit recognition Support vector machines (SVMs)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

SGN (4 cr) Chapter 5

Learning with multiple models. Boosting.

Linear Classification

Machine Learning and Deep Learning! Vincent Lepetit!

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Warm up: risk prediction with logistic regression

Ch 4. Linear Models for Classification

A Bias Correction for the Minimum Error Rate in Cross-validation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Support vector machines Lecture 4

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

The Perceptron. Volker Tresp Summer 2014

ESS2222. Lecture 4 Linear model

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

CSC 411: Lecture 04: Logistic Regression

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

6.867 Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Evaluation. Andrea Passerini Machine Learning. Evaluation

Final Exam, Machine Learning, Spring 2009

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Lecture Support Vector Machine (SVM) Classifiers

Linear Regression (continued)

Manual for a computer class in ML

Evaluation requires to define performance measures to be optimized

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning Practice Page 2 of 2 10/28/13

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

HOMEWORK 4: SVMS AND KERNELS

STA 414/2104, Spring 2014, Practice Problem Set #1

Linear Models for Classification

ESS2222. Lecture 5 Support Vector Machine

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Classifiers: Expressiveness

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

The Perceptron. Volker Tresp Summer 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Binary Classification / Perceptron

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Discriminative Learning and Big Data

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning 2nd Edition

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Data Mining und Maschinelles Lernen

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning Spring 2018 Note 18

HW5 Solutions. Gabe Hope. March 2017

CMU-Q Lecture 24:

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

LEC 5: Two Dimensional Linear Discriminant Analysis

Bias-Variance Tradeoff

Classification with Gaussians

LEC 4: Discriminant Analysis for Classification

PDEEC Machine Learning 2016/17

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

An Introduction to Statistical and Probabilistic Linear Models

ESS2222. Lecture 3 Bias-Variance Trade-off

PATTERN RECOGNITION AND MACHINE LEARNING

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Machine Learning for Signal Processing Bayes Classification and Regression

Spectral and Spatial Methods for the Classification of Urban Remote Sensing Data

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Classification and Support Vector Machine

Transcription:

Assignment 4 Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14 Exercise 1 (Rewriting the Fisher criterion for LDA, 2 points) criterion J(w) = w, m + m 2 σ 2 w,+ + σ 2 w, Show that the Fisher can be rewritten as J(w) = w, C Bw w, C W w. See the lecture notes for the meaning of m +, m, σ 2 w,+, σ 2 w,, C B and C W. Read prepare04.pdf (available on the course webpage) for an example of how to perform linear discriminant analysis (LDA) in MATLAB. Exercise 2 (Digit classification with LDA, 1+2 points) We want to use LDA for written digit classification on the USPS dataset from Assignment 1 (do not forget to convert the data type from uint8 to double). (a) Use ClassificationDiscriminant.fit to train a LDA classifier for digit 2 versus 9. Use imagesc (see Assignment 1) to plot the learned weights in a 16x16 image. Here you do not need to change the colormap. (b) Use your learned weights(.linear) and the constant weight(.const) to predict labels of the test data for digits 2 and 9. Report the 0-1 loss of your prediction and compare it with the performance of the knn classifier for k = 5. Multiclass classification: So far, all classification algorithms we have seen were designed to deal with two class (binary) classification problems. In many real world applications (e.g. digit recognition), our data has several classes. A common approach to solve such problems is to build a multiclass classifier from several binary classifiers. Assume we have k classes C 1,..., C k. Usually we follow one of the two strategies: One vs. All: A single classifier is trained per class to distinguish that class from all other classes. Prediction is then performed by predicting using each binary classifier, and choosing the prediction with the highest confidence score (in LDA, w i x + b i is the score for class i). One vs. One: For each pair of classes we construct a binary classifier (c(c 1)/2 classifiers in total). Usually, classification of an unknown pattern is done according to the maximum voting, where each of c(c 1)/2 classifiers votes for one class. Exercise 3 (Multiclass classification with LDA, 2+3 points) (a) Implement the One vs. All strategy for LDA classifier to classify digits 1, 2, 3 and 4 from USPS dataset (use the complete dataset available on the course webpage). Report the test error. Web page: http://www.informatik.uni-hamburg.de/ml/contents/people/luxburg/ teaching/2014-ss-vorlesung-ml/ Login: machine, Password: learning

(b) Implement the One vs. One strategy for LDA classifier to classify digits 1, 2, 3 and 4 from USPS dataset (use the complete dataset available on the course webpage). Report the test error. Exercise 4 (Complexity of multiclass classification, 2+1 points) You have a multiclass classification problem with n training points and k classes. Assume that each class has the same number of training points. A binary classifier myclassifier is given which requires c(m 2 1 + m 2 2) operations to learn the classifier (where m i is the number of training points in class i and c is a constant). (a) Compare the training time of multiclass classification with One vs. All and One vs. One strategies using myclassifier. Which strategy is faster? (b) Does the situation change if myclassifier only requires c(m 1 + m 2 ) operations for training? Cross validation (m-fold): In many machine learning algorithms we need to choose parameters of our learning algorithm. The regularization parameter λ in ridge regression or the number of neighbors k in knn classification are examples of such parameters. In m-fold cross-validation the original sample is randomly partitioned into m (roughly) equally sized subsamples. Of the m subsamples, a single subsample is retained as the validation data for testing the model, and the remaining m 1 subsamples are used as training data. The crossvalidation process is then repeated m times (the folds), with each of the m subsamples used exactly once as the validation data. The m results from the folds then can be combined to produce a single estimation of the performance of the classifier under consideration. Hence, we have the following scheme for choosing a parameter out of a finite set Λ of possible parameter values: 1. Split the data into m equally sized groups. 2. FOR λ Λ, i = 1,..., m DO (a) Select group i to be the validation set and all other (m 1) groups to be the training set. (b) Train the model with parameter λ on the training set, evaluate on the validation set and store the test error. 3. Select the parameter which performed best in the m folds (usually, one considers the average test error). Exercise 5 (Parameter selection by the training error, 1 point) A naive approach for choosing the learning parameter would be to select the one which minimizes the training error. Explain why this is not a good idea in ridge regression and knn classification. Exercise 6 (Cross validation in ridge regression, 3 points) Use your ridge regression code (or the one available on the course webpage) and the training data in dataridge.mat from Assignment 3. Choose the regularization parameter from λ {2 i i = 15, 14,..., 7, 8}. Use 10- fold cross validation to select the regularization parameter with best performance (test error with L2 loss). To partition the data the MATLAB command cvpartition might be helpful. 2

MACHINE LEARNING- ASSIGNMENT 4 VICTOR BERNAL ARZOLA Exercise 1 The Fisher criterion is (0.1) J (w) = w, m + m 2 if we manipulate the numerator using the denition of between-class covariance σ 2 w,+ + σ 2 w, = wt (m + m ) (m + m ) w σ 2 w,+ + σ 2 w, or = wt C B w σ 2 w,+ + σ 2 w, (0.2) = w, C Bw σ 2 w,+ + σ 2 w, recalling the denition for the denominator of 0.1 we have or that can be written as (0.3) σ 2 w,+ = 1 σ 2 w,+ = 1 σ 2 w,+ = 1 n + {i:y i=+1} n + {i:y i=+1} n + {i:y i=+1} ( w, X i w, m + ) 2 ( w T (X i m + ) ) 2 ( ) w T (X i m + ) (X i m + ) T w using a similar procedure for the other class we get σw,+ 2 + σw, 2 = w T 1 ( ) (X i m + ) (X i m + ) T + 1 n + {i:y i=+1} recalling the denition of within-class covariance (0.4) σ 2 w,+ + σ 2 w, = w T, C W w n {i:y i= 1} ((X ) i m ) (X i m ) T w Finally 0.1 0.30.4 lead to J (w) = w, C Bw w T, C W w Exercise 4 Complexity of multi-class classication. You have a multi class classication problem with N training points and classes. Assume that each class has the same number of training points. 1

MACHINE LEARNING- ASSIGNMENT 4 2 Output One vs one One vs All # Classiers ( ( 1) /2 ( # Operations per classier c N ) 2 ( + N ) 2 ) ( ( c N ) 2 ( ) + N N 2 ) Training Time ( 1) c ( ) N 2 ( ( c N ) 2 ( ) + N N 2 ) Table 1. Time Complexity Multi-class approaches (a) For a classier that takes c ( m 2 1 + m 2 2) operations to learn One vs All takes more time. Output One vs one One vs All # Classiers ( 1) /2 # Operations per classier c (( ) ( N + N )) c (( ) ( )) N + N N Training Time ( 1) c ( ) N cn Table 2. Time Complexity Multi-class approaches (b) For a classier that takes c (m 1 + m 2 ) operations to learn One vs All takes more time. Exercise 5 A model has over t the training data when its test error is larger than its training error (by at least some small amount). This means you model will fail to generalize. This is why training set error is a poor predictor of hypothesis accuracy for new data (generalization). Remarks The polynomial regression model can be expressed in matrix form in terms of a design matrix X, a response vector Y, a parameter vector a, and a vector ε of random errors. Then the model can be written as a system of linear equations: Y 1 : Y n = Y = Xa + ε 1 x 1... x p 1 1 x 2... x p 2 1 x n... x p n The i th row of X and Y will contain the x and y value for the i th data sample. The vector of estimated polynomial regression coecients can be found with least squares. This is the unique least squares solution as long as X has linearly independent columns. a 1 : a n

Table of Contents... 1 Part (a)... 1 Part (b) Label Prediction LDA for test data... 3 Compare LDA with a NN classifier with =5... 4 Cross validation (10-fold) for NN... 4 Exercise 6 Cross validation (10-fold) for NN... 5 Part (a) % Machine Learning Exercise 2 % The USPS dataset contains grayscale handwritten digit images scanned from % envelopes by the U.S. Postal Service. % The images are of size 16 x 16 (256 pixel) with pixel values % in the range 0 to 255. We have 10 classes f1; 2; :::; 9; 0g. % The training data has 500 images, stored in a 500 x 256 Matlab matrix %(usps_train.mat). The training label is a 500 x 1 vector % revealing the labels for the training data. % There are 200 test images for evaluating your algorithm % in the test data (usps_test.mat). clear all; clc; close all; load usps_train load usps_test % trlist Find labels of data corresponding to 2 and 9 trlist = find(train_label==2 train_label==9); % Train data on this indexs x_train = double(train_data(trlist,:)); y_train = double(train_label(trlist)); %Create a linear classifier cls = ClassificationDiscriminant.fit(x_train,y_train); % Uses LDA (Hyperplane +Lx=Y) %b = cls.coeffs(1,2).const; %w 256x1 L = cls.coeffs(1,2).linear; %weights = reshape(l,16,16); %imagesc((weights)); % A nine? figure(1) weights = reshape(x_train(100,:),16,16); imagesc((weights)); 1

% A two? figure(2) weights = reshape(x_train(2,:),16,16); imagesc((weights)); 2

Part (b) Label Prediction LDA for test data %There are 200 test images for evaluating %your algorithm in the test data (usps_test.mat). testlist = find(test_label==2 test_label==9); x_test = double(test_data(testlist,:)); % 40 x 256 y_test = double(test_label(testlist)); % [1 x_test][;l]=y x_test_modified = [ones(size(x_test,1),1) x_test];% add offset coeff = [ ; L]; y_predicted = x_test_modified * coeff ; class_2 = find(y_predicted >=0); class_9 = find(y_predicted <0); y_predicted(class_2) = 2; y_predicted(class_9) = 9; %=== Test the classifier error loss01=== err = loss01(y_predicted, y_test) %[y_predicted, y_test] 3

err = 0.0750 Compare LDA with a NN classifier with =5 knnclassify(train_data,train_label,test_data,k) k_values=5; % nn prediction predtest(:,1)=knnclassify(x_train,y_train,x_test,k_values); %=== Test the classifier error loss01=== errtest(:,1)=loss01(predtest(:,1),y_test) %======================================================================== errtest = 0.0250 Cross validation (10-fold) for NN Select the parameter which performed best in the m folds (usually, one considers the average test error). Set the seed rng(123) % Y train (the labels) has a length 100, and we will do partitions of 90/10 k=[1,3,5,7,10,20,30,50]; CVO = cvpartition(y_train,'k',10); err = zeros(cvo.numtestsets,1); for j=1:length(k) for i = 1:CVO.NumTestSets tridx = find(cvo.training(i)); % 90 indexes respect to the train data teidx = find(cvo.test(i));% 10 indexes respect to the train data % Cross validation we test against a portion of the same train data ytest = knnclassify(x_train(tridx,:),y_train(tridx,:),x_train(teidx,:),k(j)); err(i) = loss01(ytest,y_train(teidx)); end % Average error cverr(j) = sum(err)/sum(cvo.testsize); end 4

plot(k,100*cverr,'*-') xlabel('k') ylabel('error') grid on Exercise 6 Cross validation (10-fold) for NN Select the parameter which performed best in the m folds (usually, one considers the average test error). Set the seed clc; clear all; close all; load dataridge.mat rng(123) % Buld the polynomial basis % https://en.wikipedia.org/wiki/polynomial_regression#matrix_form_and_calculation_o % y_data= x_poly * a', where a is a vector of coefficients to be computed for i=0:3 x_poly(:,i+1)= x_train.^i; end % Y train (the labels) has a length 100, and we will do partitions of 90/10 p=15; % polynomial basis degree lambda=[-12,-10,-8,-6,-4,-2,0,2,4,6,8,10,12]; CVO = cvpartition(y_train,'k',10); err = zeros(cvo.numtestsets,1); 5

for j=1:length(lambda) for i = 1:CVO.NumTestSets tridx = find(cvo.training(i)); % 90 indexes respect to the train data teidx = find(cvo.test(i));% 10 indexes respect to the train data % Cross validation we test against a portion of the same train data % named test weight= RidgeLLS(x_poly(trIdx,:),y_train(trIdx,:),lambda(j)); ytest=x_poly(teidx,:)*weight; err(i) = sqrt((ytest-y_train(teidx))'*(ytest-y_train(teidx))); end % Average error cverr(j) = sum(err)/sum(cvo.testsize); end figure(11) plot(lambda,cverr,'*-') grid on xlabel('lambda') ylabel('average error') Warning: One or more folds do not contain points from all the groups. Published with MATLAB 7.14 6