HW5 Solutions. Gabe Hope. March 2017

Similar documents
Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Classification with Perceptrons. Reading:

Machine Learning Lecture 6 Note

1 A Support Vector Machine without Support Vectors

Support vector machines Lecture 4

Multiclass Classification-1

Machine Learning: Homework 4

Linear Regression (continued)

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 11. Linear Soft Margin Support Vector Machines

Introduction to Machine Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Nonlinear Classification

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Assignment 4. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14

CPSC 340 Assignment 4 (due November 17 ATE)

COMP 875 Announcements

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Multiclass Boosting with Repartitioning

Kaggle.

1 Machine Learning Concepts (16 points)

Jeff Howbert Introduction to Machine Learning Winter

1 Binary Classification

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines

Linear classifiers: Logistic regression

FA Homework 2 Recitation 2

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

T Assignment-08/2011

Support Vector Machines. Machine Learning Fall 2017

Announcements - Homework

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

HOMEWORK 4: SVMS AND KERNELS

Neural networks (NN) 1

CPSC 540: Machine Learning

Midterm: CS 6375 Spring 2015 Solutions

Linear and Logistic Regression. Dr. Xiaowei Huang

Multi-layer Neural Networks

MLCC 2017 Regularization Networks I: Linear Models

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning Basics

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Experiment 1: Linear Regression

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Machine Learning (CSE 446): Neural Networks

Neural Network Training

Discriminative part-based models. Many slides based on P. Felzenszwalb

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Linear discriminant functions

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning 2017

Introduction to Logistic Regression and Support Vector Machine

Neural Networks and Deep Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

CSC321 Lecture 2: Linear Regression

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Machine Learning Linear Models

Kernelized Perceptron Support Vector Machines

LEC1: Instance-based classifiers

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CSC321 Lecture 4 The Perceptron Algorithm

CSC 411 Lecture 6: Linear Regression

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Numerical Learning Algorithms

Classification with Gaussians

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning for NLP

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Machine Learning, Fall 2012 Homework 2

Neural Networks: Backpropagation

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Statistical Machine Learning

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

COMS F18 Homework 3 (due October 29, 2018)

HW3. All layers except the first will simply pass through the previous layer: W (t+1) . (The last layer will pass the first neuron).

Statistical Machine Learning from Data

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

The Naïve Bayes Classifier. Machine Learning Fall 2017

Answers Machine Learning Exercises 4

Warm up: risk prediction with logistic regression

CSC242: Intro to AI. Lecture 21

Support Vector Machines for Classification and Regression

Part of the slides are adapted from Ziko Kolter

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSC321 Lecture 4: Learning a Classifier

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

INTRODUCTION TO DATA SCIENCE

CS540 Machine learning Lecture 5

Logistic Regression. Machine Learning Fall 2018

CSCI567 Machine Learning (Fall 2018)

Transcription:

HW5 Solutions Gabe Hope March 2017 1 Problem 1 1.1 Part a Supposing wy T x>wy Ṱ x + 1. In this case our loss function is defined to be 0, therefore we know that any partial derivative of the function will also be 0 so: 1.2 Part b @L((w 1,...,w k ), (x, y)) @w j,l =0 Supposing wy T x<wy Ṱ x + 1 and j = y. In this case we can write relevant piece of the loss function as: (max y 0 6=j wt y 0x)+1 wt j x (max y 0 6=j wt y 0x)+1 D X i=1 Taking the derivative with respect to w j,l we get: 1.3 Part c w j,i x i @L((w 1,...,w k ), (x, y)) @w j,l = x l Supposing wy T x<wy Ṱ x + 1 and j =ŷ, whereŷ = argmax y 0 6=ywy T 0x. In this case we can write relevant piece of the loss function as: (max y 0 6=j wt y 0x)+1 w T j x +1 DX w j,i x i +1 i=1 wt y x w T y x w T y x 1

Taking the derivative with respect to w j,l we get: 1.4 Part d @L((w 1,...,w k ), (x, y)) @w j,l = x l Supposing w T y x<w Ṱ y x + 1 and j 6= ŷ and j 6= y, whereŷ = argmax y 0 6=yw T y 0x. In this case we can write relevant piece of the loss function as: w Ṱ y x +1 w T y x Noting that this expression does not include w j, we get that the partial derivative with respect to w j,l is 0: 2 Problem 2 2.1 Part a @L((w 1,...,w k ), (x, y)) @w j,l =0 Recall that the decision function for the multiclass SVM is: For the 2-class (0 or 1) case, this is: We can equivalently write this as: Rearranging our condition we get: y i = argmax y w T y x i y i = argmax(w T 0 x i,w T 1 x i ) y i =1 () w T 1 x i >w T 0 x i w T 1 x i w T 0 x i > 0 (w 1 w 0 ) T x i > 0 Note that this is equivalent to a decision function of the form: y i =1 () w T x i > 0 Which is a linear perceptron. Therefore we see that our 2-class multiclass SVM with parameters w 0 and w 1 is equivalent to a perceptron with parameter: w =(w 1 w 0 ) 2

2.2 Part b Now we consider the generalized case, where again our decision function is defined as: y i = argmax y w T y x i Consider the descision boundary around a particular class k from our set of K classes. We see that: y i = k () w T k x i >w T j x i 8j 6= k Equivalently: (w k w j ) T x i > 0 8j 6= k We see then that the decision region for class k is formed by the intersection of the positive regions for K 1 implicitly defined linear perceptrons, where each separates k from another class j. Therefore the boundary around class k is formed by the intersection of up to K 1 linear decision boundaries. Accounting for all of the classes, the multiclass SVM s decsion function is defined by K 2 implicit perceptrons. So all of the boundaries defined by the model will have a piecewise linear form. For the K = 3 case, we can illustrate the decision boundaries like this: 3 Problem 3 See the svmobj function in the code section. 3

4 Problem 4 4.1 Part a See the gradopt function in the code section. 4.2 Part b See code section for the implementation. The plot resulting from training with C =0.01 is shown below. 4.3 Part c Training and testing with 7 di erent C values, we get the following results: Test accuracy for each C value C Accuracy 0.0001 74.86 0.001 84.82 0.01 87.28 0.1 79.66 1 79.44 10 77.54 100 78.22 4

The best accuracy (with C = 0.01) was: 87.28. The corresponding confusion matrix is shown below. 5

4.4 Part d Plot of the vector (w k ) for each class: 6

5 Code 5.1 Code function [ X, Y ] = setupdata( varargin ) Preprocess a dataset into and appropriate form for our SVM training/ prediction code. varargin: A set of (N' x d) matrices where each matrix contains all (N ') observations for a given class. X: Combined matrix of all observations (d x N) Y: Vector of labels for each observation (1 x N) Concatenate the 'X' matrices into one training set X = double(vertcat(varargin{:})'); Add the extra dimension X = [X ; ones(1, size(x, 2))]; Create a vector with the label for each observation Y = zeros(1, size(x,2)); start = 1; for label = 1:nargin numlabel = size(varargin{label}, 1); Y(start:(start + numlabel - 1)) = label; start = start + numlabel; 7

function [ obj, grad ] = svmobj( w, C, X, Y ) Computes the SVM objective and gradient. w: Current model weights (d x k) C: Regularization tradeoff hyperparamter X: Input observations (d x N) Y: Input labels (1 x N) obj: Objective value grad: Gradient matrix 'Score ' of each class for each observation wx = w' * X; Index of the true class ' score for each obs. yind = sub2ind(size(wx), Y, 1:size(wx, 2)); Get the true class scores, then 'remove ' them wyx = wx(yind); wx(yind) = -inf; Max scores after removing the true classes [wyhatx, ~] = max(wx, [], 1); Loss for each observation Loss = max(0, wyhatx + 1 - wyx); Objective as given in the notes obj = 0.5 * (w(:)' * w(:)) + C * sum(loss); If needed compute the gradient if nargout > 1 Create a # classes x # obs. matrix, where entry (i,j) indicates how obs. 'j' should contribute to the gradient for class 'i' [ bsxfun applies a given operation row-by-row] cm = double(bsxfun(@eq, wx, wyhatx)); cm(yind) = -1; cm = bsxfun(@times, cm, Loss > 0); Compute grad by multiplying our contrib. matrix by our obs. matrix and adding the regularization term grad = w + C * (cm * X') '; 8

function [ w, traceout ] = gradopt(func, w0, r0, T, varargin) Generic gradient descent with line search. func: Function to compute objective and graident w0: Initial values of parameters to optimize r0: Base learning rate T: # of iterations for gradient descent varargin: Additional arguments passed to 'func' w: Final trained weights trace: Objective value at each iteration Initialize the weights and objective trace trace = zeros(t,1); w = w0; Run for T iterations for iter = 1:T [obj, grad] = func(w, varargin{:}); trace(iter) = obj; Line search to find optimal step size r = r0; for lsiter = 1:20 wnew = w - r * grad; r = r * 0.5; if func(wnew, varargin{:}) < obj w = wnew; break; if nargout > 1 traceout = trace; 9

function [ w, trace ] = svmtrain( X, Y, C, T, r ) Trains a support vector machine given a dataset. X: Input observations (d x N) Y: Input labels (1 x N) C: Regularization tradeoff hyperparamter T: # of iterations for gradient descent r: Base learning rate for gradient descent w: Final trained model weights (d x k) trace: Objective value at each iteration Defaults for the hyperparameters if nargin < 5 r = 0.5; if nargin < 4 T = 100; if nargin < 3 C = 1; Randomly initialize the model nlabels = numel(unique(y)); dim = size(x, 1); w0 = randn(dim, nlabels); Run gradient descent (w/ or w/o the plot data) if nargout == 1 w = gradopt(@svmobj, w0, r, T, C, X, Y); else [w,trace]=gradopt(@svmobj, w0, r, T, C, X, Y); 10

function [ Y ] = svmpredict( w, X ) Makes predictions for a set of observations given a trained model. w: Trained model weights (d x k) X: Input observations (d x N) Y: Output labels (1 x N) wx = w' * X; [~, Y] = max(wx, [], 1); function [ acc ] = score( Ytrue, Ypred ) Computes the accuracy of a set of predictions Ytrue: True labels of data (1 x N) Ypred: Predicted labels of data (1 x N) acc: Accuracy acc = sum(double(ytrue == Ypred)) / numel(ytrue); function [ mat ] = confusion( Ytrue, Ypred ) Generate a confusion matrix for a set of predictions. Ytrue: True labels of data (1 x N) Ypred: Predicted labels of data (1 x N) mat: Confusion matrix labels = unique([ytrue, Ypred]); nlabels = numel(labels); mat = zeros(nlabels); for l1 = 1:nlabels for l2 = 1:nlabels mat(l1, l2) = sum(double(ytrue == l1 & Ypred == l2)); 11

hw5. mat Runs all of the experiments for homework 5 Load and combine the training and test data matrices load digits.mat [Xtrain, Ytrain] = setupdata(train0, train1, train2, train3, train4, train5, train6, train7, train8, train9); [Xtest, Ytest] = setupdata(test0, test1, test2, test3, test4, test5, test6, test7, test8, test9); Variables to save our best result bestc = 0; besttrace = []; bestpred = []; bestw = []; bestacc = 0; Train and evaluate the SVM for a range of C values for C = [0.01, 0.1, 1, 10, 100] [w, trace] = svmtrain(xtrain, Ytrain, C); Ypred = svmpredict(w, Xtest); acc = score(ytest, Ypred); Save the best result if acc > bestacc bestc = C; besttrace = trace; bestpred = Ypred; bestw = w; bestacc = acc; Report the test accuracy and confusion matrix ( Alternatively, use Matlab 's plotconfusion func.) bestacc confusionmat = confusion(ytest, bestpred) Plot the convergence of gradient descent figure plot(besttrace) title(' Objective vs. Iterations') xlabel(' Gradient descent iterations') 12

ylabel('svm objective') Visualize the weights for the best classifier figure title('best Classifier Visualization') for digit = 1:10 img = bestw(1:( -1), digit); subplot(2, 5, digit); imagesc(reshape(img, 28, 28)'); axis off 13