HOMEWORK 4: SVMS AND KERNELS

Similar documents
Machine Learning, Fall 2011: Homework 5

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear & nonlinear classifiers

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning: Homework 5

Introduction to Machine Learning

COMS F18 Homework 3 (due October 29, 2018)

Support Vector Machines for Classification and Regression

Homework 2 Solutions Kernel SVM and Perceptron

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Max Margin-Classifier

CS446: Machine Learning Spring Problem Set 4

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

CS145: INTRODUCTION TO DATA MINING

Announcements - Homework

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Homework 5 ADMM, Primal-dual interior point Dual Theory, Dual ascent

Pattern Recognition 2018 Support Vector Machines

(Kernels +) Support Vector Machines

Machine Learning Lecture 6 Note

Support Vector Machines.

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Deep Learning for Computer Vision

Machine Learning Practice Page 2 of 2 10/28/13

Kernel Methods and Support Vector Machines

Support Vector Machine (continued)

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Linear & nonlinear classifiers

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Support Vector Machines, Kernel SVM

Binary Classification / Perceptron

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Advanced Introduction to Machine Learning

Support Vector Machines. Machine Learning Fall 2017

Name (NetID): (1 Point)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Final Exam. 1 True or False (15 Points)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Homework 4. Convex Optimization /36-725

Support Vector Machine (SVM) and Kernel Methods

SVMs, Duality and the Kernel Trick

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Kernelized Perceptron Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Classification with Perceptrons. Reading:

Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines Explained

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Support vector machines Lecture 4

Homework 3. Convex Optimization /36-725

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

SUPPORT VECTOR MACHINE

1 Binary Classification

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

SVMs: nonlinearity through kernels

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines and Bayes Regression

Midterm: CS 6375 Spring 2015 Solutions

CSE 546 Final Exam, Autumn 2013

18.9 SUPPORT VECTOR MACHINES

6.867 Machine learning

Machine Learning: Homework 4

Statistical Machine Learning from Data

Brief Introduction to Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Evaluation requires to define performance measures to be optimized

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

10-701/ Machine Learning - Midterm Exam, Fall 2010

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning for NLP

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Soft-margin SVM can address linearly separable problems with outliers

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

LECTURE 7 Support vector machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines for Classification: A Statistical Portrait

1 A Support Vector Machine without Support Vectors

Kernel methods CSE 250B

Lecture 10: Support Vector Machine and Large Margin Classifier

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Lecture Support Vector Machine (SVM) Classifiers

Transcription:

HOMEWORK 4: SVMS AND KERNELS CMU 060: MACHINE LEARNING (FALL 206) OUT: Sep. 26, 206 DUE: 5:30 pm, Oct. 05, 206 TAs: Simon Shaolei Du, Tianshu Ren, Hsiao-Yu Fish Tung Instructions Homework Submission: Submit your answers and results to Gradescope. You will use Autolab to submit your code in Problem 2.2. For Gradescope, you will need to specify which pages go with which question. We provide a LaTeX template which you can use to type up your solutions. Please check Piazza for updates about the homework. Collaboration policy: The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes (including code) are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone. The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved. Please refer to the course webpage.

Support Vector Machine (50 pts + 0 pts Extra Credits) Suppose we have the following data D = (X, y) where X R d n, the i-th column x i are the features of the i-th training sample and y i is the label of the i-th training sample. y {, } n if this is a classification problem and y R n if this is a regression problem.. SVM for Classification (Hard Margin) In this problem, you will derive the SVM algorithm from the large margin principle. We use linear discriminant function f(x) = w x for classification: we classify x into class if f(x) < 0 and into class otherwise. If the data is linearly separable, and f is the target function, then y f(x) > 0. This fact allows us to define γ = y f(x) w 2. as distance of x from the decision boundary, i.e., margin.. (20 pts) We would like to make this margin γ as large as possible, i.e. maximize the perpendicular distance to the closest point. Thus our objective function becomes max w n min i= y i f(x i ) w 2. () (Think about why we use this function). Show that () is equivalent to the following problem: (Hint: Does it matter if we rescale w kw?) min w 2 w 2 2 such that y i (w ) x i, i =, n. 2. (0 pts) In the linearly separable case if one of the training samples is removed, will the decision boundary shift toward the point removed or shift away from the point remove or remain the same? Justify your answer..2 SVM for Classification (Soft Margin) Recall from the lecture notes that if we allow some misclassification in the training data, the primal optimization of SVM is given by n minimize w,ξ i 2 w 2 2 + C ξ i i= subject to y i (w T x i ) ξ i i =,..., n ξ i 0 i =,..., n Recall from the lecture notes ξ,..., ξ n are called slack variables. The optimal slack variables have intuitive geometric interpretation as shown in Fig.. Basically, when ξ i = 0, the corresponding feature vector x i is correctly classified and it will either lie on the margin of the separator or on the correct side of the margin. Feature vector with 0 < ξ i lies within the margin but is still be correctly classified. When ξ i >, the corresponding feature vector is misclassified. Support vectors correspond to the instances with ξ i > 0 or instances that lie on the margin.. (5 pts) Suppose the optimal ξ,..., ξ n have been computed. Use the ξ i to obtain an upper bound on the number of misclassified instances. 2. (5 pts) In the primal optimization of SVM, what s the role of the coefficient C? Briefly explain your answer by considering two extreme cases, i.e., C 0 and C. 2

Figure : The relationship between the optimal slack variables and the optimal linear separator in the feature space. Support vectors are surrounded with circles..3 SVM for Regression (0 pts Extra Credits) Let x R d be the feature vector and y R be the label. In this question, we use a linear predictor for the label, i.e. given the feature vector, we predict the label by ŷ (x) = w x where w R d is the linear coefficient vector. In this question, we consider the epsilon insensitive loss function, defined by { 0 if y ŷ < ɛ L ɛ (y, ŷ) = y ŷ ɛ otherwise. where ɛ is a tuning parameter. To obtain a good w, we would like to solve the following optimization: J (w) = n. (2 pts) What kind of loss is L ɛ if ɛ = 0? n L ɛ (y, ŷ (x i )) + λ w 2 2. (2) i= 2. (3 pts) When it is beneficial to use L ɛ for ɛ > 0? Give a real world example. (Hint: think about the name of this loss function.) 3. (5 pts) Notice that (2) is not a differentiable. Show how to convert (2) into a optimization problem whose objective is differentiable and constraints are linear by introducing slack variables. 2 Kernels (50 pts) 2. Kernel Basics (0 pts). (5 pts) Suppose the input space is three-dimensional x = (x, x 2, x 3 ) T. Define the feature mapping φ(x) = (x 2, x 2 2, x 2 3, 2x x 2, 2x x 3, 2x 2 x 3 ) T. What is the corresponding kernel function, i.e. K(x, z)? Express your answer in terms of x, x 2, x 3, z, z 2, z 3. 2. (5 pts) With the same feature map in the previous question, suppose we want to compute the value of kernel function K(x, z) on two vectors x, z R 3. How many additions and multiplications are needed if you (a) map the input vector to the feature space and then perform dot product on the mapped features? (b) compute through kernel function you derived in part (a)? 3

2.2 Programming Kernel Perceptron (40 pts) In this section you implement the kernel perceptron algorithm to between handwritten digits 4 and 7 of MNIST dataset. Also you will compare kernel perceptron with linear perceptron on a synthetic dataset. For your convenience, we have provided full code for linear perceptron in perceptron run.m and starter codes for kernel perceptron in the handout, which you may download from Autolab. You may also test the correctness of your program with the offline dataset we provided, however you will be evaluated on a separate dataset on Autolab. For problem 2. - 2.4, you are expected to experiment on perceptron train.mat and perceptron test.mat. For problem 2.5, you are expected to experiment on synthetic train.mat and synthetic test.mat. To submit the code, make sure you put all your files (excluding data) in a folder called hw4 and run the following command in its top directory: tar -cvf hw4.tar hw4. Then submit the tarball created to Autolab. In order to create faster code we approach the kernel perceptron by first building a kernel matrix, K, and then querying that kernel matrix every time we are interested in using some kernel evaluation k(x i, x j ). The kernel matrix K stores the evaluations of the kernel X R d n and X2 R d m where X is usually the training set and X2 the testing set. Thus, K R n m has dimensions n m. Recall the kernel perceptron algorithm we learned in class for x i R d and y i {, } Initialize the counting vector α R n Iterate until convergence: For i =... n: * ŷ i = ϕ( n r= α ry r K i,r ) * If: ŷ i y i ; Then: α i = α i + where ϕ(z) = { if z 0 otherwise. Note that in the code we use a instead of α.. (5 pts) Complete polykernel.m: the function K = polykernel(x,x2) computes a kernel matrix of a non-homogeneous polynomial kernel of degree-3 with an offset of. The kernel matrix should be constructed such that if X R d n and X2 R d m, then K R n m. The polynomial kernel of degree-3 with an offset of is defined as: k(x i, x j ) = (x T i x j + ) 3 2. (0 pts) Complete kernel_perceptron_pred.m: the function kernel_perceptron_pred(a, Y, K, i) takes in a counting vector, a, class labels, Y, the kernel matrix K, and the index of the sample we are interested in evaluating, i. The output should be the predicted label for x i. 3. (0 pts) Complete w = kernel_perceptron.m: the function w = kernel_perceptron(a0, X, Y) runs the kernel perception training algorithm taking in an initial counting vector a0, a matrix of features X, and vector of labels Y. It outputs a learned weight vector. 4. (0 pts) Run kernel_perceptron_run.m on perceptron_train.mat and perceptronb_test.mat to perform training and see the performance on training and test sets. Report your accuracy on both datasets. 5. (5 pts) We have provided you linear perceptron codes and another data set: synthetic_train.mat and synthetic_test.mat. Please run perceptron_run.m and report the accuracy of linear perceptron. Next test your kernel perceptron on this new data set. Compare the classification accuracy of these two methods. Which one is better? Why is this the case? 4

3 Perceptron and Subgradient (0 pts Extra Credits) In this problem you will find the connection between Perceptron algorithm and (stochastic) subgradient descent. First, please look at the following two definitions: Definition (Hing Loss). For y, y R, the Hinge loss is defined by: L(y, y ) = max(0, y y ). Definition 2 (Subgradient). Given a function F (x), which may be non-differentiable, v(x) is a subgradient of F (x) if and only if for any x, the following holds: F (x ) F (x) + v(x) (x x).. (5 pts) Let (x, y) be a sample from the training data where x R d and y R. Define F (w) = L(y, w x). What is the subgradient of F (w)? (hint: there are 3 cases). 2. (5 pts) For a given function F (w) = n n i= F i(w), Cyclic Subgradient Descent Algorithm is following procedure Initialize the variable w R d. Iterate until convergence: For i =... n: * Find a subgradient of v i (w) of F i (w). * w w ηv i (w) where η R + is the step size. Show why perceptron is a special case of Cyclic Subgradient Descent Algorithm. 5