Lecture 11. Linear Soft Margin Support Vector Machines

Similar documents
The Perceptron Algorithm 1

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Multiclass Classification-1

Lecture 4: Training a Classifier

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning for NLP

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Warm up: risk prediction with logistic regression

Lecture 4: Training a Classifier

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Linear discriminant functions

Minimax risk bounds for linear threshold functions

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Midterm: CS 6375 Spring 2015 Solutions

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Linear Regression (9/11/13)

Deep Learning for Computer Vision

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Logistic Regression. Machine Learning Fall 2018

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Learning by constraints and SVMs (2)

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machine I

CS229 Supplemental Lecture notes

Support vector machines Lecture 4

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning for NLP

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Lecture 8. Instructor: Haipeng Luo

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 5

Support Vector Machine (continued)

Support Vector Machines. Machine Learning Fall 2017

MIRA, SVM, k-nn. Lirong Xia

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Midterm exam CS 189/289, Fall 2015

Stochastic Gradient Descent

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CSCI-567: Machine Learning (Spring 2019)

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning Basics

Machine Learning, Fall 2012 Homework 2

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS 188: Artificial Intelligence. Outline

CIS 520: Machine Learning Oct 09, Kernel Methods

1 Machine Learning Concepts (16 points)

Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Least Mean Squares Regression

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

EE 511 Online Learning Perceptron

Support Vector Machines and Bayes Regression

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning for Signal Processing Bayes Classification and Regression

Support vector machines

Support Vector Machine. Industrial AI Lab.

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Lecture 16: Introduction to Neural Networks

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 5 Neural models for NLP

ML4NLP Multiclass Classification

Dan Roth 461C, 3401 Walnut

Perceptron. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Name (NetID): (1 Point)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Qualifier: CS 6375 Machine Learning Spring 2015

ECE 5424: Introduction to Machine Learning

Support Vector Machine

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Name (NetID): (1 Point)

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

SGN (4 cr) Chapter 5

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Linear Classification. Prof. Matteo Matteucci

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines for Classification and Regression

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning 2017

Ch 4. Linear Models for Classification

Statistical Pattern Recognition

HOMEWORK 4: SVMS AND KERNELS

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Training the linear classifier

Transcription:

CS142: Machine Learning Spring 2017 Lecture 11 Instructor: Pedro Felzenszwalb Scribes: Dan Xiang, Tyler Dae Devlin Linear Soft Margin Support Vector Machines We continue our discussion of linear soft margin support vector machines. But first let s break down that phrase a little bit. Linear because our classifier takes a linear combination of the input features and outputs the sign of the result. Soft margin because we allow for some points to be located within the margins or even on the wrong side of the decision boundary. This was the purpose of introducing the ɛ i terms. Support vector machines because the final decision boundary depends only on a subset of the training data known as the support vectors. Suppose we have a training set T = (x 1, y 1 ),..., (x N, y N )} with x i R D and y i 1, +1}. Recall that the goal is to choose w so as to minimize the objective function F : R D R defined by F (w) = 1 2 w 2 + C N max0, 1 y i w T x i }. (1) The first term above is motivated by regularization considerations and the second term is the hinge loss training error. Speaking of which, recall that the hinge loss H is given by H(w, x, y) = max0, 1 yw T x}. To minimize this loss, we would like to make yw T x large, which happens when points are far off on the correct side of the decision boundary. Definition: A function F is called convex if for any w 1, w 2 R D and α (0, 1), we have F (αw 1 + (1 α)w 2 ) αf (w 1 ) + (1 α)f (w 2 ). If F is convex, then any local minimum of F is a global minimum. The sum of two convex functions is also convex. The maximum of two convex functions is also convex. From these facts it follows that the objective function F given by (1) is convex in w. 1

Gradient descent Gradient descent is a procedure that allows one to move from some starting point on a function to a nearby local minimum. This algorithm formalizes the notion of what it means to roll downhill. For convex functions, gradient descent will find the global minimum, since in this case a local minimum is a global minimum. Algorithm 1 Gradient Descent Precondition: A function F (w), a step size r, and a stopping threshold ɛ. 1: function GradientDescent(F, r, ɛ) 2: Initialize w 0 3: while the difference in the values of F on successive iterations is > ɛ do 4: Set g F (w) 5: Update w w rg 6: return w Although the algorithm above initializes w to 0, in general it is often better to choose the first w randomly (say multivariate normal with mean 0) so as to avoid complications produced by any symmetries the function might have at the origin. Line search procedure If our step size r is too large, there is a possibility that we might overshoot the minimum of F. Consequently, it makes sense to adjust r whenever this may be the case. The following line search procedure selects r in such a way that we are guaranteed to move lower down on the function in the next step. Algorithm 2 Line Search Precondition: A function F (w), a step direction g, and the current location w 0. 1: function LineSearch(F, g, w 0 ) 2: Initialize r 1 2 3: while F (w 0 rg) > F (w 0 ) do 4: Update r r 2 5: return r The line search subroutine would be called between lines 4 and 5 of the gradient descent algorithm, i.e. after the step direction is chosen but before the step size is chosen. 2

Computing the gradient of F In order to compute the gradient of F, we need to compute the partial derivatives of the hinge loss H(w, x, y) = max0, 1 yw T x} with respect to each w j. We have 0 yw T x 1 H(w, x, y) = w j yx j yw T x < 1. Note that the derivative is not defined when yw T x = 1, so we adopt the convention that the derivative is 0 in this case. The gradient of H is then H(w, x, y) = 0 yw T x 1 yx yw T x < 1. (2) Next, observe that the gradient of the regularization term is ( ) 1 2 w 2 = w, (3) since w 2 = D w2 i. Combining equations (2) and (3) with the linearity of the gradient operator, we have F (w) = w + C i E y i x i, where E = i : y i w T x i < 1} is the set of indices of the training examples that are misclassified or within the margin. Gradient descent update Returning to the gradient descent algorithm, we see that the update rule for the weight vector w is (algorithm 1, line 5) w w rw + rc i E y i x i = (1 r)w + rc i E y i x i. Thus, the effect of the update step is to shrink (i.e. scale down) the vector w and add the misclassified example vectors to w. 3

Multiclass support vector machines Up until now we have restricted our attention to binary classifiers, i.e. Y = 1, +1}. Now we explore how support vector machines can be extended to handle K-class classification, where K 2. Let T = (x 1, y 1 ),..., (x N, y N )} be a training set with each x i R D and y i 1,..., K}. For each class j = 1,..., K, we define a weight vector w j. We can think of the inner product wj T x to be the score of class j for example x. We define our classifier y : R D 1,..., K} by picking the class that has the maximum score for a given point, i.e. y(x) = argmax wj T x. j As in the 2-class setting, we would like to pick the weight vectors in such a way that the decision boundaries are accompanied by margins. This can be accomplished by imposing the constraint w T y i x i > max j y i (w T j x i ) + 1 for each training point (x i, y i ) T. In words, this says that the score of the correct class should be bigger than the score of any other class by a margin of 1. We can now define an objective F that maps a set of weight vectors to a nonnegative real number, F (w 1,..., w k ) = 1 2 K w i 2 + N } ( ) max 0, max w T j x i + 1 w T yi x i j y i The terms in the second sum are called the multiclass hinge loss. The set of weights that minimize this objective function define the multiclass soft margin linear SVM. How naïve Bayes can fail What makes the naïve Bayes approach to classification so naïve is its strong (often unwarranted) assumption of conditional independence. Recall that a naïve Bayes classifier assumes that features (such as the length and weight of a fish) are independent conditioned on the class (the type of fish). It is not difficult to come up with examples for which a naïve Bayes classifer fails badly. Indeed, let (x 1, x 2 ) be the length and weight of a fish. The assumption of conditional independence gives P (x 1, x 2 y) = P (x 1 y)p (x 2 y). It s not unreasonable to expect naïve Bayes to perform decently well in this scenario. But now 4

suppose we redefine our feature vector by duplicating the weight x 2 999 times over. So our new feature vector is (x 1, x 2, x 3, x 4,..., x 1000 ), where x 1 is the length and x 2 = x 3 = = x 1000 are all equal to the weight. Of course, this feature vector contains no additional information compared to the original vector (x 1, x 2 ), and in principle we would like that our classifier remain largely unaffected. Nevertheless, the naïve Bayes model is now P (x y) = P (x 1 y)p (s 2 y) 999. Thus, our classification decision will be skewed in a way that almost entirely neglects the value of x 1, since the term P (x 2 y) 999 now dominates the posterior probability. 5