Lecture 16: Modern Classification (I) - Separating Hyperplanes

Similar documents
Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression

Support Vector Machines

Support Vector Machines

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Support Vector Machines: Maximum Margin Classifiers

Linear & nonlinear classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Notes on Support Vector Machine

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines and Kernel Methods

Statistical Machine Learning from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

Lecture 11: Regression Methods I (Linear Regression)

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 11: Regression Methods I (Linear Regression)

CS-E4830 Kernel Methods in Machine Learning

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Statistical Pattern Recognition

Deep Learning for Computer Vision

LECTURE 7 Support vector machines

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

L5 Support Vector Classification

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Lecture 2: Linear SVM in the Dual

COMP 875 Announcements

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Applied inductive learning - Lecture 7

Support Vector Machine for Classification and Regression

Kernel Methods and Support Vector Machines

Max Margin-Classifier

Support Vector Machine (continued)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Convex Optimization M2

Convex Optimization and Support Vector Machine

Perceptron Revisited: Linear Separators. Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

Announcements - Homework

Polyhedral Computation. Linear Classifiers & the SVM

Statistical Data Mining and Machine Learning Hilary Term 2016

Classification and Support Vector Machine

ML (cont.): SUPPORT VECTOR MACHINES

The Perceptron Algorithm 1

Support Vector Machines, Kernel SVM

Warm up: risk prediction with logistic regression

Gaussian discriminant analysis Naive Bayes

Machine Learning A Geometric Approach

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lecture 6. Notes on Linear Algebra. Perceptron

Machine Learning: Logistic Regression. Lecture 04

Discriminative Models

Lecture 10: Support Vector Machine and Large Margin Classifier

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Support Vector Machines

Statistical Methods for Data Mining

Pattern Recognition 2018 Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Introduction to Machine Learning

Incorporating detractors into SVM classification

Support Vector Machines.

Support Vector Machine. Industrial AI Lab.

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Modelli Lineari (Generalizzati) e SVM

Linear discriminant functions

Linear, Binary SVM Classifiers

Transcription:

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Outline 1 2 Separating Hyperplane Binary SVM for Separable Case

Bayes Rule for Binary Problems Consider the simplest case: two classes are perfectly linearly separable. Linear Classifiers: Construct a linear decision boundary to separate data into different classes as much as possible. Under the equal-cost, the Bayes rule is given as: { 1 if P(Y = 1 X = x) > 0.5 φ B (x) = 0 if P(Y = 1 X = x) < P(Y = 0 X = x). We would like to mimic the Bayes rule.

Soft classifiers estimate class (posterior) probabilities first and then make a decision: (1) first estimate the conditional probability P(Y = +1 X = x); (2) then construct the decision rule as ˆP(Y = +1 x) > 0.5. Examples: LDA, Logistics regression, knn methods. Hard classifiers do not estimate the class probabilities. Instead, they target on the Bayes rule directly: estimate I [ ˆP(Y = +1 x) > 0.5] directly. Examples: SVM

Separating Hyperplanes Construct linear decision boundary that explicitly tries to separate the data into different classes as well as possible. Good separation is defined in a certain form mathematically. Even when the training data can be perfectly separately by hyperplanes, LDA and other linear methods developed under a statistical framework may not achieve perfect separation. For convenience, we label two classes as +1 and 1 from now on.

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÙÖ º½ ØÓÝ Ü ÑÔÐ Û Ø ØÛÓ Ð Ô Ö Ð Ý ÝÔ ÖÔÐ Ò º Ì ÓÖ Ò Ð Ò Ø Ð Ø ÕÙ Ö ÓÐÙØ ÓÒ Û Ñ Ð ÓÒ Ó Ø ØÖ Ò Ò ÔÓ ÒØ º Ð Ó ÓÛÒ Ö ØÛÓ ÐÙ Ô Ö Ø Ò ÝÔ ÖÔÐ Ò ÓÙÒ Ý Ø Ô Ö ÔØÖÓÒ Ð ÖÒ Ò Ð ÓÖ Ø Ñ Û Ø «Ö ÒØ Ö Ò¹ ÓÑ Ø ÖØ º

Review of Vector Algebra Given two vectors a and b in R d, the vector projection of a onto b is the orthogonal projection of a onto a straight line parallel to b, a 1 = a 1 b, where b = b b, a 1 = a cos θ = a T b, a is the length of a, θ is the angle between a and b. b is the unit vector in the direction of b. a 1 is a vector parallel to b. a 1 is a scalar, called the scalar projection of a onto b. a 1 is a scalar, which can be positive or negative. If a 1 = 0, we say a is perpendicular to b. The scalar projection is equal to length of the vector projection, with a minus sign if the direction of the projection is opposite to the direction of b.

Hyperplane and Its Normal Vector A hyperplane or affine set L is defined by the linear equation L : {x : β 0 + β T x = 0}. A hyperplane of an d-dimensional space is a flat subset with dimension d 1. If d = 2, the set L is a line. A hyperplane separates the space into two half spaces. For any two points x 1 and x 2 in L, we have β T (x 1 x 2 ) = 0. β is perpendicular to L. Define the normal vector of L β = β/ β.

Signed Distance For any x 0 L, we have β T x 0 = β 0. For any x in R d, its signed distance to L is the projection of x x 0 onto the direction β (the normal vector of L), where x 0 L, β T (x x 0 ) = βt (x x 0 ). β The choice of x 0 L can be arbitrary. The signed distance can be positive or negative. It can be used for classification purpose. Points with positive signed distances are assigned to +1 class. Points with negative signed distances are assigned to 1 class.

Linear Classfication Boundary For convenience, we define f (x) = β 0 + β T x. From the previous slide, we can show that the signed distance of any point x to L has an expression f (x) β = f (x) f (x). f (x) has the same sign (is proportional to) as the signed distance from x to L. If f (x) > 0, we classify x to +1 Class. If f (x) < 0, we classify x to 1 Class.

Functional Margin Note that If a point is correctly classified, then its signed distance has the same sign as its label. It implies that y i f (x i ) > 0. If a point is misclassified, then its signed distance has the opposite sign as its label. It implies that y i f (x i ) < 0. Call the product y i f (x i ) the functional margin of the classifier f.

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö Ü ¼ Ü Ö ÔÐ Ñ ÒØ ¼ Ì Ü ¼ ÙÖ º½ Ì Ð Ò Ö Ð Ö Ó ÝÔ ÖÔÐ Ò ÆÒ Øµº

Rosenblatt s Perceptron Learning Algorithm Perceptron: the foundation for support vector machines (Rosenblatt 1958) Key Idea: Find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary. Code the points from two classes as y i = 1, 1. If y i = 1 is misclassified, then x T i β + β 0 < 0. If y i = 1 is misclassified, then x T i β + β 0 > 0. Since the signed distance from x i to L is βt x i +β 0 β, the distance from a misclassified x i to the decision boundary is y i (β T x i + β 0 ). β

Objective Function for Define M to be the set of misclassified points M = {(x i, y i ) s : y i and f (x i ) have opposite signs}. The goal of the perceptron algorithm is to minimize D(β 0, β) = y i (x T i β + β 0 ) i M with respect to β 0, β.

Stochastic Gradient Algorithm Assume M is fixed. To minimize D(β 0, β), we compute the the gradient of D w.r.t. (β 0, β) D(β) = D ( (β 0, β) = i M y ) ix i i M y i Stochastic gradient descent is used to minimize the piecewise linear criterion Adjustment on β 0, β is done after each misclassified point is visited.

Update Rule The stochastic gradient algorithm updates ( ) ( ) ( ) β β yi x = + ρ i β 0 new β 0 ρ is the learning rate. For example, we may take ρ = 1. old Instead of computing the sum of the gradient contributions of each observation, a step is taken after each observation is visited. Note that the true gradient algorithm: β new = β old ρ D(β old ) y i

Convergence of Stochastic Gradient Algorithm Convergence Property: If the classes are linearly separable, the algorithm converges to a separating hyperplane in a finite number of steps. Assume the training data are linearly separable. We use ρ = 1 for the perceptron algorithm. Let β opt be the coefficients satisfying that y i β T optx 1 for all i = 1,, n. Then β new β opt 2 β old β opt 2 1. The algorithm converges in no more than β start β opt 2 steps. (Textbook Exercise 4.6)

About Perceptron Learning Algorithm A number of problems about this algorithm: When the data are perfectly separable by hyperplane, there are not unique solutions the solution depends on the starting values. The finite number of steps can be many steps. When the data are not linearly separable the algorithm does not converge. the cycle can be long and therefore hard to detect. So, we need to add additional constraints to the separating hyperplane, and find the optimal hyperplane in some sense.

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÙÖ º½ ØÓÝ Ü ÑÔÐ Û Ø ØÛÓ Ð Ô Ö Ð Ý ÝÔ ÖÔÐ Ò º Ì ÓÖ Ò Ð Ò Ø Ð Ø ÕÙ Ö ÓÐÙØ ÓÒ Û Ñ Ð ÓÒ Ó Ø ØÖ Ò Ò ÔÓ ÒØ º Ð Ó ÓÛÒ Ö ØÛÓ ÐÙ Ô Ö Ø Ò ÝÔ ÖÔÐ Ò ÓÙÒ Ý Ø Ô Ö ÔØÖÓÒ Ð ÖÒ Ò Ð ÓÖ Ø Ñ Û Ø «Ö ÒØ Ö Ò¹ ÓÑ Ø ÖØ º

Optimal Separating Hyperplane SVM for Separable Case Main motivation: Separate two classes and maximizes the distance to the closest points from either class (Vapnik 1996) Provides a unique solution Leads to better classification performance on test (future) data

SVM for Separable Case Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÙÖ º½ Ì Ñ Ø Ò ÙÖ º½ º Ì Ö ÓÒ Ð Ò Ø Ø Ñ Ü ÑÙÑ Ñ Ö Ò Ô Ö Ø¹ Ò Ø ØÛÓ Ð º Ì Ö Ö Ø Ö ÙÔÔÓÖØ ÔÓ ÒØ Ò¹ Ø Û Ð ÓÒ Ø ÓÙÒ ÖÝ Ó Ø Ñ Ö Ò Ò Ø ÓÔØ Ñ Ð Ô Ö Ø Ò ÝÔ ÖÔÐ Ò ÐÙ Ð Ò µ Ø Ø Ð º ÁÒÐÙ Ò Ø ÙÖ Ø ÓÙÒ ÖÝ ÓÙÒ Ù Ò ÐÓ Ø Ö Ö ÓÒ Ö Ð Ò µ Û Ú ÖÝ ÐÓ ØÓ Ø ÓÔØ Ñ Ð Ô Ö Ø Ò ÝÔ ÖÔÐ Ò Ë Ø ÓÒ ½¾º º µº

Objective Function SVM for Separable Case Let f (x) = β 0 + x T β. Consider the problem: subject to max C β 0,β,C y i (x T i β + β 0 ) β C,, i = 1,..., n. All the points are correctly satisfied, and they have at least a signed distance C from the separating hyperplane. The maximal C must be positive (for separable data). The quantity yf (x) is called the functional margin. A positive margin implies a correct classification on x. The goal is seek the largest such C and associated parameters.

Equivalent Formulation SVM for Separable Case max C β 0,β,C subject to y i (x T i β + β 0 ) C β,, i = 1,..., n. If (β, β 0 ) is the solution to the above problem, so is any positively scaled (αβ, αβ 0 ), with α > 0. For identifiability, we can scale β by requiring β = 1 C Then the objective function becomes C = 1/ β. This lead to max β 1 β subject to y i (x T i β + β 0 ) 1,, i = 1,..., n.

Linear SVM SVM for Separable Case The above problem is equivalent to This lead to min β β 0,β subject to y i (x T i β + β 0 ) 1,, i = 1,..., n. For computational convenience, we further replace β by 1 2 β 2 (a monotonic transformation), which leads to 1 min β 0,β 2 β 2 subject to y i (x T i β + β 0 ) 1, for i = 1,..., n This is the linear SVM for perfectly separable cases.

How to Interpret 1 β SVM for Separable Case Take one point x 1 from f (x 1 ) = +1, one point x 2 from f (x 2 ) = 1 β 0 + β T x 1 = +1 β 0 + β T x 2 = 1 = β T (x 1 x 2 ) = 2 = βt β (x 1 x 2 ) = 2 β We projected the vector x 1 x 2 onto the normal vector of our hyperplane. 2/ β is the width of margin between two hyperplanes f (x) = +1 and f (x) = 1.

SVM for Separable Case Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ½¾ Ö ÔÐ Ñ ÒØ ÈË Ö Ö ÔÐ Ñ ÒØ ½ ¾ Ü Ì ¼ ¼ ½ ½ Ñ Ö Ò Ü Ì ¼ ¼ ½ ¾ ½ ½ Ñ Ö Ò ÙÖ ½¾º½ ËÙÔÔÓÖØ Ú ØÓÖ Ð Ö º Ì Ð Ø Ô Ò Ð ÓÛ Ø Ô Ö Ð º Ì ÓÒ ÓÙÒ ÖÝ Ø ÓÐ Ð Ò Û Ð ÖÓ Ò Ð Ò ÓÙÒ Ø Ñ Ü Ñ Ð Ñ Ö Ò Ó Û Ø ¾ ¾ º Ì Ö Ø Ô Ò Ð ÓÛ Ø ÒÓÒ Ô Ö Ð ÓÚ ÖÐ Ôµ º Ì ÔÓ ÒØ Ð Ð Ö ÓÒ Ø ÛÖÓÒ Ó Ø Ö Ñ Ö Ò Ý Ò ÑÓÙÒØ ÔÓ ÒØ ÓÒ Ø ÓÖÖ Ø Ú ¼º Ì Ñ Ö Ò Ñ Ü Ñ Þ È È Ù Ø ØÓ ØÓØ Ð Ù Ø ÓÒ Ø Òغ À Ò Ø ØÓØ Ð Ø Ò Ó ÔÓ ÒØ ÓÒ Ø ÛÖÓÒ Ó Ø Ö Ñ Ö Òº

Maximal Margin Classifier SVM for Separable Case Geometrical Margin: defined as d + + d, where d + (d ) is the shortest distance from the separating hyperplane to the closest positive (negative) training data point. Margin is bounded below by 2 β Use squared margin for computation convenience A large margin on the training data will lead to good separation on the test data. Defined validly only for separable cases

SVM for Separable Case Solve SVM by Quadratic Programming Need to solve a convex optimization problem: quadratic objective function linear inequality constraints Lagrange (primal) function: introduce Lagrange multipliers α i 0 L P (β, β 0, α) = 1 2 β 2 n α i [y i (x T i β + β 0 ) 1] L has to be minimized with respect to the primal variables (β, β 0 ) and maximized with respect to the dual variables α i. (In other words, we need to find the saddle points for L P ). i=1

Solve the Wolfe Dual Problem SVM for Separable Case At the saddle points, we have β 0 L(β, β 0, α) = 0, β L(β, β 0, α) = 0 i.e. 0 = i α iy i, β = n i=1 α iy i x i. By substituting both into L to get the dual problem: L D (α) = n i=1 α i 1 2 i,j α iα j y i y j x T i x j subject to α i 0, i = 1,..., n; n i=1 α iy i = 0.

SVM Solutions SVM for Separable Case First, solve the dual problem for α. min α 1 2 α i α j y i y j x T i x j i,j n α i, subject to α i 0, i = 1,..., n and n i=1 α iy i = 0. Denote the solution as ˆα i, i = 1,..., n. The SVM slope is obtained by ˆβ = n i=1 ˆα iy i x i. The SVM intercept ˆβ 0 is solved from KKT condition i=1 ˆα i [y i (x T i ˆβ + β 0 ) 1] = 0, with any of the points with ˆα > 0 (support vectors) The SVM boundary is given by ˆf (x) = ˆβ 0 + n i=1 ˆα iy i x T i x.

SVM for Separable Case Karush-Kuhn-Tucker (KKT) Optimality Condition The optimal solution should satisfy the following: 0 = i α i y i (1) β = n α i y i x i (2) i=1 α i 0, for i = 1,..., n (3) n 0 = α i [y i (x T i β + β 0 ) 1], (4) i=1 1 y i (x T i β + β 0 ), for i = 1,...n. (5)

Interpretation of Solution SVM for Separable Case Optimality condition (4) is called the complementary slackness 0 = n α i [y i (x T i β + β 0 ) 1]. i=1 Since α i 0 and y i f (x i ) 1 for all i, this implies that the SVM solution must satisfy In other words, α i [y i (x T i β + β 0 ) 1] = 0, i = 1,..., n. If α i > 0, then y i (x T i β + β 0 ) = 1, so x i must lie on the margin. If y i (x T i β + β 0 ) > 1 (not on the margin), then α i = 0. No training points fall in the margin.

Support Vectors SVM for Separable Case Support Vectors (SVs): all the points with α i > 0. Define the index set SV = {i ˆα i > 0, i = 1..., n} The solution and decision boundary can be expressed only in terms of SVs ˆβ = ˆα i y i x i i SV ˆf (x) = ˆβ 0 + i SV ˆα i y i x T i x. The identification of the SV points requires the use of all the data points.

Various Classifiers SVM for Separable Case If the classes are really Gaussian, then the LDA is optimal. the separating hyperplane pays a price for focusing on the (noisier) data at the boundaries Optimal separating hyperplane has less assumptions, thus more robust to model misspecification. In the separable case, logistic regression solution is similar to the separating hyperplane solution. For perfectly separable case, the log-likelihood can be driven to zero.