Lecture 16: Modern Classification (I) - Separating Hyperplanes

Outline 1 2 Separating Hyperplane Binary SVM for Separable Case

Bayes Rule for Binary Problems Consider the simplest case: two classes are perfectly linearly separable. Linear Classifiers: Construct a linear decision boundary to separate data into different classes as much as possible. Under the equal-cost, the Bayes rule is given as: { 1 if P(Y = 1 X = x) > 0.5 φ B (x) = 0 if P(Y = 1 X = x) < P(Y = 0 X = x). We would like to mimic the Bayes rule.

Soft classifiers estimate class (posterior) probabilities first and then make a decision: (1) first estimate the conditional probability P(Y = +1 X = x); (2) then construct the decision rule as ˆP(Y = +1 x) > 0.5. Examples: LDA, Logistics regression, knn methods. Hard classifiers do not estimate the class probabilities. Instead, they target on the Bayes rule directly: estimate I [ ˆP(Y = +1 x) > 0.5] directly. Examples: SVM

Separating Hyperplanes Construct linear decision boundary that explicitly tries to separate the data into different classes as well as possible. Good separation is defined in a certain form mathematically. Even when the training data can be perfectly separately by hyperplanes, LDA and other linear methods developed under a statistical framework may not achieve perfect separation. For convenience, we label two classes as +1 and 1 from now on.

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÙÖ º½ ØÓÝ Ü ÑÔÐ Û Ø ØÛÓ Ð Ô Ö Ð Ý ÝÔ ÖÔÐ Ò º Ì ÓÖ Ò Ð Ò Ø Ð Ø ÕÙ Ö ÓÐÙØ ÓÒ Û Ñ Ð ÓÒ Ó Ø ØÖ Ò Ò ÔÓ ÒØ º Ð Ó ÓÛÒ Ö ØÛÓ ÐÙ Ô Ö Ø Ò ÝÔ ÖÔÐ Ò ÓÙÒ Ý Ø Ô Ö ÔØÖÓÒ Ð ÖÒ Ò Ð ÓÖ Ø Ñ Û Ø «Ö ÒØ Ö Ò¹ ÓÑ Ø ÖØ º

Review of Vector Algebra Given two vectors a and b in R d, the vector projection of a onto b is the orthogonal projection of a onto a straight line parallel to b, a 1 = a 1 b, where b = b b, a 1 = a cos θ = a T b, a is the length of a, θ is the angle between a and b. b is the unit vector in the direction of b. a 1 is a vector parallel to b. a 1 is a scalar, called the scalar projection of a onto b. a 1 is a scalar, which can be positive or negative. If a 1 = 0, we say a is perpendicular to b. The scalar projection is equal to length of the vector projection, with a minus sign if the direction of the projection is opposite to the direction of b.

Hyperplane and Its Normal Vector A hyperplane or affine set L is defined by the linear equation L : {x : β 0 + β T x = 0}. A hyperplane of an d-dimensional space is a flat subset with dimension d 1. If d = 2, the set L is a line. A hyperplane separates the space into two half spaces. For any two points x 1 and x 2 in L, we have β T (x 1 x 2 ) = 0. β is perpendicular to L. Define the normal vector of L β = β/ β.

Signed Distance For any x 0 L, we have β T x 0 = β 0. For any x in R d, its signed distance to L is the projection of x x 0 onto the direction β (the normal vector of L), where x 0 L, β T (x x 0 ) = βt (x x 0 ). β The choice of x 0 L can be arbitrary. The signed distance can be positive or negative. It can be used for classification purpose. Points with positive signed distances are assigned to +1 class. Points with negative signed distances are assigned to 1 class.

Linear Classfication Boundary For convenience, we define f (x) = β 0 + β T x. From the previous slide, we can show that the signed distance of any point x to L has an expression f (x) β = f (x) f (x). f (x) has the same sign (is proportional to) as the signed distance from x to L. If f (x) > 0, we classify x to +1 Class. If f (x) < 0, we classify x to 1 Class.

Functional Margin Note that If a point is correctly classified, then its signed distance has the same sign as its label. It implies that y i f (x i ) > 0. If a point is misclassified, then its signed distance has the opposite sign as its label. It implies that y i f (x i ) < 0. Call the product y i f (x i ) the functional margin of the classifier f.

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö Ü ¼ Ü Ö ÔÐ Ñ ÒØ ¼ Ì Ü ¼ ÙÖ º½ Ì Ð Ò Ö Ð Ö Ó ÝÔ ÖÔÐ Ò ÆÒ Øµº

Rosenblatt s Perceptron Learning Algorithm Perceptron: the foundation for support vector machines (Rosenblatt 1958) Key Idea: Find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary. Code the points from two classes as y i = 1, 1. If y i = 1 is misclassified, then x T i β + β 0 < 0. If y i = 1 is misclassified, then x T i β + β 0 > 0. Since the signed distance from x i to L is βt x i +β 0 β, the distance from a misclassified x i to the decision boundary is y i (β T x i + β 0 ). β

Objective Function for Define M to be the set of misclassified points M = {(x i, y i ) s : y i and f (x i ) have opposite signs}. The goal of the perceptron algorithm is to minimize D(β 0, β) = y i (x T i β + β 0 ) i M with respect to β 0, β.

Stochastic Gradient Algorithm Assume M is fixed. To minimize D(β 0, β), we compute the the gradient of D w.r.t. (β 0, β) D(β) = D ( (β 0, β) = i M y ) ix i i M y i Stochastic gradient descent is used to minimize the piecewise linear criterion Adjustment on β 0, β is done after each misclassified point is visited.

Update Rule The stochastic gradient algorithm updates ( ) ( ) ( ) β β yi x = + ρ i β 0 new β 0 ρ is the learning rate. For example, we may take ρ = 1. old Instead of computing the sum of the gradient contributions of each observation, a step is taken after each observation is visited. Note that the true gradient algorithm: β new = β old ρ D(β old ) y i

Convergence of Stochastic Gradient Algorithm Convergence Property: If the classes are linearly separable, the algorithm converges to a separating hyperplane in a finite number of steps. Assume the training data are linearly separable. We use ρ = 1 for the perceptron algorithm. Let β opt be the coefficients satisfying that y i β T optx 1 for all i = 1,, n. Then β new β opt 2 β old β opt 2 1. The algorithm converges in no more than β start β opt 2 steps. (Textbook Exercise 4.6)

About Perceptron Learning Algorithm A number of problems about this algorithm: When the data are perfectly separable by hyperplane, there are not unique solutions the solution depends on the starting values. The finite number of steps can be many steps. When the data are not linearly separable the algorithm does not converge. the cycle can be long and therefore hard to detect. So, we need to add additional constraints to the separating hyperplane, and find the optimal hyperplane in some sense.

Optimal Separating Hyperplane SVM for Separable Case Main motivation: Separate two classes and maximizes the distance to the closest points from either class (Vapnik 1996) Provides a unique solution Leads to better classification performance on test (future) data

SVM for Separable Case Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÙÖ º½ Ì Ñ Ø Ò ÙÖ º½ º Ì Ö ÓÒ Ð Ò Ø Ø Ñ Ü ÑÙÑ Ñ Ö Ò Ô Ö Ø¹ Ò Ø ØÛÓ Ð º Ì Ö Ö Ø Ö ÙÔÔÓÖØ ÔÓ ÒØ Ò¹ Ø Û Ð ÓÒ Ø ÓÙÒ ÖÝ Ó Ø Ñ Ö Ò Ò Ø ÓÔØ Ñ Ð Ô Ö Ø Ò ÝÔ ÖÔÐ Ò ÐÙ Ð Ò µ Ø Ø Ð º ÁÒÐÙ Ò Ø ÙÖ Ø ÓÙÒ ÖÝ ÓÙÒ Ù Ò ÐÓ Ø Ö Ö ÓÒ Ö Ð Ò µ Û Ú ÖÝ ÐÓ ØÓ Ø ÓÔØ Ñ Ð Ô Ö Ø Ò ÝÔ ÖÔÐ Ò Ë Ø ÓÒ ½¾º º µº

Objective Function SVM for Separable Case Let f (x) = β 0 + x T β. Consider the problem: subject to max C β 0,β,C y i (x T i β + β 0 ) β C,, i = 1,..., n. All the points are correctly satisfied, and they have at least a signed distance C from the separating hyperplane. The maximal C must be positive (for separable data). The quantity yf (x) is called the functional margin. A positive margin implies a correct classification on x. The goal is seek the largest such C and associated parameters.

Equivalent Formulation SVM for Separable Case max C β 0,β,C subject to y i (x T i β + β 0 ) C β,, i = 1,..., n. If (β, β 0 ) is the solution to the above problem, so is any positively scaled (αβ, αβ 0 ), with α > 0. For identifiability, we can scale β by requiring β = 1 C Then the objective function becomes C = 1/ β. This lead to max β 1 β subject to y i (x T i β + β 0 ) 1,, i = 1,..., n.

Linear SVM SVM for Separable Case The above problem is equivalent to This lead to min β β 0,β subject to y i (x T i β + β 0 ) 1,, i = 1,..., n. For computational convenience, we further replace β by 1 2 β 2 (a monotonic transformation), which leads to 1 min β 0,β 2 β 2 subject to y i (x T i β + β 0 ) 1, for i = 1,..., n This is the linear SVM for perfectly separable cases.

How to Interpret 1 β SVM for Separable Case Take one point x 1 from f (x 1 ) = +1, one point x 2 from f (x 2 ) = 1 β 0 + β T x 1 = +1 β 0 + β T x 2 = 1 = β T (x 1 x 2 ) = 2 = βt β (x 1 x 2 ) = 2 β We projected the vector x 1 x 2 onto the normal vector of our hyperplane. 2/ β is the width of margin between two hyperplanes f (x) = +1 and f (x) = 1.

SVM for Separable Case Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ½¾ Ö ÔÐ Ñ ÒØ ÈË Ö Ö ÔÐ Ñ ÒØ ½ ¾ Ü Ì ¼ ¼ ½ ½ Ñ Ö Ò Ü Ì ¼ ¼ ½ ¾ ½ ½ Ñ Ö Ò ÙÖ ½¾º½ ËÙÔÔÓÖØ Ú ØÓÖ Ð Ö º Ì Ð Ø Ô Ò Ð ÓÛ Ø Ô Ö Ð º Ì ÓÒ ÓÙÒ ÖÝ Ø ÓÐ Ð Ò Û Ð ÖÓ Ò Ð Ò ÓÙÒ Ø Ñ Ü Ñ Ð Ñ Ö Ò Ó Û Ø ¾ ¾ º Ì Ö Ø Ô Ò Ð ÓÛ Ø ÒÓÒ Ô Ö Ð ÓÚ ÖÐ Ôµ º Ì ÔÓ ÒØ Ð Ð Ö ÓÒ Ø ÛÖÓÒ Ó Ø Ö Ñ Ö Ò Ý Ò ÑÓÙÒØ ÔÓ ÒØ ÓÒ Ø ÓÖÖ Ø Ú ¼º Ì Ñ Ö Ò Ñ Ü Ñ Þ È È Ù Ø ØÓ ØÓØ Ð Ù Ø ÓÒ Ø ÒØº À Ò Ø ØÓØ Ð Ø Ò Ó ÔÓ ÒØ ÓÒ Ø ÛÖÓÒ Ó Ø Ö Ñ Ö Òº

Maximal Margin Classifier SVM for Separable Case Geometrical Margin: defined as d + + d, where d + (d ) is the shortest distance from the separating hyperplane to the closest positive (negative) training data point. Margin is bounded below by 2 β Use squared margin for computation convenience A large margin on the training data will lead to good separation on the test data. Defined validly only for separable cases

SVM for Separable Case Solve SVM by Quadratic Programming Need to solve a convex optimization problem: quadratic objective function linear inequality constraints Lagrange (primal) function: introduce Lagrange multipliers α i 0 L P (β, β 0, α) = 1 2 β 2 n α i [y i (x T i β + β 0 ) 1] L has to be minimized with respect to the primal variables (β, β 0 ) and maximized with respect to the dual variables α i. (In other words, we need to find the saddle points for L P ). i=1

Solve the Wolfe Dual Problem SVM for Separable Case At the saddle points, we have β 0 L(β, β 0, α) = 0, β L(β, β 0, α) = 0 i.e. 0 = i α iy i, β = n i=1 α iy i x i. By substituting both into L to get the dual problem: L D (α) = n i=1 α i 1 2 i,j α iα j y i y j x T i x j subject to α i 0, i = 1,..., n; n i=1 α iy i = 0.

SVM Solutions SVM for Separable Case First, solve the dual problem for α. min α 1 2 α i α j y i y j x T i x j i,j n α i, subject to α i 0, i = 1,..., n and n i=1 α iy i = 0. Denote the solution as ˆα i, i = 1,..., n. The SVM slope is obtained by ˆβ = n i=1 ˆα iy i x i. The SVM intercept ˆβ 0 is solved from KKT condition i=1 ˆα i [y i (x T i ˆβ + β 0 ) 1] = 0, with any of the points with ˆα > 0 (support vectors) The SVM boundary is given by ˆf (x) = ˆβ 0 + n i=1 ˆα iy i x T i x.

SVM for Separable Case Karush-Kuhn-Tucker (KKT) Optimality Condition The optimal solution should satisfy the following: 0 = i α i y i (1) β = n α i y i x i (2) i=1 α i 0, for i = 1,..., n (3) n 0 = α i [y i (x T i β + β 0 ) 1], (4) i=1 1 y i (x T i β + β 0 ), for i = 1,...n. (5)

Interpretation of Solution SVM for Separable Case Optimality condition (4) is called the complementary slackness 0 = n α i [y i (x T i β + β 0 ) 1]. i=1 Since α i 0 and y i f (x i ) 1 for all i, this implies that the SVM solution must satisfy In other words, α i [y i (x T i β + β 0 ) 1] = 0, i = 1,..., n. If α i > 0, then y i (x T i β + β 0 ) = 1, so x i must lie on the margin. If y i (x T i β + β 0 ) > 1 (not on the margin), then α i = 0. No training points fall in the margin.

Support Vectors SVM for Separable Case Support Vectors (SVs): all the points with α i > 0. Define the index set SV = {i ˆα i > 0, i = 1..., n} The solution and decision boundary can be expressed only in terms of SVs ˆβ = ˆα i y i x i i SV ˆf (x) = ˆβ 0 + i SV ˆα i y i x T i x. The identification of the SV points requires the use of all the data points.

Various Classifiers SVM for Separable Case If the classes are really Gaussian, then the LDA is optimal. the separating hyperplane pays a price for focusing on the (noisier) data at the boundaries Optimal separating hyperplane has less assumptions, thus more robust to model misspecification. In the separable case, logistic regression solution is similar to the separating hyperplane solution. For perfectly separable case, the log-likelihood can be driven to zero.