COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Size: px

Start display at page:

Download "COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008"

Melissa Turner
6 years ago
Views:

1 COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 In the previous lecture, e ere introduced to the SVM algorithm and its basic motivation for use in classification tasks. In this lecture e ill see ho to actually compute a largest margin classifier, and touch upon ho to lift the restrictive assumption of linear separability of the data. Revie of SVM Recall that SVM s try to find a large margin linear classifier for data labeled + or lying in R n. Formally, if e have m examples x i R n ith label y i, then the SVM algorithm outputs v R n satisfying the folloing program: max δ s.t. v =1 i : y i (v x i ) δ (1) SVM s are designed to explicitly maximize the margin δ, hereas boosting happens to do so accidentally. This indicates that the to algorithms may be related, and a summary of their similarities and differences is tabularized belo. The column labeled Boosting requires explanation. The set of all eak hypotheses are denoted by h 1,... We replace each example x by the vector of predictions of the eak hypotheses h 1 (x),...,;theseareits only relevant features for boosting. The final hypothesis output takes a eighted majority vote of the different eak hypotheses. These non-negative eights, scaled don to sum to 1, are denoted by a 1,... SVM Boosting example x R n h(x) = h 1 (x),... x 2 1 h(x) =maxj h j (x) =1 finds v R n eights a = a 1,... on eak hyp v 2 =1 a i 0, ( i a i =1 = a 1 =1 ) predicts sign (v x) sign j a jh j (x) =sign(a h(x)) margin y (v x) y j a jh j (x) =y (a h(x)) Computing the SVM hypothesis We describe ho to solve the optimization problem given in (1). We begin by reriting (1) as follos max δ s.t. v =1 ( v ) i : y i δ x i 1. (2)

2 Letting = v δ,egettherelation = 1 δ (using v = 1). Hence our objective is to imize subject to the SVM constraints b i () 0, here b i () = y i ( x i ) 1. Hence our task reduces to solving s.t. i : y i ( x i ) 1. (3) Folloing standard techniques, e form the Lagrangean of the above optimization problem by linearly combining the objective function and the constraints L(,α) = m α i [y i ( x i ) 1]. (4) i=1 The Lagrangean is useful because it converts the constrained optimization task in (3) to the folloing unconstrained one: max L(,α). (5) Indeed, (5) is the value of a game ith to players, Mindy and Max, here Mindy goes first, choosing R n, and Max, observing Mindy s choice, selects α R n + to maximize the resulting value of (5); Mindy, aare of Max s strategy, makes her initial choice to imize (5). If Mindy s choice of violated any constraint b i () 0, Max could choose α i sufficiently large to make (5) unbounded. If no obeying all the constraints in (3) existed, both (3), (5) ould be. Otherise, Mindy ensures b i () 0foreachi, andmax chooses α i = 0 henever b i () ere positive, so that α i b i () = 0 for every i. Thus, for obeying all constraints, L(,α) = i α ib i () = 1 2 2, and Mindy s strategy boils don to solving (3). Some imax theory Before computing (5), e consider the dual game here Max goes first. Even if Mindy ignores Max s move and plays the same in the dual as she ould in the primal, she ould ensure that the value of the dual does not exceed that of the primal. It follos max L(,α) max L(,α). Minimax theory tells us, for a large class of functions L, the values of both games are in fact equal. One such class of immediate relevance to us is here both arguments of L belong to a convex domain, L is convex in its first argument, and concave in the second. The Lagrangean formed in (4) has these properties, and hence equality holds for our games. Let,α be optimal choices of Mindy and Max for the primal and dual games, resp. = arg α = argmax max L(,α) (6) L(,α) (7) 2

3 We can derive the folloing chain of inequalities L(,α ) max L(,α) = = max max L(,α) (by(6)) L(,α) (imax theory) = L(,α )(by(7)) L(,α ). Since the first and last terms are the same, equality holds on all lines. As a consequence e obtain the folloing facts = arg L(,α ) (8) α = argmax α L(,α) (9) shoing that (,α )isasaddle point of the function L. SinceL(,α) is convex, the value of imizing L(,α) forafixedα is obtained by setting derivatives to zero. Eq (8) no implies j : L(,α ) =0. (10) j From our previous discussions and (6), e kno that obeys all constraints, and α i b i() is alays zero i : b i ( ) 0 α i b i ( )=0. (11) Conditions (10) and (11), together ith the nonnegativity constraints α i 0areknonas the Karush, Kuhn, Tucker (KKT) conditions, and they characterize all optimal solutions. Note that our discussions sho that any optimal solution satisfies the KKT conditions. Shoing that the converse holds ill be a homeork problem. We return to solving the optimization task for SVMs. Recall that it suffices to compute the value of the dual game. As discussed before, the value of imizing L(,α) for fixed α can be obtained by setting the derivative to zero: j : L j = j i α i y i x i j =0 = = i α i y i x i. (12) Plugging the expression for into L, the value of the game is given by max α i 1 α i α j y i y j (x i x j ) 2 s.t. i : α i 0 3 i i,j

4 Figure 1: Data in R 2 that is not linearly separable The above program can be solved by standard hill-climbing techniques hich ill not be discussed. If α solves the above program, (10) and (12) imply that = i α i y ix i is the SVM hypothesis. The KKT condition (11) implies that α i is non-zero only hen y i ( x i ) = 1 i.e., (x i,y i ) is a support vector. Hence the SVM hypothesis is a linear combination of only the support vectors. By a homeork problem, if there are k support vectors, the generalization error of the classifier found by SVM is bounded by O ( ) k ln m m ith high probability. This gives an alternative method of analyzin SVM s. SVM ith non-linear classifiers If the data x 1:m as not linearly separable, then no R n satisfying the constraints in (1) ould exist and the SVM algorithm ould fail completely. To allo some noisy data, the constraints are often relaxed by a small amount ξ i, and the objective function penalized by the net deviation i ξ i from the constraints. This gives rise to the soft margins SVM: s.t. i : i ξ i y i ( x i ) 1 ξ i ξ i 0 Soft margins SVMs are useful hen the data are inherently linearly separable but the labels are perturbed by some small amount of noise, a case often arising in practice. Hoever, for some kinds of data (see figure 1) only higher dimensional surfaces can classify accurately. The basic SVM theory can still be applied, but the data has to be mapped to a higher dimensional space first. For example, e can take all possible monomial terms up to a certain degree. To illustrate, a data point x =(x 1,x 2 ) R 2 can be mapped to R 6 as follos x =(x 1,x 2 ) ψ(x) =(1,x 1,x 2,x 2 1,x2 2,x 1x 2 ). A hyperplane in the ne space assumes the form a + bx 1 + cx 2 + dx ex2 2 + fx 1x 2 =0 hich is a degree 2 polynomial. This corresponds to a conic section (such as a circle, parabola, etc.) in the original space, hich can correctly classify the data in figure 1. When considering all surfaces up to degree d, the dimension of a mapped example blos 4

5 up as O(n d ), here n as the original dimension. This might create severe computational problems since naively storing and operating the projected examples ill require huge space and time complexity. Further, the high descriptive complexity of a linear classifier in the expanded space might pose problems of overfitting. SVM s succesfully bypass both problems. The statistical problem is overcome by the fact that the optimal classifier is still given by a fe support vectors; or alternatively, that the VC-dimension of a space of classifiers ith large margin does not gro ith the dimension of the space. Computational complexity is kept lo via hat is knon as the kernel trick, hich ill be described after the spring break. 5

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call