Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32
Margin Classifiers margin <w,x> b = 0 Sridhar Mahadevan: CMPSCI 689 p. 2/32
Optimal Margin Classification Consider the problem of finding a set of weights w that produces a hyperplane with the maximum geometric margin. max γ,w,b γ such that y i ( w, x i b) γ,i =1,...,m w =1 We eliminate the non-convex constraint w =1as follows: 1 min w 2 w 2 such that y i ( w, x i b) 1,i=1,...,m Sridhar Mahadevan: CMPSCI 689 p. 3/32
Lagrange Dual Formulation The primal optimal margin classification problem can be formulated as min w f(w) such that g i (w) 0,i=1,...,k and h i (w) =0,i= The dual problem can be formulated using Lagrange multipliers as max α,β:α 0 (L D (α, β)): L D (α, β) =min w ( f(w) k α i g i (w) i=1 ) l β i h i (w) i=1 Sridhar Mahadevan: CMPSCI 689 p. 4/32
Lagrange Dual Formulation Weak Duality Theorem: The dual formulation always produces a solution that is upper bounded by the solution to the primal problem. Strong Duality Theorem: The solution to the Lagrange dual is exactly the same as the primal solution, assuming that the function f(w) and the constraints g i (w) are convex, and h i (w) is an affine set (meaning h i (w) = w, x i b i ). Sridhar Mahadevan: CMPSCI 689 p. 5/32
Weak Duality Theorem Suppose w is a feasible solution to the primal problem, and that α and β constitute a solution to the dual problem. L D (α, β) = minl(u, α, β) u L(w, α, β) = f(w) i α i g i (w) i β i h i (w) f(w) This implies the following condition: max L D(α, β) min{f(w) :g i (w) 0,h i (w) =0} α,β:α 0 w Sridhar Mahadevan: CMPSCI 689 p. 6/32
Sparsity of Parameters Corollary: Let w be a weight vector that satisfies the primal constraints and α,β be the Lagrangian variables that satisfies both the dual constraints. f(w )=L D (α,β ) where αi 0 and g i (w ) 0,h i (w )=0 Then, αi g i (w )=0for i =1,...,k. The proof follows easily by noting that the inequality f(w ) i α i g i (w ) i β i h i (w ) f(w ) becomes an equality only when α i g i (w )=0for i =1,...,k. Sridhar Mahadevan: CMPSCI 689 p. 7/32
Saddle Point Function Saddle Point Function x**2 - y**2 100 80 60 40 20 0-20 -40-60 -80-100 10 5-10 -5 0 5 10-10 -5 0 Sridhar Mahadevan: CMPSCI 689 p. 8/32
Duality Gap and Saddle Points Define a saddle point as the triple w,α,β, where w Ω,α 0, and L(w,α,β) L(w,α,β ) L(w, α,β ) Theorem: The triple w,α,β is a saddle point if and only if w is a solution to the primal problem, and α,β is a solution to the dual problem, and there is no duality gap, so f(w )=L D (α,β ). Strong Duality Theorem: If f(w) is convex, and w Ω, where Ω is a convex set, and g i,h i are affine functions, the duality gap is 0. Sridhar Mahadevan: CMPSCI 689 p. 9/32
Karush Kuhn Tucker Conditions Assume f(w) and constraints g i (w) are convex, and h i (w) is an affine set (i.e, h i (w) = a i,w b i ). Let there be at least one w such that g i (w) < 0 for all i. Then, the KKT conditions assure duality gap is 0. L(w,α,β )=0, i =1,...,n w i (1) L(w,α,β )=0, i =1,...,l β i (2) α i g i (w )=0, i =1,...,k (3) g i (w ) 0,i=1,...,k (4) α i 0, i =1,...,k (5) Sridhar Mahadevan: CMPSCI 689 p. 10/32
Support Vectors We can formulate the classification problem as: 1 min w 2 w 2 such that g i (w) =y i ( w, x i b)1 0, i =1,...,m KKT implies instances for which α i > 0 are those which have functional margins exactly =1(because then g i (w) =0). The functional margin is the smallest of all the margins, which implies that we will only have nonzero α i for the points closest to the decision boundary! These are called the support vectors. Sridhar Mahadevan: CMPSCI 689 p. 11/32
Dual Form We can write the Lagrangian for our optimal margin classifier as L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i b) 1) To solve the dual form, we first minimize with respect to w and b, and then maximize w.r.t. α w L(w, b, a) =w m i=1 α i y i x i =0 w = m i=1 α i y i x i b L(w, b, a) = m i=1 α i y i =0 Sridhar Mahadevan: CMPSCI 689 p. 12/32
Support Vectors We can simplify the Lagrangian into the following form: ( m ) max α i 1 m y i y j α i α j x i,x j α 2 i i,j=1 s.t. α i 0 and i α i y i =0 Sridhar Mahadevan: CMPSCI 689 p. 13/32
Support Vectors Given the maximizing α i, we use the equation w = m i=1 α i y i x i to find the maximizing w. A new instance x is classified using a weighted sum of inner products (over only support vectors!) w,x b = m α i y i x i,x b = α i y i x i,x b i=1 i SV The intercept term b can be found from the primal constraints b = max y i =1 ( w,x i )min yi =1 ( w,x i ) 2 Sridhar Mahadevan: CMPSCI 689 p. 14/32
Geometric Margin Theorem: Consider a linearly separable set of instances (x 1,y 1 ),...,(x m,y m ), and suppose α,b is a solution to the dual optimization problem. Then, the geometric margin can be expressed as γ = 1 w = 1 i SV α i Sridhar Mahadevan: CMPSCI 689 p. 15/32
Geometric Margin Proof: Due to the KKT conditions, it follows that for all support vectors j SV y j f(x j,α,b )=y j ( i SV y i α i x i,x j b ) =1 w 2 = i α i y i x i, j α jy j x j = α jy j α i y i x i,x j = α j (1 y j b )= j SV αj i SV j SV j SV Sridhar Mahadevan: CMPSCI 689 p. 16/32
Dealing with Nonseparable Data nonseparable data high variance, low bias high bias, low variance nonseparable data nonseparable data Sridhar Mahadevan: CMPSCI 689 p. 17/32
Soft Margin Classifiers We reformulate the concept of margin to allow misclassifications: The slack variable ξ i represents the extent to which a margin constraint is violated y i (<w i,x i > b) 1 ξ i where ξ i 0, i =1,...,l Sridhar Mahadevan: CMPSCI 689 p. 18/32
Soft Margin Classifiers Similar to ridge regresson, define a variable λ which represents the extent to which we want to tolerate errors. A soft-margin classifier solves the following constrained optimization problem Minimize 1 l 2 w 2 C i=1 subject to y i (<w i,x i > b) 1 ξ i, i =1,...l ξ 2 i where ξ i 0, i =1,...l Sridhar Mahadevan: CMPSCI 689 p. 19/32
Sequential Minimal Optimization SMO uses coordinate ascent. To maximize F (α 1,...,α n ), pick some α i and optimize it while holding all other parameters fixed. ( m ) max α i 1 m y i y j α i α j <x i,x j > α 2 i i,j=1 s.t. C α i 0 and i α i y i =0 Since α 1 = y m 1 i=2 α iy i, we cannot pick one constraint alone, but can pick any two. Sridhar Mahadevan: CMPSCI 689 p. 20/32
SMO If we pick y 1 and y 2, we know that This implies that y 1 α 1 y 2 α 2 = m i=3 y i α i = ς α 1 = y 1 (ς y 2 α 2 ) This equation defines a line, where α 1 and α 2 must be on the line to be a feasible solution. The objective function can be reformulated as a quadratic function of α 2, and solved analytically to get values of α 2 and α 1. Sridhar Mahadevan: CMPSCI 689 p. 21/32
SMO C H y y L C Sridhar Mahadevan: CMPSCI 689 p. 22/32
ɛ-insensitive loss L y <w, x> b 2ε L y <w, x> b 2ε Sridhar Mahadevan: CMPSCI 689 p. 23/32
SVM Regression We introduce two slack variables ξ i and ˆξ i which represent the penalty for exceeding or being below the target value by more than ɛ. The primal problem can be formulated as l Minimize w 2 λ (ξi 2 ˆξ i 2 ) subject to (< w i,x i > b) y i ɛ ξ i, i =1,...l i=1 and y i (< w i,x i > b) ɛ ˆξ i, i =1,...l where ξ i, ˆξ i 0, i =1,...l Sridhar Mahadevan: CMPSCI 689 p. 24/32
Mercer s Theorem Theorem: Given a function K : R n R n R, K constitutes a kernel if for any finite set of instances x i, 1 i n, the corresponding kernel (or Gram) matrix is symmetric and positive semi-definite. The Gram matrix of k(x, z) is the matrix K =(k(x i,x j )) n i,j=1 Sridhar Mahadevan: CMPSCI 689 p. 25/32
Mercer s Theorem Let us restrict our attention to kernels whose Gram matrices are positive semi-definite, i.e. the eigenvalues are non-negative. Then, we know that K = λ 1 v 1 v T 1... λ n v n v T n = n i=1 λ i v i v T i Consider the nonlinear mapping φ : x i ( λ t v ti ) n t=1 Then, we can see that φ(x i ),φ(x j ) = n t=1 λ t v ti v tj = K(x i,x j ) Sridhar Mahadevan: CMPSCI 689 p. 26/32
Making New Kernels from Old Let K 1 and K 2 be two kernels defined over the same input space R n R n R. Question: Is K(x, y) =K 1 (x, y)k 2 (x, y) also a kernel? Solution: Since K 1 and K 2 are kernels, from Mercer s theorem it follows that for all vectors α, wehave α T K 1 α 0 α T K 2 α 0 Thus, it follows that α T Kα = α T (K 1 K 2 )α 0 This makes K into a kernel as well. Sridhar Mahadevan: CMPSCI 689 p. 27/32
Convolution Kernels Consider an object x = x 1,...,x d, where each part x i X i. We can define the part-of relation R(x 1,...,x d,x) which holds if and only if x 1,...,x d are indeed the parts of x. Of course, there may be more than one way to decompose x into its parts (e.g., think of subsequences of strings, or subtrees of a tree etc.). Let R 1 (x) ={(x 1,...,x d ) R(x 1,...,x d,x)} Sridhar Mahadevan: CMPSCI 689 p. 28/32
Convolution Kernels The convolution kernel k(x, y) is defined as k(x, y) = d k i (x i,y i ) R 1 (x),r 1 (y) i=1 where k i (x i,y i ) is a kernel on the i th component. Watkins (1999) defined string kernels, which can be seen as an instance of a convolution kernel. Sridhar Mahadevan: CMPSCI 689 p. 29/32
String Kernels Consider the set of all subsequences of a word of length n, e.g., the subsequences of bat are ba, at, and b-t. The length of a subsequence u is defined as i f i l 1 if the subsequence begins at position i f in a string s, and ends at position i l. Consider the mapping φ :Σ R Σn, where Σ is an alphabet, Σ is the set of all strings, and Σ n is the set of all strings of length n. Sridhar Mahadevan: CMPSCI 689 p. 30/32
String Kernels Given any subsequence u Σ n,define φ u (s) = i:u=s[i] λl(i) where i is the index vector representing the positions at which the subsequence u occurs in string s, and λ (0, 1). The string kernel is defined as K n (s, t) = s Σ n i:s[i]=u j:t[j]=u λl(i)l(j) Clearly, K n (s, t) = φ(s),φ(t), and so this is a valid kernel. Sridhar Mahadevan: CMPSCI 689 p. 31/32
Fisher Kernels Let P (X θ) be any generative model (e.g, a hidden-markov model). Consider the Fisher score equations U X = l(x θ),in θ other words the gradient of the log-likelihood of a particular input X. Define the information matrix I = E(U X UX T ) where the expectation is over P (X θ). The Fisher kernel is K(x, y) =U T X I1 U Y The Fisher kernel can be asymptotically approximated as K(x, y) UX T U Y Sridhar Mahadevan: CMPSCI 689 p. 32/32