Machine Learning Techniques

Size: px

Start display at page:

Download "Machine Learning Techniques"

Lester Barber
5 years ago
Views:

tw Department of Computer Science & Information Engineering National

1 Machine Learning Techniques ( 機器學習技法 ) Lecture 2: Dual Support Vector Machine Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

2 Roadmap 1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support Vector Machine Motivation of Dual SVM Lagrange Dual SVM Solving Dual SVM Messages behind Dual SVM 2 Combining Predictive Features: Aggregation Models 3 Distilling Implicit Features: Extraction Models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23

3 min b,w Motivation of Dual SVM Non-Linear Support Vector Machine Revisited 1 2 wt w s. t. y n (w T z }{{} n + b) 1, Φ(x n) for n = 1, 2,..., N Non-Linear Hard-Margin SVM [ ] Q = T d ; p = 0 d+1 ; 2 0 d I d a T [ ] n = y n 1 z T n ; cn = 1 [ ] b QP(Q, p, A, c) w 3 return b R & w R d with g SVM (x) = sign(w T Φ(x) + b) demanded: not many (large-margin), but sophisticated boundary (feature transform) QP with d + 1 variables and N constraints challenging if d large, or infinite?! :-) goal: SVM without dependence on d Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

4 Motivation of Dual SVM Todo: SVM without d Original SVM (convex) QP of d + 1 variables N constraints Equivalent SVM (convex) QP of N variables N + 1 constraints Warning: Heavy Math!!!!!! introduce some necessary math without rigor to help understand SVM deeper claim some results if details unnecessary like how we claimed Hoeffding Equivalent SVM: based on some dual problem of Original SVM Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23

5 Motivation of Dual SVM Key Tool: Lagrange Multipliers Regularization by Constrained-Minimizing E in min E in (w) s.t. w T w C w Regularization by Minimizing E aug min E aug (w) = E in (w) + λ w N wt w C equivalent to some λ 0 by checking optimality condition E in (w) + 2λ N w = 0 regularization: view λ as given parameter instead of C, and solve easily dual SVM: view λ s as unknown given the constraints, and solve them as variables instead how many λ s as variables? N one per constraint Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

6 min b,w Motivation of Dual SVM Starting Point: Constrained to Unconstrained 1 2 wt w s.t. y n (w T z n + b) 1, Claim SVM min b,w for n = 1, 2,..., N ( ) max L(b, w, α) all α n 0 any violating (b, w): any feasible (b, w): Lagrange Function with Lagrange multipliers λ n α n, L(b, w, α) = 1 2 wt w + α n (1 y n (w T z n + b) ) }{{}}{{} objective constraint = min b,w ( ) if violate ; 1 2 wt w if feasible ( max + ) n α n(some positive) ( + ) n α n(all non-positive) = all α n 0 max all α n 0 constraints now hidden in max Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23

7 Motivation of Dual SVM Fun Time Consider two transformed examples (z 1, +1) and (z 2, 1) with z 1 = z and z 2 = z. What is the Lagrange function L(b, w, α) of hard-margin SVM? wt w + α 1 (1 + w T z + b) + α 2 (1 + w T z + b) wt w + α 1 (1 w T z b) + α 2 (1 w T z + b) wt w + α 1 (1 + w T z + b) + α 2 (1 + w T z b) wt w + α 1 (1 w T z b) + α 2 (1 w T z b)

8 Motivation of Dual SVM Fun Time Consider two transformed examples (z 1, +1) and (z 2, 1) with z 1 = z and z 2 = z. What is the Lagrange function L(b, w, α) of hard-margin SVM? wt w + α 1 (1 + w T z + b) + α 2 (1 + w T z + b) wt w + α 1 (1 w T z b) + α 2 (1 w T z + b) wt w + α 1 (1 + w T z + b) + α 2 (1 + w T z b) wt w + α 1 (1 w T z b) + α 2 (1 w T z b) Reference Answer: 2 By definition, L(b, w, α) = 1 2 wt w + α 1 (1 y 1 (w T z 1 + b)) + α 2 (1 y 2 (w T z 2 + b)) with (z 1, y 1 ) = (z, +1) and (z 2, y 2 ) = ( z, 1). Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

9 for any fixed α with all α n 0, ( min b,w because max any Lagrange Dual SVM Lagrange Dual Problem max L(b, w, α) all α n 0 ) min b,w L(b, w, α ) for best α 0 on RHS, ( min b,w max L(b, w, α) all α n 0 because best is one of any ) max min L(b, w, all α n 0 b,w α ) }{{} Lagrange dual problem Lagrange dual problem: outer maximization of α on lower bound of original problem Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

10 Lagrange Dual SVM Strong Duality of Quadratic Programming ( ) min max L(b, w, α) b,w all α n 0 }{{} equiv. to original (primal) SVM ( ) min L(b, w, α) all α n 0 b,w }{{} Lagrange dual max : weak duality = : strong duality, true for QP if convex primal feasible primal (true if Φ-separable) linear constraints called constraint qualification exists primal-dual optimal solution (b, w, α) for both sides Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23

11 Lagrange Dual SVM Solving Lagrange Dual: Simplifications (1/2) max all α n 0 min 1 b,w 2 wt w + α n (1 y n (w T z n + b)) }{{} L(b,w,α) inner problem unconstrained, at optimal: = 0 = N α ny n L(b,w,α) b no loss of optimality if solving with constraint N α ny n = 0 but wait, b can be removed ( max all α n 0, y nα n=0 min b,w 1 2 wt w + ) α n (1 y n (w T z n )) N α ny n b Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

12 Lagrange Dual SVM Solving Lagrange Dual: Simplifications (2/2) max all α n 0, y nα n=0 ( min b,w 1 2 wt w + ) α n (1 y n (w T z n )) inner problem unconstrained, at optimal: = 0 = w i N α ny n z n,i L(b,w,α) w i no loss of optimality if solving with constraint w = N α ny n z n but wait! max all α n 0, y nα n=0,w= α ny nz n ( 1 min b,w 2 wt w + ) α n w T w max all α n 0, y 1 2 N α n y n z n 2 + nα n=0,w= α ny nz n α n Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23

13 Lagrange Dual SVM KKT Optimality Conditions max all α n 0, y 1 2 N α n y n z n 2 + nα n=0,w= α ny nz n if primal-dual optimal (b, w, α), primal feasible: y n (w T z n + b) 1 dual feasible: α n 0 dual-inner optimal: y n α n = 0; w = α n y n z n primal-inner optimal (at optimal all Lagrange terms disappear): α n (1 y n (w T z n + b)) = 0 α n called Karush-Kuhn-Tucker (KKT) conditions, necessary for optimality [& sufficient here] will use KKT to solve (b, w) from optimal α Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

14 Lagrange Dual SVM Fun Time For a single variable w, consider minimizing 1 2 w 2 subject to two linear constraints w 1 and w 3. We know that the Lagrange function L(w, α) = 1 2 w 2 + α 1 (1 w) + α 2 (w 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem? 1 α 1 0 and α w = α 1 α 2 3 α 1 (1 w) = 0 and α 2 (w 3) = 0. 4 all of the above

15 Lagrange Dual SVM Fun Time For a single variable w, consider minimizing 1 2 w 2 subject to two linear constraints w 1 and w 3. We know that the Lagrange function L(w, α) = 1 2 w 2 + α 1 (1 w) + α 2 (w 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem? 1 α 1 0 and α w = α 1 α 2 3 α 1 (1 w) = 0 and α 2 (w 3) = 0. 4 all of the above Reference Answer: 4 1 contains dual-feasible constraints; 2 contains dual-inner-optimal constraints; 3 contains primal-inner-optimal constraints. Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

16 Solving Dual SVM Dual Formulation of Support Vector Machine max all α n 0, y nα n=0,w= α ny nz n N 1 2 α n y n z n 2 + α n standard hard-margin SVM dual subject to min α 1 2 m=1 α n α m y n y m z T n z m y n α n = 0; α n 0, for n = 1, 2,..., N (convex) QP of N variables & N + 1 constraints, as promised α n how to solve? yeah, we know QP! :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23

17 Solving Dual SVM Dual SVM with QP Solver optimal α =? optimal α QP(Q, p, A, c) min subject to 1 α 2 m=1 α n α m y n y m z T n z m α n y n α n = 0; α n 0, for n = 1, 2,..., N min α 1 2 αt Qα + p T α subject to a T i α c i, q n,m = y n y m z T n z m for i = 1, 2,... p = 1 N a = y, a = y; a T n = n-th unit direction c = 0, c = 0; c n = 0 note: many solvers treat equality (a, a ) & bound (a n ) constraints specially for numerical stability Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

18 Solving Dual SVM Dual SVM with Special QP Solver optimal α QP( Q D, p, A, c) 1 min α 2 αt Q D α + p T α subject to special equality and bound constraints q n,m = y n y m z T n z m, often non-zero if N = 30, 000, dense Q D (N by N symmetric) takes > 3G RAM need special solver for not storing whole Q D utilizing special constraints properly to scale up to large N usually better to use special solver in practice Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23

19 KKT conditions if primal-dual optimal (b, w, α), Solving Dual SVM Optimal (b, w) primal feasible: y n (w T z n + b) 1 dual feasible: α n 0 dual-inner optimal: y n α n = 0; w = α n y n z n primal-inner optimal (at optimal all Lagrange terms disappear): α n (1 y n (w T z n + b)) = 0 (complementary slackness) optimal α = optimal w? easy above! optimal α = optimal b? a range from primal feasible & equality from comp. slackness if one α n > 0 b = y n w T z n comp. slackness: α n > 0 on fat boundary (SV!) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

20 Solving Dual SVM Fun Time Consider two transformed examples (z 1, +1) and (z 2, 1) with z 1 = z and z 2 = z. After solving the dual problem of hard-margin SVM, assume that the optimal α 1 and α 2 are both strictly positive. What is the optimal b? not certain with the descriptions above

21 Solving Dual SVM Fun Time Consider two transformed examples (z 1, +1) and (z 2, 1) with z 1 = z and z 2 = z. After solving the dual problem of hard-margin SVM, assume that the optimal α 1 and α 2 are both strictly positive. What is the optimal b? not certain with the descriptions above Reference Answer: 2 With the descriptions, at the optimal (b, w), b = +1 w T z = 1 + w T z That is, w T z = 1 and b = 0. Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

22 Messages behind Dual SVM Support Vectors Revisited on boundary: locates fattest hyperplane; others: not needed examples with α n > 0: on boundary call α n > 0 examples (z n, y n ) support vectors (candidates) SV (positive α n ) SV candidates (on boundary) x 1 x 2 1 = only SV needed to compute w: w = N α n y n z n = α n y n z n SV only SV needed to compute b: b = y n w T z n with any SV (z n, y n ) SVM: learn fattest hyperplane by identifying support vectors with dual optimal solution Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23

23 Messages behind Dual SVM Representation of Fattest Hyperplane SVM PLA w SVM = α n (y n z n ) w PLA = β n (y n z n ) α n from dual solution β n by # mistake corrections w = linear combination of y n z n also true for GD/SGD-based LogReg/LinReg when w 0 = 0 call w represented by data SVM: represent w by SVs only Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

24 Messages behind Dual SVM Summary: Two Forms of Hard-Margin SVM Primal Hard-Margin SVM Dual Hard-Margin SVM min b,w 1 2 wt w sub. to y n (w T z n + b) 1, for n = 1, 2,..., N d + 1 variables, N constraints suitable when d + 1 small physical meaning: locate specially-scaled (b, w) min α 1 2 αt Q D α 1 T α s.t. y T α = 0; α n 0 for n = 1,..., N N variables, N + 1 simple constraints suitable when N small physical meaning: locate SVs (z n, y n ) & their α n both eventually result in optimal (b, w) for fattest hyperplane g SVM (x) = sign(w T Φ(x) + b) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23

25 Messages behind Dual SVM Are We Done Yet? goal: SVM without dependence on d min α 1 2 αt Q D α 1 T α subject to y T α = 0; α n 0, for n = 1, 2,..., N N variables, N + 1 constraints: no dependence on d? q n,m = y n y m z T n z m : inner product in R d O( d) via naïve computation! no dependence only if avoiding naïve computation (next lecture :-)) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

26 Messages behind Dual SVM Fun Time Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of examples that are on the fat boundary that is, SV candidates?

27 Messages behind Dual SVM Fun Time Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of examples that are on the fat boundary that is, SV candidates? Reference Answer: 3 Because SVs are always on the fat boundary, # SVs # SV candidates N. Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

28 Messages behind Dual SVM Summary 1 Embedding Numerous Features: Kernel Models Lecture 2: Dual Support Vector Machine Motivation of Dual SVM want to remove dependence on d Lagrange Dual SVM KKT conditions link primal/dual Solving Dual SVM another QP, better solved with special solver Messages behind Dual SVM SVs represent fattest hyperplane next: computing inner product in R d efficiently 2 Combining Predictive Features: Aggregation Models 3 Distilling Implicit Features: Extraction Models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23

Machine Learning Techniques

Machine Learning Techniques ( 機器學習技法 ) Lecture 5: Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien