Machine Learning Techniques ( 機器學習技巧 ) Lecture 5: SVM and Logistic Regression Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/20
Agenda Lecture 5: SVM and Logistic Regression Soft-Margin SVM as Regularization SVM versus Logistic Regression SVM for Soft Classification Kernel Logistic Regression Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques /20
Soft-Margin SVM as Regularization Wrap-Up Hard-Margin Primal Soft-Margin Primal b,w 2 wt w s.t. y n (w T z n + b) b,w,ξ s.t. 2 wt w + C ξ n y n (w T z n + b) ξ n Hard-Margin Dual α 2 αt Qα T α s.t. y T α = 0 0 α n Soft-Margin Dual α 2 αt Qα T α s.t. y T α = 0 0 α n C soft-margin preferred in practice; linear: LIBLINEAR; non-linear: LIBSVM Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/20
Soft-Margin SVM as Regularization Slack Variables ξ n record margin violation by ξ n penalize with margin violation Hi b,w,ξ s.t. 2 wt w + C ξ n y n (w T z n + b) ξ n and ξ n 0 for all n violation on any (b, w), ξ n = margin violation = max ( y n (w T z n + b), 0 ) (x n, y n ) violating margin: ξ n = y n (w T z n + b) (x n, y n ) not violating margin: ξ n = 0 Hi unconstrained form of soft-margin SVM: b,w 2 wt w + C max ( y n (w T z n + b), 0 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/20
b,w Soft-Margin SVM as Regularization Unconstrained Form 2 wt w + C max ( y n (w T z n + b), 0 ) familiar? :-) just L2 regularization 2 wt w + C êrr λ N wt w + err with shorter w, another parameter, and special err why not solve this? :-) not QP, no (?) kernel trick max(, 0) not differentiable, harder to solve Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/20
Soft-Margin SVM as Regularization SVM as Regularization imize constraint regularization by constraint E in w T w C hard-margin SVM w T w E in = 0 [and more] L2 regularization λ N wt w + E in soft-margin SVM 2 wt w + CÊin large margin fewer hyperplanes L2 regularization of short w soft margin special êrr larger C or C smaller λ less regularization viewing SVM as regularization: allows extending/connecting to other learning models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/20
Soft-Margin SVM as Regularization Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/20
b,w SVM versus Logistic Regression Error Function of SVM 2 wt w + C max ( y n (w T z n + b), 0 ) linear score s = w T z n + b err 0/ (s, y) = ys êrr SVM (s, y) = max( ys, 0): upper bound of err 0/ often called hinge error measure 6 4 err 0/ 2 0 3 2 0 ys 2 3 êrr SVM : algorithmic error measure by convex upper bound of err 0/ Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/20
b,w SVM versus Logistic Regression Error Function of SVM 2 wt w + C max ( y n (w T z n + b), 0 ) linear score s = w T z n + b err 0/ (s, y) = ys êrr SVM (s, y) = max( ys, 0): upper bound of err 0/ often called hinge error measure 6 4 err 0/ svm 2 0 3 2 0 ys 2 3 êrr SVM : algorithmic error measure by convex upper bound of err 0/ Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/20
SVM versus Logistic Regression Connection between SVM and Logistic Regression linear score s = w T z n + b err 0/ (s, y) = ys êrr SVM (s, y) = max( ys, 0): upper bound of err 0/ err SCE (s, y) = log 2 ( + exp( ys)): another upper bound of err 0/ used in logistic regression 6 4 err 0/ svm scaled ce 2 0 3 2 0 ys 2 3 ys + ys êrr SVM (s, y) = 0 ys (ln 2) err SCE (s, y) 0 SVM L2-regularized logistic regression Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/20
PLA imize err 0/ specially SVM versus Logistic Regression Linear Models for Binary Classification pros: efficient if lin. separable cons: works only if lin. separable, otherwise needing pocket soft-margin SVM imize regularized êrr SVM by QP pros: easy optimization & theoretical guarantee cons: loose bound of err 0/ for very negative ys regularized logistic regression for classification imize regularized err SCE by GD/SGD/... pros: easy optimization & regularization guard cons: loose bound of err 0/ for very negative ys regularized LogReg = approximate SVM SVM = approximate LogReg (?) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/20
SVM versus Logistic Regression Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/20
SVM for Soft Classification SVM for Soft Classification Naïve Idea run SVM and get (b SVM, w SVM ) 2 return g(x) = θ(w T SVMx + b SVM ) direct use of similarity works reasonably well no LogReg flavor Naïve Idea 2 run SVM and get (b SVM, w SVM ) 2 run LogReg with (b SVM, w SVM ) as w 0 3 return LogReg solution as g(x) not really easier than original LogReg SVM flavor (kernel?) lost want: flavors from both sides Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques /20
SVM for Soft Classification A Possible Model: Two-Level Learning g(x) = θ(a (w T SVMΦ(x) + b SVM ) + B) SVM flavor: fix hyperplane direction by w SVM kernel applies LogReg flavor: fine-tune hyperplane to match maximum likelihood by scaling (A) and shifting (B) often A > 0 if w SVM reasonably good often B 0 if b SVM reasonably good new LogReg Problem: ) log + exp y n (A (w T A,B N SVMΦ(x n ) + b SVM ) + B }{{} Φ SVM(x n) two-level learning: LogReg on SVM-transformed data Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/20
SVM for Soft Classification Probabilistic SVM Platt s Model of Probabilistic SVM for Soft Classification run SVM on D to get (b SVM, w SVM ) [or the equivalent α], and transform D to z n = w T SVMΦ(x n ) + b SVM. actual model performs this step more sophistically 2 run LogReg on {(z n, y n )} N to get (A, B). actual model adds some special regularization here 3 return g(x) = θ(a (w T SVMΦ(x) + b SVM ) + B) soft classifier not having the same boundary as SVM classifier because of B how to solve LogReg: GD/SGD/or better because only two variables kernel SVM = approx. LogReg in Z-space exact LogReg in Z-space? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/20
SVM for Soft Classification Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/20
Kernel Logistic Regression Key behind Kernel Trick one key behind kernel trick: optimal w = N β n z n because w T z = N β n z T n z = N β n K (x n, x) SVM w SVM = (α n y n )z n α n from dual solutions PLA w PLA = (α n y n )z n α n by # mistake corrections LogReg by SGD w LOGREG = (α n y n )z n α n by total SGD moves when can optimal w be represented by z n? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/20
Kernel Logistic Regression Representer Theorem claim: for any L2-regularized linear model w optimal w = N β nz n. λ N wt w + N err(y, w T z n ) let optimal w = w + w, where w span(z n ) & w span(z n ) want w = 0 what if not? Consider w of same err as w : err(y, w T z n ) = err(y, (w + w ) T z n ) of smaller regularizer as w : w T w = w T w + 2w T w + w T w > w T w w more optimal than w (contradiction!) any L2-regularized linear model can be kernelized! Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/20
Kernel Logistic Regression Kernel Logistic Regression solving L2-regularized logistic regression w λ N wt w + N ( )) log + exp ( y n w T z n yields optimal solution w = N β nz n with out loss of generality, can solve for optimal β instead of w β λ N m= β n β m K (x n, x m ) + N ( ( log + exp y n how? GD/SGD/... for unconstrained optimization N m= kernel logistic regression: use representer theorem for kernel trick on L2-regularized logistic regression β m K (x m, x n ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/20 ))
Kernel Logistic Regression Kernel Logistic Regression: Another View β λ N m= β n β m K (x n, x m ) + N ( ( log + exp y n N m= β m K (x m, x n ) )) N m= β mk (x m, x n ): inner product between variables β and transformed data (K (x, x n ), K (x 2, x n ),..., K (x N, x n )) N N m= β nβ m K (x n, x m ): a special regularizer β T Kβ KLR = linear model of β with kernel transform & kernel regularizer; = linear model of w with kernel-embedded transform & L2 regularizer similar for SVM different routes to the same destination allows extension from different views Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/20
Kernel Logistic Regression Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/20
Kernel Logistic Regression Summary Lecture 5: SVM and Logistic Regression Soft-Margin SVM as Regularization L2-regularization with hinge error measure SVM versus Logistic Regression L2-regularized logistic regression SVM for Soft Classification common approach: two-level learning Kernel Logistic Regression representer theorem on L2-regularized LogReg Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/20