Machine Learning Techniques

Size: px

Start display at page:

Download "Machine Learning Techniques"

Preston Hodge
5 years ago
Views:

tw Department of Computer Science & Information Engineering National

1 Machine Learning Techniques ( 機器學習技巧 ) Lecture 6: Kernel Models for Regression Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

2 Agenda Lecture 6: Kernel Models for Regression Kernel Ridge Regression Support Vector Regression Primal Support Vector Regression Dual Summary of Kernel Models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23

3 Kernel Ridge Regression Recall: Representer Theorem for any L2-regularized linear model min w λ N wt w + 1 N err(y n, w T z n ) optimal w = N β nz n. any L2-regularized linear model can be kernelized! with squared error err(y, w T z) = (y w T z) 2 analytic solution for linear/ridge analytic solution for kernel ridge? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

4 Kernel Ridge Regression Kernel Ridge Regression Problem solving ridge min w λ N wt w + 1 N (y n w T z n ) 2 yields optimal solution w = N β n z n with out loss of generality, can solve for optimal β instead of w min β λ N m=1 β n β m K (x n, x m ) }{{} regularization of β on K -based regularizer = λ N βt Kβ + 1 N + 1 N ( y n ( β T K T Kβ 2β T K T y + y T y β m K (x n, x m ) m=1 ) 2 }{{} linear of β on K -based features ) kernel ridge : use representer theorem for kernel trick on ridge Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23

5 Kernel Ridge Regression Solving Kernel Ridge Regression E aug (β) = λ N βt Kβ + 1 ( ) β T K T Kβ 2β T K T y + y T y N E aug (β) = 2 ( ) λk T Iβ + K T Kβ K T y = 2 ( ) N N KT (λi + K)β y want E aug (β) = 0: one analytic solution β = (λi + K) 1 y ( ) 1 always exists for λ > 0, because K positive semi-definite (Mercer s condition, remember? :-)) time complexity: O(N 3 ) with simple dense matrix inversion can now do non-linear easily Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

6 Kernel Ridge Regression Linear versus Kernel Ridge Regression linear ridge w = (λi + X T X) 1 X T y more restricted O(d 3 + d 2 N) training; O(d) prediction efficient when N d kernel ridge β = (λi + K) 1 y more flexible with K O(N 3 ) training; O(N) prediction hard for big data linear versus kernel: trade-off between efficiency and flexibility Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23

7 Kernel Ridge Regression Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

8 Support Vector Regression Primal Soft-Margin SVM versus Least-Squares SVM least-squares SVM (LSSVM) = kernel ridge for classification soft-margin Gaussian SVM Gaussian LSSVM LSSVM: similar boundary, many more SVs = slower prediction, dense β (BIG g) dense β: LSSVM, kernel LogReg; sparse α: standard SVM want: sparse β like standard SVM Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

9 Support Vector Regression Primal Tube Regression will consider tube within a tube: no error outside a tube: error by distance to tube error measure: err(y, s) = max(0, s y ɛ) s y ɛ: 0 s y > ɛ: s y ɛ usually called ɛ-insensitive error with ɛ > 0 todo: L2-regularized tube to get sparse β Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23

10 Support Vector Regression Primal Tube versus Squared Regression tube: err(y, s) = max(0, s y ɛ) squared: err(y, s) = (s y) 2 squared tube err tube squared when s y small & less affected by outliers s Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

11 Support Vector Regression Primal L2-Regularized Tube Regression min w λ N wt w + 1 N ( ) max 0, w T z n y ɛ Regularized Tube Regr. min λ N wt w + 1 N tube violation unconstrained, but max not differentiable representer to kernelize, but no obvious sparsity standard SVM min 1 2 wt w + C margin vio. not differentiable, but QP dual to kernelize, KKT conditions sparsity will mimic standard SVM derivation: min b,w 1 2 wt w + C ( ) max 0, w T z n + b y n ɛ Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23

12 Support Vector Regression Primal Standard Support Vector Regression Primal min b,w 1 2 wt w + C ( ) max 0, w T z n + b y n ɛ mimicking standard SVM making constraints linear min b,w,ξ 1 2 wt w + C ξ n s.t. w T z n + b y n ɛ + ξ n ξ n wt w + C ( ξ n + ξn ) ɛ ξ n w T z n + b y n ɛ + ξ n ξ n 0, ξ n 0 Support Vector Regression (SVR) primal: minimize regularizer + upper tube violations ξ n & lower violations ξ n Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

13 Support Vector Regression Primal Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

14 Support Vector Regression Dual Quadratic Programming for SVR 1 min b,w,ξ,ξ 2 wt w + C s.t. ( ξ n + ξn ) ɛ ξ n w T z n + b y n ɛ + ξ n ξ n 0, ξ n 0 parameter C: trade-off of regularization & tube violation parameter ɛ: vertical tube width one more parameter to choose! QP of d N variables, 2N + 2N constraints next: remove dependence on d by SVR primal dual? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23

15 Support Vector Regression Dual Lagrange Multipliers α & α objective function 1 2 wt w + C ( ξ n + ξn ) Lagrange multiplier α n for ɛ ξ n w T z n + b y n Lagrange multiplier α n for w T z n + b y n ɛ + ξ n Some of the KKT Conditions L w i = 0: w = N (αn αn ) }{{} β n complementary slackness: z n α n ( ɛ ξ n w T z n b + y n ) = 0 α n ( ɛ ξ n + w T z n + b y n ) = 0 standard dual can be derived using the same steps as Lecture 20 Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

16 Support Vector Regression Dual SVM Dual and SVR Dual min s.t. 1 2 wt w + C ξ n y n (w T z n + b) 1 ξ n ξ n 0 min 1 2 wt w + C (ξn + ξn ) s.t. 1(w T z n + b y n ) ɛ + ξ n 1(y n w T z n + b) ɛ + ξ n ξ n 0, ξ n 0 min 1 α nα my ny mk (x n, x m) 2 m=1 1 α n s.t. y nα n = 0 0 α n C min 1 (α n 2 α n )(α m α m )kn,m m=1 ( (ɛ + yn) α n + (ɛ yn) ) α n s.t. 1 (α n α n ) = 0 0 α n C, 0 α n C similar QP, solvable by similar solver Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23

17 Support Vector Regression Dual w = N (αn αn ) z }{{} n β n complementary slackness: Sparsity of SVR Solution α n ( ɛ ξ n w T z n b + y n ) = 0 α n ( ɛ ξ n + w T z n + b y n ) = 0 strictly within tube w T z n + b y n < ɛ = α n = 0 and α n = 0 = β n = 0 SVs (β n 0): on or outside tube SVR: allows sparse β Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

18 Support Vector Regression Dual Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

19 Summary of Kernel Models Map of Linear Models PLA/pocket minimize err 0/1 specially linear SVR minimize regularized err TUBE by QP linear soft-margin SVM minimize regularized êrr SVM by QP linear ridge minimize regularized err SQR analytically regularized logistic minimize regularized err CE by GD/SGD second row: popular in LIBLINEAR Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23

20 Summary of Kernel Models Map of Linear/Kernel Models PLA/pocket linear SVR linear soft-margin SVM linear ridge regularized logistic SVM minimize SVM dual by QP kernel ridge kernelized linear ridge SVR minimize SVR dual by QP kernel logistic kernelized regularized logistic probabilistic SVM run SVM-transformed logistic fourth row: popular in LIBSVM Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

21 Summary of Kernel Models Map of Linear/Kernel Models PLA/pocket linear SVR linear soft-margin SVM linear ridge regularized logistic kernel ridge kernel logistic SVM SVR probabilistic SVM first row: less used due to worse performance third row: less used due to dense β Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23

22 possible kernels: Summary of Kernel Models Kernel Models polynomial, Gaussian,..., your own from Mercer s condition, coupled with kernel ridge kernel logistic SVM SVR probabilistic SVM powerful extension of linear models with great power comes great responsibility in Spiderman, remember? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

23 Summary of Kernel Models Fun Time Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

24 Summary of Kernel Models Summary Lecture 6: Kernel Models for Regression Kernel Ridge Regression representer theorem on RidgeReg Support Vector Regression Primal minimize regularized tube errors Support Vector Regression Dual a QP similar to SVM Summary of Kernel Models with great power comes great responsibility Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23

Machine Learning Techniques

Machine Learning Techniques ( 機器學習技巧 ) Lecture 5: SVM and Logistic Regression Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University