Advanced Topics in Machine Learning

Size: px

Start display at page:

Download "Advanced Topics in Machine Learning"

Avis Cox
5 years ago
Views:

1 Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16

2 Outline 10. Linearization of Nonlinear Kernels 2 / 16

3 Outline 10. Linearization of Nonlinear Kernels 2 / 16

4 Subgradient Descent minimize f (β, β 0 ; D) := 1 D subgradient g(β, β 0 ; D) := (x,y) D 1 D [1 y(β T x + β 0 )] λ β 2 (x,y) D y(β T x+β 0 )<1 1 D (x,y) D y(β T x+β 0 )<1 yx + λβ y 2 / 16

5 Subgradient Descent (1) learn-linear-svm-subgradient-descent-primal(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) step lengths η t ) : (4) n := x (5) ˆβ := 0 (6) ˆβ0 := 0 (7) t := 0 (8) do (9) ˆβ := 1 n (10) ˆβ 0 := 1 n n i=1 y i(β T x i+β 0)<1 n i=1 y i(β T x i+β 0)<1 (11) ˆβ := (1 ηt λ) ˆβ η t ˆβ (12) ˆβ0 := ˆβ 0 η t ˆβ 0 (13) t := t + 1 (14) while η t ˆβ ɛ (15) return ( ˆβ, ˆβ 0 ) y i x i y i 3 / 16

6 Subgradient Descent (subsample approximation) Idea: Do not use all training examples to estimate the error and the gradient, but just a subsample D (t) D The subsample may vary over steps t. Then approximate f ( ; D) by f ( ; D (t) ) in step t. Extremes: all samples. (subgradient descent) just a single (random) sample. (stochastic subgradient descent) 4 / 16

7 Stochastic Subgradient Descent (1) learn-linear-svm-stochastic-subgradient-descent-primal(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) step lengths η t, stop count t 0 ) : (4) n := x (5) ˆβ := 0 (6) ˆβ0 := 0 (7) t := 0 (8) l t := 0, t = 0,... t 0 1 (9) do (10) draw i randomly from {1,..., n} (11) ˆβ := δ yi(β T x i+β 0)<1y i x i (12) ˆβ 0 := δ yi(β T x i+β 0)<1y i (13) ˆβ := (1 ηt λ) ˆβ η t ˆβ (14) ˆβ0 := ˆβ 0 η t ˆβ 0 (15) l t mod t0 := η t ˆβ (16) t := t + 1 (17) while t 0 1 t =0 lt ɛ (18) return ( ˆβ, ˆβ 0 ) 5 / 16

8 Subgradient Descent with Subsample Approximation (1) learn-linear-svm-approx-subgradient-descent-primal(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) step lengths η t, stop count t 0, (4) subsample size k) : (5) n := x (6) ˆβ := 0 (7) ˆβ0 := 0 (8) t := 0 (9) l t := 0, t = 0,... t 0 1 (10) do (11) draw subset I randomly from {1,..., n} with I = k (12) ˆβ := 1 n y i x i k (13) ˆβ 0 := 1 k i I yi(β T xi+β0)<1 n i I yi(β T xi+β0)<1 (14) ˆβ := (1 ηt λ) ˆβ η t ˆβ y i (15) ˆβ0 := ˆβ 0 η t ˆβ 0 (16) l t mod t0 := η t ˆβ (17) t := t + 1 (18) while t0 1 t =0 lt ɛ (19) return ( ˆβ, ˆβ 0 ) 6 / 16

9 Subgradient Descent (subsample approximation) Shalev-Shwartz, Singer, and Srebro 2007 experimented with approximations by samples of fixed size k, i.e., D (t) = k, t Pegasos: Primal Estimated T=1250 T= kt=10 4 kt=10 5 kt= k k [Shalev-Shwartz, Singer, and Srebro 2007] Figure 3. The effect of k on the objective value of Pegasos on the Astro-Physics dataset. Left: T is fixed. Right: kt is fixed. 7 / 16

10 Subgradient Descent (subsample approximation) Shalev-Shwartz, Singer, and Srebro 2007 experimented with approximations by samples of fixed size k, i.e., D (t) = k, t Pegasos: Primal Estimated sub-gradient SO T=1250 T= k k Figure 3. The effect of k on the objective value of Pegasos on the Astro-Physics dataset. Left: T is fixed. Right: kt is fixed. kt=10 4 kt=10 5 kt=10 6 tion of SVM stochastic gr step is accom rate of conve rithm. Our e Pegasos achi of its simplic surfaced in th iments with n investigating ing problems [Shalev-Shwartz, Singer, and Srebro 2007] Acknowledg 7 / 16

11 Maintaining Small Parameters Lemma (Shalev-Shwartz, Singer, and Srebro 2007) The optimal β satisfies β 1 λ Proof. Due to strong duality for the optimal β, β 0 : f (β ) = 1 D and with β = 1 λ X T (y α ) (x,y) D [1 y(β T x + β 0)] λ β 2! = f (α ) = 1 2λ α T (XX T yy T )α + 1 D α 1 8 / 16

12 Maintaining Small Parameters Proof (ctd.). 1 2 λ β D (x,y) D λ β 2 = 1 D α 1 1 D [1 y(β T x + β 0)] + = 1 2 λ β D α 1 (x,y) D [1 y(β T x + β 0)] + 1 D α 1 and with 0 α 1 : 1 β 1 λ 9 / 16

13 Primal Estimated subgradient solver for SVM (PEGASOS) Basic ideas: use subsample approximation with fixed k (but k = 1, stochastic gradient descent, turns out to be optimal) retain β 1/ λ by rescaling in each step: β := β max(1, λ β ) Decrease step size over time: η t := 1 λt 10 / 16

14 Decrease Step Size Over Time ient SOlver for SVM Pegasos Norma Pegasos Zhang T T [Shalev-Shwartz, Singer, and Srebro 2007] 2. Comparisons of Pegasos to Norma (left) and Pegasos to stic gradient descent with a fixed learning rate (right) on the Physics datset. In the left plot, the solid lines designate the ive value and the dashed lines depict the loss on the test set. mber of examples but rather on the value of λ. In- 11 / 16

15 Pegasos (1) learn-linear-svm-pegasos(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) stop count t 0, subsample size k) : (4) n := x (5) ˆβ := 0 (6) ˆβ0 := 0 (7) t := 0 (8) l t := 0, t = 0,... t 0 1 (9) do (10) draw subset I randomly from {1,..., n} with I = k (11) ˆβ := 1 n y i x i k (12) ˆβ 0 := 1 k i I yi(β T xi+β0)<1 n i I yi(β T xi+β0)<1 (13) η t := 1/(λt) (14) ˆβ := (1 ηt λ) ˆβ η t ˆβ (15) ˆβ0 := ˆβ 0 η t ˆβ 0 (16) ˆβ := ˆβ/ max(1, λ β ) (17) l t mod t0 := η t ˆβ (18) t := t + 1 (19) while t0 1 t =0 lt ɛ (20) return ( ˆβ, ˆβ 0 ) y i 12 / 16

16 Comparison Dual Coordinate Descent vs. Pegasos A Dual Coordinate Descent Method for Large-scale Linear SV (a) L1-SVM: astro-physic (b) L2-SVM: astro-physic (a) L1-SVM: astro-physic (c) L1-SVM: news20 (d) L2-SVM: news20 (c) L1-SVM: news20 [C. J. Hsieh et al. 2008] (e) L1-SVM: rcv1 (f) L2-SVM: rcv1 (e) L1-SVM: rcv1 Lars Schmidt-Thieme, Information FigureSystems 1. Timeand versus Machine thelearning relativelab error (ISMLL), (20). University DCDL1-S, of Hildesheim, Figure 2. Germany Time versus the differ DCDL2-S are DCDL1, DCDL2 with shrinking. The dotted tween the current model 13 and / 16 th

17 Comparison Dual Coordinate Descent vs. Pegasos ordinate Descent Method for Large-scale Linear SVM b) L2-SVM: astro-physic (a) L1-SVM: astro-physic (b) L2-SVM: astro-physic (d) L2-SVM: news20 (c) L1-SVM: news20 (d) L2-SVM: news20 [C. J. Hsieh et al. 2008] (f) L2-SVM: rcv1 (e) L1-SVM: rcv1 (f) L2-SVM: rcv1 error Lars Schmidt-Thieme, (20). DCDL1-S, Information Figure Systems 2. Time andversus Machine thelearning difference Lab of (ISMLL), testing University accuracy of be-hildesheimtween the current model and the reference model (obtained 14 / Germany shrinking. The dotted 16

18 10. Linearization of Nonlinear Kernels Outline 10. Linearization of Nonlinear Kernels 15 / 16

19 10. Linearization of Nonlinear Kernels Basic Idea Instead of using a nonlinear kernel, e.g., the polynomial kernel of degree d K(x, z) := (γx T z + r) d with hyperparameters d, γ and r for data x, z R n, use the explicit embedding, e.g., for d = 1 and r = 1: φ(x) :=(1, 2γx 1,..., 2γx n, γx 2 1,..., γx 2 n, 2γx 1 x 2,..., 2γx n 1 x n ) or more simple φ(x) :=(1, x 1,..., x n, x 2 1,..., x 2 n, x 1 x 2,..., x n 1 x n ) of dimension (n+d)! n!d!. 15 / 16

20 10. Linearization of Nonlinear Kernels Comparison Linearized CHANG, Nonlinear HSIEH, CHANG, RINGGAARD vs. AND Nonlinear LIN Kernel Linear (LIBLINEAR) RBF (LIBSVM) Data set C Time (s) Accuracy C γ Time (s) Accuracy a9a real-sim ijcnn MNIST covtype , webspam , Table 4: Comparison of linear SVM and nonlinear SVM with RBF kernel. Time is in seconds. Degree-2 Polynomial Accuracy diff. Data set Training time (s) C γ LIBLINEAR LIBSVM Accuracy Linear RBF a9a real-sim , ijcnn MNIST covtype 2 8 5,211.9 NA webspam 8 8 3,228.1 NA Table 5: Training time (in seconds) and testing accuracy of using the degree-2 polynomial mapping. The last two columns show the accuracy difference to results using linear and RBF. NA indicates that programs do not terminate after 300,000 seconds. [Y. Chang et al. 2010] 16 / 16

21 References Chang, Yin-Wen et al. (Aug. 2010): Training and Testing Low-degree Polynomial Data Mappings via Linear SVM. In: J. Mach. Learn. Res. 11, Hsieh, C. J et al. (2008): A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th international conference on Machine learning, Shalev-Shwartz, S., Y. Singer, and N. Srebro (2007): Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th international conference on Machine learning, / 16

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314