Inverse Time Dependency in Convex Regularized Learning

Size: px

Start display at page:

Download "Inverse Time Dependency in Convex Regularized Learning"

Randolph Blake
5 years ago
Views:

1 Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December 7,

2 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

3 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

4 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error 10 minutes 2.3% error Can we? December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

5 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error 10 minutes 2.3% error Can we? The runtime decreases as the number of samples increase, when desired accuracy is fixed. Inverse Time Dependency December 7,

Our Contribution Propose a Primal Gradient Solver (PGS) and proves its

This work generalizes the state-ofthe-art l 2 -SVM result to l p -norm with

By bounding S (the domain of w), PGS is able to support more loss functions.

6 Our Contribution Propose a Primal Gradient Solver (PGS) and proves its inverse time dependency property. This work generalizes the state-ofthe-art l 2 -SVM result to l p -norm with convex loss functions. By bounding S (the domain of w), PGS is able to support more loss functions. For example, Least Square. It first demonstrates that both logistic loss and least square loss can be adopted into PGS and achieve the inverse time dependency property. December 7,

7 Error Decomposition err(w) Optimization error The error due to the early-stop of the optimization algorithm. Estimation error Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model. December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

8 Error Decomposition err(w) Optimization error Estimation error The error due to the early-stop of the optimization algorithm. Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model Data set size n Estimation error (blue) Desire accuracy (sum) Prediction error (green) No. of iterations Using stochastic gradient descent Total running time December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

9 Error Decomposition err(w) Optimization error Estimation error The error due to the early-stop of the optimization algorithm. Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model Data set size n Estimation error (blue) Desire accuracy (sum) Prediction error (green) No. of iterations Using stochastic gradient descent Total running time Remark: Generalization Error (formally defined later) = Optimization Error + Estimation Error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

10 Error Decomposition C log T F σ w F σ w + σtδ err(w) l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 T = O 1/δ 2ε 2 p 1 w 0 p 2 O 1 m December 7,

11 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Minimize December 7,

12 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i December 7,

13 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Regularizer December 7,

14 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Loss December 7,

15 Convex Regularized Learning Ψ = θ i i=1 m m = x i, y i i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i The l p -norm regularizer: 1 r w = 2 p 1 w p 2, p 1,2 The SVM hinge loss: l w, θ ; θ = max 0, 1 y w, x The Logistic loss: l w, θ ; θ = log 1 + e y w,x The Least Square loss: l w, θ ; θ = w, x y 2 December 7,

16 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,

17 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,

18 Inverse Time Dependency Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,

19 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i Optimization error Generalization error ε acc, satisfies F σ w F σ w + ε acc ε, satisfies l w l w 0 + ε December 7,

20 Generalization Running time Theorem 2 Theorem 1 Generalization error Optimization error Optimization error Generalization error ε acc, satisfies F σ w F σ w + ε acc ε, satisfies l w l w 0 + ε December 7,

21 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 We will create an algorithm based on Stochastic Gradient Descent, and then build the relationship between ε acc and T. December 7,

22 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

23 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

24 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

25 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

26 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

27 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

28 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 Thm 1: with probability of at least 1 δ over the choices of A 1, A T and the index i C log T F σ w i F σ w + σtδ December 7,

29 Thm 1 - Primal Gradient Solver online strongly convex optimization Thm 1: with probability of at least 1 δ over the choices of A 1, A T and the index i C log T F σ w i F σ w + σtδ December 7,

30 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 December 7,

iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle

l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 O 1 σtδ + O 1 σm +

31 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 O 1 σtδ + O 1 σm + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 Theorem 1 Karthik Sridharan, Nathan Srebro, and Shai Shalev-Shwartz, "Fast Rates for Regularized Objectives," in NIPS, December 7,

32 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 Non-positive December 7,

33 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 December 7,

34 Error Decomposition C log T F σ w F σ w + σtδ err(w) l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 T = O 1/δ 2ε 2 p 1 w 0 p 2 O 1 m Recall the definitions: δ confidence parameter ε desired generalization error p p-norm regularizer w 0 the optimal predictor m number of samples December 7,

35 Experimental Results Accuracy of PGS: (CCAT dataset) (in comparison with the best achievable accuracy by Quasi-Newton) Regularizer Loss QN Accuracy (2 hours) PGS Accuracy PGS Training Time l 2 LogisticRegression ± sec l 1. 8 LogisticRegression ± sec l 2 Least Square ± sec Speed of PGS: (CCAT dataset) To achieve an accuracy of 94%: PGS: 10 seconds for p = 2 20 seconds for p = 1.8 Quasi-Newton 600 seconds for both December 7,

36 Experimental Results December 7,

37 Experimental Results December 7,

38 Further Discussion P-norm? p 1,2? Non-linear? Kernel? Welcome to my talk on Tuesday 2 4PM P packsvm: Parallel Primal gradient descent Kernel SVM Other applications? December 7,

Optimization error + Estimation error Running time

39 Conclusion Fast Primal Gradient Solver for l p -norm regularized convex learning Regularization error = Optimization error + Estimation error Running time inverse time dependent on input data size December 7,

Thanks Questions: zhuzeyuan@hotmail.com wzchen@microsoft.

40 Thanks Questions: Acknowledgment: Shai Shalev-Shwartz From Hebrew University. December 7,

41 Thm 1 - Primal Gradient Solver 1. INPUT: λ, p, S. Let n be the feature dimension. 2. FOR i = 1,2,, n 3. w t i 1 q 1 j λ j t+1 σ q 2 q 1 λ j t+1 σ q 1 sgn λ j 4. IF S = R n, RETURN w t 5. IF S = w: w p B 6. IF w t p > B THEN, w t B w t 7. RETURN w t q w t Explicit calculation for w t = r λ (t + 1)σ. We use the superscript of the form (j) to denote the j th coordinate of a vector December 7,

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua