Online Gradient Descent Learning Algorithms

Size: px

Start display at page:

Download "Online Gradient Descent Learning Algorithms"

Rosamund Stone
5 years ago
Views:

1 DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London

2 Introduction Outline General learning setting Online gradient descent algorithm 0 Main results Generalization error Implications: consistency Error rates Discussions and comparisons Conclusions and questions

3 Introduction Learning theory model Input sample space X: a subset of Euclidean space R d Output labeled space Y : a subset of R. Distribution ρ on X Y : ρ(x, y) = ρ X (x)ρ(y x). Loss function: L(f(x), y) = (y f(x)) 2. Statistical assumption: labeled sample data sequence S = {z j = (x j, y j ) : j = 1, 2, }. identically and independent distributed according to ρ.

4 Goal of learning Given sample S, find a function f in a suitable hypothesis space such that the true error E(f) := (f(x) y) 2 dρ(x, y) X Y is close to the smallest true error E(f ρ ) where f ρ is the regression function f ρ (x) := ydρ(y x) = inf { E(f) } f : X R Y

5 Note that f f ρ 2 ρ X = X (f(x) f ρ (x)) 2 dρ X (x) = E(f) E(f ρ ) Equivalent approximation problem: find an approximator f in a hypothesis space such that f f ρ 2 ρ is small Hypothesis space assumption Hypothesis space: reproducing kernel Hilbert spaces H K (RKHS) Gaussian kernel: K(x, x ) = e σ x x 2. Polynomial kernel: K(x, x ) = (1 + x, x ) n.

6 Batch Learning algorithm Use the data set S t = {z 1,, z t } at one time Tikhonov regularization: Cucker and Smale; Evgeniou-Pontil-Poggio; Smale-Zhou; De Vito and Verri et al., from different perspectives: Regularization Network, Approximation Theory and Inverse Problems etc. 1 t f St,λ = arg min (y j f(x j ) 2 + λ f 2 f H K t K, λ > 0 Gradient descent boosting: Yao, Rosasco and Caponnetto j=1 f k+1 = f k η k t t (f k (x j ) y j )K xj. j=1 Early stopping rule instead of regularization terms in H K.

7 Stochastic Online learning in RKHS Use the data one by one Online regularized learning algorithm f j+1 = f j η j ((f j (x j ) y j )K xj + λf j ), j N, for e.g. f 1 = 0, where λ > 0 is regularization parameter and {η j } is step sizes (learning rates). Kivinen, Smola and Williamson; Smale and Yao; Ying-Zhou et al.

8 The online algorithm studied here f j+1 = f j η j (f j (x j ) y j )K xj, j N, for e.g. f 1 = 0, (1) {η j, j N} universal sequence {η j = η(t) : j = 1,, t}, t: sample number Our analysis purpose for the above algorithm Stochastic generalization error bounds for E [ f t+1 f ρ 2 ρ] in terms of the step sizes and approximation property of H K. The choice of step sizes to guarantee the (weak) consistency: E [ f t+1 f ρ 2 ρ] inf f HK f f ρ 2 ρ as t

9 Type I: Step sizes are a universal sequence Main results Generalization error Define K-functional: K(s, f ρ ) := inf f HK { f f ρ ρ + s f K }, s > 0. Theorem 1. Let θ (0, 1) and {η j = j θ : j } with some constant µ µ(θ). Then, µ for any t, E [ ] [ 2 ( f t+1 f ρ 2 ρ K(b θ,µ t (1 θ)/2, f ρ )] + O t min{θ,1 θ} ln t ). Implication to consistency: K-functional: K(, f ρ ) is non-decreasing, concave, and lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ = Consistency: lim t E [ f t+1 f ρ 2 ρ] = inff HK f f ρ 2 ρ.

10 Error rates We assume the f ρ has some smoothness. Define L K : L 2 ρ X L 2 ρ X : L K f(x) = K(x, y)f(y)dρ X, x X, f L 2 ρ X. X The fractional range space L β K(L 2 ρ X ) : the range space of L β K. Theorem 2. Let θ (0, 1), µ(θ) be absolute constants depending on θ. If f ρ L β K(L 2 ρ X ) with some 0 < β 1/2 then, by selecting η j = 1 µ( 2β 2β+1 )j 2β+1 for j, for any t there holds [ ] ( ) E Z T ft+1 f ρ 2 ρ = O t 2β 2β+1 ln t. (2) 2β

11 Type II: Step sizes depending on sample number Generalization error Theorem 3. Let {η j = η : j }. Then, we have that E [ ] [ f t+1 f ρ 2 ρ K( ( ηt ) ] ( ), f ρ ) + O η ln t Rule of early stopping: trade off K( ( ηt ) 1 2, f ρ ) and O ( η ln t ) = stopping rule: t = t(η) to ensure the bounds tend to zero as η 0 +. Equivalently, from the perspective of choosing step sizes η = η(t).

12 Implication to (weak) consistency: the step sizes (depending on samples number) t: lim t η(t) ln t = 0 and lim t tη(t) = = consistency. Error rates Theorem 4. Let {η j = η : j = 1, 2, t}. If f ρ L β K(L 2 ρ X ) for some β > 0 then, by choosing η := 2β β 64(1+κ) 4 t (2β+1) 2β+1, we have that E [ ] f t+1 f ρ 2 ρ = O (t 2β 2β+1 ln t ).

13 Discussions and Comparisons Comparisons are based on the same assumptions on f ρ L β K(L 2 ρ X ). Our error rates for online gradient descent algorithm (1): (I) O ( ) t 2β 2β+1 ln t with β (0, 1 ] for {η 2 j, j N} universal sequence (II) O ( ) t 2β 2β+1 ln t with β > 0 for {ηj = η(t) : j = 1,, t} depending on sample number Batch Tikhonov regularization: O ( t 2β+1) with β (0, 1] Zhang; Smale and Zhou

14 Discussions continued Online regularized algorithm: choosing λ = λ(t) > 0 appropriately Yao and Smale; Ying-Zhou: O ( t 2β 2β+2 ln t ) with β (0, 1] Pontil and Ying: O ( t 2β 2β+1 ln t ) with β (0, 1] The rate O(t 2β 2β+1 ) is capacity independent (eigenvalue independent) optimal (only assumption on f ρ, no assumption on decays of eigenvalues of L K ) implied by Capponetto and De Vito.

15 Ideas of Proof Three main steps. Error decomposition Rewrite the online algorithm (1): f j+1 f ρ = (I η j L K )(f j f ρ ) + η j ( LK (f j f ρ ) + (y j f j (x j ))K xj ) I: the identity operator Define B(f j, z j ) := L K (f j f ρ ) + (y j f j (x j ))K xj. Then E zj [ B(fj, z j ) ] = 0. Set ω t k(l K ) := t j=k (I η jl K ) and ω t t+1(l K ) := I. f t+1 f ρ = ω t 1(L K )f ρ + t j=1 η jω t j+1(l K )B(f j, z j ).

16 Proof Continued Proposition 1. For any t, [ f t+1 f ρ 2 ρ] is bounded by ω t 1(L K )f ρ 2 ρ }{{} approximation error t [ + 2(1 + κ) 4 E(fk ) ] ηk/( t 2 η j + 1 ). k=1 j=k+1 }{{} Cumulative sample error Remark 1. The standard cumulative loss t k=1 (y k f k (x k ) 2 has been extensively studied in online community: Cesa-Bianchi, Warmuth, Smola et al. Weighted cumulative loss: [ t k=1 (y k f k (x k ) 2 η 2 k/( t j=k+1 η j + 1 )].

17 Sketch of Proof for Proposition 1 Proof Continued E [ ] f t+1 f ρ 2 ρ = ω t 1 (L K )f ρ 2 ρ + E [ t η ] j=1 jωj+1(l t K )B(f j, z j ) 2 ρ 2 E [ t ω1(l t K )f ρ, η j ωj+1(l t K )B(f j, z j ) ] ρ j=1 }{{} zero since E zj [ B(fj, z j ) ] = 0 [ ] [ ] E Z t k t η k ωk+1(l t K )B(f k, z k ) 2 ρ = ηkz 2 ω t k+1(l k K )B(f k, z k ) 2 ρ k t [ ηk ω 2 k+1(l t K )L 1 2 K 2 Z k B(fk, z k ) K] 2, k t E zk [ B(fk, z k ) 2 K c E(f k ). ω t k+1(l K )L 1 2 K 2 2 ( 1 + κ ) 2/( t j=k+1 η j + 1 ).

18 Approximation error: For any f H K, Proof Continued ω1(l t K )f ρ ρ ω1(l t K )(f f ρ ) ρ + ω1(l t K )f ρ f f ρ ρ + ω1(l t K )L 1 2 K L 1 2 K f ρ f f ρ ρ + 2(1 + κ) ( t ( η k ) + 1 ) 1 2 f K, k=1 K functional: ω1(l t K )f ρ ρ K ( 2(1 + κ) ( t ( η k ) + 1 ) 1 2 ), f ρ k=1

19 Proof Continued Cumulative sample error: t [ E(fk ) ] ηk/( t 2 k=1 j=k+1 η j + 1 ) ( [ sup E(fk ) ]) t /( t η 2 k η j + 1 ). k=1,,t k=1 j=k+1 Uniformly bounding for [ E(f k ) ] : using f ρ ρ and E(f ρ ). Estimate t k=1 η2 k/( t j=k+1 η j + 1 ) : For instance, η j = O ( j θ) with θ (0, 1), t k=1 η 2 k /( t j=k+1 η j + 1 ) = O ( t min{θ,1 θ} ln t ).

20 Conclusions Online gradient descent algorithm is a simple yet competitive algorithm: It is statistically consistent (for two types of step sizes) Error rates are essentially the same as classical batch regularized learning and online regularized learning algorithms Optimal capacity independent error rates

21 Questions H K -norm error f t+1 f ρ K with the universal polynomially decaying step sizes Probability inequality estimates, almost surely convergence (strong consistency) Generalization error analysis with data non i.i.d. Can we directly use standard cumulative error bounds to bound generalization error?

22 Grazie! Buon Natale e Felice Anno Nuovo!

Online gradient descent learning algorithm

Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk