Online gradient descent learning algorithm

Size: px

Start display at page:

Download "Online gradient descent learning algorithm"

Aubrey George
5 years ago
Views:

1 Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, Abstract This paper considers the least-square online gradient descent algorithm in a reproducing kernel Hilbert space (RKHS) without an explicit regularization term. We present a novel capacity independent approach to derive error bounds and convergence results for this algorithm. The essential element in our analysis is the interplay between the generalization error and a weighted cumulative error which we define in the paper. We show that, although the algorithm does not involve an explicit RKHS regularization term, choosing the step sizes appropriately can yield competitive error rates with those in the literature. Short Title: Online gradient descent learning Keywords and Phrases: Learning theory, online learning, reproducing kernel Hilbert space, gradient descent, error analysis. AMS Subject Classification Numbers: 68Q3, 68T05, 6J0, 6L0. Contact author: Yiming Ying, Telephone: +44 (0) , Fax: +44 (0)

2 Introduction Let X be a compact subset of Euclidean space IR d, Y a subset in IR and IN l := {,..., l} for any l IN. Let ρ be a fixed but unknown probability measure on Z := X Y. The generalization error for a function f : X Y is defined as ( ) dρ(z). E(f) := f(x) y (.) Z The function minimizing the above error is called the regression function and is given by f ρ (x) = ydρ(y x), x X, (.) Y where ρ( x) is the conditional probability measure at x induced by ρ. We consider the problem of approximating the regression function from a set of training data drawn from ρ. In this paper, we restrict our attention to online learning algorithms for computing an approximator of f ρ in a reproducing kernel Hilbert space (RKHS). Let K : X X IR be a Mercer kernel, that is, a continuous, symmetric and positive semi-definite function, see e.g. ]. The RKHS H K associated with K is defined ] to be the completion of the linear span of the set of functions {K x ( ) := K(x, ) : x X} with inner product satisfying, for any x X and g H K, the reproducing property K x, g K = g(x). (.3) Let {z t = (x t, y t ) : t IN} be a sequence of random samples independently distributed according to ρ. The online gradient descent algorithm 0, 9,, 8] is defined as { f = 0, and for t N f t+ = f t η t ( (ft (x t ) y t )K xt + λf t ), (.4) where λ 0 is called the regularization parameter. We call the sequence {η t : t IN} the step sizes or learning rates and {f t : t IN} the learning sequence. Clearly, each output f t+ only depends only on {z j : j IN t }. The class of learning algorithms displayed above is also referred to as stochastic approximation algorithms in the setting of RKHS s. Such a stochastic approximation procedure dates back to ]. One can see, 8, 3, 35] and references therein for more background material.

3 When the parameter λ > 0, we call (.4) the online regularized algorithm. This algorithm has been well studied in the recent literature, see e.g. 0,, 33]. In this paper, we are mainly concerned with the online gradient descent algorithm without an explicit RKHS regularization term, which is given by (.4) with λ = 0, that is, { f = 0, and for t N (.5) f t+ = f t η t (f t (x t ) y t )K xt. We study the approximation of f T +, the output of iteration step (sample size) T, to the regression function f ρ. The nature of the leastsquare loss leads to the measurement in the metric of L ρ X defined as f ρ = f L ρx := ( ) X f(x) dρ X, where ρ X is the marginal distribution of ρ on X. A direct computation yields that see e.g. ] for a proof. f T + f ρ ρ = E(f T + ) E(f ρ ), (.6) Our primary goal is to estimate the error (.6) for the least-square online algorithm (.5) by means of properties of ρ and K. We shall show how the choice of the step sizes in the algorithm affects its generalization error. Moreover, we shall derive error rates for algorithm (.5) which are competitive with those in the literature. We mainly focus on two different types of step sizes. The first one is a universal polynomially decaying sequence of the form {η t = O(t θ ), t N} with θ (0, ). The second type of step sizes is of the form {η t = η : t IN T } with η = η(t ) depending on the iteration number (sample size) T. As we shall discuss in Section, the function η(t ) provides a trade-off between the learning rate η and the iteration steps T and its choice ensures convergence of the algorithm. This is in the spirit of early stopping implicit regularization, see e.g. 3, 36]. The rest of the paper is organized as follows. The next section summarizes our main results and Section 3 collects discussions on related work. To formulate our basic ideas, in Section 4 we provide a novel approach to the error analysis of the online algorithm (.4) under some conditions on the step sizes. The essential element in our analysis is an appealing relation between the generalization error and a weighted form of cumulative sample error (see Proposition ) in online learning literature (e.g. 8, 0]). In Section 5, we develop error bounds and convergence results for the online algorithm (.5). Finally, Section 6 establishes explicit error rates for 3

4 algorithm (.5) and, as a byproduct, improves the preceding rates, 33] for the online regularized algorithm given by (.4) with λ > 0. Main results We begin with some background material and notations. Firstly, let C(X) be the space of continuous functions on X with the norm, κ := sup x X K(x, x) and note, by the reproducing property (.3), for every f H K, that f κ f K. (.) In addition, for every t IN we denote the expectation E z,...,z t as E Z t ] and use the convention E Z 0 ξ = ξ for any random variable ξ. Secondly, we introduce the notion of K-functional 5] which plays a central role in approximation theory, namely K(s, f ρ ) := inf f H K { f f ρ ρ + s f K }, s > 0. (.) Next, we define, for any θ (0, ), the quantity { 48( + κ) 4 if θ = µ(θ) := (, ln ( ) ) 8 θ + otherwise. min{θ, θ} 08(+κ) 4 ( θ ) θ (.3) Finally, we require that Z y dρ(z) < which implies that the quantity E(f ρ ) + f ρ ρ is finite. We are now ready to state our main results. The first one deals with error bounds for the online gradient descent algorithm (.5). Theorem. Let θ (0, ) and {η t = µ t θ : t IN} with some constant µ µ(θ). Define {f t : t IN} by (.5). Then, for any T IN, E Z T ft + f ρ ρ] is bounded by ] K(b θ µ T ( θ)/ c ρ c θ, f ρ ) + µ T min{θ, θ} ln ( 8T θ where c ρ := 4(+κ) 4( 0 f ρ ρ+3e(f ρ ) ), b θ := (+κ) if θ = and 3 ( θ ) θ otherwise. ), (.4) θ and c θ θ = 3 The K-functional term of our upper bound in Theorem concerns the approximation of the function f ρ in the space L ρ X by functions in the 4

5 space H K. Its polynomial decay can be characterized by requiring that f ρ lies in the interpolation space (L ρ X, H K ) β, defined as (L ρ X, H K ) β, := { f L ρ X : f β, = sup s β K(s, f) < }, (.5) s>0 where β 0, ]. Indeed, it is easy to see that, for some c > 0, K(s, f ρ ) cs β for any s > 0 if and only if f ρ (L ρ X, H K ) β,. Note 5] that (L ρ X, H K ) 0, = L ρ X and (L ρ X, H K ), = H K. Therefore, we can intuitively regard the interpolation space (L ρ X, H K ) β, as an intermediate space between the metric space L ρ X and the much smaller approximation space H K. Similar ideas of using the K-functional to characterize the approximation property of the space H K can be found in,, 3]. In general, we have 5] that lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ. This gives rise to convergence of the online gradient descent algorithm (.5). Theorem. Let θ (0, ) and µ(θ) be given by (.3). Define {f t : t IN} by (.5). If the step sizes satisfy η t = µ t θ for t IN with some constant µ µ(θ) then we have that lim E ] Z T T ft + f ρ ρ = inf f f ρ f H K ρ. We will give the proofs of Theorems and in Section 5. If the space H K has a good approximation property then the limit in Theorem equals zero. To see this, we say that H K is dense in L ρ X if inf f g ρ = 0, for all g L ρ f H X. (.6) K We note that, in 6], a kernel is called universal if inf f HK f g = 0, for all g C(X). For example, the Gaussian kernel K σ (x, x ) = is universal for every σ > 0. Universal kernels are sufficient to ensure our density condition (.6), since f ρ f and C(X) is dense in L ρ X. Note that f ρ L ρ X. Therefore, an immediate consequence of Theorem is the following corollary. e x x σ Corollary. Suppose the assumptions in Theorem hold true. Moreover, if H K is dense in L ρ X then we have that lim E ] Z T T ft + f ρ ρ = 0. 5

6 By choosing θ appropriately, we can get explicit error rates if the regression function f ρ lies in the regularity space L β K (L ρ X ). To define this space, we introduce the integral operator L K : L ρ X L ρ X defined as L K f(x) := K(x, x )f(x )dρ X (x ), x X and f L ρ X. X Since K is a Mercer kernel, L K is compact and positive. Therefore, the fractional power operator L β K is well-defined for any β > 0. We indicate its range space by L β K (L ρ X ) := { f = j= λ β j a jφ j : L β K f ρ := j= } a j <, (.7) where {λ j : j IN} are the positive eigenvalues of the operator L K and {φ j : j IN} are the corresponding orthonormal eigenfunctions. Thus, the smaller β is, the bigger the range space L β K (L ρ X ) will be. In particular, we know 5] that L β K (L ρ X ) H K for β > and L K (L ρ X ) = H K with the norm satisfying One can find more details in, 5]. g K = L K g ρ, g H K. (.8) Of course, one can assume that f ρ (L ρ X, H K ) β,. This is a natural assumption since L β K (L ρ X ) (L ρ X, H K ) β, for β 0, ], see the Appendix. However, in this paper we concentrate on the hypothesis space L β K (L ρ X ) since this facilitates a comparison of our results with those in the related literature (e.g. 7, 5, 7, 3, 33, 34]). We are now in a position to state our error rates for algorithm (.5). Hereafter, the expression a T = O(b T ) means that there exists an absolute constant c such that a T cb T for all T IN. Theorem 3. Let θ (0, ) and µ(θ) be given by (.3). Define {f t : t IN} by (.5). If f ρ L β K (L ρ X ) with some 0 < β then, by selecting η t = β µ( β β+ for t IN, for any T IN there holds β+ )t ] ( ) E Z T ft + f ρ ρ = O T β β+ ln T. (.9) In Section 6., we will derive the error rate (.9) from some modified error bounds similar to (.4). Since the best rate of the second term on the right-hand side of (.4) is O(T ln T ), the associated error rate 6

7 (.9) is never faster than the rate O(T ln T ) achieved for β =. In contrast, it is known that regularization algorithms 4, 7, 5, 3] in batch learning can achieve better rates than O(T ) if f ρ L β K (L ρ X ) with β >. Unfortunately, for the online algorithm (.5) with polynomially decaying step sizes, we do not know how to get better rates beyond the range β (0, ]. We turn our attention to the case that the step sizes of (.5) are in the form of {η t = η : t IN T } while the learning rate η depends on the sample number T. In this scenario, the error bounds for algorithm (.5) read as follows. Theorem 4. Let {η t = η : t IN T } and {f t : t IN T + } be produced by algorithm (.5). If 6(κ + ) 4 ln ( 8T ) η, then E Z T ft + f ρ ρ] is bounded by K(( + κ) ( ηt ), f ρ )] + cρ η ln ( 8T ). (.0) Note that the function K(, f ρ ) is non-decreasing and lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ. Thus, from the error bound (.0) we see that large number of iteration T reduces the K-functional, but enlarge the last term in (.0); on the other hand, small number of iteration reduces the last term in (.0), but enlarges the K-functional. Hence, trading off the two terms in the error bound (.0) suggests an early stopping rule T = T (η) which guarantees convergence of algorithm (.5) as η 0+. However, our purpose in this paper is to derive error rates with respect to the iteration steps (sample number) T by choosing η appropriately, and thus we take the inverse function s description η = η(t ). In addition to the error f T + f ρ ρ, there are other interesting relevant quantities such as the error f T + f ρ K (if f ρ H K ). Observe that the global error f T + f ρ ρ can not wholly describe the local properties of f T + which is usually measured by f T + (x 0 ) f ρ (x 0 ) for any x 0 X. However, if f ρ H K, for every x 0 X we have that f T + (x 0 ) f ρ (x 0 ) κ f T + f ρ K. In this case, the convergence and the error rate in H K reflect the local performance of the predictor f T +. Indeed, it was further pointed out in 5] that the convergence in H K yields convergence in C k (X) under some conditions on K, where C k denotes the space of all functions whose derivatives up to order k are continuous. In this learning setting, we usually use the whole batch of examples at one time. 7

8 When the step sizes of (.5) are of the form {η t = η(t ) : t IN T }, we can get convergence results in L ρ X as well as in H K. Theorem 5. Let {η t = η(t ) : t IN T } and {f t : t IN T + } be produced by (.5). Then the following statements hold true. () If the step size satisfies then there holds lim T η(t ) =, lim T η(t ) ln T = 0 (.) T lim E ] Z T T ft + f ρ ρ = inf f f ρ f H K ρ. (.) () If f ρ H K and the step size satisfies then we have that lim T η(t ) =, lim T T η (T )T = 0 (.3) lim E ] Z T T ft + f ρ K = 0. (.4) Note that, if the step size decays in the form of η(t ) = O(T θ ), T with θ > 0 then the hypothesis (.) allows the choice θ (0, ) while (.3) requires that θ (, ). We will prove Theorems 4 and 5 in Section 5. In Section 6, we shall establish the following error rates in L ρ X as well as in H K when the step sizes are of the form {η t = η(t ) : t IN T }. Theorem 6. Let {η t = η : t IN T } and {f t : t IN T + } be produced by (.5). If f ρ L β K (L ρ X ) for some β > 0 then, by choosing η := β T β 64(+κ) 4 (β+) Moreover, if β > β+, we have that ] E Z T ft + f ρ ρ = O (T β β+ ln T ). (.5) then there holds E Z T ft + f ρ K ] ) = O (T β β+. (.6) As the last contribution of this paper, we further improve the preceding error rate 33] for the online regularized algorithm (.4) with λ > 0. The proof will be given in Section 6.. 8

9 Theorem 7. Let λ > 0, θ (0, ) and µ(θ) be given by (.3). Define {f t : t IN T + } by (.4). Assume that f ρ L β K (L ρ X ) with some 0 < β. For any 0 < ε < β, choose η β+ t = µ( β Then there holds β+ )+t β β+ and λ = T β+ + ε ] E Z T ft + f ρ ρ = O (T β +ε) β+. (.7) β. The preceding error rates of the online regularized algorithm (.4) with λ > 0 were established in 33] under the assumption that f ρ L β K (L ρ X ) with some β (0, ]. Namely, for any arbitrarily small ε > 0, by choosing λ := λ(t ) appropriately, there holds ] E Z T ft + f ρ ρ = O (T β +ε) (β+) (.8) and, for < β, we further have that ] ( E Z T ft + f ρ K = O T β +ε) 4β+. (.9) By Cauchy-Schwarz inequality, we see from (.7) that, for any arbitrarily small ε > 0, there holds ] ( ] ) E Z T ft + f ρ ρ E z Z T ft + f ρ ρ Therefore, the rate (.7) is much better than (.8). = O (T β +ε) β+. We end this section with some remarks. Firstly, the above error rates for algorithm (.5) are capacity independent except the prior requirement that f ρ L β K (L ρ X ) with some β > 0. It is an open problem to improve these rates when some additional information is known such as the regularity of the kernel K or some polynomial decay of the eigenvalues of L K 7,, 3]. Secondly, all the rates are proved with respect to expectation norm and we do not know how to convert them into similar probabilistic bounds such as those in 5, 3]. Finally, we wish to emphasize that when the step sizes are a polynomially decaying sequence like that in Theorem, the learning scheme (.5) is a truly online algorithm. Instead, when we use a constant step size η t = η(t ), algorithm (.5) is only an informal online algorithm since the knowledge of the sample size is required in advance. 9

10 3 Related work In this section, we discuss and compare our results with related work. A main conclusion of this discussion is that the online algorithm (.5) is competitive with the commonly used least-square learning schemes. 3. Cumulative loss analysis The mistake (regret) bounds for the cumulative loss T t= (y t f t (x t )) for general online algorithms have been well studied in the literature. See e.g., 8, 9, 0, 7, 8, 9, 9, 35] and references therein. Specifically, mistake bounds were derived for the online density estimation in ]. Section 6 in 7] derived, for a learning algorithm different from (.5), upper bounds on the relative expected instantaneous loss, measuring the predicting ability of the last output in the linear regression problem. The online algorithm studied in 0] in the linear regression setting is closest to our algorithm (.5). This paper discussed how the choice of the learning rate affects the bound for the cumulative loss. Related mistake bounds in this setting can also be found in 35]. In 9], the authors proposed general online regularized algorithms with kernels and presented their cumulative loss bounds. For a more detailed review of mistake bounds in this direction, one can refer to 9, Section 5]. There, generalization bounds for the average prediction were also derived from the cumulative loss bounds for certain algorithms in RKHS s. More precisely, for each t IN, let H t : X IR be the function produced by a prediction algorithm when fed with the independent data {(x j, y j ) : j IN t }. Consider the average prediction H T + = T + T + t= H t and let D H K with D(x) Y := M, M] for all x X. It was shown (9, Corollary ]) that there exists a prediction algorithm such that, for any δ > 0 and T IN, with probability at least δ, there holds E(H T + ) E(D) + M ( κ + M ( D K + ) + M ln ). T + δ The main difference of our generalization bounds (.4) and (.0) from the above one is that we have no assumption on the functions in the given upper bounds. 0

11 Soon after Proposition in Section 4., we shall discuss an appealing relationship between statistical generalization error of f T + and a weighted modification of cumulative loss T t= (y t f t (x t )). However, it remains to be understood whether the standard cumulative loss bounds for algorithm (.5) can be directly applied to the study of its generalization error. 3. Comparison with other regularization schemes Although deriving cumulative loss bounds of the online gradient descent algorithm (.4) is very useful, it is also important to further understand the statistical behavior of f T +. In ], the authors studied the performance of f T + in the H K norm when f T + is given by the online regularized algorithm (.4) with λ > 0. More general online regularized schemes (.4) involving commonly used loss functions in classification and regression were discussed in 0, 33]. Using quite different methods from ours, 35] obtained similar generalization bounds as (.0) for the online gradient descent algorithm associated with uniformly Lipschitz loss functions, linear kernels and constant step sizes (learning rates). In particular, in what follows we compare our rates for algorithm (.5) with the state-of-art ones for other least-square learning algorithms under the same hypothesis that f ρ L β K (L ρ X ) with β > 0. Firstly, we begin by the comparison with the rates for the online regularized algorithms. In this discussion, we note that the rates (.9), (.5) and (.6) of algorithm (.5) are comparable to the rates (.7), (.9) of the online regularized algorithm (.4). Secondly, we present the comparison with the rates of averaged stochastic gradient descent algorithms introduced in 35]. This class of online algorithms and the derived error bounds are stated in the linear kernel setting. However, we can easily extend them to the general kernel setting as follows. Assume κ η t < for t IN T, and let r = 0, ˆf = 0 and {f t : t IN T + } be defined by (.5). For any t IN T, we update r t+ = r t + η t κ ηt and ˆf t+ = rt ˆft r t+ + r t+ r t r t+ f t. Then, the generalization bound (see 35, Theorem 5.]) for the average output ˆf T + in the kernel setting can be cast as E Z T E( ˆfT + ) ] { ( κ σ ) inf + T + } E(f) + f K, (3.) f H K r T + r T +

12 where σ T + = η t. Note that E(f) E(f ρ ) = f f ρ ρ for any f L ρ X. In view of the regularization error { } D(λ) := inf f fρ ρ + λ f K, (3.) f H K the bound (3.) implies that E Z T ˆf T + f ρ ρ] is bounded by ( + κ σt ) ( + D r T + ) (r T + + κ σt + ) + κ σ T + r T + E(f ρ ). (3.3) When the step sizes are of the form η t = 4(+κ ) t θ with θ (0, ), it is not hard to observe that r T = O(T θ ), σ T + r T + = O(T min{θ, θ} ) for θ and σ T + r T + = O(T ln T ) for θ =. Furthermore, if f ρ L β K (L ρ X ) with some β (0, ] then we know from 5, Lemma 3] that which implies that D(λ) λ β L β K f ρ ρ, (3.4) ( ) D (r T + + κ σt + ) = O(T β( θ) ). (3.5) Therefore, if f ρ L β K (L ρ X ) with β (0, ], choosing θ = β β+ and putting the above observations for r T +, σ T + r T + and (3.5) into (3.3) yields that { E Z T ˆf O(T T + f ρ ln T ) for β = ρ] =, O(T β β+ ) for β (0, ). (3.6) Similarly, if f ρ L β K (L ρ X ) with β (0, ] and the step sizes have the form {η t η : t IN T } then, by choosing η = T β 4(+κ β+ ), we see from (3.3) and (3.4) that E Z T ˆf T + f ρ ρ] = O(T β β+ ). (3.7) Based on the above discussion, we see that the rates (.9) and (.5) for algorithm (.5) are the same, up to a logarithmic term, as the corresponding rates (3.6) and (3.7) for the averaged stochastic gradient descent algorithm.

13 Thirdly, we move on to the comparison with the Tikhonov regularization algorithm (see e.g. 6]) in batch learning which is defined by f z,λ := arg inf f H K { T (f(x t ) y t ) + λ f K }. (3.8) The capacity independent generalization bounds for this algorithm in 34] can be expressed as E Z T E(fz,λ ) ] ) { } ( + κ inf E(f) + λ f K. T λ f H K In terms of the regularization error D(λ), it can be equivalently stated as E Z T fz,λ f ρ ] ( ) ( ρ D λ + E(fρ ) + D ( λ )){ ( ) 4κ κ } T λ +. (3.9) T λ But, if f ρ L β K (L ρ X ) with β (0, ] we know from (3.4) that D(λ) λ β L β K f ρ ρ. Putting this into (3.9) and trading off T and λ, for f ρ L β K (L ρ X ) with some β (0, ], the choice λ = T β+ gives the rate fz,λ f ρ ] ) = O (T β β+. (3.0) E Z T ρ When f ρ L β K (L ρ X ) with β (, ], it was further improved in 5], by choosing λ = λ(t ) appropriately, that and ] ( ) E Z T fz,λ f ρ ρ = O T β β+, (3.) E Z T fz,λ f ρ K ] = O ( T β β+ ). (3.) Under the same assumptions, the rates (.9), (.5) and (.6) of the online algorithm (.5) are almost the same as the corresponding rates (3.0), (3.) and (3.) in batch learning. Finally, we end this subsection with the remark that generalization bounds associated with D(λ) such as (3.9) for the regularization algorithms are quite neat and interesting, but they tend to suffer a saturation phenomenon. The rates are originally stated in the form of probability inequalities. We employ their expectation versions for consistency with the other bounds presented in the paper. 3

14 To explain this phenomenon, we observe that { } D(λ) = inf { f f ρ ρ + λ f f H K K} inf f f ρ ρ + λ f ρ. f H K κ Since f f ρ ρ f f ρ ρ, by letting t = f ρ, we have that } D(λ) inf {(t f ρ ρ ) + λt = λ t IR κ κ + λ f ρ ρ. (3.3) If f ρ is not identically zero (i.e., f ρ ρ > 0), the inequality (3.3) implies that the optimal decay of D(λ) is O(λ), λ 0+ which, by (3.4), is achieved when f ρ L K (L ρ X ) = H K. Equivalently, the rate of D(λ) is never faster than O(λ), λ 0+ even if f ρ lies in any smaller space L β K (L ρ X ) with β > than H K. Therefore, in this case the consequent trade-off rate (3.0) derived from (3.9) is at most O(T ), T, which is achieved at β = and can not be improved beyond the range β (0, ]. This is the so-called saturation phenomenon in the context of inverse problems 5]. For similar reasons, the rates (3.6) and (3.7) derived from (3.3) for the averaged stochastic gradient descent algorithm have the same problem. In contrast, our capacity independent rate (.5) for algorithm (.5) does not suffer this drawback and is arbitrarily close to T as β. 3.3 Capacity independent optimality In this subsection, we explain in two folds that the error rates for algorithm (.5) are optimal, up to a logarithmic term, in the capacity independent sense. We initially illustrate this by citing the lower bounds in 7] when f ρ L β K (L ρ X ) with β (, ]. Then, we demonstrate a specific example to contrast our rates with the best possible rate in the non-parametric statistical literature (see e.g. 6, 7]). We begin with the citation of the lower bound in 7]. Consider f ρ L β K (L ρ X ) with β (, ] and the eigenvalues (arranged in decreasing order) of L K have the decay λ i = O(i b ) with b >. It was proved in 7] that the rate T βb βb+ in L ρx is optimal, see 7] for a precise definition of optimal rates. Since K(x, x ) = i IN λ iφ i (x)φ i (x ) for any x, x X and X φ i (x)dρ X (x) = for any i IN, it follows that i= λ i = K(x, x)dρ X X(x) κ, see e.g. ]. Therefore, for any i IN we have 4

15 that iλ i j IN i λ j κ which implies that λ i = O(i ) for all kernels. Since our capacity independent rates are independent of the eigenvalues of L K, taking b in the rate T βb βb+ leads to the eigenvalue-independent optimal rate T β β+ (see also 3] for a discussion). In this sense, our rates (.5) and (.7) are almost optimal for the capacity independent case under the condition that f ρ L β K (L ρ X ) with β (, ]. Now, we illustrate by a specific example that our rates for algorithm (.5) are comparable to the optimal rate in non-parametric statistics. For simplicity, we only consider smooth splines 30] in one dimension. Let X = 0, ], dρ X = dx. For smooth splines, the reproducing kernel space can be regarded as a fractional Sobolev space H s 0, ] with s > which is defined by Fourier coefficients { H s 0, ] := f = a 0 + k IN k s (a k sin(πkx) + b k cos(πkx)) : f s := a 0 + k IN(a } k + b k) <. It is easy to see, for any x, x X, that the reproducing kernel of the space H s 0, ] can be represented by K s (x, x ) = + k IN k s cos(πk(x x )). In addition, λ =, λ j = λ j+ = j s for j IN and φ (x) =, φ j (x) = sin(πjx) and φ j+ (x) = cos(πjx) for j IN. Therefore, in this case, the range space L β K s (L dx ) with β > 0 defined by (.7) is identical to the Sobolev space H βs 0, ]. It is well known (e.g. 7]) in non-parametric statistics that the rate O(T 4βs 4βs+ ) is optimal if fρ L β K s (L dx ) = Hβs 0, ] with β >. Therefore, if dρ X = dx, f ρ L β K s (L dx ), and H K s = H s 0, ] with s > with β > β then our rate O(T β+ ln T ) of algorithm (.5) given by (.5) is suboptimal. But it is arbitrarily close to the best one O(T 4βs 4βs+ ) as s +. 4 Error decomposition and basic estimates In this section, we formulate our basic ideas by providing a novel approach to the error analysis of online gradient descent algorithms in L ρ X. As 5

16 mentioned in the introduction, the main purpose of this paper is to analyze the online algorithm (.5). However, we would like to state a unified approach for the general online algorithm (.4) for any λ 0 since this will, as a byproduct, allow us to improve the preceding error rate in 33] for the algorithm (.4) with λ > 0 as proved in Section 6. below. We first establish some useful observations used later. For λ > 0, define the regularizing function by } f λ = arg inf {E(f) + λ f K. (4.) f H K Lemma. Let λ > 0 and f λ be defined as above. Then we have that L K (f λ f ρ ) + λf λ = 0 (4.) and f λ f ρ ρ f ρ ρ. (4.3) Proof. The first equality is well-known, see e.g. ]. For the inequality (4.3), we know from the equality } (.6) that (4.) is equivalent to f λ = arg inf f HK { f f ρ ρ + λ f K. By taking f = 0, the definition of f λ yields the desired estimate f λ f ρ ρ f ρ ρ. With (4.) at hand, we can interpret the online algorithm (.4) with λ 0 as the following useful form. With a slight abuse of notation, we indicate f ρ by f 0. Lemma. Let λ 0 and {f t : t IN T + } be defined as (.4). Then, for any t IN T we have that f t+ f λ = (I η t (L K + λi))(f t f λ ) + η t B(f t, z t ), (4.4) where I is the identity operator and the vector-valued random variable B(f t, z t ) is defined by B(f t, z t ) := L K (f t f ρ ) + (y t f t (x t ))K xt. (4.5) Proof. The equality (4.4) for the case λ = 0 is easily verified since we have set f 0 = f ρ. 6

17 For the case λ > 0, we use the property (4.) of the regularizing function f λ in Lemma. By the definition of f t+ given by (.4) with λ > 0, we know that f t+ f λ = f t f λ η t (f t (x t ) y t )K xt η t λf t = (I η t (L K + λi))(f t f λ ) + η t L K (f t f λ ) η t λf λ + η t (y t f t (x t ))K xt. But the equality (4.) tells us that λf λ = L K (f λ f ρ ). Hence, putting this back into the above equation and arranging it yield the desired equality (4.4). With the help of the formula (4.4), we can describe our approach by three steps in which the first one is referred to as error decomposition. 4. Error decomposition For λ 0 and t IN, set the operator ωk t (L K + λi) := t j=k (I η j(l K + λi)) for k IN t and ωt+(l t K + λi) := I. Applying induction to the equality (4.4) and noting that f = 0, for any t IN T we have that f t+ f λ = ω(l t K + λi)f λ + η j ωj+(l t K + λi)b(f j, z j ). (4.6) j IN t Applying (4.6) with t = T, we get the following error decomposition f T + f λ = ω T (L K + λi)f λ + η t ω T t+(l K + λi)b(f t, z t ). (4.7) The above error decomposition technique is well-known in statistical learning theory in order to realize the error analysis for least-square related learning algorithms, see e.g., 3, 33]. One can find similar ideas used for different learning schemes, see e.g. 3,, 4, 4, 5] and references therein. We will use the error decomposition (4.7) to estimate the expectation of f T + f ρ ρ. To this end, we need some useful preparations. Let L be a linear operator from L ρ X to itself, we use L to denote its operator norm, that is, L := sup f ρ= Lf ρ. We also need the following technical lemma which forms the essential estimates of our approach. 7

18 Lemma 3. Let λ 0, β > 0 and η t (κ + λ) for any integer t j, k]. Then there holds ω k j (L K + λi)l β K ( ( β ) β + κ β) e exp { λ k t=j η t}( + ( k t=j η t ) β ). Proof. Note that, for any x, x X K(x, x ) = K x, K x K K x K K x K κ. (4.8) Combing this with the fact ρ X (X) =, we see from Cauchy-Schwarz inequality that, for any f L ρ X L K f ρ = K(x, x )f(x )dρ X (x ) dρ X (x) X X = f ρ ( X X X K(x, x ) dρ X (x ) X X f(x ) ) dρ X (x ) dρ X (x) K(x, x ) dρ X (x )dρ X (x) κ 4 f ρ. (4.9) Hence, L K κ. Since {λ l : l IN} are the eigenvalues of L K, it follows that sup l λ l κ and L β K Moreover, for any x (κ + λ), := sup λ β l κβ. (4.0) l IN I x(l K + λi) := sup( x(λ l + λ)) xλ. (4.) l IN Applying (4.) with x = η t iteratively for t j, k] and (4.0), we have that k ωj k (L K + λi)l β K I η t (L K + λi) L β K t=j κ β k { ( η t λ) exp λ t=j k η t }κ β, (4.) where the elementary inequality η t λ e ηtλ for any t j, k] is used in the last inequality. t=j 8

19 Also, ωj k (L K + λi)l β K := sup l IN k ( η t (λ l + λ))λ β l t=j sup exp { (λ + λ l ) l IN { exp λ { = exp λ k t=j k t=j η t } λ β l } { η t sup exp x 0 x k t=j k }( β ) β ( k ) β. η t η t e t=j t=j η t } x β Consequently, the above inequality and (4.) implies that ω T t (L K + λi)l β K is bounded by { exp λ k } ( ( β η t e t=j ) ) β { + κ β min, ( k ) β η t }. Combining this with the elementary inequality that min{a, b } a+b for any a > 0, b > 0 finishes the lemma. t=j Now we present the upper bound for E Z T ft + f ρ ρ] which is the foundation of our approach introduced in this paper. Hereafter, we adopt the convention that l j=l+ η j = 0 for any l N. Proposition. Let λ 0 and {f t : t IN} be defined as (.4). Then, for any T IN, E Z T ft + f ρ ρ] is bounded by f λ f ρ ρ + ω T (L K + λi)f λ ρ + ω T (L K + λi)f λ ρ f λ f ρ ρ + ( + κ) 4 η t exp { λ T j=t+ η } j + T j=t+ η E Z t E(ft ) ]. (4.3) j Proof. Since f T + f ρ = f T + f λ + f λ f ρ, there holds E Z T ft + f ρ ρ] = EZ T ft + f λ ρ] + fλ f ρ ρ + E Z T ft + f λ, f λ f ρ ρ ]. (4.4) In order to use (4.4) to prove (4.3), we need to estimate the first term and the last term on the right-hand side of (4.4). 9

20 For the last term on the right-hand side of (4.4), observe that the data is i.i.d. and f t is only dependent on {z,..., z t }, not on z t. Moreover, recall the definition of the regression function: f ρ (x) = ydρ(y x). Thus, X B(f t, z t ) has a nice vanishing property E zt B(ft, z t ) ] = 0, t IN. (4.5) Applying (4.5) to the equality (4.7) implies that E Z T ft + f λ ] = ω T (L K + λi)f λ. Therefore, E Z T f T + f λ, f λ f ρ ρ = E Z T ft + f λ ], fλ f ρ ρ ω T (L K + λi)f λ ρ f λ f ρ ρ. (4.6) For the first term on the right-hand side of (4.4), we first use the equality (4.7) to get that E Z T ft + f λ ρ] = ω T (L K + λi)f λ ρ ] + E Z T η t ωt+(l T K + λi)b(f t, z t ) ρ E Z T η t ωt+(l T K + λi)b(f t, z t ), ω T (L K + λi)f λ ρ. (4.7) Below, we estimate the last two terms on the right-hand side of (4.7) separately. For the second term on the right-hand side of (4.7), we write it as ω Tt +(L K + λi)b(f t, z t ), ω Tt+(L ] K + λi)b(f t, z t ) ρ. t IN T η t η t E Z T Therefore, by (4.5), for t > t E Z t ωt T +(L K + λi)b(f t, z t ), ωt+(l T K + λi)b(f t, z t ) ρ = E Z t E zt ωt+(l T K + λi)ωt T +(L K + λi)b(f t, z t ), B(f t, z t ) ρ = E Z t ωt+(l T K + λi)ωt T +(L K + λi)b(f t, z t ), E zt B(ft, z t ) ] ρ = 0. By the symmetry of t, t, the above equality also holds true for t > t. 0

21 Consequently, it follows that ] E Z T η t ωt+(l T K + λi)b(f t, z t ) ρ = ηt E Z t ω T t+ (L K + λi)b(f t, z t ) ρ] ηt ω T t+(l K + λi)lk E Z t L K B(f ] t, z t ) ρ = ηt ω T t+(l K + λi)lk E Z t B(ft, z t ) K], (4.8) where we have used equation (.8) in the last equality. To estimate E Z t B(ft, z t ) K], we note from (4.5) that Ezt (ft (x t ) y t )K xt ] = LK (f t f ρ ), and thus we can rewrite E zt B(ft, z t ) K] as E (ft zt (x t ) y t )K xt ] ] K Ezt (ft (x t ) y t )K xt. (4.9) K Therefore, E Z t B(ft, z t ) ] K EZ t E (ft zt (x t ) y t )K xt ] K ( ) κ ]] E Z t Ezt ft (x t ) y t = κ E Z t E(ft ) ]. (4.0) In addition, applying Lemma 3 with β = and j = t +, k = T implies that ωt+(l T K +λi)l K (+κ) exp{ λ T j=t+ η j}. Consequently, plugging + T j=t+ η j this and (4.0) into (4.8) yields that ] E Z T η t ωt+(l T K + λi)b(f t, z t ) ρ ( + κ) { exp λ T 4 j=t+ η j} + T j=t+ η E Z t E(ft ) ]. (4.) j For the last term on the right-hand side of (4.7), we use (4.5) to get that E Z T η t ωt+(l T K + λi)b(f t, z t ), ω T (L K + λi)f λ ρ = η t ωt+(l T K + λi)e Z t E z t B(ft, z t ) ], ω T (L K + λi)f λ ρ = 0.

22 Substituting this and (4.) into (4.7) yields that E Z T ft + f λ ρ] ω T (L K + λi)f λ ρ + ( + κ) { exp λ T 4 j=t+ η j} + T j=t+ η j E Z t E(ft ) ]. (4.) Combining this with (4.6), (4.4) completes the upper bound (4.3). Now the error decomposition and Proposition help us reduce the goal of our error analysis to the estimation of the four terms in (4.3). The first three terms of (4.3) are frequently referred to as the approximation error 4, 5]. Since the data is i.i.d. and f t does not depend on z t, the last term of (4.3) can be rewritten as E Z T η t exp { λ T j=t+ η j + T j=t+ η j } (y t f t (x t )) ]. It can be regarded as the expectation of a weighted form of the cumulative loss (y t f t (x t )) in the online learning literature 8, 0, 9]. Thus, we call the second term of (4.3) the cumulative sample error. We estimate these two errors separately in the following two subsections which constitute the second and last steps of our approach. 4. Estimates for the approximation error Here, we establish some basic estimates for the deterministic approximation errors involving f λ f ρ ρ and ω T (L K + λi)f λ which is the second step of our approach. To estimate the term f λ f ρ ρ, we only need to consider the case λ > 0 since f 0 = f ρ. For this purpose, recall Lemma 3 in 5]. Lemma 4. Let λ > 0 and f λ be defined by (4.). If f ρ L β K (L ρ X ) with 0 < β then there holds (a) f λ f ρ ρ λ β L β K f ρ ρ ; (b) If moreover, < β then f λ f ρ K λ β L β K f ρ ρ. Using the above lemma, we can estimate the quantity ω T (L K + λi)f λ ρ as follows.

23 Lemma 5. Let λ 0 and η t (κ + λ) for each t IN T. Then the following statements hold true. (a) If λ = 0 then we have that ω T (L K )f ρ ρ K ( ( + κ) ( ) η ) t, f ρ ; (b) If λ > 0 then we have that { ω T (L K + λi)f λ ρ exp λ η t } f ρ ρ. Proof. We first prove property (a). Since η t κ for t IN T, using (4.) with λ = 0 and x = η t for t IN T iteratively implies that ω T (L K ) I η t L K. Thus, for any f H K there holds ω T (L K )f ρ ρ ω T (L K )(f f ρ ) ρ + ω T (L K )f ρ f f ρ ρ + ω T (L K )L K L K f ρ. = f f ρ ρ + ω T (L K )L K f K, (4.3) where (.8) is used in the last equality. Applying Lemma 3 with λ = 0, β = (, j =, and k = T implies that ) ωt (L K )L K ( + κ). η t Then, substituting this into the right-hand side of (4.3) yields that { ( ω T (L K )f ρ ρ inf f fρ ρ + ( + κ) η t f H K ) f K }. (4.4) Turn to the proof of property (b), note that η t (κ + λ) for any t IN T. Thus, applying (4.) again with x = η t for t IN T implies that ω T (L K + λi)f λ ρ ( η t λ) f λ ρ exp { } λ η t fλ ρ. But, by (4.3), f λ ρ f ρ ρ which finishes property (b). The last step of our approach is to estimate the cumulative sample error. 3

24 4.3 Estimates for the cumulative sample error In this subsection, we describe basic estimates for the cumulative sample error. Observe that the cumulative sample error (the last term in (4.3)) can be bounded by sup E Z t E(ft ) ]] η t exp { λ T j=t+ η } ] j + T j=t+ η. (4.5) j Therefore, it remains to estimate the above summand involving the step sizes and the uniform bounds for the learning sequence separately. Let us first deal with the summand. Lemma 6. Let λ 0, θ 0, ) and µ max{λ, } + κ. If the step sizes are in the form of {η t = µ t θ : t IN T } then, for any l IN T, the summation η j exp { λ l k=j+ η } k j IN l + l k=j+ η is bounded by k c θ µ ( exp { λd θ l θ /µ } l min{θ, θ} + l θ ) ln ( 8l θ ), (4.6) where d θ = θ θ and c θ = for θ = 0, 3 for θ =, and 3 ( θ ) θ otherwise. The technical proof is given in the Appendix. Now, we can see that if the step sizes are of the form {η t = µ t θ : t IN T } with θ (0, ) and µ max{λ, } + κ, the summation part involving the step sizes in (4.5) is bounded by (4.6) with l = T, which tends to zero as T. Hence, in order to estimate the whole term (4.5) it suffices to establish the uniform bounds for sup t INT + E Z t E(ft ) ]. This is intuitively reasonable since we expect that the learning sequence {f t : t IN T + } approximates the regression function f ρ. To this end, we need the following direct result from Lemma 6. Corollary. Let λ 0, θ 0, ) and η t = µ t θ for t IN T with µ κ + λ. For any T IN, if µ λ is larger than the quantity 6( + κ) 4 ln ( 8T ) if θ = 0, µ(θ) := 48( + κ) 4 if θ = (, 08(+κ) 4 ln ( ) ) (4.7) 8 θ + otherwise, min{θ, θ} ( θ ) θ } 4

25 then there holds j IN t η j exp { λ t k=j+ η k + t k=j+ η k } 8( + κ) 4, t IN T. (4.8) Proof. For any l IN T and λ 0, the quality (4.6) is bounded by c θ µ l min{θ, θ} ln( 8l ). Thus, the hypothesis (4.8) holds true as long as θ ( ) µ κ +max{λ, } and 6(+κ)4 c θ t min{θ, θ} 8t ln for any t IN µ θ T. When θ = 0, by the definition of c 0, the above inequality is virtually identical to µ 6( + κ) 4 ln ( 8T ). For θ (0, ), note that ln t = ln( t min{θ, θ}) tmin{θ, θ}. min{θ, θ} min{θ, θ} Therefore, the inequality 6( + κ) 4 c ( θ 8t ) t min{θ, θ} ln µ 6( + ( κ)4 c θ ln ( 8 µ θ θ ) + min{θ, θ} ) and the definition of c θ given in Lemma 6 yield the desired result. We are ready to establish the uniform bounds for the learning sequence. Proposition. Let λ 0 and {f t : t IN T + } be produced by algorithm (.4). Moreover, if the step sizes satisfy η t (κ + λ) for t IN T and the inequality (4.8) then we have that E Z t E(ft ) ] 0 f ρ ρ + 3E(f ρ ), t IN T +. (4.9) Proof. Since f = 0, from (.6) it follows that E(f ) = f f ρ ρ + E(f ρ ) = f ρ ρ + E(f ρ ), which implies that (4.9) holds true for t =. As the induction assumption, suppose (4.9) holds true for all k IN t. To advance the induction, we need to estimate E Z t E(ft+ ) ]. Applying (4.3) with T = t IN, E Z t ft+ f ρ ρ] is bounded by f λ f ρ ρ + ω(l t K + λi)f λ ρ + ω(l t K + λi)f λ ρ f λ f ρ ρ + ( + κ) η 4 k exp{ λ t j=k+ η } j k IN t + t j=k+ η E Z k E(fk ) ]. (4.30) j 5

26 Note that f λ f ρ ρ f ρ ρ by (4.). Since η t (κ +λ) for t IN, ω t (L K + λi) j IN t I η j (L K + λi), Hence, ω t (L K + λi)f λ ρ ω t (L K + λi) f λ ρ f λ ρ 4 f ρ ρ. Similarly, by (4.3), there holds that ω t (L K + λi)f λ ρ f λ f ρ ρ f ρ ρ. Putting the above estimates, the induction assumption, and hypothesis (4.8) into (4.30) implies that ( ) E Z t ft+ f ρ ρ] 9 fρ ρ + 0 f ρ ρ + 3E(f ρ ) /4. Note that E(f t+ ) = E(f ρ ) + f t+ f ρ ρ by (.6), and thus E Z T E(ft+ ) ] 0 f ρ ρ + 3E(f ρ ), which advances the induction and completes the proof. In the following sections, we apply Proposition and the basic estimates for the approximation error and cumulative sample error to prove our main results stated in Section. 5 Error bounds and convergence In this section, we mainly develop error bounds and convergence results for algorithm (.5) based on the Proposition. In this case, we have that λ = 0 and f 0 = f ρ. Therefore, the error decomposition formula (4.7) is reduced to f T + f ρ = ω T (L K )f ρ + η t ω T t+(l K )B(f t, z t ). (5.) By Proposition, E Z T ft + f ρ ρ] is bounded by ω T (L K )f ρ ρ + ( + κ) 4 k IN T η t + T j=t+ η E Z t E(ft ) ]. (5.) j 6

27 5. Proofs of error bounds This subsection proves error bounds stated in Theorems and 4. The essential estimates will also be used to derive error rates in Section 6. We start with the following useful lemma. One can find its modified form in 3]. Lemma 7. If f L β K (L ρ X ) with some β > 0 then ( (β ) β ( ω T (L K )f ρ + κ β) L β K e f ρ In addition, if β > then ωt (L K )f K is bounded by η t ) β. (5.3) ( (β ) ) β ( + κ β L β K e f ) (β ρ η ) t. (5.4) Proof. Applying Lemma 3 with λ = 0, j =, and k = T, the desired estimates (5.3) and (5.4) follow immediately from the following observations ω T (L K )f ρ ω T (L K )L β K L β K f ρ and ω T (L K )f K = ω T (L K )L K f ρ ω T (L K )L β K L β K f ρ. This completes our lemma. Recall the definitions of µ(θ) in Corollary and c θ in Lemma 6, our error bounds read as follows. Proposition 3. Let λ = 0, θ 0, ) and η t = µ t θ for t IN with some constant µ µ(θ). Define {f t : t IN} by (.5). Then, for any T IN we have that ] E Z T ft + f ρ ρ ω T (L K )f ρ ρ + c ρc θ µ T min{θ, θ} ln ( 8T ) θ where c ρ := 4( + κ) 4( 0 f ρ ρ + 3E(f ρ ) ). Proof. Since η t = µ t θ with µ µ(θ) for any t IN T, by Corollary we know that the inequality (4.8) on the step sizes holds true for any 7

28 t IN T. Hence, the bound (4.9) in Proposition holds true. Now, the desired result follows by applying (4.6) with λ = 0 and l = T to the right-hand side of (5.). To apply Proposition 3 to prove Theorem, we need an easy observation, that is, if 0 < θ < then, for any t IN T, T j=t j θ (T + ) θ t θ. (5.5) θ We are in a position to establish Theorem stated in Section. Proof of Theorem. Recall that η t = µ t θ for θ (0, ). Therefore, by T (5.5) we have that η t θ ( θ)µ T θ. Putting this and property (a) in t= Lemma 5 together, the bound (.4) immediately follows from Proposition 3 with θ (0, ). It still remains a question whether we can estimate the stronger error f T + f ρ K when the step sizes have the form {η t = O(t θ ) : t IN} with θ (0, ) and f ρ L β K (L ρ X ) with β >. We turn our attention to the case when the step sizes are in the form of {η t = η : t IN T } with η = η(t ) depending on T. In this case, we can deal with the expectation of the error f T + f ρ K if f ρ H K. Lemma 8. Let f ρ H K, η t = η for t IN T with ηκ and {f t : t IN T + } be defined as (.5). Then we have that ] E Z T ft + f ρ K ω T (L K )f ρ K + κ η E(ft ) ]. E Z t Proof. Since the argument here is essentially similar to the proof of Proposition except the norm ρ is replaced by K in H K, we only sketch its proof. From the error decomposition (5.), we know that ] E Z T ft + f ρ K] = ω T (L K )f ρ K + η E Z T ωt+(l T K )B(f t, z t ) K η ω T (L K )f ρ, E Z T ω T t+ (L K )B(f t, z t ) ] K. (5.6) 8

29 By the vanishing property (4.5) of B(f t, z t ), the second term on righthand side of the above inequality is estimated as follows: ] E Z T ωt+(l T K )B(f t, z t ) K = E Z t ω T t+ (L K )B(f t, z t ) K] Consequently, E Z T ωt+(l T K )B(f t, z t ) K ] = ωt+(l T K ) E Z t = ω T t+(l K ) E Z t E Z t L K B(f t, z t ) ρ ω T t+ (L K )L K B(f ] t, z t ) ρ ] ] B(ft, z t ) K. (5.7) By (4.0), E Z t B(ft, z t ) K] κ E Z t E(ft ) ]. Also, since ηκ, the estimate (4.) yields that, for any t IN T there holds ω T t+(l K ). Putting these estimates back into (5.7), we have that ] E Z T ωt+(l T K )B(f t, z t ) K κ T E Z t E(ft ) ]. (5.8) t= Using (4.5) again, the last term on the right-hand side of (5.6) follows that ω T (L K )f ρ, E Z T ω T t+ (L K )B(f t, z t ) ] K = 0. Cascading this equality, (5.6) and (5.8) yields the desired estimate. We are ready to present error bounds when the step sizes are of the form {η t = η : t IN T } with η = η(t ) depending on T. Proposition 4. Let η t = η for t IN T and {f t : t IN T + } be defined as (.5). If 6(κ + ) 4 ln ( 8T ) η then there holds E Z T ft + f ρ ρ] ω T (L K )f ρ ρ + c ρ η ln ( 8T ). (5.9) In addition, if f ρ H K then we have that E Z T ft + f ρ K] ω T (L K )f ρ K + c ρ η T. (5.0) Proof. We regard η as µ. Then, the inequality 6(κ + )4 ln ( 8T ) η means that µ 6( + κ) 4 ln ( 8T ). By Corollary, the inequality (4.8) 9

30 on the step sizes is satisfied. Hence, the estimate (4.9) in Proposition holds true. The first estimate (5.9) follows from Proposition 3 in the case of θ = 0. Combining Lemma 8 and (4.9) yields (5.0). From the above proposition, we can prove Theorem 4 stated in the Section. Proof of Theorem 4. The error bound (.0) is derived from (5.9) and property (a) in Lemma Proofs of convergence results We are in a position to apply the error bounds in Theorems and 4 to prove Theorems and 5, while Propositions 3 and 4 will be used to derive the explicit rates in the next section. To prove Theorem, we introduce the following well-known property of K-functional, see 5]. We include its proof for completeness. Lemma 9. Let s > 0 and K(s, f ρ ) be defined by (.). Then there holds lim K(s, f ρ) = inf f f ρ ρ. s 0+ f H K Proof. By the definition (.) of K-functional, it is straightforward that lim s 0+ K(s, f ρ ) inf f HK f f ρ ρ. Hence, we only need to show that lim s 0+ K(s, f ρ ) inf f H K f f ρ ρ. (5.) To this end, for any ε > 0 we know that there exists f ε H K such that f ε f ρ ρ ε + inf f HK f f ρ ρ. Therefore, by letting f = f ε, we know, for any s > 0, that K(s, f ρ ) is bounded by ε + inf f HK f f ρ ρ + s f ε K. Letting s 0+ yields (5.), and hence completes the lemma. We are ready to prove Theorem based on Lemma 9 and Theorem. Proof of Theorem. Since θ (0, ) and η t = t θ for any t µ IN with µ µ(θ), Theorem holds true. Applying Lemma 9 with s = b θ µ T ( θ)/ to (.4) yields that lim T E Z T ft + f ρ ρ] inf f HK f f ρ ρ. 30

31 On the other hand, since f T + H K, for any T IN and any sample z = {z t : t IN T }, we know that inf f HK f f ρ ρ f T + f ρ ρ, which leads to lim T E Z T ft + f ρ ρ] inff HK f f ρ ρ. This completes Theorem. To prove Theorem 5, we further need the following observation. Lemma 0. Let f ρ H K, the step sizes {η t = η(t ) : t IN T } satisfy η(t )κ for any T IN and lim T η(t )T =. Then we have that lim T ωt (L K )f ρ K = 0. Proof. Recall the definition of L β K (L ρ X ) with β > 0 and the fact H K = L K (L ρ X ). Hence, if f ρ H K then f ρ = λ j a j φ j, for some {a j : j IN} l. j= Before we move to the next step, it will be helpful to sketch the main ideas. In what follows we construct functions in L K (L ρ X ) to approximate f ρ arbitrarily in H K while for functions in L K (L ρ X ), we apply (5.4) in Lemma 7 with β = to deal with it. Now, write f ρ = N λ j >0,j= λ j a j φ j + λ j >0,j=N+ λ j a j φ j. Then, f N = N λ j(λ j a j )φ j L K(L ρ X ) for any N IN because L K f N ρ = λ j >0,j= N λ j >0,j= (λ j a j ) <. On the other hand, since j= a j <, then, for any ε > 0, there exists N(ε) IN such that j=n(ε)+ a j ε. Therefore, f ρ f N(ε) K = j=n(ε)+ λ j a j φ j K = j=n(ε)+ a j ε. By (.8), we have that (I η(t )L K )f K = (I η(t )L K )L K f ρ I η(t )L K L K f ρ = I η(t )L K f K. Also, since η(t )κ, applying (4.) with x = η implies that (I η(t )L K ). Hence, for any f H K there holds (I η(t )L K )f K f K. Consequently, applying the above inequality with f = f ρ f N(ε) 3

32 tells us that (I η(t )L K ) T f ρ K (I η(t )L K ) T f N(ε) K + (I η(t )L K ) T (f ρ f N(ε) ) K (I η(t )L K ) T f N(ε) K + f ρ f N(ε) K (I η(t )L K ) T f N(ε) K + ε. (5.) To estimate (I η(t )L K ) T f N(ε) K, applying (5.4) with η t = η(t ) for any t IN T, f = f N(ε), and β = yields that (I η(t )L K ) T f N(ε) K ( + κ ) L K f N(ε) ρ ( T η(t ) ). Since lim T T η(t ) =, the above estimation implies that there exists an integer T (ε) N(ε) such that (I η(t )L K ) T f N(ε) K ε, T T (ε). (5.3) Putting (5.) and (5.3) together, we know that ω T (L K )f ρ K = (I η(t )L K ) T f ρ K ε, T T (ε). Since ε > 0 is arbitrary, we obtain the desired claim. We now move on to the proof of Theorem 5. Proof of Theorem 5. Observe whenever lim T η(t ) ln T = 0 or lim T T η (T ) = 0, there exists T (ε) IN such that 6(κ+) 4 η(t ) ln T for all T T (ε). The hypotheses in Lemma 8 and Theorem 4 are satisfied for any T T (ε). Therefore, (.0) holds true for any T T (ε). We first prove the convergence in L ρ X. Applying Lemma 9 with s = (+κ) ( η(t )T ) to (.0), it follows from the hypothesis lim T T η(t ) = that lim T E Z T ft + f ρ ρ] inff HK f f ρ ρ. On the other hand, since f T + H K, inf f HK f f ρ ρ f T + f ρ ρ for any data z = {z t : t IN T } which leads to inf f HK f f ρ ρ lim T E Z T ft + f ρ ρ]. This verifies the first part of Theorem 5. The proof for the second part is similar. Since lim T T η (T ) = 0, combining (5.0) in Proposition 4 with Lemma 0 yields that which completes Theorem 5. lim T E Z T ft + f ρ K] 0 3

Online Gradient Descent Learning Algorithms

DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline