Online gradient descent learning algorithm

Size: px
Start display at page:

Download "Online gradient descent learning algorithm"

Transcription

1 Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, Abstract This paper considers the least-square online gradient descent algorithm in a reproducing kernel Hilbert space (RKHS) without an explicit regularization term. We present a novel capacity independent approach to derive error bounds and convergence results for this algorithm. The essential element in our analysis is the interplay between the generalization error and a weighted cumulative error which we define in the paper. We show that, although the algorithm does not involve an explicit RKHS regularization term, choosing the step sizes appropriately can yield competitive error rates with those in the literature. Short Title: Online gradient descent learning Keywords and Phrases: Learning theory, online learning, reproducing kernel Hilbert space, gradient descent, error analysis. AMS Subject Classification Numbers: 68Q3, 68T05, 6J0, 6L0. Contact author: Yiming Ying, Telephone: +44 (0) , Fax: +44 (0)

2 Introduction Let X be a compact subset of Euclidean space IR d, Y a subset in IR and IN l := {,..., l} for any l IN. Let ρ be a fixed but unknown probability measure on Z := X Y. The generalization error for a function f : X Y is defined as ( ) dρ(z). E(f) := f(x) y (.) Z The function minimizing the above error is called the regression function and is given by f ρ (x) = ydρ(y x), x X, (.) Y where ρ( x) is the conditional probability measure at x induced by ρ. We consider the problem of approximating the regression function from a set of training data drawn from ρ. In this paper, we restrict our attention to online learning algorithms for computing an approximator of f ρ in a reproducing kernel Hilbert space (RKHS). Let K : X X IR be a Mercer kernel, that is, a continuous, symmetric and positive semi-definite function, see e.g. ]. The RKHS H K associated with K is defined ] to be the completion of the linear span of the set of functions {K x ( ) := K(x, ) : x X} with inner product satisfying, for any x X and g H K, the reproducing property K x, g K = g(x). (.3) Let {z t = (x t, y t ) : t IN} be a sequence of random samples independently distributed according to ρ. The online gradient descent algorithm 0, 9,, 8] is defined as { f = 0, and for t N f t+ = f t η t ( (ft (x t ) y t )K xt + λf t ), (.4) where λ 0 is called the regularization parameter. We call the sequence {η t : t IN} the step sizes or learning rates and {f t : t IN} the learning sequence. Clearly, each output f t+ only depends only on {z j : j IN t }. The class of learning algorithms displayed above is also referred to as stochastic approximation algorithms in the setting of RKHS s. Such a stochastic approximation procedure dates back to ]. One can see, 8, 3, 35] and references therein for more background material.

3 When the parameter λ > 0, we call (.4) the online regularized algorithm. This algorithm has been well studied in the recent literature, see e.g. 0,, 33]. In this paper, we are mainly concerned with the online gradient descent algorithm without an explicit RKHS regularization term, which is given by (.4) with λ = 0, that is, { f = 0, and for t N (.5) f t+ = f t η t (f t (x t ) y t )K xt. We study the approximation of f T +, the output of iteration step (sample size) T, to the regression function f ρ. The nature of the leastsquare loss leads to the measurement in the metric of L ρ X defined as f ρ = f L ρx := ( ) X f(x) dρ X, where ρ X is the marginal distribution of ρ on X. A direct computation yields that see e.g. ] for a proof. f T + f ρ ρ = E(f T + ) E(f ρ ), (.6) Our primary goal is to estimate the error (.6) for the least-square online algorithm (.5) by means of properties of ρ and K. We shall show how the choice of the step sizes in the algorithm affects its generalization error. Moreover, we shall derive error rates for algorithm (.5) which are competitive with those in the literature. We mainly focus on two different types of step sizes. The first one is a universal polynomially decaying sequence of the form {η t = O(t θ ), t N} with θ (0, ). The second type of step sizes is of the form {η t = η : t IN T } with η = η(t ) depending on the iteration number (sample size) T. As we shall discuss in Section, the function η(t ) provides a trade-off between the learning rate η and the iteration steps T and its choice ensures convergence of the algorithm. This is in the spirit of early stopping implicit regularization, see e.g. 3, 36]. The rest of the paper is organized as follows. The next section summarizes our main results and Section 3 collects discussions on related work. To formulate our basic ideas, in Section 4 we provide a novel approach to the error analysis of the online algorithm (.4) under some conditions on the step sizes. The essential element in our analysis is an appealing relation between the generalization error and a weighted form of cumulative sample error (see Proposition ) in online learning literature (e.g. 8, 0]). In Section 5, we develop error bounds and convergence results for the online algorithm (.5). Finally, Section 6 establishes explicit error rates for 3

4 algorithm (.5) and, as a byproduct, improves the preceding rates, 33] for the online regularized algorithm given by (.4) with λ > 0. Main results We begin with some background material and notations. Firstly, let C(X) be the space of continuous functions on X with the norm, κ := sup x X K(x, x) and note, by the reproducing property (.3), for every f H K, that f κ f K. (.) In addition, for every t IN we denote the expectation E z,...,z t as E Z t ] and use the convention E Z 0 ξ = ξ for any random variable ξ. Secondly, we introduce the notion of K-functional 5] which plays a central role in approximation theory, namely K(s, f ρ ) := inf f H K { f f ρ ρ + s f K }, s > 0. (.) Next, we define, for any θ (0, ), the quantity { 48( + κ) 4 if θ = µ(θ) := (, ln ( ) ) 8 θ + otherwise. min{θ, θ} 08(+κ) 4 ( θ ) θ (.3) Finally, we require that Z y dρ(z) < which implies that the quantity E(f ρ ) + f ρ ρ is finite. We are now ready to state our main results. The first one deals with error bounds for the online gradient descent algorithm (.5). Theorem. Let θ (0, ) and {η t = µ t θ : t IN} with some constant µ µ(θ). Define {f t : t IN} by (.5). Then, for any T IN, E Z T ft + f ρ ρ] is bounded by ] K(b θ µ T ( θ)/ c ρ c θ, f ρ ) + µ T min{θ, θ} ln ( 8T θ where c ρ := 4(+κ) 4( 0 f ρ ρ+3e(f ρ ) ), b θ := (+κ) if θ = and 3 ( θ ) θ otherwise. ), (.4) θ and c θ θ = 3 The K-functional term of our upper bound in Theorem concerns the approximation of the function f ρ in the space L ρ X by functions in the 4

5 space H K. Its polynomial decay can be characterized by requiring that f ρ lies in the interpolation space (L ρ X, H K ) β, defined as (L ρ X, H K ) β, := { f L ρ X : f β, = sup s β K(s, f) < }, (.5) s>0 where β 0, ]. Indeed, it is easy to see that, for some c > 0, K(s, f ρ ) cs β for any s > 0 if and only if f ρ (L ρ X, H K ) β,. Note 5] that (L ρ X, H K ) 0, = L ρ X and (L ρ X, H K ), = H K. Therefore, we can intuitively regard the interpolation space (L ρ X, H K ) β, as an intermediate space between the metric space L ρ X and the much smaller approximation space H K. Similar ideas of using the K-functional to characterize the approximation property of the space H K can be found in,, 3]. In general, we have 5] that lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ. This gives rise to convergence of the online gradient descent algorithm (.5). Theorem. Let θ (0, ) and µ(θ) be given by (.3). Define {f t : t IN} by (.5). If the step sizes satisfy η t = µ t θ for t IN with some constant µ µ(θ) then we have that lim E ] Z T T ft + f ρ ρ = inf f f ρ f H K ρ. We will give the proofs of Theorems and in Section 5. If the space H K has a good approximation property then the limit in Theorem equals zero. To see this, we say that H K is dense in L ρ X if inf f g ρ = 0, for all g L ρ f H X. (.6) K We note that, in 6], a kernel is called universal if inf f HK f g = 0, for all g C(X). For example, the Gaussian kernel K σ (x, x ) = is universal for every σ > 0. Universal kernels are sufficient to ensure our density condition (.6), since f ρ f and C(X) is dense in L ρ X. Note that f ρ L ρ X. Therefore, an immediate consequence of Theorem is the following corollary. e x x σ Corollary. Suppose the assumptions in Theorem hold true. Moreover, if H K is dense in L ρ X then we have that lim E ] Z T T ft + f ρ ρ = 0. 5

6 By choosing θ appropriately, we can get explicit error rates if the regression function f ρ lies in the regularity space L β K (L ρ X ). To define this space, we introduce the integral operator L K : L ρ X L ρ X defined as L K f(x) := K(x, x )f(x )dρ X (x ), x X and f L ρ X. X Since K is a Mercer kernel, L K is compact and positive. Therefore, the fractional power operator L β K is well-defined for any β > 0. We indicate its range space by L β K (L ρ X ) := { f = j= λ β j a jφ j : L β K f ρ := j= } a j <, (.7) where {λ j : j IN} are the positive eigenvalues of the operator L K and {φ j : j IN} are the corresponding orthonormal eigenfunctions. Thus, the smaller β is, the bigger the range space L β K (L ρ X ) will be. In particular, we know 5] that L β K (L ρ X ) H K for β > and L K (L ρ X ) = H K with the norm satisfying One can find more details in, 5]. g K = L K g ρ, g H K. (.8) Of course, one can assume that f ρ (L ρ X, H K ) β,. This is a natural assumption since L β K (L ρ X ) (L ρ X, H K ) β, for β 0, ], see the Appendix. However, in this paper we concentrate on the hypothesis space L β K (L ρ X ) since this facilitates a comparison of our results with those in the related literature (e.g. 7, 5, 7, 3, 33, 34]). We are now in a position to state our error rates for algorithm (.5). Hereafter, the expression a T = O(b T ) means that there exists an absolute constant c such that a T cb T for all T IN. Theorem 3. Let θ (0, ) and µ(θ) be given by (.3). Define {f t : t IN} by (.5). If f ρ L β K (L ρ X ) with some 0 < β then, by selecting η t = β µ( β β+ for t IN, for any T IN there holds β+ )t ] ( ) E Z T ft + f ρ ρ = O T β β+ ln T. (.9) In Section 6., we will derive the error rate (.9) from some modified error bounds similar to (.4). Since the best rate of the second term on the right-hand side of (.4) is O(T ln T ), the associated error rate 6

7 (.9) is never faster than the rate O(T ln T ) achieved for β =. In contrast, it is known that regularization algorithms 4, 7, 5, 3] in batch learning can achieve better rates than O(T ) if f ρ L β K (L ρ X ) with β >. Unfortunately, for the online algorithm (.5) with polynomially decaying step sizes, we do not know how to get better rates beyond the range β (0, ]. We turn our attention to the case that the step sizes of (.5) are in the form of {η t = η : t IN T } while the learning rate η depends on the sample number T. In this scenario, the error bounds for algorithm (.5) read as follows. Theorem 4. Let {η t = η : t IN T } and {f t : t IN T + } be produced by algorithm (.5). If 6(κ + ) 4 ln ( 8T ) η, then E Z T ft + f ρ ρ] is bounded by K(( + κ) ( ηt ), f ρ )] + cρ η ln ( 8T ). (.0) Note that the function K(, f ρ ) is non-decreasing and lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ. Thus, from the error bound (.0) we see that large number of iteration T reduces the K-functional, but enlarge the last term in (.0); on the other hand, small number of iteration reduces the last term in (.0), but enlarges the K-functional. Hence, trading off the two terms in the error bound (.0) suggests an early stopping rule T = T (η) which guarantees convergence of algorithm (.5) as η 0+. However, our purpose in this paper is to derive error rates with respect to the iteration steps (sample number) T by choosing η appropriately, and thus we take the inverse function s description η = η(t ). In addition to the error f T + f ρ ρ, there are other interesting relevant quantities such as the error f T + f ρ K (if f ρ H K ). Observe that the global error f T + f ρ ρ can not wholly describe the local properties of f T + which is usually measured by f T + (x 0 ) f ρ (x 0 ) for any x 0 X. However, if f ρ H K, for every x 0 X we have that f T + (x 0 ) f ρ (x 0 ) κ f T + f ρ K. In this case, the convergence and the error rate in H K reflect the local performance of the predictor f T +. Indeed, it was further pointed out in 5] that the convergence in H K yields convergence in C k (X) under some conditions on K, where C k denotes the space of all functions whose derivatives up to order k are continuous. In this learning setting, we usually use the whole batch of examples at one time. 7

8 When the step sizes of (.5) are of the form {η t = η(t ) : t IN T }, we can get convergence results in L ρ X as well as in H K. Theorem 5. Let {η t = η(t ) : t IN T } and {f t : t IN T + } be produced by (.5). Then the following statements hold true. () If the step size satisfies then there holds lim T η(t ) =, lim T η(t ) ln T = 0 (.) T lim E ] Z T T ft + f ρ ρ = inf f f ρ f H K ρ. (.) () If f ρ H K and the step size satisfies then we have that lim T η(t ) =, lim T T η (T )T = 0 (.3) lim E ] Z T T ft + f ρ K = 0. (.4) Note that, if the step size decays in the form of η(t ) = O(T θ ), T with θ > 0 then the hypothesis (.) allows the choice θ (0, ) while (.3) requires that θ (, ). We will prove Theorems 4 and 5 in Section 5. In Section 6, we shall establish the following error rates in L ρ X as well as in H K when the step sizes are of the form {η t = η(t ) : t IN T }. Theorem 6. Let {η t = η : t IN T } and {f t : t IN T + } be produced by (.5). If f ρ L β K (L ρ X ) for some β > 0 then, by choosing η := β T β 64(+κ) 4 (β+) Moreover, if β > β+, we have that ] E Z T ft + f ρ ρ = O (T β β+ ln T ). (.5) then there holds E Z T ft + f ρ K ] ) = O (T β β+. (.6) As the last contribution of this paper, we further improve the preceding error rate 33] for the online regularized algorithm (.4) with λ > 0. The proof will be given in Section 6.. 8

9 Theorem 7. Let λ > 0, θ (0, ) and µ(θ) be given by (.3). Define {f t : t IN T + } by (.4). Assume that f ρ L β K (L ρ X ) with some 0 < β. For any 0 < ε < β, choose η β+ t = µ( β Then there holds β+ )+t β β+ and λ = T β+ + ε ] E Z T ft + f ρ ρ = O (T β +ε) β+. (.7) β. The preceding error rates of the online regularized algorithm (.4) with λ > 0 were established in 33] under the assumption that f ρ L β K (L ρ X ) with some β (0, ]. Namely, for any arbitrarily small ε > 0, by choosing λ := λ(t ) appropriately, there holds ] E Z T ft + f ρ ρ = O (T β +ε) (β+) (.8) and, for < β, we further have that ] ( E Z T ft + f ρ K = O T β +ε) 4β+. (.9) By Cauchy-Schwarz inequality, we see from (.7) that, for any arbitrarily small ε > 0, there holds ] ( ] ) E Z T ft + f ρ ρ E z Z T ft + f ρ ρ Therefore, the rate (.7) is much better than (.8). = O (T β +ε) β+. We end this section with some remarks. Firstly, the above error rates for algorithm (.5) are capacity independent except the prior requirement that f ρ L β K (L ρ X ) with some β > 0. It is an open problem to improve these rates when some additional information is known such as the regularity of the kernel K or some polynomial decay of the eigenvalues of L K 7,, 3]. Secondly, all the rates are proved with respect to expectation norm and we do not know how to convert them into similar probabilistic bounds such as those in 5, 3]. Finally, we wish to emphasize that when the step sizes are a polynomially decaying sequence like that in Theorem, the learning scheme (.5) is a truly online algorithm. Instead, when we use a constant step size η t = η(t ), algorithm (.5) is only an informal online algorithm since the knowledge of the sample size is required in advance. 9

10 3 Related work In this section, we discuss and compare our results with related work. A main conclusion of this discussion is that the online algorithm (.5) is competitive with the commonly used least-square learning schemes. 3. Cumulative loss analysis The mistake (regret) bounds for the cumulative loss T t= (y t f t (x t )) for general online algorithms have been well studied in the literature. See e.g., 8, 9, 0, 7, 8, 9, 9, 35] and references therein. Specifically, mistake bounds were derived for the online density estimation in ]. Section 6 in 7] derived, for a learning algorithm different from (.5), upper bounds on the relative expected instantaneous loss, measuring the predicting ability of the last output in the linear regression problem. The online algorithm studied in 0] in the linear regression setting is closest to our algorithm (.5). This paper discussed how the choice of the learning rate affects the bound for the cumulative loss. Related mistake bounds in this setting can also be found in 35]. In 9], the authors proposed general online regularized algorithms with kernels and presented their cumulative loss bounds. For a more detailed review of mistake bounds in this direction, one can refer to 9, Section 5]. There, generalization bounds for the average prediction were also derived from the cumulative loss bounds for certain algorithms in RKHS s. More precisely, for each t IN, let H t : X IR be the function produced by a prediction algorithm when fed with the independent data {(x j, y j ) : j IN t }. Consider the average prediction H T + = T + T + t= H t and let D H K with D(x) Y := M, M] for all x X. It was shown (9, Corollary ]) that there exists a prediction algorithm such that, for any δ > 0 and T IN, with probability at least δ, there holds E(H T + ) E(D) + M ( κ + M ( D K + ) + M ln ). T + δ The main difference of our generalization bounds (.4) and (.0) from the above one is that we have no assumption on the functions in the given upper bounds. 0

11 Soon after Proposition in Section 4., we shall discuss an appealing relationship between statistical generalization error of f T + and a weighted modification of cumulative loss T t= (y t f t (x t )). However, it remains to be understood whether the standard cumulative loss bounds for algorithm (.5) can be directly applied to the study of its generalization error. 3. Comparison with other regularization schemes Although deriving cumulative loss bounds of the online gradient descent algorithm (.4) is very useful, it is also important to further understand the statistical behavior of f T +. In ], the authors studied the performance of f T + in the H K norm when f T + is given by the online regularized algorithm (.4) with λ > 0. More general online regularized schemes (.4) involving commonly used loss functions in classification and regression were discussed in 0, 33]. Using quite different methods from ours, 35] obtained similar generalization bounds as (.0) for the online gradient descent algorithm associated with uniformly Lipschitz loss functions, linear kernels and constant step sizes (learning rates). In particular, in what follows we compare our rates for algorithm (.5) with the state-of-art ones for other least-square learning algorithms under the same hypothesis that f ρ L β K (L ρ X ) with β > 0. Firstly, we begin by the comparison with the rates for the online regularized algorithms. In this discussion, we note that the rates (.9), (.5) and (.6) of algorithm (.5) are comparable to the rates (.7), (.9) of the online regularized algorithm (.4). Secondly, we present the comparison with the rates of averaged stochastic gradient descent algorithms introduced in 35]. This class of online algorithms and the derived error bounds are stated in the linear kernel setting. However, we can easily extend them to the general kernel setting as follows. Assume κ η t < for t IN T, and let r = 0, ˆf = 0 and {f t : t IN T + } be defined by (.5). For any t IN T, we update r t+ = r t + η t κ ηt and ˆf t+ = rt ˆft r t+ + r t+ r t r t+ f t. Then, the generalization bound (see 35, Theorem 5.]) for the average output ˆf T + in the kernel setting can be cast as E Z T E( ˆfT + ) ] { ( κ σ ) inf + T + } E(f) + f K, (3.) f H K r T + r T +

12 where σ T + = η t. Note that E(f) E(f ρ ) = f f ρ ρ for any f L ρ X. In view of the regularization error { } D(λ) := inf f fρ ρ + λ f K, (3.) f H K the bound (3.) implies that E Z T ˆf T + f ρ ρ] is bounded by ( + κ σt ) ( + D r T + ) (r T + + κ σt + ) + κ σ T + r T + E(f ρ ). (3.3) When the step sizes are of the form η t = 4(+κ ) t θ with θ (0, ), it is not hard to observe that r T = O(T θ ), σ T + r T + = O(T min{θ, θ} ) for θ and σ T + r T + = O(T ln T ) for θ =. Furthermore, if f ρ L β K (L ρ X ) with some β (0, ] then we know from 5, Lemma 3] that which implies that D(λ) λ β L β K f ρ ρ, (3.4) ( ) D (r T + + κ σt + ) = O(T β( θ) ). (3.5) Therefore, if f ρ L β K (L ρ X ) with β (0, ], choosing θ = β β+ and putting the above observations for r T +, σ T + r T + and (3.5) into (3.3) yields that { E Z T ˆf O(T T + f ρ ln T ) for β = ρ] =, O(T β β+ ) for β (0, ). (3.6) Similarly, if f ρ L β K (L ρ X ) with β (0, ] and the step sizes have the form {η t η : t IN T } then, by choosing η = T β 4(+κ β+ ), we see from (3.3) and (3.4) that E Z T ˆf T + f ρ ρ] = O(T β β+ ). (3.7) Based on the above discussion, we see that the rates (.9) and (.5) for algorithm (.5) are the same, up to a logarithmic term, as the corresponding rates (3.6) and (3.7) for the averaged stochastic gradient descent algorithm.

13 Thirdly, we move on to the comparison with the Tikhonov regularization algorithm (see e.g. 6]) in batch learning which is defined by f z,λ := arg inf f H K { T (f(x t ) y t ) + λ f K }. (3.8) The capacity independent generalization bounds for this algorithm in 34] can be expressed as E Z T E(fz,λ ) ] ) { } ( + κ inf E(f) + λ f K. T λ f H K In terms of the regularization error D(λ), it can be equivalently stated as E Z T fz,λ f ρ ] ( ) ( ρ D λ + E(fρ ) + D ( λ )){ ( ) 4κ κ } T λ +. (3.9) T λ But, if f ρ L β K (L ρ X ) with β (0, ] we know from (3.4) that D(λ) λ β L β K f ρ ρ. Putting this into (3.9) and trading off T and λ, for f ρ L β K (L ρ X ) with some β (0, ], the choice λ = T β+ gives the rate fz,λ f ρ ] ) = O (T β β+. (3.0) E Z T ρ When f ρ L β K (L ρ X ) with β (, ], it was further improved in 5], by choosing λ = λ(t ) appropriately, that and ] ( ) E Z T fz,λ f ρ ρ = O T β β+, (3.) E Z T fz,λ f ρ K ] = O ( T β β+ ). (3.) Under the same assumptions, the rates (.9), (.5) and (.6) of the online algorithm (.5) are almost the same as the corresponding rates (3.0), (3.) and (3.) in batch learning. Finally, we end this subsection with the remark that generalization bounds associated with D(λ) such as (3.9) for the regularization algorithms are quite neat and interesting, but they tend to suffer a saturation phenomenon. The rates are originally stated in the form of probability inequalities. We employ their expectation versions for consistency with the other bounds presented in the paper. 3

14 To explain this phenomenon, we observe that { } D(λ) = inf { f f ρ ρ + λ f f H K K} inf f f ρ ρ + λ f ρ. f H K κ Since f f ρ ρ f f ρ ρ, by letting t = f ρ, we have that } D(λ) inf {(t f ρ ρ ) + λt = λ t IR κ κ + λ f ρ ρ. (3.3) If f ρ is not identically zero (i.e., f ρ ρ > 0), the inequality (3.3) implies that the optimal decay of D(λ) is O(λ), λ 0+ which, by (3.4), is achieved when f ρ L K (L ρ X ) = H K. Equivalently, the rate of D(λ) is never faster than O(λ), λ 0+ even if f ρ lies in any smaller space L β K (L ρ X ) with β > than H K. Therefore, in this case the consequent trade-off rate (3.0) derived from (3.9) is at most O(T ), T, which is achieved at β = and can not be improved beyond the range β (0, ]. This is the so-called saturation phenomenon in the context of inverse problems 5]. For similar reasons, the rates (3.6) and (3.7) derived from (3.3) for the averaged stochastic gradient descent algorithm have the same problem. In contrast, our capacity independent rate (.5) for algorithm (.5) does not suffer this drawback and is arbitrarily close to T as β. 3.3 Capacity independent optimality In this subsection, we explain in two folds that the error rates for algorithm (.5) are optimal, up to a logarithmic term, in the capacity independent sense. We initially illustrate this by citing the lower bounds in 7] when f ρ L β K (L ρ X ) with β (, ]. Then, we demonstrate a specific example to contrast our rates with the best possible rate in the non-parametric statistical literature (see e.g. 6, 7]). We begin with the citation of the lower bound in 7]. Consider f ρ L β K (L ρ X ) with β (, ] and the eigenvalues (arranged in decreasing order) of L K have the decay λ i = O(i b ) with b >. It was proved in 7] that the rate T βb βb+ in L ρx is optimal, see 7] for a precise definition of optimal rates. Since K(x, x ) = i IN λ iφ i (x)φ i (x ) for any x, x X and X φ i (x)dρ X (x) = for any i IN, it follows that i= λ i = K(x, x)dρ X X(x) κ, see e.g. ]. Therefore, for any i IN we have 4

15 that iλ i j IN i λ j κ which implies that λ i = O(i ) for all kernels. Since our capacity independent rates are independent of the eigenvalues of L K, taking b in the rate T βb βb+ leads to the eigenvalue-independent optimal rate T β β+ (see also 3] for a discussion). In this sense, our rates (.5) and (.7) are almost optimal for the capacity independent case under the condition that f ρ L β K (L ρ X ) with β (, ]. Now, we illustrate by a specific example that our rates for algorithm (.5) are comparable to the optimal rate in non-parametric statistics. For simplicity, we only consider smooth splines 30] in one dimension. Let X = 0, ], dρ X = dx. For smooth splines, the reproducing kernel space can be regarded as a fractional Sobolev space H s 0, ] with s > which is defined by Fourier coefficients { H s 0, ] := f = a 0 + k IN k s (a k sin(πkx) + b k cos(πkx)) : f s := a 0 + k IN(a } k + b k) <. It is easy to see, for any x, x X, that the reproducing kernel of the space H s 0, ] can be represented by K s (x, x ) = + k IN k s cos(πk(x x )). In addition, λ =, λ j = λ j+ = j s for j IN and φ (x) =, φ j (x) = sin(πjx) and φ j+ (x) = cos(πjx) for j IN. Therefore, in this case, the range space L β K s (L dx ) with β > 0 defined by (.7) is identical to the Sobolev space H βs 0, ]. It is well known (e.g. 7]) in non-parametric statistics that the rate O(T 4βs 4βs+ ) is optimal if fρ L β K s (L dx ) = Hβs 0, ] with β >. Therefore, if dρ X = dx, f ρ L β K s (L dx ), and H K s = H s 0, ] with s > with β > β then our rate O(T β+ ln T ) of algorithm (.5) given by (.5) is suboptimal. But it is arbitrarily close to the best one O(T 4βs 4βs+ ) as s +. 4 Error decomposition and basic estimates In this section, we formulate our basic ideas by providing a novel approach to the error analysis of online gradient descent algorithms in L ρ X. As 5

16 mentioned in the introduction, the main purpose of this paper is to analyze the online algorithm (.5). However, we would like to state a unified approach for the general online algorithm (.4) for any λ 0 since this will, as a byproduct, allow us to improve the preceding error rate in 33] for the algorithm (.4) with λ > 0 as proved in Section 6. below. We first establish some useful observations used later. For λ > 0, define the regularizing function by } f λ = arg inf {E(f) + λ f K. (4.) f H K Lemma. Let λ > 0 and f λ be defined as above. Then we have that L K (f λ f ρ ) + λf λ = 0 (4.) and f λ f ρ ρ f ρ ρ. (4.3) Proof. The first equality is well-known, see e.g. ]. For the inequality (4.3), we know from the equality } (.6) that (4.) is equivalent to f λ = arg inf f HK { f f ρ ρ + λ f K. By taking f = 0, the definition of f λ yields the desired estimate f λ f ρ ρ f ρ ρ. With (4.) at hand, we can interpret the online algorithm (.4) with λ 0 as the following useful form. With a slight abuse of notation, we indicate f ρ by f 0. Lemma. Let λ 0 and {f t : t IN T + } be defined as (.4). Then, for any t IN T we have that f t+ f λ = (I η t (L K + λi))(f t f λ ) + η t B(f t, z t ), (4.4) where I is the identity operator and the vector-valued random variable B(f t, z t ) is defined by B(f t, z t ) := L K (f t f ρ ) + (y t f t (x t ))K xt. (4.5) Proof. The equality (4.4) for the case λ = 0 is easily verified since we have set f 0 = f ρ. 6

17 For the case λ > 0, we use the property (4.) of the regularizing function f λ in Lemma. By the definition of f t+ given by (.4) with λ > 0, we know that f t+ f λ = f t f λ η t (f t (x t ) y t )K xt η t λf t = (I η t (L K + λi))(f t f λ ) + η t L K (f t f λ ) η t λf λ + η t (y t f t (x t ))K xt. But the equality (4.) tells us that λf λ = L K (f λ f ρ ). Hence, putting this back into the above equation and arranging it yield the desired equality (4.4). With the help of the formula (4.4), we can describe our approach by three steps in which the first one is referred to as error decomposition. 4. Error decomposition For λ 0 and t IN, set the operator ωk t (L K + λi) := t j=k (I η j(l K + λi)) for k IN t and ωt+(l t K + λi) := I. Applying induction to the equality (4.4) and noting that f = 0, for any t IN T we have that f t+ f λ = ω(l t K + λi)f λ + η j ωj+(l t K + λi)b(f j, z j ). (4.6) j IN t Applying (4.6) with t = T, we get the following error decomposition f T + f λ = ω T (L K + λi)f λ + η t ω T t+(l K + λi)b(f t, z t ). (4.7) The above error decomposition technique is well-known in statistical learning theory in order to realize the error analysis for least-square related learning algorithms, see e.g., 3, 33]. One can find similar ideas used for different learning schemes, see e.g. 3,, 4, 4, 5] and references therein. We will use the error decomposition (4.7) to estimate the expectation of f T + f ρ ρ. To this end, we need some useful preparations. Let L be a linear operator from L ρ X to itself, we use L to denote its operator norm, that is, L := sup f ρ= Lf ρ. We also need the following technical lemma which forms the essential estimates of our approach. 7

18 Lemma 3. Let λ 0, β > 0 and η t (κ + λ) for any integer t j, k]. Then there holds ω k j (L K + λi)l β K ( ( β ) β + κ β) e exp { λ k t=j η t}( + ( k t=j η t ) β ). Proof. Note that, for any x, x X K(x, x ) = K x, K x K K x K K x K κ. (4.8) Combing this with the fact ρ X (X) =, we see from Cauchy-Schwarz inequality that, for any f L ρ X L K f ρ = K(x, x )f(x )dρ X (x ) dρ X (x) X X = f ρ ( X X X K(x, x ) dρ X (x ) X X f(x ) ) dρ X (x ) dρ X (x) K(x, x ) dρ X (x )dρ X (x) κ 4 f ρ. (4.9) Hence, L K κ. Since {λ l : l IN} are the eigenvalues of L K, it follows that sup l λ l κ and L β K Moreover, for any x (κ + λ), := sup λ β l κβ. (4.0) l IN I x(l K + λi) := sup( x(λ l + λ)) xλ. (4.) l IN Applying (4.) with x = η t iteratively for t j, k] and (4.0), we have that k ωj k (L K + λi)l β K I η t (L K + λi) L β K t=j κ β k { ( η t λ) exp λ t=j k η t }κ β, (4.) where the elementary inequality η t λ e ηtλ for any t j, k] is used in the last inequality. t=j 8

19 Also, ωj k (L K + λi)l β K := sup l IN k ( η t (λ l + λ))λ β l t=j sup exp { (λ + λ l ) l IN { exp λ { = exp λ k t=j k t=j η t } λ β l } { η t sup exp x 0 x k t=j k }( β ) β ( k ) β. η t η t e t=j t=j η t } x β Consequently, the above inequality and (4.) implies that ω T t (L K + λi)l β K is bounded by { exp λ k } ( ( β η t e t=j ) ) β { + κ β min, ( k ) β η t }. Combining this with the elementary inequality that min{a, b } a+b for any a > 0, b > 0 finishes the lemma. t=j Now we present the upper bound for E Z T ft + f ρ ρ] which is the foundation of our approach introduced in this paper. Hereafter, we adopt the convention that l j=l+ η j = 0 for any l N. Proposition. Let λ 0 and {f t : t IN} be defined as (.4). Then, for any T IN, E Z T ft + f ρ ρ] is bounded by f λ f ρ ρ + ω T (L K + λi)f λ ρ + ω T (L K + λi)f λ ρ f λ f ρ ρ + ( + κ) 4 η t exp { λ T j=t+ η } j + T j=t+ η E Z t E(ft ) ]. (4.3) j Proof. Since f T + f ρ = f T + f λ + f λ f ρ, there holds E Z T ft + f ρ ρ] = EZ T ft + f λ ρ] + fλ f ρ ρ + E Z T ft + f λ, f λ f ρ ρ ]. (4.4) In order to use (4.4) to prove (4.3), we need to estimate the first term and the last term on the right-hand side of (4.4). 9

20 For the last term on the right-hand side of (4.4), observe that the data is i.i.d. and f t is only dependent on {z,..., z t }, not on z t. Moreover, recall the definition of the regression function: f ρ (x) = ydρ(y x). Thus, X B(f t, z t ) has a nice vanishing property E zt B(ft, z t ) ] = 0, t IN. (4.5) Applying (4.5) to the equality (4.7) implies that E Z T ft + f λ ] = ω T (L K + λi)f λ. Therefore, E Z T f T + f λ, f λ f ρ ρ = E Z T ft + f λ ], fλ f ρ ρ ω T (L K + λi)f λ ρ f λ f ρ ρ. (4.6) For the first term on the right-hand side of (4.4), we first use the equality (4.7) to get that E Z T ft + f λ ρ] = ω T (L K + λi)f λ ρ ] + E Z T η t ωt+(l T K + λi)b(f t, z t ) ρ E Z T η t ωt+(l T K + λi)b(f t, z t ), ω T (L K + λi)f λ ρ. (4.7) Below, we estimate the last two terms on the right-hand side of (4.7) separately. For the second term on the right-hand side of (4.7), we write it as ω Tt +(L K + λi)b(f t, z t ), ω Tt+(L ] K + λi)b(f t, z t ) ρ. t IN T η t η t E Z T Therefore, by (4.5), for t > t E Z t ωt T +(L K + λi)b(f t, z t ), ωt+(l T K + λi)b(f t, z t ) ρ = E Z t E zt ωt+(l T K + λi)ωt T +(L K + λi)b(f t, z t ), B(f t, z t ) ρ = E Z t ωt+(l T K + λi)ωt T +(L K + λi)b(f t, z t ), E zt B(ft, z t ) ] ρ = 0. By the symmetry of t, t, the above equality also holds true for t > t. 0

21 Consequently, it follows that ] E Z T η t ωt+(l T K + λi)b(f t, z t ) ρ = ηt E Z t ω T t+ (L K + λi)b(f t, z t ) ρ] ηt ω T t+(l K + λi)lk E Z t L K B(f ] t, z t ) ρ = ηt ω T t+(l K + λi)lk E Z t B(ft, z t ) K], (4.8) where we have used equation (.8) in the last equality. To estimate E Z t B(ft, z t ) K], we note from (4.5) that Ezt (ft (x t ) y t )K xt ] = LK (f t f ρ ), and thus we can rewrite E zt B(ft, z t ) K] as E (ft zt (x t ) y t )K xt ] ] K Ezt (ft (x t ) y t )K xt. (4.9) K Therefore, E Z t B(ft, z t ) ] K EZ t E (ft zt (x t ) y t )K xt ] K ( ) κ ]] E Z t Ezt ft (x t ) y t = κ E Z t E(ft ) ]. (4.0) In addition, applying Lemma 3 with β = and j = t +, k = T implies that ωt+(l T K +λi)l K (+κ) exp{ λ T j=t+ η j}. Consequently, plugging + T j=t+ η j this and (4.0) into (4.8) yields that ] E Z T η t ωt+(l T K + λi)b(f t, z t ) ρ ( + κ) { exp λ T 4 j=t+ η j} + T j=t+ η E Z t E(ft ) ]. (4.) j For the last term on the right-hand side of (4.7), we use (4.5) to get that E Z T η t ωt+(l T K + λi)b(f t, z t ), ω T (L K + λi)f λ ρ = η t ωt+(l T K + λi)e Z t E z t B(ft, z t ) ], ω T (L K + λi)f λ ρ = 0.

22 Substituting this and (4.) into (4.7) yields that E Z T ft + f λ ρ] ω T (L K + λi)f λ ρ + ( + κ) { exp λ T 4 j=t+ η j} + T j=t+ η j E Z t E(ft ) ]. (4.) Combining this with (4.6), (4.4) completes the upper bound (4.3). Now the error decomposition and Proposition help us reduce the goal of our error analysis to the estimation of the four terms in (4.3). The first three terms of (4.3) are frequently referred to as the approximation error 4, 5]. Since the data is i.i.d. and f t does not depend on z t, the last term of (4.3) can be rewritten as E Z T η t exp { λ T j=t+ η j + T j=t+ η j } (y t f t (x t )) ]. It can be regarded as the expectation of a weighted form of the cumulative loss (y t f t (x t )) in the online learning literature 8, 0, 9]. Thus, we call the second term of (4.3) the cumulative sample error. We estimate these two errors separately in the following two subsections which constitute the second and last steps of our approach. 4. Estimates for the approximation error Here, we establish some basic estimates for the deterministic approximation errors involving f λ f ρ ρ and ω T (L K + λi)f λ which is the second step of our approach. To estimate the term f λ f ρ ρ, we only need to consider the case λ > 0 since f 0 = f ρ. For this purpose, recall Lemma 3 in 5]. Lemma 4. Let λ > 0 and f λ be defined by (4.). If f ρ L β K (L ρ X ) with 0 < β then there holds (a) f λ f ρ ρ λ β L β K f ρ ρ ; (b) If moreover, < β then f λ f ρ K λ β L β K f ρ ρ. Using the above lemma, we can estimate the quantity ω T (L K + λi)f λ ρ as follows.

23 Lemma 5. Let λ 0 and η t (κ + λ) for each t IN T. Then the following statements hold true. (a) If λ = 0 then we have that ω T (L K )f ρ ρ K ( ( + κ) ( ) η ) t, f ρ ; (b) If λ > 0 then we have that { ω T (L K + λi)f λ ρ exp λ η t } f ρ ρ. Proof. We first prove property (a). Since η t κ for t IN T, using (4.) with λ = 0 and x = η t for t IN T iteratively implies that ω T (L K ) I η t L K. Thus, for any f H K there holds ω T (L K )f ρ ρ ω T (L K )(f f ρ ) ρ + ω T (L K )f ρ f f ρ ρ + ω T (L K )L K L K f ρ. = f f ρ ρ + ω T (L K )L K f K, (4.3) where (.8) is used in the last equality. Applying Lemma 3 with λ = 0, β = (, j =, and k = T implies that ) ωt (L K )L K ( + κ). η t Then, substituting this into the right-hand side of (4.3) yields that { ( ω T (L K )f ρ ρ inf f fρ ρ + ( + κ) η t f H K ) f K }. (4.4) Turn to the proof of property (b), note that η t (κ + λ) for any t IN T. Thus, applying (4.) again with x = η t for t IN T implies that ω T (L K + λi)f λ ρ ( η t λ) f λ ρ exp { } λ η t fλ ρ. But, by (4.3), f λ ρ f ρ ρ which finishes property (b). The last step of our approach is to estimate the cumulative sample error. 3

24 4.3 Estimates for the cumulative sample error In this subsection, we describe basic estimates for the cumulative sample error. Observe that the cumulative sample error (the last term in (4.3)) can be bounded by sup E Z t E(ft ) ]] η t exp { λ T j=t+ η } ] j + T j=t+ η. (4.5) j Therefore, it remains to estimate the above summand involving the step sizes and the uniform bounds for the learning sequence separately. Let us first deal with the summand. Lemma 6. Let λ 0, θ 0, ) and µ max{λ, } + κ. If the step sizes are in the form of {η t = µ t θ : t IN T } then, for any l IN T, the summation η j exp { λ l k=j+ η } k j IN l + l k=j+ η is bounded by k c θ µ ( exp { λd θ l θ /µ } l min{θ, θ} + l θ ) ln ( 8l θ ), (4.6) where d θ = θ θ and c θ = for θ = 0, 3 for θ =, and 3 ( θ ) θ otherwise. The technical proof is given in the Appendix. Now, we can see that if the step sizes are of the form {η t = µ t θ : t IN T } with θ (0, ) and µ max{λ, } + κ, the summation part involving the step sizes in (4.5) is bounded by (4.6) with l = T, which tends to zero as T. Hence, in order to estimate the whole term (4.5) it suffices to establish the uniform bounds for sup t INT + E Z t E(ft ) ]. This is intuitively reasonable since we expect that the learning sequence {f t : t IN T + } approximates the regression function f ρ. To this end, we need the following direct result from Lemma 6. Corollary. Let λ 0, θ 0, ) and η t = µ t θ for t IN T with µ κ + λ. For any T IN, if µ λ is larger than the quantity 6( + κ) 4 ln ( 8T ) if θ = 0, µ(θ) := 48( + κ) 4 if θ = (, 08(+κ) 4 ln ( ) ) (4.7) 8 θ + otherwise, min{θ, θ} ( θ ) θ } 4

25 then there holds j IN t η j exp { λ t k=j+ η k + t k=j+ η k } 8( + κ) 4, t IN T. (4.8) Proof. For any l IN T and λ 0, the quality (4.6) is bounded by c θ µ l min{θ, θ} ln( 8l ). Thus, the hypothesis (4.8) holds true as long as θ ( ) µ κ +max{λ, } and 6(+κ)4 c θ t min{θ, θ} 8t ln for any t IN µ θ T. When θ = 0, by the definition of c 0, the above inequality is virtually identical to µ 6( + κ) 4 ln ( 8T ). For θ (0, ), note that ln t = ln( t min{θ, θ}) tmin{θ, θ}. min{θ, θ} min{θ, θ} Therefore, the inequality 6( + κ) 4 c ( θ 8t ) t min{θ, θ} ln µ 6( + ( κ)4 c θ ln ( 8 µ θ θ ) + min{θ, θ} ) and the definition of c θ given in Lemma 6 yield the desired result. We are ready to establish the uniform bounds for the learning sequence. Proposition. Let λ 0 and {f t : t IN T + } be produced by algorithm (.4). Moreover, if the step sizes satisfy η t (κ + λ) for t IN T and the inequality (4.8) then we have that E Z t E(ft ) ] 0 f ρ ρ + 3E(f ρ ), t IN T +. (4.9) Proof. Since f = 0, from (.6) it follows that E(f ) = f f ρ ρ + E(f ρ ) = f ρ ρ + E(f ρ ), which implies that (4.9) holds true for t =. As the induction assumption, suppose (4.9) holds true for all k IN t. To advance the induction, we need to estimate E Z t E(ft+ ) ]. Applying (4.3) with T = t IN, E Z t ft+ f ρ ρ] is bounded by f λ f ρ ρ + ω(l t K + λi)f λ ρ + ω(l t K + λi)f λ ρ f λ f ρ ρ + ( + κ) η 4 k exp{ λ t j=k+ η } j k IN t + t j=k+ η E Z k E(fk ) ]. (4.30) j 5

26 Note that f λ f ρ ρ f ρ ρ by (4.). Since η t (κ +λ) for t IN, ω t (L K + λi) j IN t I η j (L K + λi), Hence, ω t (L K + λi)f λ ρ ω t (L K + λi) f λ ρ f λ ρ 4 f ρ ρ. Similarly, by (4.3), there holds that ω t (L K + λi)f λ ρ f λ f ρ ρ f ρ ρ. Putting the above estimates, the induction assumption, and hypothesis (4.8) into (4.30) implies that ( ) E Z t ft+ f ρ ρ] 9 fρ ρ + 0 f ρ ρ + 3E(f ρ ) /4. Note that E(f t+ ) = E(f ρ ) + f t+ f ρ ρ by (.6), and thus E Z T E(ft+ ) ] 0 f ρ ρ + 3E(f ρ ), which advances the induction and completes the proof. In the following sections, we apply Proposition and the basic estimates for the approximation error and cumulative sample error to prove our main results stated in Section. 5 Error bounds and convergence In this section, we mainly develop error bounds and convergence results for algorithm (.5) based on the Proposition. In this case, we have that λ = 0 and f 0 = f ρ. Therefore, the error decomposition formula (4.7) is reduced to f T + f ρ = ω T (L K )f ρ + η t ω T t+(l K )B(f t, z t ). (5.) By Proposition, E Z T ft + f ρ ρ] is bounded by ω T (L K )f ρ ρ + ( + κ) 4 k IN T η t + T j=t+ η E Z t E(ft ) ]. (5.) j 6

27 5. Proofs of error bounds This subsection proves error bounds stated in Theorems and 4. The essential estimates will also be used to derive error rates in Section 6. We start with the following useful lemma. One can find its modified form in 3]. Lemma 7. If f L β K (L ρ X ) with some β > 0 then ( (β ) β ( ω T (L K )f ρ + κ β) L β K e f ρ In addition, if β > then ωt (L K )f K is bounded by η t ) β. (5.3) ( (β ) ) β ( + κ β L β K e f ) (β ρ η ) t. (5.4) Proof. Applying Lemma 3 with λ = 0, j =, and k = T, the desired estimates (5.3) and (5.4) follow immediately from the following observations ω T (L K )f ρ ω T (L K )L β K L β K f ρ and ω T (L K )f K = ω T (L K )L K f ρ ω T (L K )L β K L β K f ρ. This completes our lemma. Recall the definitions of µ(θ) in Corollary and c θ in Lemma 6, our error bounds read as follows. Proposition 3. Let λ = 0, θ 0, ) and η t = µ t θ for t IN with some constant µ µ(θ). Define {f t : t IN} by (.5). Then, for any T IN we have that ] E Z T ft + f ρ ρ ω T (L K )f ρ ρ + c ρc θ µ T min{θ, θ} ln ( 8T ) θ where c ρ := 4( + κ) 4( 0 f ρ ρ + 3E(f ρ ) ). Proof. Since η t = µ t θ with µ µ(θ) for any t IN T, by Corollary we know that the inequality (4.8) on the step sizes holds true for any 7

28 t IN T. Hence, the bound (4.9) in Proposition holds true. Now, the desired result follows by applying (4.6) with λ = 0 and l = T to the right-hand side of (5.). To apply Proposition 3 to prove Theorem, we need an easy observation, that is, if 0 < θ < then, for any t IN T, T j=t j θ (T + ) θ t θ. (5.5) θ We are in a position to establish Theorem stated in Section. Proof of Theorem. Recall that η t = µ t θ for θ (0, ). Therefore, by T (5.5) we have that η t θ ( θ)µ T θ. Putting this and property (a) in t= Lemma 5 together, the bound (.4) immediately follows from Proposition 3 with θ (0, ). It still remains a question whether we can estimate the stronger error f T + f ρ K when the step sizes have the form {η t = O(t θ ) : t IN} with θ (0, ) and f ρ L β K (L ρ X ) with β >. We turn our attention to the case when the step sizes are in the form of {η t = η : t IN T } with η = η(t ) depending on T. In this case, we can deal with the expectation of the error f T + f ρ K if f ρ H K. Lemma 8. Let f ρ H K, η t = η for t IN T with ηκ and {f t : t IN T + } be defined as (.5). Then we have that ] E Z T ft + f ρ K ω T (L K )f ρ K + κ η E(ft ) ]. E Z t Proof. Since the argument here is essentially similar to the proof of Proposition except the norm ρ is replaced by K in H K, we only sketch its proof. From the error decomposition (5.), we know that ] E Z T ft + f ρ K] = ω T (L K )f ρ K + η E Z T ωt+(l T K )B(f t, z t ) K η ω T (L K )f ρ, E Z T ω T t+ (L K )B(f t, z t ) ] K. (5.6) 8

29 By the vanishing property (4.5) of B(f t, z t ), the second term on righthand side of the above inequality is estimated as follows: ] E Z T ωt+(l T K )B(f t, z t ) K = E Z t ω T t+ (L K )B(f t, z t ) K] Consequently, E Z T ωt+(l T K )B(f t, z t ) K ] = ωt+(l T K ) E Z t = ω T t+(l K ) E Z t E Z t L K B(f t, z t ) ρ ω T t+ (L K )L K B(f ] t, z t ) ρ ] ] B(ft, z t ) K. (5.7) By (4.0), E Z t B(ft, z t ) K] κ E Z t E(ft ) ]. Also, since ηκ, the estimate (4.) yields that, for any t IN T there holds ω T t+(l K ). Putting these estimates back into (5.7), we have that ] E Z T ωt+(l T K )B(f t, z t ) K κ T E Z t E(ft ) ]. (5.8) t= Using (4.5) again, the last term on the right-hand side of (5.6) follows that ω T (L K )f ρ, E Z T ω T t+ (L K )B(f t, z t ) ] K = 0. Cascading this equality, (5.6) and (5.8) yields the desired estimate. We are ready to present error bounds when the step sizes are of the form {η t = η : t IN T } with η = η(t ) depending on T. Proposition 4. Let η t = η for t IN T and {f t : t IN T + } be defined as (.5). If 6(κ + ) 4 ln ( 8T ) η then there holds E Z T ft + f ρ ρ] ω T (L K )f ρ ρ + c ρ η ln ( 8T ). (5.9) In addition, if f ρ H K then we have that E Z T ft + f ρ K] ω T (L K )f ρ K + c ρ η T. (5.0) Proof. We regard η as µ. Then, the inequality 6(κ + )4 ln ( 8T ) η means that µ 6( + κ) 4 ln ( 8T ). By Corollary, the inequality (4.8) 9

30 on the step sizes is satisfied. Hence, the estimate (4.9) in Proposition holds true. The first estimate (5.9) follows from Proposition 3 in the case of θ = 0. Combining Lemma 8 and (4.9) yields (5.0). From the above proposition, we can prove Theorem 4 stated in the Section. Proof of Theorem 4. The error bound (.0) is derived from (5.9) and property (a) in Lemma Proofs of convergence results We are in a position to apply the error bounds in Theorems and 4 to prove Theorems and 5, while Propositions 3 and 4 will be used to derive the explicit rates in the next section. To prove Theorem, we introduce the following well-known property of K-functional, see 5]. We include its proof for completeness. Lemma 9. Let s > 0 and K(s, f ρ ) be defined by (.). Then there holds lim K(s, f ρ) = inf f f ρ ρ. s 0+ f H K Proof. By the definition (.) of K-functional, it is straightforward that lim s 0+ K(s, f ρ ) inf f HK f f ρ ρ. Hence, we only need to show that lim s 0+ K(s, f ρ ) inf f H K f f ρ ρ. (5.) To this end, for any ε > 0 we know that there exists f ε H K such that f ε f ρ ρ ε + inf f HK f f ρ ρ. Therefore, by letting f = f ε, we know, for any s > 0, that K(s, f ρ ) is bounded by ε + inf f HK f f ρ ρ + s f ε K. Letting s 0+ yields (5.), and hence completes the lemma. We are ready to prove Theorem based on Lemma 9 and Theorem. Proof of Theorem. Since θ (0, ) and η t = t θ for any t µ IN with µ µ(θ), Theorem holds true. Applying Lemma 9 with s = b θ µ T ( θ)/ to (.4) yields that lim T E Z T ft + f ρ ρ] inf f HK f f ρ ρ. 30

31 On the other hand, since f T + H K, for any T IN and any sample z = {z t : t IN T }, we know that inf f HK f f ρ ρ f T + f ρ ρ, which leads to lim T E Z T ft + f ρ ρ] inff HK f f ρ ρ. This completes Theorem. To prove Theorem 5, we further need the following observation. Lemma 0. Let f ρ H K, the step sizes {η t = η(t ) : t IN T } satisfy η(t )κ for any T IN and lim T η(t )T =. Then we have that lim T ωt (L K )f ρ K = 0. Proof. Recall the definition of L β K (L ρ X ) with β > 0 and the fact H K = L K (L ρ X ). Hence, if f ρ H K then f ρ = λ j a j φ j, for some {a j : j IN} l. j= Before we move to the next step, it will be helpful to sketch the main ideas. In what follows we construct functions in L K (L ρ X ) to approximate f ρ arbitrarily in H K while for functions in L K (L ρ X ), we apply (5.4) in Lemma 7 with β = to deal with it. Now, write f ρ = N λ j >0,j= λ j a j φ j + λ j >0,j=N+ λ j a j φ j. Then, f N = N λ j(λ j a j )φ j L K(L ρ X ) for any N IN because L K f N ρ = λ j >0,j= N λ j >0,j= (λ j a j ) <. On the other hand, since j= a j <, then, for any ε > 0, there exists N(ε) IN such that j=n(ε)+ a j ε. Therefore, f ρ f N(ε) K = j=n(ε)+ λ j a j φ j K = j=n(ε)+ a j ε. By (.8), we have that (I η(t )L K )f K = (I η(t )L K )L K f ρ I η(t )L K L K f ρ = I η(t )L K f K. Also, since η(t )κ, applying (4.) with x = η implies that (I η(t )L K ). Hence, for any f H K there holds (I η(t )L K )f K f K. Consequently, applying the above inequality with f = f ρ f N(ε) 3

32 tells us that (I η(t )L K ) T f ρ K (I η(t )L K ) T f N(ε) K + (I η(t )L K ) T (f ρ f N(ε) ) K (I η(t )L K ) T f N(ε) K + f ρ f N(ε) K (I η(t )L K ) T f N(ε) K + ε. (5.) To estimate (I η(t )L K ) T f N(ε) K, applying (5.4) with η t = η(t ) for any t IN T, f = f N(ε), and β = yields that (I η(t )L K ) T f N(ε) K ( + κ ) L K f N(ε) ρ ( T η(t ) ). Since lim T T η(t ) =, the above estimation implies that there exists an integer T (ε) N(ε) such that (I η(t )L K ) T f N(ε) K ε, T T (ε). (5.3) Putting (5.) and (5.3) together, we know that ω T (L K )f ρ K = (I η(t )L K ) T f ρ K ε, T T (ε). Since ε > 0 is arbitrary, we obtain the desired claim. We now move on to the proof of Theorem 5. Proof of Theorem 5. Observe whenever lim T η(t ) ln T = 0 or lim T T η (T ) = 0, there exists T (ε) IN such that 6(κ+) 4 η(t ) ln T for all T T (ε). The hypotheses in Lemma 8 and Theorem 4 are satisfied for any T T (ε). Therefore, (.0) holds true for any T T (ε). We first prove the convergence in L ρ X. Applying Lemma 9 with s = (+κ) ( η(t )T ) to (.0), it follows from the hypothesis lim T T η(t ) = that lim T E Z T ft + f ρ ρ] inff HK f f ρ ρ. On the other hand, since f T + H K, inf f HK f f ρ ρ f T + f ρ ρ for any data z = {z t : t IN T } which leads to inf f HK f f ρ ρ lim T E Z T ft + f ρ ρ]. This verifies the first part of Theorem 5. The proof for the second part is similar. Since lim T T η (T ) = 0, combining (5.0) in Proposition 4 with Lemma 0 yields that which completes Theorem 5. lim T E Z T ft + f ρ K] 0 3

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Eigenvalues and Eigenfunctions of the Laplacian

Eigenvalues and Eigenfunctions of the Laplacian The Waterloo Mathematics Review 23 Eigenvalues and Eigenfunctions of the Laplacian Mihai Nica University of Waterloo mcnica@uwaterloo.ca Abstract: The problem of determining the eigenvalues and eigenvectors

More information

Online Learning with Samples Drawn from Non-identical Distributions

Online Learning with Samples Drawn from Non-identical Distributions Journal of Machine Learning Research 10 (009) 873-898 Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Regularization in Reproducing Kernel Banach Spaces

Regularization in Reproducing Kernel Banach Spaces .... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Calculus in Gauss Space

Calculus in Gauss Space Calculus in Gauss Space 1. The Gradient Operator The -dimensional Lebesgue space is the measurable space (E (E )) where E =[0 1) or E = R endowed with the Lebesgue measure, and the calculus of functions

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

4th Preparation Sheet - Solutions

4th Preparation Sheet - Solutions Prof. Dr. Rainer Dahlhaus Probability Theory Summer term 017 4th Preparation Sheet - Solutions Remark: Throughout the exercise sheet we use the two equivalent definitions of separability of a metric space

More information

Functional Analysis F3/F4/NVP (2005) Homework assignment 3

Functional Analysis F3/F4/NVP (2005) Homework assignment 3 Functional Analysis F3/F4/NVP (005 Homework assignment 3 All students should solve the following problems: 1. Section 4.8: Problem 8.. Section 4.9: Problem 4. 3. Let T : l l be the operator defined by

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Bernstein-Szegö Inequalities in Reproducing Kernel Hilbert Spaces ABSTRACT 1. INTRODUCTION

Bernstein-Szegö Inequalities in Reproducing Kernel Hilbert Spaces ABSTRACT 1. INTRODUCTION Malaysian Journal of Mathematical Sciences 6(2): 25-36 (202) Bernstein-Szegö Inequalities in Reproducing Kernel Hilbert Spaces Noli N. Reyes and Rosalio G. Artes Institute of Mathematics, University of

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Chapter One. The Calderón-Zygmund Theory I: Ellipticity

Chapter One. The Calderón-Zygmund Theory I: Ellipticity Chapter One The Calderón-Zygmund Theory I: Ellipticity Our story begins with a classical situation: convolution with homogeneous, Calderón- Zygmund ( kernels on R n. Let S n 1 R n denote the unit sphere

More information

be the set of complex valued 2π-periodic functions f on R such that

be the set of complex valued 2π-periodic functions f on R such that . Fourier series. Definition.. Given a real number P, we say a complex valued function f on R is P -periodic if f(x + P ) f(x) for all x R. We let be the set of complex valued -periodic functions f on

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

1 The Observability Canonical Form

1 The Observability Canonical Form NONLINEAR OBSERVERS AND SEPARATION PRINCIPLE 1 The Observability Canonical Form In this Chapter we discuss the design of observers for nonlinear systems modelled by equations of the form ẋ = f(x, u) (1)

More information

Homework I, Solutions

Homework I, Solutions Homework I, Solutions I: (15 points) Exercise on lower semi-continuity: Let X be a normed space and f : X R be a function. We say that f is lower semi - continuous at x 0 if for every ε > 0 there exists

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Fourier Transform & Sobolev Spaces

Fourier Transform & Sobolev Spaces Fourier Transform & Sobolev Spaces Michael Reiter, Arthur Schuster Summer Term 2008 Abstract We introduce the concept of weak derivative that allows us to define new interesting Hilbert spaces the Sobolev

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

A SYNOPSIS OF HILBERT SPACE THEORY

A SYNOPSIS OF HILBERT SPACE THEORY A SYNOPSIS OF HILBERT SPACE THEORY Below is a summary of Hilbert space theory that you find in more detail in any book on functional analysis, like the one Akhiezer and Glazman, the one by Kreiszig or

More information

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels Y.C. Hon and R. Schaback April 9, Abstract This paper solves the Laplace equation u = on domains Ω R 3 by meshless collocation

More information

Cross-Selling in a Call Center with a Heterogeneous Customer Population. Technical Appendix

Cross-Selling in a Call Center with a Heterogeneous Customer Population. Technical Appendix Cross-Selling in a Call Center with a Heterogeneous Customer opulation Technical Appendix Itay Gurvich Mor Armony Costis Maglaras This technical appendix is dedicated to the completion of the proof of

More information

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011 Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011 Section 2.6 (cont.) Properties of Real Functions Here we first study properties of functions from R to R, making use of the additional structure

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

Fourier Series. 1. Review of Linear Algebra

Fourier Series. 1. Review of Linear Algebra Fourier Series In this section we give a short introduction to Fourier Analysis. If you are interested in Fourier analysis and would like to know more detail, I highly recommend the following book: Fourier

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

An introduction to Birkhoff normal form

An introduction to Birkhoff normal form An introduction to Birkhoff normal form Dario Bambusi Dipartimento di Matematica, Universitá di Milano via Saldini 50, 0133 Milano (Italy) 19.11.14 1 Introduction The aim of this note is to present an

More information

Appendix A Functional Analysis

Appendix A Functional Analysis Appendix A Functional Analysis A.1 Metric Spaces, Banach Spaces, and Hilbert Spaces Definition A.1. Metric space. Let X be a set. A map d : X X R is called metric on X if for all x,y,z X it is i) d(x,y)

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy Banach Spaces These notes provide an introduction to Banach spaces, which are complete normed vector spaces. For the purposes of these notes, all vector spaces are assumed to be over the real numbers.

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Real Variables # 10 : Hilbert Spaces II

Real Variables # 10 : Hilbert Spaces II randon ehring Real Variables # 0 : Hilbert Spaces II Exercise 20 For any sequence {f n } in H with f n = for all n, there exists f H and a subsequence {f nk } such that for all g H, one has lim (f n k,

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS Fixed Point Theory, 15(2014), No. 2, 399-426 http://www.math.ubbcluj.ro/ nodeacj/sfptcj.html PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS ANDRZEJ CEGIELSKI AND RAFA

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

1.5 Approximate Identities

1.5 Approximate Identities 38 1 The Fourier Transform on L 1 (R) which are dense subspaces of L p (R). On these domains, P : D P L p (R) and M : D M L p (R). Show, however, that P and M are unbounded even when restricted to these

More information

7: FOURIER SERIES STEVEN HEILMAN

7: FOURIER SERIES STEVEN HEILMAN 7: FOURIER SERIES STEVE HEILMA Contents 1. Review 1 2. Introduction 1 3. Periodic Functions 2 4. Inner Products on Periodic Functions 3 5. Trigonometric Polynomials 5 6. Periodic Convolutions 7 7. Fourier

More information

MA677 Assignment #3 Morgan Schreffler Due 09/19/12 Exercise 1 Using Hölder s inequality, prove Minkowski s inequality for f, g L p (R d ), p 1:

MA677 Assignment #3 Morgan Schreffler Due 09/19/12 Exercise 1 Using Hölder s inequality, prove Minkowski s inequality for f, g L p (R d ), p 1: Exercise 1 Using Hölder s inequality, prove Minkowski s inequality for f, g L p (R d ), p 1: f + g p f p + g p. Proof. If f, g L p (R d ), then since f(x) + g(x) max {f(x), g(x)}, we have f(x) + g(x) p

More information

Hilbert Spaces. Contents

Hilbert Spaces. Contents Hilbert Spaces Contents 1 Introducing Hilbert Spaces 1 1.1 Basic definitions........................... 1 1.2 Results about norms and inner products.............. 3 1.3 Banach and Hilbert spaces......................

More information

Spectral Geometry of Riemann Surfaces

Spectral Geometry of Riemann Surfaces Spectral Geometry of Riemann Surfaces These are rough notes on Spectral Geometry and their application to hyperbolic riemann surfaces. They are based on Buser s text Geometry and Spectra of Compact Riemann

More information

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

First-Return Integrals

First-Return Integrals First-Return ntegrals Marianna Csörnyei, Udayan B. Darji, Michael J. Evans, Paul D. Humke Abstract Properties of first-return integrals of real functions defined on the unit interval are explored. n particular,

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Fourier Series. ,..., e ixn ). Conversely, each 2π-periodic function φ : R n C induces a unique φ : T n C for which φ(e ix 1

Fourier Series. ,..., e ixn ). Conversely, each 2π-periodic function φ : R n C induces a unique φ : T n C for which φ(e ix 1 Fourier Series Let {e j : 1 j n} be the standard basis in R n. We say f : R n C is π-periodic in each variable if f(x + πe j ) = f(x) x R n, 1 j n. We can identify π-periodic functions with functions on

More information

On convergent power series

On convergent power series Peter Roquette 17. Juli 1996 On convergent power series We consider the following situation: K a field equipped with a non-archimedean absolute value which is assumed to be complete K[[T ]] the ring of

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Lecture Notes on Metric Spaces

Lecture Notes on Metric Spaces Lecture Notes on Metric Spaces Math 117: Summer 2007 John Douglas Moore Our goal of these notes is to explain a few facts regarding metric spaces not included in the first few chapters of the text [1],

More information

MATH 590: Meshfree Methods

MATH 590: Meshfree Methods MATH 590: Meshfree Methods Chapter 2 Part 3: Native Space for Positive Definite Kernels Greg Fasshauer Department of Applied Mathematics Illinois Institute of Technology Fall 2014 fasshauer@iit.edu MATH

More information

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij Weak convergence and Brownian Motion (telegram style notes) P.J.C. Spreij this version: December 8, 2006 1 The space C[0, ) In this section we summarize some facts concerning the space C[0, ) of real

More information