Online Learning with Samples Drawn from Non-identical Distributions

Size: px

Start display at page:

Download "Online Learning with Samples Drawn from Non-identical Distributions"

Esther McDonald
5 years ago
Views:

1 Journal of Machine Learning Research 10 (009) Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of Mathematics City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, China TING Editor: Peter Bartlett Abstract Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a Hölder space. For regression with least square or insensitive loss, learning rates are given in both the RKHS norm and the L norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space 1. Introduction In the literature of learning theory, samples for algorithms are often assumed to be drawn independently from an identical distribution. Here we consider a setting with samples drawn from nonidentical distributions. Such a framework was introduced in Smale and hou (009) and Steinwart et al. (008) where online learning for least square regression and off-line support vector machines are investigated. We shall follow this framework and study a kernel based online learning algorithm associated with a general convex loss function. Our analysis can be applied for various purposes including regression and classification. 1.1 Sampling with Non-identical Distributions Let (,d) be a metric space called an input space for the learning problem. Let be a compact subset of R (output space) and =. In our online learning setting, at each step t = 1,,..., a pair z t = (x t,y t ) is drawn from a probability distribution ρ (t) on. The sampling sequence of probability distributions ρ (t) t=1,, is not identical. For convergence analysis, we shall assume that the sequence ρ (t) t=1,, of marginal distributions on converges polynomially in the dual of the Hölder space C s () for some 0 < s 1. c 009 Ting Hu and Ding-uan hou.

2 HU AND HOU Here the Hölder space C s () is defined to be the space of all continuous functions on with the norm f C s () = f C() + f C s () finite, where f C s () := sup x y f(x) f(y) (d(x,y)) s. Definition 1 We say that the sequence ρ (t) t=1,, converges polynomially to a probability distribution ρ in (C s ()) (0 s 1) if there exist C > 0 and b > 0 such that ρ (t) ρ (C s ()) Ct b, t N. (1) By the definition of the dual space (C s ()), decay condition (1) can be expressed as f(x)dρ (t) f(x)dρ Ct b f C s (), f C s (),t N. () What measures quantitatively differences between our non-identical setting and the i.i.d. case is the power index b. Its impact on performance of online learning algorithms will be studied in this paper. The i.i.d. case corresponds to b =. We describe three situations in which decay condition (1) is satisfied. The first is when a distribution ρ is perturbed by some noise and the noise level decreases as t increases. Example 1 Let h (t) be a sequence of bounded functions on such that sup x h (t) (x) Ct b. Then the sequence ρ (t) t=1,, defined by dρ (t) = dρ + h (t) (x)dρ satisfies (1) for any 0 s 1. The proof follows from R f(x)h(t) (x)dρ sup x h (t) (x) f C() Ct b f C s (). In this example, h (t) is the density function of the noise distribution and we assume its bound (noise level) to decay polynomially as t increases. The second situation when decay condition (1) is satisfied is generated by iterative actions of an integral operator associated with a stochastic density kernel. We demonstrate this situation by an example on = S n 1, the unit sphere of R n with n. Let ds be the normalized surface element of S n 1. The corresponding space L (S n 1 ) has an orthonormal basis l,k : l +,k = 1,...,N(n,l) with N(n,0) = 1 and N(n,l) = l+n l (l+n 3)!(n )! (l 1)!. Here l,k is a spherical harmonic of order l which is the restriction onto S n 1 of a homogeneous polynomial in R n of degree l satisfying the Laplace equation f = 0. In particular, 0,0 1. Example Let = S n 1, 0 < α < 1, and ψ C( ) be given by ψ(x,u) = 1+ l=1 N(n,l) a l,k l,k (x) l,k (u) where 0 a l,k α, k=1 l=1 N(n,l) a l,k l,k C() < 1. k=1 If h (1) is a square integrable density function on and a sequence of density functions h (t) is defined by h (t+1) (x) = ψ(x,u)h (t) (u)ds(u), x, t N, then we know h (t) = 0,0 + l=1 N(n,l) k=1 a t 1 l,k h(1), l,k L (S n 1 ) l,k and h (t) 0,0 L (S n 1 ) α t 1 h (1) L (S n 1 ). It follows that the sequence ρ (t) = h(t) (x)ds t=1,,... of probability distributions on converges polynomially to the uniform distribution dρ = ds on and satisfies (1) for any 0 s

3 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS In general, if ν is a strictly positive probability distribution on, and if ψ C( ) is strictly positive satisfying R ψ(x,u)dν(u) = 1 for each x, then the sequence ρ(t) defined by ρ (t) (Γ) = ψ(x,u)dρ (t 1) (x) dν(u) on Borel sets Γ Γ satisfies ρ (t) ρ (C()) Cα t for some (strictly positive) probability distribution ρ on and constants C > 0,0 < α < 1. Hence decay condition (1) is valid for any 0 s 1. For details, see Smale and hou (009). The third situation to realize decay condition (1) is to induce distributions by dynamical systems. Here we present a simple example. Example 3 Let = [ 1/,1/] and for each t N, the probability distribution ρ (t) on has support [ t, t ] and uniform density t 1 on its support. Then with δ 0 being the delta distribution at the origin, for each 0 < s 1 we have f(x)dρ (t) f(x)dδ 0 t 1 t t f(x) f(0) dx ( s) t f C s (). Remark Since f C() f C s (), we see from () that decay condition (1) with any 0 < s 1 is satisfied when this polynomial convergence requirement is valid in the case s = 0. This happens in Examples 1 and. Note that when s = 0, the dual space (C()) is exactly the space of signed finite measures on. Each signed finite measure µ on lies in (C()) (C s ()) and satisfies µ (C s ()) µ (C()) R d µ. 1. Fully Online Learning Algorithm In this paper we study a family of online learning algorithms associated with reproducing kernel Hilbert spaces and a general convex loss function. A reproducing kernel Hilbert space (RKHS) is induced by a Mercer kernel K : R which is a continuous and symmetric function such that the matrix (K(x i,x j )) l i, j=1 is positive semidefinite for any finite set of points x 1,,x l. The RKHSH K is the completion (Aronszajn, 1950) of the span of the set of functions K x = K(x, ) : x with the inner product given by K x,k y K = K(x,y). Definition 3 We say that V : R R + is a convex loss function if for each y, the univariate function V(y, ) : R R + is convex. The convexity tells us (Rockafellar, 1970) that for each f R and y, the left derivative lim δ 0 (V(y, f + δ) V(y, f))/δ exists and is no more than the right derivative lim δ 0+ (V(y, f + δ) V(y, f))/δ. An arbitrary number between them (which is a gradient) will be taken and denoted as V(y, f) in our algorithm. For the least square regression problem, we can take V(y, f) = (y f). For the binary classification problem, we can take V(y, f) = φ(y f) with φ : R R + a convex function. The online algorithm associated with the RKHS H K and the convex loss V is a stochastic gradient descent method (Cesa-Bianchi et al., 1996; Kivinen et al., 004; Smale and ao, 006; ing and hou, 006; ing, 007). 875

4 HU AND HOU Definition 4 The fully online learning algorithm is defined by f 1 = 0 and f t+1 = f t η t V(y t, f t (x t ))K xt + λ t f t, for t = 1,,..., (3) where λ t > 0 is called the regularization parameter and η t > 0 the step size. In this fully online algorithm, the regularization parameter λ t changes with the learning step t. Throughout the paper we assume that λ t+1 λ t for each t N. When the regularization parameter λ t λ 1 does not change as the step t develops, we call scheme (3) partially online. The goal of this paper is to investigate the fully online learning algorithm (3) when the sampling sequence is not identical. We will show that learning rates in the non-identical setting can be the same as those in the i.i.d. case when the power index b in polynomial decay condition (1) is large enough, that is, ρ (t) converges fast to ρ. When b is small, the non-identical effect becomes crucial and the learning rates will depend essentially on b.. Error Bounds for Regression and Classification As in the work on least square regression (Smale and hou, 009), we assume for the sampling sequence ρ (t) t=1,, that the conditional distribution ρ (t) (y x) of each ρ (t) at x is independent of t, denoted as ρ x. Throughout the paper we assume independence of the sampling, that is, z t = (x t,y t ) t is a sequence of samples drawn from the product probability space Π t=1,, (,ρ (t) ). Error analysis will be conducted for fully online learning algorithm (3) under polynomial decay condition (1) for the sequence of marginal distributions ρ (t). Let ρ be the probability distribution on given by the marginal distribution ρ and the conditional distributions ρ x. Essential difficulty in our non-identical setting is caused by the deviation of ρ (t) from ρ. The first novelty of our analysis is to deal with an error quantity t involving ρ (t) ρ (defined by (15) below) which occurs only in the non-identical setting. This is handled for a general loss function V and output space by Lemma 18 in Section 3 under decay condition (1) for marginal distributions ρ (t) and Lipschitz s continuity of conditional distributions ρ x : x. Definition 5 We say that the set of distributions ρ x : x is Lipschitz s in (C s ()) if there exists a constant C ρ 0 such that ρ x ρ u (C s ()) C ρ(d(x,u)) s, x,u. (4) Notice that on the compact subset of R, the Hölder space C s () and its dual (C s ()) are well defined. Each ρ x belongs to (C s ()). The second novelty of our analysis is to show for the least square loss (described in Section 3) and binary classification that Lipschitz s continuity (4) of ρ x : x is the same as requiring f ρ C s () where f ρ is the regression function defined by f ρ (x) = ydρ x (y), x. (5) Proposition 6 Let 0 < s 1. Condition (4) implies f ρ C s () with f ρ C s () C ρ (1+ 1 s )sup y y. When = 1, 1, f ρ C s () also implies (4) and C ρ f ρ C s (). 876

5 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS The two-point nature of the output space for binary classification plays a crucial role in our observation. The second statement of Proposition 6 is not true for general output space. Here is one example. Example 4 Let 0 < s 1 and = 1, 1,0. Then condition (4) holds if and only if f ρ C s () and f ρ, 1 C s () where f ρ, 1 is the function on given by f ρ, 1 (x) = ρ x ( 1). Proofs of Proposition 6 and Example 4 will be given in the appendix. Our third novelty is to understand some essential differences between our non-identical setting and the classical i.i.d. setting by pointing out the key role played by the power index b of polynomial decay condition ρ (t) ρ (C s ()) Ct b in derived convergence rates in Theorems 7 and 10 for regression and Theorem 11 for classification. Even for least square regression our result improves the error analysis in Smale and hou (009) where a stronger exponential decay condition ρ (t) ρ (C s ()) Cαt is assumed. Our error bounds for fully online algorithm (3) are comparable with those for a batch learning algorithm generated by the off-line regularization scheme in H K defined with a sample z := z t = (x t,y t ) t=1 T and a regularization parameter λ > 0 as f z,λ = arg min f H K 1 T T t=1 V(y t, f(x t ))+ λ f K. (6) Let us demonstrate our error analysis by learning rates for regression with least square loss and insensitive loss and for binary classification with hinge loss..1 Learning Rates for Least Square Regression Here we take = [ M,M] for some M > 0 and the least square loss V = V ls as V ls (y, f) = (y f). Then the algorithm takes the form f t+1 = f t η t ( f t (x t ) y t )K xt + λ t f t, for t = 1,,... The following learning rates are derived by the procedure in Smale and hou (009) where an exponential decay condition is assumed. Here we only impose a much weaker polynomial decay condition (1). We also assume the regularity condition (of order r > 0) f ρ = L r K(g ρ ) for some g ρ L ρ (), (7) where L K is the integral operator L ρ defined by L K f(x) = K(x,v) f(v)dρ (v), with LK r well-defined as a compact operator. x Theorem 7 Let 0 < s 1 and 1 < r 3. Assume K Cs ( ), regularity condition (7) for f ρ and (1) with b > r+ r+1 for ρ(t). Take λ t = λ 1 t 1 r+1, ηt = η 1 t r r+1 877

6 HU AND HOU with λ 1 η 1 > r 1 4r+, then ( ) IE z1,...,zt ft+1 f ρ K CT 4r+ r 1, where C is a constant independent of T. Denote the constant κ = max x K(x,x). From the reproducing property of the RKHSH K, we see that K x, f K = f(x), x, f H K (8) f C() κ f K, x, f H K. Most error analysis in the literature of least square regression (hang, 004; De Vito et al., 005; Smale and ao, 006; Wu et al., 007) is about the L -norm f T+1 f ρ L ρ or risk in the i.i.d. case. From a predictive viewpoint, in the non-identical setting, the error f T+1 f ρ should be measured with respect to the distribution ρ (T), not the limit ρ. This can be done by bounding f T+1 f ρ C() (since ρ (T) changes with T ), which follows from estimates for f T+1 f ρ K. So our bounds for the error in the H K -norm provides useful predictive information about learning ability of fully online algorithm (3) in the non-identical setting. Remark 8 When R n and K C m ( ) for some m N, we know from hou (003, 008) ( ) and Theorem 7 that IE z1,...,z T ft+1 f ρ C m () = O(T 4r+ r 1 ). So the regression function is learned efficiently by the online algorithm not only in the usual Lρ space, but also strongly in the space C m () implying the learning of gradients (Mukherjee and Wu, 006). In the special case of r = 3, the learning rate in Theorem 7 is IE z 1,...,z T ( ft+1 f ρ K ) = O(T 4), 1 the same as those in the literature (Smale and hou, 009; Tarrès and ao, 005; Smale and hou, 007). Here we assume polynomial convergence condition (1) with a large index b > r+ r+1. So the influence of the non-identical distributions ρ (t) does not appear in the learning rates (it is involved in the constant C). Instead of refining the analysis for smaller b in Theorem 7, we shall show the influence of the index b on learning rates by the settings of regression with insensitive loss and binary classification.. Learning Rates for Regression with Insensitive Loss A large family of loss functions for regression take the form V(y, f) = ψ(y f) where ψ : R R + is an even, convex and continuous function satisfying ψ(0) = 0. One example is the ε-insensitive loss (Vapnik, 1998) with ε 0 where ψ(u) = max u ε,0. We consider the case when ε = 0. In this case the loss is called least absolute deviation or least absolute error in the literature of statistics and finds applications in some important problems because of robustness. Definition 9 The insensitive loss V = V in is given by V in (y, f) = y f. Algorithm (3) now takes the form (1 ηt λ f t+1 = t ) f t η t K xt, if f t (x t ) y t, (1 η t λ t ) f t + η t K xt, if f t (x t ) < y t. The following learning rates are new and will be proved in Section

7 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Theorem 10 Let 0 < s 1 and K Cs ( ). Assume regularity condition (7) for f ρ with r > 1, and polynomial convergence condition (1) for ρ (t). Suppose that for each x, ρ x is the uniform distribution on the interval [ f ρ (x) 1, f ρ (x) + 1]. If λ 1 (κ g ρ L ρ ) /(1 r) / and η 1 > 0, then with a constant C independent of T, when 1 < r 3 we have ( ) r 1 IE z1,...,z T ft+1 f ρ K min CT 6r+1, b 6r+1 by taking λ t = λ 1 t 6r+1, ηt = η 1 t 6r+1 4r and when 1 < r 1, we have ( ) IE z1,...,z T f T+1 f ρ L ρ CT min 3r+ r, b 1 3r+ with λ t = λ 1 t 3r+ 1, ηt = η 1 t r+1 Again, when b > 4r+ 6r+1, Rn and K C m ( ) for some m N, Theorem 10 tells us that ( ) IE z1,...,z T ft+1 f ρ C m () = O(T 6r+1 r 1 ). So the regression function is learned efficiently by the online algorithm not only with respect to the risk, but also strongly in the space C m () implying the learning of gradients. Consider the case r = 3. When ρ(i) converges slowly with b < 4 5, we see that ( ) IE z1,...,z T ft+1 f ρ K = O(T ( b 5 1) ) which heavily depends on the index b representing quantitatively the deviation of ρ (t) from ρ. When b is large enough with b 4 5, the learning rate ( ) IE z1,...,z T ft+1 f ρ K is of order T 5 1 which is independent of b. It would be interesting to extend Theorem 10 to situations when ρ x : x are more general bounded symmetric distributions..3 Learning Rates for Binary Classification The output space for the binary classification problem is = 1, 1 representing the set of two classes. A binary classifierc : makes a prediction y =C(x) for each point x. A real valued function f : R can be used to generate a classifier C(x) = sgn( f(x)) where sgn( f(x)) = 1 if f(x) 0 and sgn( f(x)) = 1 if f(x) < 0. A classifying convex loss φ : R R + is often used for the real valued function f, to measure the local error φ(y f(x)) suffered from the use of sgn( f) as a model for the process producing y at x. Take V(y, f) = φ(y f) in out setting. Offline classification algorithm (6) has been extensively studied in the literature. In particular, the error analysis is well done when the sample z is assumed to be an identical and independent drawer from a probability measure ρ on. See, for example, Evgeniou et al. (000), Steinwart (00), hang (004) and Wu et al. (007). Some analysis for off-line support vector machines with dependent samples can be found in Steinwart et al. (008). When the sample size T is very large, algorithm (6) might be practically challenging. Then online learning algorithms can be applied, which provide more efficient methods for classification. These algorithms are generalizations of the perceptron which has a long history. The error analysis for online algorithm (3) has also been conducted for classification in the i.i.d. setting, see, for example, Cesa-Bianchi et al. (004), ing and hou (006), ing (007) and e and hou (007). Here we are interested in the error analysis for fully online algorithm (3) in the non-identical setting. The error is measured by the misclassification errorr (C) defined to be the probability of the event C(x) y for a classifierc R (C) = ρ x (y C(x))dρ (x) = ρ x ( C(x))dρ (x) r+.

8 HU AND HOU The best classifier minimizing the misclassification error is called the Bayes rule (e.g., Devroye et al. 1997) and can be expressed as f c = sgn( f ρ ). We are interested in the classifier sgn( f T+1 ) produced by the real valued function f T+1 using fully online learning algorithm (3). So our error analysis aims at the excess misclassification error R (sgn( f T+1 )) R ( f c ). We demonstrate our error analysis by a result, proved in Section 5, for the hinge loss φ(x) = (1 x) + = max1 x,0. For this loss, the online algorithm (3) can be expressed as f 1 = 0 and (1 ηt λ f t+1 = t ) f t, if y t f t (x t ) > 1, (1 η t λ t ) f t + η t y t K xt, if y t f t (x t ) 1. Theorem 11 Let V(y, f) = (1 y f) + and K C s ( ) for some 0 < s 1. Assume f ρ C s () and the triple (ρ, f c,k) satisfies for some 0 < β 1 andd 0 > 0. Take inf f f c L 1 f H ρ + λ K f K D 0 λ β λ > 0 (9) λ t = λ 1 t 1 4,ηt = η 1 t ( 1 + β 1 ) where λ 1 > 0 and 0 < η 1 1 κ +λ 1. If ρ (t) t=1,, satisfies (1) with b > 1, then IE z1,...,z T (R (sgn( f T+1 )) R ( f c )) C β,s,b T min β 4, β 4, b 1 4, (10) where C β,s,b = C η1,λ 1,κ,D 0,β,s is a constant depending on η 1,λ 1,κ,D 0,β,s and b. In the i.i.d. case with b =, the learning rate for fully online algorithm (3) we state in Theorem 11 is of form O(T min β 4, β 4 ) which is better than that in e and hou (007) of order O(T min β 4, 8 1 ε ) with an arbitrarily small ε > 0. This improvement is realized by technical novelty in our error analysis, as pointed out in Remark 7. So even in the i.i.d. case, our learning rate for fully online classification with the hinge loss under approximation error assumption (9) is the best in the literature. Let us discuss the role of the power index b for the convergence of ρ (t) to ρ played in the learning rate in Theorem 11. Consider the case β 3 1+β 5. When b meaning fast convergence of ρ (t) to ρ, learning rate (10) takes the form O(T β 4 ) which depends only on the approximation ability of H K with respect to the function f c and ρ. When 1 < b < 1+β representing some slow convergence of ρ (t) to ρ, the learning rate takes the form O(T b 1 When β > 3 5, learning rate (10) is of order T ( 8 1+ β 4 ) for b > β 1 4 ) which depends only on b. (fast convergence of ρ(t) ) and of order T ( b 4 1) for 1 < b β 1 (slow convergence of ρ(t) ). It would be interesting to know how online learning algorithms adapt when the time dependent distribution drifts sufficiently slowly (corresponding to very small b). 880

9 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS.4 Approximation Error Involving a General Loss Condition (9) concerns the approximation of the function f c in the space L 1 ρ by functions from the RKHS H K. In particular, when H K is a dense subset of C() (i.e., K is a universal kernel), the quantity on the left-hand side of (9) tends to 0 as λ 0. So (9) is a reasonable assumption, which can be characterized by an interpolation space condition (Smale and hou, 003; Chen et al., 004). Assumptions like (9) are necessary to determine the regularization parameter for achieving learning rate (10). This can be seen from the literature (Wu et al., 007; hang, 004; Caponnetto et al., 007) of off-line algorithm (6): learning rates are obtained by suitable choices of the regularization parameter λ λ 1 = λ 1 (T), according to the behavior of the approximation error estimated from a priori conditions on the distribution ρ and the spaceh K. For a general loss function V, conditions like (9) can be stated as the approximation error or regularization error. Definition 1 The approximation errord(λ) associated with the triple (K,V,ρ) is E( f) E( fρ V )+ λ f K D(λ) = inf f H K where we define the generalization error for f : R as E( f) = V(y, f(x))dρ = V(y, f(x))dρ x (y)dρ and f V ρ is a minimizer ofe( f)., (11) The approximation error measures the approximation ability of the spaceh K with respect to the learning process involving V and ρ. The denseness of H K in C() implies lim λ 0 D(λ) = 0. A natural assumption would be D(λ) D 0 λ β for some 0 β 1 andd 0 > 0. (1) Throughout the paper we assume for the general loss function V that V := sup y V(y,0)+ sup V(y, f) V(y,0) / f : y, f 1 <. Remark 13 SinceD(λ) E(0)+0 V for any λ > 0, we see that (1) always holds with β = 0 andd 0 = V. For the least square loss V(y, f) = (y f), the minimizer f V ρ ofe( f) is exactly the regression function defined by (5) and approximation error (11) takes the fromd(λ) = inf f HK f f ρ L ρ + λ f K which measures the approximation of f ρ in L ρ by functions from the RKHSH K. For a general loss function V, the minimizer f V ρ ofe( f) is in general different from f ρ. Moreover, the approximation errord(λ) involves the approximation of f V ρ byh K in some function spaces which need not be L ρ. Example 5 For the hinge loss V(y, f) = (1 y f) +, the minimizer fρ V is the Bayes rule fρ V = f c (Devroye et al., 1997). Moreover the uniform Lipschitz continuity of V(, f) implies (Chen et al., 004) E( f) E( fρ V ) = φ(y f(x)) φ(y f c (x))dρ f(x) f c (x) dρ = f f c L 1 ρ. 881

10 HU AND HOU So approximation error (11) can be estimated by approximation in the space L 1 ρ and condition (9) implies (1). Consider the insensitive loss V = V in. We can easily see that for each x, f V in ρ (x) equals the median of the probability distribution ρ x on. That is, f V in ρ (x) is uniquely determined by ρ x (y : y f V in ρ (x)) 1 and ρ x (y : y f V in ρ (x)) 1. When ρ x is symmetric about its mean f ρ (x), we have f V in ρ (x) = f ρ (x). Example 6 Let V = V in, f ρ C() and p 1. If for each x, the conditional distribution ρ x is given by p dρ x (y) = y f ρ(x) p 1 dy, if y f ρ (x) 1, 0, if y f ρ (x) > 1, then we have E( f) E( f V in ρ ) = 1 p+1 f f ρ p+1 1 Lρ p+1 p+1 x : f(x) f ρ (x) >1 f(x) f ρ (x) p+1 + p (1+ p) f(x) f ρ (x) dρ. It follows that D(λ) inf f H K f f ρ p+1 + λ Lρ p+1 f K. The conclusion of Example 6 will be proved in the appendix. The following general result follows from the same argument as in Wu et al. (006). Proposition 14 If the loss function V satisfies V(y, f 1 ) V(y, f ) f 1 f c for some 0 < c 1 and any y, f 1, f R, then D(λ) inf f H K f f V ρ c L 1 ρ + λ f K inf f H K f f V ρ c L ρ + λ f K. If the univariate function V(y, ) is C 1 and satisfies V(y, f 1 ) V(y, f ) f 1 f c for some 0 < c 1 and any y, f 1, f R, then D(λ) inf f fρ V 1+c f H L 1+c K ρ + λ f K inf f H K f f V ρ 1+c L ρ + λ f K. 3. Key Analysis for the Fully Online Non-identical Setting Learning rates for fully online learning algorithm (3) such as those stated in the last section for regression and classification are obtained through analysis for approximation error and sample error. The approximation error has been well understood (Smale and hou, 003; Chen et al., 004). The sample error for (3) will be estimated in the following two sections. It is expressed as f T+1 fλ V T K where fλ V T is a regularizing function. 88

11 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Definition 15 For λ > 0 the regularizing function fλ V H K is defined by E( f)+ λ f K f V λ = arg inf f H K. (13) Observe that fλ V is a sample-free limit of f z,λ defined by (6). It is natural to expect that the function f T+1 produced by online algorithm (3) approximates the regularizing function fλ V T well. This is actually the case. Our main result on sample error analysis estimates the difference f T+1 fλ V T in theh K -norm for quite general situations. The estimation follows from an iteration procedure, developed for online classification algorithms in ing and hou (006), ing (007) and e and hou (007), together with some novelty provided in the proof of Theorem 6 in the next section. It is based on a one-step iteration bounding f t+1 fλ V t K in terms of f t fλ V t 1 K. The goal of this section is to present our key analysis for one-step iteration in tackling two barriers arising from the fully online non-identical setting. 3.1 Bounding Error Term Caused by Non-identical Sequence of Distributions The first part of our key analysis for one-step iteration is to tackle an extra error term t caused by the non-identical sequence of distributions when bounding f t+1 f V λ t K by means of f t f V λ t K with the fixed regularization parameter λ t. Lemma 16 Define f t by (3). Then we have IE zt ( f t+1 f V λ t K) (1 η t λ t ) f t f V λ t K + η t t + η t IE zt V(y t, f t (x t ))K xt + λ t f t K, (14) where t is defined by [ ] t = V(y, fλ V t (x)) V(y, f t (x)) d ρ (t) ρ. (15) Lemma 16 follows from the same procedure as in the proof of Lemma 3 in ing and hou (006) and e and hou (007). A crucial estimate is V(y t, f t (x t ))K xt + λ t f t, f V λ t f t K [V(y t, f V λ t (x t ))+ λ t f V λ t K] [V(y t, f t (x t ))+ λ t f t K]. When we take the expectation with respect to z t = (x t,y t ), we get R V(y, f λ V t (x))dρ (t) (and R V(y, f t(x))dρ (t) ) on the right-hand side, not E( fλ V t ) = R V(y, f λ V t (x))dρ. So compared with results in the i.i.d. case (Smale and ao, 006; ing and hou, 006; ing, 007; Tarrès and ao, 005; e and hou, 007), an extra term t involving the difference measure ρ (t) ρ appears. This is the first barrier we need to tackle here. When V is the least square loss, it can be easily handled by assuming f ρ C s (). In fact, from V(y, f) = (y f), we see that t = = [ fλ V t (x)) f t (x)][ fλ V t (x)+ f t (x) y]dρ x (y) [ f V λ t (x)) f t (x)][ f V λ t (x)+ f t (x) f ρ (x)]d [ ] ρ (t) ρ. [ ] d ρ (t) ρ This together with the relation hg C s () h C s () g C s () yields the following bound. 883

12 HU AND HOU Proposition 17 Let V(y, f) = (y f) and f ρ C s (). Then we have t fλ V t C s () + f t C s () + f ρ C s () ρ (t) ρ (C s ()) f λ V t f t C s (). For a general loss function and output space, t can be bounded under assumption (4). It is a special case of the following general result when h = f V λ t and g = f t. Lemma 18 Let h,g C s (). If (4) holds, then we have [ V(y,h(x)) V(y,g(x))d ρ (t) ρ] ( ) B h,g h C s () + g C s () + Cρ B h,g ρ (t) ρ (C s ()), where B h,g and B h,g are constants given by B h,g = sup V(y, f) : y, f max h C(), g C() and B h,g = sup V(, f) C s () : f max h C(), g C(). Proof By decomposing the probability distributions on into marginal and conditional distributions, we see [ ] V(y,h(x)) V(y,g(x))d ρ (t) ρ = [ ] V(y,h(x)) V(y,g(x))dρ x (y)d ρ (t) ρ. By the definition of the norm in (C s ()), we obtain [ ] V(y,h(x)) V(y,g(x))d ρ (t) ρ ρ (t) ρ (C s ()) J C s (), where J is a function on defined by J(x) = V(y,h(x)) V(y,g(x))dρ x (y), x. The notion of B h,g tells us that V(y,h(x)) V(y,g(x)) B h,g h(x) g(x) for each y. Hence J C() B h,g h g C(). To bound J C s (), let x,u. We can decompose J(x) J(u) as J(x) J(u) = + [V(y,h(x)) V(y,g(x))] [V(y,h(u)) V(y,g(u))]dρ x (y) [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y). By the notion B h,g, part of the first term above can be bounded as V(y,h(x)) V(y,h(u)) B h,g h(x) h(u) B h,g h C s ()(d(x,u)) s. 884

13 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS The same bound holds for the other part of the first term. So we get [V(y,h(x)) V(y,g(x))] [V(y,h(u)) V(y,g(u))]dρ x (y) B h,g h C s () + g C s () (d(x,u)) s. For the other term of the expression for J(x) J(u), we apply condition (4) to [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y) ρ x ρ u (C s ()) V(y,h(u)) V(y,g(u)) C s () and find that [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y) C ρ(d(x,u)) s B h,g. Combining the above two bounds, we see that J(x) J(u) J C s () = sup x u (d(x,u)) s B h,g h C s () + g C s () + Cρ B h,g. Then the desired bound follows and the lemma is proved. 3. Bounding Error Term Caused by Varying Regularization Parameters The second part of our key analysis is to estimate the error term called drift error f V λ t f V λ t 1 K caused by the change of the regularization parameter from λ t 1 to λ t in our fully online algorithm. This is the second barrier we need to tackle here. Definition 19 The drift error is defined as d t = f V λ t f V λ t 1 K. The drift error can be estimated by the approximation error, which has been studied for regression (Smale and hou, 009) and for classification (e and hou, 007). Theorem 0 Let V be a convex loss function, f V λ by (13) and µ> λ > 0. We have f V λ f V µ K µ ( 1 λ 1 µ )( f V λ K + f V µ K ) µ ( 1 λ 1 µ ) ( D(λ) λ In particular, if with some 0 < γ 1 we take λ t = λ 1 t γ for t 1, then d t+1 t γ 1 D(λ 1 t γ )/λ 1. + ) D(µ). µ Proof Taking derivative of the functional E( f)+ λ f K at the minimizer f φ λ see that V(y, fλ V (x))k xdρ+λfλ V = 0. defined in (13), we 885

14 HU AND HOU It follows that fλ V f µ V = 1 V(y, fµ V (x))k x dρ 1 µ λ V(y, f V λ (x))k xdρ. Combining with the reproducing property (8), we know f V λ f V µ K = f V λ f V µ, f V λ f V µ K can be expressed as f V λ f V µ K = 1 µ V(y, f V µ (x)) ( f V λ f V µ The convexity of the function V(y, ) on R tells us that ) 1 (x)dρ V(y, fλ V λ (x))( fλ V f V ) µ (x)dρ. V(y, fµ V (x)) ( fλ V f µ V ) (x) V(y, f V λ (x)) V(y, fµ V (x)) and Hence V(y, fλ V (x))( fµ V fλ V ) (x) V(y, f V µ (x)) V(y, fλ V (x)). f V λ f V µ K ( 1 λ 1 µ )(E( f V µ ) E( f V λ )). From the definition of fµ V, we see thate( fµ V )+ µ f µ V K (E( f λ V)+ µ f λ V K ) 0. It follows that E( fµ V ) E( fλ V ) µ ( f λ V K fµ V K) µ f λ V f µ V K ( fλ V K + fµ V K ). Then the desired inequality follows. 4. Bounds for Sample Error inh K -norm We are in a position to present our main result on the sample error measured with theh K -norm of the difference f T+1 f V λ T. This will be done by applying iteratively the key analysis in Lemma 16, Lemma 18 and Theorem 0. When applying Lemma 18, we need to bound g C s () in terms of g K. Definition 1 We say that the Mercer kernel K satisfies the kernel condition of order s if K C s ( ) and for some κ s > 0, K(x,x) K(x,u)+K(u,u) κ s(d(x,u)) s, x,u. (16) When 0 < s 1 and K Cs ( ), (16) holds true. The following result follows directly from hou (003), hou (008) and Smale and hou (009). Lemma If K satisfies the kernel condition of order s with (16) valid, then we have g C s () (κ+κ s ) g K, g H K. When using Lemma 16 and Lemma 18, we need incremental behaviors of the loss function V to bound f t K. 886

15 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Definition 3 Denote N(λ) = sup V(y, f) : y, f max κ V /λ,κ V /λ and Ñ(λ) = sup V(, f) C s () : y, f max κ V /λ,κ V /λ. We say that V has incremental exponent p 0 if for some N 1 > 0 and λ 1 > 0 we have N(λ) N 1 ( 1 λ )p and Ñ(λ) N 1 ( 1 λ )p+1 0 < λ λ 1. (17) We say that V is locally Lipschitz at the origin if V(y, f) V(y,0) M 0 := sup f : y, f 1 <. (18) The following result can be proved by exactly the same procedure as those in ing and hou (006), ing (007) and e and hou (007). Lemma 4 Assume that V is locally Lipschitz at the origin. Define f t by (3). If for t = 1,...,T, then we have η t ( κ (M 0 + N(λ t ))+λ t ) 1 (19) f t K κ V λ t, t = 1,...,T + 1. (0) For the insensitive loss, (18) is not satisfied, but V(y, f) is uniformly bounded by 1. For such loss functions we can apply the following bound. Lemma 5 Assume V(y, f) := sup y, f R V(y, f) <. Define f t by (3). If for some λ 1,η 1 > 0 and γ,α > 0 with λ+α < 1, we take λ t = λ 1 t γ,η t = η 1 t α with t = 1,...,T, then we have f t K C V,γ,α λ t, t = 1,...,T + 1, (1) where C V,γ,α is the constant given by ( C V,γ,α = κ V(y, f) λ 1 η 1 + λ 1 γ+α /(λ 1 η 1 )+ ( (1+α)/[eλ 1 η 1 (1 γ+α 1 )] ) ) 1+α 1 γ α. Proof By taking norms in (3) we see that By iterating and the choice f 1 = 0 we find f t+1 K (1 η t λ t ) f t K + η t κ V(y, f). f t+1 K t i=1 Π t j=i+1(1 η j λ j )η i κ V(y, f). 887

16 HU AND HOU But 1 η j λ j exp η j λ j. It follows that i=1 f t+1 K t i=1 t j=i+1 Π t j=i+1 exp t j=i+1 η j λ j η i κ V(y, f). Now we need the following elementary inequality with c > 0,q 0 and 0 < q 1 < 1: t 1 i q exp c j q 1 q 1+q ( 1+q + c ec(1 q1 1 ) ) 1+q 1 q 1 t q 1 q. () This elementary inequality can be found in, for example, Smale and hou (009). Taking q = α,q 1 = γ+α and c = λ 1 η 1 we know that f t+1 K is bounded by ( κ V(y, f) η 1 t α + γ+α /(λ 1 η 1 )+ ( (1+α)/[eλ 1 η 1 (1 γ+α 1 )] ) 1+α 1 γ α )t γ. Then our desired bound holds true. Now we can present our bound for the sample error f T+1 f V λ T K. Theorem 6 Suppose the following assumptions hold: Take 1. the kernel K satisfies the kernel condition of order s (0 < s 1) with (16) valid.. V has incremental exponent p 0 with (17) valid and V satisfies (18). 3. ρ (t) t=1,, converges polynomially to ρ in (C s ()) with (1) valid. 4. the distributions ρ x : x is Lipschitz s in (C s ()) with (4) valid. 5. the triple (K,V,ρ) has the approximation ability of power 0 < β 1 stated by (1). with some λ 1,η 1 > 0 and γ,α satisfying 0 < γ < Then we have 5+4p β λ t = λ 1 t γ,η t = η 1 t α (3) γ(3 β), (p+1)γ < α < 1, η 1 1 κ M 0 + κ N 1 λ p 1 + λ 1. (4) IE z1,z,,z T ( f T+1 f V λ T K) C K,V,ρ,b,β,s T θ (5) where the power index θ is given by θ := min γ(3 β) α,α γ(p+1),b γ(+ p), (6) and C K,V,ρ,b,β,s is a constant independent of T given explicitly in the proof. 888

17 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Proof We divide the proof into four steps. First we bound t. Since λ t and η t take the form (3), we see from the lower bound for α in (4) that pγ (p+1)γ α. Hence for t N, η t (κ (M 0 + N(λ t ))+λ t ) η 1 t α (κ M 0 + κ N 1 λ p 1 t pγ + λ 1 t γ ) η 1 (κ M 0 + κ N 1 λ p 1 + λ 1 ). So the last restriction of (4) implies (19), and by Lemma 4, we know that (0) holds true. Taking f = 0 in (13) yields f V λ t K V λ t, t = 1,...,T. Putting these bounds into the definition of constant B h,g, B h,g, we see by the notion N(λ) that B f V λ t, f t N(λ t ), B f V λ t, f t N(λ t ). So by Lemma 18 and Lemma, t Bt := (κ+κ s )( V /λ t + κ V /λ t )N(λ t )+C ρ Ñ(λ t ) ρ (t) ρ (C s ()). Putting this bound and (0) into (14) yields IE z1,z,,z t ( f t+1 f V λ t K) (1 η t λ t )IE z1,z,,z t 1 ( f t f V λ t K ) + κ η t (N(λ t )+ V ) + η t B t. Next we derive explicit bounds for the one-step iteration. Recall that d t = fλ V t fλ V t 1 K. It gives f t fλ V t K f t fλ V t 1 K + f t fλ V t 1 K d t + dt. γ+α Take τ = 1 γ(1 β)/. By the upper bound for α in (4), we find 0 < τ < 1. Take A 1 = η 1λ 1+τ(1 β)/ 1 > 0. Applying the elementary inequality 1+τ D τ/ 0 ab = [ A 1 ab τ/ ][b 1 τ/ / A 1 ] A 1 a b τ + b τ /A 1 (7) to a = f t f V λ t 1 K and b = d t, we know that IE z1,z,,z t 1 ( f t f V λ t K ) ( ) (1+A 1 dt τ )IE z1,z,,z t 1 f t fλ V t 1 K + dt τ /A 1 + dt. Using this bound and noticing the inequality (1 η t λ t )(1+A 1 dt τ ) 1+A 1 dt τ η t λ t, we obtain ) IE z1,z,,z t ( f t+1 fλ V t K) (1+A 1 dt τ η t λ t )IE z1,z,,z t 1 ( f t fλ V t 1 K +d τ t /A 1 + d t + κ η t (N(λ t )+ V ) + η t B t. By Theorem 0, condition (1) for the approximation error yields d t D 0 λ (β 1)/ 1 (t 1) γ(1 β)/ 1 A t γ(1 β)/ 1 where A = 4 D 0 λ (β 1)/ 1. Inserting the parameter form λ t = λ 1 t γ into assumption (17) and applying condition (1), we can bound Bt as Bt A 3 t γ(1+p) b where A 3 = (κ+κ s )( V /λ 1 + κ V /λ 1 )+C ρ /λ 1 N 1 λ p 1 C. 889

18 HU AND HOU Therefore, for the one-step iteration, by denoting f V λ 1 as f V λ 0 when t = 1, we have for each t = 1,...,T, IE z1,z,,z t ( f t+1 f V λ t K) (1+A 1 d τ t η t λ t )IE z1,z,,z t 1 ( f t f V λ t 1 K ) + A 4 t θ, (8) where θ = min γ( β) α, (α pγ), α+b γ(1+ p) and A 4 = A τ /A 1 + A + κ η 1(N 1 λ p 1 + V ) + η 1 A 3. Then we iterate the above one-step analysis. Inserting the parameter forms for λ t and η t, we see from the definition of the constant A 1 that 1+A 1 d τ t η t λ t 1+A 1 A τ t τ(γ(1 β)/ 1) η 1 λ 1 t γ α = 1 η 1λ 1 t γ α. (9) So the one-step analysis (8) yields ( IE z1,z,,z t ( f t+1 fλ V t K) 1 η ( ) 1λ 1 )IE t γ α z1,z,,zt 1 f t fλ V t 1 K + A 4 t θ. Applying this bound iteratively for t = 1,..., T implies T t=1 IE z1,z,,z T ( f T+1 fλ V T T K) A 4 Π T j=t+1(1 η 1λ 1 t=1 j γ α )t θ + Πt=1(1 T η 1λ 1 t γ α ) f 1 fλ V 1 K. Finally we bound the above expressions by two elementary inequalities. The first one is (). Applying this inequality with c = η 1λ 1,q 1 = γ+α and q = θ, since 1 u e u for any u 0, the first expression above can be bounded as Π T j=t+1(1 η 1λ 1 j γ α )t θ T exp η T 1λ 1 t=1 j t γ α θ A 5 T γ+α θ, j=t+1 where A 5 is the constant given by γ+α+ θ+1 ( A 5 = + 1+ η 1 λ 1 For the second expression above, we have Πt=1(1 T η 1λ 1 t γ α ) exp λ 1 η 1 exp exp (1 γ α) η 1λ 1 + θ eη 1 λ 1 (1 γ+α 1 ) T j=1 ) 1+ θ 1 γ α. t γ α exp η 1λ 1 λ 1 η 1 (T + 1)1 γ α. (1 γ α) 890 T+1 1 x γ α dx

19 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Applying another elementary inequality ( a ) a exp cx x a, c,a,x > 0 ec with c = λ 1η 1 (1 γ α), a = 1 γ α and x = (T + 1)1 γ α yields Πt=1(1 T η ( 1λ 1 λ 1 η 1 4 t γ α ) exp (1 γ α) eλ 1 η 1 ) 1 γ α T. The above two estimates give the desired bound (5) with θ = θ γ α and the constant C K,V,ρ,b,β,s given by ( ) λ 1 η γ α V C K,V,ρ,b,β,s = A 4 A 5 + exp. (1 γ α) eλ 1 η 1 λ 1 This proves the theorem. Remark 7 Some ideas in the above proof are from ing and hou (006), e and hou (007) and Smale and hou (009). Two novel points are presented for the first time here. One is the bound for t, dealing with ρ (t) ρ, given in the first step of our proof in order to tackle the technical difficulty arising from the non-identical sampling process. The same difficulty for the least square regression was overcome in Smale and hou (009) by the special linear feature and explicit expressions offered by the least square loss. The second technical novelty is to introduce a parameter A 1 into elementary inequality (7). With this parameter, we can bound 1+A 1 dt τ η t λ t by 1 1 η tλ t, shown in (9). This improves the error bound even in the i.i.d. case presented in e and hou (007) for the fully online algorithm. Let us discuss the role of parameters in Theorem 6. When γ is small enough and b > 3, fully online algorithm (3) becomes very close to the partially online scheme with λ t λ 1. By taking α = 3, the rates in (5) are of order O(T ( 3 ε) ) with ε arbitrarily small, which is a nice bound for the sample error f T+1 fλ V T K. In this case, the difference between fλ V T and fρ V, measured by the approximation error, increases since λ T = λ 1 T γ. To estimate the total error between f T+1 and fρ V, we should take a balance for the index γ of the regularization parameter, as shown in Theorem 11. For the insensitive loss, (18) is not satisfied. We can apply Lemma 5 and obtain bounds for f T+1 fλ V T K by the same proof as that for Theorem 6. Proposition 8 Assume V(y, f) < and all the conditions of Theorem 6 except (18). Take λ t,η t by (3) with the restriction (4) without the last inequality. Then the same convergence rate (5) holds true with the power index θ given by (6). 5. Bounds for Binary Classification and Regression with Insensitive Loss We demonstrate how to apply Theorem 6 by deriving learning rates of fully online algorithm (3) for binary classification and regression with insensitive loss. 891

20 HU AND HOU Theorem 9 Let V(y, f) = φ(y f) where φ : R R + is a convex function satisfying φ (u) N φ u p, φ(u) N φ u p+1 u 1 (30) for some p 0 and N φ > 0. Suppose M φ := sup φ (u) φ (0) / u : u 1 <. Assume (16) for K, (1) for ρ (t), and (1) for (K,φ,ρ). If f ρ C s () and we choose λ t,η t as (3) with +γ(p +β) 0 < γ < 5+10p β,α = 3 and b > γ(+ p), and η 1 < η 0, λ 1 κ (φ(0)+ φ L [ 1,1])/, then we have ( ) IE z1,...,z T E( ft+1 ) E( fρ V ) C φ,β,γ T min γ(5+10p β) 6, b γ(+3p),γβ, 1 where η 0 := with N κ M φ +κ N 1 λ p 1 +λ 1 = (M φ +N φ +φ(0)+ φ L [ 1,1])κ p (φ(0)+ φ L [ 1,1]) p 1 and C φ,β,γ is a constant depending on η 1,λ 1,κ,D 0,β,φ,β and s. Proof By the bounds for f V λ T K and f T+1 K, we know from (30) that E( f T+1 ) E( f V λ T ) = φ(y f T+1 (x)) φ(y fλ V T (x))dρ C K,φ λ p T f T+1 fλ V T κc K,φ λ p T f T+1 fλ V T K, where C K,φ is a constant depending on K and φ. It is easy to check that the loss V(y, f) = φ(y f) satisfies (17) with incremental exponent p. +γ(p +β) By Theorem 6 with 0 < γ < 5+10p β,α = 3 and b > γ(+ p), we have ( IE z1,z,,z T f T+1 fλ V ) K T C K,V,ρ,b,β,s T min[ γ(5+4p β)]/6,[b γ(+p)]/. Also, we havee( f φ λ T ) E( fρ V ) D(λ T ) D 0 λ β T. Thus we get a bound for the excess generalization error IE z1,...,z T (E( f T+1 ) E( f V ρ )) C φ,β,γ T min[ γ(5+10p β)]/6,γβ,[b γ(+3p)]/, where C φ,β,γ = κc K,φ λ p 1 CK,V,ρ,b,β,s +D 0 λ β 1. This verifies the desired bound. Theorem 9 yields concrete learning rates with various loss functions. When φ is chosen to be the hinge loss φ(x) = (1 x) +, we can prove Theorem Proof of Theorem 11 When 0 < s 1 and K Cs ( ), (16) holds true. The loss function φ(x) = (1 x) + satisfies φ (x) = 1 for x 1 and 0 otherwise. It follows that (17) holds true with p = 0 and M φ = 0. By Example 5, (9) implies (1). Thus all conditions in Theorem 9 are satisfied and by taking p = 0 and γ = 1 4, we have ( ) IE z1,...,z T E( ft+1 ) E( fρ V ) C φ,β,γ T min 18 + β 4, β 4, b

21 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS An important relation concerning the hinge loss is the one (hang, 004) between the excess misclassification error and the excess generalization error given for any measurable function f : R as R (sgn( f)) R ( f c ) E( f) E( f c ). Combining this relation with the above bound for the excess generalization error proves the conclusion of Theorem 11. Turn to the general loss φ. We give an additional assumption that φ (0) exists and is positive. Under this assumption it was proved in Chen et al. (004) and Bartlett et al. (006) that there exists a constant depending c φ only on φ such that for any measurable function f : R, R (sgn( f)) R ( f c ) c φ E( f) E( f φ ρ). Then Theorem 9 gives the following learning rate. Corollary 30 Let φ be a loss function such that φ (0) exists and is positive. Under the assumptions of Theorem 9, if γ = 5+10p+5β, we have IE z1,...,z T (R (sgn( f T+1 )) R ( f c )) C φ,β T min where C φ,β is a constant independent of T. β 5+10p+5β, 4 b +3p 10+0p+10β, As an example, the q-norm SVM loss φ(x) = ((1 x) + ) q with q > 1 satisfies φ (0) > 0 and (17) with p = q 1. So Corollary 30 yields the following rates. Example 7 Let φ(x) = ((1 x) + ) q with q > 1. Under the assumptions of Theorem 9, if γ = 10q 5+5β, α = 8q 6+4β 10q 5+5β and b > 6q 10q 5+5β, then ( ) ( IE z1,...,z T R (sgn( ft+1 )) R ( f c ) = O T min β 10q 5+5β, 4 b 3q 1 0q 10+10β ). Finally we verify the learning rates for regression with insensitive loss stated in Section. 5. Proof of Theorem 10 We need the regularizing function f V ls λ defined by (13) with the least square loss V = V ls. It can be found, for example, in Smale and hou (007) that regularity condition (7) implies f V ls λ f ρ K ( ) λ r 1 gρ L, when 1 ρ < r 3 and f V ls λ f ρ L ρ ( ) λ r g ρ L, when 0 < r 1. ρ It follows that when λ (κ g ρ L ρ ) /(1 r), we have f V ls λ f ρ C() 1. Thus by the special form of the conditional distribution ρ x, we see from the conclusion of Example 6 with p = 1 that 893

22 HU AND HOU f V in λ = f V ls λ and bounds for f V in λ f ρ K and f V in λ f ρ L ρ follow. Moreover, condition (1) for D(λ) is valid with β = 1. Now we check other conditions of Proposition 8. Condition (16) is valid because K C s ( ) with 0 < s 1. By a simple computation, incremental condition (17) is verified with exponent p = 0. Note that f ρ K κ r 1 g ρ L ρ and f ρ C s () (κ + κ s ) f ρ K. Then for any x,u and g C s (), we see from the uniform distribution ρ x and ρ u that g(y)d(ρ x ρ u ) = 1 fρ (x)+1 f ρ (x) 1 fρ (u)+1 g(y)dy g(y)dy g C() f ρ (x) f ρ (u) f ρ (u) 1 g C() (κ+κ s )κ r 1 g ρ L ρ (d(x,u)) s. This verifies Lipschitz s continuous condition (4) for ρ x with constant C ρ = (κ+κ s )κ r 1 g ρ L ρ. Thus all conditions of Proposition 8 are satisfied and we obtain IE z1,z,,z T ( f T+1 f V in λ T K) C K,V,ρ,b,β,s T θ where θ := min γ α,α γ,b γ. Finally we get and IE z1,z,,z T ( f T+1 f ρ K ) C K,V,ρ,b,β,s T θ/ + IE z1,z,,z T ( f T+1 f ρ L ρ ) κ C K,V,ρ,b,β,s T θ/ + Then our desired learning rates follow. ( λt+1 ( λt+1 ) r 1 gρ L ρ, when 1 < r 3 ) r g ρ L ρ, when 0 < r 1. Acknowledgments We would like to thank the referees for their constructive suggestions and comments. The work described in this paper was partially supported by a grant from the Research Grants Council of Hong Kong [Project No. CityU ] and National Science Fund for Distinguished oung Scholars of China [Project No ]. The corresponding author is Ding-uan hou. Appendix A. This appendix includes some detailed proofs. 894

23 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS A.1 Proof of Proposition 6 The first statement follows by taking g(y) = y on because f ρ (x) f ρ (u) = g(y)d(ρ x ρ u )(y) ρ x ρ u (C s ()) g C s () and g C s () = g C() + g C s () sup y y + 1 s sup y y. Actually the above estimates tell us that f ρ is continuous and belongs to C s () with f ρ C s () C ρ (1+ 1 s )sup y y. For the second statement, since = 1, 1, we have f ρ (x) = ρ x (1) ρ x ( 1). It follows that for each y and x, there holds ρ x (y) = 1+y f ρ(x). So for any g C s () and x,u, g(y)d(ρ x ρ u )(y) = g(y) y[ f ρ(x) f ρ (u)] yg(y) = y y [ f ρ(x) f ρ (u)]. Now the conclusion follows from g(y)d(ρ x ρ u )(y) g C() f ρ (x) f ρ (u) f ρ C s ()(d(x,u)) s g C s (). This proves Proposition 6. A. Proof of Example 4 For x, we have ρ x (1) = f ρ, 1 (x)+ f ρ (x) and ρ x (0) = 1 f ρ, 1 (x) f ρ (x). Hence for any g C s (), g(y)dρ x = f ρ, 1 (x)g(1) g(0)+g( 1)+ f ρ (x)g(1) g(0)+g(0) and for u, g(y)d(ρ x ρ u ) = [ f ρ, 1 (x) f ρ, 1 (u)]g(1) g(0)+g( 1)+[f ρ (x) f ρ (u)]g(1) g(0). Then our statement follows from the first part of Proposition 6. This proves the conclusion of Example 4. A.3 Proof of Example 6 Let x. When f(x) f V in ρ (x), we see from the explicit form of the insensitive loss V in that V in (y, f(x))dρ x (y) V in (y, f V in ρ (x))dρ x (y) It follows that = y f(x) + f V in ρ (x) f(x)dρ x + y f V in ρ (x) f(x)+ f V in f V ρ (x) ydρ x. in ρ (x)<y< f(x) f(x) f V in ρ (x)dρ x E( f) E( f V in ρ ) = f(x) f V in ρ (x) x + f(x) y dρ x dρ (31) I x 895

24 HU AND HOU where ) ( x := ρ x (y : y f V in(x) ρ x y : y > f V in ρ (x)) ρ and I x in the open interval between f V in ρ (x) and f(x). The same relation (31) also holds when f(x) < f V in ρ (x). Now we use the special assumption on the conditional distributions and see that the median and mean of ρ x are equal: f V in ρ (x) = f ρ (x) for each x. Moreover, x = 0 and when f ρ (x) f(x) f ρ (x)+1, we have f(x) fρ (x) f(x) y dρ x = ( f(x) f ρ (x) u) p I x 0 up 1 du = f(x) f ρ(x) p+1. p+1 The same expression holds true when f ρ (x) 1 f(x) < f ρ (x). When f(x) f ρ (x) > 1, since ρ x vanishes outside [ f ρ (x) 1, f ρ (x)+1], we have R I x f(x) y dρ x = f(x) f ρ (x) p p+1. Therefore, (31) is the same as E( f) E( f V in ρ ) = x : f(x) f ρ (x) 1 + x : f(x) f ρ (x) >1 f(x) f ρ (x) p+1 dρ p+1 f(x) f ρ (x) p p+1 dρ. This proves the desired expression fore( f) E( f V in ρ ) and hence the bound ford(λ). References N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68: , P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: , 006. A. Caponnetto, L. Rosasco and. ao. On early stopping in gradient descent learning. Constructive Approximation, 6:89 315, 007. D. R. Chen, Q. Wu,. ing and D.. hou. Support vector machine soft margin classifiers: error analysis. Journal of Machine Learning Research, 5: , 004. N. Cesa-Bianchi, P. Long and M. K. Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural Networks, 7: , N. Cesa-Bianchi, A. Conconi and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50: , 004. E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59 85, 005. E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some properties of regularized kernel methods. Journal of Machine Learning Research, 5: ,

Online gradient descent learning algorithm

Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk