Online Learning with Samples Drawn from Non-identical Distributions

Size: px
Start display at page:

Download "Online Learning with Samples Drawn from Non-identical Distributions"

Transcription

1 Journal of Machine Learning Research 10 (009) Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of Mathematics City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, China TING Editor: Peter Bartlett Abstract Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a Hölder space. For regression with least square or insensitive loss, learning rates are given in both the RKHS norm and the L norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space 1. Introduction In the literature of learning theory, samples for algorithms are often assumed to be drawn independently from an identical distribution. Here we consider a setting with samples drawn from nonidentical distributions. Such a framework was introduced in Smale and hou (009) and Steinwart et al. (008) where online learning for least square regression and off-line support vector machines are investigated. We shall follow this framework and study a kernel based online learning algorithm associated with a general convex loss function. Our analysis can be applied for various purposes including regression and classification. 1.1 Sampling with Non-identical Distributions Let (,d) be a metric space called an input space for the learning problem. Let be a compact subset of R (output space) and =. In our online learning setting, at each step t = 1,,..., a pair z t = (x t,y t ) is drawn from a probability distribution ρ (t) on. The sampling sequence of probability distributions ρ (t) t=1,, is not identical. For convergence analysis, we shall assume that the sequence ρ (t) t=1,, of marginal distributions on converges polynomially in the dual of the Hölder space C s () for some 0 < s 1. c 009 Ting Hu and Ding-uan hou.

2 HU AND HOU Here the Hölder space C s () is defined to be the space of all continuous functions on with the norm f C s () = f C() + f C s () finite, where f C s () := sup x y f(x) f(y) (d(x,y)) s. Definition 1 We say that the sequence ρ (t) t=1,, converges polynomially to a probability distribution ρ in (C s ()) (0 s 1) if there exist C > 0 and b > 0 such that ρ (t) ρ (C s ()) Ct b, t N. (1) By the definition of the dual space (C s ()), decay condition (1) can be expressed as f(x)dρ (t) f(x)dρ Ct b f C s (), f C s (),t N. () What measures quantitatively differences between our non-identical setting and the i.i.d. case is the power index b. Its impact on performance of online learning algorithms will be studied in this paper. The i.i.d. case corresponds to b =. We describe three situations in which decay condition (1) is satisfied. The first is when a distribution ρ is perturbed by some noise and the noise level decreases as t increases. Example 1 Let h (t) be a sequence of bounded functions on such that sup x h (t) (x) Ct b. Then the sequence ρ (t) t=1,, defined by dρ (t) = dρ + h (t) (x)dρ satisfies (1) for any 0 s 1. The proof follows from R f(x)h(t) (x)dρ sup x h (t) (x) f C() Ct b f C s (). In this example, h (t) is the density function of the noise distribution and we assume its bound (noise level) to decay polynomially as t increases. The second situation when decay condition (1) is satisfied is generated by iterative actions of an integral operator associated with a stochastic density kernel. We demonstrate this situation by an example on = S n 1, the unit sphere of R n with n. Let ds be the normalized surface element of S n 1. The corresponding space L (S n 1 ) has an orthonormal basis l,k : l +,k = 1,...,N(n,l) with N(n,0) = 1 and N(n,l) = l+n l (l+n 3)!(n )! (l 1)!. Here l,k is a spherical harmonic of order l which is the restriction onto S n 1 of a homogeneous polynomial in R n of degree l satisfying the Laplace equation f = 0. In particular, 0,0 1. Example Let = S n 1, 0 < α < 1, and ψ C( ) be given by ψ(x,u) = 1+ l=1 N(n,l) a l,k l,k (x) l,k (u) where 0 a l,k α, k=1 l=1 N(n,l) a l,k l,k C() < 1. k=1 If h (1) is a square integrable density function on and a sequence of density functions h (t) is defined by h (t+1) (x) = ψ(x,u)h (t) (u)ds(u), x, t N, then we know h (t) = 0,0 + l=1 N(n,l) k=1 a t 1 l,k h(1), l,k L (S n 1 ) l,k and h (t) 0,0 L (S n 1 ) α t 1 h (1) L (S n 1 ). It follows that the sequence ρ (t) = h(t) (x)ds t=1,,... of probability distributions on converges polynomially to the uniform distribution dρ = ds on and satisfies (1) for any 0 s

3 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS In general, if ν is a strictly positive probability distribution on, and if ψ C( ) is strictly positive satisfying R ψ(x,u)dν(u) = 1 for each x, then the sequence ρ(t) defined by ρ (t) (Γ) = ψ(x,u)dρ (t 1) (x) dν(u) on Borel sets Γ Γ satisfies ρ (t) ρ (C()) Cα t for some (strictly positive) probability distribution ρ on and constants C > 0,0 < α < 1. Hence decay condition (1) is valid for any 0 s 1. For details, see Smale and hou (009). The third situation to realize decay condition (1) is to induce distributions by dynamical systems. Here we present a simple example. Example 3 Let = [ 1/,1/] and for each t N, the probability distribution ρ (t) on has support [ t, t ] and uniform density t 1 on its support. Then with δ 0 being the delta distribution at the origin, for each 0 < s 1 we have f(x)dρ (t) f(x)dδ 0 t 1 t t f(x) f(0) dx ( s) t f C s (). Remark Since f C() f C s (), we see from () that decay condition (1) with any 0 < s 1 is satisfied when this polynomial convergence requirement is valid in the case s = 0. This happens in Examples 1 and. Note that when s = 0, the dual space (C()) is exactly the space of signed finite measures on. Each signed finite measure µ on lies in (C()) (C s ()) and satisfies µ (C s ()) µ (C()) R d µ. 1. Fully Online Learning Algorithm In this paper we study a family of online learning algorithms associated with reproducing kernel Hilbert spaces and a general convex loss function. A reproducing kernel Hilbert space (RKHS) is induced by a Mercer kernel K : R which is a continuous and symmetric function such that the matrix (K(x i,x j )) l i, j=1 is positive semidefinite for any finite set of points x 1,,x l. The RKHSH K is the completion (Aronszajn, 1950) of the span of the set of functions K x = K(x, ) : x with the inner product given by K x,k y K = K(x,y). Definition 3 We say that V : R R + is a convex loss function if for each y, the univariate function V(y, ) : R R + is convex. The convexity tells us (Rockafellar, 1970) that for each f R and y, the left derivative lim δ 0 (V(y, f + δ) V(y, f))/δ exists and is no more than the right derivative lim δ 0+ (V(y, f + δ) V(y, f))/δ. An arbitrary number between them (which is a gradient) will be taken and denoted as V(y, f) in our algorithm. For the least square regression problem, we can take V(y, f) = (y f). For the binary classification problem, we can take V(y, f) = φ(y f) with φ : R R + a convex function. The online algorithm associated with the RKHS H K and the convex loss V is a stochastic gradient descent method (Cesa-Bianchi et al., 1996; Kivinen et al., 004; Smale and ao, 006; ing and hou, 006; ing, 007). 875

4 HU AND HOU Definition 4 The fully online learning algorithm is defined by f 1 = 0 and f t+1 = f t η t V(y t, f t (x t ))K xt + λ t f t, for t = 1,,..., (3) where λ t > 0 is called the regularization parameter and η t > 0 the step size. In this fully online algorithm, the regularization parameter λ t changes with the learning step t. Throughout the paper we assume that λ t+1 λ t for each t N. When the regularization parameter λ t λ 1 does not change as the step t develops, we call scheme (3) partially online. The goal of this paper is to investigate the fully online learning algorithm (3) when the sampling sequence is not identical. We will show that learning rates in the non-identical setting can be the same as those in the i.i.d. case when the power index b in polynomial decay condition (1) is large enough, that is, ρ (t) converges fast to ρ. When b is small, the non-identical effect becomes crucial and the learning rates will depend essentially on b.. Error Bounds for Regression and Classification As in the work on least square regression (Smale and hou, 009), we assume for the sampling sequence ρ (t) t=1,, that the conditional distribution ρ (t) (y x) of each ρ (t) at x is independent of t, denoted as ρ x. Throughout the paper we assume independence of the sampling, that is, z t = (x t,y t ) t is a sequence of samples drawn from the product probability space Π t=1,, (,ρ (t) ). Error analysis will be conducted for fully online learning algorithm (3) under polynomial decay condition (1) for the sequence of marginal distributions ρ (t). Let ρ be the probability distribution on given by the marginal distribution ρ and the conditional distributions ρ x. Essential difficulty in our non-identical setting is caused by the deviation of ρ (t) from ρ. The first novelty of our analysis is to deal with an error quantity t involving ρ (t) ρ (defined by (15) below) which occurs only in the non-identical setting. This is handled for a general loss function V and output space by Lemma 18 in Section 3 under decay condition (1) for marginal distributions ρ (t) and Lipschitz s continuity of conditional distributions ρ x : x. Definition 5 We say that the set of distributions ρ x : x is Lipschitz s in (C s ()) if there exists a constant C ρ 0 such that ρ x ρ u (C s ()) C ρ(d(x,u)) s, x,u. (4) Notice that on the compact subset of R, the Hölder space C s () and its dual (C s ()) are well defined. Each ρ x belongs to (C s ()). The second novelty of our analysis is to show for the least square loss (described in Section 3) and binary classification that Lipschitz s continuity (4) of ρ x : x is the same as requiring f ρ C s () where f ρ is the regression function defined by f ρ (x) = ydρ x (y), x. (5) Proposition 6 Let 0 < s 1. Condition (4) implies f ρ C s () with f ρ C s () C ρ (1+ 1 s )sup y y. When = 1, 1, f ρ C s () also implies (4) and C ρ f ρ C s (). 876

5 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS The two-point nature of the output space for binary classification plays a crucial role in our observation. The second statement of Proposition 6 is not true for general output space. Here is one example. Example 4 Let 0 < s 1 and = 1, 1,0. Then condition (4) holds if and only if f ρ C s () and f ρ, 1 C s () where f ρ, 1 is the function on given by f ρ, 1 (x) = ρ x ( 1). Proofs of Proposition 6 and Example 4 will be given in the appendix. Our third novelty is to understand some essential differences between our non-identical setting and the classical i.i.d. setting by pointing out the key role played by the power index b of polynomial decay condition ρ (t) ρ (C s ()) Ct b in derived convergence rates in Theorems 7 and 10 for regression and Theorem 11 for classification. Even for least square regression our result improves the error analysis in Smale and hou (009) where a stronger exponential decay condition ρ (t) ρ (C s ()) Cαt is assumed. Our error bounds for fully online algorithm (3) are comparable with those for a batch learning algorithm generated by the off-line regularization scheme in H K defined with a sample z := z t = (x t,y t ) t=1 T and a regularization parameter λ > 0 as f z,λ = arg min f H K 1 T T t=1 V(y t, f(x t ))+ λ f K. (6) Let us demonstrate our error analysis by learning rates for regression with least square loss and insensitive loss and for binary classification with hinge loss..1 Learning Rates for Least Square Regression Here we take = [ M,M] for some M > 0 and the least square loss V = V ls as V ls (y, f) = (y f). Then the algorithm takes the form f t+1 = f t η t ( f t (x t ) y t )K xt + λ t f t, for t = 1,,... The following learning rates are derived by the procedure in Smale and hou (009) where an exponential decay condition is assumed. Here we only impose a much weaker polynomial decay condition (1). We also assume the regularity condition (of order r > 0) f ρ = L r K(g ρ ) for some g ρ L ρ (), (7) where L K is the integral operator L ρ defined by L K f(x) = K(x,v) f(v)dρ (v), with LK r well-defined as a compact operator. x Theorem 7 Let 0 < s 1 and 1 < r 3. Assume K Cs ( ), regularity condition (7) for f ρ and (1) with b > r+ r+1 for ρ(t). Take λ t = λ 1 t 1 r+1, ηt = η 1 t r r+1 877

6 HU AND HOU with λ 1 η 1 > r 1 4r+, then ( ) IE z1,...,zt ft+1 f ρ K CT 4r+ r 1, where C is a constant independent of T. Denote the constant κ = max x K(x,x). From the reproducing property of the RKHSH K, we see that K x, f K = f(x), x, f H K (8) f C() κ f K, x, f H K. Most error analysis in the literature of least square regression (hang, 004; De Vito et al., 005; Smale and ao, 006; Wu et al., 007) is about the L -norm f T+1 f ρ L ρ or risk in the i.i.d. case. From a predictive viewpoint, in the non-identical setting, the error f T+1 f ρ should be measured with respect to the distribution ρ (T), not the limit ρ. This can be done by bounding f T+1 f ρ C() (since ρ (T) changes with T ), which follows from estimates for f T+1 f ρ K. So our bounds for the error in the H K -norm provides useful predictive information about learning ability of fully online algorithm (3) in the non-identical setting. Remark 8 When R n and K C m ( ) for some m N, we know from hou (003, 008) ( ) and Theorem 7 that IE z1,...,z T ft+1 f ρ C m () = O(T 4r+ r 1 ). So the regression function is learned efficiently by the online algorithm not only in the usual Lρ space, but also strongly in the space C m () implying the learning of gradients (Mukherjee and Wu, 006). In the special case of r = 3, the learning rate in Theorem 7 is IE z 1,...,z T ( ft+1 f ρ K ) = O(T 4), 1 the same as those in the literature (Smale and hou, 009; Tarrès and ao, 005; Smale and hou, 007). Here we assume polynomial convergence condition (1) with a large index b > r+ r+1. So the influence of the non-identical distributions ρ (t) does not appear in the learning rates (it is involved in the constant C). Instead of refining the analysis for smaller b in Theorem 7, we shall show the influence of the index b on learning rates by the settings of regression with insensitive loss and binary classification.. Learning Rates for Regression with Insensitive Loss A large family of loss functions for regression take the form V(y, f) = ψ(y f) where ψ : R R + is an even, convex and continuous function satisfying ψ(0) = 0. One example is the ε-insensitive loss (Vapnik, 1998) with ε 0 where ψ(u) = max u ε,0. We consider the case when ε = 0. In this case the loss is called least absolute deviation or least absolute error in the literature of statistics and finds applications in some important problems because of robustness. Definition 9 The insensitive loss V = V in is given by V in (y, f) = y f. Algorithm (3) now takes the form (1 ηt λ f t+1 = t ) f t η t K xt, if f t (x t ) y t, (1 η t λ t ) f t + η t K xt, if f t (x t ) < y t. The following learning rates are new and will be proved in Section

7 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Theorem 10 Let 0 < s 1 and K Cs ( ). Assume regularity condition (7) for f ρ with r > 1, and polynomial convergence condition (1) for ρ (t). Suppose that for each x, ρ x is the uniform distribution on the interval [ f ρ (x) 1, f ρ (x) + 1]. If λ 1 (κ g ρ L ρ ) /(1 r) / and η 1 > 0, then with a constant C independent of T, when 1 < r 3 we have ( ) r 1 IE z1,...,z T ft+1 f ρ K min CT 6r+1, b 6r+1 by taking λ t = λ 1 t 6r+1, ηt = η 1 t 6r+1 4r and when 1 < r 1, we have ( ) IE z1,...,z T f T+1 f ρ L ρ CT min 3r+ r, b 1 3r+ with λ t = λ 1 t 3r+ 1, ηt = η 1 t r+1 Again, when b > 4r+ 6r+1, Rn and K C m ( ) for some m N, Theorem 10 tells us that ( ) IE z1,...,z T ft+1 f ρ C m () = O(T 6r+1 r 1 ). So the regression function is learned efficiently by the online algorithm not only with respect to the risk, but also strongly in the space C m () implying the learning of gradients. Consider the case r = 3. When ρ(i) converges slowly with b < 4 5, we see that ( ) IE z1,...,z T ft+1 f ρ K = O(T ( b 5 1) ) which heavily depends on the index b representing quantitatively the deviation of ρ (t) from ρ. When b is large enough with b 4 5, the learning rate ( ) IE z1,...,z T ft+1 f ρ K is of order T 5 1 which is independent of b. It would be interesting to extend Theorem 10 to situations when ρ x : x are more general bounded symmetric distributions..3 Learning Rates for Binary Classification The output space for the binary classification problem is = 1, 1 representing the set of two classes. A binary classifierc : makes a prediction y =C(x) for each point x. A real valued function f : R can be used to generate a classifier C(x) = sgn( f(x)) where sgn( f(x)) = 1 if f(x) 0 and sgn( f(x)) = 1 if f(x) < 0. A classifying convex loss φ : R R + is often used for the real valued function f, to measure the local error φ(y f(x)) suffered from the use of sgn( f) as a model for the process producing y at x. Take V(y, f) = φ(y f) in out setting. Offline classification algorithm (6) has been extensively studied in the literature. In particular, the error analysis is well done when the sample z is assumed to be an identical and independent drawer from a probability measure ρ on. See, for example, Evgeniou et al. (000), Steinwart (00), hang (004) and Wu et al. (007). Some analysis for off-line support vector machines with dependent samples can be found in Steinwart et al. (008). When the sample size T is very large, algorithm (6) might be practically challenging. Then online learning algorithms can be applied, which provide more efficient methods for classification. These algorithms are generalizations of the perceptron which has a long history. The error analysis for online algorithm (3) has also been conducted for classification in the i.i.d. setting, see, for example, Cesa-Bianchi et al. (004), ing and hou (006), ing (007) and e and hou (007). Here we are interested in the error analysis for fully online algorithm (3) in the non-identical setting. The error is measured by the misclassification errorr (C) defined to be the probability of the event C(x) y for a classifierc R (C) = ρ x (y C(x))dρ (x) = ρ x ( C(x))dρ (x) r+.

8 HU AND HOU The best classifier minimizing the misclassification error is called the Bayes rule (e.g., Devroye et al. 1997) and can be expressed as f c = sgn( f ρ ). We are interested in the classifier sgn( f T+1 ) produced by the real valued function f T+1 using fully online learning algorithm (3). So our error analysis aims at the excess misclassification error R (sgn( f T+1 )) R ( f c ). We demonstrate our error analysis by a result, proved in Section 5, for the hinge loss φ(x) = (1 x) + = max1 x,0. For this loss, the online algorithm (3) can be expressed as f 1 = 0 and (1 ηt λ f t+1 = t ) f t, if y t f t (x t ) > 1, (1 η t λ t ) f t + η t y t K xt, if y t f t (x t ) 1. Theorem 11 Let V(y, f) = (1 y f) + and K C s ( ) for some 0 < s 1. Assume f ρ C s () and the triple (ρ, f c,k) satisfies for some 0 < β 1 andd 0 > 0. Take inf f f c L 1 f H ρ + λ K f K D 0 λ β λ > 0 (9) λ t = λ 1 t 1 4,ηt = η 1 t ( 1 + β 1 ) where λ 1 > 0 and 0 < η 1 1 κ +λ 1. If ρ (t) t=1,, satisfies (1) with b > 1, then IE z1,...,z T (R (sgn( f T+1 )) R ( f c )) C β,s,b T min β 4, β 4, b 1 4, (10) where C β,s,b = C η1,λ 1,κ,D 0,β,s is a constant depending on η 1,λ 1,κ,D 0,β,s and b. In the i.i.d. case with b =, the learning rate for fully online algorithm (3) we state in Theorem 11 is of form O(T min β 4, β 4 ) which is better than that in e and hou (007) of order O(T min β 4, 8 1 ε ) with an arbitrarily small ε > 0. This improvement is realized by technical novelty in our error analysis, as pointed out in Remark 7. So even in the i.i.d. case, our learning rate for fully online classification with the hinge loss under approximation error assumption (9) is the best in the literature. Let us discuss the role of the power index b for the convergence of ρ (t) to ρ played in the learning rate in Theorem 11. Consider the case β 3 1+β 5. When b meaning fast convergence of ρ (t) to ρ, learning rate (10) takes the form O(T β 4 ) which depends only on the approximation ability of H K with respect to the function f c and ρ. When 1 < b < 1+β representing some slow convergence of ρ (t) to ρ, the learning rate takes the form O(T b 1 When β > 3 5, learning rate (10) is of order T ( 8 1+ β 4 ) for b > β 1 4 ) which depends only on b. (fast convergence of ρ(t) ) and of order T ( b 4 1) for 1 < b β 1 (slow convergence of ρ(t) ). It would be interesting to know how online learning algorithms adapt when the time dependent distribution drifts sufficiently slowly (corresponding to very small b). 880

9 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS.4 Approximation Error Involving a General Loss Condition (9) concerns the approximation of the function f c in the space L 1 ρ by functions from the RKHS H K. In particular, when H K is a dense subset of C() (i.e., K is a universal kernel), the quantity on the left-hand side of (9) tends to 0 as λ 0. So (9) is a reasonable assumption, which can be characterized by an interpolation space condition (Smale and hou, 003; Chen et al., 004). Assumptions like (9) are necessary to determine the regularization parameter for achieving learning rate (10). This can be seen from the literature (Wu et al., 007; hang, 004; Caponnetto et al., 007) of off-line algorithm (6): learning rates are obtained by suitable choices of the regularization parameter λ λ 1 = λ 1 (T), according to the behavior of the approximation error estimated from a priori conditions on the distribution ρ and the spaceh K. For a general loss function V, conditions like (9) can be stated as the approximation error or regularization error. Definition 1 The approximation errord(λ) associated with the triple (K,V,ρ) is E( f) E( fρ V )+ λ f K D(λ) = inf f H K where we define the generalization error for f : R as E( f) = V(y, f(x))dρ = V(y, f(x))dρ x (y)dρ and f V ρ is a minimizer ofe( f)., (11) The approximation error measures the approximation ability of the spaceh K with respect to the learning process involving V and ρ. The denseness of H K in C() implies lim λ 0 D(λ) = 0. A natural assumption would be D(λ) D 0 λ β for some 0 β 1 andd 0 > 0. (1) Throughout the paper we assume for the general loss function V that V := sup y V(y,0)+ sup V(y, f) V(y,0) / f : y, f 1 <. Remark 13 SinceD(λ) E(0)+0 V for any λ > 0, we see that (1) always holds with β = 0 andd 0 = V. For the least square loss V(y, f) = (y f), the minimizer f V ρ ofe( f) is exactly the regression function defined by (5) and approximation error (11) takes the fromd(λ) = inf f HK f f ρ L ρ + λ f K which measures the approximation of f ρ in L ρ by functions from the RKHSH K. For a general loss function V, the minimizer f V ρ ofe( f) is in general different from f ρ. Moreover, the approximation errord(λ) involves the approximation of f V ρ byh K in some function spaces which need not be L ρ. Example 5 For the hinge loss V(y, f) = (1 y f) +, the minimizer fρ V is the Bayes rule fρ V = f c (Devroye et al., 1997). Moreover the uniform Lipschitz continuity of V(, f) implies (Chen et al., 004) E( f) E( fρ V ) = φ(y f(x)) φ(y f c (x))dρ f(x) f c (x) dρ = f f c L 1 ρ. 881

10 HU AND HOU So approximation error (11) can be estimated by approximation in the space L 1 ρ and condition (9) implies (1). Consider the insensitive loss V = V in. We can easily see that for each x, f V in ρ (x) equals the median of the probability distribution ρ x on. That is, f V in ρ (x) is uniquely determined by ρ x (y : y f V in ρ (x)) 1 and ρ x (y : y f V in ρ (x)) 1. When ρ x is symmetric about its mean f ρ (x), we have f V in ρ (x) = f ρ (x). Example 6 Let V = V in, f ρ C() and p 1. If for each x, the conditional distribution ρ x is given by p dρ x (y) = y f ρ(x) p 1 dy, if y f ρ (x) 1, 0, if y f ρ (x) > 1, then we have E( f) E( f V in ρ ) = 1 p+1 f f ρ p+1 1 Lρ p+1 p+1 x : f(x) f ρ (x) >1 f(x) f ρ (x) p+1 + p (1+ p) f(x) f ρ (x) dρ. It follows that D(λ) inf f H K f f ρ p+1 + λ Lρ p+1 f K. The conclusion of Example 6 will be proved in the appendix. The following general result follows from the same argument as in Wu et al. (006). Proposition 14 If the loss function V satisfies V(y, f 1 ) V(y, f ) f 1 f c for some 0 < c 1 and any y, f 1, f R, then D(λ) inf f H K f f V ρ c L 1 ρ + λ f K inf f H K f f V ρ c L ρ + λ f K. If the univariate function V(y, ) is C 1 and satisfies V(y, f 1 ) V(y, f ) f 1 f c for some 0 < c 1 and any y, f 1, f R, then D(λ) inf f fρ V 1+c f H L 1+c K ρ + λ f K inf f H K f f V ρ 1+c L ρ + λ f K. 3. Key Analysis for the Fully Online Non-identical Setting Learning rates for fully online learning algorithm (3) such as those stated in the last section for regression and classification are obtained through analysis for approximation error and sample error. The approximation error has been well understood (Smale and hou, 003; Chen et al., 004). The sample error for (3) will be estimated in the following two sections. It is expressed as f T+1 fλ V T K where fλ V T is a regularizing function. 88

11 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Definition 15 For λ > 0 the regularizing function fλ V H K is defined by E( f)+ λ f K f V λ = arg inf f H K. (13) Observe that fλ V is a sample-free limit of f z,λ defined by (6). It is natural to expect that the function f T+1 produced by online algorithm (3) approximates the regularizing function fλ V T well. This is actually the case. Our main result on sample error analysis estimates the difference f T+1 fλ V T in theh K -norm for quite general situations. The estimation follows from an iteration procedure, developed for online classification algorithms in ing and hou (006), ing (007) and e and hou (007), together with some novelty provided in the proof of Theorem 6 in the next section. It is based on a one-step iteration bounding f t+1 fλ V t K in terms of f t fλ V t 1 K. The goal of this section is to present our key analysis for one-step iteration in tackling two barriers arising from the fully online non-identical setting. 3.1 Bounding Error Term Caused by Non-identical Sequence of Distributions The first part of our key analysis for one-step iteration is to tackle an extra error term t caused by the non-identical sequence of distributions when bounding f t+1 f V λ t K by means of f t f V λ t K with the fixed regularization parameter λ t. Lemma 16 Define f t by (3). Then we have IE zt ( f t+1 f V λ t K) (1 η t λ t ) f t f V λ t K + η t t + η t IE zt V(y t, f t (x t ))K xt + λ t f t K, (14) where t is defined by [ ] t = V(y, fλ V t (x)) V(y, f t (x)) d ρ (t) ρ. (15) Lemma 16 follows from the same procedure as in the proof of Lemma 3 in ing and hou (006) and e and hou (007). A crucial estimate is V(y t, f t (x t ))K xt + λ t f t, f V λ t f t K [V(y t, f V λ t (x t ))+ λ t f V λ t K] [V(y t, f t (x t ))+ λ t f t K]. When we take the expectation with respect to z t = (x t,y t ), we get R V(y, f λ V t (x))dρ (t) (and R V(y, f t(x))dρ (t) ) on the right-hand side, not E( fλ V t ) = R V(y, f λ V t (x))dρ. So compared with results in the i.i.d. case (Smale and ao, 006; ing and hou, 006; ing, 007; Tarrès and ao, 005; e and hou, 007), an extra term t involving the difference measure ρ (t) ρ appears. This is the first barrier we need to tackle here. When V is the least square loss, it can be easily handled by assuming f ρ C s (). In fact, from V(y, f) = (y f), we see that t = = [ fλ V t (x)) f t (x)][ fλ V t (x)+ f t (x) y]dρ x (y) [ f V λ t (x)) f t (x)][ f V λ t (x)+ f t (x) f ρ (x)]d [ ] ρ (t) ρ. [ ] d ρ (t) ρ This together with the relation hg C s () h C s () g C s () yields the following bound. 883

12 HU AND HOU Proposition 17 Let V(y, f) = (y f) and f ρ C s (). Then we have t fλ V t C s () + f t C s () + f ρ C s () ρ (t) ρ (C s ()) f λ V t f t C s (). For a general loss function and output space, t can be bounded under assumption (4). It is a special case of the following general result when h = f V λ t and g = f t. Lemma 18 Let h,g C s (). If (4) holds, then we have [ V(y,h(x)) V(y,g(x))d ρ (t) ρ] ( ) B h,g h C s () + g C s () + Cρ B h,g ρ (t) ρ (C s ()), where B h,g and B h,g are constants given by B h,g = sup V(y, f) : y, f max h C(), g C() and B h,g = sup V(, f) C s () : f max h C(), g C(). Proof By decomposing the probability distributions on into marginal and conditional distributions, we see [ ] V(y,h(x)) V(y,g(x))d ρ (t) ρ = [ ] V(y,h(x)) V(y,g(x))dρ x (y)d ρ (t) ρ. By the definition of the norm in (C s ()), we obtain [ ] V(y,h(x)) V(y,g(x))d ρ (t) ρ ρ (t) ρ (C s ()) J C s (), where J is a function on defined by J(x) = V(y,h(x)) V(y,g(x))dρ x (y), x. The notion of B h,g tells us that V(y,h(x)) V(y,g(x)) B h,g h(x) g(x) for each y. Hence J C() B h,g h g C(). To bound J C s (), let x,u. We can decompose J(x) J(u) as J(x) J(u) = + [V(y,h(x)) V(y,g(x))] [V(y,h(u)) V(y,g(u))]dρ x (y) [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y). By the notion B h,g, part of the first term above can be bounded as V(y,h(x)) V(y,h(u)) B h,g h(x) h(u) B h,g h C s ()(d(x,u)) s. 884

13 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS The same bound holds for the other part of the first term. So we get [V(y,h(x)) V(y,g(x))] [V(y,h(u)) V(y,g(u))]dρ x (y) B h,g h C s () + g C s () (d(x,u)) s. For the other term of the expression for J(x) J(u), we apply condition (4) to [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y) ρ x ρ u (C s ()) V(y,h(u)) V(y,g(u)) C s () and find that [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y) C ρ(d(x,u)) s B h,g. Combining the above two bounds, we see that J(x) J(u) J C s () = sup x u (d(x,u)) s B h,g h C s () + g C s () + Cρ B h,g. Then the desired bound follows and the lemma is proved. 3. Bounding Error Term Caused by Varying Regularization Parameters The second part of our key analysis is to estimate the error term called drift error f V λ t f V λ t 1 K caused by the change of the regularization parameter from λ t 1 to λ t in our fully online algorithm. This is the second barrier we need to tackle here. Definition 19 The drift error is defined as d t = f V λ t f V λ t 1 K. The drift error can be estimated by the approximation error, which has been studied for regression (Smale and hou, 009) and for classification (e and hou, 007). Theorem 0 Let V be a convex loss function, f V λ by (13) and µ> λ > 0. We have f V λ f V µ K µ ( 1 λ 1 µ )( f V λ K + f V µ K ) µ ( 1 λ 1 µ ) ( D(λ) λ In particular, if with some 0 < γ 1 we take λ t = λ 1 t γ for t 1, then d t+1 t γ 1 D(λ 1 t γ )/λ 1. + ) D(µ). µ Proof Taking derivative of the functional E( f)+ λ f K at the minimizer f φ λ see that V(y, fλ V (x))k xdρ+λfλ V = 0. defined in (13), we 885

14 HU AND HOU It follows that fλ V f µ V = 1 V(y, fµ V (x))k x dρ 1 µ λ V(y, f V λ (x))k xdρ. Combining with the reproducing property (8), we know f V λ f V µ K = f V λ f V µ, f V λ f V µ K can be expressed as f V λ f V µ K = 1 µ V(y, f V µ (x)) ( f V λ f V µ The convexity of the function V(y, ) on R tells us that ) 1 (x)dρ V(y, fλ V λ (x))( fλ V f V ) µ (x)dρ. V(y, fµ V (x)) ( fλ V f µ V ) (x) V(y, f V λ (x)) V(y, fµ V (x)) and Hence V(y, fλ V (x))( fµ V fλ V ) (x) V(y, f V µ (x)) V(y, fλ V (x)). f V λ f V µ K ( 1 λ 1 µ )(E( f V µ ) E( f V λ )). From the definition of fµ V, we see thate( fµ V )+ µ f µ V K (E( f λ V)+ µ f λ V K ) 0. It follows that E( fµ V ) E( fλ V ) µ ( f λ V K fµ V K) µ f λ V f µ V K ( fλ V K + fµ V K ). Then the desired inequality follows. 4. Bounds for Sample Error inh K -norm We are in a position to present our main result on the sample error measured with theh K -norm of the difference f T+1 f V λ T. This will be done by applying iteratively the key analysis in Lemma 16, Lemma 18 and Theorem 0. When applying Lemma 18, we need to bound g C s () in terms of g K. Definition 1 We say that the Mercer kernel K satisfies the kernel condition of order s if K C s ( ) and for some κ s > 0, K(x,x) K(x,u)+K(u,u) κ s(d(x,u)) s, x,u. (16) When 0 < s 1 and K Cs ( ), (16) holds true. The following result follows directly from hou (003), hou (008) and Smale and hou (009). Lemma If K satisfies the kernel condition of order s with (16) valid, then we have g C s () (κ+κ s ) g K, g H K. When using Lemma 16 and Lemma 18, we need incremental behaviors of the loss function V to bound f t K. 886

15 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Definition 3 Denote N(λ) = sup V(y, f) : y, f max κ V /λ,κ V /λ and Ñ(λ) = sup V(, f) C s () : y, f max κ V /λ,κ V /λ. We say that V has incremental exponent p 0 if for some N 1 > 0 and λ 1 > 0 we have N(λ) N 1 ( 1 λ )p and Ñ(λ) N 1 ( 1 λ )p+1 0 < λ λ 1. (17) We say that V is locally Lipschitz at the origin if V(y, f) V(y,0) M 0 := sup f : y, f 1 <. (18) The following result can be proved by exactly the same procedure as those in ing and hou (006), ing (007) and e and hou (007). Lemma 4 Assume that V is locally Lipschitz at the origin. Define f t by (3). If for t = 1,...,T, then we have η t ( κ (M 0 + N(λ t ))+λ t ) 1 (19) f t K κ V λ t, t = 1,...,T + 1. (0) For the insensitive loss, (18) is not satisfied, but V(y, f) is uniformly bounded by 1. For such loss functions we can apply the following bound. Lemma 5 Assume V(y, f) := sup y, f R V(y, f) <. Define f t by (3). If for some λ 1,η 1 > 0 and γ,α > 0 with λ+α < 1, we take λ t = λ 1 t γ,η t = η 1 t α with t = 1,...,T, then we have f t K C V,γ,α λ t, t = 1,...,T + 1, (1) where C V,γ,α is the constant given by ( C V,γ,α = κ V(y, f) λ 1 η 1 + λ 1 γ+α /(λ 1 η 1 )+ ( (1+α)/[eλ 1 η 1 (1 γ+α 1 )] ) ) 1+α 1 γ α. Proof By taking norms in (3) we see that By iterating and the choice f 1 = 0 we find f t+1 K (1 η t λ t ) f t K + η t κ V(y, f). f t+1 K t i=1 Π t j=i+1(1 η j λ j )η i κ V(y, f). 887

16 HU AND HOU But 1 η j λ j exp η j λ j. It follows that i=1 f t+1 K t i=1 t j=i+1 Π t j=i+1 exp t j=i+1 η j λ j η i κ V(y, f). Now we need the following elementary inequality with c > 0,q 0 and 0 < q 1 < 1: t 1 i q exp c j q 1 q 1+q ( 1+q + c ec(1 q1 1 ) ) 1+q 1 q 1 t q 1 q. () This elementary inequality can be found in, for example, Smale and hou (009). Taking q = α,q 1 = γ+α and c = λ 1 η 1 we know that f t+1 K is bounded by ( κ V(y, f) η 1 t α + γ+α /(λ 1 η 1 )+ ( (1+α)/[eλ 1 η 1 (1 γ+α 1 )] ) 1+α 1 γ α )t γ. Then our desired bound holds true. Now we can present our bound for the sample error f T+1 f V λ T K. Theorem 6 Suppose the following assumptions hold: Take 1. the kernel K satisfies the kernel condition of order s (0 < s 1) with (16) valid.. V has incremental exponent p 0 with (17) valid and V satisfies (18). 3. ρ (t) t=1,, converges polynomially to ρ in (C s ()) with (1) valid. 4. the distributions ρ x : x is Lipschitz s in (C s ()) with (4) valid. 5. the triple (K,V,ρ) has the approximation ability of power 0 < β 1 stated by (1). with some λ 1,η 1 > 0 and γ,α satisfying 0 < γ < Then we have 5+4p β λ t = λ 1 t γ,η t = η 1 t α (3) γ(3 β), (p+1)γ < α < 1, η 1 1 κ M 0 + κ N 1 λ p 1 + λ 1. (4) IE z1,z,,z T ( f T+1 f V λ T K) C K,V,ρ,b,β,s T θ (5) where the power index θ is given by θ := min γ(3 β) α,α γ(p+1),b γ(+ p), (6) and C K,V,ρ,b,β,s is a constant independent of T given explicitly in the proof. 888

17 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Proof We divide the proof into four steps. First we bound t. Since λ t and η t take the form (3), we see from the lower bound for α in (4) that pγ (p+1)γ α. Hence for t N, η t (κ (M 0 + N(λ t ))+λ t ) η 1 t α (κ M 0 + κ N 1 λ p 1 t pγ + λ 1 t γ ) η 1 (κ M 0 + κ N 1 λ p 1 + λ 1 ). So the last restriction of (4) implies (19), and by Lemma 4, we know that (0) holds true. Taking f = 0 in (13) yields f V λ t K V λ t, t = 1,...,T. Putting these bounds into the definition of constant B h,g, B h,g, we see by the notion N(λ) that B f V λ t, f t N(λ t ), B f V λ t, f t N(λ t ). So by Lemma 18 and Lemma, t Bt := (κ+κ s )( V /λ t + κ V /λ t )N(λ t )+C ρ Ñ(λ t ) ρ (t) ρ (C s ()). Putting this bound and (0) into (14) yields IE z1,z,,z t ( f t+1 f V λ t K) (1 η t λ t )IE z1,z,,z t 1 ( f t f V λ t K ) + κ η t (N(λ t )+ V ) + η t B t. Next we derive explicit bounds for the one-step iteration. Recall that d t = fλ V t fλ V t 1 K. It gives f t fλ V t K f t fλ V t 1 K + f t fλ V t 1 K d t + dt. γ+α Take τ = 1 γ(1 β)/. By the upper bound for α in (4), we find 0 < τ < 1. Take A 1 = η 1λ 1+τ(1 β)/ 1 > 0. Applying the elementary inequality 1+τ D τ/ 0 ab = [ A 1 ab τ/ ][b 1 τ/ / A 1 ] A 1 a b τ + b τ /A 1 (7) to a = f t f V λ t 1 K and b = d t, we know that IE z1,z,,z t 1 ( f t f V λ t K ) ( ) (1+A 1 dt τ )IE z1,z,,z t 1 f t fλ V t 1 K + dt τ /A 1 + dt. Using this bound and noticing the inequality (1 η t λ t )(1+A 1 dt τ ) 1+A 1 dt τ η t λ t, we obtain ) IE z1,z,,z t ( f t+1 fλ V t K) (1+A 1 dt τ η t λ t )IE z1,z,,z t 1 ( f t fλ V t 1 K +d τ t /A 1 + d t + κ η t (N(λ t )+ V ) + η t B t. By Theorem 0, condition (1) for the approximation error yields d t D 0 λ (β 1)/ 1 (t 1) γ(1 β)/ 1 A t γ(1 β)/ 1 where A = 4 D 0 λ (β 1)/ 1. Inserting the parameter form λ t = λ 1 t γ into assumption (17) and applying condition (1), we can bound Bt as Bt A 3 t γ(1+p) b where A 3 = (κ+κ s )( V /λ 1 + κ V /λ 1 )+C ρ /λ 1 N 1 λ p 1 C. 889

18 HU AND HOU Therefore, for the one-step iteration, by denoting f V λ 1 as f V λ 0 when t = 1, we have for each t = 1,...,T, IE z1,z,,z t ( f t+1 f V λ t K) (1+A 1 d τ t η t λ t )IE z1,z,,z t 1 ( f t f V λ t 1 K ) + A 4 t θ, (8) where θ = min γ( β) α, (α pγ), α+b γ(1+ p) and A 4 = A τ /A 1 + A + κ η 1(N 1 λ p 1 + V ) + η 1 A 3. Then we iterate the above one-step analysis. Inserting the parameter forms for λ t and η t, we see from the definition of the constant A 1 that 1+A 1 d τ t η t λ t 1+A 1 A τ t τ(γ(1 β)/ 1) η 1 λ 1 t γ α = 1 η 1λ 1 t γ α. (9) So the one-step analysis (8) yields ( IE z1,z,,z t ( f t+1 fλ V t K) 1 η ( ) 1λ 1 )IE t γ α z1,z,,zt 1 f t fλ V t 1 K + A 4 t θ. Applying this bound iteratively for t = 1,..., T implies T t=1 IE z1,z,,z T ( f T+1 fλ V T T K) A 4 Π T j=t+1(1 η 1λ 1 t=1 j γ α )t θ + Πt=1(1 T η 1λ 1 t γ α ) f 1 fλ V 1 K. Finally we bound the above expressions by two elementary inequalities. The first one is (). Applying this inequality with c = η 1λ 1,q 1 = γ+α and q = θ, since 1 u e u for any u 0, the first expression above can be bounded as Π T j=t+1(1 η 1λ 1 j γ α )t θ T exp η T 1λ 1 t=1 j t γ α θ A 5 T γ+α θ, j=t+1 where A 5 is the constant given by γ+α+ θ+1 ( A 5 = + 1+ η 1 λ 1 For the second expression above, we have Πt=1(1 T η 1λ 1 t γ α ) exp λ 1 η 1 exp exp (1 γ α) η 1λ 1 + θ eη 1 λ 1 (1 γ+α 1 ) T j=1 ) 1+ θ 1 γ α. t γ α exp η 1λ 1 λ 1 η 1 (T + 1)1 γ α. (1 γ α) 890 T+1 1 x γ α dx

19 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Applying another elementary inequality ( a ) a exp cx x a, c,a,x > 0 ec with c = λ 1η 1 (1 γ α), a = 1 γ α and x = (T + 1)1 γ α yields Πt=1(1 T η ( 1λ 1 λ 1 η 1 4 t γ α ) exp (1 γ α) eλ 1 η 1 ) 1 γ α T. The above two estimates give the desired bound (5) with θ = θ γ α and the constant C K,V,ρ,b,β,s given by ( ) λ 1 η γ α V C K,V,ρ,b,β,s = A 4 A 5 + exp. (1 γ α) eλ 1 η 1 λ 1 This proves the theorem. Remark 7 Some ideas in the above proof are from ing and hou (006), e and hou (007) and Smale and hou (009). Two novel points are presented for the first time here. One is the bound for t, dealing with ρ (t) ρ, given in the first step of our proof in order to tackle the technical difficulty arising from the non-identical sampling process. The same difficulty for the least square regression was overcome in Smale and hou (009) by the special linear feature and explicit expressions offered by the least square loss. The second technical novelty is to introduce a parameter A 1 into elementary inequality (7). With this parameter, we can bound 1+A 1 dt τ η t λ t by 1 1 η tλ t, shown in (9). This improves the error bound even in the i.i.d. case presented in e and hou (007) for the fully online algorithm. Let us discuss the role of parameters in Theorem 6. When γ is small enough and b > 3, fully online algorithm (3) becomes very close to the partially online scheme with λ t λ 1. By taking α = 3, the rates in (5) are of order O(T ( 3 ε) ) with ε arbitrarily small, which is a nice bound for the sample error f T+1 fλ V T K. In this case, the difference between fλ V T and fρ V, measured by the approximation error, increases since λ T = λ 1 T γ. To estimate the total error between f T+1 and fρ V, we should take a balance for the index γ of the regularization parameter, as shown in Theorem 11. For the insensitive loss, (18) is not satisfied. We can apply Lemma 5 and obtain bounds for f T+1 fλ V T K by the same proof as that for Theorem 6. Proposition 8 Assume V(y, f) < and all the conditions of Theorem 6 except (18). Take λ t,η t by (3) with the restriction (4) without the last inequality. Then the same convergence rate (5) holds true with the power index θ given by (6). 5. Bounds for Binary Classification and Regression with Insensitive Loss We demonstrate how to apply Theorem 6 by deriving learning rates of fully online algorithm (3) for binary classification and regression with insensitive loss. 891

20 HU AND HOU Theorem 9 Let V(y, f) = φ(y f) where φ : R R + is a convex function satisfying φ (u) N φ u p, φ(u) N φ u p+1 u 1 (30) for some p 0 and N φ > 0. Suppose M φ := sup φ (u) φ (0) / u : u 1 <. Assume (16) for K, (1) for ρ (t), and (1) for (K,φ,ρ). If f ρ C s () and we choose λ t,η t as (3) with +γ(p +β) 0 < γ < 5+10p β,α = 3 and b > γ(+ p), and η 1 < η 0, λ 1 κ (φ(0)+ φ L [ 1,1])/, then we have ( ) IE z1,...,z T E( ft+1 ) E( fρ V ) C φ,β,γ T min γ(5+10p β) 6, b γ(+3p),γβ, 1 where η 0 := with N κ M φ +κ N 1 λ p 1 +λ 1 = (M φ +N φ +φ(0)+ φ L [ 1,1])κ p (φ(0)+ φ L [ 1,1]) p 1 and C φ,β,γ is a constant depending on η 1,λ 1,κ,D 0,β,φ,β and s. Proof By the bounds for f V λ T K and f T+1 K, we know from (30) that E( f T+1 ) E( f V λ T ) = φ(y f T+1 (x)) φ(y fλ V T (x))dρ C K,φ λ p T f T+1 fλ V T κc K,φ λ p T f T+1 fλ V T K, where C K,φ is a constant depending on K and φ. It is easy to check that the loss V(y, f) = φ(y f) satisfies (17) with incremental exponent p. +γ(p +β) By Theorem 6 with 0 < γ < 5+10p β,α = 3 and b > γ(+ p), we have ( IE z1,z,,z T f T+1 fλ V ) K T C K,V,ρ,b,β,s T min[ γ(5+4p β)]/6,[b γ(+p)]/. Also, we havee( f φ λ T ) E( fρ V ) D(λ T ) D 0 λ β T. Thus we get a bound for the excess generalization error IE z1,...,z T (E( f T+1 ) E( f V ρ )) C φ,β,γ T min[ γ(5+10p β)]/6,γβ,[b γ(+3p)]/, where C φ,β,γ = κc K,φ λ p 1 CK,V,ρ,b,β,s +D 0 λ β 1. This verifies the desired bound. Theorem 9 yields concrete learning rates with various loss functions. When φ is chosen to be the hinge loss φ(x) = (1 x) +, we can prove Theorem Proof of Theorem 11 When 0 < s 1 and K Cs ( ), (16) holds true. The loss function φ(x) = (1 x) + satisfies φ (x) = 1 for x 1 and 0 otherwise. It follows that (17) holds true with p = 0 and M φ = 0. By Example 5, (9) implies (1). Thus all conditions in Theorem 9 are satisfied and by taking p = 0 and γ = 1 4, we have ( ) IE z1,...,z T E( ft+1 ) E( fρ V ) C φ,β,γ T min 18 + β 4, β 4, b

21 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS An important relation concerning the hinge loss is the one (hang, 004) between the excess misclassification error and the excess generalization error given for any measurable function f : R as R (sgn( f)) R ( f c ) E( f) E( f c ). Combining this relation with the above bound for the excess generalization error proves the conclusion of Theorem 11. Turn to the general loss φ. We give an additional assumption that φ (0) exists and is positive. Under this assumption it was proved in Chen et al. (004) and Bartlett et al. (006) that there exists a constant depending c φ only on φ such that for any measurable function f : R, R (sgn( f)) R ( f c ) c φ E( f) E( f φ ρ). Then Theorem 9 gives the following learning rate. Corollary 30 Let φ be a loss function such that φ (0) exists and is positive. Under the assumptions of Theorem 9, if γ = 5+10p+5β, we have IE z1,...,z T (R (sgn( f T+1 )) R ( f c )) C φ,β T min where C φ,β is a constant independent of T. β 5+10p+5β, 4 b +3p 10+0p+10β, As an example, the q-norm SVM loss φ(x) = ((1 x) + ) q with q > 1 satisfies φ (0) > 0 and (17) with p = q 1. So Corollary 30 yields the following rates. Example 7 Let φ(x) = ((1 x) + ) q with q > 1. Under the assumptions of Theorem 9, if γ = 10q 5+5β, α = 8q 6+4β 10q 5+5β and b > 6q 10q 5+5β, then ( ) ( IE z1,...,z T R (sgn( ft+1 )) R ( f c ) = O T min β 10q 5+5β, 4 b 3q 1 0q 10+10β ). Finally we verify the learning rates for regression with insensitive loss stated in Section. 5. Proof of Theorem 10 We need the regularizing function f V ls λ defined by (13) with the least square loss V = V ls. It can be found, for example, in Smale and hou (007) that regularity condition (7) implies f V ls λ f ρ K ( ) λ r 1 gρ L, when 1 ρ < r 3 and f V ls λ f ρ L ρ ( ) λ r g ρ L, when 0 < r 1. ρ It follows that when λ (κ g ρ L ρ ) /(1 r), we have f V ls λ f ρ C() 1. Thus by the special form of the conditional distribution ρ x, we see from the conclusion of Example 6 with p = 1 that 893

22 HU AND HOU f V in λ = f V ls λ and bounds for f V in λ f ρ K and f V in λ f ρ L ρ follow. Moreover, condition (1) for D(λ) is valid with β = 1. Now we check other conditions of Proposition 8. Condition (16) is valid because K C s ( ) with 0 < s 1. By a simple computation, incremental condition (17) is verified with exponent p = 0. Note that f ρ K κ r 1 g ρ L ρ and f ρ C s () (κ + κ s ) f ρ K. Then for any x,u and g C s (), we see from the uniform distribution ρ x and ρ u that g(y)d(ρ x ρ u ) = 1 fρ (x)+1 f ρ (x) 1 fρ (u)+1 g(y)dy g(y)dy g C() f ρ (x) f ρ (u) f ρ (u) 1 g C() (κ+κ s )κ r 1 g ρ L ρ (d(x,u)) s. This verifies Lipschitz s continuous condition (4) for ρ x with constant C ρ = (κ+κ s )κ r 1 g ρ L ρ. Thus all conditions of Proposition 8 are satisfied and we obtain IE z1,z,,z T ( f T+1 f V in λ T K) C K,V,ρ,b,β,s T θ where θ := min γ α,α γ,b γ. Finally we get and IE z1,z,,z T ( f T+1 f ρ K ) C K,V,ρ,b,β,s T θ/ + IE z1,z,,z T ( f T+1 f ρ L ρ ) κ C K,V,ρ,b,β,s T θ/ + Then our desired learning rates follow. ( λt+1 ( λt+1 ) r 1 gρ L ρ, when 1 < r 3 ) r g ρ L ρ, when 0 < r 1. Acknowledgments We would like to thank the referees for their constructive suggestions and comments. The work described in this paper was partially supported by a grant from the Research Grants Council of Hong Kong [Project No. CityU ] and National Science Fund for Distinguished oung Scholars of China [Project No ]. The corresponding author is Ding-uan hou. Appendix A. This appendix includes some detailed proofs. 894

23 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS A.1 Proof of Proposition 6 The first statement follows by taking g(y) = y on because f ρ (x) f ρ (u) = g(y)d(ρ x ρ u )(y) ρ x ρ u (C s ()) g C s () and g C s () = g C() + g C s () sup y y + 1 s sup y y. Actually the above estimates tell us that f ρ is continuous and belongs to C s () with f ρ C s () C ρ (1+ 1 s )sup y y. For the second statement, since = 1, 1, we have f ρ (x) = ρ x (1) ρ x ( 1). It follows that for each y and x, there holds ρ x (y) = 1+y f ρ(x). So for any g C s () and x,u, g(y)d(ρ x ρ u )(y) = g(y) y[ f ρ(x) f ρ (u)] yg(y) = y y [ f ρ(x) f ρ (u)]. Now the conclusion follows from g(y)d(ρ x ρ u )(y) g C() f ρ (x) f ρ (u) f ρ C s ()(d(x,u)) s g C s (). This proves Proposition 6. A. Proof of Example 4 For x, we have ρ x (1) = f ρ, 1 (x)+ f ρ (x) and ρ x (0) = 1 f ρ, 1 (x) f ρ (x). Hence for any g C s (), g(y)dρ x = f ρ, 1 (x)g(1) g(0)+g( 1)+ f ρ (x)g(1) g(0)+g(0) and for u, g(y)d(ρ x ρ u ) = [ f ρ, 1 (x) f ρ, 1 (u)]g(1) g(0)+g( 1)+[f ρ (x) f ρ (u)]g(1) g(0). Then our statement follows from the first part of Proposition 6. This proves the conclusion of Example 4. A.3 Proof of Example 6 Let x. When f(x) f V in ρ (x), we see from the explicit form of the insensitive loss V in that V in (y, f(x))dρ x (y) V in (y, f V in ρ (x))dρ x (y) It follows that = y f(x) + f V in ρ (x) f(x)dρ x + y f V in ρ (x) f(x)+ f V in f V ρ (x) ydρ x. in ρ (x)<y< f(x) f(x) f V in ρ (x)dρ x E( f) E( f V in ρ ) = f(x) f V in ρ (x) x + f(x) y dρ x dρ (31) I x 895

24 HU AND HOU where ) ( x := ρ x (y : y f V in(x) ρ x y : y > f V in ρ (x)) ρ and I x in the open interval between f V in ρ (x) and f(x). The same relation (31) also holds when f(x) < f V in ρ (x). Now we use the special assumption on the conditional distributions and see that the median and mean of ρ x are equal: f V in ρ (x) = f ρ (x) for each x. Moreover, x = 0 and when f ρ (x) f(x) f ρ (x)+1, we have f(x) fρ (x) f(x) y dρ x = ( f(x) f ρ (x) u) p I x 0 up 1 du = f(x) f ρ(x) p+1. p+1 The same expression holds true when f ρ (x) 1 f(x) < f ρ (x). When f(x) f ρ (x) > 1, since ρ x vanishes outside [ f ρ (x) 1, f ρ (x)+1], we have R I x f(x) y dρ x = f(x) f ρ (x) p p+1. Therefore, (31) is the same as E( f) E( f V in ρ ) = x : f(x) f ρ (x) 1 + x : f(x) f ρ (x) >1 f(x) f ρ (x) p+1 dρ p+1 f(x) f ρ (x) p p+1 dρ. This proves the desired expression fore( f) E( f V in ρ ) and hence the bound ford(λ). References N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68: , P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: , 006. A. Caponnetto, L. Rosasco and. ao. On early stopping in gradient descent learning. Constructive Approximation, 6:89 315, 007. D. R. Chen, Q. Wu,. ing and D.. hou. Support vector machine soft margin classifiers: error analysis. Journal of Machine Learning Research, 5: , 004. N. Cesa-Bianchi, P. Long and M. K. Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural Networks, 7: , N. Cesa-Bianchi, A. Conconi and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50: , 004. E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59 85, 005. E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some properties of regularized kernel methods. Journal of Machine Learning Research, 5: ,

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Sobolev Spaces. Chapter 10

Sobolev Spaces. Chapter 10 Chapter 1 Sobolev Spaces We now define spaces H 1,p (R n ), known as Sobolev spaces. For u to belong to H 1,p (R n ), we require that u L p (R n ) and that u have weak derivatives of first order in L p

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable

More information

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels Y.C. Hon and R. Schaback April 9, Abstract This paper solves the Laplace equation u = on domains Ω R 3 by meshless collocation

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Distributed Learning with Regularized Least Squares

Distributed Learning with Regularized Least Squares Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents MATH 3969 - MEASURE THEORY AND FOURIER ANALYSIS ANDREW TULLOCH Contents 1. Measure Theory 2 1.1. Properties of Measures 3 1.2. Constructing σ-algebras and measures 3 1.3. Properties of the Lebesgue measure

More information

REAL AND COMPLEX ANALYSIS

REAL AND COMPLEX ANALYSIS REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Reproducing Kernel Banach Spaces for Machine Learning

Reproducing Kernel Banach Spaces for Machine Learning Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009 Reproducing Kernel Banach Spaces for Machine Learning Haizhang Zhang, Yuesheng Xu and Jun Zhang

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Sobolev spaces. May 18

Sobolev spaces. May 18 Sobolev spaces May 18 2015 1 Weak derivatives The purpose of these notes is to give a very basic introduction to Sobolev spaces. More extensive treatments can e.g. be found in the classical references

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Strictly Positive Definite Functions on a Real Inner Product Space

Strictly Positive Definite Functions on a Real Inner Product Space Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Relationships between upper exhausters and the basic subdifferential in variational analysis

Relationships between upper exhausters and the basic subdifferential in variational analysis J. Math. Anal. Appl. 334 (2007) 261 272 www.elsevier.com/locate/jmaa Relationships between upper exhausters and the basic subdifferential in variational analysis Vera Roshchina City University of Hong

More information

Back to the future: Radial Basis Function networks revisited

Back to the future: Radial Basis Function networks revisited Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

2. Function spaces and approximation

2. Function spaces and approximation 2.1 2. Function spaces and approximation 2.1. The space of test functions. Notation and prerequisites are collected in Appendix A. Let Ω be an open subset of R n. The space C0 (Ω), consisting of the C

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Forchheimer Equations in Porous Media - Part III

Forchheimer Equations in Porous Media - Part III Forchheimer Equations in Porous Media - Part III Luan Hoang, Akif Ibragimov Department of Mathematics and Statistics, Texas Tech niversity http://www.math.umn.edu/ lhoang/ luan.hoang@ttu.edu Applied Mathematics

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Sparseness of Support Vector Machines

Sparseness of Support Vector Machines Journal of Machine Learning Research 4 2003) 07-05 Submitted 0/03; Published /03 Sparseness of Support Vector Machines Ingo Steinwart Modeling, Algorithms, and Informatics Group, CCS-3 Mail Stop B256 Los

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE FRANÇOIS BOLLEY Abstract. In this note we prove in an elementary way that the Wasserstein distances, which play a basic role in optimal transportation

More information

Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018

Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018 Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018 Topics: consistent estimators; sub-σ-fields and partial observations; Doob s theorem about sub-σ-field measurability;

More information

Reflected Brownian Motion

Reflected Brownian Motion Chapter 6 Reflected Brownian Motion Often we encounter Diffusions in regions with boundary. If the process can reach the boundary from the interior in finite time with positive probability we need to decide

More information

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York The Kernel Trick Robert M. Haralick Computer Science, Graduate Center City University of New York Outline SVM Classification < (x 1, c 1 ),..., (x Z, c Z ) > is the training data c 1,..., c Z { 1, 1} specifies

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Singular Integrals. 1 Calderon-Zygmund decomposition

Singular Integrals. 1 Calderon-Zygmund decomposition Singular Integrals Analysis III Calderon-Zygmund decomposition Let f be an integrable function f dx 0, f = g + b with g Cα almost everywhere, with b

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Partial Differential Equations

Partial Differential Equations Part II Partial Differential Equations Year 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2015 Paper 4, Section II 29E Partial Differential Equations 72 (a) Show that the Cauchy problem for u(x,

More information

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Richard F. Bass Krzysztof Burdzy University of Washington

Richard F. Bass Krzysztof Burdzy University of Washington ON DOMAIN MONOTONICITY OF THE NEUMANN HEAT KERNEL Richard F. Bass Krzysztof Burdzy University of Washington Abstract. Some examples are given of convex domains for which domain monotonicity of the Neumann

More information

ELEMENTS OF PROBABILITY THEORY

ELEMENTS OF PROBABILITY THEORY ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Lecture No 1 Introduction to Diffusion equations The heat equat

Lecture No 1 Introduction to Diffusion equations The heat equat Lecture No 1 Introduction to Diffusion equations The heat equation Columbia University IAS summer program June, 2009 Outline of the lectures We will discuss some basic models of diffusion equations and

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information