Online Learning with Samples Drawn from Non-identical Distributions
|
|
- Esther McDonald
- 5 years ago
- Views:
Transcription
1 Journal of Machine Learning Research 10 (009) Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of Mathematics City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, China TING Editor: Peter Bartlett Abstract Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a Hölder space. For regression with least square or insensitive loss, learning rates are given in both the RKHS norm and the L norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space 1. Introduction In the literature of learning theory, samples for algorithms are often assumed to be drawn independently from an identical distribution. Here we consider a setting with samples drawn from nonidentical distributions. Such a framework was introduced in Smale and hou (009) and Steinwart et al. (008) where online learning for least square regression and off-line support vector machines are investigated. We shall follow this framework and study a kernel based online learning algorithm associated with a general convex loss function. Our analysis can be applied for various purposes including regression and classification. 1.1 Sampling with Non-identical Distributions Let (,d) be a metric space called an input space for the learning problem. Let be a compact subset of R (output space) and =. In our online learning setting, at each step t = 1,,..., a pair z t = (x t,y t ) is drawn from a probability distribution ρ (t) on. The sampling sequence of probability distributions ρ (t) t=1,, is not identical. For convergence analysis, we shall assume that the sequence ρ (t) t=1,, of marginal distributions on converges polynomially in the dual of the Hölder space C s () for some 0 < s 1. c 009 Ting Hu and Ding-uan hou.
2 HU AND HOU Here the Hölder space C s () is defined to be the space of all continuous functions on with the norm f C s () = f C() + f C s () finite, where f C s () := sup x y f(x) f(y) (d(x,y)) s. Definition 1 We say that the sequence ρ (t) t=1,, converges polynomially to a probability distribution ρ in (C s ()) (0 s 1) if there exist C > 0 and b > 0 such that ρ (t) ρ (C s ()) Ct b, t N. (1) By the definition of the dual space (C s ()), decay condition (1) can be expressed as f(x)dρ (t) f(x)dρ Ct b f C s (), f C s (),t N. () What measures quantitatively differences between our non-identical setting and the i.i.d. case is the power index b. Its impact on performance of online learning algorithms will be studied in this paper. The i.i.d. case corresponds to b =. We describe three situations in which decay condition (1) is satisfied. The first is when a distribution ρ is perturbed by some noise and the noise level decreases as t increases. Example 1 Let h (t) be a sequence of bounded functions on such that sup x h (t) (x) Ct b. Then the sequence ρ (t) t=1,, defined by dρ (t) = dρ + h (t) (x)dρ satisfies (1) for any 0 s 1. The proof follows from R f(x)h(t) (x)dρ sup x h (t) (x) f C() Ct b f C s (). In this example, h (t) is the density function of the noise distribution and we assume its bound (noise level) to decay polynomially as t increases. The second situation when decay condition (1) is satisfied is generated by iterative actions of an integral operator associated with a stochastic density kernel. We demonstrate this situation by an example on = S n 1, the unit sphere of R n with n. Let ds be the normalized surface element of S n 1. The corresponding space L (S n 1 ) has an orthonormal basis l,k : l +,k = 1,...,N(n,l) with N(n,0) = 1 and N(n,l) = l+n l (l+n 3)!(n )! (l 1)!. Here l,k is a spherical harmonic of order l which is the restriction onto S n 1 of a homogeneous polynomial in R n of degree l satisfying the Laplace equation f = 0. In particular, 0,0 1. Example Let = S n 1, 0 < α < 1, and ψ C( ) be given by ψ(x,u) = 1+ l=1 N(n,l) a l,k l,k (x) l,k (u) where 0 a l,k α, k=1 l=1 N(n,l) a l,k l,k C() < 1. k=1 If h (1) is a square integrable density function on and a sequence of density functions h (t) is defined by h (t+1) (x) = ψ(x,u)h (t) (u)ds(u), x, t N, then we know h (t) = 0,0 + l=1 N(n,l) k=1 a t 1 l,k h(1), l,k L (S n 1 ) l,k and h (t) 0,0 L (S n 1 ) α t 1 h (1) L (S n 1 ). It follows that the sequence ρ (t) = h(t) (x)ds t=1,,... of probability distributions on converges polynomially to the uniform distribution dρ = ds on and satisfies (1) for any 0 s
3 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS In general, if ν is a strictly positive probability distribution on, and if ψ C( ) is strictly positive satisfying R ψ(x,u)dν(u) = 1 for each x, then the sequence ρ(t) defined by ρ (t) (Γ) = ψ(x,u)dρ (t 1) (x) dν(u) on Borel sets Γ Γ satisfies ρ (t) ρ (C()) Cα t for some (strictly positive) probability distribution ρ on and constants C > 0,0 < α < 1. Hence decay condition (1) is valid for any 0 s 1. For details, see Smale and hou (009). The third situation to realize decay condition (1) is to induce distributions by dynamical systems. Here we present a simple example. Example 3 Let = [ 1/,1/] and for each t N, the probability distribution ρ (t) on has support [ t, t ] and uniform density t 1 on its support. Then with δ 0 being the delta distribution at the origin, for each 0 < s 1 we have f(x)dρ (t) f(x)dδ 0 t 1 t t f(x) f(0) dx ( s) t f C s (). Remark Since f C() f C s (), we see from () that decay condition (1) with any 0 < s 1 is satisfied when this polynomial convergence requirement is valid in the case s = 0. This happens in Examples 1 and. Note that when s = 0, the dual space (C()) is exactly the space of signed finite measures on. Each signed finite measure µ on lies in (C()) (C s ()) and satisfies µ (C s ()) µ (C()) R d µ. 1. Fully Online Learning Algorithm In this paper we study a family of online learning algorithms associated with reproducing kernel Hilbert spaces and a general convex loss function. A reproducing kernel Hilbert space (RKHS) is induced by a Mercer kernel K : R which is a continuous and symmetric function such that the matrix (K(x i,x j )) l i, j=1 is positive semidefinite for any finite set of points x 1,,x l. The RKHSH K is the completion (Aronszajn, 1950) of the span of the set of functions K x = K(x, ) : x with the inner product given by K x,k y K = K(x,y). Definition 3 We say that V : R R + is a convex loss function if for each y, the univariate function V(y, ) : R R + is convex. The convexity tells us (Rockafellar, 1970) that for each f R and y, the left derivative lim δ 0 (V(y, f + δ) V(y, f))/δ exists and is no more than the right derivative lim δ 0+ (V(y, f + δ) V(y, f))/δ. An arbitrary number between them (which is a gradient) will be taken and denoted as V(y, f) in our algorithm. For the least square regression problem, we can take V(y, f) = (y f). For the binary classification problem, we can take V(y, f) = φ(y f) with φ : R R + a convex function. The online algorithm associated with the RKHS H K and the convex loss V is a stochastic gradient descent method (Cesa-Bianchi et al., 1996; Kivinen et al., 004; Smale and ao, 006; ing and hou, 006; ing, 007). 875
4 HU AND HOU Definition 4 The fully online learning algorithm is defined by f 1 = 0 and f t+1 = f t η t V(y t, f t (x t ))K xt + λ t f t, for t = 1,,..., (3) where λ t > 0 is called the regularization parameter and η t > 0 the step size. In this fully online algorithm, the regularization parameter λ t changes with the learning step t. Throughout the paper we assume that λ t+1 λ t for each t N. When the regularization parameter λ t λ 1 does not change as the step t develops, we call scheme (3) partially online. The goal of this paper is to investigate the fully online learning algorithm (3) when the sampling sequence is not identical. We will show that learning rates in the non-identical setting can be the same as those in the i.i.d. case when the power index b in polynomial decay condition (1) is large enough, that is, ρ (t) converges fast to ρ. When b is small, the non-identical effect becomes crucial and the learning rates will depend essentially on b.. Error Bounds for Regression and Classification As in the work on least square regression (Smale and hou, 009), we assume for the sampling sequence ρ (t) t=1,, that the conditional distribution ρ (t) (y x) of each ρ (t) at x is independent of t, denoted as ρ x. Throughout the paper we assume independence of the sampling, that is, z t = (x t,y t ) t is a sequence of samples drawn from the product probability space Π t=1,, (,ρ (t) ). Error analysis will be conducted for fully online learning algorithm (3) under polynomial decay condition (1) for the sequence of marginal distributions ρ (t). Let ρ be the probability distribution on given by the marginal distribution ρ and the conditional distributions ρ x. Essential difficulty in our non-identical setting is caused by the deviation of ρ (t) from ρ. The first novelty of our analysis is to deal with an error quantity t involving ρ (t) ρ (defined by (15) below) which occurs only in the non-identical setting. This is handled for a general loss function V and output space by Lemma 18 in Section 3 under decay condition (1) for marginal distributions ρ (t) and Lipschitz s continuity of conditional distributions ρ x : x. Definition 5 We say that the set of distributions ρ x : x is Lipschitz s in (C s ()) if there exists a constant C ρ 0 such that ρ x ρ u (C s ()) C ρ(d(x,u)) s, x,u. (4) Notice that on the compact subset of R, the Hölder space C s () and its dual (C s ()) are well defined. Each ρ x belongs to (C s ()). The second novelty of our analysis is to show for the least square loss (described in Section 3) and binary classification that Lipschitz s continuity (4) of ρ x : x is the same as requiring f ρ C s () where f ρ is the regression function defined by f ρ (x) = ydρ x (y), x. (5) Proposition 6 Let 0 < s 1. Condition (4) implies f ρ C s () with f ρ C s () C ρ (1+ 1 s )sup y y. When = 1, 1, f ρ C s () also implies (4) and C ρ f ρ C s (). 876
5 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS The two-point nature of the output space for binary classification plays a crucial role in our observation. The second statement of Proposition 6 is not true for general output space. Here is one example. Example 4 Let 0 < s 1 and = 1, 1,0. Then condition (4) holds if and only if f ρ C s () and f ρ, 1 C s () where f ρ, 1 is the function on given by f ρ, 1 (x) = ρ x ( 1). Proofs of Proposition 6 and Example 4 will be given in the appendix. Our third novelty is to understand some essential differences between our non-identical setting and the classical i.i.d. setting by pointing out the key role played by the power index b of polynomial decay condition ρ (t) ρ (C s ()) Ct b in derived convergence rates in Theorems 7 and 10 for regression and Theorem 11 for classification. Even for least square regression our result improves the error analysis in Smale and hou (009) where a stronger exponential decay condition ρ (t) ρ (C s ()) Cαt is assumed. Our error bounds for fully online algorithm (3) are comparable with those for a batch learning algorithm generated by the off-line regularization scheme in H K defined with a sample z := z t = (x t,y t ) t=1 T and a regularization parameter λ > 0 as f z,λ = arg min f H K 1 T T t=1 V(y t, f(x t ))+ λ f K. (6) Let us demonstrate our error analysis by learning rates for regression with least square loss and insensitive loss and for binary classification with hinge loss..1 Learning Rates for Least Square Regression Here we take = [ M,M] for some M > 0 and the least square loss V = V ls as V ls (y, f) = (y f). Then the algorithm takes the form f t+1 = f t η t ( f t (x t ) y t )K xt + λ t f t, for t = 1,,... The following learning rates are derived by the procedure in Smale and hou (009) where an exponential decay condition is assumed. Here we only impose a much weaker polynomial decay condition (1). We also assume the regularity condition (of order r > 0) f ρ = L r K(g ρ ) for some g ρ L ρ (), (7) where L K is the integral operator L ρ defined by L K f(x) = K(x,v) f(v)dρ (v), with LK r well-defined as a compact operator. x Theorem 7 Let 0 < s 1 and 1 < r 3. Assume K Cs ( ), regularity condition (7) for f ρ and (1) with b > r+ r+1 for ρ(t). Take λ t = λ 1 t 1 r+1, ηt = η 1 t r r+1 877
6 HU AND HOU with λ 1 η 1 > r 1 4r+, then ( ) IE z1,...,zt ft+1 f ρ K CT 4r+ r 1, where C is a constant independent of T. Denote the constant κ = max x K(x,x). From the reproducing property of the RKHSH K, we see that K x, f K = f(x), x, f H K (8) f C() κ f K, x, f H K. Most error analysis in the literature of least square regression (hang, 004; De Vito et al., 005; Smale and ao, 006; Wu et al., 007) is about the L -norm f T+1 f ρ L ρ or risk in the i.i.d. case. From a predictive viewpoint, in the non-identical setting, the error f T+1 f ρ should be measured with respect to the distribution ρ (T), not the limit ρ. This can be done by bounding f T+1 f ρ C() (since ρ (T) changes with T ), which follows from estimates for f T+1 f ρ K. So our bounds for the error in the H K -norm provides useful predictive information about learning ability of fully online algorithm (3) in the non-identical setting. Remark 8 When R n and K C m ( ) for some m N, we know from hou (003, 008) ( ) and Theorem 7 that IE z1,...,z T ft+1 f ρ C m () = O(T 4r+ r 1 ). So the regression function is learned efficiently by the online algorithm not only in the usual Lρ space, but also strongly in the space C m () implying the learning of gradients (Mukherjee and Wu, 006). In the special case of r = 3, the learning rate in Theorem 7 is IE z 1,...,z T ( ft+1 f ρ K ) = O(T 4), 1 the same as those in the literature (Smale and hou, 009; Tarrès and ao, 005; Smale and hou, 007). Here we assume polynomial convergence condition (1) with a large index b > r+ r+1. So the influence of the non-identical distributions ρ (t) does not appear in the learning rates (it is involved in the constant C). Instead of refining the analysis for smaller b in Theorem 7, we shall show the influence of the index b on learning rates by the settings of regression with insensitive loss and binary classification.. Learning Rates for Regression with Insensitive Loss A large family of loss functions for regression take the form V(y, f) = ψ(y f) where ψ : R R + is an even, convex and continuous function satisfying ψ(0) = 0. One example is the ε-insensitive loss (Vapnik, 1998) with ε 0 where ψ(u) = max u ε,0. We consider the case when ε = 0. In this case the loss is called least absolute deviation or least absolute error in the literature of statistics and finds applications in some important problems because of robustness. Definition 9 The insensitive loss V = V in is given by V in (y, f) = y f. Algorithm (3) now takes the form (1 ηt λ f t+1 = t ) f t η t K xt, if f t (x t ) y t, (1 η t λ t ) f t + η t K xt, if f t (x t ) < y t. The following learning rates are new and will be proved in Section
7 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Theorem 10 Let 0 < s 1 and K Cs ( ). Assume regularity condition (7) for f ρ with r > 1, and polynomial convergence condition (1) for ρ (t). Suppose that for each x, ρ x is the uniform distribution on the interval [ f ρ (x) 1, f ρ (x) + 1]. If λ 1 (κ g ρ L ρ ) /(1 r) / and η 1 > 0, then with a constant C independent of T, when 1 < r 3 we have ( ) r 1 IE z1,...,z T ft+1 f ρ K min CT 6r+1, b 6r+1 by taking λ t = λ 1 t 6r+1, ηt = η 1 t 6r+1 4r and when 1 < r 1, we have ( ) IE z1,...,z T f T+1 f ρ L ρ CT min 3r+ r, b 1 3r+ with λ t = λ 1 t 3r+ 1, ηt = η 1 t r+1 Again, when b > 4r+ 6r+1, Rn and K C m ( ) for some m N, Theorem 10 tells us that ( ) IE z1,...,z T ft+1 f ρ C m () = O(T 6r+1 r 1 ). So the regression function is learned efficiently by the online algorithm not only with respect to the risk, but also strongly in the space C m () implying the learning of gradients. Consider the case r = 3. When ρ(i) converges slowly with b < 4 5, we see that ( ) IE z1,...,z T ft+1 f ρ K = O(T ( b 5 1) ) which heavily depends on the index b representing quantitatively the deviation of ρ (t) from ρ. When b is large enough with b 4 5, the learning rate ( ) IE z1,...,z T ft+1 f ρ K is of order T 5 1 which is independent of b. It would be interesting to extend Theorem 10 to situations when ρ x : x are more general bounded symmetric distributions..3 Learning Rates for Binary Classification The output space for the binary classification problem is = 1, 1 representing the set of two classes. A binary classifierc : makes a prediction y =C(x) for each point x. A real valued function f : R can be used to generate a classifier C(x) = sgn( f(x)) where sgn( f(x)) = 1 if f(x) 0 and sgn( f(x)) = 1 if f(x) < 0. A classifying convex loss φ : R R + is often used for the real valued function f, to measure the local error φ(y f(x)) suffered from the use of sgn( f) as a model for the process producing y at x. Take V(y, f) = φ(y f) in out setting. Offline classification algorithm (6) has been extensively studied in the literature. In particular, the error analysis is well done when the sample z is assumed to be an identical and independent drawer from a probability measure ρ on. See, for example, Evgeniou et al. (000), Steinwart (00), hang (004) and Wu et al. (007). Some analysis for off-line support vector machines with dependent samples can be found in Steinwart et al. (008). When the sample size T is very large, algorithm (6) might be practically challenging. Then online learning algorithms can be applied, which provide more efficient methods for classification. These algorithms are generalizations of the perceptron which has a long history. The error analysis for online algorithm (3) has also been conducted for classification in the i.i.d. setting, see, for example, Cesa-Bianchi et al. (004), ing and hou (006), ing (007) and e and hou (007). Here we are interested in the error analysis for fully online algorithm (3) in the non-identical setting. The error is measured by the misclassification errorr (C) defined to be the probability of the event C(x) y for a classifierc R (C) = ρ x (y C(x))dρ (x) = ρ x ( C(x))dρ (x) r+.
8 HU AND HOU The best classifier minimizing the misclassification error is called the Bayes rule (e.g., Devroye et al. 1997) and can be expressed as f c = sgn( f ρ ). We are interested in the classifier sgn( f T+1 ) produced by the real valued function f T+1 using fully online learning algorithm (3). So our error analysis aims at the excess misclassification error R (sgn( f T+1 )) R ( f c ). We demonstrate our error analysis by a result, proved in Section 5, for the hinge loss φ(x) = (1 x) + = max1 x,0. For this loss, the online algorithm (3) can be expressed as f 1 = 0 and (1 ηt λ f t+1 = t ) f t, if y t f t (x t ) > 1, (1 η t λ t ) f t + η t y t K xt, if y t f t (x t ) 1. Theorem 11 Let V(y, f) = (1 y f) + and K C s ( ) for some 0 < s 1. Assume f ρ C s () and the triple (ρ, f c,k) satisfies for some 0 < β 1 andd 0 > 0. Take inf f f c L 1 f H ρ + λ K f K D 0 λ β λ > 0 (9) λ t = λ 1 t 1 4,ηt = η 1 t ( 1 + β 1 ) where λ 1 > 0 and 0 < η 1 1 κ +λ 1. If ρ (t) t=1,, satisfies (1) with b > 1, then IE z1,...,z T (R (sgn( f T+1 )) R ( f c )) C β,s,b T min β 4, β 4, b 1 4, (10) where C β,s,b = C η1,λ 1,κ,D 0,β,s is a constant depending on η 1,λ 1,κ,D 0,β,s and b. In the i.i.d. case with b =, the learning rate for fully online algorithm (3) we state in Theorem 11 is of form O(T min β 4, β 4 ) which is better than that in e and hou (007) of order O(T min β 4, 8 1 ε ) with an arbitrarily small ε > 0. This improvement is realized by technical novelty in our error analysis, as pointed out in Remark 7. So even in the i.i.d. case, our learning rate for fully online classification with the hinge loss under approximation error assumption (9) is the best in the literature. Let us discuss the role of the power index b for the convergence of ρ (t) to ρ played in the learning rate in Theorem 11. Consider the case β 3 1+β 5. When b meaning fast convergence of ρ (t) to ρ, learning rate (10) takes the form O(T β 4 ) which depends only on the approximation ability of H K with respect to the function f c and ρ. When 1 < b < 1+β representing some slow convergence of ρ (t) to ρ, the learning rate takes the form O(T b 1 When β > 3 5, learning rate (10) is of order T ( 8 1+ β 4 ) for b > β 1 4 ) which depends only on b. (fast convergence of ρ(t) ) and of order T ( b 4 1) for 1 < b β 1 (slow convergence of ρ(t) ). It would be interesting to know how online learning algorithms adapt when the time dependent distribution drifts sufficiently slowly (corresponding to very small b). 880
9 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS.4 Approximation Error Involving a General Loss Condition (9) concerns the approximation of the function f c in the space L 1 ρ by functions from the RKHS H K. In particular, when H K is a dense subset of C() (i.e., K is a universal kernel), the quantity on the left-hand side of (9) tends to 0 as λ 0. So (9) is a reasonable assumption, which can be characterized by an interpolation space condition (Smale and hou, 003; Chen et al., 004). Assumptions like (9) are necessary to determine the regularization parameter for achieving learning rate (10). This can be seen from the literature (Wu et al., 007; hang, 004; Caponnetto et al., 007) of off-line algorithm (6): learning rates are obtained by suitable choices of the regularization parameter λ λ 1 = λ 1 (T), according to the behavior of the approximation error estimated from a priori conditions on the distribution ρ and the spaceh K. For a general loss function V, conditions like (9) can be stated as the approximation error or regularization error. Definition 1 The approximation errord(λ) associated with the triple (K,V,ρ) is E( f) E( fρ V )+ λ f K D(λ) = inf f H K where we define the generalization error for f : R as E( f) = V(y, f(x))dρ = V(y, f(x))dρ x (y)dρ and f V ρ is a minimizer ofe( f)., (11) The approximation error measures the approximation ability of the spaceh K with respect to the learning process involving V and ρ. The denseness of H K in C() implies lim λ 0 D(λ) = 0. A natural assumption would be D(λ) D 0 λ β for some 0 β 1 andd 0 > 0. (1) Throughout the paper we assume for the general loss function V that V := sup y V(y,0)+ sup V(y, f) V(y,0) / f : y, f 1 <. Remark 13 SinceD(λ) E(0)+0 V for any λ > 0, we see that (1) always holds with β = 0 andd 0 = V. For the least square loss V(y, f) = (y f), the minimizer f V ρ ofe( f) is exactly the regression function defined by (5) and approximation error (11) takes the fromd(λ) = inf f HK f f ρ L ρ + λ f K which measures the approximation of f ρ in L ρ by functions from the RKHSH K. For a general loss function V, the minimizer f V ρ ofe( f) is in general different from f ρ. Moreover, the approximation errord(λ) involves the approximation of f V ρ byh K in some function spaces which need not be L ρ. Example 5 For the hinge loss V(y, f) = (1 y f) +, the minimizer fρ V is the Bayes rule fρ V = f c (Devroye et al., 1997). Moreover the uniform Lipschitz continuity of V(, f) implies (Chen et al., 004) E( f) E( fρ V ) = φ(y f(x)) φ(y f c (x))dρ f(x) f c (x) dρ = f f c L 1 ρ. 881
10 HU AND HOU So approximation error (11) can be estimated by approximation in the space L 1 ρ and condition (9) implies (1). Consider the insensitive loss V = V in. We can easily see that for each x, f V in ρ (x) equals the median of the probability distribution ρ x on. That is, f V in ρ (x) is uniquely determined by ρ x (y : y f V in ρ (x)) 1 and ρ x (y : y f V in ρ (x)) 1. When ρ x is symmetric about its mean f ρ (x), we have f V in ρ (x) = f ρ (x). Example 6 Let V = V in, f ρ C() and p 1. If for each x, the conditional distribution ρ x is given by p dρ x (y) = y f ρ(x) p 1 dy, if y f ρ (x) 1, 0, if y f ρ (x) > 1, then we have E( f) E( f V in ρ ) = 1 p+1 f f ρ p+1 1 Lρ p+1 p+1 x : f(x) f ρ (x) >1 f(x) f ρ (x) p+1 + p (1+ p) f(x) f ρ (x) dρ. It follows that D(λ) inf f H K f f ρ p+1 + λ Lρ p+1 f K. The conclusion of Example 6 will be proved in the appendix. The following general result follows from the same argument as in Wu et al. (006). Proposition 14 If the loss function V satisfies V(y, f 1 ) V(y, f ) f 1 f c for some 0 < c 1 and any y, f 1, f R, then D(λ) inf f H K f f V ρ c L 1 ρ + λ f K inf f H K f f V ρ c L ρ + λ f K. If the univariate function V(y, ) is C 1 and satisfies V(y, f 1 ) V(y, f ) f 1 f c for some 0 < c 1 and any y, f 1, f R, then D(λ) inf f fρ V 1+c f H L 1+c K ρ + λ f K inf f H K f f V ρ 1+c L ρ + λ f K. 3. Key Analysis for the Fully Online Non-identical Setting Learning rates for fully online learning algorithm (3) such as those stated in the last section for regression and classification are obtained through analysis for approximation error and sample error. The approximation error has been well understood (Smale and hou, 003; Chen et al., 004). The sample error for (3) will be estimated in the following two sections. It is expressed as f T+1 fλ V T K where fλ V T is a regularizing function. 88
11 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Definition 15 For λ > 0 the regularizing function fλ V H K is defined by E( f)+ λ f K f V λ = arg inf f H K. (13) Observe that fλ V is a sample-free limit of f z,λ defined by (6). It is natural to expect that the function f T+1 produced by online algorithm (3) approximates the regularizing function fλ V T well. This is actually the case. Our main result on sample error analysis estimates the difference f T+1 fλ V T in theh K -norm for quite general situations. The estimation follows from an iteration procedure, developed for online classification algorithms in ing and hou (006), ing (007) and e and hou (007), together with some novelty provided in the proof of Theorem 6 in the next section. It is based on a one-step iteration bounding f t+1 fλ V t K in terms of f t fλ V t 1 K. The goal of this section is to present our key analysis for one-step iteration in tackling two barriers arising from the fully online non-identical setting. 3.1 Bounding Error Term Caused by Non-identical Sequence of Distributions The first part of our key analysis for one-step iteration is to tackle an extra error term t caused by the non-identical sequence of distributions when bounding f t+1 f V λ t K by means of f t f V λ t K with the fixed regularization parameter λ t. Lemma 16 Define f t by (3). Then we have IE zt ( f t+1 f V λ t K) (1 η t λ t ) f t f V λ t K + η t t + η t IE zt V(y t, f t (x t ))K xt + λ t f t K, (14) where t is defined by [ ] t = V(y, fλ V t (x)) V(y, f t (x)) d ρ (t) ρ. (15) Lemma 16 follows from the same procedure as in the proof of Lemma 3 in ing and hou (006) and e and hou (007). A crucial estimate is V(y t, f t (x t ))K xt + λ t f t, f V λ t f t K [V(y t, f V λ t (x t ))+ λ t f V λ t K] [V(y t, f t (x t ))+ λ t f t K]. When we take the expectation with respect to z t = (x t,y t ), we get R V(y, f λ V t (x))dρ (t) (and R V(y, f t(x))dρ (t) ) on the right-hand side, not E( fλ V t ) = R V(y, f λ V t (x))dρ. So compared with results in the i.i.d. case (Smale and ao, 006; ing and hou, 006; ing, 007; Tarrès and ao, 005; e and hou, 007), an extra term t involving the difference measure ρ (t) ρ appears. This is the first barrier we need to tackle here. When V is the least square loss, it can be easily handled by assuming f ρ C s (). In fact, from V(y, f) = (y f), we see that t = = [ fλ V t (x)) f t (x)][ fλ V t (x)+ f t (x) y]dρ x (y) [ f V λ t (x)) f t (x)][ f V λ t (x)+ f t (x) f ρ (x)]d [ ] ρ (t) ρ. [ ] d ρ (t) ρ This together with the relation hg C s () h C s () g C s () yields the following bound. 883
12 HU AND HOU Proposition 17 Let V(y, f) = (y f) and f ρ C s (). Then we have t fλ V t C s () + f t C s () + f ρ C s () ρ (t) ρ (C s ()) f λ V t f t C s (). For a general loss function and output space, t can be bounded under assumption (4). It is a special case of the following general result when h = f V λ t and g = f t. Lemma 18 Let h,g C s (). If (4) holds, then we have [ V(y,h(x)) V(y,g(x))d ρ (t) ρ] ( ) B h,g h C s () + g C s () + Cρ B h,g ρ (t) ρ (C s ()), where B h,g and B h,g are constants given by B h,g = sup V(y, f) : y, f max h C(), g C() and B h,g = sup V(, f) C s () : f max h C(), g C(). Proof By decomposing the probability distributions on into marginal and conditional distributions, we see [ ] V(y,h(x)) V(y,g(x))d ρ (t) ρ = [ ] V(y,h(x)) V(y,g(x))dρ x (y)d ρ (t) ρ. By the definition of the norm in (C s ()), we obtain [ ] V(y,h(x)) V(y,g(x))d ρ (t) ρ ρ (t) ρ (C s ()) J C s (), where J is a function on defined by J(x) = V(y,h(x)) V(y,g(x))dρ x (y), x. The notion of B h,g tells us that V(y,h(x)) V(y,g(x)) B h,g h(x) g(x) for each y. Hence J C() B h,g h g C(). To bound J C s (), let x,u. We can decompose J(x) J(u) as J(x) J(u) = + [V(y,h(x)) V(y,g(x))] [V(y,h(u)) V(y,g(u))]dρ x (y) [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y). By the notion B h,g, part of the first term above can be bounded as V(y,h(x)) V(y,h(u)) B h,g h(x) h(u) B h,g h C s ()(d(x,u)) s. 884
13 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS The same bound holds for the other part of the first term. So we get [V(y,h(x)) V(y,g(x))] [V(y,h(u)) V(y,g(u))]dρ x (y) B h,g h C s () + g C s () (d(x,u)) s. For the other term of the expression for J(x) J(u), we apply condition (4) to [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y) ρ x ρ u (C s ()) V(y,h(u)) V(y,g(u)) C s () and find that [V(y,h(u)) V(y,g(u))]d[ρ x ρ u ](y) C ρ(d(x,u)) s B h,g. Combining the above two bounds, we see that J(x) J(u) J C s () = sup x u (d(x,u)) s B h,g h C s () + g C s () + Cρ B h,g. Then the desired bound follows and the lemma is proved. 3. Bounding Error Term Caused by Varying Regularization Parameters The second part of our key analysis is to estimate the error term called drift error f V λ t f V λ t 1 K caused by the change of the regularization parameter from λ t 1 to λ t in our fully online algorithm. This is the second barrier we need to tackle here. Definition 19 The drift error is defined as d t = f V λ t f V λ t 1 K. The drift error can be estimated by the approximation error, which has been studied for regression (Smale and hou, 009) and for classification (e and hou, 007). Theorem 0 Let V be a convex loss function, f V λ by (13) and µ> λ > 0. We have f V λ f V µ K µ ( 1 λ 1 µ )( f V λ K + f V µ K ) µ ( 1 λ 1 µ ) ( D(λ) λ In particular, if with some 0 < γ 1 we take λ t = λ 1 t γ for t 1, then d t+1 t γ 1 D(λ 1 t γ )/λ 1. + ) D(µ). µ Proof Taking derivative of the functional E( f)+ λ f K at the minimizer f φ λ see that V(y, fλ V (x))k xdρ+λfλ V = 0. defined in (13), we 885
14 HU AND HOU It follows that fλ V f µ V = 1 V(y, fµ V (x))k x dρ 1 µ λ V(y, f V λ (x))k xdρ. Combining with the reproducing property (8), we know f V λ f V µ K = f V λ f V µ, f V λ f V µ K can be expressed as f V λ f V µ K = 1 µ V(y, f V µ (x)) ( f V λ f V µ The convexity of the function V(y, ) on R tells us that ) 1 (x)dρ V(y, fλ V λ (x))( fλ V f V ) µ (x)dρ. V(y, fµ V (x)) ( fλ V f µ V ) (x) V(y, f V λ (x)) V(y, fµ V (x)) and Hence V(y, fλ V (x))( fµ V fλ V ) (x) V(y, f V µ (x)) V(y, fλ V (x)). f V λ f V µ K ( 1 λ 1 µ )(E( f V µ ) E( f V λ )). From the definition of fµ V, we see thate( fµ V )+ µ f µ V K (E( f λ V)+ µ f λ V K ) 0. It follows that E( fµ V ) E( fλ V ) µ ( f λ V K fµ V K) µ f λ V f µ V K ( fλ V K + fµ V K ). Then the desired inequality follows. 4. Bounds for Sample Error inh K -norm We are in a position to present our main result on the sample error measured with theh K -norm of the difference f T+1 f V λ T. This will be done by applying iteratively the key analysis in Lemma 16, Lemma 18 and Theorem 0. When applying Lemma 18, we need to bound g C s () in terms of g K. Definition 1 We say that the Mercer kernel K satisfies the kernel condition of order s if K C s ( ) and for some κ s > 0, K(x,x) K(x,u)+K(u,u) κ s(d(x,u)) s, x,u. (16) When 0 < s 1 and K Cs ( ), (16) holds true. The following result follows directly from hou (003), hou (008) and Smale and hou (009). Lemma If K satisfies the kernel condition of order s with (16) valid, then we have g C s () (κ+κ s ) g K, g H K. When using Lemma 16 and Lemma 18, we need incremental behaviors of the loss function V to bound f t K. 886
15 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Definition 3 Denote N(λ) = sup V(y, f) : y, f max κ V /λ,κ V /λ and Ñ(λ) = sup V(, f) C s () : y, f max κ V /λ,κ V /λ. We say that V has incremental exponent p 0 if for some N 1 > 0 and λ 1 > 0 we have N(λ) N 1 ( 1 λ )p and Ñ(λ) N 1 ( 1 λ )p+1 0 < λ λ 1. (17) We say that V is locally Lipschitz at the origin if V(y, f) V(y,0) M 0 := sup f : y, f 1 <. (18) The following result can be proved by exactly the same procedure as those in ing and hou (006), ing (007) and e and hou (007). Lemma 4 Assume that V is locally Lipschitz at the origin. Define f t by (3). If for t = 1,...,T, then we have η t ( κ (M 0 + N(λ t ))+λ t ) 1 (19) f t K κ V λ t, t = 1,...,T + 1. (0) For the insensitive loss, (18) is not satisfied, but V(y, f) is uniformly bounded by 1. For such loss functions we can apply the following bound. Lemma 5 Assume V(y, f) := sup y, f R V(y, f) <. Define f t by (3). If for some λ 1,η 1 > 0 and γ,α > 0 with λ+α < 1, we take λ t = λ 1 t γ,η t = η 1 t α with t = 1,...,T, then we have f t K C V,γ,α λ t, t = 1,...,T + 1, (1) where C V,γ,α is the constant given by ( C V,γ,α = κ V(y, f) λ 1 η 1 + λ 1 γ+α /(λ 1 η 1 )+ ( (1+α)/[eλ 1 η 1 (1 γ+α 1 )] ) ) 1+α 1 γ α. Proof By taking norms in (3) we see that By iterating and the choice f 1 = 0 we find f t+1 K (1 η t λ t ) f t K + η t κ V(y, f). f t+1 K t i=1 Π t j=i+1(1 η j λ j )η i κ V(y, f). 887
16 HU AND HOU But 1 η j λ j exp η j λ j. It follows that i=1 f t+1 K t i=1 t j=i+1 Π t j=i+1 exp t j=i+1 η j λ j η i κ V(y, f). Now we need the following elementary inequality with c > 0,q 0 and 0 < q 1 < 1: t 1 i q exp c j q 1 q 1+q ( 1+q + c ec(1 q1 1 ) ) 1+q 1 q 1 t q 1 q. () This elementary inequality can be found in, for example, Smale and hou (009). Taking q = α,q 1 = γ+α and c = λ 1 η 1 we know that f t+1 K is bounded by ( κ V(y, f) η 1 t α + γ+α /(λ 1 η 1 )+ ( (1+α)/[eλ 1 η 1 (1 γ+α 1 )] ) 1+α 1 γ α )t γ. Then our desired bound holds true. Now we can present our bound for the sample error f T+1 f V λ T K. Theorem 6 Suppose the following assumptions hold: Take 1. the kernel K satisfies the kernel condition of order s (0 < s 1) with (16) valid.. V has incremental exponent p 0 with (17) valid and V satisfies (18). 3. ρ (t) t=1,, converges polynomially to ρ in (C s ()) with (1) valid. 4. the distributions ρ x : x is Lipschitz s in (C s ()) with (4) valid. 5. the triple (K,V,ρ) has the approximation ability of power 0 < β 1 stated by (1). with some λ 1,η 1 > 0 and γ,α satisfying 0 < γ < Then we have 5+4p β λ t = λ 1 t γ,η t = η 1 t α (3) γ(3 β), (p+1)γ < α < 1, η 1 1 κ M 0 + κ N 1 λ p 1 + λ 1. (4) IE z1,z,,z T ( f T+1 f V λ T K) C K,V,ρ,b,β,s T θ (5) where the power index θ is given by θ := min γ(3 β) α,α γ(p+1),b γ(+ p), (6) and C K,V,ρ,b,β,s is a constant independent of T given explicitly in the proof. 888
17 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Proof We divide the proof into four steps. First we bound t. Since λ t and η t take the form (3), we see from the lower bound for α in (4) that pγ (p+1)γ α. Hence for t N, η t (κ (M 0 + N(λ t ))+λ t ) η 1 t α (κ M 0 + κ N 1 λ p 1 t pγ + λ 1 t γ ) η 1 (κ M 0 + κ N 1 λ p 1 + λ 1 ). So the last restriction of (4) implies (19), and by Lemma 4, we know that (0) holds true. Taking f = 0 in (13) yields f V λ t K V λ t, t = 1,...,T. Putting these bounds into the definition of constant B h,g, B h,g, we see by the notion N(λ) that B f V λ t, f t N(λ t ), B f V λ t, f t N(λ t ). So by Lemma 18 and Lemma, t Bt := (κ+κ s )( V /λ t + κ V /λ t )N(λ t )+C ρ Ñ(λ t ) ρ (t) ρ (C s ()). Putting this bound and (0) into (14) yields IE z1,z,,z t ( f t+1 f V λ t K) (1 η t λ t )IE z1,z,,z t 1 ( f t f V λ t K ) + κ η t (N(λ t )+ V ) + η t B t. Next we derive explicit bounds for the one-step iteration. Recall that d t = fλ V t fλ V t 1 K. It gives f t fλ V t K f t fλ V t 1 K + f t fλ V t 1 K d t + dt. γ+α Take τ = 1 γ(1 β)/. By the upper bound for α in (4), we find 0 < τ < 1. Take A 1 = η 1λ 1+τ(1 β)/ 1 > 0. Applying the elementary inequality 1+τ D τ/ 0 ab = [ A 1 ab τ/ ][b 1 τ/ / A 1 ] A 1 a b τ + b τ /A 1 (7) to a = f t f V λ t 1 K and b = d t, we know that IE z1,z,,z t 1 ( f t f V λ t K ) ( ) (1+A 1 dt τ )IE z1,z,,z t 1 f t fλ V t 1 K + dt τ /A 1 + dt. Using this bound and noticing the inequality (1 η t λ t )(1+A 1 dt τ ) 1+A 1 dt τ η t λ t, we obtain ) IE z1,z,,z t ( f t+1 fλ V t K) (1+A 1 dt τ η t λ t )IE z1,z,,z t 1 ( f t fλ V t 1 K +d τ t /A 1 + d t + κ η t (N(λ t )+ V ) + η t B t. By Theorem 0, condition (1) for the approximation error yields d t D 0 λ (β 1)/ 1 (t 1) γ(1 β)/ 1 A t γ(1 β)/ 1 where A = 4 D 0 λ (β 1)/ 1. Inserting the parameter form λ t = λ 1 t γ into assumption (17) and applying condition (1), we can bound Bt as Bt A 3 t γ(1+p) b where A 3 = (κ+κ s )( V /λ 1 + κ V /λ 1 )+C ρ /λ 1 N 1 λ p 1 C. 889
18 HU AND HOU Therefore, for the one-step iteration, by denoting f V λ 1 as f V λ 0 when t = 1, we have for each t = 1,...,T, IE z1,z,,z t ( f t+1 f V λ t K) (1+A 1 d τ t η t λ t )IE z1,z,,z t 1 ( f t f V λ t 1 K ) + A 4 t θ, (8) where θ = min γ( β) α, (α pγ), α+b γ(1+ p) and A 4 = A τ /A 1 + A + κ η 1(N 1 λ p 1 + V ) + η 1 A 3. Then we iterate the above one-step analysis. Inserting the parameter forms for λ t and η t, we see from the definition of the constant A 1 that 1+A 1 d τ t η t λ t 1+A 1 A τ t τ(γ(1 β)/ 1) η 1 λ 1 t γ α = 1 η 1λ 1 t γ α. (9) So the one-step analysis (8) yields ( IE z1,z,,z t ( f t+1 fλ V t K) 1 η ( ) 1λ 1 )IE t γ α z1,z,,zt 1 f t fλ V t 1 K + A 4 t θ. Applying this bound iteratively for t = 1,..., T implies T t=1 IE z1,z,,z T ( f T+1 fλ V T T K) A 4 Π T j=t+1(1 η 1λ 1 t=1 j γ α )t θ + Πt=1(1 T η 1λ 1 t γ α ) f 1 fλ V 1 K. Finally we bound the above expressions by two elementary inequalities. The first one is (). Applying this inequality with c = η 1λ 1,q 1 = γ+α and q = θ, since 1 u e u for any u 0, the first expression above can be bounded as Π T j=t+1(1 η 1λ 1 j γ α )t θ T exp η T 1λ 1 t=1 j t γ α θ A 5 T γ+α θ, j=t+1 where A 5 is the constant given by γ+α+ θ+1 ( A 5 = + 1+ η 1 λ 1 For the second expression above, we have Πt=1(1 T η 1λ 1 t γ α ) exp λ 1 η 1 exp exp (1 γ α) η 1λ 1 + θ eη 1 λ 1 (1 γ+α 1 ) T j=1 ) 1+ θ 1 γ α. t γ α exp η 1λ 1 λ 1 η 1 (T + 1)1 γ α. (1 γ α) 890 T+1 1 x γ α dx
19 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS Applying another elementary inequality ( a ) a exp cx x a, c,a,x > 0 ec with c = λ 1η 1 (1 γ α), a = 1 γ α and x = (T + 1)1 γ α yields Πt=1(1 T η ( 1λ 1 λ 1 η 1 4 t γ α ) exp (1 γ α) eλ 1 η 1 ) 1 γ α T. The above two estimates give the desired bound (5) with θ = θ γ α and the constant C K,V,ρ,b,β,s given by ( ) λ 1 η γ α V C K,V,ρ,b,β,s = A 4 A 5 + exp. (1 γ α) eλ 1 η 1 λ 1 This proves the theorem. Remark 7 Some ideas in the above proof are from ing and hou (006), e and hou (007) and Smale and hou (009). Two novel points are presented for the first time here. One is the bound for t, dealing with ρ (t) ρ, given in the first step of our proof in order to tackle the technical difficulty arising from the non-identical sampling process. The same difficulty for the least square regression was overcome in Smale and hou (009) by the special linear feature and explicit expressions offered by the least square loss. The second technical novelty is to introduce a parameter A 1 into elementary inequality (7). With this parameter, we can bound 1+A 1 dt τ η t λ t by 1 1 η tλ t, shown in (9). This improves the error bound even in the i.i.d. case presented in e and hou (007) for the fully online algorithm. Let us discuss the role of parameters in Theorem 6. When γ is small enough and b > 3, fully online algorithm (3) becomes very close to the partially online scheme with λ t λ 1. By taking α = 3, the rates in (5) are of order O(T ( 3 ε) ) with ε arbitrarily small, which is a nice bound for the sample error f T+1 fλ V T K. In this case, the difference between fλ V T and fρ V, measured by the approximation error, increases since λ T = λ 1 T γ. To estimate the total error between f T+1 and fρ V, we should take a balance for the index γ of the regularization parameter, as shown in Theorem 11. For the insensitive loss, (18) is not satisfied. We can apply Lemma 5 and obtain bounds for f T+1 fλ V T K by the same proof as that for Theorem 6. Proposition 8 Assume V(y, f) < and all the conditions of Theorem 6 except (18). Take λ t,η t by (3) with the restriction (4) without the last inequality. Then the same convergence rate (5) holds true with the power index θ given by (6). 5. Bounds for Binary Classification and Regression with Insensitive Loss We demonstrate how to apply Theorem 6 by deriving learning rates of fully online algorithm (3) for binary classification and regression with insensitive loss. 891
20 HU AND HOU Theorem 9 Let V(y, f) = φ(y f) where φ : R R + is a convex function satisfying φ (u) N φ u p, φ(u) N φ u p+1 u 1 (30) for some p 0 and N φ > 0. Suppose M φ := sup φ (u) φ (0) / u : u 1 <. Assume (16) for K, (1) for ρ (t), and (1) for (K,φ,ρ). If f ρ C s () and we choose λ t,η t as (3) with +γ(p +β) 0 < γ < 5+10p β,α = 3 and b > γ(+ p), and η 1 < η 0, λ 1 κ (φ(0)+ φ L [ 1,1])/, then we have ( ) IE z1,...,z T E( ft+1 ) E( fρ V ) C φ,β,γ T min γ(5+10p β) 6, b γ(+3p),γβ, 1 where η 0 := with N κ M φ +κ N 1 λ p 1 +λ 1 = (M φ +N φ +φ(0)+ φ L [ 1,1])κ p (φ(0)+ φ L [ 1,1]) p 1 and C φ,β,γ is a constant depending on η 1,λ 1,κ,D 0,β,φ,β and s. Proof By the bounds for f V λ T K and f T+1 K, we know from (30) that E( f T+1 ) E( f V λ T ) = φ(y f T+1 (x)) φ(y fλ V T (x))dρ C K,φ λ p T f T+1 fλ V T κc K,φ λ p T f T+1 fλ V T K, where C K,φ is a constant depending on K and φ. It is easy to check that the loss V(y, f) = φ(y f) satisfies (17) with incremental exponent p. +γ(p +β) By Theorem 6 with 0 < γ < 5+10p β,α = 3 and b > γ(+ p), we have ( IE z1,z,,z T f T+1 fλ V ) K T C K,V,ρ,b,β,s T min[ γ(5+4p β)]/6,[b γ(+p)]/. Also, we havee( f φ λ T ) E( fρ V ) D(λ T ) D 0 λ β T. Thus we get a bound for the excess generalization error IE z1,...,z T (E( f T+1 ) E( f V ρ )) C φ,β,γ T min[ γ(5+10p β)]/6,γβ,[b γ(+3p)]/, where C φ,β,γ = κc K,φ λ p 1 CK,V,ρ,b,β,s +D 0 λ β 1. This verifies the desired bound. Theorem 9 yields concrete learning rates with various loss functions. When φ is chosen to be the hinge loss φ(x) = (1 x) +, we can prove Theorem Proof of Theorem 11 When 0 < s 1 and K Cs ( ), (16) holds true. The loss function φ(x) = (1 x) + satisfies φ (x) = 1 for x 1 and 0 otherwise. It follows that (17) holds true with p = 0 and M φ = 0. By Example 5, (9) implies (1). Thus all conditions in Theorem 9 are satisfied and by taking p = 0 and γ = 1 4, we have ( ) IE z1,...,z T E( ft+1 ) E( fρ V ) C φ,β,γ T min 18 + β 4, β 4, b
21 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS An important relation concerning the hinge loss is the one (hang, 004) between the excess misclassification error and the excess generalization error given for any measurable function f : R as R (sgn( f)) R ( f c ) E( f) E( f c ). Combining this relation with the above bound for the excess generalization error proves the conclusion of Theorem 11. Turn to the general loss φ. We give an additional assumption that φ (0) exists and is positive. Under this assumption it was proved in Chen et al. (004) and Bartlett et al. (006) that there exists a constant depending c φ only on φ such that for any measurable function f : R, R (sgn( f)) R ( f c ) c φ E( f) E( f φ ρ). Then Theorem 9 gives the following learning rate. Corollary 30 Let φ be a loss function such that φ (0) exists and is positive. Under the assumptions of Theorem 9, if γ = 5+10p+5β, we have IE z1,...,z T (R (sgn( f T+1 )) R ( f c )) C φ,β T min where C φ,β is a constant independent of T. β 5+10p+5β, 4 b +3p 10+0p+10β, As an example, the q-norm SVM loss φ(x) = ((1 x) + ) q with q > 1 satisfies φ (0) > 0 and (17) with p = q 1. So Corollary 30 yields the following rates. Example 7 Let φ(x) = ((1 x) + ) q with q > 1. Under the assumptions of Theorem 9, if γ = 10q 5+5β, α = 8q 6+4β 10q 5+5β and b > 6q 10q 5+5β, then ( ) ( IE z1,...,z T R (sgn( ft+1 )) R ( f c ) = O T min β 10q 5+5β, 4 b 3q 1 0q 10+10β ). Finally we verify the learning rates for regression with insensitive loss stated in Section. 5. Proof of Theorem 10 We need the regularizing function f V ls λ defined by (13) with the least square loss V = V ls. It can be found, for example, in Smale and hou (007) that regularity condition (7) implies f V ls λ f ρ K ( ) λ r 1 gρ L, when 1 ρ < r 3 and f V ls λ f ρ L ρ ( ) λ r g ρ L, when 0 < r 1. ρ It follows that when λ (κ g ρ L ρ ) /(1 r), we have f V ls λ f ρ C() 1. Thus by the special form of the conditional distribution ρ x, we see from the conclusion of Example 6 with p = 1 that 893
22 HU AND HOU f V in λ = f V ls λ and bounds for f V in λ f ρ K and f V in λ f ρ L ρ follow. Moreover, condition (1) for D(λ) is valid with β = 1. Now we check other conditions of Proposition 8. Condition (16) is valid because K C s ( ) with 0 < s 1. By a simple computation, incremental condition (17) is verified with exponent p = 0. Note that f ρ K κ r 1 g ρ L ρ and f ρ C s () (κ + κ s ) f ρ K. Then for any x,u and g C s (), we see from the uniform distribution ρ x and ρ u that g(y)d(ρ x ρ u ) = 1 fρ (x)+1 f ρ (x) 1 fρ (u)+1 g(y)dy g(y)dy g C() f ρ (x) f ρ (u) f ρ (u) 1 g C() (κ+κ s )κ r 1 g ρ L ρ (d(x,u)) s. This verifies Lipschitz s continuous condition (4) for ρ x with constant C ρ = (κ+κ s )κ r 1 g ρ L ρ. Thus all conditions of Proposition 8 are satisfied and we obtain IE z1,z,,z T ( f T+1 f V in λ T K) C K,V,ρ,b,β,s T θ where θ := min γ α,α γ,b γ. Finally we get and IE z1,z,,z T ( f T+1 f ρ K ) C K,V,ρ,b,β,s T θ/ + IE z1,z,,z T ( f T+1 f ρ L ρ ) κ C K,V,ρ,b,β,s T θ/ + Then our desired learning rates follow. ( λt+1 ( λt+1 ) r 1 gρ L ρ, when 1 < r 3 ) r g ρ L ρ, when 0 < r 1. Acknowledgments We would like to thank the referees for their constructive suggestions and comments. The work described in this paper was partially supported by a grant from the Research Grants Council of Hong Kong [Project No. CityU ] and National Science Fund for Distinguished oung Scholars of China [Project No ]. The corresponding author is Ding-uan hou. Appendix A. This appendix includes some detailed proofs. 894
23 ONLINE LEARNING WITH NON-IDENTICAL DISTRIBUTIONS A.1 Proof of Proposition 6 The first statement follows by taking g(y) = y on because f ρ (x) f ρ (u) = g(y)d(ρ x ρ u )(y) ρ x ρ u (C s ()) g C s () and g C s () = g C() + g C s () sup y y + 1 s sup y y. Actually the above estimates tell us that f ρ is continuous and belongs to C s () with f ρ C s () C ρ (1+ 1 s )sup y y. For the second statement, since = 1, 1, we have f ρ (x) = ρ x (1) ρ x ( 1). It follows that for each y and x, there holds ρ x (y) = 1+y f ρ(x). So for any g C s () and x,u, g(y)d(ρ x ρ u )(y) = g(y) y[ f ρ(x) f ρ (u)] yg(y) = y y [ f ρ(x) f ρ (u)]. Now the conclusion follows from g(y)d(ρ x ρ u )(y) g C() f ρ (x) f ρ (u) f ρ C s ()(d(x,u)) s g C s (). This proves Proposition 6. A. Proof of Example 4 For x, we have ρ x (1) = f ρ, 1 (x)+ f ρ (x) and ρ x (0) = 1 f ρ, 1 (x) f ρ (x). Hence for any g C s (), g(y)dρ x = f ρ, 1 (x)g(1) g(0)+g( 1)+ f ρ (x)g(1) g(0)+g(0) and for u, g(y)d(ρ x ρ u ) = [ f ρ, 1 (x) f ρ, 1 (u)]g(1) g(0)+g( 1)+[f ρ (x) f ρ (u)]g(1) g(0). Then our statement follows from the first part of Proposition 6. This proves the conclusion of Example 4. A.3 Proof of Example 6 Let x. When f(x) f V in ρ (x), we see from the explicit form of the insensitive loss V in that V in (y, f(x))dρ x (y) V in (y, f V in ρ (x))dρ x (y) It follows that = y f(x) + f V in ρ (x) f(x)dρ x + y f V in ρ (x) f(x)+ f V in f V ρ (x) ydρ x. in ρ (x)<y< f(x) f(x) f V in ρ (x)dρ x E( f) E( f V in ρ ) = f(x) f V in ρ (x) x + f(x) y dρ x dρ (31) I x 895
24 HU AND HOU where ) ( x := ρ x (y : y f V in(x) ρ x y : y > f V in ρ (x)) ρ and I x in the open interval between f V in ρ (x) and f(x). The same relation (31) also holds when f(x) < f V in ρ (x). Now we use the special assumption on the conditional distributions and see that the median and mean of ρ x are equal: f V in ρ (x) = f ρ (x) for each x. Moreover, x = 0 and when f ρ (x) f(x) f ρ (x)+1, we have f(x) fρ (x) f(x) y dρ x = ( f(x) f ρ (x) u) p I x 0 up 1 du = f(x) f ρ(x) p+1. p+1 The same expression holds true when f ρ (x) 1 f(x) < f ρ (x). When f(x) f ρ (x) > 1, since ρ x vanishes outside [ f ρ (x) 1, f ρ (x)+1], we have R I x f(x) y dρ x = f(x) f ρ (x) p p+1. Therefore, (31) is the same as E( f) E( f V in ρ ) = x : f(x) f ρ (x) 1 + x : f(x) f ρ (x) >1 f(x) f ρ (x) p+1 dρ p+1 f(x) f ρ (x) p p+1 dρ. This proves the desired expression fore( f) E( f V in ρ ) and hence the bound ford(λ). References N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68: , P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: , 006. A. Caponnetto, L. Rosasco and. ao. On early stopping in gradient descent learning. Constructive Approximation, 6:89 315, 007. D. R. Chen, Q. Wu,. ing and D.. hou. Support vector machine soft margin classifiers: error analysis. Journal of Machine Learning Research, 5: , 004. N. Cesa-Bianchi, P. Long and M. K. Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural Networks, 7: , N. Cesa-Bianchi, A. Conconi and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50: , 004. E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59 85, 005. E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some properties of regularized kernel methods. Journal of Machine Learning Research, 5: ,
Online gradient descent learning algorithm
Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk
More informationLearning Theory of Randomized Kaczmarz Algorithm
Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationDerivative reproducing properties for kernel methods in learning theory
Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationGeometry on Probability Spaces
Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationAre Loss Functions All the Same?
Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationLaplace s Equation. Chapter Mean Value Formulas
Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic
More informationTHEOREMS, ETC., FOR MATH 515
THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationConvergence rates of spectral methods for statistical inverse learning problems
Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationSobolev Spaces. Chapter 10
Chapter 1 Sobolev Spaces We now define spaces H 1,p (R n ), known as Sobolev spaces. For u to belong to H 1,p (R n ), we require that u L p (R n ) and that u have weak derivatives of first order in L p
More informationON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction
ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function
More informationElements of Positive Definite Kernel and Reproducing Kernel Hilbert Space
Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More information3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?
MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable
More informationSolving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels
Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels Y.C. Hon and R. Schaback April 9, Abstract This paper solves the Laplace equation u = on domains Ω R 3 by meshless collocation
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationDistributed Learning with Regularized Least Squares
Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationMATH MEASURE THEORY AND FOURIER ANALYSIS. Contents
MATH 3969 - MEASURE THEORY AND FOURIER ANALYSIS ANDREW TULLOCH Contents 1. Measure Theory 2 1.1. Properties of Measures 3 1.2. Constructing σ-algebras and measures 3 1.3. Properties of the Lebesgue measure
More informationREAL AND COMPLEX ANALYSIS
REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any
More informationFinite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product
Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationReproducing Kernel Banach Spaces for Machine Learning
Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009 Reproducing Kernel Banach Spaces for Machine Learning Haizhang Zhang, Yuesheng Xu and Jun Zhang
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationMercer s Theorem, Feature Maps, and Smoothing
Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationSobolev spaces. May 18
Sobolev spaces May 18 2015 1 Weak derivatives The purpose of these notes is to give a very basic introduction to Sobolev spaces. More extensive treatments can e.g. be found in the classical references
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationStrictly Positive Definite Functions on a Real Inner Product Space
Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationRelationships between upper exhausters and the basic subdifferential in variational analysis
J. Math. Anal. Appl. 334 (2007) 261 272 www.elsevier.com/locate/jmaa Relationships between upper exhausters and the basic subdifferential in variational analysis Vera Roshchina City University of Hong
More informationBack to the future: Radial Basis Function networks revisited
Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu
More informationSecond Order Optimization Algorithms I
Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More information2. Function spaces and approximation
2.1 2. Function spaces and approximation 2.1. The space of test functions. Notation and prerequisites are collected in Appendix A. Let Ω be an open subset of R n. The space C0 (Ω), consisting of the C
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationForchheimer Equations in Porous Media - Part III
Forchheimer Equations in Porous Media - Part III Luan Hoang, Akif Ibragimov Department of Mathematics and Statistics, Texas Tech niversity http://www.math.umn.edu/ lhoang/ luan.hoang@ttu.edu Applied Mathematics
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationSparseness of Support Vector Machines
Journal of Machine Learning Research 4 2003) 07-05 Submitted 0/03; Published /03 Sparseness of Support Vector Machines Ingo Steinwart Modeling, Algorithms, and Informatics Group, CCS-3 Mail Stop B256 Los
More information(Kernels +) Support Vector Machines
(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning
More informationSEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE
SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE FRANÇOIS BOLLEY Abstract. In this note we prove in an elementary way that the Wasserstein distances, which play a basic role in optimal transportation
More informationHomework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018
Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018 Topics: consistent estimators; sub-σ-fields and partial observations; Doob s theorem about sub-σ-field measurability;
More informationReflected Brownian Motion
Chapter 6 Reflected Brownian Motion Often we encounter Diffusions in regions with boundary. If the process can reach the boundary from the interior in finite time with positive probability we need to decide
More informationThe Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York
The Kernel Trick Robert M. Haralick Computer Science, Graduate Center City University of New York Outline SVM Classification < (x 1, c 1 ),..., (x Z, c Z ) > is the training data c 1,..., c Z { 1, 1} specifies
More informationKernels for Multi task Learning
Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationSingular Integrals. 1 Calderon-Zygmund decomposition
Singular Integrals Analysis III Calderon-Zygmund decomposition Let f be an integrable function f dx 0, f = g + b with g Cα almost everywhere, with b
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationWorst-Case Analysis of the Perceptron and Exponentiated Update Algorithms
Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April
More informationSparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results
Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,
More informationFunctional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...
Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationPartial Differential Equations
Part II Partial Differential Equations Year 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2015 Paper 4, Section II 29E Partial Differential Equations 72 (a) Show that the Cauchy problem for u(x,
More informationRANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor
RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationRichard F. Bass Krzysztof Burdzy University of Washington
ON DOMAIN MONOTONICITY OF THE NEUMANN HEAT KERNEL Richard F. Bass Krzysztof Burdzy University of Washington Abstract. Some examples are given of convex domains for which domain monotonicity of the Neumann
More informationELEMENTS OF PROBABILITY THEORY
ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable
More informationAn introduction to some aspects of functional analysis
An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms
More informationLecture No 1 Introduction to Diffusion equations The heat equat
Lecture No 1 Introduction to Diffusion equations The heat equation Columbia University IAS summer program June, 2009 Outline of the lectures We will discuss some basic models of diffusion equations and
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More information