Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

Least-square Regularized Regression Learn f : X Y fro rando saples z = {(x i, y i )} i=1 Take X to be a copact subset of R n and Y = R. y f(x) Due to noises or other uncertainty, we assue a (unknown) probability easure ρ on Z = X Y governs the sapling. arginal distribution ρ X on X: {x i } i=1 drawn according to ρ X conditional distribution ρ( x) at x X Learning the regression function: f ρ (x) = Y ydρ(y x) y i f ρ (x i ) First Previous Next Last Back Close Quit 1

Learning with a Fixed Gaussian f z,λ,σ := arg in f H K σ 1 i=1 (f(x i ) y i ) 2 + λ f 2 K σ, (1) where λ = λ() > 0, and K σ (x, y) = e x y 2 2σ 2 kernel on X is a Gaussian Reproducing Kernel Hilbert Space (RKHS) H Kσ copletion of span{(k σ ) t := K σ (t, ) : t X} with the inner product, Kσ satisfying (K σ ) x, (K σ ) y K = K σ (x, y). First Previous Next Last Back Close Quit 2

Theore 1 (Sale-Zhou, Constr. Approx. 2007) Assue y M and that f ρ = X K σ(x, y)g(y)dρ X (y) for soe g L 2 ρ X. For any 0 < δ < 1, with confidence 1 δ, f z,λ,σ f ρ L 2 2 log ( 4/δ )( 12M ) 2/3 1/3 ( 1 ) 1/3 g ρx L 2 ρ X where λ = λ() = log ( 4/δ )( 12M/ g L 2 ρx ) 2/3 ( 1/ ) 1/3. In Theore 1, f ρ C First Previous Next Last Back Close Quit 3

RKHS H Kσ generated by a Gaussian kernel on X H Kσ = H Kσ (R n ) X where H Kσ (R n ) is the RKHS generated by K σ as a Mercer kernel on R n : H Kσ (R n ) = { f L 2 (R n ) : f HK σ (Rn ) < } where f HK σ (Rn ) = Rn ˆf(ξ) 2 ( 2πσ) n e σ2 ξ 2 2 dξ 1/2. Thus H Kσ (R n ) C (R n ) Steinwart First Previous Next Last Back Close Quit 4

If X is a doain with piecewise sooth boundary and dρ X (x) c 0 dx for soe c 0 > 0, then for any β > 0, } D σ (λ) := { f f ρ 2L 2ρX + λ f 2 K σ = O(λ β ) inf f H K σ iplies f ρ C (X). Note f f ρ 2 L 2 ρ X = E(f) E(f ρ ) where E(f) := Z (f(x) y)2 dρ. Denote E z (f) = 1 i=1 (f(x i ) y i ) 2 E(f). Then f z,λ,σ = arg in f H K σ { Ez (f) + λ f 2 K σ }. First Previous Next Last Back Close Quit 5

If we define f λ,σ = arg in f H K σ { E(f) + λ f 2 Kσ }, then f z,λ,σ f λ,σ and the error can be estiated in ters of λ by the theory of unifor convergence over the copact function set B M/ λ := {f H Kσ : f Kσ M/ λ} since f z,λ,σ B M/ λ. But f λ,σ f ρ 2 L 2 ρ X = O(λ β ) for any β > 0 iplies f ρ C (X). So the learning ability of a single Gaussian is weak. One ay choose less sooth kernel, but we would like radial basis kernels for anifold learning. One way to increase the learning ability of Gaussian kernels: let σ depend on and σ = σ() 0 as. Steinwart-Scovel, Xiang-Zhou,... First Previous Next Last Back Close Quit 6

Another way: allow all possible variances σ (0, ) Regularization Schees with Flexible Gaussians: Zhou, Wu-Ying-Zhou, Ying-Zhou, Micchelli-Pontil-Wu-Zhou,... f z,λ := arg in 0<σ< in f H K σ 1 i=1 (f(x i ) y i ) 2 + λ f 2 K σ Theore 2 (Ying-Zhou, J. Mach. Learning Res. 2007) Let ρ X be the Lebesgue easure on a doain X in R n with inially sooth boundary. If f ρ H s (X) for soe s 2 and λ = 2s+n 4(4s+n), then we have ( E z Z fz,λ f ρ 2 ) L 2 = O ( s ) 2(4s+n) log.. First Previous Next Last Back Close Quit 7

Major difficulty: is the function set H = 0<σ< { f HKσ : f Kσ R } with R > 0 learnable? That is, is this function set a unifor Glivenko-Cantelli class? Its closure is not a copact subset of C(X). Theory of Unifor Convergence for sup f H E z (f) E(f). Given a bounded set H of functions on X, when do we have li sup l ρ Prob sup l sup f H 1 i=1 f(x i ) f(x)dρ X > ɛ = 0, ɛ > 0? Such a set is called a unifor Glivenko-Cantelli (UGC) class. Characterizations: Vapnik-Chervonenkis, and Alon, Ben-David, Cesa-Bianchi, Haussler (1997) First Previous Next Last Back Close Quit 8

Our quantitative estiates: If V : Y R R + is convex with respect to the second variable, M = V (y, 0) L ρ (Z) <, and C R = sup{ax{ V (y, t), V + (y, t) } : y Y, t R} <, then we have { supf H 1 i=1 V (y i, f(x i )) Z V (y, f(x))dρ } E z Z C C R R log 1/4 + 2M, where C is a constant depending on n. Ideas: reducing the estiates for H to a uch saller subset F = {(K σ ) x : x X, 0 < σ < }, then bounding epirical covering nubers. The UGC property follows fro the characterization of Dudley-Giné-Zinn. First Previous Next Last Back Close Quit 9

Iprove the learning rates when X is a anifold of diension d with d uch saller than the diension n of the underlying Euclidean space. Approxiation by Gaussians on Rieannian anifolds Let X be a d-diensional connected copact C subanifold of R n without boundary. The approxiation schee is given by a faily of linear operators {I σ : C(X) C(X)} σ>0 as I σ (f)(x) = = 1 ( 2πσ) d 1 ( 2πσ) d X K σ(x, y)f(y)dv (y) X exp { x y 2 2σ 2 where V is the Rieannian volue easure of X. } f(y)dv (y), x X, First Previous Next Last Back Close Quit 10

Theore 3 (Ye-Zhou, Adv. Coput. Math. 2007) If f ρ Lip(s) with 0 < s 1, then I σ (f ρ ) f ρ C(X) C X f ρ Lip(s) σ s σ > 0, (2) where C X is a positive constant independent of f ρ or σ. taking λ = ( log 2 ) s+d 8s+4d, we have E z Z { f z,λ f ρ 2L 2ρX } = O (( log 2 ) s 8s+4d ). By s The index 8s+4d in Theore 3 is saller than s in Theore 2 when the anifold diension d is uch saller than 2(4s+n) n. First Previous Next Last Back Close Quit 11

Classification by Gaussians on Rieannian anifolds Let φ(t) = ax{1 t, 0} be the hinge loss for the support vector achine classification. Define f z,λ = arg in σ (0,+ ) in f H K σ 1 i=1 φ (y i f(x i )) + λ f 2 K σ. By using I σ : L p (X) L p (X), we obtain learning rates for binary classification to learn the Bayes rule: f c (x) = { 1, if ρ(y = 1 x) ρ(y = 1 x) 1, if ρ(y = 1 x) < ρ(y = 1 x). Here Y = {1, 1} represents two classes. The isclassification error is defined as R(f) : Prob{y f(x)} R(f c ) for any f : X Y. First Previous Next Last Back Close Quit 12

The Sobolev space H k p (X) is the copletion of C (X) with respect to the nor f H k p (X) = k j=0 ( X j f p dv ) 1/p, where j f denotes the jth covariant derivative of f. Theore 4 If f c lies in the interpolation space (L 1 (X), H1 2(X)) θ ( ) 2θ+d 12θ+2d, we have for soe 0 < θ 1, then by taking λ = E z Z { R(sgn(fz,λ )) R(f c ) } C where C is a constant independent of. log 2 ( log 2 ) θ 6θ+d, First Previous Next Last Back Close Quit 13

Ongoing topics: variable selection diensionality reduction graph Laplacian diffusion ap First Previous Next Last Back Close Quit 14