Learnability of Gaussians with flexible variances

Size: px

Start display at page:

Download "Learnability of Gaussians with flexible variances"

Karen Fox
6 years ago
Views:

1 Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

2 Least-square Regularized Regression Learn f : X Y fro rando saples z = {(x i, y i )} i=1 Take X to be a copact subset of R n and Y = R. y f(x) Due to noises or other uncertainty, we assue a (unknown) probability easure ρ on Z = X Y governs the sapling. arginal distribution ρ X on X: {x i } i=1 drawn according to ρ X conditional distribution ρ( x) at x X Learning the regression function: f ρ (x) = Y ydρ(y x) y i f ρ (x i ) First Previous Next Last Back Close Quit 1

3 Learning with a Fixed Gaussian f z,λ,σ := arg in f H K σ 1 i=1 (f(x i ) y i ) 2 + λ f 2 K σ, (1) where λ = λ() > 0, and K σ (x, y) = e x y 2 2σ 2 kernel on X is a Gaussian Reproducing Kernel Hilbert Space (RKHS) H Kσ copletion of span{(k σ ) t := K σ (t, ) : t X} with the inner product, Kσ satisfying (K σ ) x, (K σ ) y K = K σ (x, y). First Previous Next Last Back Close Quit 2

4 Theore 1 (Sale-Zhou, Constr. Approx. 2007) Assue y M and that f ρ = X K σ(x, y)g(y)dρ X (y) for soe g L 2 ρ X. For any 0 < δ < 1, with confidence 1 δ, f z,λ,σ f ρ L 2 2 log ( 4/δ )( 12M ) 2/3 1/3 ( 1 ) 1/3 g ρx L 2 ρ X where λ = λ() = log ( 4/δ )( 12M/ g L 2 ρx ) 2/3 ( 1/ ) 1/3. In Theore 1, f ρ C First Previous Next Last Back Close Quit 3

5 RKHS H Kσ generated by a Gaussian kernel on X H Kσ = H Kσ (R n ) X where H Kσ (R n ) is the RKHS generated by K σ as a Mercer kernel on R n : H Kσ (R n ) = { f L 2 (R n ) : f HK σ (Rn ) < } where f HK σ (Rn ) = Rn ˆf(ξ) 2 ( 2πσ) n e σ2 ξ 2 2 dξ 1/2. Thus H Kσ (R n ) C (R n ) Steinwart First Previous Next Last Back Close Quit 4

6 If X is a doain with piecewise sooth boundary and dρ X (x) c 0 dx for soe c 0 > 0, then for any β > 0, } D σ (λ) := { f f ρ 2L 2ρX + λ f 2 K σ = O(λ β ) inf f H K σ iplies f ρ C (X). Note f f ρ 2 L 2 ρ X = E(f) E(f ρ ) where E(f) := Z (f(x) y)2 dρ. Denote E z (f) = 1 i=1 (f(x i ) y i ) 2 E(f). Then f z,λ,σ = arg in f H K σ { Ez (f) + λ f 2 K σ }. First Previous Next Last Back Close Quit 5

7 If we define f λ,σ = arg in f H K σ { E(f) + λ f 2 Kσ }, then f z,λ,σ f λ,σ and the error can be estiated in ters of λ by the theory of unifor convergence over the copact function set B M/ λ := {f H Kσ : f Kσ M/ λ} since f z,λ,σ B M/ λ. But f λ,σ f ρ 2 L 2 ρ X = O(λ β ) for any β > 0 iplies f ρ C (X). So the learning ability of a single Gaussian is weak. One ay choose less sooth kernel, but we would like radial basis kernels for anifold learning. One way to increase the learning ability of Gaussian kernels: let σ depend on and σ = σ() 0 as. Steinwart-Scovel, Xiang-Zhou,... First Previous Next Last Back Close Quit 6

8 Another way: allow all possible variances σ (0, ) Regularization Schees with Flexible Gaussians: Zhou, Wu-Ying-Zhou, Ying-Zhou, Micchelli-Pontil-Wu-Zhou,... f z,λ := arg in 0<σ< in f H K σ 1 i=1 (f(x i ) y i ) 2 + λ f 2 K σ Theore 2 (Ying-Zhou, J. Mach. Learning Res. 2007) Let ρ X be the Lebesgue easure on a doain X in R n with inially sooth boundary. If f ρ H s (X) for soe s 2 and λ = 2s+n 4(4s+n), then we have ( E z Z fz,λ f ρ 2 ) L 2 = O ( s ) 2(4s+n) log.. First Previous Next Last Back Close Quit 7

9 Major difficulty: is the function set H = 0<σ< { f HKσ : f Kσ R } with R > 0 learnable? That is, is this function set a unifor Glivenko-Cantelli class? Its closure is not a copact subset of C(X). Theory of Unifor Convergence for sup f H E z (f) E(f). Given a bounded set H of functions on X, when do we have li sup l ρ Prob sup l sup f H 1 i=1 f(x i ) f(x)dρ X > ɛ = 0, ɛ > 0? Such a set is called a unifor Glivenko-Cantelli (UGC) class. Characterizations: Vapnik-Chervonenkis, and Alon, Ben-David, Cesa-Bianchi, Haussler (1997) First Previous Next Last Back Close Quit 8

10 Our quantitative estiates: If V : Y R R + is convex with respect to the second variable, M = V (y, 0) L ρ (Z) <, and C R = sup{ax{ V (y, t), V + (y, t) } : y Y, t R} <, then we have { supf H 1 i=1 V (y i, f(x i )) Z V (y, f(x))dρ } E z Z C C R R log 1/4 + 2M, where C is a constant depending on n. Ideas: reducing the estiates for H to a uch saller subset F = {(K σ ) x : x X, 0 < σ < }, then bounding epirical covering nubers. The UGC property follows fro the characterization of Dudley-Giné-Zinn. First Previous Next Last Back Close Quit 9

11 Iprove the learning rates when X is a anifold of diension d with d uch saller than the diension n of the underlying Euclidean space. Approxiation by Gaussians on Rieannian anifolds Let X be a d-diensional connected copact C subanifold of R n without boundary. The approxiation schee is given by a faily of linear operators {I σ : C(X) C(X)} σ>0 as I σ (f)(x) = = 1 ( 2πσ) d 1 ( 2πσ) d X K σ(x, y)f(y)dv (y) X exp { x y 2 2σ 2 where V is the Rieannian volue easure of X. } f(y)dv (y), x X, First Previous Next Last Back Close Quit 10

12 Theore 3 (Ye-Zhou, Adv. Coput. Math. 2007) If f ρ Lip(s) with 0 < s 1, then I σ (f ρ ) f ρ C(X) C X f ρ Lip(s) σ s σ > 0, (2) where C X is a positive constant independent of f ρ or σ. taking λ = ( log 2 ) s+d 8s+4d, we have E z Z { f z,λ f ρ 2L 2ρX } = O (( log 2 ) s 8s+4d ). By s The index 8s+4d in Theore 3 is saller than s in Theore 2 when the anifold diension d is uch saller than 2(4s+n) n. First Previous Next Last Back Close Quit 11

13 Classification by Gaussians on Rieannian anifolds Let φ(t) = ax{1 t, 0} be the hinge loss for the support vector achine classification. Define f z,λ = arg in σ (0,+ ) in f H K σ 1 i=1 φ (y i f(x i )) + λ f 2 K σ. By using I σ : L p (X) L p (X), we obtain learning rates for binary classification to learn the Bayes rule: f c (x) = { 1, if ρ(y = 1 x) ρ(y = 1 x) 1, if ρ(y = 1 x) < ρ(y = 1 x). Here Y = {1, 1} represents two classes. The isclassification error is defined as R(f) : Prob{y f(x)} R(f c ) for any f : X Y. First Previous Next Last Back Close Quit 12

14 The Sobolev space H k p (X) is the copletion of C (X) with respect to the nor f H k p (X) = k j=0 ( X j f p dv ) 1/p, where j f denotes the jth covariant derivative of f. Theore 4 If f c lies in the interpolation space (L 1 (X), H1 2(X)) θ ( ) 2θ+d 12θ+2d, we have for soe 0 < θ 1, then by taking λ = E z Z { R(sgn(fz,λ )) R(f c ) } C where C is a constant independent of. log 2 ( log 2 ) θ 6θ+d, First Previous Next Last Back Close Quit 13

15 Ongoing topics: variable selection diensionality reduction graph Laplacian diffusion ap First Previous Next Last Back Close Quit 14

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,