Shannon Sampling II. Connections to Learning Theory

Size: px

Start display at page:

Download "Shannon Sampling II. Connections to Learning Theory"

Cecilia Lamb
6 years ago
Views:

1 Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong at Chee Avenue, Kowloon, Hong Kong, CHINA E-ail: July, 004 Abstract We continue our study [1] of Shannon sapling and function reconstruction In this paper, the error analysis is iproved he proble of function reconstruction is extended to a ore general setting with fraes beyond point evaluation hen we show how our approach can be applied to learning theory: a functional analysis fraework is presented; sharp, diension independent probability estiates are given not only for error in the L spaces, but also for the error in the reproducing kernel Hilbert space where the a learning algorith is perfored Covering nuber arguents are replaced by estiates of integral operators Keywords and Phrases: Shannon sapling, function reconstruction, learning heory, reproducing kernel Hilbert space, fraes 1 Introduction his paper considers regularization schees associated with the least square loss and Hilbert spaces H of continuous functions Our target is to provide a unified approach for two topics: interpolation theory, or ore generally, function reconstruction in Shannon sapling theory with H being a space of band-liited functions or functions with certain he first author is partially supported by NSF grant he second author is supported partially by the Research Grants Council of Hong Kong [Project No CityU ] and by City University of Hong Kong [Project No ] 1

2 decay; and regression proble in learning theory with H being a reproducing kernel Hilbert space H K First, we iprove the probability estiates in [1] with a siplified developent hen we apply the technique for function reconstruction to learning theory In particular, we show that a regression function f ρ can be approxiated by a regularization schee f z,λ in H K Diension independent exponential probability estiates are given for the error f z,λ f ρ K Our error bounds provide clues to the asyptotic choice of the regularization paraeter γ or λ Sapling Operator Let H be a Hilbert space of continuous functions on a coplete etric space X and the inclusion J : H CX is bounded with J < hen for each x X, the point evaluation functional f fx is bounded on H with nor at ost J Hence there exists an eleent E x H such that fx =< f, E x > H, f H 1 Let x be a discrete subset of X Define the sapling operator S x : H l x by S x f = fx x x We shall always assue that S x is bounded Denote S x as the adjoint of S x hen for each c l x, there holds < f, S x c > H =< S x f, c > l x= x x c x fx =< f, x x c x E x > H, f H It follows that S x c = x x c x E x, c l x

3 3 Algorith o allow noise, we ake the following assuption Special Assuption he sapled values y = y x x x have the for: For soe f H, and each x x, y x = f x + η x, where η x is drawn fro ρ x 31 Here for each x X, ρ x is a probability easure with zero ean, and its variance σ x satisfies σ := x x σ x < Note that x x f x = S x f l x S x f H < he Markov inequality for a nonnegative rando variable ξ asserts that Prob{ξ Eξ 1 δ, 0 < δ < 1 3 δ It tells us that for every ε > 0, Prob { {η x l x > ε E {η x σ l x /ε = ε By taking ε, we see that {η x l x and hence y l x in probability Let γ 0 With the saple z := x, y x x x, consider the algorith { Function reconstruction f := arg in fx yx + γ f H 33 f H heore 1 If S x S x + γi is invertible, then f exists, is unique and x x f = Ly, L := S x S x + γi 1 S x Proof Denote E z f := x x fx yx Since x x fx = Sx f l x =< S x S xf, f > H, we know that for f H, E z f + γ f H =< Sx S x + γi f, f > H < Sx y, f > H + y l x aking the functional derivative [10] for f H, we see that any iniizer f of 33 satisfies S x S x + γi f z,γ = Sx y his proves heore 1 he invertibility of the operator S x S x + γi is valid for rich data 3

4 Definition 1 We say that x provides rich data with respect to H if is positive It provides poor data if λ x = 0 λ x := inf f H S xf l x/ f H 34 he proble of function reconstruction here is to estiate the error f f H In this paper we shall show in Corollary below that in the rich data case, with γ = 0, for every 0 < δ < 1, with probability 1 δ, there holds f f H J σ /δ λ 35 x his estiate does not require the boundedness of the noise ρ x Moreover, under the stronger condition see 1] that η x M for each x x, we shall use the McDiarid inequality and prove in heore 5 below that for every 0 < δ < 1, with probability 1 δ, f f H J λ 8σ log 1 x δ M log 1 36 δ he two estiates, 35 and 36, iprove the bounds in [1] It turns out that heore 4 in [1] is a consequence of the reark which follows it about the Markov inequality Conversations with David McAllester were iportant to clarify this point 4 Saple Error Define f x,γ := L S x f 41 he saple error takes the for f f x,γ H heore If Sx S x + γi is invertible and Special Assuption holds, then for every 0 < δ < 1, with probability 1 δ, there holds f f x,γ H Sx S x + γi 1 J σ δ If η x M for soe M 0 and each x x, then for every ε > 0, we have Prob y { f f x,γ H L σ 1 + ε 1 exp { εσ M log 1 + ε 4

5 Proof Write f f x,γ H as L y S x f H Sx S x + γi 1 Sx y Sx f H But Hence S x S x y Sx f = yx f x E x x x y Sx f H = yx f x y x f x < E x, E x > H x x x x By the independence of the saples and Ey x f x = 0, E { y x f x = σ x, its expected value is E Sx y Sx f H = σx < E x, E x > H Now < E x, E x > H = E x H J hen the expected value of the saple error can be bounded as x x E f f x,γ H S x S x + γi 1 J σ he first desired probability estiate follows fro the Markov inequality 3 For the second estiate, we apply heore 3 fro [1] with w 1 to the rando variables {η x x x Special Assuption tells us that Eη x = 0, which iplies Eη x = σ x hen we see that for every ε > 0, { { Prob y η x σx x x{ > ε exp ε M log 1 + M ε x x σ ηx Here we have used the condition η x M, which iplies η x σx M x x σ ηx x x Eη 4 x M σ < he desired bound then follows fro f f x,γ H L {η x l x Also, Reark When x contains eleents, we can take σ M < Proposition 1 he sapling operator S x satisfies S x S x + γi 1 1 λ x + γ 5

6 For the operator L, we have L S x λ x + γ Proof Let v H and u = S x S x + γi 1 v hen S x S x + γi u = v aking inner products on both sides with u, we have < S x u, S x u > l x +γ u H =< v, u > H v H u H he definition of the richness λ x tells us that < S x u, S x u > l x= S x u l x λ x u H It follows that λ x + γ u H v H u H Hence u H λ x + γ 1 v H his is true for every v H hen the bound for the first operator follows he second inequality is trivial Corollary 1 If S x S x + γi is invertible and Special Assuption holds, then for every 0 < δ < 1, with probability 1 δ, there holds f f x,γ H J σ λ x + γ δ 5 Integration Error Recall that f x,γ = L S x f = S x S x + γi 1 S x S x f hen f x,γ = S x S x + γi 1 S x S x + γi γi f = f γ S x S x + γi 1 f 51 his in connection with Proposition 1 proves the following proposition 6

7 Proposition If S x S x + γi is invertible, then f x,γ f H γ f H λ x + γ Corollary If λ x > 0 and γ = 0, then for every 0 < δ < 1, with probability 1 δ, there holds 51 holds f f H J σ /δ λ x For the poor data case λ x = 0, we need to estiate γ S x S x + γi 1 f according to Recall that for a positive self-adjoint linear operator L on a Hilbert space H, there γ L + γi 1 f H = γ L + γi 1 f Lg + Lg H f Lg H + γ g H for every g H aking the infiu over g H, we have γ L + γi 1 { f H Kf, γ := inf f Lg H + γ g H, f H, γ > 0 5 g H his is the K-functional between H and the range of L hus, when the range of L is dense in H, we have li γ 0 γ L + γi 1 f H = 0 for every f H If f is in the range of L r for soe 0 < r 1, then γ L + γi 1 f H L r f H γ r See [11] Using 5 for L = S x S x, we can use a K-functional between H and the range of S x S x to get the convergence rate Proposition 3 Define f γ as fγ { := arg inf f S x S x g H + γ g H, γ > 0, g H then there holds f x,γ f H f S x S x f γ H + γ f γ H In particular, if f lies in the closure of the range of Sx S x, then li γ 0 f x,γ f H = 0 If f is in the range of Sx S r x for soe 0 < r 1, then fx,γ f H rf Sx S x H γ r Copared with Corollary, Proposition 3 in connection with Corollary 1 gives an error estiate for the poor data case when f is in the range of r Sx S x For every 7

8 0 < δ < 1, with probability 1 δ, there holds f f H J σ γ δ + S x S x rf H γ r 6 More General Setting of Function Reconstruction Fro 1 we see that the boundedness of S x is equivalent to the Bessel sequence property of the faily {E x x x of eleents in H, ie, there is a positive constant B such that < f, Ex > H B f H, f H 61 x x Moreover, x provides rich data if and only if this faily fors a frae of H, ie, there are two positive constants A B called frae bounds such that A f H x x < f, Ex > H B f H, f H In this case, the operator S x S x is called the frae operator Its inverse is usually difficult to copute, but it satisfies the reconstruction property: f = x x < f, S x S x 1Ex > H E x, f H For these basic facts about fraes, see [17] he function reconstruction algorith studied in the previous sections can be generalized to a setting with a Bessel sequence {E x x x in H satisfying 61 Here the point evaluation 1 is replaced by the functional < f, E x > H and the algorith becoes f := arg in f H { x x < f, Ex > H y x + γ f H 6 he saple values are given by y x =< f, E x > H +η x If we replace the sapling operator S x by the operator fro H to l x apping f to < f, E x > H, then the algorith can x x be analyzed in the sae as above and all the error bounds hold true Concrete exaples for this generalized setting can be found in the literature of iage processing, inverse probles [6] and sapling theory [1]: the Fredhol integral equation of the first kind, the 8

9 oent proble, and the function reconstruction fro weighted-averages One can even consider ore general function reconstruction schees: replacing the least-square loss in 6 by soe other loss function and H by soe other nor For exaple, if we choose Vapnik s ɛ-insensitive loss: t ɛ := ax{ t ɛ, 0, and a function space H included in H such as a Sobolev space in L, then a function reconstruction schee becoes { f := arg in < f, Ex > H y ɛ x + γ f H f H x x he rich data requireent is reasonable for function reconstruction such as sapling theory [13] On the other hand, in learning theory, the situation of poor data or poor frae bounds A 0 as the nuber of points in x increases often happens For such situations, we take x to be rando saples of soe probability distribution 7 Learning heory Fro now on we assue that X is copact Let ρ be a probability easure on Z := X Y with Y := IR he error for a function f : X Y is given by Ef := Z fx y dρ he function iniizing the error is called the regression function and is given by f ρ x := Y ydρy x, x X Here ρy x is the conditional distribution at x induced by ρ he arginal distribution on X is denoted as ρ X We assue that f ρ L ρ X Denote f ρ := f L ρx the variance of ρ and σ ρ as he purpose of the regression proble in learning theory [3, 7, 9, 14, 15] is to find good approxiations of the regression function fro a set of rando saples z := { x i, y i drawn independently according to ρ his purpose is achieved in Corollaries 3, 4, and 5 below Here we consider kernel based learning algoriths Let K : X X IR be continuous, syetric and positive seidefinite, ie, for any finite set of distinct points {x 1,, x l X, the atrix Kx i, x j l i,j=1 seidefinite Such a kernel is called a Mercer kernel is positive he Reproducing Kernel Hilbert Space RKHS H K associated with the kernel K is defined to be the closure [] of the linear span of the set of functions {K x := Kx, : x 9

10 X with the inner product <, > K satisfying < K x, K y > K = Kx, y he reproducing property takes the for < K x, g > K = gx, x X, g H K 71 he learning algorith we study here is a regularized one: { 1 Learning Schee f z,λ := arg in fxi y i + λ f f H K K 7 We shall investigate how f z,λ approxiates f ρ and how the choice of the regularization paraeter λ leads to optial convergence rates he convergence in L ρ x has been considered in [4, 5, 18] he purpose of this section is to present a siple functional analysis approach, and to provide the convergence rates in the space H K as weel as sharper, diension independent probability estiates in L ρ X he reproducing kernel property 71 tells us that the iniizer of 7 lies in H K,z := span{k xi by projection onto this subspace hus, the algorith can be written in the sae way as 33 o see this, we denote x = {x i, ρ x = ρ x f ρ x for x X hen E x = K x for x x Special Assuption holds, and 31 is true except that f H is replaced by f = f ρ Denote y = y i he learning schee 7 becoes f z,λ := arg in f H K,z { x x herefore, heore 1 still holds and we have fx yx + γ f K, γ = λ f z,λ = S x S x + λi 1 S x y his iplies the expression see, eg [3] that f z,λ = c ik xi with c = c i satisfying Kxi, x j i,j=1 + λi c = y Denote κ := sup x X Kx, x and f ρ x := f ρ x x x Define f x,λ := S x S x + λi 1 S x f ρ x Observe that S x : IR H K,z is given by S x c = c ik xi hen S x S x satisfies S x S x f = x x fxk x = L K,x S x f, f H K,z, 10

11 where L K,x : l x H K is defined as L K,x c := 1 c i K xi It is a good approxiation of the integral operator L K : L ρ X H K defined by L K fx := Kx, yfydρ X y, x X X he operator L K can also be defined as a self-adjoint operator on H K or on L ρ X We shall use the sae notion L K for these operators defined on different doains As operators on H K, L K,x S x approxiates L K well In fact, it was shown in [5] that E L K,x S x L K HK H K κ operators with doain L ρ X o get sharper error bounds, we need to get estiates for the Lea 1 Let x X be randoly drawn according to ρ X hen for any f L ρ X, E L K,x f x L K f K = E 1 fx i K xi L K f K κ f ρ Proof Define ξ to be the H K -valued rando variable ξ := fxk x over X, ρ X hen 1 fx ik xi L K f = 1 ξx i Eξ We know that { E 1 ξx i Eξ K E 1 which is bounded by κ f ρ/ ξx i Eξ K = 1 E ξ K Eξ K by he function f x,λ ay be considered as an approxiation of f λ where f λ is defined f λ := L K + λi 1 LK f ρ 73 In fact, f λ is a iniizer of the optiization proble: { f λ := arg in f fρ ρ + λ f { K = arg in Ef Efρ + λ f K 74 f H K f H K heore 3 Let z be randoly drawn according to ρ hen κ σ E z Z fz,λ f x,λ ρ K λ 11

12 and 3κ f ρ ρ E x X fx,λ f λ K λ Proof he sae proof as that of heore and Proposition 1 shows that E y fz,λ f x,λ K κ σ x i λ x + λ But E x σ x i = σ ρ hen the first stateent follows o see the second stateent we write f x,λ f λ as f x,λ f λ + f λ f λ, where f λ is defined by f λ := L K,x S x + λi 1 LK f ρ 75 Since f x,λ f λ = L K,x S x + λi 1 S x f ρ x L K f ρ, 76 Lea 1 tells us that E f x,λ f λ K 1 λ E S x f ρ x L K f ρ K κ f ρ ρ λ o estiate f λ f λ, we write L K f ρ as L K + λif λ hen f λ f λ = L K,x S x + λi 1 LK + λif λ f λ = L K,x S x + λi 1 LK f λ L K,x S x f λ Hence f λ f λ K 1 λ L Kf λ L K,x S x f λ K 77 Applying Lea 1 again, we see that E f λ f λ K 1 λ E L K f λ L K,x S x f λ K κ f λ ρ λ Note that f λ is a iniizer of 74 aking f = 0 yields f λ f ρ ρ + λ f λ K f ρ ρ Hence herefore, our second estiate follows f λ ρ f ρ ρ and f λ K f ρ ρ / λ 78 he last step is to estiate the approxiation error f λ f ρ 1

13 heore 4 Define f λ by 73 If L r K f ρ L ρ X for soe 0 < r 1, then f λ f ρ ρ λ r L r K f ρ ρ 79 When 1 < r 1, we have f λ f ρ K λ r 1 L r K f ρ ρ 710 We follow the sae line as we did in [11] Estiates siilar to 79 can be found [3, heore 3 1]: for a self-adjoint strictly positive copact operator A on a Hilbert space H, there holds for 0 < r < s, { b a + γ A s b γ r/s A r a 711 inf b H A istake was ade in [3] when scaling fro s = 1 to general s > 0: r should be < 1 in the general situation A proof of 79 was given in [5] Here we provide a coplete proof because the idea is used for verifying 710 Proof of heore 4 If {λ i, ψ i i 1 are the noralized eigenpairs of the integral operator L K : L ρ X L ρ X, then λ i ψ i K = 1 when λ i > 0 Write f ρ = L r K g for soe g = i 1 d iψ i with {d i l = g ρ < hen f ρ = i 1 λr i d iψ i and by 73, f λ f ρ = L K + λi 1 LK f ρ f ρ = i 1 i 1 i 1 λ λ i + λ λr i d i ψ i It follows that { 1/ { λ 1 r λ f λ f ρ ρ = λ i + λ λr i d i = λ r λi λ i + λ his is bounded by λ r {d i l λ + λ i = λ r g ρ = λ r L r K f ρ ρ Hence 79 holds r d i 1/ When r > 1, we have f λ f ρ K = λ 1 λ i + λ λr i d i = λ 3 r r 1 λ r 1 λi d i λ i + λ λ + λ i λ i >0 i 1 his is again bounded by λ r 1 {d i l has been verified = λ r 1 L r K f ρ ρ he second stateent 710 Cobining heores 3 and 4, we find the expected value of the error f z,λ f ρ By choosing the optial paraeter in this bound, we get the following convergence rates 13

14 Corollary 3 Let z be randoly drawn according to ρ Assue L r K f ρ L ρ X for soe 1 < r 1, then { 1 E z Z fz,λ f ρ K Cρ,K + λ r 1, 71 λ where C ρ,k := κ σ ρ + 3κ f ρ ρ + L r K f ρ ρ is independent of the diension Hence λ = 1 1 r 1 1+r = Ez Z fz,λ f ρ K 4r+ Cρ,K 713 Reark Corollary 3 provides estiates for the H K -nor error of f z,λ f ρ So we require f ρ H K which is equivalent to L 1 K f ρ L ρ X o get convergence rates we assue a stronger condition L r K f ρ L ρ X for soe 1 < r 1 he optial rate derived fro Corollary 3 is achieved by r = 1 In this case, E z Z fz,λ f ρ K = he nor f z,λ f ρ K can not be bounded by the error Ef z,λ Ef ρ, hence our bound for the H K -nor error is new in learning theory Corollary 4 Let z be randoly drawn according to ρ If L r K f ρ L ρ X 1, then for soe 0 < r { E z Z fz,λ f ρ ρ C 1 ρ,k + λ, r 714 λ where C ρ,k := κ σ ρ + 3κ f ρ ρ + L r K f ρ ρ hus, λ = 1 +r = Ez Z fz,λ f ρ ρ C 1 r r+ ρ,k 715 Reark he convergence rate 715 for the L ρ X -nor is obtained by optiizing the regularization paraeter λ in 714 he sharp rate derived fro Corollary 4 is 1 which is achieved by r = 1 In [18], a leave-one-out technique was used to derive the expected value of learning schees For the schee 7, the result can be expressed as E z Z Efz,λ 1 4, { 1 + κ inf Ef + λ λ f H K f K 716 Notice that Ef Ef ρ = f f ρ ρ If we denote the regularization error see [1] as Dλ := { inf Ef Efρ + λ f { K = inf f fρ ρ + λ f K, 717 f H K f H K 14

15 then the bound 716 can be restated as E z Z fz,λ f ρ ρ D λ/ + Ef ρ + D λ/ 4κ λ + 4κ4 λ One can then derive the convergence rate in expectation when f ρ H K and Ef ρ > 0 In fact, 711 with H = L ρ X, A = L K holds for r = s = 1/, which yields the best rate for the regularization error Dλ f ρ K λ One can thus get E z Z fz,λ 1 f ρ ρ = 1 by taking λ = 1/ Applying 3, one can have the probability estiate fz,λ f ρ ρ C 1 4 δ for the confidence 1 δ 1 In [5], a functional analysis approach was eployed for the error analysis of the schee 7 he ain result asserts that for any 0 < δ < 1, with confidence 1 δ, Efz,λ Ef λ Mκ 1 + κ 1 + log /δ 718 λ λ Convergence rates were also derived in [5, Corollary 1] by cobining 718 with 79: when f ρ lies in the range of L K, for any 0 < δ < 1, with confidence 1 δ, there holds fz,λ f ρ ρ C log /δ 1 5, if λ = log /δ hus the confidence is iproved fro 1/δ to log/δ, while the rate is weakened to In the next section we shall verify the sae confidence estiate while the sharp rate is kept Our approach is short and neat, without involving the leave-one-out technique Moreover, we can derive convergence rates in the space H K Probability Estiates by McDiarid Inequalities In this section we apply soe McDiarid inequalities to iprove the probability estiates derived fro expected values by the Markov inequality Let Ω, ρ be a probability space For t = t 1,, t Ω and t i Ω, we denote t i := t 1,, t i 1, t i, t i+1,, t Lea Let {t i, t i be iid drawers of the probability distribution ρ on Ω, and F : Ω IR be a easurable function 15

16 1 If for each i there is c i such that sup t Ω,t i Ω F t F t i c i, then Prob t Ω {F t E t F t ε exp { ε, ε > 0 81 c i If there is B 0 such that sup t Ω,1 i F t Eti F t B, then Prob t Ω {F t E t F t ε { exp where σ i F := sup z\{t i Ω 1 E t i { F t Eti F t ε Bε/3 +, ε > 0, σ i F 8 he first inequality is the McDiarid inequality, see [8] he second inequality is its Bernstein for which is presented in [16] First, we show how the probability estiate for function reconstruction stated in heore can be iproved heore 5 If S x S x + γi is invertible and Special Assuption holds, then under the condition that y x f x M for each x x, we have for every 0 < δ < 1, with probability 1 δ, f f x,γ H Sx S x + γi 1 J 8σ log 1 δ M log 1 δ J λ x + γ 8σ log 1 δ M log 1 δ Proof Write f f x,γ H as L y S x f H S x S x + γi 1 S x y Sx f H Consider the function F : l x IR defined by F y = Sx y Sx f H Recall fro the proof of heore that F y = x x yx f x E x H and E y F E y F = σx < E x, E x > H J σ 83 x x 16

17 hen we can apply the McDiarid inequality Let x 0 x and y x 0 be a new saple at x 0 We have F y F y x 0 = S x y Sx f H Sx y x 0 S x f H S x y y x 0 H he bound equals y x0 y x 0 Ex0 H y x0 y x 0 J Since y x f x M for each x x, it can be bounded by M J, which can be taken as B in Lea Also, F E yx0 y Ey0 F y yx0 y J dρx0 x 0 y x 0 dρ y x0 x 0 y x0 y x 0 J dρ x0 y x 0 dρ x0 y x0 4 J σ x 0 his yields x 0 x σ x 0 F 4 J σ hus Lea tells us that for every ε > 0, Prob y Y x {F y E y F y ε Solving the quadratic equation gives the probability estiate { exp ε M J ε/3 + 4 J σ = log 1 δ ε M J ε/3 + 4 J σ F y E y F + J 8σ log 1 δ M log 1 δ with confidence 1 δ his in connection with 83 proves heore 5 hen we turn to the learning theory estiates he purpose is to iprove the bound in heore 3 by applying the McDiarid inequality o this end, we refine Lea 1 to a probability estiate for Lea 3 Let x X be randoly drawn according to ρ X hen for any f L ρ X 0 < δ < 1, with confidence 1 δ, there holds 1 fx i K xi L K f 4κ f K 3 log 1 δ + κ f ρ log 1 δ and 17

18 Proof Define a function F : X IR as F x = F x 1,, x = 1 fx i K xi L K f K For j {1,,, we have F x F x j 1 fxj fx j K xj K κ fx j fx j It follows that F x Exj F x κ f =: B Moreover, κ E xj F x E xj F x fxj fx X X j dρx x j dρ X x j κ fxj + fx j dρ X x jdρ X x j 4κ f ρ X X hen we have j=1 σ j F 4κ f ρ hus we can apply Lea to the function F and find that Prob x X { { F x E x F x ε exp κ f ε 3 ε + 4κ f ρ Solving a quadratic equation again, we see that with confidence 1 δ, we have F x E x F x + 4κ f 3 log 1 δ + κ f ρ 8 log 1 δ Lea 1 says that E x F x κ f ρ hen our conclusion follows heore 6 Let z be randoly drawn according to ρ satisfying y M alost everywhere hen for any 0 < δ < 1, with confidence 1 δ we have f z,λ f λ K κm log 4/δ κ λ 3 λ Proof Since y M alost everywhere, we know that f ρ ρ f ρ M Recall the function f λ defined by 75 It satisfies 76 Hence f x,λ f λ K 1 λ 1 f ρ x i K xi L K f ρ K 18

19 Applying Lea 3 to the function f ρ, we have with confidence 1 δ, f x,λ f λ K 4κM 3λ log 1 δ + κm log 1 λ δ In the sae way, by Lea 3 with the function f λ and 77, we find { Prob x X f λ f λ K 4κ f λ log 1 3λ δ + κ f λ ρ log 1 1 δ λ δ By 78, we have f λ ρ M and herefore, with confidence 1 δ, there holds f λ f λ K f λ κ f λ K κm λ 4κ M 3λ λ log 1 δ + κm 1 + λ 8 log 1 δ Finally, we apply heore 5 For each x X, there holds with confidence 1 δ, f z,λ f x,λ K κ 8σ λ log 1 δ M log 1 84 δ Here σ = σ x i Apply the Bernstein inequality Prob x X { 1 { ε ξx i Eξ ε exp Bε/3 + σ ξ to the rando variable ξx = Y y fρ x dρy x It satisfies 0 ξ 4M, Eξ = σ ρ, and σ ξ 4M σ ρ hen we see that { 1 Prob x X σx i σ ρ + 8M log 1/δ 8M σ ρ log 1/δ + 1 δ 3 Hence σ σ ρ + M 3 log 1/δ + 8M σ ρ log 1/δ 1/4 which is bounded by σ ρ + M 3 log 1/δ ogether with 84, we see that with probability 1 δ in Z, we have the bound σ f z,λ f x,λ K 4κ ρ log 1/δ + 34κM log 1/δ λ 3λ 19

20 Cobining the above three bounds, we know that for 0 < δ < 1/4, with confidence 1 4δ, f z,λ f λ K is bounded by { κm 13 log 1/δ σ log 1/δ + 4 ρ log 1/δ + 4κ log 1/δ λ M 3 λ κm log 1/δ { log 1/δ 13 λ + 3 log σ ρ + 4κ M 3 λ But σ ρ M hen our conclusion follows log 1/δ We are in a position to state our convergence rates in both K and ρ nors Corollary 5 Let z be randoly drawn according to ρ satisfying y M alost everywhere If f ρ is in the range of L K, then for any 0 < δ < 1, with confidence 1 δ we have f z,λ f ρ K C log 4/δ 1 6 log 4/δ by taking λ = and f z,λ f ρ ρ C log 4/δ 1 4 log 4/δ by taking λ = 1 4, 86 where C is a constant independent of the diension: C := 30κM + κ M + L 1 K f ρ ρ References [1] A Aldroubi and K Gröchenig, Non-unifor sapling and reconstruction in shiftinvariant spaces, SIAM Review , [] N Aronszajn, heory of reproducing kernels, rans Aer Math Soc , [3] F Cucker and S Sale, On the atheatical foundations of learning, Bull Aer Math Soc , 1 49 [4] F Cucker and S Sale, Best choices for regularization paraeters in learning theory, Foundat Coput Math 00,

21 [5] E De Vito, A Caponnetto, and L Rosasco, Model selection for regularized leastsquares algorith in learning theory, Foundat Coput Math, to appear [6] H W Engl, M Hanke, and A Neubauer, Regularization of Inverse Probles, Matheatics and Its Applications, 375, Kluwer, Dordrecht, 1996 [7] Evgeniou, M Pontil, and Poggio, Regularization networks and support vector achines, Adv Coput Math , 1 50 [8] C McDiarid, Concentration, in Probabilistic Methods for Algorithic Discrete Matheatics, Springer-Verlag, Berlin, 1998, pp [9] P Niyogi, he Inforational Coplexity of Learning, Kluwer, 1998 [10] Poggio and S Sale, he atheatics of learning: dealing with data, Notices Aer Math Soc , [11] S Sale and D X Zhou, Estiating the approxiation error in learning theory, Anal Appl 1 003, [1] S Sale and D X Zhou, Shannon sapling and function reconstruction fro point values, Bull Aer Math Soc , [13] M Unser, Sapling-50 years after Shannon, Proc IEEE , [14] V Vapnik, Statistical Learning heory, John Wiley & Sons, 1998 [15] G Wahba, Spline Models for Observational Data, SIAM, 1990 [16] Y Ying, McDiarid inequalities of Bernstein and Bennett fors, echnical Report, City University of Hong Kong, 004 [17] R M Young, An Introduction to Non-Haronic Fourier Series, Acadeic Press, New York, 1980 [18] Zhang, Leave-one-out bounds for kernel ethods, Neural Cop ,

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007