Shannon Sampling II. Connections to Learning Theory

Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong at Chee Avenue, Kowloon, Hong Kong, CHINA E-ail: azhou@athcityueduhk July, 004 Abstract We continue our study [1] of Shannon sapling and function reconstruction In this paper, the error analysis is iproved he proble of function reconstruction is extended to a ore general setting with fraes beyond point evaluation hen we show how our approach can be applied to learning theory: a functional analysis fraework is presented; sharp, diension independent probability estiates are given not only for error in the L spaces, but also for the error in the reproducing kernel Hilbert space where the a learning algorith is perfored Covering nuber arguents are replaced by estiates of integral operators Keywords and Phrases: Shannon sapling, function reconstruction, learning heory, reproducing kernel Hilbert space, fraes 1 Introduction his paper considers regularization schees associated with the least square loss and Hilbert spaces H of continuous functions Our target is to provide a unified approach for two topics: interpolation theory, or ore generally, function reconstruction in Shannon sapling theory with H being a space of band-liited functions or functions with certain he first author is partially supported by NSF grant 035113 he second author is supported partially by the Research Grants Council of Hong Kong [Project No CityU 103704] and by City University of Hong Kong [Project No 700144] 1

decay; and regression proble in learning theory with H being a reproducing kernel Hilbert space H K First, we iprove the probability estiates in [1] with a siplified developent hen we apply the technique for function reconstruction to learning theory In particular, we show that a regression function f ρ can be approxiated by a regularization schee f z,λ in H K Diension independent exponential probability estiates are given for the error f z,λ f ρ K Our error bounds provide clues to the asyptotic choice of the regularization paraeter γ or λ Sapling Operator Let H be a Hilbert space of continuous functions on a coplete etric space X and the inclusion J : H CX is bounded with J < hen for each x X, the point evaluation functional f fx is bounded on H with nor at ost J Hence there exists an eleent E x H such that fx =< f, E x > H, f H 1 Let x be a discrete subset of X Define the sapling operator S x : H l x by S x f = fx x x We shall always assue that S x is bounded Denote S x as the adjoint of S x hen for each c l x, there holds < f, S x c > H =< S x f, c > l x= x x c x fx =< f, x x c x E x > H, f H It follows that S x c = x x c x E x, c l x

3 Algorith o allow noise, we ake the following assuption Special Assuption he sapled values y = y x x x have the for: For soe f H, and each x x, y x = f x + η x, where η x is drawn fro ρ x 31 Here for each x X, ρ x is a probability easure with zero ean, and its variance σ x satisfies σ := x x σ x < Note that x x f x = S x f l x S x f H < he Markov inequality for a nonnegative rando variable ξ asserts that Prob{ξ Eξ 1 δ, 0 < δ < 1 3 δ It tells us that for every ε > 0, Prob { {η x l x > ε E {η x σ l x /ε = ε By taking ε, we see that {η x l x and hence y l x in probability Let γ 0 With the saple z := x, y x x x, consider the algorith { Function reconstruction f := arg in fx yx + γ f H 33 f H heore 1 If S x S x + γi is invertible, then f exists, is unique and x x f = Ly, L := S x S x + γi 1 S x Proof Denote E z f := x x fx yx Since x x fx = Sx f l x =< S x S xf, f > H, we know that for f H, E z f + γ f H =< Sx S x + γi f, f > H < Sx y, f > H + y l x aking the functional derivative [10] for f H, we see that any iniizer f of 33 satisfies S x S x + γi f z,γ = Sx y his proves heore 1 he invertibility of the operator S x S x + γi is valid for rich data 3

Definition 1 We say that x provides rich data with respect to H if is positive It provides poor data if λ x = 0 λ x := inf f H S xf l x/ f H 34 he proble of function reconstruction here is to estiate the error f f H In this paper we shall show in Corollary below that in the rich data case, with γ = 0, for every 0 < δ < 1, with probability 1 δ, there holds f f H J σ /δ λ 35 x his estiate does not require the boundedness of the noise ρ x Moreover, under the stronger condition see 1] that η x M for each x x, we shall use the McDiarid inequality and prove in heore 5 below that for every 0 < δ < 1, with probability 1 δ, f f H J λ 8σ log 1 x δ + 4 3 M log 1 36 δ he two estiates, 35 and 36, iprove the bounds in [1] It turns out that heore 4 in [1] is a consequence of the reark which follows it about the Markov inequality Conversations with David McAllester were iportant to clarify this point 4 Saple Error Define f x,γ := L S x f 41 he saple error takes the for f f x,γ H heore If Sx S x + γi is invertible and Special Assuption holds, then for every 0 < δ < 1, with probability 1 δ, there holds f f x,γ H Sx S x + γi 1 J σ δ If η x M for soe M 0 and each x x, then for every ε > 0, we have Prob y { f f x,γ H L σ 1 + ε 1 exp { εσ M log 1 + ε 4

Proof Write f f x,γ H as L y S x f H Sx S x + γi 1 Sx y Sx f H But Hence S x S x y Sx f = yx f x E x x x y Sx f H = yx f x y x f x < E x, E x > H x x x x By the independence of the saples and Ey x f x = 0, E { y x f x = σ x, its expected value is E Sx y Sx f H = σx < E x, E x > H Now < E x, E x > H = E x H J hen the expected value of the saple error can be bounded as x x E f f x,γ H S x S x + γi 1 J σ he first desired probability estiate follows fro the Markov inequality 3 For the second estiate, we apply heore 3 fro [1] with w 1 to the rando variables {η x x x Special Assuption tells us that Eη x = 0, which iplies Eη x = σ x hen we see that for every ε > 0, { { Prob y η x σx x x{ > ε exp ε M log 1 + M ε x x σ ηx Here we have used the condition η x M, which iplies η x σx M x x σ ηx x x Eη 4 x M σ < he desired bound then follows fro f f x,γ H L {η x l x Also, Reark When x contains eleents, we can take σ M < Proposition 1 he sapling operator S x satisfies S x S x + γi 1 1 λ x + γ 5

For the operator L, we have L S x λ x + γ Proof Let v H and u = S x S x + γi 1 v hen S x S x + γi u = v aking inner products on both sides with u, we have < S x u, S x u > l x +γ u H =< v, u > H v H u H he definition of the richness λ x tells us that < S x u, S x u > l x= S x u l x λ x u H It follows that λ x + γ u H v H u H Hence u H λ x + γ 1 v H his is true for every v H hen the bound for the first operator follows he second inequality is trivial Corollary 1 If S x S x + γi is invertible and Special Assuption holds, then for every 0 < δ < 1, with probability 1 δ, there holds f f x,γ H J σ λ x + γ δ 5 Integration Error Recall that f x,γ = L S x f = S x S x + γi 1 S x S x f hen f x,γ = S x S x + γi 1 S x S x + γi γi f = f γ S x S x + γi 1 f 51 his in connection with Proposition 1 proves the following proposition 6

Proposition If S x S x + γi is invertible, then f x,γ f H γ f H λ x + γ Corollary If λ x > 0 and γ = 0, then for every 0 < δ < 1, with probability 1 δ, there holds 51 holds f f H J σ /δ λ x For the poor data case λ x = 0, we need to estiate γ S x S x + γi 1 f according to Recall that for a positive self-adjoint linear operator L on a Hilbert space H, there γ L + γi 1 f H = γ L + γi 1 f Lg + Lg H f Lg H + γ g H for every g H aking the infiu over g H, we have γ L + γi 1 { f H Kf, γ := inf f Lg H + γ g H, f H, γ > 0 5 g H his is the K-functional between H and the range of L hus, when the range of L is dense in H, we have li γ 0 γ L + γi 1 f H = 0 for every f H If f is in the range of L r for soe 0 < r 1, then γ L + γi 1 f H L r f H γ r See [11] Using 5 for L = S x S x, we can use a K-functional between H and the range of S x S x to get the convergence rate Proposition 3 Define f γ as fγ { := arg inf f S x S x g H + γ g H, γ > 0, g H then there holds f x,γ f H f S x S x f γ H + γ f γ H In particular, if f lies in the closure of the range of Sx S x, then li γ 0 f x,γ f H = 0 If f is in the range of Sx S r x for soe 0 < r 1, then fx,γ f H rf Sx S x H γ r Copared with Corollary, Proposition 3 in connection with Corollary 1 gives an error estiate for the poor data case when f is in the range of r Sx S x For every 7

0 < δ < 1, with probability 1 δ, there holds f f H J σ γ δ + S x S x rf H γ r 6 More General Setting of Function Reconstruction Fro 1 we see that the boundedness of S x is equivalent to the Bessel sequence property of the faily {E x x x of eleents in H, ie, there is a positive constant B such that < f, Ex > H B f H, f H 61 x x Moreover, x provides rich data if and only if this faily fors a frae of H, ie, there are two positive constants A B called frae bounds such that A f H x x < f, Ex > H B f H, f H In this case, the operator S x S x is called the frae operator Its inverse is usually difficult to copute, but it satisfies the reconstruction property: f = x x < f, S x S x 1Ex > H E x, f H For these basic facts about fraes, see [17] he function reconstruction algorith studied in the previous sections can be generalized to a setting with a Bessel sequence {E x x x in H satisfying 61 Here the point evaluation 1 is replaced by the functional < f, E x > H and the algorith becoes f := arg in f H { x x < f, Ex > H y x + γ f H 6 he saple values are given by y x =< f, E x > H +η x If we replace the sapling operator S x by the operator fro H to l x apping f to < f, E x > H, then the algorith can x x be analyzed in the sae as above and all the error bounds hold true Concrete exaples for this generalized setting can be found in the literature of iage processing, inverse probles [6] and sapling theory [1]: the Fredhol integral equation of the first kind, the 8

oent proble, and the function reconstruction fro weighted-averages One can even consider ore general function reconstruction schees: replacing the least-square loss in 6 by soe other loss function and H by soe other nor For exaple, if we choose Vapnik s ɛ-insensitive loss: t ɛ := ax{ t ɛ, 0, and a function space H included in H such as a Sobolev space in L, then a function reconstruction schee becoes { f := arg in < f, Ex > H y ɛ x + γ f H f H x x he rich data requireent is reasonable for function reconstruction such as sapling theory [13] On the other hand, in learning theory, the situation of poor data or poor frae bounds A 0 as the nuber of points in x increases often happens For such situations, we take x to be rando saples of soe probability distribution 7 Learning heory Fro now on we assue that X is copact Let ρ be a probability easure on Z := X Y with Y := IR he error for a function f : X Y is given by Ef := Z fx y dρ he function iniizing the error is called the regression function and is given by f ρ x := Y ydρy x, x X Here ρy x is the conditional distribution at x induced by ρ he arginal distribution on X is denoted as ρ X We assue that f ρ L ρ X Denote f ρ := f L ρx the variance of ρ and σ ρ as he purpose of the regression proble in learning theory [3, 7, 9, 14, 15] is to find good approxiations of the regression function fro a set of rando saples z := { x i, y i drawn independently according to ρ his purpose is achieved in Corollaries 3, 4, and 5 below Here we consider kernel based learning algoriths Let K : X X IR be continuous, syetric and positive seidefinite, ie, for any finite set of distinct points {x 1,, x l X, the atrix Kx i, x j l i,j=1 seidefinite Such a kernel is called a Mercer kernel is positive he Reproducing Kernel Hilbert Space RKHS H K associated with the kernel K is defined to be the closure [] of the linear span of the set of functions {K x := Kx, : x 9

X with the inner product <, > K satisfying < K x, K y > K = Kx, y he reproducing property takes the for < K x, g > K = gx, x X, g H K 71 he learning algorith we study here is a regularized one: { 1 Learning Schee f z,λ := arg in fxi y i + λ f f H K K 7 We shall investigate how f z,λ approxiates f ρ and how the choice of the regularization paraeter λ leads to optial convergence rates he convergence in L ρ x has been considered in [4, 5, 18] he purpose of this section is to present a siple functional analysis approach, and to provide the convergence rates in the space H K as weel as sharper, diension independent probability estiates in L ρ X he reproducing kernel property 71 tells us that the iniizer of 7 lies in H K,z := span{k xi by projection onto this subspace hus, the algorith can be written in the sae way as 33 o see this, we denote x = {x i, ρ x = ρ x f ρ x for x X hen E x = K x for x x Special Assuption holds, and 31 is true except that f H is replaced by f = f ρ Denote y = y i he learning schee 7 becoes f z,λ := arg in f H K,z { x x herefore, heore 1 still holds and we have fx yx + γ f K, γ = λ f z,λ = S x S x + λi 1 S x y his iplies the expression see, eg [3] that f z,λ = c ik xi with c = c i satisfying Kxi, x j i,j=1 + λi c = y Denote κ := sup x X Kx, x and f ρ x := f ρ x x x Define f x,λ := S x S x + λi 1 S x f ρ x Observe that S x : IR H K,z is given by S x c = c ik xi hen S x S x satisfies S x S x f = x x fxk x = L K,x S x f, f H K,z, 10

where L K,x : l x H K is defined as L K,x c := 1 c i K xi It is a good approxiation of the integral operator L K : L ρ X H K defined by L K fx := Kx, yfydρ X y, x X X he operator L K can also be defined as a self-adjoint operator on H K or on L ρ X We shall use the sae notion L K for these operators defined on different doains As operators on H K, L K,x S x approxiates L K well In fact, it was shown in [5] that E L K,x S x L K HK H K κ operators with doain L ρ X o get sharper error bounds, we need to get estiates for the Lea 1 Let x X be randoly drawn according to ρ X hen for any f L ρ X, E L K,x f x L K f K = E 1 fx i K xi L K f K κ f ρ Proof Define ξ to be the H K -valued rando variable ξ := fxk x over X, ρ X hen 1 fx ik xi L K f = 1 ξx i Eξ We know that { E 1 ξx i Eξ K E 1 which is bounded by κ f ρ/ ξx i Eξ K = 1 E ξ K Eξ K by he function f x,λ ay be considered as an approxiation of f λ where f λ is defined f λ := L K + λi 1 LK f ρ 73 In fact, f λ is a iniizer of the optiization proble: { f λ := arg in f fρ ρ + λ f { K = arg in Ef Efρ + λ f K 74 f H K f H K heore 3 Let z be randoly drawn according to ρ hen κ σ E z Z fz,λ f x,λ ρ K λ 11

and 3κ f ρ ρ E x X fx,λ f λ K λ Proof he sae proof as that of heore and Proposition 1 shows that E y fz,λ f x,λ K κ σ x i λ x + λ But E x σ x i = σ ρ hen the first stateent follows o see the second stateent we write f x,λ f λ as f x,λ f λ + f λ f λ, where f λ is defined by f λ := L K,x S x + λi 1 LK f ρ 75 Since f x,λ f λ = L K,x S x + λi 1 S x f ρ x L K f ρ, 76 Lea 1 tells us that E f x,λ f λ K 1 λ E S x f ρ x L K f ρ K κ f ρ ρ λ o estiate f λ f λ, we write L K f ρ as L K + λif λ hen f λ f λ = L K,x S x + λi 1 LK + λif λ f λ = L K,x S x + λi 1 LK f λ L K,x S x f λ Hence f λ f λ K 1 λ L Kf λ L K,x S x f λ K 77 Applying Lea 1 again, we see that E f λ f λ K 1 λ E L K f λ L K,x S x f λ K κ f λ ρ λ Note that f λ is a iniizer of 74 aking f = 0 yields f λ f ρ ρ + λ f λ K f ρ ρ Hence herefore, our second estiate follows f λ ρ f ρ ρ and f λ K f ρ ρ / λ 78 he last step is to estiate the approxiation error f λ f ρ 1

heore 4 Define f λ by 73 If L r K f ρ L ρ X for soe 0 < r 1, then f λ f ρ ρ λ r L r K f ρ ρ 79 When 1 < r 1, we have f λ f ρ K λ r 1 L r K f ρ ρ 710 We follow the sae line as we did in [11] Estiates siilar to 79 can be found [3, heore 3 1]: for a self-adjoint strictly positive copact operator A on a Hilbert space H, there holds for 0 < r < s, { b a + γ A s b γ r/s A r a 711 inf b H A istake was ade in [3] when scaling fro s = 1 to general s > 0: r should be < 1 in the general situation A proof of 79 was given in [5] Here we provide a coplete proof because the idea is used for verifying 710 Proof of heore 4 If {λ i, ψ i i 1 are the noralized eigenpairs of the integral operator L K : L ρ X L ρ X, then λ i ψ i K = 1 when λ i > 0 Write f ρ = L r K g for soe g = i 1 d iψ i with {d i l = g ρ < hen f ρ = i 1 λr i d iψ i and by 73, f λ f ρ = L K + λi 1 LK f ρ f ρ = i 1 i 1 i 1 λ λ i + λ λr i d i ψ i It follows that { 1/ { λ 1 r λ f λ f ρ ρ = λ i + λ λr i d i = λ r λi λ i + λ his is bounded by λ r {d i l λ + λ i = λ r g ρ = λ r L r K f ρ ρ Hence 79 holds r d i 1/ When r > 1, we have f λ f ρ K = λ 1 λ i + λ λr i d i = λ 3 r r 1 λ r 1 λi d i λ i + λ λ + λ i λ i >0 i 1 his is again bounded by λ r 1 {d i l has been verified = λ r 1 L r K f ρ ρ he second stateent 710 Cobining heores 3 and 4, we find the expected value of the error f z,λ f ρ By choosing the optial paraeter in this bound, we get the following convergence rates 13

Corollary 3 Let z be randoly drawn according to ρ Assue L r K f ρ L ρ X for soe 1 < r 1, then { 1 E z Z fz,λ f ρ K Cρ,K + λ r 1, 71 λ where C ρ,k := κ σ ρ + 3κ f ρ ρ + L r K f ρ ρ is independent of the diension Hence λ = 1 1 r 1 1+r = Ez Z fz,λ f ρ K 4r+ Cρ,K 713 Reark Corollary 3 provides estiates for the H K -nor error of f z,λ f ρ So we require f ρ H K which is equivalent to L 1 K f ρ L ρ X o get convergence rates we assue a stronger condition L r K f ρ L ρ X for soe 1 < r 1 he optial rate derived fro Corollary 3 is achieved by r = 1 In this case, E z Z fz,λ f ρ K = 1 1 6 he nor f z,λ f ρ K can not be bounded by the error Ef z,λ Ef ρ, hence our bound for the H K -nor error is new in learning theory Corollary 4 Let z be randoly drawn according to ρ If L r K f ρ L ρ X 1, then for soe 0 < r { E z Z fz,λ f ρ ρ C 1 ρ,k + λ, r 714 λ where C ρ,k := κ σ ρ + 3κ f ρ ρ + L r K f ρ ρ hus, λ = 1 +r = Ez Z fz,λ f ρ ρ C 1 r r+ ρ,k 715 Reark he convergence rate 715 for the L ρ X -nor is obtained by optiizing the regularization paraeter λ in 714 he sharp rate derived fro Corollary 4 is 1 which is achieved by r = 1 In [18], a leave-one-out technique was used to derive the expected value of learning schees For the schee 7, the result can be expressed as E z Z Efz,λ 1 4, { 1 + κ inf Ef + λ λ f H K f K 716 Notice that Ef Ef ρ = f f ρ ρ If we denote the regularization error see [1] as Dλ := { inf Ef Efρ + λ f { K = inf f fρ ρ + λ f K, 717 f H K f H K 14

then the bound 716 can be restated as E z Z fz,λ f ρ ρ D λ/ + Ef ρ + D λ/ 4κ λ + 4κ4 λ One can then derive the convergence rate 1 1 4 in expectation when f ρ H K and Ef ρ > 0 In fact, 711 with H = L ρ X, A = L K holds for r = s = 1/, which yields the best rate for the regularization error Dλ f ρ K λ One can thus get E z Z fz,λ 1 f ρ ρ = 1 by taking λ = 1/ Applying 3, one can have the probability estiate fz,λ f ρ ρ C 1 4 δ for the confidence 1 δ 1 In [5], a functional analysis approach was eployed for the error analysis of the schee 7 he ain result asserts that for any 0 < δ < 1, with confidence 1 δ, Efz,λ Ef λ Mκ 1 + κ 1 + log /δ 718 λ λ Convergence rates were also derived in [5, Corollary 1] by cobining 718 with 79: when f ρ lies in the range of L K, for any 0 < δ < 1, with confidence 1 δ, there holds fz,λ f ρ ρ C log /δ 1 5, if λ = log /δ hus the confidence is iproved fro 1/δ to log/δ, while the rate is weakened to 1 1 5 In the next section we shall verify the sae confidence estiate while the sharp rate is kept Our approach is short and neat, without involving the leave-one-out technique Moreover, we can derive convergence rates in the space H K 1 5 8 Probability Estiates by McDiarid Inequalities In this section we apply soe McDiarid inequalities to iprove the probability estiates derived fro expected values by the Markov inequality Let Ω, ρ be a probability space For t = t 1,, t Ω and t i Ω, we denote t i := t 1,, t i 1, t i, t i+1,, t Lea Let {t i, t i be iid drawers of the probability distribution ρ on Ω, and F : Ω IR be a easurable function 15

1 If for each i there is c i such that sup t Ω,t i Ω F t F t i c i, then Prob t Ω {F t E t F t ε exp { ε, ε > 0 81 c i If there is B 0 such that sup t Ω,1 i F t Eti F t B, then Prob t Ω {F t E t F t ε { exp where σ i F := sup z\{t i Ω 1 E t i { F t Eti F t ε Bε/3 +, ε > 0, σ i F 8 he first inequality is the McDiarid inequality, see [8] he second inequality is its Bernstein for which is presented in [16] First, we show how the probability estiate for function reconstruction stated in heore can be iproved heore 5 If S x S x + γi is invertible and Special Assuption holds, then under the condition that y x f x M for each x x, we have for every 0 < δ < 1, with probability 1 δ, f f x,γ H Sx S x + γi 1 J 8σ log 1 δ + 4 3 M log 1 δ J λ x + γ 8σ log 1 δ + 4 3 M log 1 δ Proof Write f f x,γ H as L y S x f H S x S x + γi 1 S x y Sx f H Consider the function F : l x IR defined by F y = Sx y Sx f H Recall fro the proof of heore that F y = x x yx f x E x H and E y F E y F = σx < E x, E x > H J σ 83 x x 16

hen we can apply the McDiarid inequality Let x 0 x and y x 0 be a new saple at x 0 We have F y F y x 0 = S x y Sx f H Sx y x 0 S x f H S x y y x 0 H he bound equals y x0 y x 0 Ex0 H y x0 y x 0 J Since y x f x M for each x x, it can be bounded by M J, which can be taken as B in Lea Also, F E yx0 y Ey0 F y yx0 y J dρx0 x 0 y x 0 dρ y x0 x 0 y x0 y x 0 J dρ x0 y x 0 dρ x0 y x0 4 J σ x 0 his yields x 0 x σ x 0 F 4 J σ hus Lea tells us that for every ε > 0, Prob y Y x {F y E y F y ε Solving the quadratic equation gives the probability estiate { exp ε M J ε/3 + 4 J σ = log 1 δ ε M J ε/3 + 4 J σ F y E y F + J 8σ log 1 δ + 4 3 M log 1 δ with confidence 1 δ his in connection with 83 proves heore 5 hen we turn to the learning theory estiates he purpose is to iprove the bound in heore 3 by applying the McDiarid inequality o this end, we refine Lea 1 to a probability estiate for Lea 3 Let x X be randoly drawn according to ρ X hen for any f L ρ X 0 < δ < 1, with confidence 1 δ, there holds 1 fx i K xi L K f 4κ f K 3 log 1 δ + κ f ρ 1 + 8 log 1 δ and 17

Proof Define a function F : X IR as F x = F x 1,, x = 1 fx i K xi L K f K For j {1,,, we have F x F x j 1 fxj fx j K xj K κ fx j fx j It follows that F x Exj F x κ f =: B Moreover, κ E xj F x E xj F x fxj fx X X j dρx x j dρ X x j κ fxj + fx j dρ X x jdρ X x j 4κ f ρ X X hen we have j=1 σ j F 4κ f ρ hus we can apply Lea to the function F and find that Prob x X { { F x E x F x ε exp κ f ε 3 ε + 4κ f ρ Solving a quadratic equation again, we see that with confidence 1 δ, we have F x E x F x + 4κ f 3 log 1 δ + κ f ρ 8 log 1 δ Lea 1 says that E x F x κ f ρ hen our conclusion follows heore 6 Let z be randoly drawn according to ρ satisfying y M alost everywhere hen for any 0 < δ < 1, with confidence 1 δ we have f z,λ f λ K κm log 4/δ 30 + 4κ λ 3 λ Proof Since y M alost everywhere, we know that f ρ ρ f ρ M Recall the function f λ defined by 75 It satisfies 76 Hence f x,λ f λ K 1 λ 1 f ρ x i K xi L K f ρ K 18

Applying Lea 3 to the function f ρ, we have with confidence 1 δ, f x,λ f λ K 4κM 3λ log 1 δ + κm 1 + 8 log 1 λ δ In the sae way, by Lea 3 with the function f λ and 77, we find { Prob x X f λ f λ K 4κ f λ log 1 3λ δ + κ f λ ρ 1 + 8 log 1 1 δ λ δ By 78, we have f λ ρ M and herefore, with confidence 1 δ, there holds f λ f λ K f λ κ f λ K κm λ 4κ M 3λ λ log 1 δ + κm 1 + λ 8 log 1 δ Finally, we apply heore 5 For each x X, there holds with confidence 1 δ, f z,λ f x,λ K κ 8σ λ log 1 δ + 4 3 M log 1 84 δ Here σ = σ x i Apply the Bernstein inequality Prob x X { 1 { ε ξx i Eξ ε exp Bε/3 + σ ξ to the rando variable ξx = Y y fρ x dρy x It satisfies 0 ξ 4M, Eξ = σ ρ, and σ ξ 4M σ ρ hen we see that { 1 Prob x X σx i σ ρ + 8M log 1/δ 8M σ ρ log 1/δ + 1 δ 3 Hence σ σ ρ + M 3 log 1/δ + 8M σ ρ log 1/δ 1/4 which is bounded by σ ρ + M 3 log 1/δ ogether with 84, we see that with probability 1 δ in Z, we have the bound σ f z,λ f x,λ K 4κ ρ log 1/δ + 34κM log 1/δ λ 3λ 19

Cobining the above three bounds, we know that for 0 < δ < 1/4, with confidence 1 4δ, f z,λ f λ K is bounded by { κm 13 log 1/δ σ + 3 + 3 8 log 1/δ + 4 ρ log 1/δ + 4κ log 1/δ λ M 3 λ κm log 1/δ { log 1/δ 13 λ + 3 log + 6 + 4 σ ρ + 4κ M 3 λ But σ ρ M hen our conclusion follows log 1/δ We are in a position to state our convergence rates in both K and ρ nors Corollary 5 Let z be randoly drawn according to ρ satisfying y M alost everywhere If f ρ is in the range of L K, then for any 0 < δ < 1, with confidence 1 δ we have f z,λ f ρ K C log 4/δ 1 6 log 4/δ by taking λ = 1 3 85 and f z,λ f ρ ρ C log 4/δ 1 4 log 4/δ by taking λ = 1 4, 86 where C is a constant independent of the diension: C := 30κM + κ M + L 1 K f ρ ρ References [1] A Aldroubi and K Gröchenig, Non-unifor sapling and reconstruction in shiftinvariant spaces, SIAM Review 43 001, 585 60 [] N Aronszajn, heory of reproducing kernels, rans Aer Math Soc 68 1950, 337 404 [3] F Cucker and S Sale, On the atheatical foundations of learning, Bull Aer Math Soc 39 001, 1 49 [4] F Cucker and S Sale, Best choices for regularization paraeters in learning theory, Foundat Coput Math 00, 413 48 0

[5] E De Vito, A Caponnetto, and L Rosasco, Model selection for regularized leastsquares algorith in learning theory, Foundat Coput Math, to appear [6] H W Engl, M Hanke, and A Neubauer, Regularization of Inverse Probles, Matheatics and Its Applications, 375, Kluwer, Dordrecht, 1996 [7] Evgeniou, M Pontil, and Poggio, Regularization networks and support vector achines, Adv Coput Math 13 000, 1 50 [8] C McDiarid, Concentration, in Probabilistic Methods for Algorithic Discrete Matheatics, Springer-Verlag, Berlin, 1998, pp 195 48 [9] P Niyogi, he Inforational Coplexity of Learning, Kluwer, 1998 [10] Poggio and S Sale, he atheatics of learning: dealing with data, Notices Aer Math Soc 50 003, 537 544 [11] S Sale and D X Zhou, Estiating the approxiation error in learning theory, Anal Appl 1 003, 17 41 [1] S Sale and D X Zhou, Shannon sapling and function reconstruction fro point values, Bull Aer Math Soc 41 004, 79 305 [13] M Unser, Sapling-50 years after Shannon, Proc IEEE 88 000, 569 587 [14] V Vapnik, Statistical Learning heory, John Wiley & Sons, 1998 [15] G Wahba, Spline Models for Observational Data, SIAM, 1990 [16] Y Ying, McDiarid inequalities of Bernstein and Bennett fors, echnical Report, City University of Hong Kong, 004 [17] R M Young, An Introduction to Non-Haronic Fourier Series, Acadeic Press, New York, 1980 [18] Zhang, Leave-one-out bounds for kernel ethods, Neural Cop 15 003, 1397 1437 1