Shannon Sampling II. Connections to Learning Theory

Similar documents
Learnability of Gaussians with flexible variances

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

Computational and Statistical Learning Theory

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

Multi-kernel Regularized Classifiers

Manifold learning via Multi-Penalty Regularization

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

A Simple Regression Problem

Journal of Mathematical Analysis and Applications

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Geometry on Probability Spaces

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Learnability and Stability in the General Learning Setting

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

A Bernstein-Markov Theorem for Normed Spaces

Estimating Parameters for a Gaussian pdf

Universal algorithms for learning theory Part II : piecewise polynomial functions

Complex Quadratic Optimization and Semidefinite Programming

Block designs and statistics

The degree of a typical vertex in generalized random intersection graph models

1 Bounding the Margin

Machine Learning Basics: Estimators, Bias and Variance

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

arxiv: v1 [cs.ds] 3 Feb 2014

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

3.8 Three Types of Convergence

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Physics 215 Winter The Density Matrix

Non-Parametric Non-Line-of-Sight Identification 1

Sharp Time Data Tradeoffs for Linear Inverse Problems

CS Lecture 13. More Maximum Likelihood

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Online gradient descent learning algorithm

Lower Bounds for Quantized Matrix Completion

Support recovery in compressed sensing: An estimation theoretic approach

Combining Classifiers

Tail Estimation of the Spectral Density under Fixed-Domain Asymptotics

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Learning Theory of Randomized Kaczmarz Algorithm

Feature Extraction Techniques

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Support Vector Machines. Goals for the lecture

Stochastic Subgradient Methods

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Support Vector Machines. Maximizing the Margin

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

FAST DYNAMO ON THE REAL LINE

Derivative reproducing properties for kernel methods in learning theory

arxiv: v1 [math.pr] 17 May 2009

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

New Classes of Positive Semi-Definite Hankel Tensors

Fairness via priority scheduling

Asynchronous Gossip Algorithms for Stochastic Optimization

On Conditions for Linearity of Optimal Estimation

Some Classical Ergodic Theorems

Stability Bounds for Non-i.i.d. Processes

Weighted- 1 minimization with multiple weighting sets

Neural Network Learning as an Inverse Problem

The linear sampling method and the MUSIC algorithm

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

STRONG LAW OF LARGE NUMBERS FOR SCALAR-NORMED SUMS OF ELEMENTS OF REGRESSIVE SEQUENCES OF RANDOM VARIABLES

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Max-Product Shepard Approximation Operators

Alireza Kamel Mirmostafaee

Online Gradient Descent Learning Algorithms

A note on the realignment criterion

Learning, Regularization and Ill-Posed Inverse Problems

A note on the multiplication of sparse matrices

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

APPROXIMATION BY GENERALIZED FABER SERIES IN BERGMAN SPACES ON INFINITE DOMAINS WITH A QUASICONFORMAL BOUNDARY

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

L p moments of random vectors via majorizing measures

Asymptotics of weighted random sums

A Note on Online Scheduling for Jobs with Arbitrary Release Times

Konrad-Zuse-Zentrum für Informationstechnik Berlin Heilbronner Str. 10, D Berlin - Wilmersdorf

THE AVERAGE NORM OF POLYNOMIALS OF FIXED HEIGHT

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Kernel Methods and Support Vector Machines

Computational and Statistical Learning Theory

Fixed-to-Variable Length Distribution Matching

Testing equality of variances for multiple univariate normal populations

M ath. Res. Lett. 15 (2008), no. 2, c International Press 2008 SUM-PRODUCT ESTIMATES VIA DIRECTED EXPANDERS. Van H. Vu. 1.

DISSIMILARITY MEASURES FOR ICA-BASED SOURCE NUMBER ESTIMATION. Seungchul Lee 2 2. University of Michigan. Ann Arbor, MI, USA.

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Testing Properties of Collections of Distributions

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Statistics and Probability Letters

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Transcription:

Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong at Chee Avenue, Kowloon, Hong Kong, CHINA E-ail: azhou@athcityueduhk July, 004 Abstract We continue our study [1] of Shannon sapling and function reconstruction In this paper, the error analysis is iproved he proble of function reconstruction is extended to a ore general setting with fraes beyond point evaluation hen we show how our approach can be applied to learning theory: a functional analysis fraework is presented; sharp, diension independent probability estiates are given not only for error in the L spaces, but also for the error in the reproducing kernel Hilbert space where the a learning algorith is perfored Covering nuber arguents are replaced by estiates of integral operators Keywords and Phrases: Shannon sapling, function reconstruction, learning heory, reproducing kernel Hilbert space, fraes 1 Introduction his paper considers regularization schees associated with the least square loss and Hilbert spaces H of continuous functions Our target is to provide a unified approach for two topics: interpolation theory, or ore generally, function reconstruction in Shannon sapling theory with H being a space of band-liited functions or functions with certain he first author is partially supported by NSF grant 035113 he second author is supported partially by the Research Grants Council of Hong Kong [Project No CityU 103704] and by City University of Hong Kong [Project No 700144] 1

decay; and regression proble in learning theory with H being a reproducing kernel Hilbert space H K First, we iprove the probability estiates in [1] with a siplified developent hen we apply the technique for function reconstruction to learning theory In particular, we show that a regression function f ρ can be approxiated by a regularization schee f z,λ in H K Diension independent exponential probability estiates are given for the error f z,λ f ρ K Our error bounds provide clues to the asyptotic choice of the regularization paraeter γ or λ Sapling Operator Let H be a Hilbert space of continuous functions on a coplete etric space X and the inclusion J : H CX is bounded with J < hen for each x X, the point evaluation functional f fx is bounded on H with nor at ost J Hence there exists an eleent E x H such that fx =< f, E x > H, f H 1 Let x be a discrete subset of X Define the sapling operator S x : H l x by S x f = fx x x We shall always assue that S x is bounded Denote S x as the adjoint of S x hen for each c l x, there holds < f, S x c > H =< S x f, c > l x= x x c x fx =< f, x x c x E x > H, f H It follows that S x c = x x c x E x, c l x

3 Algorith o allow noise, we ake the following assuption Special Assuption he sapled values y = y x x x have the for: For soe f H, and each x x, y x = f x + η x, where η x is drawn fro ρ x 31 Here for each x X, ρ x is a probability easure with zero ean, and its variance σ x satisfies σ := x x σ x < Note that x x f x = S x f l x S x f H < he Markov inequality for a nonnegative rando variable ξ asserts that Prob{ξ Eξ 1 δ, 0 < δ < 1 3 δ It tells us that for every ε > 0, Prob { {η x l x > ε E {η x σ l x /ε = ε By taking ε, we see that {η x l x and hence y l x in probability Let γ 0 With the saple z := x, y x x x, consider the algorith { Function reconstruction f := arg in fx yx + γ f H 33 f H heore 1 If S x S x + γi is invertible, then f exists, is unique and x x f = Ly, L := S x S x + γi 1 S x Proof Denote E z f := x x fx yx Since x x fx = Sx f l x =< S x S xf, f > H, we know that for f H, E z f + γ f H =< Sx S x + γi f, f > H < Sx y, f > H + y l x aking the functional derivative [10] for f H, we see that any iniizer f of 33 satisfies S x S x + γi f z,γ = Sx y his proves heore 1 he invertibility of the operator S x S x + γi is valid for rich data 3

Definition 1 We say that x provides rich data with respect to H if is positive It provides poor data if λ x = 0 λ x := inf f H S xf l x/ f H 34 he proble of function reconstruction here is to estiate the error f f H In this paper we shall show in Corollary below that in the rich data case, with γ = 0, for every 0 < δ < 1, with probability 1 δ, there holds f f H J σ /δ λ 35 x his estiate does not require the boundedness of the noise ρ x Moreover, under the stronger condition see 1] that η x M for each x x, we shall use the McDiarid inequality and prove in heore 5 below that for every 0 < δ < 1, with probability 1 δ, f f H J λ 8σ log 1 x δ + 4 3 M log 1 36 δ he two estiates, 35 and 36, iprove the bounds in [1] It turns out that heore 4 in [1] is a consequence of the reark which follows it about the Markov inequality Conversations with David McAllester were iportant to clarify this point 4 Saple Error Define f x,γ := L S x f 41 he saple error takes the for f f x,γ H heore If Sx S x + γi is invertible and Special Assuption holds, then for every 0 < δ < 1, with probability 1 δ, there holds f f x,γ H Sx S x + γi 1 J σ δ If η x M for soe M 0 and each x x, then for every ε > 0, we have Prob y { f f x,γ H L σ 1 + ε 1 exp { εσ M log 1 + ε 4

Proof Write f f x,γ H as L y S x f H Sx S x + γi 1 Sx y Sx f H But Hence S x S x y Sx f = yx f x E x x x y Sx f H = yx f x y x f x < E x, E x > H x x x x By the independence of the saples and Ey x f x = 0, E { y x f x = σ x, its expected value is E Sx y Sx f H = σx < E x, E x > H Now < E x, E x > H = E x H J hen the expected value of the saple error can be bounded as x x E f f x,γ H S x S x + γi 1 J σ he first desired probability estiate follows fro the Markov inequality 3 For the second estiate, we apply heore 3 fro [1] with w 1 to the rando variables {η x x x Special Assuption tells us that Eη x = 0, which iplies Eη x = σ x hen we see that for every ε > 0, { { Prob y η x σx x x{ > ε exp ε M log 1 + M ε x x σ ηx Here we have used the condition η x M, which iplies η x σx M x x σ ηx x x Eη 4 x M σ < he desired bound then follows fro f f x,γ H L {η x l x Also, Reark When x contains eleents, we can take σ M < Proposition 1 he sapling operator S x satisfies S x S x + γi 1 1 λ x + γ 5

For the operator L, we have L S x λ x + γ Proof Let v H and u = S x S x + γi 1 v hen S x S x + γi u = v aking inner products on both sides with u, we have < S x u, S x u > l x +γ u H =< v, u > H v H u H he definition of the richness λ x tells us that < S x u, S x u > l x= S x u l x λ x u H It follows that λ x + γ u H v H u H Hence u H λ x + γ 1 v H his is true for every v H hen the bound for the first operator follows he second inequality is trivial Corollary 1 If S x S x + γi is invertible and Special Assuption holds, then for every 0 < δ < 1, with probability 1 δ, there holds f f x,γ H J σ λ x + γ δ 5 Integration Error Recall that f x,γ = L S x f = S x S x + γi 1 S x S x f hen f x,γ = S x S x + γi 1 S x S x + γi γi f = f γ S x S x + γi 1 f 51 his in connection with Proposition 1 proves the following proposition 6

Proposition If S x S x + γi is invertible, then f x,γ f H γ f H λ x + γ Corollary If λ x > 0 and γ = 0, then for every 0 < δ < 1, with probability 1 δ, there holds 51 holds f f H J σ /δ λ x For the poor data case λ x = 0, we need to estiate γ S x S x + γi 1 f according to Recall that for a positive self-adjoint linear operator L on a Hilbert space H, there γ L + γi 1 f H = γ L + γi 1 f Lg + Lg H f Lg H + γ g H for every g H aking the infiu over g H, we have γ L + γi 1 { f H Kf, γ := inf f Lg H + γ g H, f H, γ > 0 5 g H his is the K-functional between H and the range of L hus, when the range of L is dense in H, we have li γ 0 γ L + γi 1 f H = 0 for every f H If f is in the range of L r for soe 0 < r 1, then γ L + γi 1 f H L r f H γ r See [11] Using 5 for L = S x S x, we can use a K-functional between H and the range of S x S x to get the convergence rate Proposition 3 Define f γ as fγ { := arg inf f S x S x g H + γ g H, γ > 0, g H then there holds f x,γ f H f S x S x f γ H + γ f γ H In particular, if f lies in the closure of the range of Sx S x, then li γ 0 f x,γ f H = 0 If f is in the range of Sx S r x for soe 0 < r 1, then fx,γ f H rf Sx S x H γ r Copared with Corollary, Proposition 3 in connection with Corollary 1 gives an error estiate for the poor data case when f is in the range of r Sx S x For every 7

0 < δ < 1, with probability 1 δ, there holds f f H J σ γ δ + S x S x rf H γ r 6 More General Setting of Function Reconstruction Fro 1 we see that the boundedness of S x is equivalent to the Bessel sequence property of the faily {E x x x of eleents in H, ie, there is a positive constant B such that < f, Ex > H B f H, f H 61 x x Moreover, x provides rich data if and only if this faily fors a frae of H, ie, there are two positive constants A B called frae bounds such that A f H x x < f, Ex > H B f H, f H In this case, the operator S x S x is called the frae operator Its inverse is usually difficult to copute, but it satisfies the reconstruction property: f = x x < f, S x S x 1Ex > H E x, f H For these basic facts about fraes, see [17] he function reconstruction algorith studied in the previous sections can be generalized to a setting with a Bessel sequence {E x x x in H satisfying 61 Here the point evaluation 1 is replaced by the functional < f, E x > H and the algorith becoes f := arg in f H { x x < f, Ex > H y x + γ f H 6 he saple values are given by y x =< f, E x > H +η x If we replace the sapling operator S x by the operator fro H to l x apping f to < f, E x > H, then the algorith can x x be analyzed in the sae as above and all the error bounds hold true Concrete exaples for this generalized setting can be found in the literature of iage processing, inverse probles [6] and sapling theory [1]: the Fredhol integral equation of the first kind, the 8

oent proble, and the function reconstruction fro weighted-averages One can even consider ore general function reconstruction schees: replacing the least-square loss in 6 by soe other loss function and H by soe other nor For exaple, if we choose Vapnik s ɛ-insensitive loss: t ɛ := ax{ t ɛ, 0, and a function space H included in H such as a Sobolev space in L, then a function reconstruction schee becoes { f := arg in < f, Ex > H y ɛ x + γ f H f H x x he rich data requireent is reasonable for function reconstruction such as sapling theory [13] On the other hand, in learning theory, the situation of poor data or poor frae bounds A 0 as the nuber of points in x increases often happens For such situations, we take x to be rando saples of soe probability distribution 7 Learning heory Fro now on we assue that X is copact Let ρ be a probability easure on Z := X Y with Y := IR he error for a function f : X Y is given by Ef := Z fx y dρ he function iniizing the error is called the regression function and is given by f ρ x := Y ydρy x, x X Here ρy x is the conditional distribution at x induced by ρ he arginal distribution on X is denoted as ρ X We assue that f ρ L ρ X Denote f ρ := f L ρx the variance of ρ and σ ρ as he purpose of the regression proble in learning theory [3, 7, 9, 14, 15] is to find good approxiations of the regression function fro a set of rando saples z := { x i, y i drawn independently according to ρ his purpose is achieved in Corollaries 3, 4, and 5 below Here we consider kernel based learning algoriths Let K : X X IR be continuous, syetric and positive seidefinite, ie, for any finite set of distinct points {x 1,, x l X, the atrix Kx i, x j l i,j=1 seidefinite Such a kernel is called a Mercer kernel is positive he Reproducing Kernel Hilbert Space RKHS H K associated with the kernel K is defined to be the closure [] of the linear span of the set of functions {K x := Kx, : x 9

X with the inner product <, > K satisfying < K x, K y > K = Kx, y he reproducing property takes the for < K x, g > K = gx, x X, g H K 71 he learning algorith we study here is a regularized one: { 1 Learning Schee f z,λ := arg in fxi y i + λ f f H K K 7 We shall investigate how f z,λ approxiates f ρ and how the choice of the regularization paraeter λ leads to optial convergence rates he convergence in L ρ x has been considered in [4, 5, 18] he purpose of this section is to present a siple functional analysis approach, and to provide the convergence rates in the space H K as weel as sharper, diension independent probability estiates in L ρ X he reproducing kernel property 71 tells us that the iniizer of 7 lies in H K,z := span{k xi by projection onto this subspace hus, the algorith can be written in the sae way as 33 o see this, we denote x = {x i, ρ x = ρ x f ρ x for x X hen E x = K x for x x Special Assuption holds, and 31 is true except that f H is replaced by f = f ρ Denote y = y i he learning schee 7 becoes f z,λ := arg in f H K,z { x x herefore, heore 1 still holds and we have fx yx + γ f K, γ = λ f z,λ = S x S x + λi 1 S x y his iplies the expression see, eg [3] that f z,λ = c ik xi with c = c i satisfying Kxi, x j i,j=1 + λi c = y Denote κ := sup x X Kx, x and f ρ x := f ρ x x x Define f x,λ := S x S x + λi 1 S x f ρ x Observe that S x : IR H K,z is given by S x c = c ik xi hen S x S x satisfies S x S x f = x x fxk x = L K,x S x f, f H K,z, 10

where L K,x : l x H K is defined as L K,x c := 1 c i K xi It is a good approxiation of the integral operator L K : L ρ X H K defined by L K fx := Kx, yfydρ X y, x X X he operator L K can also be defined as a self-adjoint operator on H K or on L ρ X We shall use the sae notion L K for these operators defined on different doains As operators on H K, L K,x S x approxiates L K well In fact, it was shown in [5] that E L K,x S x L K HK H K κ operators with doain L ρ X o get sharper error bounds, we need to get estiates for the Lea 1 Let x X be randoly drawn according to ρ X hen for any f L ρ X, E L K,x f x L K f K = E 1 fx i K xi L K f K κ f ρ Proof Define ξ to be the H K -valued rando variable ξ := fxk x over X, ρ X hen 1 fx ik xi L K f = 1 ξx i Eξ We know that { E 1 ξx i Eξ K E 1 which is bounded by κ f ρ/ ξx i Eξ K = 1 E ξ K Eξ K by he function f x,λ ay be considered as an approxiation of f λ where f λ is defined f λ := L K + λi 1 LK f ρ 73 In fact, f λ is a iniizer of the optiization proble: { f λ := arg in f fρ ρ + λ f { K = arg in Ef Efρ + λ f K 74 f H K f H K heore 3 Let z be randoly drawn according to ρ hen κ σ E z Z fz,λ f x,λ ρ K λ 11

and 3κ f ρ ρ E x X fx,λ f λ K λ Proof he sae proof as that of heore and Proposition 1 shows that E y fz,λ f x,λ K κ σ x i λ x + λ But E x σ x i = σ ρ hen the first stateent follows o see the second stateent we write f x,λ f λ as f x,λ f λ + f λ f λ, where f λ is defined by f λ := L K,x S x + λi 1 LK f ρ 75 Since f x,λ f λ = L K,x S x + λi 1 S x f ρ x L K f ρ, 76 Lea 1 tells us that E f x,λ f λ K 1 λ E S x f ρ x L K f ρ K κ f ρ ρ λ o estiate f λ f λ, we write L K f ρ as L K + λif λ hen f λ f λ = L K,x S x + λi 1 LK + λif λ f λ = L K,x S x + λi 1 LK f λ L K,x S x f λ Hence f λ f λ K 1 λ L Kf λ L K,x S x f λ K 77 Applying Lea 1 again, we see that E f λ f λ K 1 λ E L K f λ L K,x S x f λ K κ f λ ρ λ Note that f λ is a iniizer of 74 aking f = 0 yields f λ f ρ ρ + λ f λ K f ρ ρ Hence herefore, our second estiate follows f λ ρ f ρ ρ and f λ K f ρ ρ / λ 78 he last step is to estiate the approxiation error f λ f ρ 1

heore 4 Define f λ by 73 If L r K f ρ L ρ X for soe 0 < r 1, then f λ f ρ ρ λ r L r K f ρ ρ 79 When 1 < r 1, we have f λ f ρ K λ r 1 L r K f ρ ρ 710 We follow the sae line as we did in [11] Estiates siilar to 79 can be found [3, heore 3 1]: for a self-adjoint strictly positive copact operator A on a Hilbert space H, there holds for 0 < r < s, { b a + γ A s b γ r/s A r a 711 inf b H A istake was ade in [3] when scaling fro s = 1 to general s > 0: r should be < 1 in the general situation A proof of 79 was given in [5] Here we provide a coplete proof because the idea is used for verifying 710 Proof of heore 4 If {λ i, ψ i i 1 are the noralized eigenpairs of the integral operator L K : L ρ X L ρ X, then λ i ψ i K = 1 when λ i > 0 Write f ρ = L r K g for soe g = i 1 d iψ i with {d i l = g ρ < hen f ρ = i 1 λr i d iψ i and by 73, f λ f ρ = L K + λi 1 LK f ρ f ρ = i 1 i 1 i 1 λ λ i + λ λr i d i ψ i It follows that { 1/ { λ 1 r λ f λ f ρ ρ = λ i + λ λr i d i = λ r λi λ i + λ his is bounded by λ r {d i l λ + λ i = λ r g ρ = λ r L r K f ρ ρ Hence 79 holds r d i 1/ When r > 1, we have f λ f ρ K = λ 1 λ i + λ λr i d i = λ 3 r r 1 λ r 1 λi d i λ i + λ λ + λ i λ i >0 i 1 his is again bounded by λ r 1 {d i l has been verified = λ r 1 L r K f ρ ρ he second stateent 710 Cobining heores 3 and 4, we find the expected value of the error f z,λ f ρ By choosing the optial paraeter in this bound, we get the following convergence rates 13

Corollary 3 Let z be randoly drawn according to ρ Assue L r K f ρ L ρ X for soe 1 < r 1, then { 1 E z Z fz,λ f ρ K Cρ,K + λ r 1, 71 λ where C ρ,k := κ σ ρ + 3κ f ρ ρ + L r K f ρ ρ is independent of the diension Hence λ = 1 1 r 1 1+r = Ez Z fz,λ f ρ K 4r+ Cρ,K 713 Reark Corollary 3 provides estiates for the H K -nor error of f z,λ f ρ So we require f ρ H K which is equivalent to L 1 K f ρ L ρ X o get convergence rates we assue a stronger condition L r K f ρ L ρ X for soe 1 < r 1 he optial rate derived fro Corollary 3 is achieved by r = 1 In this case, E z Z fz,λ f ρ K = 1 1 6 he nor f z,λ f ρ K can not be bounded by the error Ef z,λ Ef ρ, hence our bound for the H K -nor error is new in learning theory Corollary 4 Let z be randoly drawn according to ρ If L r K f ρ L ρ X 1, then for soe 0 < r { E z Z fz,λ f ρ ρ C 1 ρ,k + λ, r 714 λ where C ρ,k := κ σ ρ + 3κ f ρ ρ + L r K f ρ ρ hus, λ = 1 +r = Ez Z fz,λ f ρ ρ C 1 r r+ ρ,k 715 Reark he convergence rate 715 for the L ρ X -nor is obtained by optiizing the regularization paraeter λ in 714 he sharp rate derived fro Corollary 4 is 1 which is achieved by r = 1 In [18], a leave-one-out technique was used to derive the expected value of learning schees For the schee 7, the result can be expressed as E z Z Efz,λ 1 4, { 1 + κ inf Ef + λ λ f H K f K 716 Notice that Ef Ef ρ = f f ρ ρ If we denote the regularization error see [1] as Dλ := { inf Ef Efρ + λ f { K = inf f fρ ρ + λ f K, 717 f H K f H K 14

then the bound 716 can be restated as E z Z fz,λ f ρ ρ D λ/ + Ef ρ + D λ/ 4κ λ + 4κ4 λ One can then derive the convergence rate 1 1 4 in expectation when f ρ H K and Ef ρ > 0 In fact, 711 with H = L ρ X, A = L K holds for r = s = 1/, which yields the best rate for the regularization error Dλ f ρ K λ One can thus get E z Z fz,λ 1 f ρ ρ = 1 by taking λ = 1/ Applying 3, one can have the probability estiate fz,λ f ρ ρ C 1 4 δ for the confidence 1 δ 1 In [5], a functional analysis approach was eployed for the error analysis of the schee 7 he ain result asserts that for any 0 < δ < 1, with confidence 1 δ, Efz,λ Ef λ Mκ 1 + κ 1 + log /δ 718 λ λ Convergence rates were also derived in [5, Corollary 1] by cobining 718 with 79: when f ρ lies in the range of L K, for any 0 < δ < 1, with confidence 1 δ, there holds fz,λ f ρ ρ C log /δ 1 5, if λ = log /δ hus the confidence is iproved fro 1/δ to log/δ, while the rate is weakened to 1 1 5 In the next section we shall verify the sae confidence estiate while the sharp rate is kept Our approach is short and neat, without involving the leave-one-out technique Moreover, we can derive convergence rates in the space H K 1 5 8 Probability Estiates by McDiarid Inequalities In this section we apply soe McDiarid inequalities to iprove the probability estiates derived fro expected values by the Markov inequality Let Ω, ρ be a probability space For t = t 1,, t Ω and t i Ω, we denote t i := t 1,, t i 1, t i, t i+1,, t Lea Let {t i, t i be iid drawers of the probability distribution ρ on Ω, and F : Ω IR be a easurable function 15

1 If for each i there is c i such that sup t Ω,t i Ω F t F t i c i, then Prob t Ω {F t E t F t ε exp { ε, ε > 0 81 c i If there is B 0 such that sup t Ω,1 i F t Eti F t B, then Prob t Ω {F t E t F t ε { exp where σ i F := sup z\{t i Ω 1 E t i { F t Eti F t ε Bε/3 +, ε > 0, σ i F 8 he first inequality is the McDiarid inequality, see [8] he second inequality is its Bernstein for which is presented in [16] First, we show how the probability estiate for function reconstruction stated in heore can be iproved heore 5 If S x S x + γi is invertible and Special Assuption holds, then under the condition that y x f x M for each x x, we have for every 0 < δ < 1, with probability 1 δ, f f x,γ H Sx S x + γi 1 J 8σ log 1 δ + 4 3 M log 1 δ J λ x + γ 8σ log 1 δ + 4 3 M log 1 δ Proof Write f f x,γ H as L y S x f H S x S x + γi 1 S x y Sx f H Consider the function F : l x IR defined by F y = Sx y Sx f H Recall fro the proof of heore that F y = x x yx f x E x H and E y F E y F = σx < E x, E x > H J σ 83 x x 16

hen we can apply the McDiarid inequality Let x 0 x and y x 0 be a new saple at x 0 We have F y F y x 0 = S x y Sx f H Sx y x 0 S x f H S x y y x 0 H he bound equals y x0 y x 0 Ex0 H y x0 y x 0 J Since y x f x M for each x x, it can be bounded by M J, which can be taken as B in Lea Also, F E yx0 y Ey0 F y yx0 y J dρx0 x 0 y x 0 dρ y x0 x 0 y x0 y x 0 J dρ x0 y x 0 dρ x0 y x0 4 J σ x 0 his yields x 0 x σ x 0 F 4 J σ hus Lea tells us that for every ε > 0, Prob y Y x {F y E y F y ε Solving the quadratic equation gives the probability estiate { exp ε M J ε/3 + 4 J σ = log 1 δ ε M J ε/3 + 4 J σ F y E y F + J 8σ log 1 δ + 4 3 M log 1 δ with confidence 1 δ his in connection with 83 proves heore 5 hen we turn to the learning theory estiates he purpose is to iprove the bound in heore 3 by applying the McDiarid inequality o this end, we refine Lea 1 to a probability estiate for Lea 3 Let x X be randoly drawn according to ρ X hen for any f L ρ X 0 < δ < 1, with confidence 1 δ, there holds 1 fx i K xi L K f 4κ f K 3 log 1 δ + κ f ρ 1 + 8 log 1 δ and 17

Proof Define a function F : X IR as F x = F x 1,, x = 1 fx i K xi L K f K For j {1,,, we have F x F x j 1 fxj fx j K xj K κ fx j fx j It follows that F x Exj F x κ f =: B Moreover, κ E xj F x E xj F x fxj fx X X j dρx x j dρ X x j κ fxj + fx j dρ X x jdρ X x j 4κ f ρ X X hen we have j=1 σ j F 4κ f ρ hus we can apply Lea to the function F and find that Prob x X { { F x E x F x ε exp κ f ε 3 ε + 4κ f ρ Solving a quadratic equation again, we see that with confidence 1 δ, we have F x E x F x + 4κ f 3 log 1 δ + κ f ρ 8 log 1 δ Lea 1 says that E x F x κ f ρ hen our conclusion follows heore 6 Let z be randoly drawn according to ρ satisfying y M alost everywhere hen for any 0 < δ < 1, with confidence 1 δ we have f z,λ f λ K κm log 4/δ 30 + 4κ λ 3 λ Proof Since y M alost everywhere, we know that f ρ ρ f ρ M Recall the function f λ defined by 75 It satisfies 76 Hence f x,λ f λ K 1 λ 1 f ρ x i K xi L K f ρ K 18

Applying Lea 3 to the function f ρ, we have with confidence 1 δ, f x,λ f λ K 4κM 3λ log 1 δ + κm 1 + 8 log 1 λ δ In the sae way, by Lea 3 with the function f λ and 77, we find { Prob x X f λ f λ K 4κ f λ log 1 3λ δ + κ f λ ρ 1 + 8 log 1 1 δ λ δ By 78, we have f λ ρ M and herefore, with confidence 1 δ, there holds f λ f λ K f λ κ f λ K κm λ 4κ M 3λ λ log 1 δ + κm 1 + λ 8 log 1 δ Finally, we apply heore 5 For each x X, there holds with confidence 1 δ, f z,λ f x,λ K κ 8σ λ log 1 δ + 4 3 M log 1 84 δ Here σ = σ x i Apply the Bernstein inequality Prob x X { 1 { ε ξx i Eξ ε exp Bε/3 + σ ξ to the rando variable ξx = Y y fρ x dρy x It satisfies 0 ξ 4M, Eξ = σ ρ, and σ ξ 4M σ ρ hen we see that { 1 Prob x X σx i σ ρ + 8M log 1/δ 8M σ ρ log 1/δ + 1 δ 3 Hence σ σ ρ + M 3 log 1/δ + 8M σ ρ log 1/δ 1/4 which is bounded by σ ρ + M 3 log 1/δ ogether with 84, we see that with probability 1 δ in Z, we have the bound σ f z,λ f x,λ K 4κ ρ log 1/δ + 34κM log 1/δ λ 3λ 19

Cobining the above three bounds, we know that for 0 < δ < 1/4, with confidence 1 4δ, f z,λ f λ K is bounded by { κm 13 log 1/δ σ + 3 + 3 8 log 1/δ + 4 ρ log 1/δ + 4κ log 1/δ λ M 3 λ κm log 1/δ { log 1/δ 13 λ + 3 log + 6 + 4 σ ρ + 4κ M 3 λ But σ ρ M hen our conclusion follows log 1/δ We are in a position to state our convergence rates in both K and ρ nors Corollary 5 Let z be randoly drawn according to ρ satisfying y M alost everywhere If f ρ is in the range of L K, then for any 0 < δ < 1, with confidence 1 δ we have f z,λ f ρ K C log 4/δ 1 6 log 4/δ by taking λ = 1 3 85 and f z,λ f ρ ρ C log 4/δ 1 4 log 4/δ by taking λ = 1 4, 86 where C is a constant independent of the diension: C := 30κM + κ M + L 1 K f ρ ρ References [1] A Aldroubi and K Gröchenig, Non-unifor sapling and reconstruction in shiftinvariant spaces, SIAM Review 43 001, 585 60 [] N Aronszajn, heory of reproducing kernels, rans Aer Math Soc 68 1950, 337 404 [3] F Cucker and S Sale, On the atheatical foundations of learning, Bull Aer Math Soc 39 001, 1 49 [4] F Cucker and S Sale, Best choices for regularization paraeters in learning theory, Foundat Coput Math 00, 413 48 0

[5] E De Vito, A Caponnetto, and L Rosasco, Model selection for regularized leastsquares algorith in learning theory, Foundat Coput Math, to appear [6] H W Engl, M Hanke, and A Neubauer, Regularization of Inverse Probles, Matheatics and Its Applications, 375, Kluwer, Dordrecht, 1996 [7] Evgeniou, M Pontil, and Poggio, Regularization networks and support vector achines, Adv Coput Math 13 000, 1 50 [8] C McDiarid, Concentration, in Probabilistic Methods for Algorithic Discrete Matheatics, Springer-Verlag, Berlin, 1998, pp 195 48 [9] P Niyogi, he Inforational Coplexity of Learning, Kluwer, 1998 [10] Poggio and S Sale, he atheatics of learning: dealing with data, Notices Aer Math Soc 50 003, 537 544 [11] S Sale and D X Zhou, Estiating the approxiation error in learning theory, Anal Appl 1 003, 17 41 [1] S Sale and D X Zhou, Shannon sapling and function reconstruction fro point values, Bull Aer Math Soc 41 004, 79 305 [13] M Unser, Sapling-50 years after Shannon, Proc IEEE 88 000, 569 587 [14] V Vapnik, Statistical Learning heory, John Wiley & Sons, 1998 [15] G Wahba, Spline Models for Observational Data, SIAM, 1990 [16] Y Ying, McDiarid inequalities of Bernstein and Bennett fors, echnical Report, City University of Hong Kong, 004 [17] R M Young, An Introduction to Non-Haronic Fourier Series, Acadeic Press, New York, 1980 [18] Zhang, Leave-one-out bounds for kernel ethods, Neural Cop 15 003, 1397 1437 1