ISSN , Volume 32, Number 2

Size: px

Start display at page:

Download "ISSN , Volume 32, Number 2"

Benedict Lee
6 years ago
Views:

ISSN 076-476, Volume 3, Number This article was published in the above mentioned Springer issue The material, including all portions thereof, is protected by copyright; all rights are held

1 ISSN , Volume 3, Number This article was published in the above mentioned Springer issue The material, including all portions thereof, is protected by copyright; all rights are held exclusively by Springer Science + Business Media The material is for personal use only; commercial use is not permitted Unauthorized reproduction, transfer and/or use may be a violation of criminal as well as civil law

Constr Approx (00) 3: 307 338 DOI 0007/s00365-009-9080-0 Some Properties of Gaussian Reproducing Kernel Hilbert Spaces and Their Implications for Function Approximation and Learning Theory Ha Quang

2 Constr Approx (00) 3: DOI 0007/s Some Properties of Gaussian Reproducing Kernel Hilbert Spaces and Their Implications for Function Approximation and Learning Theory Ha Quang Minh Received: 7 August 008 / Revised: 6 May 009 / Accepted: 9 July 009 / Published online: 9 December 009 Springer Science+Business Media, LLC 009 Abstract We give several properties of the reproducing kernel Hilbert space induced by the Gaussian kernel, along with their implications for recent results in the complexity of the regularized least square algorithm in learning theory Keywords Reproducing kernel Hilbert spaces Gaussian kernel Eigenvalues Learning theory Regularized least square algorithm Mathematics Subject Classification (000) 68T05 68P30 Introduction The theory of reproducing kernel Hilbert spaces (RKHS) has recently emerged as a powerful framework for the problem of learning from data, both from algorithmic and theoretical perspectives (comprehensive treatments are found in, for example, [, 0, ]) Of the many kernels being utilized, the Gaussian kernel is the most widely used and in many cases gives the best performance It is thus of crucial importance, for both theoretical and practical purposes, to have a deep understanding of this kernel and the Hilbert space it induces This paper describes some properties that resulted from our study of learning theory problems While we provide several implications of these properties for learning theory and function approximation, the results are of mathematical interest in their own right For many other interesting properties of the Gaussian RKHS that have appeared elsewhere in the literature, we refer to [5, 7] and the many references therein Communicated by Wolfgang Dahmen HQ Minh ( ) Humboldt Universität zu Berlin, Invalidenstrasse 43, 05 Berlin, Germany minhhaquang@staffhu-berlinde

3 308 Constr Approx (00) 3: Reproducing Kernel Hilbert Spaces Before we state our main results, let us briefly recall RKHS The general theory of reproducing kernel Hilbert spaces was developed by Aronszajn [] Let X be an arbitrary nonempty set Let K : X X R be a symmetric function satisfying: for any finite set of points {x i } N i= in X and real numbers {a i} N i=, N a i a j K(x i,x j ) 0 i,j= K is said to be a positive definite kernel on X There exists a unique Hilbert space H K of functions on X satisfying: K x H K for all x X, where K x (t) = K(x,t); span{k x } x X is dense in H K ; the inner product, K of H K satisfies: f(x)= f,k x K (reproducing property), for all f H K H K is called the Reproducing Kernel Hilbert Space with reproducing kernel K The Gaussian kernel is K(x,t) = exp( x t ), where X R n and σ>0 σ Organization We will state the main results we wish to report in Sect and give their proofs in Sect 3 A discussion of some of their implications for learning theory and function approximation will be given in Sect 4 The proof of Theorem 0, which is necessary for Theorem 5, will be given in Appendix B Finally, Appendix C contains some technical results on the Gamma function that we will need at various points in the paper Main Results of the Paper Notation Let α = (α,,α n ) (N {0}) n, α = n j= α j, x α = x α xα n n, and Cα d = α d!!α n!, the multinomial coefficients Also, by writing Lp (X), dx, we assume that the Lebesgue measure is being used Theorem Let X R n be any set with nonempty interior Let K(x,t) = exp( x t ) Then dim(h σ K ) = and { H K = f = e x } σ w α x α : f K = k! wα (/σ ) k < () α =0 C k α =k α

4 Constr Approx (00) 3: The inner product, K on H K is given by for f = e x σ for H K is f,g K = α =0 w α x α,g = e x σ k! (/σ ) k α =k w α v α C k α α =0 v α x α H K An orthonormal basis { (/σ φ α (x) = ) k Cα k } e x σ k! x α () α =k, Remark Though an orthonormal basis for the RKHS induced by the Gaussian kernels K(x,t) = exp( x t ) has been known in the literature (for example [5] and σ references therein), our approach below using the Weyl inner product leads to a much shorter proof Following are some of the properties of the Gaussian RKHS H K that will be derived from Theorem Theorem Let X R n be any set with nonempty interior Let K(x,t) = exp( x t ) Then H σ K does not contain any polynomial on X, including the nonzero constant function This theorem may be somewhat surprising, given the fact that if X is compact, then the H K induced by the Gaussian kernel is dense in the space C(X) of continuous functions on X (see [4]) This generalizes a result from [5], which shows that H K does not contain the nonzero constant function, using a different method Theorem 3 Let X R n be any set with nonempty interior Let K(x,z) = exp( x z ) The Hilbert space H σ K induced by K on X contains the function exp( μ x ) if and only if 0 <μ< For such μ, the corresponding functions have σ norms given by ( ) exp μ x [ ] n = μ( μ) σ K To discuss Theorem 3, let us use the notation H K,σ Then exp( x ) H σ K, σ, but exp( x )/ H σ K,σ This is not necessarily surprising, since the two Hilbert spaces contain functions that decay at different rates Essentially, Theorem 3 states that the function space H K,σ contains functions with decay rates within a fixed band Remark Setting μ = 0 in Theorem 3 gives another proof that the constant function does not belong to the RKHS H K

5 30 Constr Approx (00) 3: Theorem 4 Let K(x,t) = exp( x t ) on R n R n Then H σ K L (R n ) for any σ>0 This result is in contrast to the following fact: H K (R n ( ) = {f C 0 R n ) R L ( R n) : f K = n e σ ξ (π) n (σ π) n 4 f(ξ) dξ < where f is the Fourier-Plancherel transform of f, that is, H K (R n ) is an infinite-order Sobolev space Thus functions in H K (R n ), which are smooth, are not necessarily integrable Remark 3 We will give two different proofs for Theorem 4 The first proof follows from Theorem and constructs an explicit function in H K that does not belong to L (R n ) The second proof was suggested to the author by one of the referees and invokes a general result from [3] While nonconstructive, it has the advantage of not having to use any explicit computation To state our next results, we need the following connection between the theory of reproducing kernels and integral operators, manifested via Mercer s theorem Let X be a complete, separable metric space, equipped with a finite, Borel measure μ, that is μ(x) <, with supp(μ) = X, ie, the measure of each nonempty open subset is nonzero Let K : X X R be a continuous, symmetric, positive definite kernel satisfying κ = sup K(x,x) < x X Consider the integral operator L K : L μ (X) L μ (X) defined by (L K f )(x) = K(x,t)f(t)dμ(t) X This is a self-adjoint, compact operator with eigenvalues λ λ 0, with the corresponding L μ -normalized eigenfunctions {φ k} k= forming an orthonormal basis for L μ (X) Mercer s theorem (we refer to [4] for more detail) states that K(x,t) = λ k φ k (x)φ k (t), k= where the series converges absolutely for each (x, t) X X and uniformly on compact subsets of X X It follows from Mercer s theorem that H K = Im ( { L / ) } K = f = a k φ k : f K = ak < λ k k=,λ k >0 k=,λ k >0 and the set { λ k φ k } k=,λ k >0 forms an orthonormal basis for H K },

6 Constr Approx (00) 3: Remark 4 Note that for the compactness of L K and Mercer s theorem, we can assume that X is complete and not compact, as in the original version due to Mercer (the interval [0, ],see[7]), or in the treatment given in [4] It suffices to have (see [6]) for all x X and X X K x L μ (X) K(x,t) dμ(x)dμ(t) < By assuming that sup x X K(x,x) < and μ(x) <, as we do here, both of these conditions are satisfied For the Gaussian kernel, the second condition will fail if X = R n and μ is the Lebesgue measure For the polynomial kernels, the second condition could also fail if X = R n (hence K(x,x) is unbounded), evenifμ is a probability measure Let S n ={x R n : x =} be the n-dimensional unit sphere with surface area S n = π n Ɣ( n ) Theorem 5 Let n N, n be fixed Let X = S n Let μ be the uniform measure on S n Let f 0 : S n R be defined by { if x S n + f 0 (x) = (x n 0), if x S n (x n < 0) Let K :[, ] R be a continuous function giving rise to the Mercer kernel K(x,t) = K( x,t ) on S n S n () If n 3, then f 0 / Im(L r K ) for any r n () If K(x,t) = exp( x t ), then f σ 0 / Im(L r K ) for any r>0 Remark 5 The function f 0 above is the Bayes classifier corresponding to the binary classification problem where the two classes lie on the upper and lower hemispheres, respectively, with decision boundary x n = 0, with P ( y = x S+ n ) ( =, P y = x S n ) + = 0, P ( y = x S n ) ( = 0, P y = x S n ) = Remark 6 We recall that if K is continuous, then all functions in H K are continuous (see for example [4]) Since H K = Im(L / K ), we can immediately state that as a discontinuous function, f 0 / Im(L r K ) for any r / Theorem 5 thus extends this result to all r>0 for the Gaussian kernel and all r n (n 3) for a general continuous kernel Our next results will be on the eigenvalues and eigenfunctions of L K on S n The first two give the general formula and the rate of decay of the eigenvalues corresponding to a continuous, symmetric, positive definite kernel on S n, while the

7 3 Constr Approx (00) 3: third computes the eigenvalues corresponding to the Gaussian kernel itself explicitly Together they imply Theorem 5, but the results are of interest by themselves Recall that the space of spherical harmonics of order k on S n, denoted by Y k (n), has dimension (see for example [9]) dim Y k (n) = N(n,k)= (k + n )(k + n 3)! k!(n )! and an orthonormal basis denoted by {Y k,j (n; x)} N(n,k) j= Theorem 6 Let n N, n be fixed Let K :[, ] R be a continuous function giving rise to a continuous, positive definite kernel K(x,t) = K( x,t ) on S n S n Let μ be the Lebesgue measure on S n The eigenvalues λ k of L K : L μ (Sn ) L μ (Sn ) are given by λ k = S n K(t)P k (n; t) ( n 3 t ) dt, each with multiplicity N(n,k), for k Z, k 0 The corresponding eigenfunctions for each λ k are the spherical harmonics {Y k,j (n; x)} N(n,k) j= of order k Theorem 7 Let n N, n 3 be fixed Let K :[, ] R be a continuous function giving rise to a continuous, positive definite kernel K(x,t) = K( x,t ) on S n S n Let μ be the Lebesgue measure on S n The eigenvalues λ k of L K : L μ (Sn ) L μ (Sn ) satisfy λ k κ Sn κ Sn (n )! N(n,k) (k + ) n Theorem 8 Let n N, n, be fixed Let X = S n and μ be the uniform probability distribution on S n For K(x,t) = exp( x t ), σ>0, the eigenvalues of σ L K : L μ (X) L μ (X) are: ( ) ( ) λ k = e /σ n σ n I k+n/ σ Ɣ for all k N {0}, where I denotes the modified Bessel function of the first kind In all three cases, each λ k occurs with multiplicity N(n,k) The corresponding eigenfunctions are the spherical harmonics of order k on S n The λ k s satisfy and are decreasing if σ ( n )/ λ k λ k+ >(k+ n/)σ

8 Constr Approx (00) 3: Proofs of Main Results 3 The Weyl Inner Product and Orthonormal Basis of the Gaussian RKHS Let us prove Theorem Itwasshownin[4] that for X = R n, n N, and K(x,t) = x,t d, d N, wehaveh K = H d (R n ), the linear space of all homogeneous polynomialsofdegreed in R n, with the inner product, being the Weyl inner product on H d (R n ): f,g K = w α v α Cα d α =d for f = α =d w αt α, g = α =d v αt α H K Theorem 9 (Aronszajn) Let H be a separable Hilbert space of functions over X with orthonormal basis {φ k } H is a reproducing kernel Hilbert space if and only if φ k (x) < for all x X The unique kernel K is defined by K(x,y) = φ k (x)φ k (y) Proof of Theorem We will show that the inner product, K in H K is simply a generalization of the Weyl inner product for the homogeneous polynomial space H d (R n ), d N Consider the following expansion: ) x t K(x,t) = exp ( σ = e x σ e t (/σ ) k σ Cα k k! xα t α α =k Let { H 0 = f = e x σ α =0 w α x α k! (/σ ) k w α C k α =k α } < For f H 0, g = e x σ α =0 v α x α H 0, we define the inner product f,g K,0 = k! (/σ ) k α =k w α v α Cα k Let us show that H 0 is itself a Hilbert space under, K,0 For simplicity let n = Then { H 0 = f = e x } σ w k x k k! (/σ ) k w k <

9 34 Constr Approx (00) 3: It is clear that H 0 is an inner product space under, K,0 Its completeness under the induced norm K,0 is equivalent to the completeness of the weighted l sequence space l σ {(w = k ) : ( (wk ) ) / } k! l = σ (/σ ) k w k, which is itself a Hilbert space Thus (H 0, K,0 ) is a Hilbert space If X R n has nonempty interior, then the mononomials x α, α 0, are all distinct It follows from the definition of the inner product, K,0 that the φ α s, as given in (), are orthonormal under, K,0 Since H 0 = span{φ α } α, it follows that the φ α s form an orthonormal basis for (H 0, K,0 ) By Theorem 9 and the relations φ α (x)φ α (t) = K(x,t), α =k φα (x) = K(x,x) = <, α =k it follows that (H 0, K,0 ) is a reproducing kernel Hilbert space of functions on X with kernel K(x,t) Since the RKHS induced by a kernel K on a set X is unique, we must have (H 0, K,0 ) = (H K, K ) 3 Proofs of Theorems, 3, and 4 Proof of Theorem It suffices for us to show the case n = On subsets of R with nonempty interior, we have { H K = f = e x σ w k x k : f K = } σ k k! k wk < Let d Z, d 0 be given but arbitrary Consider the polynomial p(x) = a 0 + a x + +a d x d for arbitrary coefficients a i R Then we have p(x) = e x σ p(x)e x σ = e x σ d i=0 a i x k+i σ k k! Let b j = σ j j! and b j+ = 0forj Z, j 0 Then we can rewrite p(x) as p(x) = e x σ ( 0 i d,j 0,i+j=k a i b j )x k

10 Constr Approx (00) 3: Let w k = 0 i d,j 0,i+j=k a ib j Then p(x) = e x σ w k x k Then we have σ k k! k w k = σ k ( k! k 0 i d,j 0,i+j=k a i b j ) (a) Assume for now that a i 0 for all 0 i d, with a d > 0 (what follows will also be true if a i 0 for all 0 i d, with a d < 0) Then σ k k! k w k k=d = a d σ k k! k a d b k d = a d σ (k+d) (k + d)! k+d b k σ (k+d) ( ) (k + d)! k+d σ k = a d σ d k! d (k + d)! k (k!) The inequality becomes an equality if a d > 0 and a i = 0 for 0 i d, so that this lower bound is sharp Recall Stirling s formula, which states that n! lim n = Then for c πn( n e ) n k = (k+d)!,wehaveford : k (k!) ( ) k + d lim c k + d k k = lim (k + d) d = k k π k k c k / = k For d = 0, we have lim k π In both cases we have c k =,showing that p(x) cannot be a member of H K (b) Consider now the case in which the coefficients a i s have mixed signs, with a d 0 We will make use of two elementary inequalities, the first being that (a + b) ( a b ) for all a,b R and the second, (a b) (a c) for all a c b By the first inequality, we have ( ) a i b j ( a d b k d a d b k d+ + +a 0 b k ) 0 i d,j 0,i+j=k (b) Let d be even By definition of the b j s, we have b j+ = b j+ = 0 Thus, if k d 0 is even, then b j σ (j+) a d b k d+ + +a 0 b k ( a d + + a + a 0 ) σ ( k d b k d + ) a d b k d when k satisfies k 4( a d + + a + a 0 ) + d = A σ a d d Then for k even, k max{d,a d },wehave ( ) a i b j 4 a d b k d 0 i d,j 0,i+j=k and

11 36 Constr Approx (00) 3: by the second elementary inequality above Hence we have σ k k! k w k 4 k even,k max{d,a d } σ k k! k a d b k d, which diverges as in part (a), showing that p(x) is not in H K (b) The case when d is odd is entirely similar Proof of Theorem 3 Let us first consider the case n = Then { H K = f = e x } σ w k x k : f K = σ k k! k wk < Consider the function e μx σ, which is e μx σ = e x σ e (μ )x σ = e x σ ( ) k (μ )k x k σ k k! Thus, w k = ( )k (μ ) k, and w σ k k! j = 0forj k Then σ k k! k w k = If μ 0orμ, then σ 4k (k)! (μ ) k k σ 4k (k!) = (μ ) k (k)! k (k!) (μ ) k (k)! k (k!) (k)! k (k!) = as in the Proof of Theorem above, showing that f / H K in those cases If 0 < μ<, then (μ ) k (k)! k (k )!! k (k!) = + (μ ), (k)!! which converges by the Ratio Test Hence we have σ k k! w k k <, showing that e μx σ H K for 0 <μ<, with norm e μx σ K = For any n N,wehave (μ ) k (k)! k (k!) = k= = (μ ) μ( μ) e μ x σ = e x σ e (μ ) x σ = e x σ n i= k i =0 w ki x k i i = e x σ n k,,k n i= w ki x k i i,

12 Constr Approx (00) 3: giving us e μ x σ K = n {k,,k n }=0 i= σ k i k i! k i w k i = n i= k i =0 σ k i k i! k w i k i The result then follows from the one-dimensional case above by symmetry Proof of Theorem 4 We have for n =, { H K (R) = f = e x σ w k x k : f K = } σ k k! k wk < By formula (), if f(x ) H K (R), then g = ( n j= e x j σ )f (x ) H K (R n ) Since R n R n ( n j= ( n j= e x j σ ) dx dx n = ( σ π ) n <, ) e x j ( ) π n σ dx dx n = σ <, it suffices to show that H K = H K (R) / L (R) Fork 0, let where q> Then σ k k! k w k = w k = k k! (k + ) q/ σ k, (k + ) q < f = e We will show that f/ L (R) for <q 3/ We have R f(x) dx f(x) dx = = = 0 w k e x σ x k dx 0 ( ) k + w k σ k+ Ɣ = σ 0 e x σ x σ w k x k H K w k x k dx since w k 0 for all k, by the Monotone Convergence Theorem, k k! (k + ) q/ Ɣ ( k + )

13 38 Constr Approx (00) 3: Now k!=ɣ(k + ) = π k Ɣ( k+ )Ɣ( k + ) Thus f(x) π /4 σ dx R ( Ɣ( k+ ) ) / Ɣ( k + ) (k + ) q/ By Lemma 0, wehave R f(x) (π) /4 e /4 σ dx > Ɣ( k+ ) Ɣ( k +) = Sk+ π S k > e / (k+) / It thus follows that (k + ) q+ 4 = for 0 <q 3/ Thus f/ L (R) as required This completes the proof 33 A Different Proof for Theorem 4 The proof of Theorem 4 given above uses Theorem and constructs an explicit function in H K that is not in L (R n ) Let us now present a nonconstructive proof, based on results from [3], which have a very general setting For our purpose, let X R n, and μ a Borel measure on X LetK : X X R be a measurable, positive definite kernel For a fixed p, q = p p, the function K is said to be p-bounded if: () the function K x L q μ(x) for almost all x X, () the function L K f L q μ(x), where L K f(x)= X K(x,y)f(y)dμ(y), for all f L p μ(x) We will need the following result from [3] Proposition Assume that the Hilbert space H K induced by K is separable Given p, the following two conditions are equivalent: () H K L p μ(x) () The reproducing kernel K is q-bounded, with q = p p Proof of Theorem 4 For our present setting, X = R n, and μ is the Lebesgue measure on R n We can prove that H K (R n ) is not a proper subset of L (R n ) if we can show that K is not -bounded Let us verify the two conditions for -boundedness For simplicity, it suffices for us to prove for n = and σ = For the Gaussian kernel K(x,y) = exp( (x y) ), it is obvious that K x L (R n ), so condition () is satisfied For condition (), let φ = L (R n ); then L K φ(x)= exp ( (x y) ) dy = π, R which is a constant function and obviously not a part of L (R n ) Thus condition () of -boundedness is not satisfied, and so H K (R n ) is not a proper subset of L (R n )

14 Constr Approx (00) 3: Proofs of Theorems 6, 7, and 8 Proof of Theorem 6 From the Funk-Hecke formula (see Theorem in Appendix A), it immediately follows that spherical harmonics are eigenfunctions of L K, and we also obtain the analytical formula of the corresponding eigenvalues λ k We also know that the normalized eigenfunctions of L K form an orthonormal basis for L μ (Sn ) Since the spherical harmonics {{Y k,j (n; x)} N(n,k) j= } indeed form an orthonormal basis for L μ (Sn ), they are the only eigenfunctions of L K Proof of Theorem 7 We have from the Funk-Hecke formula that λ k = S n K(t)P k (n; t) ( n 3 t ) dt We will make use of the following identities [9]: Pk (n; t) ( t ) n 3 dt = Sn S n N(n,k) ( t ) n 3 dt = Sn S n and By the Cauchy-Schwarz inequality, K(t)P k (n; t) ( n 3 t ) dt ( sup t K(t) ) ( P k (n; t) ( t ( κ P k (n; t) ( t ) )( n 3 dt = κ Sn S n S n N(n,k) S n = κ ) n 3 ) dt ( S n S n ( t ) ) n 3 dt ) N(n,k) Since the λ k s are nonnegative, it follows that λ k κ Sn N(n,k), which is the first inequality We have N(n,k) = (k + n )(k + n 3)! k!(n )! (k + )n, (n )! (k + )(k + )(k + ) (k + n 3) (n )! giving us the second inequality

15 30 Constr Approx (00) 3: Lemma Let K(t) = e rt Then K(t)P k (n; t) ( n 3 t ) dt = πɣ ( n )( ) n/ I k+n/ (r), r where I is the modified Bessel function of the first kind, defined by I ν (x) = j=0 ( ) x ν+j j!ɣ(ν + j + ) Proof We will need the following formula from [6]: ) ν / Ɣ(ν)I ν /(r), e rt( t ) ν dt = π( r where r>0 and ν>0 Recall Rodrigues rule (see [9]), which states that for f C k ([, ]), f(t)p k (n; t) ( n 3 t ) dt = R k (n) f (k) (t) ( t ) k+ n 3 dt, where R k (n) is called the Rodrigues constant and is given by: Applying Rodrigues rule, we have R k (n) = k e rt P k (n; t) ( n 3 t ) dt = R k (n)r k = R k (n)r k π e rt( t ) k+ n 3 ( r Ɣ( n ) Ɣ(k + n ) ) k+n/ Ɣ( k + n ) I k+n/ (r) Substituting in the values of R k (n) and r, we obtain the desired answer Proof of Theorem 8 On S n,wehave ) ( x t exp ( σ = exp ) σ exp ( x,t The explicit formula for λ k thus follows from Theorem 7 and Lemma, with r = σ σ )

16 Constr Approx (00) 3: By definition of the modified Bessel function, we have ( ) ( ) k+n/ ( ) j σ I k+n/ σ = σ j!ɣ(j + k + n/ + ) j=0 ( ) k+n/ ( ) j σ = σ j!(j + k + n/)ɣ(j + k + n/) j=0 ( ) k+n/ < σ k + n/ j=0 ( σ ) j j!ɣ(j + k + n/) ( ) = σ (k + n/) I k+n/ σ, from which we have the inequality λ k λ k+ >(k+ n/)σ The inequality λ k λ k+ thus is satisfied if σ (k +n/) for all k 0 It suffices to require that it holds for k = 0, that is, σ n/ σ ( n )/ 35 Spherical Harmonics Expansions on the Sphere and Proof of Theorem 5 The proof is based on the rate of decay of the eigenvalues λ k of L K : L μ (Sn ) L μ (Sn ) and the Fourier expansion of f 0 on S n in terms of spherical harmonics, which is stated below Theorem 0 Let n N,n be fixed For n =, the standard Fourier expansion on S holds: f 0 (θ) = 4 π sin(k + )θ k + = 4 π(k + ) sin(k + )θ π For n 3, the Fourier expansion of f 0 in the spherical harmonics on S n is: where f 0 (x) = S n A(0, k +,n)= A(0, k +,n)y k+,0, (n; x), π(4k + n)(k + n )! n (k + )! Ɣ( k)ɣ(k + n+ ), and {{Y k,m,j (n; x)} N(n,m) j= } k m=0 is an orthonormal basis for the space Y k(n) of spherical harmonics of order k, as described in Proposition in Appendix B For all n N, n, 3 π 3/ (k + ) S n < S n A (0, k +,n)< 5 π (k + ) 3/ S n

17 3 Constr Approx (00) 3: Proof of Theorem 0 This is given in Appendix B Proof of Theorem 5 From Theorem 0, the Fourier expansion of f 0 on S n consists of precisely one function Y k+,0, (n; x) from each spherical space Y k+ (n) of odd order k +, k 0 If f 0 Im(L r K ), then its Fourier expansion must have the form f 0 = λ r k+ a k+y k+,0, (n; x) Comparing this expansion with the Fourier expansion of f 0, we obtain S a k+ = n A(0, k +,n) λ r k+ We must have ak+ < Since Sn A (0, k +,n)> 3,itfollows that π 3/ (k+) ak+ > 3 Sn π 3/ (k + ) λ r = 3 Sn π k+ 3/ b k For part, where n 3, we have from Theorem 7 that for r n,wehave ( b k κ S n (n )! ) r λ k+ k + (k + ) =, n (k+) κ S n (n )! Thus showing that f 0 / Im(L r K ) when r n For the Gaussian kernel K(x,t) = e x t σ on S n S n in part, we have for all k N, ( n by Theorem 8, with ( ) λ k+ = e /σ σ n I k+n/ σ Ɣ ) S n, λ k+ λ k+3 >(k + n/ + )(k + n/ + )σ 4 for all k 0 We now apply the Ratio Test to the series b k : b k+ lim = lim k b k k lim k ( ) r λk+ (k + ) (k + 3) λ k+3 ( σ 4 (k + n/ + )(k + n/ + ) ) r = for any r>0, showing that the series diverges Thus f 0 / Im(L r K ) for any r>0

18 Constr Approx (00) 3: Several Implications for Learning Theory and Function Approximation 4 Implication of Theorem 4 The main implication of this theorem is the nonfeasibility of L (R n ) norm optimization or regularization in H K (R n ) However, this could still be done on subsets of finite linear combinations of the basis functions K x Furthermore, this phenomenon does not arise when we talk about H K (X), where X is a bounded subset of R n 4 Implication of Theorems 5 and 3 for the Regularized Least Square Algorithm in RKHS and the Regression and Binary Classification Problems in Learning Theory We will first need to discuss the learning setting Assume that the input space X admits as mathematical representation a subset or manifold in some Euclidean space R n, and the output space Y is a subset of the real numbers R It is assumed that there is an unknown probability distribution ρ on Z = X Y with ρ(x,y) = ρ X (x)ρ(y x) In the regression problem (see [4, 0] and references therein), we aim to estimate the regression function f ρ (x) = ydρ(y x), Y which minimizes the least square error ( ) ε(f ) = f(x) y dρ Z Let L ρ X (X) denote the Hilbert space of square integrable functions on X, with norm denoted by ρ With the assumption that f ρ L ρ X (X), for every f L ρ X (X), ε(f ) ε(f ρ ) = f f ρ ρ Let z = (x i,y i ) m i= be a finite random sample of size m, m N, drawn independently according to ρ Our task is to construct functions f z, using the finite sample z, such that lim f z f ρ m ρ = 0 with high probability In the binary classification problem, where Y ={, }, the optimal binary classifier is the Bayes classifier: sgn f ρ (x) = { ifp(y= x) P(y= x), ifp(y= x) < P(y = x)

19 34 Constr Approx (00) 3: Let f : X R be a real-valued function, which induces the binary classifier sgn(f ) : X {, }, defined by sgn(f )(x) = sgn(f (x)) The error of sgn(f ) with respect to the Bayes classifier is ρ X (X f ) = sgn(f ) sgn(f ρ ) 4 ρ, where X f ={x X : sgn(f (x)) sgn(f ρ (x))} Our task is to construct binary classifiers sgn(f z ), using the finite sample z, such that lim m sgn(f z ) sgn(f ρ ) 4 ρ = 0 with high probability The regularized least square algorithm (see [] and references therein) attempts to solve both problems of least square regression and binary classification by the following minimization procedure Algorithm Let K : X X R denote a continuous, positive definite kernel Let H K be the corresponding RKHS, with norm K For each λ>0, let { m ( ( f z,λ = arg min f x i ) } y i) + λ f f H K m K i= (A) For the least square regression problem, f z,λ is taken to be the empirical version of f ρ, which approximates f ρ in the ρ norm (B) For the binary classification problem, sgn(f z,λ ) is taken to be the empirical version of sgn(f ρ ), which approximates sgn(f ρ ) in the ρ norm For the purpose of our argument, we state here two typical results obtained recently regarding the complexity of the above algorithm (we refer to [5, 3, ] and follow-up works for detail) Assume that μ = ρ X, f ρ Im(L r K ) for 0 <r, and that y M almost surely Then according to [3], for any 0 <δ<, for an appropriate choice of λ, ( f z,λ f ρ ρ log 4 ) (κm) r/(+r) L r δ K f ρ for / <r, and ( f z,λ f ρ ρ log 4 ) (8M + 8 r κ r L r δ /(+r) ρ K f ρ ( ) r/(+r) (3) m ) ( ) r/ ρ (4) m for 0 <r /, with probability at least δ, where the quantity L r K f ρ ρ plays a crucial role This in turn will lead to convergence in the binary classification problem, via the following []: ρ X (X fz,λ ) = sgn(f z,λ ) sgn(f ρ ) q 4 ρ 4(B q+ q + ) f z,λ f ρ ρ, (5)

20 Constr Approx (00) 3: where 0 q, provided that the Tsybakov s noise condition [9] is satisfied: ρ X ({ x : f ρ (x) L }) B q L q, 0 L It is the purpose of this section to discuss the applicability of these results in certain settings We note that f 0 = f ρ is the regression function resulting from the conditional probability P ( y = x S+ n ) ( =, P y = x S n ) + = 0, P ( y = x S n ) ( = 0, P y = x S n ) = Recall that it is the Bayes classifier corresponding to the binary classification problem where the two classes lie on the upper and lower hemispheres, respectively, with decision boundary x n = 0 Theorem 5 shows that complexity results such as the ones just mentioned cannot be applied in this case at all (for the Gaussian kernel) or with very small r (for the general continuous case) However, this is a noise-free binary classification problem, which can be solved successfully precisely by the regularized least square algorithm above, with the linear kernel K(x,t) = x,t, such that ρ X (X fz,λ ) = sgn(f z,λ ) sgn(f ρ ) ( n 4 ρ O m log ) (6) δ Remark 7 We stress that expression (6) does not follow from the preceding discussion, but is a result obtained by the author in [8], where a more general result is proved In general, it has been observed that the classification problem is significantly easier than regression if the function η(x) = P(y= x) = + f ρ(x) is far away from (or equivalently, f ρ is far away from 0), with high probability (we refer to the survey [] and the extensive references therein) We remark that convergence rates of the form O(m r/(r+) ) as in (3) are optimal for the regression problem (we refer to [8] and the references therein, where r represents a different concept) It would be interesting to establish the corresponding optimality of (3) and (4), which we will leave for a future work We would like now to discuss a different complexity result from [3], where the multi-kernel least square regularization algorithm is proposed, which computes f z,λ = arg min min σ f H K σ { m m ( ( f x i ) } y i) + λ f K σ Here K σ (x, t) = exp( n (x i t i ) i= ), with σ = (σ,,σ n ) = (0, ) n σ i i=

21 36 Constr Approx (00) 3: Example of [3] states that if X R n is a domain with Lipschitz boundary and f ρ H s (X), the Hilbert-Sobolev space of order s>0, then for an appropriate choice of λ: E ( f z,λ f ρ ) ( L = O (log m) / m s n ɛ ) 4(4s n ɛ) (7) ρ X if n <s n +, 0 <ɛ<s n IfX is bounded and ρ X is the Lebesgue measure on X, then: E ( f z,λ f ρ ) ( L = O (log m) / m s ) (4s+n) (8) ρ X if 0 <s If f H s (X), with n <s n +, then f is necessarily continuous and thus bound (7) cannot be applied for our present function f ρ, which is discontinuous As for bound (8), it could be applied for our case only when 0 <s min{, n }We note that for such s, the best possible rate offered by (8)isO((log m) / m α ), where α = min{, 8+n } This is also the rate for the binary classification error ρ X(X fz,λ ), using (5), when q = One must exercise caution when reading the O notation here however, since the constants within may contain the dimension n There are two interesting issues here whose exploration we will leave for a future work The first is to find 0 <s min{, n } such that f ρ H s (S n ) The second is to investigate connections between this multi-kernel approach and our Theorem 3 By this theorem, using kernels with flexible variances allows us to capture functions with multi-level rates of decay It is also possible that it is not necessary to use a continuous set as above, but a discrete version of it 43 Derivatives in the Gaussian RKHS Let X be a closed subset of R n with nonempty interior The method of proofs for Theorems, 3, and 4 can be applied to show that t α K x (t) H K for any multinomial index α, and hence p(t)k x (t) H K for any polynomial p : X R This implies that D α K x H K for any α, where D α denotes the partial derivative with multiindex α This order of reasoning is the reverse of that in the proof of Corollary of [7], where they use D α K x H K for all α to imply that p(t)k x H K for any p Note that [7] and references therein provide general treatments of the derivatives in RKHS by analytic kernels Furthermore, if n =, then for f(t)= t d K x (t) = t d exp( (x t) ),wehave σ f K = e x σ σ d d ( ) x k (k + ) (k + d) < (9) k! σ Using this formula and the binomial theorem, one can obtain the K-norm of the derivative of any order for K x Then-dimensional case can be worked out similarly Acknowledgements The present paper developed from a part of the author s PhD thesis with Steve Smale, whose advice and support is gratefully acknowledged He also wishes to thank the referees for their many valuable comments and suggestions This work was partially supported by the Vienna Fund for Science and Technology (Wiener Wissenschafts-, Forschungs- und Technologiefonds) and the German Research Foundation (Deutsche Forschungsgemeinschaft, grant DFG:GZ WI 55/-)

22 Constr Approx (00) 3: Appendix A: The Funk-Hecke Formula For our computations on the sphere S n, we need the following fundamental result from the theory of spherical harmonics (see [9]) For consistency of notation with the spherical harmonics literature, we use ds n for the surface measure of S n Normalizing by dividing by S n, we get the uniform probability measure on S n Theorem (Funk-Hecke Formula) Let K :[, ] R be a continuous function giving rise to an inner product kernel K(x,t) = K( x,t ) on S n S n Let Y k Y k (n) for k 0 Then for any x S n, K ( x,t ) Y k (t) ds n (t) = λ k Y k (x), S n where λ k = S n K(t)P k (n; t) ( n 3 t ) dt, where P k (n; t) denotes the Legendre polynomial of degree k in dimension n Appendix B: Fourier Expansion on S n In this section, we will prove Theorem 0 in Sect 35, that is, we will compute the Fourier expansion of { if x S n + f 0 (x) = (x n 0), if x S n (x n < 0), on S n, in terms of the spherical harmonics The case n = is just the usual Fourier expansion on the circle Lemma The Fourier series of f 0 on L (S ) is given by: f 0 (θ) = 4 π sin(k + )θ k + = 4 π(k + ) sin(k + )θ π Consider n 3 Let Y k (n) denote the space of spherical harmonics of order k on the sphere S n It turns out that working directly with an explicit orthonormal basis of spherical harmonics on S n is highly complicated analytically We shall find that it is much better for us to utilize an inductive construction of orthonormal bases of Y k (n) Let us first describe one such construction, as given in [9] Let e,,e n be the canonical basis of R n Letx S n We write ( ) x(n ) x = te n + t, 0

23 38 Constr Approx (00) 3: where t [, ] and x (n ) S n, (x (n ), 0) T span{e,,e n } We then have [9] ds n ( te n + t ) ( x (n ) = t ) n 3 dt ds n ( ) x (n ), or more compactly, ds n = ( n 3 t ) dt ds n (0) Recall that the normed associated Legendre functions are defined by: A m k (n; t)= n (k + n )(k m)!(k + n + m 3)! k!ɣ( n ) Pk m (n; t), (n; t) is the associated Legendre function of degree k, order m, and dimen- where Pk m sion n Proposition (Orthonormal basis of Y k (n) [9]) Suppose that for m = 0,,,k, the orthonormal bases Y m,j, j =,,N(n,m), of Y m (n ) are given Then the functions { Yk,m,j (n; x) = A m k (n; t)y m,j (n ; x (n ) ) : j =,,,N(n,m) } form an orthonormal basis for Y k (n), starting with the Fourier basis for n = (the circle S ) We will now expand f 0 in terms of the orthonormal spherical harmonics of Y k,m,j (n; x) of Proposition Lemma 3 f0,y k,m,j (n; ) = A(m, k, n) Y m,j (n ; x (n ) )ds n, S n where A(m, k, n) = Proof By definition, we have 0 A m k (n; t)( n 3 0 t ) dt A m k (n; t)( n 3 t ) dt { ( f 0 (x) = f 0 ten + t ) if 0 t, x (n ) = if t<0,

24 Constr Approx (00) 3: independent of the first n coordinates Thus by (0), we have f 0 (x)y k,m,j (n; x)ds n S n as desired = t= ( f 0 ten + t ) x (n ) S n A m k (n; t)y m,j (n ; x (n ) ) ( n 3 t ) dt ds n = A(m, k, n) Y m,j (n ; x (n ) )ds n, S n The expression obtained in the above lemma is considerably simplified with the aid of the Funk-Hecke formula (Theorem ), which states that for α S n, Y k Y k (n), and f C([, ]), f ( α, x ) Y k (x) ds n (x) = C k Y k (α), S n where C k = S n P k (n; t)f(t) ( n 3 t ) dt In particular, for f, this implies Y k (x) ds n (x) = C k Y k (α) S n The right-hand side is thus independent of α This implies that they are both identically zero for k For k = 0, we have Y 0, and hence S n C 0 = Y 0 (x) ds n (x) = S n S n Corollary The only nonzero Fourier coefficients of f 0 are: f0,y k,0, (n; ) = A(0,k,n) S n Proof From the above Funk-Hecke formula, we have { 0 if m, Y m,j (n ; x (n ) )ds n = S n S n if m = 0 Thus S n f 0 Y k,m,j (n; x)ds n is only nonzero when m = 0 Since N(n,0) =, j takes only the value This gives us the desired result

25 330 Constr Approx (00) 3: It thus remains for us to evaluate A(0,k,n) We will do this using the Gegenbauer polynomials and the aid of [6] Definition (Gegenbauer polynomials [9]) Let 0 r<, t [, ] For each integer k 0, λ>0, the Gegenbauer polynomial Ck λ (t) is defined to be the coefficient in the expansion ( rt + r ) λ = r k Ck λ (t) Ck λ (t) is an odd function when k is odd, and even when k is even Lemma 4 [9] For n 3, let λ = n Then C λ k (k + n 3)! (t) = P k (n; t) k!(n 3)! In particular, for n = 3, we have C λ k (t) = P k(3; t) From this and the formula for A 0 k (n; t) in the definition of associated Legendre functions, we obtain, for λ = n, A 0 k (n; t)= (n 3)! Ɣ( n ) Corollary Let D k and λ be as above Then (k + n )k! n (k + n 3)! Cλ k (t) = D kc λ k (t) { 0 if k is even, A(0,k,n)= D k 0 Cλ k (t)( t ) λ dt if k is odd Proof For λ = n n 3,wehave = λ We then have A(0,k,n)= 0 = D k [ A 0 k (n; t)( n 3 0 t ) dt 0 A 0 k (n; t)( n 3 t ) dt Ck λ (t)( t ) λ 0 dt Ck λ (t)( t ) ] λ dt When k is even, Ck λ (t) is even, hence the two integrals are equal and cancel out When k is odd, Ck λ (t) is odd, thus two integrals have the same magnitude but opposite signs, giving us the desired expression Let us now evaluate 0 Cλ k (t)( t ) λ dt We have the following two results at our disposal:

26 Constr Approx (00) 3: Lemma 5 [6] Let Pν μ (t) denote the classical associated Legendre function, with μ and ν complex numbers (for natural numbers m, k, Pk m(t) = P k m (3; t)) Let λ>0 Then Ck λ (t) = λ Ɣ(λ + k)ɣ(λ + ) P λ (t) ( t ) Ɣ(λ)Ɣ(k + ) λ+k 4 λ Lemma 6 [6] Let μ C be such that Re(μ) < Then Let E k = λ 0 ( t ) μ Pν μ (t) dt = μ π Ɣ( μ+ν ν μ+3 )Ɣ( ) Ɣ(λ+k)Ɣ(λ+ ) Ɣ(λ)Ɣ(k+) 0 Then we have C λ k (t)( t ) λ dt = E k 0 P λ ( t ) λ λ+k 4 dt We now apply the above lemma with μ = λ< and ν = λ + k to obtain Ck λ (t)( t ) λ λ π dt = E k Ɣ( k )Ɣ(λ + + k ) 0 Substituting λ = n and odd values of k, we get: Corollary 3 For k 0, 0 C n k+ (t)( n 3 t ) dt = π(n+ k )!Ɣ( n ) (n 3)!(k + )!Ɣ( k)ɣ(k + n+ ) Corollary 4 Let n 3 be fixed The Fourier expansion of f 0 in the spherical harmonics on S n is: f 0 (x) = S n A(0, k +,n)y k+,0, (n; x), where A(0, k +,n)is given by: π(4k + n)(k + n )! A(0, k +,n)= n (k + )! for k N {0} Example (n = 3) We have f 0 L (S ) = S =4π and A(0, k +, 3) = 4k + 3 ( ) k (k )!! 4 6 (k + ) = Ɣ( 4k + 3 k)ɣ(k + n+ ) ( ) k (k )!! k (k + )!

27 33 Constr Approx (00) 3: We check that S A(0, k +, 3) = π 4k + 3 k+ The sum of the last series above follows from the identity [6]: 4k + 3 k+ (0 θ π ), where we set θ = 0 ( ) (k )!! = 4π = f 0 (k + )! L (S ) ( ) (k )!! P k+(cos θ)= 4θ (k + )! π, Remark 8 (n = ) For n =, noting that S 0 = {±} =, we obtain S 0 A(0, k +, ) = 4( )k, π(k + ) which differ from the Fourier coefficients in the expansion of Lemma only by the factor ( ) k B0 General Case n We now move to the general case n Let us consider two separate cases: n is even and n is odd Lemma 7 Let n = m, m We have A(0, k +, m) = 8 π(k + ) k + m k + m (k + )!! (k + m )!! (k)!! (k + m )!! Proof We have ( ) ( m + Ɣ = Ɣ m + ) = (m )!! π m Thus (k )!! (m + ) (m + k )Ɣ( m+ ) = (m )!! m (k + ) (k + m ) (m )!! π On the other hand, = m π(k + ) (k + m ) (4k + m)(k + m )! m (k + )! = 8(k + m)(k + )(k + 3) (k + m ) m Combining these two expressions, we get A(0, k +, m)

28 Constr Approx (00) 3: Lemma 8 Let n = m +, m Then A (0, k +, m + ) = Proof We have A(0, k +, m + ) = (4k + m + )(k + m )! m (k + )! 4k + m + (k + ) k + m (k + )!! (k)!! (k + m )!! (k + m)!! [(k )!!] [(k + )(k + 4) (k + m)m!] [(k )!!] = (4k + m + )(k + )(k + 3) (k + m ) [(k + m)!!] (4k + m + )(k + )(k + 3) (k + m ) [(k + )!!] = (k + ) (k + )(k + 4) (k + m) [(k)!!(k + )!!] = 4k + m + (k + 3)(k + 5) (k + m ) [(k + )!!] (k + ) k + m (k + 4)(k + 6) (k + m) [(k)!!(k + )!!], from which the desired expression follows To find upper and lower bounds for the coefficients A(0, k +,n), we will apply Lemmas 0 and in Appendix C, which are consequences of Stirling s formula Corollary 5 Let n = m, m Then 4A π(m ) / (k + ) <A(0, k +, 8A m) < π(m ) / (k + ) 3/, where A = e 5/36 ( 7 8e )/ and A = A Proof We have A(0, k +, m) = From Lemma,wehave 8 π(k + ) ( ) 4eπ / e /8 7 (k + m ) / ( π (k + m )!! < (k + m )!! <e/ ) / (k + ) / < e / ( π k + m k + m (k + )!! (k + m )!! (k)!! (k + m )!! ) / (k + m ) /, (k + )!! (k)!! <e /8 ( 7 4eπ ) / (k + ) /

29 334 Constr Approx (00) 3: Combining these gives us ( ) 8e / e 5/36 (k + ) / (k + m )!! (k + )!! < 7 (k + m ) / (k + m )!! (k)!! Now <e 5/36 ( 7 8e ) / (k + ) / (k + m ) / (m ) / (k + ) / (k + )/ (k + m ) / (m ) / and < m m k + m k + m We finally have 4A π(m ) / (k + ) <A(0, k +, 8A m) < π(m ) / (k + ) 3/, where A = e 5/36 ( 7 8e )/ and A = A as required Corollary 6 Let n = m +, m Then e /3 π(m) / (k + ) <A (0, k +, m + )< Proof We have A (0, k +, m + ) = 4k + m + (k + ) k + m ( ) 7 / 4e /9 e π(m) / (k + ) 3/ (k + )!! (k)!! (k + m )!! (k + m)!! From Lemma, ( ) e / ( ) e / (k + m )!! / < <e /6 π (k + m) / (k + m)!! π (k + m) /, ( ) / ( ) e / (k + ) / (k + )!! 7 / < <e /8 (k + ) / π (k)!! 4eπ Combining these gives us e /3 π Now (k + ) / (k + m) (k + m )!! < / (k + m)!! (k + )!! (k)!! (k + )/ (k + )/ (m) / (k + m) / (m) / and < m + m ( ) 7 / <e /9 (k + ) / e π (k + m) / 4k + m + k + m < 4

30 Constr Approx (00) 3: We finally have e /3 π(m) / (k + ) <A (0, k +, m + ) ( ) 7 / <e /9 4 e π(m) / (k + ) 3/, as required Combining both cases of n odd and even, we have: Corollary 7 For all n N, n, e /3 π(n ) / (k + ) <A (0, k +,n)< ( ) 7 / 4e /9 e π(n ) / (k + ) 3/, 3 S n < S n A π 3/ (k + ) (0, k +,n)< 5 S n π (k + ) 3/ Proof The first inequality follows by combining both cases of n even and odd For the second one, we have that S n A (0, k +,n)= S n A (0, k +,n) Sn S n Now ( ) n / ( ) e /6 < Sn n / eπ S n <e/ π Combine this with the first inequality and simplify, and we obtain the second inequality Appendix C: The Gamma Function and Stirling s Formula Consider Stirling s series for a>0: Ɣ(a + ) = ( ) a a [ πa + e a + 88a 39 ] 5840a 3 + Thus for all a>0 we can write Ɣ(a + ) = e A(a) ( ) a a πa, e where 0 <A(a)< a

31 336 Constr Approx (00) 3: Lemma 9 For all n N, ( ) n+ n n!! = Stir(n), e where Stir(n) = e A(n/) πe if n is even, and Stir(n) = e A(n/) e if n is odd Proof (a) By Stirling s formula, we have n!=ɣ(n + ) = e A(n) ( ) n n πn e It thus follows that (n)!! = n n!=e A(n) ( ) n n πn = e A(n) ( ) n n+/ πe, e e from which we have the first identity (b) From the formula Ɣ(n + ) = (n )!! π n,wehave (n + )!! = (n + ) π n Ɣ(n + /) = n+ π Ɣ(n + 3/) = n+ e A(n+/) ( n + / π(n+ /) π e = e A(n+/) n+ ( ) n + / n+ e e = e A(n+/) ( ) n + n+ e, e ) n+/ which is the second identity Lemma 0 Let n N, n be fixed Then ( ) n / ( e B(n) Sn n eπ S n <eb(n) π ) / for some B(n) satisfying 6(n ) <B(n)< 6n Consequently for all n N, n, Proof We have ( ) n / ( ) e /6 < Sn n / eπ S n <e/ π S n S n = π n Ɣ( n ) Ɣ( n ) π n = π Ɣ( n ) Ɣ( n )

32 Constr Approx (00) 3: From Stirling s formula, we have Ɣ(a + ) = e A(a) πa( a e )a for all a>0, where 0 <A(a)</a Thus for all a>0, Ɣ(a) = e A(a) π e ( ) a a / e Hence we have Ɣ(n/) Ɣ( n ) = e A(n/) π e ( e n ) n = e B(n) n A( e ) π e ( n e ) n / ( n e ) / ( + ) n, n where B(n) = A(n/) A( n ), easily seen to satisfy 6(n ) <B(n)< 6n We have that for all n, ( / + ) n <e / n Combining this with the last expression, we obtain the desired result Lemma For all n N, ( ) 4eπ / ( ) e B(n) (n)!! π / 7 (n + ) / (n + )!! <eb(n) (n + ) /, for some function B(n) satisfying n+6 <B(n)< n Consequently for all n N, Similarly, ( ) 4eπ / e /8 (n)!! 7 (n + ) / (n + )!! <e/ ( π ) / (n + ) / ( ) e / ( ) e / (n )!! / < <e /6 π (n) / (n)!! π (n) / Proof We proceed as in the above lemma From Stirling s formula, we have ( ) n+ n (n)!! = Stir(n) = e A(n) ( ) n+ n πe, e e ( ) n+ n + (n + )!! = Stir(n + ) = e A(n+/) ( ) n+ n + e e e Thus it follows that ( ) (n)!! π e / ( (n + )!! = eb(n) ) n+, n + n +

33 338 Constr Approx (00) 3: where B(n) = A(n) A(n + /) is easily seen to satisfy have that for all n, ( ) 8 / = 7 ( ) 3/ ( ) n+ < 3 n + e / n+6 <B(n)< n We Combining this with the last expression, we obtain the desired result The other inequality is proven similarly References Aronszajn, N: Theory of reproducing kernels Trans Am Math Soc 68, (950) Boucheron, S, Bousquet, O, Lugosi, G: Theory of classification: a survey of recent advances ESAIM: Prob Stat 9, (005) 3 Carmeli, C, De Vito, E, Toigo, A: Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem Anal Appl 4, (006) 4 Cucker, F, Smale, S: On the mathematical foundations of learning Bull Am Math Soc 39(), 49 (00) 5 De Vito, E, Caponnetto, A, Rosasco, L: Model selection for regularized least-squares algorithm in learning theory Found Comput Math 5(), (005) 6 Gradshteyn, IS, Ryzhik, IM: Table of Integrals, Series, Products, 6th edn Academic Press, San Diego (000) 7 Mercer, J: Functions of positive and negative type, and their connection with the theory of integral equations Philos Trans R Soc Lond, Ser A 09, (909) 8 Minh, HQ: The regularized least square algorithm and the problem of learning halfspaces Submitted preprint (007) 9 Müller, C: Analysis of Spherical Symmetries in Euclidean Spaces Applied Mathematical Sciences, vol 9 Springer, New York (997) 0 Niyogi, P, Girosi, F: Generalization bounds for function approximation from scattered noisy data Adv Comput Math 0, 5 80 (999) Poggio, T, Smale, S: The mathematics of learning: dealing with data Not Am Math Soc 50(5), (003) Schölkopf, B, Smola, AJ: Learning with Kernels MIT Press, Cambridge (00) 3 Smale, S, Zhou, DX: Learning theory estimates via integral operators and their approximations Constr Approx 6(), 53 7 (007) 4 Steinwart, I: On the influence of the kernel on the consistency of support vector machines J Mach Learn Res, (00) 5 Steinwart, I, Hush, D, Scovel, C: An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels IEEE Trans Inf Theory 5, (006) 6 Sun, HW: Mercer theorem for RKHS on noncompact sets J Complex, (005) 7 Sun, HW, Zhou, DX: Reproducing kernel Hilbert spaces associated with analytic translationinvariant Mercer kernels J Fourier Anal Appl 4, 89 0 (008) 8 Temlyakov, VN: Approximation in learning theory Constr Approx 7, (008) 9 Tsybakov, AB: Optimal aggregation of classifiers in statistical learning Ann Stat 3(), (004) 0 Vapnik, V: Statistical Learning Theory Wiley, New York (998) Wahba, G: Spline Models for Observational Data CBMS-NSF Regional Conference Series in Applied Mathematics Society for Industrial and Applied Mathematics, Philadelphia (990) Yao, Y: Early stopping in gradient descent learning Constr Approx 6(), (007) 3 Ying, Y, Zhou, DX: Learnability of Gaussians with flexible variances J Mach Learn Res 8, (007)

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,