Mercer s Theorem, Feature Maps, and Smoothing

Size: px

Start display at page:

Download "Mercer s Theorem, Feature Maps, and Smoothing"

Edwina Sharon Malone
6 years ago
Views:

1 Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics, University of California, Berkeley 970 Evans Hall, Berkeley, CA 9470, USA minh,niyogi@cs.uchicago.edu, yao@math.berkeley.edu Abstract. We study Mercer s theorem and feature maps for several positive definite kernels that are widely used in practice. The smoothing properties of these kernels will also be explored. Introduction Kernel-based methods have become increasingly popular and important in machine learning. The central idea behind the so-called kernel trick is that a closed form Mercer kernel allows one to efficiently solve a variety of non-linear optimization problems that arise in regression, classification, inverse problems, and the like. It is well known in the machine learning community that kernels are associated with feature maps and a kernel based procedure may be interpreted as mapping the data from the original input space into a potentially higher dimensional feature space where linear methods may then be used. One finds many accounts of this idea where the input space X is mapped by a feature map Φ : X H (where H is a Hilbert space) so that for any two points x, y X, we have K(x, y) = φ(x), φ(y) H. Yet, while much has been written about kernels and many different kinds of kernels have been discussed in the literature, much less has been explicitly written about their associated feature maps. In general, we do not have a clear and concrete understanding of what exactly these feature maps are. Our goal in this paper is to take steps toward a better understanding of feature maps by explicitly computing them for a number of popular kernels for a variety of domains. By doing so, we hope to clarify the precise nature of feature maps in very concrete terms so that machine learning researchers may have a better feel for them. Following are the main points and new results of our paper:. As we will illustrate, feature maps and feature spaces are not unique. For a given domain X and a fixed kernel K on X X, there exist in fact infinitely many feature maps associated with K. Although these maps are essentially equivalent, in a sense to be made precise in Section 4.3, there are subtleties

2 that we wish to emphasize. For a given kernel K, the feature maps of K induced by Mercer s theorem depend fundamentally on the domain X, as will be seen in the examples of Section. Moreover, feature maps do not necessarily arise from Mercer s theorem, examples of which will be given in Section 4.. The importance of Mercer s theorem, however, goes far beyond the feature maps that it induces: the eigenvalues and eigenfunctions associated with K play a central role in obtaining error estimates in learning theory, see for example [8], [4]. For this reason, the determination of the spectrum of K, which is highly nontrivial in general, is crucially important in its own right. Theorems and 3 in Section give the complete spectrum of the polynomial and Gaussian kernels on S n, including sharp rates of decay of their eigenvalues. Theorem 4 gives the eigenfunctions and a recursive formula for the computation of eigenvalues of the polynomial kernel on the hypercube {, } n.. One domain that we particularly focus on is the unit sphere S n in R n, for several reasons. First, it is a special example of a compact Riemannian manifold and the problem of learning on manifolds has attracted attention recently, see for example [], [3]. Second, its symmetric and homogeneous nature allows us to obtain complete and explicit results in many cases. We believe that S n together with kernels defined on it is a fruitful source of examples and counterexamples for theoretical analysis of kernel-based learning. We will point out that intuitions based on low dimensions such as n = in general do not carry over to higher dimensions - Theorem 5 in Section 3 gives an important example along this line. We will also consider the unit ball B n, the hypercube {, } n, and R n itself. 3. We will also try to understand the smoothness property of kernels on S n. In particular, we will show that the polynomial and Gaussian kernels define Hilbert spaces of functions whose norms may be interpreted as smoothness functionals, similar to those of splines on S n. We will obtain precise and sharp results on this question in the paper. This is the content of Section 5. The smoothness implications allow us to better understand the applicability of such kernels in solving smoothing problems. Notation: For X R n and µ a Borel measure on X, L µ(x) = {f : X C : X f(x) dµ(x) < }. We will also use L (X) for L µ(x) and dx for dµ(x) if µ is the Lebesgue measure on X. The surface area of the unit sphere S n is denoted by S n = n Γ ( n ). Mercer s Theorem One of the fundamental mathematical results underlying learning theory with kernels is Mercer s theorem. Let X be a closed subset of R n, n N, µ a Borel measure on X, and K : X X R a symmetric function satisfying: for any

3 finite set of points {x i } N i= in X and real numbers {a i} N i= N a i a j K(x i, x j ) 0 () i,j= K is said to be a positive definite kernel on X. Assume further that K(x, t) dµ(x)dµ(t) < () X X Consider the induced integral operator L K : L µ(x) L µ(x) defined by L K f(x) = K(x, t)f(t)dµ(t) (3) X This is a self-adjoint, positive, compact operator with a countable system of nonnegative eigenvalues {λ k } k= satisfying k= λ k <. L K is said to be Hilbert- Schmidt and the corresponding L µ(x)-normalized eigenfunctions {φ k } k= form an orthonormal basis of L µ(x). We recall that a Borel measure µ on X is said to be strictly positive if the measure of every nonempty open subset in X is positive, an example being the Lebesgue measure in R n. Theorem (Mercer). Let X R n be closed, µ a strictly positive Borel measure on X, K a continuous function on X X satisfying () and (). Then K(x, t) = λ k φ k (x)φ k (t) (4) k= where the series converges absolutely for each pair (x, t) X X and uniformly on each compact subset of X. Mercer s theorem still holds if X is a finite set {x i }, such as X = {, } n, K is pointwise-defined positive definite and µ(x i ) > 0 for each i.. Examples on the Sphere S n We will give explicit examples of the eigenvalues and eigenfunctions in Mercer s theorem on the unit sphere S n for the polynomial and Gaussian kernels. We need the concept of spherical harmonics, a modern and authoritative account of which is [6]. Some of the material below was first reported in the kernel learning literature in [9], where the eigenvalues for the polynomial kernels with n = 3, were computed. In this section, we will carry out computations for a general n N, n. [ Definition (Spherical Harmonics). Let n = x ] x denote n the Laplacian operator on R n. A homogeneous polynomial of degree k in R n whose Laplacian vanishes is called a homogeneous harmonic of order k. Let Y k (n)

4 denote the subspace of all homogeneous harmonics of order k on the unit sphere S n in R n. The functions in Y k (n) are called spherical harmonics of order k. We will denote by {Y k,j (n; x)} N(n,k) j= any fixed orthonormal basis for Y k (n) where N(n, k) = dim Y k (n) = (k+n )(k+n 3)! k!(n )!, k 0. Theorem. Let X = S n, n N, n. Let µ be the uniform probability distribution on S n. For K(x, t) = exp( x t σ ), σ > 0 λ k = e /σ σ n I k+n/ ( σ )Γ (n ) (5) for all k N {0}, where I denotes the modified Bessel function of the first kind, defined below. Each λ k occurs with multiplicity N(n, k) with the corresponding eigenfunctions being spherical harmonics of order k on S n. The λ k s are decreasing if σ ( n )/. Furthermore ( e A σ )k n (k + n ) k+ < λ k < ( e A σ )k n (k + n ) k+ (6) for A, A depending on σ and n given below. Remark. A = e /σ / (e) n Γ ( n ), A = e /σ +/σ 4 (e) n Γ ( n ). For ν, z C, I ν (z) = j=0 j!γ (ν+j+) ( z ) ν+j. Theorem 3. Let X = S n, n N, n, and d N. Let µ be the uniform probability distribution on S n. For K(x, t) = ( + x, t ) d, the nonzero eigenvalues of L K : L µ(x) L µ(x) are λ k = d+n d! Γ (d + n )Γ ( n ) (7) (d k)! Γ (d + k + n ) for 0 k d. Each λ k occurs with multiplicity N(n, k), with the corresponding eigenfunctions being spherical harmonics of order k on S n. Furthermore, the λ k s form a decreasing sequence and B (k + d + n ) d+n 3 B < λ k < (k + d + n ) d+n 3 (8) where 0 k d, for B, B depending on d, n given below. Remark. B = e d (e) d+n n Γ (d+ d! )Γ ( n ), B e /6 d d+ = e d (e) d+n d! Γ (d+ n )Γ ( n ).. Example on the Hypercube {, } n We will now give an example with the hypercube {, } n. Let M k = {α = (α i ) n i=, α i {0, }, α = α + + α n = k}, then the set {x α } α Mk,0 k n, consists of multilinear mononomials {, x, x x,..., x... x n }.

5 Theorem 4. Let X = {, } n. Let d N, d n be fixed. Let K(x, t) = ( + x, t ) d on X X. Let µ be the uniform distribution on X, then the nonzero eigenvalues λ d k s of L K : L µ(x) L µ(x) satisfy λ d+ k = kλ d k + λ d k + (n k)λ d k+ (9) λ d 0 λ d... λ d d = λ d d = d! (0) and λ d k = 0 for k > d. The corresponding L µ(x)-normalized eigenfunctions for each λ k are {x α } α Mk. Example (d = ). The recurrence relation (9) is nonlinear in two indexes and hence a closed analytic expression for λ d k is hard to find for large d. It is straightforward, however, to write a computer program for computing λ d k. For d = λ 0 = n + λ = λ = with corresponding eigenfunctions, {x,..., x n }, and {x x, x x 3,..., x n x n }, respectively..3 Example on the Unit Ball B n Except for the homogeneous polynomial kernel K(x, t) = x, t d, the computation of the spectrum of L K on the unit ball B n is much more difficult analytically than that on S n. For K(x, t) = ( + x, t ) d and small values of d, it is still possible, however, to obtain explicit answers. Example (X = B n, K(x, t) = ( + x, t ) ). Let µ be the uniform measure on B n. The eigenspace spanned by {x,..., x n } corresponds to the eigenvalue λ = (n+). The eigenspace spanned by { x x Y,j (n; x )}N(n,) j= corresponds to the eigenvalue λ = (n+)(n+4). The eigenvalues that correspond to span{, x } are λ 0, = (n + )(n + 5) + D (n + )(n + 4) where D = (n + ) (n + 5) 6(n + 4). λ 0, = (n + )(n + 5) D (n + )(n + 4) 3 Unboundedness of Normalized Eigenfunctions It is known that the L µ-normalized eigenfunctions {φ k } are generally unbounded, that is in general sup k N φ k = This was first pointed out by Smale, with the first counterexample given in [4]. This phenomenon is very common, however, as the following result shows.

6 Theorem 5. Let X = S n, n 3. Let µ be the Lebesgue measure on S n. Let f : [, ] R be a continuous function, giving rise to a Mercer kernel K(x, t) = f( x, t ) on S n S n. If infinitely many of the eigenvalues of L K : L µ(s n ) L µ(s n ) are nonzero, then for the set of corresponding L µ-normalized eigenfunctions {φ k } k= sup φ k = () k N Remark 3. This is in sharp contrast with the case n =, where we will show that sup φ k k with the supremum attained on the functions { cos kθ, sin kθ } k N. Theorem 5 applies in particular to the Gaussian kernel K(x, t) = exp( x t σ ). Hence care needs to be taken in applying analysis that requires C K = sup k φ k <, for example [5]. 4 Feature Maps 4. Examples of Feature Maps via Mercer s Theorem A natural feature map that arises immediately from Mercer s theorem is Φ µ : X l Φ µ (x) = ( λ k φ k (x)) k= () where if only N < of the eigenvalues are strictly positive, then Φ µ : X R N. This is the map that is often covered in the machine learning literature. Example 3 (n = d =,X = S n ). Theorem 3 gives the eigenvalues (3,, ), with eigenfunctions (, x, x, xx, x x ) = (, cos θ, sin θ sin θ, cos θ, ), where x = cos θ, x = sin θ, giving rise to the feature map 3 Φ µ (x) = (, x, x, x x, x x ) Example 4 (n = d =, X = {, } ). Theorem 4 gives Φ µ (x) = ( 3, x, x, x x ) Observation (i) As our notation suggests, Φ µ depends on the particular measure µ that is in the definition of the operator L K and thus is not unique. Each measure µ gives rise to a different system of eigenvalues and eigenfunctions (λ k, φ k ) k= and therefore a different Φ µ. (ii) In Theorem 3 and, the multiplicity of the λ k s means that for each choice of orthonormal bases of the space Y k (n) of spherical harmonics of order k, there is a different feature map. Thus are infinitely many feature maps arising from the uniform probability distribution on S n alone.

7 4. Examples of Feature Maps not via Mercer s Theorem Feature maps do not necessarily arise from Mercer s theorem. Consider any set X and any pointwise-defined, positive definite kernel K on X X. For each x X, let K x : X R be defined by K x (t) = K(x, t) and H K = span{k x : x X} (3) be the Reproducing Kernel Hilbert Space (RKHS) induced by K, with the inner product K x, K t K = K(x, t), see []. The following feature map is then immediate: Φ K : X H K Φ K (x) = K x (4) In this section we discuss, via examples, two other methods for obtaining feature maps. Let X R n be any subset. Consider the Gaussian kernel K(x, t) = exp( x t σ ) on X X, which admits the following expansion K(x, t) = exp( x t σ ) = e x σ e t σ k=0 (/σ ) k k! α =k C k αx α t α (5) where Cα k k! = (α, which implies the feature map: Φ )!...(α n)! g : X l where Φ g (x) = e x σ (/σ ( ) k Cα k k! x α ) α =k,k=0 Remark 4. The standard polynomial feature maps in machine learning, see for example ([7], page 8), are obtained exactly in the same way. Consider next a special class of kernels that is widely used in practice, called convolution kernels. We recall that for a function f L (R n ), its Fourier transform is defined to be ˆf(ξ) = f(x)e i ξ,x dx R n By Fourier transform computation, it may be shown that if µ : R n R is even, nonnegative, such that µ, µ L (R n ), then the kernel K : R n R n R defined by K(x, t) = µ(u)e i x t,u du (6) R n is continuous, symmetric, positive definite. Further more, for any x, t R n K(x, t) = () n The following then is a feature map of K on X X R n µ(x u) µ(t u)du (7) Φ conv : X L (R n ) (8)

8 For the Gaussian kernel e x t σ Φ conv (x)(u) = () n µ(x u) = ( σ )n e σ u R n 4 e i x t,u du and (Φ conv (x))(u) = ( σ ) n e x u σ One can similarly obtain feature maps for the inverse multiquadric, exponential, or B-spline kernels. The identity e x t σ = ( 4 σ ) n e R x u n σ e t u σ du can also be verified directly, as done in [0], where implications of the Gaussian feature map Φ conv (x) above are also discussed. 4.3 Equivalence of Feature Maps It is known ([7], page 39) that, given a set X and a pointwise-defined, symmetric, positive definite kernel K on X X, all feature maps from X into Hilbert spaces are essentially equivalent. In this section, we will make this statement precise. Let H be a Hilbert space and Φ : X H be such that Φ x, Φ t H = K(x, t) for all x, t X, where Φ x = Φ(x). The evaluation functional L x : H R given by L x v = v, Φ x H, where x varies over X, defines an inclusion map L Φ : H R X (L Φ v)(x) = v, Φ x H where R X denotes the vector space of pointwise-defined, real-valued functions on X. Observe that as a vector space of functions, H K R X. Proposition. Let H Φ = span{φ x : x X}, a subspace of H. The restriction of L Φ on H Φ is an isometric isomorphism between H Φ and H K. Proof. First, L Φ is bijective from H Φ to the image L Φ (H Φ ), since ker L Φ = H Φ. Under the map L Φ, for each x, t X, (L Φ Φ x )(t) = Φ x, Φ t = K x (t), thus K x L Φ Φ x as functions on X. This implies that span{k x : x X} is isomorphic to span{φ x : x X} as vector spaces. The isometric isomorphism of H Φ = span{φ x : x X} and H K = span{k x : x X} then follows from Φ x, Φ t H = K(x, t) = K x, K t K. This completes the proof. Remark 5. Each choice of Φ is thus equivalent to a factorization of the map Φ K : x K x H K, that is the following diagram is commutative Φ K x X K x H K Φ L Φ Φ x H Φ (9) We will call Φ K : x K x H K the canonical feature map associated with K.

9 5 Smoothing Properties of Kernels on the Sphere Having discussed feature maps, we will in this section analyze the smoothing properties of the polynomial and Gaussian kernels and compare them with those of spline kernels on the sphere S n. In the spline smoothing problem on S as described in [], one solves the minimization problem m m (f(x i ) y i ) + λ i= 0 (f (m) (t)) dt (0) for x i [0, ] and f W m, where J m (f) = 0 (f (m) (t)) dt is the square norm of the RKHS W 0 m = {f : f, f,..., f (m ) absolutely continuous, f (m) L [0, ], f (j) (0) = f (j) (), j = 0,,..., m } The space W m is then the RKHS defined by ( ) W m = {} Wm 0 = {f : f K = 4 f(t)dt (f (m) (t)) dt < } induced by a kernel K. One particular feature of spline smoothing, on S, S, or R n, is that in general the RKHS W m does not have a closed form kernel K that is efficiently computable. This is in contrast with the RKHS that are used in kernel machine learning, all of which correspond to closed-form kernels that can be evaluated efficiently. It is not clear, however, whether the norms in these RKHS correspond to smoothness functionals. In this section, we will show that for the polynomial and Gaussian kernels on S n, they do. 5. The Iterated Laplacian and Splines on the Sphere S n Splines on S n for n = and n = 3, as treated by Wahba [], [], can be generalized to any n, n N, via the iterated Laplacian (also called the Laplace-Beltrami operator) on S n. The RKHS corresponding to W m in (0) is a subspace of L (S n ) described by H K = {f : f K = ( ) S n f(x)dx + f(x) m f(x)dx < } S n S n The Laplacian on S n has eigenvalues λ k = k(k + n ), k 0, with corresponding eigenfunctions {Y k,j (n; x)} N(n,k) j=, which form an orthonormal basis in the space Y k (n) of spherical harmonics of order k. For f L (S n ), if we use a the expansion f = 0 + N(n,k) S n k= j= a k,j Y k,j (n; x) then the space H K takes the form H K = {f L (S n ) : f K = a N(n,k) 0 S n + [k(k + n )] m a k,j < } k= j=

10 and thus the corresponding kernel is K(x, t) = + k= N(n,k) [k(k + n )] m Y k,j (n; x)y k,j (n; t) () which is well-defined iff m > n. Let P k(n; t) denote the Legendre polynomial of degree k in dimension n, then () takes the form K(x, t) = + S n k= j= N(n, k) [k(k + n )] m P k(n; x, t ) () which does not have a closed form in general - for the case n = 3, see []. Remark 6. Clearly m can be replaced by any real number s > n. 5. Smoothing Properties of Polynomial and Gaussian Kernels Let n denote the gradient operator on S n (also called the first-order Beltrami operator, see [6] page 79 for a definition). Let Y k Y k (n), k 0, then n Y k L (S n ) = S n n Y k (x) ds n (x) = k(k + n ) (3) This shows that spherical harmonics of higher-order are less smooth. This is particularly straightforward in the case n = with the Fourier basis functions {, cos kθ, sin kθ} k N - as k increases, the functions oscillate more rapidly. It follows that any regularization term f K in problems such as (0), where K possesses a decreasing spectrum λ k - k corresponds to the order of the spherical harmonics - will have a smoothing effect. That is, the higher-order spherical harmonics, which are less smooth, will be penalized more. The decreasing spectrum property is true for the spline kernels, the polynomial kernel ( + x, t ) d, and the Gaussian kernel for σ ( n )/, as we showed in Theorems and 3. Hence all these kernels possess smoothing properties on S n. Furthermore, Theorem shows that for the Gaussian kernel, for all k ( e A σ )k n (k + n ) k+ < λ k < ( e A σ )k n (k + n ) k+ and Theorem 3 shows that for the polynomial kernel ( + x, t ) d B (k + d + n ) d+n 3 B < λ k < (k + d + n ) d+n 3 for 0 k d. Compare these with the eigenvalues of the spline kernels λ k = [k(k + n )] m

11 for k, we see that the Gaussian kernel has the sharpest smoothing property, as can be seen from the exponential decay of the eigenvalues. For K(x, t) = ( + x, t ) d, if d > m n + 3, then K has sharper smoothing property than a spline kernel of order m. Moreover, all spherical harmonics of order greater than d are filtered out, hence choosing K amounts to choosing a hypothesis space of bandlimited functions on S n. A Proofs of Results The proofs for results on S n all make use of properties of spherical harmonics on S n, which can be found in [6]. We will prove Theorem (Theorem 3 is similar) and Theorem 5. A. Proof of Theorem Let f : [, ] R be a continuous function. Let Y k Y k (n) for k 0. Then the Funk-Hecke formula ([6], page 30) states that for any x S n : S n f( x, t )Y k (t)ds n (t) = λ k Y k (x) (4) where λ k = S n f(t)p k (n; t)( t ) n 3 dt (5) and P k (n; t) denotes the Legendre polynomial of degree k in dimension n. Since the spherical harmonics {{Y k,j (n; x)} N(n,k) j= } k=0 form an orthonormal basis for L (S n ), an immediate consequence of the Funk-Hecke formula is that if K on S n S n is defined by K(x, t) = f( x, t ), and µ is the Lebesgue measure on S n, then the eigenvalues of L K : L µ(s n ) L µ(s n ) are given precisely by (5), with the corresponding orthonormal eigenfunctions of λ k being {Y k,j (n; x)} N(n,k) j=. The multiplicity of λ k is therefore N(n, k) = dim(y k (n)). On S n e x t σ = e σ e x,t σ. Thus λ k = e σ S n e t σ P k (n; t)( t ) n 3 dt = e σ S n Γ ( n )(σ ) n/ I k+n/ ( σ ) by Lemma = e /σ σ n I k+n/ ( σ )Γ ( n ) Sn Normalizing by setting S n = gives the required expression for λ k as in (5). Lemma. Let f(t) = e rt, then ( ) f(t)p k (n; t)( t n/ ) n 3 n dt = Γ ( ) I k+n/ (r) (6) r

12 Proof. We apply the following which follows from ([3], page 79, formula 9) e rt ( t ) ν dt = ( ) ν / Γ (ν)i ν /(r) (7) r and Rodrigues rule ([6], page 3), which states that for f C k ([, ]) f(t)p k (n; t)( t ) n 3 dt = Rk (n) where R k (n) = Γ (k+ n ). For f(t) = ert, we have ert P k (n; t)( t ) n 3 dt = R k (n)r k ert ( t n 3 k+ ) = R k (n)r k ( ) k+n/ r Γ (k + n )I k+n/ (r) Substituting in the values of R k (n) gives the desired answer. Lemma. The sequence {λ k } k=0 is decreasing if σ ( /. n) Γ ( n k ) Proof. We will first prove that I k+n/ ( σ ) = ( σ ) k+n/ j=0 = ( σ ) k+n/ j=0 λ k λ k+ > (k + n/)σ. We have ( σ )j j!γ (j+k+n/+) ( σ )j j!(j+k+n/)γ (j+k+n/) < ( σ ) k+n/ ( σ )j k+n/ j=0 j!γ (j+k+n/) = σ (k+n/) I k+n/ ( σ ) f (k) (t)( t n 3 k+ ) dt (8) λ which implies k λ k+ > (k + n/)σ. The inequality λ k λ k+ thus is satisfied if σ (k + n/) for all k 0. It suffices to require that it holds for k = 0, that is σ n/ σ ( /. n) Proof (of (6)). By definition of I ν (z), we have for z > 0 I ν (z) < ( z )ν ( z )j Γ (ν+) j=0 j! = ( z )ν /4 Γ (ν+) ez Then for ν = k + n and z = σ : I k+ n ( σ ) < Γ (k+ n )( σ )k+n e /σ4. Consider Stirling s series for a > 0 Γ (a + ) = ( a ) [ a a + e a + 88a 39 ] 5840a (9) Thus for all a > 0 we can write Γ (a + ) = e A(a) e ( a e. Hence for all k a Γ (k + n ) = ea(k,n) e( k + n n k+ ) e where 0 < A(k, n) < (k+ n ). Then I k+ n ( σ ) < (e) k+ n k+ n (k+n ) The other direction is obtained similarly. ) a+ where 0 < A(a) < = e A(k,n) e( k + n e ( σ )k+n e /σ4 implying (6). n k+ )

13 A. Proof of Theorem 5 We will first show an upper bound for Y k, where Y k is any L (S n )- normalized function in Y k (n), then exhibit a one-dimensional subspace of functions in Y k (n) that attain this upper bound. Observe that Y k belongs to an orthonormal basis {Y k,j (n; x)} N(n,k) j= of Y k (n). The following highlights the crucial difference between the case n = and n 3. Lemma 3. For any n, k 0, for all j N, j N(n, k) Y k,j (n;.) N(n, k) S n (30) In particular, for n = and all k 0: Y k,j (n;.). Proof. The Addition Theorem for spherical harmonics ([6], page 8) states that for any x, α S n N(n,k) j= Y k,j (n; x)y k,j (n; α) = which implies that for any x S n Y k,j (n; x) N(n, k) S n P k(n; x, α ) N(n, k) S n P N(n, k) k(n; x, x ) = S n P N(n, k) k(n; ) = S n giving us the first result. For n =, we have N(n, k) = for k = 0, N(n, k) = for k, and S =, giving us the second result. Definition. Consider the group O(n) of all orthogonal transformations in R n, that is O(n) = {A R n n : A T A = AA T = I}. A function f : S n R is said to be invariant under a transformation A O(n) if f A (x) = f(ax) = f(x) for all x S n. Let α S n. The isotropy group J n,α is defined by J n,α = {A O(n) : Aα = α}. Lemma 4. Assume that Y k Y k (n) is invariant with respect to J n,α and satisfies S n Y k (x) ds n (x) =. Then Y k is unique up to a multiplicative constant C α,n,k with C α,n,k = and Y k = Y k (α) = N(n, k) S n (3) Proof. If Y k is invariant with respect to J n,α, then by ([6], Lemma 3, page 7), it must satisfy Y k (x) = Y k (α)p k (n; x, α ), showing that the subspace of Y k (n) invariant with respect to J n,α is one-dimensional. The Addition Theorem implies that for any α S n

14 By assumption, we then have S n P k (n; x, α ) ds n (x) = Sn N(n,k) = S n Y k (x) ds n (x) = Y k (α) S n P k (n; x, α ) ds n (x) = Y k (α) S n N(n,k), giving us Y k(α) = Y k (x) = N(n,k) S n N(n,k) S n P k(n; x, α ) by the property P k (n; t) for t. Thus Y k =. Thus we for all x Sn N(n,k) S n N(n,k) S n as desired. Proposition (Orthonormal basis of Y k (n) [6]). Let n 3. Let e,..., e n be the canonical basis of R n. Let x S n. We write x = te n + ( ) t x(n ) 0 where t [, ] and x (n ) S n, (x (n ), 0) T span{e,..., e n }. Suppose that for m = 0,,..., k, the orthonormal bases Y m,j, j =,..., N(n, m) of Y m (n ) are given, then an orthonormal basis for Y k (n) is Y k,m,j (n; x) = A m k (n; t)y m,j (n ; x (n ) ) : j =,..., N(n, m) (3) starting with the Fourier basis for n =, where A m n (k + n )(k m)!(k + n + m 3)! k (n; t) = k!γ ( n ) Pk m (n; t) (33) Proposition 3. Let n N, n 3. Let µ be the Lebesgue measure on S n. For each k 0, any orthonormal basis of the space Y k (n) of spherical harmonics of order k contains an L µ-normalized spherical harmonic Y k such that Y k = N(n, k) S n = (k + n )(k + n 3)! k!(n )! S n (34) as k, where S n = n Γ ( n ) is the surface area of Sn. Proof. Let x = te n + ( ) t x(n ), t. For each k 0, the 0 orthonormal basis for Y k (n) in Proposition contains the function Y k,0, (n; x) = A 0 k(t)y 0, (n ; x (n ) ) = A 0 k(t) S n (35) Y k,0, (n; x) = (k+n )(k+n 3)! Γ ( n ) n k! S n P k (n; t) = N(n,k) S n P k(n; t) Then Y k,0, (n; x) is invariant with respect to J n,α where α = (0,..., 0, ). Thus Y k = Y k,0, is the desired function for the current orthonormal basis. For any orthonormal basis of Y k (n), the result follows by Lemma 4 and rotational symmetry on the sphere.

15 Proof (of Theorem 5). By the Funk-Hecke formula, all spherical harmonics of order k are eigenfunctions corresponding to the eigenvalue λ k as given by (5). If infinitely many of the λ k s are nonzero, then the corresponding set of L (S n )- orthonormal eigenfunctions {φ k }, being an orthonormal basis of L (S n ), contains a spherical harmonic Y k satisfying (34), for infinitely many k. It follows from Proposition 3 then that sup k φ k =. References. N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American Mathematical Society, vol. 68, pages , M. Belkin, P. Niyogi, and V. Sindwani. Manifold Regularization: a Geometric Framework for Learning from Examples. University of Chicago Computer Science Technical Report TR , 004, accepted for publication. 3. M. Belkin and P. Niyogi. Semi-supervised Learning on Riemannian Manifolds. Machine Learning, Special Issue on Clustering, vol. 56, pages 09-39, E. De Vito, A. Caponnetto, and L. Rosasco. Model Selection for Regularized Least-Squares Algorithm in Learning Theory. Foundations of Computational Mathematics, vol. 5, no., pages 59-85, J. Lafferty and G. Lebanon. Diffusion Kernels on Statistical Manifolds. Journal of Machine Learning Research, vol. 6, pages 9-63, C. Müller. Analysis of Spherical Symmetries in Euclidean Spaces. Applied Mathematical Sciences 9, Springer, New York, B. Schölkopf and A.J. Smola. Learning with Kernels. The MIT Press, Cambridge, Massachusetts, S. Smale and D.X. Zhou. Learning Theory Estimates via Integral Operators and Their Approximations, 005, to appear. 9. A.J. Smola, Z.L. Ovari and R.C. Williamson. Regularization with Dot-Product Kernels. Advances in Information Processing Systems, I. Steinwart, D. Hush, and C. Scovel. An Explicit Description of the Reproducing Kernel Hilbert Spaces of Gaussian RBF Kernels Kernels. Los Alamos National Laboratory Technical Report LA-UR , December G. Wahba. Spline Interpolation and Smoothing on the Sphere. SIAM Journal of Scientific and Statistical Computing, vol., pages 5-6, 98.. G. Wahba. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59, Society for Industrial and Applied Mathematics, Philadelphia, G.N. Watson. A Treatise on the Theory of Bessel Functions, nd edition, Cambridge University Press, Cambridge, England, D.X. Zhou. The Covering Number in Learning Theory. Journal of Complexity, vol. 8, pages , 00.

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert