Mercer s Theorem, Feature Maps, and Smoothing

Similar documents
Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

ISSN , Volume 32, Number 2

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Recall that any inner product space V has an associated norm defined by

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Geometry on Probability Spaces

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Kernel Method: Data Analysis with Positive Definite Kernels

Kernels A Machine Learning Overview

Reproducing Kernel Hilbert Spaces

10-701/ Recitation : Kernels

Regularization in Reproducing Kernel Banach Spaces

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

Kernel Principal Component Analysis

Reproducing Kernel Hilbert Spaces

MATH 590: Meshfree Methods

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Eigenvalues and eigenfunctions of the Laplacian. Andrew Hassell

Learning with Consistency between Inductive Functions and Kernels

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

What is an RKHS? Dino Sejdinovic, Arthur Gretton. March 11, 2014

MAT 578 FUNCTIONAL ANALYSIS EXERCISES

1 Math 241A-B Homework Problem List for F2015 and W2016

Kernel methods and the exponential family

Strictly Positive Definite Functions on a Real Inner Product Space

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Derivative reproducing properties for kernel methods in learning theory

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Vectors in Function Spaces

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

Representation theory and quantum mechanics tutorial Spin and the hydrogen atom

REAL AND COMPLEX ANALYSIS

Functional Analysis. Martin Brokate. 1 Normed Spaces 2. 2 Hilbert Spaces The Principle of Uniform Boundedness 32

Kernels for Multi task Learning

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels

Strictly positive definite functions on a real inner product space

Manifold Learning: Theory and Applications to HRI

THE PROBLEMS FOR THE SECOND TEST FOR BRIEF SOLUTIONS

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Kernel Methods. Machine Learning A W VO

SOLUTIONS TO SELECTED PROBLEMS FROM ASSIGNMENTS 3, 4

On an Eigenvalue Problem Involving Legendre Functions by Nicholas J. Rose North Carolina State University

Chapter 8 Integral Operators

Convergence rates of spectral methods for statistical inverse learning problems

Spectral theory for compact operators on Banach spaces

Definition and basic properties of heat kernels I, An introduction

Kernel Methods. Outline

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

COMMON COMPLEMENTS OF TWO SUBSPACES OF A HILBERT SPACE

MATH 220: INNER PRODUCT SPACES, SYMMETRIC OPERATORS, ORTHOGONALITY

Kernel methods and the exponential family

I teach myself... Hilbert spaces

MATH 5640: Fourier Series

Approximate Kernel PCA with Random Features

Topics in Fourier analysis - Lecture 2.

1 Continuity Classes C m (Ω)

Statistical learning on graphs

Kernels MIT Course Notes

Solutions: Problem Set 4 Math 201B, Winter 2007

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Hilbert Spaces. Contents

Chapter 4 Euclid Space

Preconditioning techniques in frame theory and probabilistic frames

Kernel B Splines and Interpolation

Deviation Measures and Normals of Convex Bodies

Elementary linear algebra

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

LEGENDRE POLYNOMIALS AND APPLICATIONS. We construct Legendre polynomials and apply them to solve Dirichlet problems in spherical coordinates.

Real Analysis Qualifying Exam May 14th 2016

Notes on Special Functions

Online Gradient Descent Learning Algorithms

1 Functional Analysis

CIS 520: Machine Learning Oct 09, Kernel Methods

Functional Analysis I

Basic Properties of Metric and Normed Spaces

Bessel Functions Michael Taylor. Lecture Notes for Math 524

x 3y 2z = 6 1.2) 2x 4y 3z = 8 3x + 6y + 8z = 5 x + 3y 2z + 5t = 4 1.5) 2x + 8y z + 9t = 9 3x + 5y 12z + 17t = 7

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Introduction to Signal Spaces

Waves on 2 and 3 dimensional domains

The goal of this chapter is to study linear systems of ordinary differential equations: dt,..., dx ) T

Analysis in weighted spaces : preliminary version

NATIONAL BOARD FOR HIGHER MATHEMATICS. Research Awards Screening Test. February 25, Time Allowed: 90 Minutes Maximum Marks: 40

Analysis Preliminary Exam Workshop: Hilbert Spaces

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

OPERATOR THEORY ON HILBERT SPACE. Class notes. John Petrovic

Solutions: Problem Set 3 Math 201B, Winter 2007

MATH & MATH FUNCTIONS OF A REAL VARIABLE EXERCISES FALL 2015 & SPRING Scientia Imperii Decus et Tutamen 1

Transcription:

Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics, University of California, Berkeley 970 Evans Hall, Berkeley, CA 9470, USA minh,niyogi@cs.uchicago.edu, yao@math.berkeley.edu Abstract. We study Mercer s theorem and feature maps for several positive definite kernels that are widely used in practice. The smoothing properties of these kernels will also be explored. Introduction Kernel-based methods have become increasingly popular and important in machine learning. The central idea behind the so-called kernel trick is that a closed form Mercer kernel allows one to efficiently solve a variety of non-linear optimization problems that arise in regression, classification, inverse problems, and the like. It is well known in the machine learning community that kernels are associated with feature maps and a kernel based procedure may be interpreted as mapping the data from the original input space into a potentially higher dimensional feature space where linear methods may then be used. One finds many accounts of this idea where the input space X is mapped by a feature map Φ : X H (where H is a Hilbert space) so that for any two points x, y X, we have K(x, y) = φ(x), φ(y) H. Yet, while much has been written about kernels and many different kinds of kernels have been discussed in the literature, much less has been explicitly written about their associated feature maps. In general, we do not have a clear and concrete understanding of what exactly these feature maps are. Our goal in this paper is to take steps toward a better understanding of feature maps by explicitly computing them for a number of popular kernels for a variety of domains. By doing so, we hope to clarify the precise nature of feature maps in very concrete terms so that machine learning researchers may have a better feel for them. Following are the main points and new results of our paper:. As we will illustrate, feature maps and feature spaces are not unique. For a given domain X and a fixed kernel K on X X, there exist in fact infinitely many feature maps associated with K. Although these maps are essentially equivalent, in a sense to be made precise in Section 4.3, there are subtleties

that we wish to emphasize. For a given kernel K, the feature maps of K induced by Mercer s theorem depend fundamentally on the domain X, as will be seen in the examples of Section. Moreover, feature maps do not necessarily arise from Mercer s theorem, examples of which will be given in Section 4.. The importance of Mercer s theorem, however, goes far beyond the feature maps that it induces: the eigenvalues and eigenfunctions associated with K play a central role in obtaining error estimates in learning theory, see for example [8], [4]. For this reason, the determination of the spectrum of K, which is highly nontrivial in general, is crucially important in its own right. Theorems and 3 in Section give the complete spectrum of the polynomial and Gaussian kernels on S n, including sharp rates of decay of their eigenvalues. Theorem 4 gives the eigenfunctions and a recursive formula for the computation of eigenvalues of the polynomial kernel on the hypercube {, } n.. One domain that we particularly focus on is the unit sphere S n in R n, for several reasons. First, it is a special example of a compact Riemannian manifold and the problem of learning on manifolds has attracted attention recently, see for example [], [3]. Second, its symmetric and homogeneous nature allows us to obtain complete and explicit results in many cases. We believe that S n together with kernels defined on it is a fruitful source of examples and counterexamples for theoretical analysis of kernel-based learning. We will point out that intuitions based on low dimensions such as n = in general do not carry over to higher dimensions - Theorem 5 in Section 3 gives an important example along this line. We will also consider the unit ball B n, the hypercube {, } n, and R n itself. 3. We will also try to understand the smoothness property of kernels on S n. In particular, we will show that the polynomial and Gaussian kernels define Hilbert spaces of functions whose norms may be interpreted as smoothness functionals, similar to those of splines on S n. We will obtain precise and sharp results on this question in the paper. This is the content of Section 5. The smoothness implications allow us to better understand the applicability of such kernels in solving smoothing problems. Notation: For X R n and µ a Borel measure on X, L µ(x) = {f : X C : X f(x) dµ(x) < }. We will also use L (X) for L µ(x) and dx for dµ(x) if µ is the Lebesgue measure on X. The surface area of the unit sphere S n is denoted by S n = n Γ ( n ). Mercer s Theorem One of the fundamental mathematical results underlying learning theory with kernels is Mercer s theorem. Let X be a closed subset of R n, n N, µ a Borel measure on X, and K : X X R a symmetric function satisfying: for any

finite set of points {x i } N i= in X and real numbers {a i} N i= N a i a j K(x i, x j ) 0 () i,j= K is said to be a positive definite kernel on X. Assume further that K(x, t) dµ(x)dµ(t) < () X X Consider the induced integral operator L K : L µ(x) L µ(x) defined by L K f(x) = K(x, t)f(t)dµ(t) (3) X This is a self-adjoint, positive, compact operator with a countable system of nonnegative eigenvalues {λ k } k= satisfying k= λ k <. L K is said to be Hilbert- Schmidt and the corresponding L µ(x)-normalized eigenfunctions {φ k } k= form an orthonormal basis of L µ(x). We recall that a Borel measure µ on X is said to be strictly positive if the measure of every nonempty open subset in X is positive, an example being the Lebesgue measure in R n. Theorem (Mercer). Let X R n be closed, µ a strictly positive Borel measure on X, K a continuous function on X X satisfying () and (). Then K(x, t) = λ k φ k (x)φ k (t) (4) k= where the series converges absolutely for each pair (x, t) X X and uniformly on each compact subset of X. Mercer s theorem still holds if X is a finite set {x i }, such as X = {, } n, K is pointwise-defined positive definite and µ(x i ) > 0 for each i.. Examples on the Sphere S n We will give explicit examples of the eigenvalues and eigenfunctions in Mercer s theorem on the unit sphere S n for the polynomial and Gaussian kernels. We need the concept of spherical harmonics, a modern and authoritative account of which is [6]. Some of the material below was first reported in the kernel learning literature in [9], where the eigenvalues for the polynomial kernels with n = 3, were computed. In this section, we will carry out computations for a general n N, n. [ Definition (Spherical Harmonics). Let n = x ] +... + x denote n the Laplacian operator on R n. A homogeneous polynomial of degree k in R n whose Laplacian vanishes is called a homogeneous harmonic of order k. Let Y k (n)

denote the subspace of all homogeneous harmonics of order k on the unit sphere S n in R n. The functions in Y k (n) are called spherical harmonics of order k. We will denote by {Y k,j (n; x)} N(n,k) j= any fixed orthonormal basis for Y k (n) where N(n, k) = dim Y k (n) = (k+n )(k+n 3)! k!(n )!, k 0. Theorem. Let X = S n, n N, n. Let µ be the uniform probability distribution on S n. For K(x, t) = exp( x t σ ), σ > 0 λ k = e /σ σ n I k+n/ ( σ )Γ (n ) (5) for all k N {0}, where I denotes the modified Bessel function of the first kind, defined below. Each λ k occurs with multiplicity N(n, k) with the corresponding eigenfunctions being spherical harmonics of order k on S n. The λ k s are decreasing if σ ( n )/. Furthermore ( e A σ )k n (k + n ) k+ < λ k < ( e A σ )k n (k + n ) k+ (6) for A, A depending on σ and n given below. Remark. A = e /σ / (e) n Γ ( n ), A = e /σ +/σ 4 (e) n Γ ( n ). For ν, z C, I ν (z) = j=0 j!γ (ν+j+) ( z ) ν+j. Theorem 3. Let X = S n, n N, n, and d N. Let µ be the uniform probability distribution on S n. For K(x, t) = ( + x, t ) d, the nonzero eigenvalues of L K : L µ(x) L µ(x) are λ k = d+n d! Γ (d + n )Γ ( n ) (7) (d k)! Γ (d + k + n ) for 0 k d. Each λ k occurs with multiplicity N(n, k), with the corresponding eigenfunctions being spherical harmonics of order k on S n. Furthermore, the λ k s form a decreasing sequence and B (k + d + n ) d+n 3 B < λ k < (k + d + n ) d+n 3 (8) where 0 k d, for B, B depending on d, n given below. Remark. B = e d (e) d+n n Γ (d+ d! )Γ ( n ), B e /6 d d+ = e d (e) d+n d! Γ (d+ n )Γ ( n ).. Example on the Hypercube {, } n We will now give an example with the hypercube {, } n. Let M k = {α = (α i ) n i=, α i {0, }, α = α + + α n = k}, then the set {x α } α Mk,0 k n, consists of multilinear mononomials {, x, x x,..., x... x n }.

Theorem 4. Let X = {, } n. Let d N, d n be fixed. Let K(x, t) = ( + x, t ) d on X X. Let µ be the uniform distribution on X, then the nonzero eigenvalues λ d k s of L K : L µ(x) L µ(x) satisfy λ d+ k = kλ d k + λ d k + (n k)λ d k+ (9) λ d 0 λ d... λ d d = λ d d = d! (0) and λ d k = 0 for k > d. The corresponding L µ(x)-normalized eigenfunctions for each λ k are {x α } α Mk. Example (d = ). The recurrence relation (9) is nonlinear in two indexes and hence a closed analytic expression for λ d k is hard to find for large d. It is straightforward, however, to write a computer program for computing λ d k. For d = λ 0 = n + λ = λ = with corresponding eigenfunctions, {x,..., x n }, and {x x, x x 3,..., x n x n }, respectively..3 Example on the Unit Ball B n Except for the homogeneous polynomial kernel K(x, t) = x, t d, the computation of the spectrum of L K on the unit ball B n is much more difficult analytically than that on S n. For K(x, t) = ( + x, t ) d and small values of d, it is still possible, however, to obtain explicit answers. Example (X = B n, K(x, t) = ( + x, t ) ). Let µ be the uniform measure on B n. The eigenspace spanned by {x,..., x n } corresponds to the eigenvalue λ = (n+). The eigenspace spanned by { x x Y,j (n; x )}N(n,) j= corresponds to the eigenvalue λ = (n+)(n+4). The eigenvalues that correspond to span{, x } are λ 0, = (n + )(n + 5) + D (n + )(n + 4) where D = (n + ) (n + 5) 6(n + 4). λ 0, = (n + )(n + 5) D (n + )(n + 4) 3 Unboundedness of Normalized Eigenfunctions It is known that the L µ-normalized eigenfunctions {φ k } are generally unbounded, that is in general sup k N φ k = This was first pointed out by Smale, with the first counterexample given in [4]. This phenomenon is very common, however, as the following result shows.

Theorem 5. Let X = S n, n 3. Let µ be the Lebesgue measure on S n. Let f : [, ] R be a continuous function, giving rise to a Mercer kernel K(x, t) = f( x, t ) on S n S n. If infinitely many of the eigenvalues of L K : L µ(s n ) L µ(s n ) are nonzero, then for the set of corresponding L µ-normalized eigenfunctions {φ k } k= sup φ k = () k N Remark 3. This is in sharp contrast with the case n =, where we will show that sup φ k k with the supremum attained on the functions { cos kθ, sin kθ } k N. Theorem 5 applies in particular to the Gaussian kernel K(x, t) = exp( x t σ ). Hence care needs to be taken in applying analysis that requires C K = sup k φ k <, for example [5]. 4 Feature Maps 4. Examples of Feature Maps via Mercer s Theorem A natural feature map that arises immediately from Mercer s theorem is Φ µ : X l Φ µ (x) = ( λ k φ k (x)) k= () where if only N < of the eigenvalues are strictly positive, then Φ µ : X R N. This is the map that is often covered in the machine learning literature. Example 3 (n = d =,X = S n ). Theorem 3 gives the eigenvalues (3,, ), with eigenfunctions (, x, x, xx, x x ) = (, cos θ, sin θ sin θ, cos θ, ), where x = cos θ, x = sin θ, giving rise to the feature map 3 Φ µ (x) = (, x, x, x x, x x ) Example 4 (n = d =, X = {, } ). Theorem 4 gives Φ µ (x) = ( 3, x, x, x x ) Observation (i) As our notation suggests, Φ µ depends on the particular measure µ that is in the definition of the operator L K and thus is not unique. Each measure µ gives rise to a different system of eigenvalues and eigenfunctions (λ k, φ k ) k= and therefore a different Φ µ. (ii) In Theorem 3 and, the multiplicity of the λ k s means that for each choice of orthonormal bases of the space Y k (n) of spherical harmonics of order k, there is a different feature map. Thus are infinitely many feature maps arising from the uniform probability distribution on S n alone.

4. Examples of Feature Maps not via Mercer s Theorem Feature maps do not necessarily arise from Mercer s theorem. Consider any set X and any pointwise-defined, positive definite kernel K on X X. For each x X, let K x : X R be defined by K x (t) = K(x, t) and H K = span{k x : x X} (3) be the Reproducing Kernel Hilbert Space (RKHS) induced by K, with the inner product K x, K t K = K(x, t), see []. The following feature map is then immediate: Φ K : X H K Φ K (x) = K x (4) In this section we discuss, via examples, two other methods for obtaining feature maps. Let X R n be any subset. Consider the Gaussian kernel K(x, t) = exp( x t σ ) on X X, which admits the following expansion K(x, t) = exp( x t σ ) = e x σ e t σ k=0 (/σ ) k k! α =k C k αx α t α (5) where Cα k k! = (α, which implies the feature map: Φ )!...(α n)! g : X l where Φ g (x) = e x σ (/σ ( ) k Cα k k! x α ) α =k,k=0 Remark 4. The standard polynomial feature maps in machine learning, see for example ([7], page 8), are obtained exactly in the same way. Consider next a special class of kernels that is widely used in practice, called convolution kernels. We recall that for a function f L (R n ), its Fourier transform is defined to be ˆf(ξ) = f(x)e i ξ,x dx R n By Fourier transform computation, it may be shown that if µ : R n R is even, nonnegative, such that µ, µ L (R n ), then the kernel K : R n R n R defined by K(x, t) = µ(u)e i x t,u du (6) R n is continuous, symmetric, positive definite. Further more, for any x, t R n K(x, t) = () n The following then is a feature map of K on X X R n µ(x u) µ(t u)du (7) Φ conv : X L (R n ) (8)

For the Gaussian kernel e x t σ Φ conv (x)(u) = () n µ(x u) = ( σ )n e σ u R n 4 e i x t,u du and (Φ conv (x))(u) = ( σ ) n e x u σ One can similarly obtain feature maps for the inverse multiquadric, exponential, or B-spline kernels. The identity e x t σ = ( 4 σ ) n e R x u n σ e t u σ du can also be verified directly, as done in [0], where implications of the Gaussian feature map Φ conv (x) above are also discussed. 4.3 Equivalence of Feature Maps It is known ([7], page 39) that, given a set X and a pointwise-defined, symmetric, positive definite kernel K on X X, all feature maps from X into Hilbert spaces are essentially equivalent. In this section, we will make this statement precise. Let H be a Hilbert space and Φ : X H be such that Φ x, Φ t H = K(x, t) for all x, t X, where Φ x = Φ(x). The evaluation functional L x : H R given by L x v = v, Φ x H, where x varies over X, defines an inclusion map L Φ : H R X (L Φ v)(x) = v, Φ x H where R X denotes the vector space of pointwise-defined, real-valued functions on X. Observe that as a vector space of functions, H K R X. Proposition. Let H Φ = span{φ x : x X}, a subspace of H. The restriction of L Φ on H Φ is an isometric isomorphism between H Φ and H K. Proof. First, L Φ is bijective from H Φ to the image L Φ (H Φ ), since ker L Φ = H Φ. Under the map L Φ, for each x, t X, (L Φ Φ x )(t) = Φ x, Φ t = K x (t), thus K x L Φ Φ x as functions on X. This implies that span{k x : x X} is isomorphic to span{φ x : x X} as vector spaces. The isometric isomorphism of H Φ = span{φ x : x X} and H K = span{k x : x X} then follows from Φ x, Φ t H = K(x, t) = K x, K t K. This completes the proof. Remark 5. Each choice of Φ is thus equivalent to a factorization of the map Φ K : x K x H K, that is the following diagram is commutative Φ K x X K x H K Φ L Φ Φ x H Φ (9) We will call Φ K : x K x H K the canonical feature map associated with K.

5 Smoothing Properties of Kernels on the Sphere Having discussed feature maps, we will in this section analyze the smoothing properties of the polynomial and Gaussian kernels and compare them with those of spline kernels on the sphere S n. In the spline smoothing problem on S as described in [], one solves the minimization problem m m (f(x i ) y i ) + λ i= 0 (f (m) (t)) dt (0) for x i [0, ] and f W m, where J m (f) = 0 (f (m) (t)) dt is the square norm of the RKHS W 0 m = {f : f, f,..., f (m ) absolutely continuous, f (m) L [0, ], f (j) (0) = f (j) (), j = 0,,..., m } The space W m is then the RKHS defined by ( ) W m = {} Wm 0 = {f : f K = 4 f(t)dt 0 + 0 (f (m) (t)) dt < } induced by a kernel K. One particular feature of spline smoothing, on S, S, or R n, is that in general the RKHS W m does not have a closed form kernel K that is efficiently computable. This is in contrast with the RKHS that are used in kernel machine learning, all of which correspond to closed-form kernels that can be evaluated efficiently. It is not clear, however, whether the norms in these RKHS correspond to smoothness functionals. In this section, we will show that for the polynomial and Gaussian kernels on S n, they do. 5. The Iterated Laplacian and Splines on the Sphere S n Splines on S n for n = and n = 3, as treated by Wahba [], [], can be generalized to any n, n N, via the iterated Laplacian (also called the Laplace-Beltrami operator) on S n. The RKHS corresponding to W m in (0) is a subspace of L (S n ) described by H K = {f : f K = ( ) S n f(x)dx + f(x) m f(x)dx < } S n S n The Laplacian on S n has eigenvalues λ k = k(k + n ), k 0, with corresponding eigenfunctions {Y k,j (n; x)} N(n,k) j=, which form an orthonormal basis in the space Y k (n) of spherical harmonics of order k. For f L (S n ), if we use a the expansion f = 0 + N(n,k) S n k= j= a k,j Y k,j (n; x) then the space H K takes the form H K = {f L (S n ) : f K = a N(n,k) 0 S n + [k(k + n )] m a k,j < } k= j=

and thus the corresponding kernel is K(x, t) = + k= N(n,k) [k(k + n )] m Y k,j (n; x)y k,j (n; t) () which is well-defined iff m > n. Let P k(n; t) denote the Legendre polynomial of degree k in dimension n, then () takes the form K(x, t) = + S n k= j= N(n, k) [k(k + n )] m P k(n; x, t ) () which does not have a closed form in general - for the case n = 3, see []. Remark 6. Clearly m can be replaced by any real number s > n. 5. Smoothing Properties of Polynomial and Gaussian Kernels Let n denote the gradient operator on S n (also called the first-order Beltrami operator, see [6] page 79 for a definition). Let Y k Y k (n), k 0, then n Y k L (S n ) = S n n Y k (x) ds n (x) = k(k + n ) (3) This shows that spherical harmonics of higher-order are less smooth. This is particularly straightforward in the case n = with the Fourier basis functions {, cos kθ, sin kθ} k N - as k increases, the functions oscillate more rapidly. It follows that any regularization term f K in problems such as (0), where K possesses a decreasing spectrum λ k - k corresponds to the order of the spherical harmonics - will have a smoothing effect. That is, the higher-order spherical harmonics, which are less smooth, will be penalized more. The decreasing spectrum property is true for the spline kernels, the polynomial kernel ( + x, t ) d, and the Gaussian kernel for σ ( n )/, as we showed in Theorems and 3. Hence all these kernels possess smoothing properties on S n. Furthermore, Theorem shows that for the Gaussian kernel, for all k ( e A σ )k n (k + n ) k+ < λ k < ( e A σ )k n (k + n ) k+ and Theorem 3 shows that for the polynomial kernel ( + x, t ) d B (k + d + n ) d+n 3 B < λ k < (k + d + n ) d+n 3 for 0 k d. Compare these with the eigenvalues of the spline kernels λ k = [k(k + n )] m

for k, we see that the Gaussian kernel has the sharpest smoothing property, as can be seen from the exponential decay of the eigenvalues. For K(x, t) = ( + x, t ) d, if d > m n + 3, then K has sharper smoothing property than a spline kernel of order m. Moreover, all spherical harmonics of order greater than d are filtered out, hence choosing K amounts to choosing a hypothesis space of bandlimited functions on S n. A Proofs of Results The proofs for results on S n all make use of properties of spherical harmonics on S n, which can be found in [6]. We will prove Theorem (Theorem 3 is similar) and Theorem 5. A. Proof of Theorem Let f : [, ] R be a continuous function. Let Y k Y k (n) for k 0. Then the Funk-Hecke formula ([6], page 30) states that for any x S n : S n f( x, t )Y k (t)ds n (t) = λ k Y k (x) (4) where λ k = S n f(t)p k (n; t)( t ) n 3 dt (5) and P k (n; t) denotes the Legendre polynomial of degree k in dimension n. Since the spherical harmonics {{Y k,j (n; x)} N(n,k) j= } k=0 form an orthonormal basis for L (S n ), an immediate consequence of the Funk-Hecke formula is that if K on S n S n is defined by K(x, t) = f( x, t ), and µ is the Lebesgue measure on S n, then the eigenvalues of L K : L µ(s n ) L µ(s n ) are given precisely by (5), with the corresponding orthonormal eigenfunctions of λ k being {Y k,j (n; x)} N(n,k) j=. The multiplicity of λ k is therefore N(n, k) = dim(y k (n)). On S n e x t σ = e σ e x,t σ. Thus λ k = e σ S n e t σ P k (n; t)( t ) n 3 dt = e σ S n Γ ( n )(σ ) n/ I k+n/ ( σ ) by Lemma = e /σ σ n I k+n/ ( σ )Γ ( n ) Sn Normalizing by setting S n = gives the required expression for λ k as in (5). Lemma. Let f(t) = e rt, then ( ) f(t)p k (n; t)( t n/ ) n 3 n dt = Γ ( ) I k+n/ (r) (6) r

Proof. We apply the following which follows from ([3], page 79, formula 9) e rt ( t ) ν dt = ( ) ν / Γ (ν)i ν /(r) (7) r and Rodrigues rule ([6], page 3), which states that for f C k ([, ]) f(t)p k (n; t)( t ) n 3 dt = Rk (n) where R k (n) = Γ (k+ n ). For f(t) = ert, we have ert P k (n; t)( t ) n 3 dt = R k (n)r k ert ( t n 3 k+ ) = R k (n)r k ( ) k+n/ r Γ (k + n )I k+n/ (r) Substituting in the values of R k (n) gives the desired answer. Lemma. The sequence {λ k } k=0 is decreasing if σ ( /. n) Γ ( n k ) Proof. We will first prove that I k+n/ ( σ ) = ( σ ) k+n/ j=0 = ( σ ) k+n/ j=0 λ k λ k+ > (k + n/)σ. We have ( σ )j j!γ (j+k+n/+) ( σ )j j!(j+k+n/)γ (j+k+n/) < ( σ ) k+n/ ( σ )j k+n/ j=0 j!γ (j+k+n/) = σ (k+n/) I k+n/ ( σ ) f (k) (t)( t n 3 k+ ) dt (8) λ which implies k λ k+ > (k + n/)σ. The inequality λ k λ k+ thus is satisfied if σ (k + n/) for all k 0. It suffices to require that it holds for k = 0, that is σ n/ σ ( /. n) Proof (of (6)). By definition of I ν (z), we have for z > 0 I ν (z) < ( z )ν ( z )j Γ (ν+) j=0 j! = ( z )ν /4 Γ (ν+) ez Then for ν = k + n and z = σ : I k+ n ( σ ) < Γ (k+ n )( σ )k+n e /σ4. Consider Stirling s series for a > 0 Γ (a + ) = ( a ) [ a a + e a + 88a 39 ] 5840a 3 +... (9) Thus for all a > 0 we can write Γ (a + ) = e A(a) e ( a e. Hence for all k a Γ (k + n ) = ea(k,n) e( k + n n k+ ) e where 0 < A(k, n) < (k+ n ). Then I k+ n ( σ ) < (e) k+ n k+ n (k+n ) The other direction is obtained similarly. ) a+ where 0 < A(a) < = e A(k,n) e( k + n e ( σ )k+n e /σ4 implying (6). n k+ )

A. Proof of Theorem 5 We will first show an upper bound for Y k, where Y k is any L (S n )- normalized function in Y k (n), then exhibit a one-dimensional subspace of functions in Y k (n) that attain this upper bound. Observe that Y k belongs to an orthonormal basis {Y k,j (n; x)} N(n,k) j= of Y k (n). The following highlights the crucial difference between the case n = and n 3. Lemma 3. For any n, k 0, for all j N, j N(n, k) Y k,j (n;.) N(n, k) S n (30) In particular, for n = and all k 0: Y k,j (n;.). Proof. The Addition Theorem for spherical harmonics ([6], page 8) states that for any x, α S n N(n,k) j= Y k,j (n; x)y k,j (n; α) = which implies that for any x S n Y k,j (n; x) N(n, k) S n P k(n; x, α ) N(n, k) S n P N(n, k) k(n; x, x ) = S n P N(n, k) k(n; ) = S n giving us the first result. For n =, we have N(n, k) = for k = 0, N(n, k) = for k, and S =, giving us the second result. Definition. Consider the group O(n) of all orthogonal transformations in R n, that is O(n) = {A R n n : A T A = AA T = I}. A function f : S n R is said to be invariant under a transformation A O(n) if f A (x) = f(ax) = f(x) for all x S n. Let α S n. The isotropy group J n,α is defined by J n,α = {A O(n) : Aα = α}. Lemma 4. Assume that Y k Y k (n) is invariant with respect to J n,α and satisfies S n Y k (x) ds n (x) =. Then Y k is unique up to a multiplicative constant C α,n,k with C α,n,k = and Y k = Y k (α) = N(n, k) S n (3) Proof. If Y k is invariant with respect to J n,α, then by ([6], Lemma 3, page 7), it must satisfy Y k (x) = Y k (α)p k (n; x, α ), showing that the subspace of Y k (n) invariant with respect to J n,α is one-dimensional. The Addition Theorem implies that for any α S n

By assumption, we then have S n P k (n; x, α ) ds n (x) = Sn N(n,k) = S n Y k (x) ds n (x) = Y k (α) S n P k (n; x, α ) ds n (x) = Y k (α) S n N(n,k), giving us Y k(α) = Y k (x) = N(n,k) S n N(n,k) S n P k(n; x, α ) by the property P k (n; t) for t. Thus Y k =. Thus we for all x Sn N(n,k) S n N(n,k) S n as desired. Proposition (Orthonormal basis of Y k (n) [6]). Let n 3. Let e,..., e n be the canonical basis of R n. Let x S n. We write x = te n + ( ) t x(n ) 0 where t [, ] and x (n ) S n, (x (n ), 0) T span{e,..., e n }. Suppose that for m = 0,,..., k, the orthonormal bases Y m,j, j =,..., N(n, m) of Y m (n ) are given, then an orthonormal basis for Y k (n) is Y k,m,j (n; x) = A m k (n; t)y m,j (n ; x (n ) ) : j =,..., N(n, m) (3) starting with the Fourier basis for n =, where A m n (k + n )(k m)!(k + n + m 3)! k (n; t) = k!γ ( n ) Pk m (n; t) (33) Proposition 3. Let n N, n 3. Let µ be the Lebesgue measure on S n. For each k 0, any orthonormal basis of the space Y k (n) of spherical harmonics of order k contains an L µ-normalized spherical harmonic Y k such that Y k = N(n, k) S n = (k + n )(k + n 3)! k!(n )! S n (34) as k, where S n = n Γ ( n ) is the surface area of Sn. Proof. Let x = te n + ( ) t x(n ), t. For each k 0, the 0 orthonormal basis for Y k (n) in Proposition contains the function Y k,0, (n; x) = A 0 k(t)y 0, (n ; x (n ) ) = A 0 k(t) S n (35) Y k,0, (n; x) = (k+n )(k+n 3)! Γ ( n ) n k! S n P k (n; t) = N(n,k) S n P k(n; t) Then Y k,0, (n; x) is invariant with respect to J n,α where α = (0,..., 0, ). Thus Y k = Y k,0, is the desired function for the current orthonormal basis. For any orthonormal basis of Y k (n), the result follows by Lemma 4 and rotational symmetry on the sphere.

Proof (of Theorem 5). By the Funk-Hecke formula, all spherical harmonics of order k are eigenfunctions corresponding to the eigenvalue λ k as given by (5). If infinitely many of the λ k s are nonzero, then the corresponding set of L (S n )- orthonormal eigenfunctions {φ k }, being an orthonormal basis of L (S n ), contains a spherical harmonic Y k satisfying (34), for infinitely many k. It follows from Proposition 3 then that sup k φ k =. References. N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American Mathematical Society, vol. 68, pages 337-404, 950.. M. Belkin, P. Niyogi, and V. Sindwani. Manifold Regularization: a Geometric Framework for Learning from Examples. University of Chicago Computer Science Technical Report TR-004-06, 004, accepted for publication. 3. M. Belkin and P. Niyogi. Semi-supervised Learning on Riemannian Manifolds. Machine Learning, Special Issue on Clustering, vol. 56, pages 09-39, 004. 4. E. De Vito, A. Caponnetto, and L. Rosasco. Model Selection for Regularized Least-Squares Algorithm in Learning Theory. Foundations of Computational Mathematics, vol. 5, no., pages 59-85, 005. 5. J. Lafferty and G. Lebanon. Diffusion Kernels on Statistical Manifolds. Journal of Machine Learning Research, vol. 6, pages 9-63, 005. 6. C. Müller. Analysis of Spherical Symmetries in Euclidean Spaces. Applied Mathematical Sciences 9, Springer, New York, 997. 7. B. Schölkopf and A.J. Smola. Learning with Kernels. The MIT Press, Cambridge, Massachusetts, 00. 8. S. Smale and D.X. Zhou. Learning Theory Estimates via Integral Operators and Their Approximations, 005, to appear. 9. A.J. Smola, Z.L. Ovari and R.C. Williamson. Regularization with Dot-Product Kernels. Advances in Information Processing Systems, 000. 0. I. Steinwart, D. Hush, and C. Scovel. An Explicit Description of the Reproducing Kernel Hilbert Spaces of Gaussian RBF Kernels Kernels. Los Alamos National Laboratory Technical Report LA-UR-04-874, December 005.. G. Wahba. Spline Interpolation and Smoothing on the Sphere. SIAM Journal of Scientific and Statistical Computing, vol., pages 5-6, 98.. G. Wahba. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59, Society for Industrial and Applied Mathematics, Philadelphia, 990. 3. G.N. Watson. A Treatise on the Theory of Bessel Functions, nd edition, Cambridge University Press, Cambridge, England, 944. 4. D.X. Zhou. The Covering Number in Learning Theory. Journal of Complexity, vol. 8, pages 739-767, 00.