Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Size: px

Start display at page:

Download "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions"

Caroline Janel Douglas
5 years ago
Views:

1 Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA Kenji Fukuizu The Institute of Statistical atheatics Tokyo, Japan Arthur Gretton Carnegie ellon University PI for Biological Cybernetics Gert R. G. Lanckriet Departent of ECE UC San Diego, La Jolla, USA Bernhard Schölkopf PI for Biological Cybernetics Tübingen, Gerany Abstract Ebeddings of probability easures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical eans of representing and coparing probabilities. In particular, the distance between ebeddings (the axiu ean discrepancy, or D) has several key advantages over any classical etrics on distributions, naely easy coputability, fast convergence and low bias of finite saple estiates. An iportant requireent of the ebedding RKHS is that it be characteristic: in this case, the D between two distributions is zero if and only if the distributions coincide. Three new results on the D are introduced in the present study. First, it is established that D corresponds to the optial risk of a kernel classifier, thus foring a natural link between the distance between distributions and their ease of classification. An iportant consequence is that a kernel ust be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-copact doains. Third, a generalization of the D is proposed for failies of kernels, as the supreu over Ds on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance easure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the proble of learning the kernel by iniizing the risk of the corresponding kernel classifier. The generalized D is shown to have consistent finite saple estiates, and its perforance is deonstrated on a hoogeneity testing exaple. Introduction Kernel ethods are broadly established as a useful way of constructing nonlinear algoriths fro linear ones, by ebedding points into higher diensional reproducing kernel Hilbert spaces (RKHSs) [2]. A generalization of this idea is to ebed probability distributions into RKHSs, giv-

2 ing us a linear ethod for dealing with higher order statistics [8, 5, 7]. ore specifically, suppose we are given the set P of all Borel probability easures defined on the topological space, and the RKHS (H, k) of functions on with k as its reproducing kernel (r.k.). For P P, denote by Pk := k(., x) dp(x). If k is easurable and bounded, then we ay define the ebedding of P in H as Pk H. The RKHS distance between two such appings associated with P, Q P is called the axiu ean discrepancy (D) [8, 7], and is written γ k (P, Q) = Pk Qk H. () We say that k is characteristic [6, 7] if the apping P Pk is injective, in which case () is zero if and only if P = Q, i.e., γ k is a etric on P. An iediate application of the D is to probles of coparing distributions based on finite saples: exaples include tests of hoogeneity [8], independence [9], and conditional independence [6]. In this application doain, the question of whether k is characteristic is key: without this property, the algoriths can fail through inability to distinguish between particular distributions. Characteristic kernels are iportant in binary classification: The proble of distinguishing distributions is strongly related to binary classification: indeed, one would expect easily distinguishable distributions to be easily classifiable. The link between these two probles is especially direct in the case of the D: in Section 2, we show that γ k is the negative of the optial risk (corresponding to a linear loss function) associated with the Parzen window classifier [2, 4] (also called kernel classification rule [4, Chapter ]), where the Parzen window turns out to be k. We also show that γ k is an upper bound on the argin of a hard-argin support vector achine (SV). The iportance of using characteristic RKHSs is further underlined by this link: if the property does not hold, then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve Bayes risk in the kernel-based classification algoriths. Characterization of characteristic kernels: Given the centrality of the characteristic property to both RKHS classification and RKHS distribution testing, we should take particular care in establishing which kernels satisfy this requireent. Early results in this direction include [8], where k is shown to be characteristic on copact if it is universal in the sense of Steinwart [8, Definition 4]; and [6, 7], which address the case of non-copact, and show that k is characteristic if and only if H + R is dense in the Banach space of p-power (p ) integrable functions. The conditions in both these studies can be difficult to check and interpret, however, and the restriction of the first to copact is liiting. In the case of translation invariant kernels, [7] proved the kernel to be characteristic if and only if the support of the Fourier transfor of k is the entire R d, which is a uch easier condition to verify. Siilar sufficient conditions are obtained by [7] for translation invariant kernels on groups and sei-groups. In Section 3, we expand the class of characteristic kernels to include kernels that ay or ay not be translation invariant, with the introduction of a novel criterion: strictly positive definite kernels (see Definition 3) on are characteristic. Choice of characteristic kernels: In expanding the failies of allowable characteristic kernels, we have so far neglected the question of which characteristic kernel to choose. A practitioner asking by how uch two saples differ does not want to receive a blizzard of answers for every conceivable kernel and bandwidth setting, but a single easure that satisfies soe reasonable notion of distance across the faily of kernels considered. Thus, in Section 4, we propose a generalization of the D, yielding a new distance easure between P and Q defined as γ(p, Q) = sup{γ k (P, Q) : k K} = sup{ Pk Qk H : k K}, (2) which is the axial RKHS distance between P and Q over a faily, K of positive definite kernels. For exaple, K can be the faily of Gaussian kernels on R d indexed by the bandwidth paraeter. This distance easure is very natural in the light of our results on binary classification (in Section 2): ost directly, this corresponds to the proble of learning the kernel by iniizing the risk of the associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on the argin allowed for a hard argin SV between the saples. To apply the generalized D in practice, we ust ensure its epirical estiator is consistent. In our ain result of Section 4, we provide an epirical estiate of γ(p, Q) based on finite saples, and show that any popular kernels like the Gaussian, Laplacian, and the entire atérn class on R d yield consistent estiates There is a subtlety here, since unlike the proble of testing for differences in distributions, classification suffers fro slow learning rates. See [4, Chapter 7] for details. 2

3 of γ(p, Q). The proof is based on bounding the Radeacher chaos coplexity of K, which can be understood as the U-process equivalent of Radeacher coplexity [3]. Finally, in Section 5, we provide a siple experiental deonstration that the generalized D can be applied in practice to the proble of hoogeneity testing. Specifically, we show that when two distributions differ on particular length scales, the kernel selected by the generalized D is appropriate to this difference, and the resulting hypothesis test outperfors the heuristic kernel choice eployed in earlier studies [8]. The proofs of the results in Sections 2-4 are provided in the appendix. 2 Characteristic Kernels and Binary Classification One of the ost iportant applications of the axiu ean discrepancy is in nonparaetric hypothesis testing [8, 9, 6], where the characteristic property of k is required to distinguish between probability easures. In the following, we show how D naturally appears in binary classification, with reference to the Parzen window classifier and hard-argin SV. This otivates the need for characteristic k to guarantee that classes arising fro different distributions can be classified by kernel-based algoriths. To this end, let us consider the binary classification proble with X being a -valued rando variable, Y being a {, +}-valued rando variable and the product space, {, +}, being endowed with an induced Borel probability easure µ. A discriinant function, f is a real valued easurable function on, whose sign is used to ake a classification decision. Given a loss function L : {, +} R R, the goal is to choose an f that iniizes the risk associated with L, with the optial L-risk being defined as R L F = inf f F { L(y, f(x)) dµ(x, y) = inf ε f F } L (f) dp + ( ε) L (f) dq, (3) where F is the set of all easurable functions on, L (α) := L(, α), L (α) := L(, α), P(X) := µ(x Y = +), Q(X) := µ(x Y = ), ε := µ(, Y = +). Here, P and Q represent the class-conditional distributions and ε is the prior distribution of class +. Now, we present the result that relates γ k to the optial risk associated with the Parzen window classifier. Theore (γ k and Parzen classification). Let L (α) = α ε and L (α) = α ε. Then, γ k(p, Q) = RF L k, where F k = {f : f H } and H is an RKHS with a easurable and bounded k. Suppose {(X i, Y i )} N i=, X i, Y i {, +}, i is a training saple drawn i.i.d. fro µ and = {i : Y i = }. If f F k is an epirical iniizer of (3) (where F is replaced by F k in (3)), then sign( f(x)) = {, Y i = k(x, X i) >, which is the Parzen window classifier. N Y i = k(x, X i) N Y i = k(x, X i) Y i = k(x, X i), (4) Theore shows that γ k is the negative of the optial L-risk (where L is the linear loss as defined in Theore ) associated with the Parzen window classifier. Therefore, if k is not characteristic, which eans γ k (P, Q) = for soe P Q, then RF L k =, i.e., the risk is axiu (note that since γ k (P, Q) = RF L k, the axiu risk is zero). In other words, if k is characteristic, then the axiu risk is obtained only when P = Q. This otivates the iportance of characteristic kernels in binary classification. In the following, we provide another result which provides a siilar otivation for the iportance of characteristic kernels in binary classification, wherein we relate γ k to the argin of a hard-argin SV. Theore 2 (γ k and hard-argin SV). Suppose {(X i, Y i )} N i=, X i, Y i {, +}, i is a training saple drawn i.i.d. fro µ. Assuing the training saple is separable, let f sv be the solution to the progra, inf{ f H : Y i f(x i ), i}, where H is an RKHS with easurable and bounded k. If k is characteristic, then γ k(p, Q n ), (5) f sv H 2 where P := Y i = δ X i, Q n := n Y i = δ X i, = {i : Y i = } and n = N. δ x represents the Dirac easure at x. 3

4 Theore 2 provides a bound on the argin of hard-argin SV in ters of D. (5) shows that a saller D between P and Q n enforces a saller argin (i.e., a less sooth classifier, f sv, where soothness is easured as f sv H ). We can observe that the bound in (5) ay be loose if the nuber of support vectors is sall. Suppose k is not characteristic, then γ k (P, Q n ) can be zero for P Q n and therefore the argin is zero, which eans even unlike distributions can becoe inseparable in this feature representation. Another justification of using characteristic kernels in kernel-based classification algoriths can be provided by studying the conditions on H for which the Bayes risk is realized for all µ. Steinwart and Christann [9, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk is achieved for all µ if and only if H is dense in L p (, η) for all η, where η = εp + ( ε)q. Here, L p (, η) represents the Banach space of p-power integrable functions, where p [, ) is dependent on the loss function, L. Denseness of H in L p (, η) iplies H + R is dense L p (, η), which therefore yields that k is characteristic [6, 7]. On the other hand, if constant functions are included in H, then it is easy to show that the characteristic property of k is also sufficient to achieve the Bayes risk. As an exaple, it can be shown that characteristic kernels are necessary (and sufficient if constant functions are in H) for SVs to achieve the Bayes risk [9, Exaple 5.4]. Therefore, the characteristic property of k is fundaental in kernel-based classification algoriths. Having showed how characteristic kernels play a role in kernel-based classification, in the following section, we provide a novel characterization for the. 3 Novel Characterization for Characteristic Kernels A positive definite (pd) kernel, k is said to be characteristic to P if and only if γ k (P, Q) = P = Q, P, Q P. The following result provides a novel characterization for characteristic kernels, which shows that strictly pd kernels are characteristic to P. An advantage with this characterization is that it holds for any arbitrary topological space unlike the earlier characterizations where a group structure on is assued [7, 7]. First, we define strictly pd kernels as follows. Definition 3 (Strictly positive definite kernels). Let be a topological space. A easurable and bounded kernel, k is said to be strictly positive definite if and only if k(x, y) dµ(x) dµ(y) > for all finite non-zero signed Borel easures, µ defined on. Note that the above definition is not equivalent to the usual definition of strictly pd kernels that involves finite sus [9, Definition 4.5]. The above definition is a generalization of integrally strictly positive definite functions [2, Section 6]: k(x, y)f(x)f(y) dx dy > for all f L 2 (R d ), which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is stronger than the finite su definition as [9, Theore 4.62] shows a kernel that is strictly pd in the finite su sense but not in the integral sense. Theore 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on, then k is characteristic to P. The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic. We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to P. Exaples of strictly pd kernels on R d include exp( σ x y 2 2), σ >, exp( σ x y ), σ >, (c 2 + x y 2 2) β, β >, c >, B 2l+ -splines etc. Note that k(x, y) = f(x)k(x, y)f(y) is a strictly pd kernel if k is strictly pd, where f : R is a bounded continuous function. Therefore, translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant strictly pd kernel. A siple exaple of a translation-variant kernel that is a strictly pd kernel on copact sets of R d is k(x, y) = exp(σx T y), σ >, where we have chosen f(.) = exp(σ. 2 2/2) and k(x, y) = exp( σ x y 2 2/2), σ >. Therefore, k is characteristic on copact sets of R d, which is the sae result that follows fro the universality of k [8, Section 3, Exaple ]. The following result in [3], which is based on the usual definition of strictly pd kernels, can be obtained as a corollary to Theore 4. Corollary 5 ([3]). Let X = {x i } i=, Y = {y j} n j= and assue that x i x j, y i y j, i, j. Suppose k is strictly positive definite. Then i= α ik(., x i ) = n j= β jk(., y j ) for soe α i, β j R\{} X = Y. Suppose we choose α i =, i and β j = n, j in Corollary 5. Then i= α ik(., x i ) and n j= β jk(., y j ) represent the ean functions in H. Note that the Parzen classifier in (4) 4

5 is a ean classifier (that separates the ean functions) in H, i.e., sign( k(., x), w H ), where w = i= k(., x i) n n i= k(., y i). Suppose k is strictly pd (ore generally, suppose k is characteristic). Then, by Corollary 5, the noral vector, w to the hyperplane in H passing through the origin is zero, i.e., the ean functions coincide (and are therefore not classifiable) if and only if X = Y. 4 Generalizing the D for Classes of Characteristic Kernels The discussion so far has been related to the characteristic property of k that akes γ k a etric on P. We have seen that this characteristic property is of prie iportance both in distribution testing, and to ensure classifiability of dissiilar distributions in the RKHS. We have not yet addressed how to choose aong a selection/faily of characteristic kernels, given a particular pair of distributions we wish to discriinate between. We introduce one approach to this proble in the present section. Let = R d and k σ (x, y) = exp( σ x y 2 2), σ R +, where σ represents the bandwidth paraeter. {k σ : σ R + } is the faily of Gaussian kernels and {γ kσ : σ R + } is the faily of Ds indexed by the kernel paraeter, σ. Note that k σ is characteristic for any σ R ++ and therefore γ kσ is a etric on P for any σ R ++. However, in practice, one would prefer a single nuber that defines the distance between P and Q. The question therefore to be addressed is how to choose appropriate σ. The choice of σ has iportant iplications on the statistical aspect of γ kσ. Note that as σ, k σ and as σ, k σ a.e., which eans γ kσ (P, Q) as σ or σ for all P, Q P (this behavior is also exhibited by k σ (x, y) = exp( σ x y ) and k σ (x, y) = σ 2 /(σ 2 + x y 2 2), which are also characteristic). This eans choosing sufficiently sall or sufficiently large σ (depending on P and Q) akes γ kσ (P, Q) arbitrarily sall. Therefore, σ has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently, the applications involving D set σ heuristically [8, 9]. To generalize the D to failies of kernels, we propose the following odification to γ k, which yields a pseudoetric on P, γ(p, Q) = sup{γ k (P, Q) : k K} = sup{ Pk Qk H : k K}. (6) Note that γ is the axial RKHS distance between P and Q over a faily, K of positive definite kernels. It is easy to check that if any k K is characteristic, then γ is a etric on P. Exaples for K include: K g := {e σ x y 2 2, x, y R d : σ R + }; K l := {e σ x y, x, y R d : σ R + }; K ψ := {e σψ(x,y), x, y : σ R + }, where ψ : R is a negative definite kernel; K rbf := { e λ x y 2 2 dµσ (λ), x, y R d, µ σ + : σ Σ R d }, where + is the set of all finite nonnegative Borel easures, µ σ on R + that are not concentrated at zero, etc. The proposal of γ(p, Q) in (6) can be otivated by the connection that we have established in Section 2 between γ k and the Parzen window classifier. Since the Parzen window classifier depends on the kernel, k, one can propose to learn the kernel like in support vector achines [], wherein the kernel is chosen such that R L F k in Theore is iniized over k K, i.e., inf R L F k = sup γ k (P, Q) = γ(p, Q). A siilar otivation for γ can be provided based on (5) as learning the kernel in a hard-argin SV by axiizing its argin. At this point, we briefly discuss the issue of noralized vs. unnoralized kernel failies, K in (6). We say a translation-invariant kernel, k on R d is noralized if ψ(y) dy = c (soe positive constant independent of the kernel paraeter), where k(x, y) = ψ(x y). K is a noralized kernel faily if every kernel in K is noralized. If K is not noralized, we say it is unnoralized. For exaple, it is easy to see that K g and K l are unnoralized kernel failies. Let us consider the noralized Gaussian faily, K n g = {(σ/π) d/2 e σ x y 2 2, x, y R d : σ [σ, )}. It can be shown that for any k σ, k τ K n g, < σ < τ <, we have γ kσ (P, Q) γ kτ (P, Q), which eans, γ(p, Q) = γ σ (P, Q). Therefore, the generalized D reduces to a single kernel D. A siilar result also holds for the noralized inverse-quadratic kernel faily, { 2σ 2 /π(σ 2 + x y 2 2), x, y R : σ [σ, )}. These exaples show that the generalized D definition is usually not very useful if K is a noralized kernel faily. In addition, σ should be chosen beforehand, which is equivalent to heuristically setting the kernel paraeter in γ k. Note that σ cannot be zero because in the liiting case of σ, the kernels approach a Dirac distribution, which eans the liiting kernel is not bounded and therefore the definition of D in () does not hold. So, in this work, we consider unnoralized kernel failies to render the definition of generalized D in (6) useful. 5

6 To use γ in statistical applications where P and Q are known only through i.i.d. saples {X i } i= and {Y i } n i= respectively, we require its estiator γ(p, Q n ) to be consistent, where P and Q n represent the epirical easures based on {X i } i= and {Y j} n j=. For k easurable and bounded, [8, 5] have shown that γ k (P, Q n ) is a n/( + n)-consistent estiator of γ k (P, Q). The statistical consistency of γ(p, Q n ) is established in the following theore, which uses tools fro U-process theory [3, Chapters 3,5]. We begin with the following definition. Definition 6 (Radeacher chaos). Let G be a class of functions on and {ρ i } n i= be independent Radeacher rando variables, i.e., Pr(ρ i = ) = Pr(ρ i = ) = 2. The hoogeneous Radeacher chaos process of order two with respect to {ρ i } n i= is defined as {n n i<j ρ iρ j g(x i, x j ) : g G} for soe {x i } n i=. The Radeacher chaos coplexity over G is defined as U n (G; {x i } n i=) := E ρ sup n ρ i ρ j g(x i, x j ). (7) n g G i<j We now provide the ain result of the present section. Theore 7 (Consistency of γ(p, Q n )). Let every k K be easurable and bounded with ν := sup,x k(x, x) <. Then, with probability at least δ, γ(p, Q n ) γ(p, Q) A, where A = 6U (K; {X i }) + 6U n(k; {Y i }) n + ( 8ν + 36ν log 4 δ ) + n n. (8) Fro (8), it is clear that if U (K; {X i }) = O P () and U n (K; {Y i }) = O Q (), then γ(p, Q n ) a.s. γ(p, Q). The following result provides a bound on U (K; {X i }) in ters of the entropy integral. Lea 8 (Entropy bound). For any K as in Theore 7 with K, there exists a universal constant C such that U (K; {X i } i=) C [ ν log N (K, D, ɛ) dɛ, (9) ] where D(k, k 2 ) = i<j (k (X i, X j ) k 2 (X i, X j )) 2 2. N (K, D, ɛ) represents the ɛ- covering nuber of K with respect to the etric D. Assuing K to be a VC-subgraph class, the following result, as a corollary to Lea 8 provides an estiate of U (K; {X i } i= ). Before presenting the result, we first provide the definition of a VC-subgraph class. Definition 9 (VC-subgraph class). The subgraph of a function g : R is the subset of R given by {(x, t) : t < g(x)}. A collection G of easurable functions on a saple space is called a VC-subgraph class, if the collection of all subgraphs of the functions in G fors a VC-class of sets (in R). The VC-index (also called the VC-diension) of a VC-subgraph class, G is the sae as the pseudodiension of G. See [, Definition.] for details. Corollary (U (K; {X i }) for VC-subgraph, K). Suppose K is a VC-subgraph class with V (K) being the VC-index. Assue K satisfies the conditions in Theore 7 and K. Then for soe universal constants C and C. U (K; {X i }) Cν log(c V (K)(6e 9 ) V (K) ), () Using () in (8), we have γ(p, Q n ) γ(p, Q) = O P,Q ( ( + n)/n) and by the Borel- Cantelli lea, γ(p, Q n ) γ(p, Q) a.s.. Now, the question reduces to which of the kernel classes, K have V (K) <. [22, Lea 2] showed that V (K g ) = (also see [23]) and U (K rbf ) C 2 U (K g ), where C 2 <. It can be shown that V (K ψ ) = and V (K l ) =. All these classes satisfy the conditions of Theore 7 and Corollary and therefore provide consistent estiates of γ(p, Q) for any P, Q P. Exaples of kernels on R d that are covered by these classes include the Gaussian, Laplacian, inverse ultiquadratics, atérn class etc. Other choices for K that are popular in achine learning are the linear cobination of kernels, K lin := {k λ = l i= λ ik i k λ is pd, l i= λ i = } and K con := {k λ = l i= λ ik i λ i, l i= λ i = }. [6, Lea 7] have shown that V (K con ) V (K lin ) l. Therefore, instead of using a class based on a fixed, paraeterized kernel, one can also use a finite linear cobination of kernels to copute γ. 6

7 So far, we have presented the etric property and statistical consistency (of the epirical estiator) of γ. Now, the question is how do we copute γ(p, Q n ) in practice. To show this, in the following, we present two exaples. Exaple. Suppose K = K g. Then, γ(p, Q n ) can be written as γ 2 (P, Q n ) = sup σ R + i,j= e σ X i X j n i,j= e σ Y i Y j 2 n 2 2,n i,j= e σ X i Y j 2 n. () The optiu σ can be obtained by solving () and γ(p, Q n ) = P k σ Q n k σ Hσ. Exaple 2. Suppose K = K con. Then, γ(p, Q n ) becoes γ 2 (P, Q n ) = sup P k Q n k 2 H = k d(p Q n ) (P Q n ) con sup con = sup{λ T a : λ T =, λ }, (2) where we have replaced k by l i= λ ik i. Here λ = (λ,..., λ l ) and (a) i = P k i Q n k i 2 H i = 2 a,b= k i(x a, X b ) + n n 2 a,b= k i(y a, Y b ) 2,n n a,b= k i(x a, Y b ). It is easy to see that γ 2 (P, Q n ) = ax i l (a) i. Siilar exaples can be provided for other K, where γ(p, Q n ) can be coputed by solving a seidefinite progra (K = K lin ) or by the constrained gradient descent ( K = K l, K rbf ). Finally, while the approach in (6) to generalizing γ k is our focus in this paper, an alternative Bayesian strategy would be to define a non-negative finite easure λ over K, and to average γ k over that easure, i.e., β(p, Q) := K γ k(p, Q) dλ(k). This also yields a pseudoetric on P. That said, β(p, Q) λ(k)γ(p, Q), P, Q, which eans if P and Q can be distinguished by β, they can be distinguished by γ, but not vice-versa. In this sense, γ is stronger than β. One further coplication with the Bayesian approach is in defining a sensible λ over K. Note that γ k (single kernel D based on k ) can be obtained by defining λ(k) = δ(k k ) in β(p, Q). 5 Experients In this section, we present a benchark experient that illustrates the generalized D proposed in Section 4 is preferred above the single kernel D where the kernel paraeter is set heuristically. The experiental setup is as follows. Let p = N(, σ 2 p), a noral distribution in R with zero ean and variance, σ 2 p. Let q be the perturbed version of p, given as q(x) = p(x)( + sin νx). Here p and q are the densities associated with P and Q respectively. It is easy to see that q differs fro p at increasing frequencies with increasing ν. Let k(x, y) = exp( (x y) 2 /σ). Now, the goal is that given rando saples drawn i.i.d. fro P and Q (with ν fixed), we would like to test H : P = Q vs. H : P Q. The idea is that as ν increases, it will be harder to distinguish between P and Q for a fixed saple size. Therefore, using this setup we can verify whether the adaptive bandwidth selection achieved by γ (as the test statistic) helps to distinguish between P and Q at higher ν copared to γ k with a heuristic σ. To this end, using γ(p, Q n ) and γ k (P, Q n ) (with various σ) as test statistics T n, we design a test that returns H if T n c n, and H otherwise. The proble therefore reduces to finding c n. c n is deterined as the ( α) quantile of the asyptotic distribution of T n under H, which therefore fixes the type-i error (the probability of rejecting H when it is true) to α. The consistency of this test under γ k (for any fixed σ) is proved in [8]. A siilar result can be shown for γ under soe conditions on K. We skip the details here. In our experients, we set = n =, σp 2 = and draw two sets of independent rando saples fro Q. The distribution of T n is estiated by bootstrapping on these saples (25 bootstrap iterations are perfored) and the associated 95 th quantile (we choose α =.5) is coputed. Since the perforance of the test is judged by its type-ii error (the probability of accepting H when H is true), we draw a rando saple, one each fro P and Q and test whether P = Q. This process is repeated 3 ties, and estiates of type-i and type-ii errors are obtained for both γ and γ k. 4 different values for σ are considered on a logarithic scale of base 2 with exponents ( 3, 2,,,, 3 2, 2, 5 2, 3, 7 2, 4, 5, 6) along with the edian distance between saples as one ore choice. 5 different choices for ν are considered: ( 2, 3 4,, 5 4, 3 2 ). 7

8 Error (in %) Type I error Type II error.5.75 ν.25.5 (a) log σ 3 2 Type I error (in %) ν.25.5 (d) 5 ν=.5 ν=.75 ν=. ν=.25 ν= log σ (b) edian as σ 9 8 Type II error (in %) 5 ν=.5 ν=.75 ν=. ν=.25 ν= ν.25.5 (e) log σ (c) Figure : (a) Type-I and Type-II errors (in %) for γ for varying ν. (b,c) Type-I and type-ii error (in %) for γ k (with different σ) for varying ν. The dotted line in (c) corresponds to the edian heuristic, which shows that its associated type-ii error is very large at large ν. (d) Box plot of log σ grouped by ν, where σ is selected by γ. (e) Box plot of the edian distance between points (which is also a choice for σ), grouped by ν. Refer to Section 5 for details. Figure (a) shows the estiated type-i and type-ii errors using γ as the test statistic for varying ν. Note that the type-i error is close to its design value of 5%, while the type-ii error is zero for all ν, which eans γ distinguishes between P and Q for all perturbations. Figures (b,c) show the estiates of type-i and type-ii errors using γ k as the test statistic for different σ and ν. Figure (d) shows the box plot for log σ, grouped by ν, where σ is the bandwidth selected by γ. Figure (e) shows the box plot of the edian distance between points (which is also a choice for σ), grouped by ν. Fro Figures (c) and (e), it is easy to see that the edian heuristic exhibits high type-ii error for ν = 3 2, while γ exhibits zero type-ii error (fro Figure (a)). Figure (c) also shows that heuristic choices of σ can result in high type-ii errors. It is intuitive to note that as ν increases, (which eans the characteristic function of Q differs fro that of P at higher frequencies), a saller σ is needed to detect these changes. The advantage of using γ is that it selects σ in a distribution-dependent fashion and its behavior in the box plot shown in Figure (d) atches with the previously entioned intuition about the behavior of σ with respect to ν. These results deonstrate the validity of using γ as a distance easure in applications. 6 Conclusions In this work, we have shown how D appears in binary classification, and thus that characteristic kernels are iportant in kernel-based classification algoriths. We have broadened the class of characteristic RKHSs to include those induced by strictly positive definite kernels (with particular application to kernels on non-copact doains, and/or kernels that are not translation invariant). We have further provided a convergent generalization of D over failies of kernel functions, which becoes necessary even in considering relatively siple failies of kernels (such as the Gaussian kernels paraeterized by their bandwidth). The usefulness of the generalized D is illustrated experientally with a two-saple testing proble. Acknowledgents The authors thank anonyous reviewers for their constructive coents and especially the reviewer who pointed out the connection between characteristic kernels and the achievability of Bayes risk. B. K. S. was supported by the PI for Biological Cybernetics, National Science Foundation (grant DS-SPA 62549), the Fair Isaac Corporation and the University of California I- CRO progra. A. G. was supported by grants DARPA IPTO FA , ONR URI N47747, and ARO URI W9NF

9 A Proofs We provide proofs for the results in Sections 2-4. A. Proof of Theore To prove Theore, we need the following result fro [7]. Theore 3 ([7]). Let F k := {f : f H }, where (H, k) is an RKHS defined on a easurable space with k easurable and bounded. Then, where H represents the RKHS nor. γ k (P, Q) = sup f F k Pf Qf = Pk Qk H, (3) Proof of Theore : Fro (3), we have ε L (f) dp + ( ε) L (f) dq = Therefore, R L F k f dq f dp = Qf Pf. = inf f F k (Qf Pf) = sup f F k (Pf Qf) = sup f F k Pf Qf = γ k (P, Q), which follows fro Theore 3. Given {(X i, Y i )} N i= drawn i.i.d. fro µ, the epirical equivalent of (3) is given by { } inf f(x i ) + f(x i ) : f F k. N Solving this for f gives f = and the result in (4) follows. A.2 Proof of Theore 2 Y i = Y i= k(., X i) N Y i = k(., X i) N Y i = Y k(., X i= i) Y i = k(., X, i) H Before we prove Theore 2, we present a lea which we will use to prove Theore 2. Lea 4. Let θ : V R and ψ : V R be convex functions on a real vector space V. Suppose where θ is not constant on {x : ψ(x) b} and a <. Then a = sup{θ(x) : ψ(x) b}, (4) b = inf{ψ(x) : θ(x) a}. (5) We need the following result fro [, Theore 32.] to prove Lea 4. Theore 5 ([]). Let f be a convex function, and let C be a convex set contained in the doain of f. If f attains its supreu relative to C at soe point of relative interior of C, then f is actually constant throughout C. Proof of Lea 4: Note that A := {x : ψ(x) b} is a convex subset of V. Since θ is not constant on A, by Theore 5, θ attains its supreu on the boundary of A. Therefore, any solution, x to (4) satisfies θ(x ) = a and ψ(x ) = b. Let G := {x : θ(x) > a}. For any x G, ψ(x) > b. If this were not the case, then x is not a solution to (4). Let H := {x : θ(x) = a}. Clearly, x H and so there exists an x H for which ψ(x) = b. Suppose inf{ψ(x) : x H} = c < b, which eans for soe x H, x A. Fro (4), this iplies θ attains its supreu relative to A at soe point of relative interior of A. By Theore 5, this iplies θ is constant on A leading to a contradiction. Therefore, inf{ψ(x) : x H} = b and the result in (5) follows. 9

10 Proof of Theore 2: By Theore 3, γ k (P, Q) = sup{pf Qf : f H }. Note that Pf Qf and f H are convex functionals on H. For P Q, Pf Qf is not constant on F k, since k is characteristic. Therefore, by Lea 4, we have = inf{ f H : Pf Qf γ k (P, Q), f H}. Since this holds for all P Q, it holds for P and Q n. Therefore, we have Consider 2 γ k (P, Q n ) = inf { f H : P f Q n f 2, f H}. {f H : P f Q n f 2} = which eans 2 γ k (P,Q n ) f sv H. A.3 Proof of Theore 4 { f H : f(x i ) n Y i= {f H : Y i f(x i ), i}, Y i= f(x i ) 2 To prove Theore 4, we need the following lea that provides necessary and sufficient conditions for a kernel not to be characteristic. Lea 6. Let k be easurable and bounded on. Then P Q, P, Q P such that γ k (P, Q) = if and only if there exists a finite non-zero signed Borel easure µ that satisfies: (i) k(x, y) dµ(x) dµ(y) =, (ii) µ() =. Proof. ( ) Suppose there exists a finite non-zero signed Borel easure, µ that satisfies (i) and (ii) in Lea 6. By the Jordan decoposition theore [5, Theore 5.6.], there exist unique positive easures µ + and µ such that µ = µ + µ and µ + µ (µ + and µ are singular). By (ii), we have µ + () = µ () =: α. Define P = α µ + and Q = α µ. Clearly, P Q, P, Q P. Then, γ 2 k(p, Q) = Pk Qk 2 H = α 2 µk 2 H = α 2 µk, µk H. (6) Fro the proof of Theore 3 (see Theore 3 in [7]), we have µk, µk H = k(x, y) dµ(x) dµ(y) and therefore, by (i), γk (P, Q) =. So, we have constructed P Q such that γ k (P, Q) =. ( ) Suppose P Q, P, Q P such that γ k (P, Q) =. Let µ = P Q. Clearly µ is a finite non-zero signed Borel easure that satisfies µ() =. Note that γk 2(P, Q) = Pk Qk 2 H = µk 2 H = k(x, y) dµ(x) dµ(y), and therefore (i) follows. Proof of Theore 4: Since k is strictly pd on, we have k(x, y) dη(x)dη(y) > for any finite non-zero signed Borel easure η. This eans there does not exist a finite non-zero signed Borel easure that satisfies (i) in Lea 6. Therefore, by Lea 6, there does not exist P Q, P, Q P such that γ k (P, Q) =, which iplies k is characteristic to P. A.4 Proof of Corollary 5 Consider i= α ik(., x i ) = k(., x) dµ X (x) = µ X k, where µ X = i= α iδ xi. Here δ xi represents the Dirac easure at x i. So, we have µ X k = µ Y k, which is equivalent to k(x, y) d(µ X µ Y )(x) d(µ X µ Y )(y) =. (7) Since k is strictly pd, by Lea 6, we have µ X k = µ Y k µ X = µ Y X = Y. }

11 A.5 Proof of Theore 7 Consider γ(p, Q n ) γ(p, Q) = sup P k Q n k H sup Pk Qk H P k Q n k H Pk Qk H sup sup sup sup P k Q n k Pk + Qk H [ P k Pk H + Q n k Qk H ] P k Pk H + sup Q n k Qk H. (8) We now bound the ters sup P k Pk H and sup Q n k Qk H. Since sup P k Pk H satisfies the bounded difference property, using cdiarid s inequality gives that with probability at least δ 4 over the choice of {X i}, we have sup P k Pk H E sup P k Pk H + 2ν log 4 δ. (9) By invoking syetrization for E sup P k Pk H, we have E sup P k Pk H 2EE ρ sup ρ i k(., X i ), (2) H where {ρ i } i= represent i.i.d. Radeacher rando variables and E ρ represents the expectation w.r.t. {ρ i } conditioned on {X i }. Since E ρ sup i= ρ ik(., X i ) H satisfies the bounded difference property, by cdiarid s inequality, with probability at least δ 4 over the choice of the rando saples of size, we have 2ν EE ρ sup ρ i k(., X i ) E ρ sup ρ i k(., X i ) + log 4 δ. (2) H H By writing i= i<j i= i= ρ i k(., X i ) = ρ i ρ j k(x i, X j ) i= H i,j= 2 ρ i ρ j k(x i, X j ) ν +, (22) we have with probability at least δ 4, the following holds: EE ρ sup 2U (K; {X i }) ν 2ν ρ i k(., X i ) + + H log 4 δ. (23) i= Tying (9)-(23), we have that w.p. at least δ 2 over the choice of {X i}, the following holds: 8U (K; {X i }) sup P k Pk H + 2 ν 8ν + log 4 δ. (24) Perforing siilar analysis for sup Q n k Qk H, we have that w.p. at least δ 2 over the choice of {Y i }, 8Un (K; {Y i }) sup Q n k Qk H + 2 ν 8ν + n n n log 4 δ. (25) Using (24) and (25) with a + b 2(a + b) provides the result.

12 A.6 Proof of Lea 8 Fro [2, Proposition 2.2, Proposition 2.6] (also see [3, Corollary 5..8]), we have that there exists a universal constant C < such that U (K; {X i }) C ν log N (K, D, ɛ) dɛ, where 2 D 2 (k, k 2 ) = E ρ n ρ i ρ j h(x i, X j ) i<j = 2 E ρ ρ i ρ j ρ r ρ s h(x i, X j )h(x r, X s ) i<j,r<s = 2 h 2 (X i, X j ), i<j where h(x i, X j ) = k (X i, X j ) k 2 (X i, X j ). A.7 Proof of Corollary The result follows by bounding the unifor covering nuber of the VC-subgraph class, K. By [2, Theore 2.6], we have N (K, D, ɛ) C V (K)(6ν 2 ɛ 2 e) V (K), (26) where V (K) is the VC-index of K. Therefore, ν U (K; {X i }) C log N (K, D, ɛ) dɛ ν ν 4V (K)C log( ) dɛ + νv (K)C log(6e) + Cν log(c V (K)). (27) ɛ Note that ν log( ν ɛ ) dɛ 2ν. Using this in (27) and rearranging the ters provides the result. References []. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cabridge University Press, UK, 999. [2]. A. Arcones and E. Giné. Liit theores for U-processes. Annals of Probability, 2: , 993. [3] V. H. de la Peña and E. Giné. Decoupling: Fro Dependence to Independence. Springer-Verlag, NY, 999. [4] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 996. [5] R.. Dudley. Real Analysis and Probability. Cabridge University Press, Cabridge, UK, 22. [6] K. Fukuizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel easures of conditional dependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Inforation Processing Systes 2, pages , Cabridge, A, 28. IT Press. [7] K. Fukuizu, B. K. Sriperubudur, A. Gretton, and B. Schölkopf. Characteristic kernels on groups and seigroups. In D. Koller, D. Schuurans, Y. Bengio, and L. Bottou, editors, Advances in Neural Inforation Processing Systes 2, pages , 29. [8] A. Gretton, K.. Borgwardt,. Rasch, B. Schölkopf, and A. Sola. A kernel ethod for the two saple proble. In B. Schölkopf, J. Platt, and T. Hoffan, editors, Advances in Neural Inforation Processing Systes 9, pages IT Press, 27. [9] A. Gretton, K. Fukuizu, C.-H. Teo, L. Song, B. Schölkopf, and A. Sola. A kernel statistical test of independence. In Advances in Neural Inforation Processing Systes 2, pages IT Press, 28. [] G. R. G. Lanckriet, N. Christianini, P. Bartlett, L. El Ghaoui, and. I. Jordan. Learning the kernel atrix with seidefinite prograing. Journal of achine Learning Research, 5:24 72, 24. [] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 97. 2

13 [2] B. Schölkopf and A. J. Sola. Learning with Kernels. IT Press, Cabridge, A, 22. [3] B. Schölkopf, B. K. Sriperubudur, A. Gretton, and K. Fukuizu. RKHS representation of easures. In Learning Theory and Approxiation Workshop, Oberwolfach, Gerany, 28. [4] J. Shawe-Taylor and N. Cristianini. Kernel ethods for Pattern Analysis. Cabridge University Press, UK, 24. [5] A. J. Sola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space ebedding for distributions. In Proc. 8th International Conference on Algorithic Learning Theory, pages 3 3. Springer-Verlag, Berlin, Gerany, 27. [6] N. Srebro and S. Ben-David. Learning bounds for support vector achines with learned kernels. In G. Lugosi and H. U. Sion, editors, Proc. of the 9 th Annual Conference on Learning Theory, pages 69 83, 26. [7] B. K. Sriperubudur, A. Gretton, K. Fukuizu, G. R. G. Lanckriet, and B. Schölkopf. Injective Hilbert space ebeddings of probability easures. In R. Servedio and T. Zhang, editors, Proc. of the 2 st Annual Conference on Learning Theory, pages 22, 28. [8] I. Steinwart. On the influence of the kernel on the consistency of support vector achines. Journal of achine Learning Research, 2:67 93, 22. [9] I. Steinwart and A. Christann. Support Vector achines. Springer, 28. [2] J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky ountain Journal of atheatics, 6(3):49 433, 976. [2] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Epirical Processes. Springer-Verlag, New York, 996. [22] Y. Ying and C. Capbell. Generalization bounds for learning the kernel. In Proc. of the 22 nd Annual Conference on Learning Theory, 29. [23] Y. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Journal of achine Learning Research, 8: , 27. 3

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher