Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
|
|
- Caroline Janel Douglas
- 5 years ago
- Views:
Transcription
1 Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA Kenji Fukuizu The Institute of Statistical atheatics Tokyo, Japan Arthur Gretton Carnegie ellon University PI for Biological Cybernetics Gert R. G. Lanckriet Departent of ECE UC San Diego, La Jolla, USA Bernhard Schölkopf PI for Biological Cybernetics Tübingen, Gerany Abstract Ebeddings of probability easures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical eans of representing and coparing probabilities. In particular, the distance between ebeddings (the axiu ean discrepancy, or D) has several key advantages over any classical etrics on distributions, naely easy coputability, fast convergence and low bias of finite saple estiates. An iportant requireent of the ebedding RKHS is that it be characteristic: in this case, the D between two distributions is zero if and only if the distributions coincide. Three new results on the D are introduced in the present study. First, it is established that D corresponds to the optial risk of a kernel classifier, thus foring a natural link between the distance between distributions and their ease of classification. An iportant consequence is that a kernel ust be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-copact doains. Third, a generalization of the D is proposed for failies of kernels, as the supreu over Ds on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance easure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the proble of learning the kernel by iniizing the risk of the corresponding kernel classifier. The generalized D is shown to have consistent finite saple estiates, and its perforance is deonstrated on a hoogeneity testing exaple. Introduction Kernel ethods are broadly established as a useful way of constructing nonlinear algoriths fro linear ones, by ebedding points into higher diensional reproducing kernel Hilbert spaces (RKHSs) [2]. A generalization of this idea is to ebed probability distributions into RKHSs, giv-
2 ing us a linear ethod for dealing with higher order statistics [8, 5, 7]. ore specifically, suppose we are given the set P of all Borel probability easures defined on the topological space, and the RKHS (H, k) of functions on with k as its reproducing kernel (r.k.). For P P, denote by Pk := k(., x) dp(x). If k is easurable and bounded, then we ay define the ebedding of P in H as Pk H. The RKHS distance between two such appings associated with P, Q P is called the axiu ean discrepancy (D) [8, 7], and is written γ k (P, Q) = Pk Qk H. () We say that k is characteristic [6, 7] if the apping P Pk is injective, in which case () is zero if and only if P = Q, i.e., γ k is a etric on P. An iediate application of the D is to probles of coparing distributions based on finite saples: exaples include tests of hoogeneity [8], independence [9], and conditional independence [6]. In this application doain, the question of whether k is characteristic is key: without this property, the algoriths can fail through inability to distinguish between particular distributions. Characteristic kernels are iportant in binary classification: The proble of distinguishing distributions is strongly related to binary classification: indeed, one would expect easily distinguishable distributions to be easily classifiable. The link between these two probles is especially direct in the case of the D: in Section 2, we show that γ k is the negative of the optial risk (corresponding to a linear loss function) associated with the Parzen window classifier [2, 4] (also called kernel classification rule [4, Chapter ]), where the Parzen window turns out to be k. We also show that γ k is an upper bound on the argin of a hard-argin support vector achine (SV). The iportance of using characteristic RKHSs is further underlined by this link: if the property does not hold, then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve Bayes risk in the kernel-based classification algoriths. Characterization of characteristic kernels: Given the centrality of the characteristic property to both RKHS classification and RKHS distribution testing, we should take particular care in establishing which kernels satisfy this requireent. Early results in this direction include [8], where k is shown to be characteristic on copact if it is universal in the sense of Steinwart [8, Definition 4]; and [6, 7], which address the case of non-copact, and show that k is characteristic if and only if H + R is dense in the Banach space of p-power (p ) integrable functions. The conditions in both these studies can be difficult to check and interpret, however, and the restriction of the first to copact is liiting. In the case of translation invariant kernels, [7] proved the kernel to be characteristic if and only if the support of the Fourier transfor of k is the entire R d, which is a uch easier condition to verify. Siilar sufficient conditions are obtained by [7] for translation invariant kernels on groups and sei-groups. In Section 3, we expand the class of characteristic kernels to include kernels that ay or ay not be translation invariant, with the introduction of a novel criterion: strictly positive definite kernels (see Definition 3) on are characteristic. Choice of characteristic kernels: In expanding the failies of allowable characteristic kernels, we have so far neglected the question of which characteristic kernel to choose. A practitioner asking by how uch two saples differ does not want to receive a blizzard of answers for every conceivable kernel and bandwidth setting, but a single easure that satisfies soe reasonable notion of distance across the faily of kernels considered. Thus, in Section 4, we propose a generalization of the D, yielding a new distance easure between P and Q defined as γ(p, Q) = sup{γ k (P, Q) : k K} = sup{ Pk Qk H : k K}, (2) which is the axial RKHS distance between P and Q over a faily, K of positive definite kernels. For exaple, K can be the faily of Gaussian kernels on R d indexed by the bandwidth paraeter. This distance easure is very natural in the light of our results on binary classification (in Section 2): ost directly, this corresponds to the proble of learning the kernel by iniizing the risk of the associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on the argin allowed for a hard argin SV between the saples. To apply the generalized D in practice, we ust ensure its epirical estiator is consistent. In our ain result of Section 4, we provide an epirical estiate of γ(p, Q) based on finite saples, and show that any popular kernels like the Gaussian, Laplacian, and the entire atérn class on R d yield consistent estiates There is a subtlety here, since unlike the proble of testing for differences in distributions, classification suffers fro slow learning rates. See [4, Chapter 7] for details. 2
3 of γ(p, Q). The proof is based on bounding the Radeacher chaos coplexity of K, which can be understood as the U-process equivalent of Radeacher coplexity [3]. Finally, in Section 5, we provide a siple experiental deonstration that the generalized D can be applied in practice to the proble of hoogeneity testing. Specifically, we show that when two distributions differ on particular length scales, the kernel selected by the generalized D is appropriate to this difference, and the resulting hypothesis test outperfors the heuristic kernel choice eployed in earlier studies [8]. The proofs of the results in Sections 2-4 are provided in the appendix. 2 Characteristic Kernels and Binary Classification One of the ost iportant applications of the axiu ean discrepancy is in nonparaetric hypothesis testing [8, 9, 6], where the characteristic property of k is required to distinguish between probability easures. In the following, we show how D naturally appears in binary classification, with reference to the Parzen window classifier and hard-argin SV. This otivates the need for characteristic k to guarantee that classes arising fro different distributions can be classified by kernel-based algoriths. To this end, let us consider the binary classification proble with X being a -valued rando variable, Y being a {, +}-valued rando variable and the product space, {, +}, being endowed with an induced Borel probability easure µ. A discriinant function, f is a real valued easurable function on, whose sign is used to ake a classification decision. Given a loss function L : {, +} R R, the goal is to choose an f that iniizes the risk associated with L, with the optial L-risk being defined as R L F = inf f F { L(y, f(x)) dµ(x, y) = inf ε f F } L (f) dp + ( ε) L (f) dq, (3) where F is the set of all easurable functions on, L (α) := L(, α), L (α) := L(, α), P(X) := µ(x Y = +), Q(X) := µ(x Y = ), ε := µ(, Y = +). Here, P and Q represent the class-conditional distributions and ε is the prior distribution of class +. Now, we present the result that relates γ k to the optial risk associated with the Parzen window classifier. Theore (γ k and Parzen classification). Let L (α) = α ε and L (α) = α ε. Then, γ k(p, Q) = RF L k, where F k = {f : f H } and H is an RKHS with a easurable and bounded k. Suppose {(X i, Y i )} N i=, X i, Y i {, +}, i is a training saple drawn i.i.d. fro µ and = {i : Y i = }. If f F k is an epirical iniizer of (3) (where F is replaced by F k in (3)), then sign( f(x)) = {, Y i = k(x, X i) >, which is the Parzen window classifier. N Y i = k(x, X i) N Y i = k(x, X i) Y i = k(x, X i), (4) Theore shows that γ k is the negative of the optial L-risk (where L is the linear loss as defined in Theore ) associated with the Parzen window classifier. Therefore, if k is not characteristic, which eans γ k (P, Q) = for soe P Q, then RF L k =, i.e., the risk is axiu (note that since γ k (P, Q) = RF L k, the axiu risk is zero). In other words, if k is characteristic, then the axiu risk is obtained only when P = Q. This otivates the iportance of characteristic kernels in binary classification. In the following, we provide another result which provides a siilar otivation for the iportance of characteristic kernels in binary classification, wherein we relate γ k to the argin of a hard-argin SV. Theore 2 (γ k and hard-argin SV). Suppose {(X i, Y i )} N i=, X i, Y i {, +}, i is a training saple drawn i.i.d. fro µ. Assuing the training saple is separable, let f sv be the solution to the progra, inf{ f H : Y i f(x i ), i}, where H is an RKHS with easurable and bounded k. If k is characteristic, then γ k(p, Q n ), (5) f sv H 2 where P := Y i = δ X i, Q n := n Y i = δ X i, = {i : Y i = } and n = N. δ x represents the Dirac easure at x. 3
4 Theore 2 provides a bound on the argin of hard-argin SV in ters of D. (5) shows that a saller D between P and Q n enforces a saller argin (i.e., a less sooth classifier, f sv, where soothness is easured as f sv H ). We can observe that the bound in (5) ay be loose if the nuber of support vectors is sall. Suppose k is not characteristic, then γ k (P, Q n ) can be zero for P Q n and therefore the argin is zero, which eans even unlike distributions can becoe inseparable in this feature representation. Another justification of using characteristic kernels in kernel-based classification algoriths can be provided by studying the conditions on H for which the Bayes risk is realized for all µ. Steinwart and Christann [9, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk is achieved for all µ if and only if H is dense in L p (, η) for all η, where η = εp + ( ε)q. Here, L p (, η) represents the Banach space of p-power integrable functions, where p [, ) is dependent on the loss function, L. Denseness of H in L p (, η) iplies H + R is dense L p (, η), which therefore yields that k is characteristic [6, 7]. On the other hand, if constant functions are included in H, then it is easy to show that the characteristic property of k is also sufficient to achieve the Bayes risk. As an exaple, it can be shown that characteristic kernels are necessary (and sufficient if constant functions are in H) for SVs to achieve the Bayes risk [9, Exaple 5.4]. Therefore, the characteristic property of k is fundaental in kernel-based classification algoriths. Having showed how characteristic kernels play a role in kernel-based classification, in the following section, we provide a novel characterization for the. 3 Novel Characterization for Characteristic Kernels A positive definite (pd) kernel, k is said to be characteristic to P if and only if γ k (P, Q) = P = Q, P, Q P. The following result provides a novel characterization for characteristic kernels, which shows that strictly pd kernels are characteristic to P. An advantage with this characterization is that it holds for any arbitrary topological space unlike the earlier characterizations where a group structure on is assued [7, 7]. First, we define strictly pd kernels as follows. Definition 3 (Strictly positive definite kernels). Let be a topological space. A easurable and bounded kernel, k is said to be strictly positive definite if and only if k(x, y) dµ(x) dµ(y) > for all finite non-zero signed Borel easures, µ defined on. Note that the above definition is not equivalent to the usual definition of strictly pd kernels that involves finite sus [9, Definition 4.5]. The above definition is a generalization of integrally strictly positive definite functions [2, Section 6]: k(x, y)f(x)f(y) dx dy > for all f L 2 (R d ), which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is stronger than the finite su definition as [9, Theore 4.62] shows a kernel that is strictly pd in the finite su sense but not in the integral sense. Theore 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on, then k is characteristic to P. The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic. We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to P. Exaples of strictly pd kernels on R d include exp( σ x y 2 2), σ >, exp( σ x y ), σ >, (c 2 + x y 2 2) β, β >, c >, B 2l+ -splines etc. Note that k(x, y) = f(x)k(x, y)f(y) is a strictly pd kernel if k is strictly pd, where f : R is a bounded continuous function. Therefore, translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant strictly pd kernel. A siple exaple of a translation-variant kernel that is a strictly pd kernel on copact sets of R d is k(x, y) = exp(σx T y), σ >, where we have chosen f(.) = exp(σ. 2 2/2) and k(x, y) = exp( σ x y 2 2/2), σ >. Therefore, k is characteristic on copact sets of R d, which is the sae result that follows fro the universality of k [8, Section 3, Exaple ]. The following result in [3], which is based on the usual definition of strictly pd kernels, can be obtained as a corollary to Theore 4. Corollary 5 ([3]). Let X = {x i } i=, Y = {y j} n j= and assue that x i x j, y i y j, i, j. Suppose k is strictly positive definite. Then i= α ik(., x i ) = n j= β jk(., y j ) for soe α i, β j R\{} X = Y. Suppose we choose α i =, i and β j = n, j in Corollary 5. Then i= α ik(., x i ) and n j= β jk(., y j ) represent the ean functions in H. Note that the Parzen classifier in (4) 4
5 is a ean classifier (that separates the ean functions) in H, i.e., sign( k(., x), w H ), where w = i= k(., x i) n n i= k(., y i). Suppose k is strictly pd (ore generally, suppose k is characteristic). Then, by Corollary 5, the noral vector, w to the hyperplane in H passing through the origin is zero, i.e., the ean functions coincide (and are therefore not classifiable) if and only if X = Y. 4 Generalizing the D for Classes of Characteristic Kernels The discussion so far has been related to the characteristic property of k that akes γ k a etric on P. We have seen that this characteristic property is of prie iportance both in distribution testing, and to ensure classifiability of dissiilar distributions in the RKHS. We have not yet addressed how to choose aong a selection/faily of characteristic kernels, given a particular pair of distributions we wish to discriinate between. We introduce one approach to this proble in the present section. Let = R d and k σ (x, y) = exp( σ x y 2 2), σ R +, where σ represents the bandwidth paraeter. {k σ : σ R + } is the faily of Gaussian kernels and {γ kσ : σ R + } is the faily of Ds indexed by the kernel paraeter, σ. Note that k σ is characteristic for any σ R ++ and therefore γ kσ is a etric on P for any σ R ++. However, in practice, one would prefer a single nuber that defines the distance between P and Q. The question therefore to be addressed is how to choose appropriate σ. The choice of σ has iportant iplications on the statistical aspect of γ kσ. Note that as σ, k σ and as σ, k σ a.e., which eans γ kσ (P, Q) as σ or σ for all P, Q P (this behavior is also exhibited by k σ (x, y) = exp( σ x y ) and k σ (x, y) = σ 2 /(σ 2 + x y 2 2), which are also characteristic). This eans choosing sufficiently sall or sufficiently large σ (depending on P and Q) akes γ kσ (P, Q) arbitrarily sall. Therefore, σ has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently, the applications involving D set σ heuristically [8, 9]. To generalize the D to failies of kernels, we propose the following odification to γ k, which yields a pseudoetric on P, γ(p, Q) = sup{γ k (P, Q) : k K} = sup{ Pk Qk H : k K}. (6) Note that γ is the axial RKHS distance between P and Q over a faily, K of positive definite kernels. It is easy to check that if any k K is characteristic, then γ is a etric on P. Exaples for K include: K g := {e σ x y 2 2, x, y R d : σ R + }; K l := {e σ x y, x, y R d : σ R + }; K ψ := {e σψ(x,y), x, y : σ R + }, where ψ : R is a negative definite kernel; K rbf := { e λ x y 2 2 dµσ (λ), x, y R d, µ σ + : σ Σ R d }, where + is the set of all finite nonnegative Borel easures, µ σ on R + that are not concentrated at zero, etc. The proposal of γ(p, Q) in (6) can be otivated by the connection that we have established in Section 2 between γ k and the Parzen window classifier. Since the Parzen window classifier depends on the kernel, k, one can propose to learn the kernel like in support vector achines [], wherein the kernel is chosen such that R L F k in Theore is iniized over k K, i.e., inf R L F k = sup γ k (P, Q) = γ(p, Q). A siilar otivation for γ can be provided based on (5) as learning the kernel in a hard-argin SV by axiizing its argin. At this point, we briefly discuss the issue of noralized vs. unnoralized kernel failies, K in (6). We say a translation-invariant kernel, k on R d is noralized if ψ(y) dy = c (soe positive constant independent of the kernel paraeter), where k(x, y) = ψ(x y). K is a noralized kernel faily if every kernel in K is noralized. If K is not noralized, we say it is unnoralized. For exaple, it is easy to see that K g and K l are unnoralized kernel failies. Let us consider the noralized Gaussian faily, K n g = {(σ/π) d/2 e σ x y 2 2, x, y R d : σ [σ, )}. It can be shown that for any k σ, k τ K n g, < σ < τ <, we have γ kσ (P, Q) γ kτ (P, Q), which eans, γ(p, Q) = γ σ (P, Q). Therefore, the generalized D reduces to a single kernel D. A siilar result also holds for the noralized inverse-quadratic kernel faily, { 2σ 2 /π(σ 2 + x y 2 2), x, y R : σ [σ, )}. These exaples show that the generalized D definition is usually not very useful if K is a noralized kernel faily. In addition, σ should be chosen beforehand, which is equivalent to heuristically setting the kernel paraeter in γ k. Note that σ cannot be zero because in the liiting case of σ, the kernels approach a Dirac distribution, which eans the liiting kernel is not bounded and therefore the definition of D in () does not hold. So, in this work, we consider unnoralized kernel failies to render the definition of generalized D in (6) useful. 5
6 To use γ in statistical applications where P and Q are known only through i.i.d. saples {X i } i= and {Y i } n i= respectively, we require its estiator γ(p, Q n ) to be consistent, where P and Q n represent the epirical easures based on {X i } i= and {Y j} n j=. For k easurable and bounded, [8, 5] have shown that γ k (P, Q n ) is a n/( + n)-consistent estiator of γ k (P, Q). The statistical consistency of γ(p, Q n ) is established in the following theore, which uses tools fro U-process theory [3, Chapters 3,5]. We begin with the following definition. Definition 6 (Radeacher chaos). Let G be a class of functions on and {ρ i } n i= be independent Radeacher rando variables, i.e., Pr(ρ i = ) = Pr(ρ i = ) = 2. The hoogeneous Radeacher chaos process of order two with respect to {ρ i } n i= is defined as {n n i<j ρ iρ j g(x i, x j ) : g G} for soe {x i } n i=. The Radeacher chaos coplexity over G is defined as U n (G; {x i } n i=) := E ρ sup n ρ i ρ j g(x i, x j ). (7) n g G i<j We now provide the ain result of the present section. Theore 7 (Consistency of γ(p, Q n )). Let every k K be easurable and bounded with ν := sup,x k(x, x) <. Then, with probability at least δ, γ(p, Q n ) γ(p, Q) A, where A = 6U (K; {X i }) + 6U n(k; {Y i }) n + ( 8ν + 36ν log 4 δ ) + n n. (8) Fro (8), it is clear that if U (K; {X i }) = O P () and U n (K; {Y i }) = O Q (), then γ(p, Q n ) a.s. γ(p, Q). The following result provides a bound on U (K; {X i }) in ters of the entropy integral. Lea 8 (Entropy bound). For any K as in Theore 7 with K, there exists a universal constant C such that U (K; {X i } i=) C [ ν log N (K, D, ɛ) dɛ, (9) ] where D(k, k 2 ) = i<j (k (X i, X j ) k 2 (X i, X j )) 2 2. N (K, D, ɛ) represents the ɛ- covering nuber of K with respect to the etric D. Assuing K to be a VC-subgraph class, the following result, as a corollary to Lea 8 provides an estiate of U (K; {X i } i= ). Before presenting the result, we first provide the definition of a VC-subgraph class. Definition 9 (VC-subgraph class). The subgraph of a function g : R is the subset of R given by {(x, t) : t < g(x)}. A collection G of easurable functions on a saple space is called a VC-subgraph class, if the collection of all subgraphs of the functions in G fors a VC-class of sets (in R). The VC-index (also called the VC-diension) of a VC-subgraph class, G is the sae as the pseudodiension of G. See [, Definition.] for details. Corollary (U (K; {X i }) for VC-subgraph, K). Suppose K is a VC-subgraph class with V (K) being the VC-index. Assue K satisfies the conditions in Theore 7 and K. Then for soe universal constants C and C. U (K; {X i }) Cν log(c V (K)(6e 9 ) V (K) ), () Using () in (8), we have γ(p, Q n ) γ(p, Q) = O P,Q ( ( + n)/n) and by the Borel- Cantelli lea, γ(p, Q n ) γ(p, Q) a.s.. Now, the question reduces to which of the kernel classes, K have V (K) <. [22, Lea 2] showed that V (K g ) = (also see [23]) and U (K rbf ) C 2 U (K g ), where C 2 <. It can be shown that V (K ψ ) = and V (K l ) =. All these classes satisfy the conditions of Theore 7 and Corollary and therefore provide consistent estiates of γ(p, Q) for any P, Q P. Exaples of kernels on R d that are covered by these classes include the Gaussian, Laplacian, inverse ultiquadratics, atérn class etc. Other choices for K that are popular in achine learning are the linear cobination of kernels, K lin := {k λ = l i= λ ik i k λ is pd, l i= λ i = } and K con := {k λ = l i= λ ik i λ i, l i= λ i = }. [6, Lea 7] have shown that V (K con ) V (K lin ) l. Therefore, instead of using a class based on a fixed, paraeterized kernel, one can also use a finite linear cobination of kernels to copute γ. 6
7 So far, we have presented the etric property and statistical consistency (of the epirical estiator) of γ. Now, the question is how do we copute γ(p, Q n ) in practice. To show this, in the following, we present two exaples. Exaple. Suppose K = K g. Then, γ(p, Q n ) can be written as γ 2 (P, Q n ) = sup σ R + i,j= e σ X i X j n i,j= e σ Y i Y j 2 n 2 2,n i,j= e σ X i Y j 2 n. () The optiu σ can be obtained by solving () and γ(p, Q n ) = P k σ Q n k σ Hσ. Exaple 2. Suppose K = K con. Then, γ(p, Q n ) becoes γ 2 (P, Q n ) = sup P k Q n k 2 H = k d(p Q n ) (P Q n ) con sup con = sup{λ T a : λ T =, λ }, (2) where we have replaced k by l i= λ ik i. Here λ = (λ,..., λ l ) and (a) i = P k i Q n k i 2 H i = 2 a,b= k i(x a, X b ) + n n 2 a,b= k i(y a, Y b ) 2,n n a,b= k i(x a, Y b ). It is easy to see that γ 2 (P, Q n ) = ax i l (a) i. Siilar exaples can be provided for other K, where γ(p, Q n ) can be coputed by solving a seidefinite progra (K = K lin ) or by the constrained gradient descent ( K = K l, K rbf ). Finally, while the approach in (6) to generalizing γ k is our focus in this paper, an alternative Bayesian strategy would be to define a non-negative finite easure λ over K, and to average γ k over that easure, i.e., β(p, Q) := K γ k(p, Q) dλ(k). This also yields a pseudoetric on P. That said, β(p, Q) λ(k)γ(p, Q), P, Q, which eans if P and Q can be distinguished by β, they can be distinguished by γ, but not vice-versa. In this sense, γ is stronger than β. One further coplication with the Bayesian approach is in defining a sensible λ over K. Note that γ k (single kernel D based on k ) can be obtained by defining λ(k) = δ(k k ) in β(p, Q). 5 Experients In this section, we present a benchark experient that illustrates the generalized D proposed in Section 4 is preferred above the single kernel D where the kernel paraeter is set heuristically. The experiental setup is as follows. Let p = N(, σ 2 p), a noral distribution in R with zero ean and variance, σ 2 p. Let q be the perturbed version of p, given as q(x) = p(x)( + sin νx). Here p and q are the densities associated with P and Q respectively. It is easy to see that q differs fro p at increasing frequencies with increasing ν. Let k(x, y) = exp( (x y) 2 /σ). Now, the goal is that given rando saples drawn i.i.d. fro P and Q (with ν fixed), we would like to test H : P = Q vs. H : P Q. The idea is that as ν increases, it will be harder to distinguish between P and Q for a fixed saple size. Therefore, using this setup we can verify whether the adaptive bandwidth selection achieved by γ (as the test statistic) helps to distinguish between P and Q at higher ν copared to γ k with a heuristic σ. To this end, using γ(p, Q n ) and γ k (P, Q n ) (with various σ) as test statistics T n, we design a test that returns H if T n c n, and H otherwise. The proble therefore reduces to finding c n. c n is deterined as the ( α) quantile of the asyptotic distribution of T n under H, which therefore fixes the type-i error (the probability of rejecting H when it is true) to α. The consistency of this test under γ k (for any fixed σ) is proved in [8]. A siilar result can be shown for γ under soe conditions on K. We skip the details here. In our experients, we set = n =, σp 2 = and draw two sets of independent rando saples fro Q. The distribution of T n is estiated by bootstrapping on these saples (25 bootstrap iterations are perfored) and the associated 95 th quantile (we choose α =.5) is coputed. Since the perforance of the test is judged by its type-ii error (the probability of accepting H when H is true), we draw a rando saple, one each fro P and Q and test whether P = Q. This process is repeated 3 ties, and estiates of type-i and type-ii errors are obtained for both γ and γ k. 4 different values for σ are considered on a logarithic scale of base 2 with exponents ( 3, 2,,,, 3 2, 2, 5 2, 3, 7 2, 4, 5, 6) along with the edian distance between saples as one ore choice. 5 different choices for ν are considered: ( 2, 3 4,, 5 4, 3 2 ). 7
8 Error (in %) Type I error Type II error.5.75 ν.25.5 (a) log σ 3 2 Type I error (in %) ν.25.5 (d) 5 ν=.5 ν=.75 ν=. ν=.25 ν= log σ (b) edian as σ 9 8 Type II error (in %) 5 ν=.5 ν=.75 ν=. ν=.25 ν= ν.25.5 (e) log σ (c) Figure : (a) Type-I and Type-II errors (in %) for γ for varying ν. (b,c) Type-I and type-ii error (in %) for γ k (with different σ) for varying ν. The dotted line in (c) corresponds to the edian heuristic, which shows that its associated type-ii error is very large at large ν. (d) Box plot of log σ grouped by ν, where σ is selected by γ. (e) Box plot of the edian distance between points (which is also a choice for σ), grouped by ν. Refer to Section 5 for details. Figure (a) shows the estiated type-i and type-ii errors using γ as the test statistic for varying ν. Note that the type-i error is close to its design value of 5%, while the type-ii error is zero for all ν, which eans γ distinguishes between P and Q for all perturbations. Figures (b,c) show the estiates of type-i and type-ii errors using γ k as the test statistic for different σ and ν. Figure (d) shows the box plot for log σ, grouped by ν, where σ is the bandwidth selected by γ. Figure (e) shows the box plot of the edian distance between points (which is also a choice for σ), grouped by ν. Fro Figures (c) and (e), it is easy to see that the edian heuristic exhibits high type-ii error for ν = 3 2, while γ exhibits zero type-ii error (fro Figure (a)). Figure (c) also shows that heuristic choices of σ can result in high type-ii errors. It is intuitive to note that as ν increases, (which eans the characteristic function of Q differs fro that of P at higher frequencies), a saller σ is needed to detect these changes. The advantage of using γ is that it selects σ in a distribution-dependent fashion and its behavior in the box plot shown in Figure (d) atches with the previously entioned intuition about the behavior of σ with respect to ν. These results deonstrate the validity of using γ as a distance easure in applications. 6 Conclusions In this work, we have shown how D appears in binary classification, and thus that characteristic kernels are iportant in kernel-based classification algoriths. We have broadened the class of characteristic RKHSs to include those induced by strictly positive definite kernels (with particular application to kernels on non-copact doains, and/or kernels that are not translation invariant). We have further provided a convergent generalization of D over failies of kernel functions, which becoes necessary even in considering relatively siple failies of kernels (such as the Gaussian kernels paraeterized by their bandwidth). The usefulness of the generalized D is illustrated experientally with a two-saple testing proble. Acknowledgents The authors thank anonyous reviewers for their constructive coents and especially the reviewer who pointed out the connection between characteristic kernels and the achievability of Bayes risk. B. K. S. was supported by the PI for Biological Cybernetics, National Science Foundation (grant DS-SPA 62549), the Fair Isaac Corporation and the University of California I- CRO progra. A. G. was supported by grants DARPA IPTO FA , ONR URI N47747, and ARO URI W9NF
9 A Proofs We provide proofs for the results in Sections 2-4. A. Proof of Theore To prove Theore, we need the following result fro [7]. Theore 3 ([7]). Let F k := {f : f H }, where (H, k) is an RKHS defined on a easurable space with k easurable and bounded. Then, where H represents the RKHS nor. γ k (P, Q) = sup f F k Pf Qf = Pk Qk H, (3) Proof of Theore : Fro (3), we have ε L (f) dp + ( ε) L (f) dq = Therefore, R L F k f dq f dp = Qf Pf. = inf f F k (Qf Pf) = sup f F k (Pf Qf) = sup f F k Pf Qf = γ k (P, Q), which follows fro Theore 3. Given {(X i, Y i )} N i= drawn i.i.d. fro µ, the epirical equivalent of (3) is given by { } inf f(x i ) + f(x i ) : f F k. N Solving this for f gives f = and the result in (4) follows. A.2 Proof of Theore 2 Y i = Y i= k(., X i) N Y i = k(., X i) N Y i = Y k(., X i= i) Y i = k(., X, i) H Before we prove Theore 2, we present a lea which we will use to prove Theore 2. Lea 4. Let θ : V R and ψ : V R be convex functions on a real vector space V. Suppose where θ is not constant on {x : ψ(x) b} and a <. Then a = sup{θ(x) : ψ(x) b}, (4) b = inf{ψ(x) : θ(x) a}. (5) We need the following result fro [, Theore 32.] to prove Lea 4. Theore 5 ([]). Let f be a convex function, and let C be a convex set contained in the doain of f. If f attains its supreu relative to C at soe point of relative interior of C, then f is actually constant throughout C. Proof of Lea 4: Note that A := {x : ψ(x) b} is a convex subset of V. Since θ is not constant on A, by Theore 5, θ attains its supreu on the boundary of A. Therefore, any solution, x to (4) satisfies θ(x ) = a and ψ(x ) = b. Let G := {x : θ(x) > a}. For any x G, ψ(x) > b. If this were not the case, then x is not a solution to (4). Let H := {x : θ(x) = a}. Clearly, x H and so there exists an x H for which ψ(x) = b. Suppose inf{ψ(x) : x H} = c < b, which eans for soe x H, x A. Fro (4), this iplies θ attains its supreu relative to A at soe point of relative interior of A. By Theore 5, this iplies θ is constant on A leading to a contradiction. Therefore, inf{ψ(x) : x H} = b and the result in (5) follows. 9
10 Proof of Theore 2: By Theore 3, γ k (P, Q) = sup{pf Qf : f H }. Note that Pf Qf and f H are convex functionals on H. For P Q, Pf Qf is not constant on F k, since k is characteristic. Therefore, by Lea 4, we have = inf{ f H : Pf Qf γ k (P, Q), f H}. Since this holds for all P Q, it holds for P and Q n. Therefore, we have Consider 2 γ k (P, Q n ) = inf { f H : P f Q n f 2, f H}. {f H : P f Q n f 2} = which eans 2 γ k (P,Q n ) f sv H. A.3 Proof of Theore 4 { f H : f(x i ) n Y i= {f H : Y i f(x i ), i}, Y i= f(x i ) 2 To prove Theore 4, we need the following lea that provides necessary and sufficient conditions for a kernel not to be characteristic. Lea 6. Let k be easurable and bounded on. Then P Q, P, Q P such that γ k (P, Q) = if and only if there exists a finite non-zero signed Borel easure µ that satisfies: (i) k(x, y) dµ(x) dµ(y) =, (ii) µ() =. Proof. ( ) Suppose there exists a finite non-zero signed Borel easure, µ that satisfies (i) and (ii) in Lea 6. By the Jordan decoposition theore [5, Theore 5.6.], there exist unique positive easures µ + and µ such that µ = µ + µ and µ + µ (µ + and µ are singular). By (ii), we have µ + () = µ () =: α. Define P = α µ + and Q = α µ. Clearly, P Q, P, Q P. Then, γ 2 k(p, Q) = Pk Qk 2 H = α 2 µk 2 H = α 2 µk, µk H. (6) Fro the proof of Theore 3 (see Theore 3 in [7]), we have µk, µk H = k(x, y) dµ(x) dµ(y) and therefore, by (i), γk (P, Q) =. So, we have constructed P Q such that γ k (P, Q) =. ( ) Suppose P Q, P, Q P such that γ k (P, Q) =. Let µ = P Q. Clearly µ is a finite non-zero signed Borel easure that satisfies µ() =. Note that γk 2(P, Q) = Pk Qk 2 H = µk 2 H = k(x, y) dµ(x) dµ(y), and therefore (i) follows. Proof of Theore 4: Since k is strictly pd on, we have k(x, y) dη(x)dη(y) > for any finite non-zero signed Borel easure η. This eans there does not exist a finite non-zero signed Borel easure that satisfies (i) in Lea 6. Therefore, by Lea 6, there does not exist P Q, P, Q P such that γ k (P, Q) =, which iplies k is characteristic to P. A.4 Proof of Corollary 5 Consider i= α ik(., x i ) = k(., x) dµ X (x) = µ X k, where µ X = i= α iδ xi. Here δ xi represents the Dirac easure at x i. So, we have µ X k = µ Y k, which is equivalent to k(x, y) d(µ X µ Y )(x) d(µ X µ Y )(y) =. (7) Since k is strictly pd, by Lea 6, we have µ X k = µ Y k µ X = µ Y X = Y. }
11 A.5 Proof of Theore 7 Consider γ(p, Q n ) γ(p, Q) = sup P k Q n k H sup Pk Qk H P k Q n k H Pk Qk H sup sup sup sup P k Q n k Pk + Qk H [ P k Pk H + Q n k Qk H ] P k Pk H + sup Q n k Qk H. (8) We now bound the ters sup P k Pk H and sup Q n k Qk H. Since sup P k Pk H satisfies the bounded difference property, using cdiarid s inequality gives that with probability at least δ 4 over the choice of {X i}, we have sup P k Pk H E sup P k Pk H + 2ν log 4 δ. (9) By invoking syetrization for E sup P k Pk H, we have E sup P k Pk H 2EE ρ sup ρ i k(., X i ), (2) H where {ρ i } i= represent i.i.d. Radeacher rando variables and E ρ represents the expectation w.r.t. {ρ i } conditioned on {X i }. Since E ρ sup i= ρ ik(., X i ) H satisfies the bounded difference property, by cdiarid s inequality, with probability at least δ 4 over the choice of the rando saples of size, we have 2ν EE ρ sup ρ i k(., X i ) E ρ sup ρ i k(., X i ) + log 4 δ. (2) H H By writing i= i<j i= i= ρ i k(., X i ) = ρ i ρ j k(x i, X j ) i= H i,j= 2 ρ i ρ j k(x i, X j ) ν +, (22) we have with probability at least δ 4, the following holds: EE ρ sup 2U (K; {X i }) ν 2ν ρ i k(., X i ) + + H log 4 δ. (23) i= Tying (9)-(23), we have that w.p. at least δ 2 over the choice of {X i}, the following holds: 8U (K; {X i }) sup P k Pk H + 2 ν 8ν + log 4 δ. (24) Perforing siilar analysis for sup Q n k Qk H, we have that w.p. at least δ 2 over the choice of {Y i }, 8Un (K; {Y i }) sup Q n k Qk H + 2 ν 8ν + n n n log 4 δ. (25) Using (24) and (25) with a + b 2(a + b) provides the result.
12 A.6 Proof of Lea 8 Fro [2, Proposition 2.2, Proposition 2.6] (also see [3, Corollary 5..8]), we have that there exists a universal constant C < such that U (K; {X i }) C ν log N (K, D, ɛ) dɛ, where 2 D 2 (k, k 2 ) = E ρ n ρ i ρ j h(x i, X j ) i<j = 2 E ρ ρ i ρ j ρ r ρ s h(x i, X j )h(x r, X s ) i<j,r<s = 2 h 2 (X i, X j ), i<j where h(x i, X j ) = k (X i, X j ) k 2 (X i, X j ). A.7 Proof of Corollary The result follows by bounding the unifor covering nuber of the VC-subgraph class, K. By [2, Theore 2.6], we have N (K, D, ɛ) C V (K)(6ν 2 ɛ 2 e) V (K), (26) where V (K) is the VC-index of K. Therefore, ν U (K; {X i }) C log N (K, D, ɛ) dɛ ν ν 4V (K)C log( ) dɛ + νv (K)C log(6e) + Cν log(c V (K)). (27) ɛ Note that ν log( ν ɛ ) dɛ 2ν. Using this in (27) and rearranging the ters provides the result. References []. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cabridge University Press, UK, 999. [2]. A. Arcones and E. Giné. Liit theores for U-processes. Annals of Probability, 2: , 993. [3] V. H. de la Peña and E. Giné. Decoupling: Fro Dependence to Independence. Springer-Verlag, NY, 999. [4] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 996. [5] R.. Dudley. Real Analysis and Probability. Cabridge University Press, Cabridge, UK, 22. [6] K. Fukuizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel easures of conditional dependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Inforation Processing Systes 2, pages , Cabridge, A, 28. IT Press. [7] K. Fukuizu, B. K. Sriperubudur, A. Gretton, and B. Schölkopf. Characteristic kernels on groups and seigroups. In D. Koller, D. Schuurans, Y. Bengio, and L. Bottou, editors, Advances in Neural Inforation Processing Systes 2, pages , 29. [8] A. Gretton, K.. Borgwardt,. Rasch, B. Schölkopf, and A. Sola. A kernel ethod for the two saple proble. In B. Schölkopf, J. Platt, and T. Hoffan, editors, Advances in Neural Inforation Processing Systes 9, pages IT Press, 27. [9] A. Gretton, K. Fukuizu, C.-H. Teo, L. Song, B. Schölkopf, and A. Sola. A kernel statistical test of independence. In Advances in Neural Inforation Processing Systes 2, pages IT Press, 28. [] G. R. G. Lanckriet, N. Christianini, P. Bartlett, L. El Ghaoui, and. I. Jordan. Learning the kernel atrix with seidefinite prograing. Journal of achine Learning Research, 5:24 72, 24. [] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 97. 2
13 [2] B. Schölkopf and A. J. Sola. Learning with Kernels. IT Press, Cabridge, A, 22. [3] B. Schölkopf, B. K. Sriperubudur, A. Gretton, and K. Fukuizu. RKHS representation of easures. In Learning Theory and Approxiation Workshop, Oberwolfach, Gerany, 28. [4] J. Shawe-Taylor and N. Cristianini. Kernel ethods for Pattern Analysis. Cabridge University Press, UK, 24. [5] A. J. Sola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space ebedding for distributions. In Proc. 8th International Conference on Algorithic Learning Theory, pages 3 3. Springer-Verlag, Berlin, Gerany, 27. [6] N. Srebro and S. Ben-David. Learning bounds for support vector achines with learned kernels. In G. Lugosi and H. U. Sion, editors, Proc. of the 9 th Annual Conference on Learning Theory, pages 69 83, 26. [7] B. K. Sriperubudur, A. Gretton, K. Fukuizu, G. R. G. Lanckriet, and B. Schölkopf. Injective Hilbert space ebeddings of probability easures. In R. Servedio and T. Zhang, editors, Proc. of the 2 st Annual Conference on Learning Theory, pages 22, 28. [8] I. Steinwart. On the influence of the kernel on the consistency of support vector achines. Journal of achine Learning Research, 2:67 93, 22. [9] I. Steinwart and A. Christann. Support Vector achines. Springer, 28. [2] J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky ountain Journal of atheatics, 6(3):49 433, 976. [2] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Epirical Processes. Springer-Verlag, New York, 996. [22] Y. Ying and C. Capbell. Generalization bounds for learning the kernel. In Proc. of the 22 nd Annual Conference on Learning Theory, 29. [23] Y. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Journal of achine Learning Research, 8: , 27. 3
Computational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationIntelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationE0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationKernel-Based Nonparametric Anomaly Detection
Kernel-Based Nonparaetric Anoaly Detection Shaofeng Zou Dept of EECS Syracuse University Eail: szou@syr.edu Yingbin Liang Dept of EECS Syracuse University Eail: yliang6@syr.edu H. Vincent Poor Dept of
More informationRobustness and Regularization of Support Vector Machines
Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More information3.8 Three Types of Convergence
3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationSupport Vector Machines. Goals for the lecture
Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More information1 Rademacher Complexity Bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More informationBoosting with log-loss
Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the
More informationEstimating Parameters for a Gaussian pdf
Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3
More informationSharp Time Data Tradeoffs for Linear Inverse Problems
Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used
More informationSupport Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationPattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition
More informationUNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.
UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationFixed-to-Variable Length Distribution Matching
Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de
More informationOn Conditions for Linearity of Optimal Estimation
On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationLecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon
Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More informationSymmetrization and Rademacher Averages
Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and
More informationSupport recovery in compressed sensing: An estimation theoretic approach
Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de
More informationBayes Decision Rule and Naïve Bayes Classifier
Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationUsing EM To Estimate A Probablity Density With A Mixture Of Gaussians
Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial
More informatione-companion ONLY AVAILABLE IN ELECTRONIC FORM
OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer
More informationShannon Sampling II. Connections to Learning Theory
Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,
More informationChaotic Coupled Map Lattices
Chaotic Coupled Map Lattices Author: Dustin Keys Advisors: Dr. Robert Indik, Dr. Kevin Lin 1 Introduction When a syste of chaotic aps is coupled in a way that allows the to share inforation about each
More informationDEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS
ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved
More informationFoundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)
More informationLecture October 23. Scribes: Ruixin Qiang and Alana Shine
CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single
More informationScale Invariant Conditional Dependence Measures
Sashank J. Reddi sjakkar@cs.cu.edu Machine Learning Departent, School of Coputer Science, Carnegie Mellon University Barnabás Póczos bapoczos@cs.cu.edu Machine Learning Departent, School of Coputer Science,
More informationA Bernstein-Markov Theorem for Normed Spaces
A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :
More informationThis model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.
CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when
More informationTEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES
TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES S. E. Ahed, R. J. Tokins and A. I. Volodin Departent of Matheatics and Statistics University of Regina Regina,
More informationConsistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material
Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable
More informationKeywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution
Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality
More informationTail estimates for norms of sums of log-concave random vectors
Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics
More informationCh 12: Variations on Backpropagation
Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith
More informationSVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming
SVM Soft Margin Classifiers: Linear Prograing versus Quadratic Prograing Qiang Wu wu.qiang@student.cityu.edu.hk Ding-Xuan Zhou azhou@cityu.edu.hk Departent of Matheatics, City University of Hong Kong,
More informationarxiv: v1 [cs.lg] 8 Jan 2019
Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v
More informationStochastic Subgradient Methods
Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods
More informationNew Bounds for Learning Intervals with Implications for Semi-Supervised Learning
JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University
More informationComputable Shell Decomposition Bounds
Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung
More informationEnsemble Based on Data Envelopment Analysis
Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807
More informationResearch Article Robust ε-support Vector Regression
Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,
More informationLecture 21. Interior Point Methods Setup and Algorithm
Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and
More informationMulti-kernel Regularized Classifiers
Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA wu.qiang@student.cityu.edu.hk, yying@cityu.edu.hk,
More informationThe Weierstrass Approximation Theorem
36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined
More informationSupplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data
Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution
More informationSupport Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab
Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a
More informationSupplement to: Subsampling Methods for Persistent Homology
Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation
More informationYongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011
This article was downloaded by: [Xi'an Jiaotong University] On: 15 Noveber 2011, At: 18:34 Publisher: Taylor & Francis Infora Ltd Registered in England and Wales Registered Nuber: 1072954 Registered office:
More informationare equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,
Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations
More informationThe Methods of Solution for Constrained Nonlinear Programming
Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 01-06 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.co The Methods of Solution for Constrained
More informationA Smoothed Boosting Algorithm Using Probabilistic Output Codes
A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial
More informationA Note on the Applied Use of MDL Approximations
A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention
More informationAsynchronous Gossip Algorithms for Stochastic Optimization
Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu
More informationCOS 424: Interacting with Data. Written Exercises
COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well
More informationRademacher Complexity Margin Bounds for Learning with a Large Number of Classes
Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute
More informationNyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,
More informationarxiv: v1 [math.pr] 17 May 2009
A strong law of large nubers for artingale arrays Yves F. Atchadé arxiv:0905.2761v1 [ath.pr] 17 May 2009 March 2009 Abstract: We prove a artingale triangular array generalization of the Chow-Birnbau- Marshall
More informationChapter 6 1-D Continuous Groups
Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:
More informationNon-parametric Estimation of Integral Probability Metrics
Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,
More informationLearnability and Stability in the General Learning Setting
Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu
More informationFairness via priority scheduling
Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation
More informationA Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness
A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,
More informationMetric Entropy of Convex Hulls
Metric Entropy of Convex Hulls Fuchang Gao University of Idaho Abstract Let T be a precopact subset of a Hilbert space. The etric entropy of the convex hull of T is estiated in ters of the etric entropy
More informationComputable Shell Decomposition Bounds
Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago
More informationThe degree of a typical vertex in generalized random intersection graph models
Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent
More informationIntelligent Systems: Reasoning and Recognition. Artificial Neural Networks
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial
More informationDetection and Estimation Theory
ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer
More informationarxiv: v1 [stat.ml] 27 Oct 2018
Stein Variational Gradient Descent as Moent Matching arxiv:80.693v stat.ml] 27 Oct 208 Qiang Liu, Dilin Wang Departent of Coputer Science The University of Texas at Austin Austin, TX 7872 {lqiang, dilin}@cs.utexas.edu
More informationA Theoretical Framework for Deep Transfer Learning
A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il
More informationExperimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis
City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna
More informationNonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy
Storage Capacity and Dynaics of Nononotonic Networks Bruno Crespi a and Ignazio Lazzizzera b a. IRST, I-38050 Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I-38050 Povo (Trento) Italy INFN Gruppo
More informationStability Bounds for Non-i.i.d. Processes
tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More information3.3 Variational Characterization of Singular Values
3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and
More informationESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics
ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents
More informationNew upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.
New upper bound for the B-spline basis condition nuber II. A proof of de Boor's 2 -conjecture K. Scherer Institut fur Angewandte Matheati, Universitat Bonn, 535 Bonn, Gerany and A. Yu. Shadrin Coputing
More informationBootstrapping Dependent Data
Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly
More information