Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Size: px
Start display at page:

Download "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions"

Transcription

1 Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA Kenji Fukuizu The Institute of Statistical atheatics Tokyo, Japan Arthur Gretton Carnegie ellon University PI for Biological Cybernetics Gert R. G. Lanckriet Departent of ECE UC San Diego, La Jolla, USA Bernhard Schölkopf PI for Biological Cybernetics Tübingen, Gerany Abstract Ebeddings of probability easures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical eans of representing and coparing probabilities. In particular, the distance between ebeddings (the axiu ean discrepancy, or D) has several key advantages over any classical etrics on distributions, naely easy coputability, fast convergence and low bias of finite saple estiates. An iportant requireent of the ebedding RKHS is that it be characteristic: in this case, the D between two distributions is zero if and only if the distributions coincide. Three new results on the D are introduced in the present study. First, it is established that D corresponds to the optial risk of a kernel classifier, thus foring a natural link between the distance between distributions and their ease of classification. An iportant consequence is that a kernel ust be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-copact doains. Third, a generalization of the D is proposed for failies of kernels, as the supreu over Ds on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance easure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the proble of learning the kernel by iniizing the risk of the corresponding kernel classifier. The generalized D is shown to have consistent finite saple estiates, and its perforance is deonstrated on a hoogeneity testing exaple. Introduction Kernel ethods are broadly established as a useful way of constructing nonlinear algoriths fro linear ones, by ebedding points into higher diensional reproducing kernel Hilbert spaces (RKHSs) [2]. A generalization of this idea is to ebed probability distributions into RKHSs, giv-

2 ing us a linear ethod for dealing with higher order statistics [8, 5, 7]. ore specifically, suppose we are given the set P of all Borel probability easures defined on the topological space, and the RKHS (H, k) of functions on with k as its reproducing kernel (r.k.). For P P, denote by Pk := k(., x) dp(x). If k is easurable and bounded, then we ay define the ebedding of P in H as Pk H. The RKHS distance between two such appings associated with P, Q P is called the axiu ean discrepancy (D) [8, 7], and is written γ k (P, Q) = Pk Qk H. () We say that k is characteristic [6, 7] if the apping P Pk is injective, in which case () is zero if and only if P = Q, i.e., γ k is a etric on P. An iediate application of the D is to probles of coparing distributions based on finite saples: exaples include tests of hoogeneity [8], independence [9], and conditional independence [6]. In this application doain, the question of whether k is characteristic is key: without this property, the algoriths can fail through inability to distinguish between particular distributions. Characteristic kernels are iportant in binary classification: The proble of distinguishing distributions is strongly related to binary classification: indeed, one would expect easily distinguishable distributions to be easily classifiable. The link between these two probles is especially direct in the case of the D: in Section 2, we show that γ k is the negative of the optial risk (corresponding to a linear loss function) associated with the Parzen window classifier [2, 4] (also called kernel classification rule [4, Chapter ]), where the Parzen window turns out to be k. We also show that γ k is an upper bound on the argin of a hard-argin support vector achine (SV). The iportance of using characteristic RKHSs is further underlined by this link: if the property does not hold, then there exist distributions that are unclassifiable in the RKHS H. We further strengthen this by showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve Bayes risk in the kernel-based classification algoriths. Characterization of characteristic kernels: Given the centrality of the characteristic property to both RKHS classification and RKHS distribution testing, we should take particular care in establishing which kernels satisfy this requireent. Early results in this direction include [8], where k is shown to be characteristic on copact if it is universal in the sense of Steinwart [8, Definition 4]; and [6, 7], which address the case of non-copact, and show that k is characteristic if and only if H + R is dense in the Banach space of p-power (p ) integrable functions. The conditions in both these studies can be difficult to check and interpret, however, and the restriction of the first to copact is liiting. In the case of translation invariant kernels, [7] proved the kernel to be characteristic if and only if the support of the Fourier transfor of k is the entire R d, which is a uch easier condition to verify. Siilar sufficient conditions are obtained by [7] for translation invariant kernels on groups and sei-groups. In Section 3, we expand the class of characteristic kernels to include kernels that ay or ay not be translation invariant, with the introduction of a novel criterion: strictly positive definite kernels (see Definition 3) on are characteristic. Choice of characteristic kernels: In expanding the failies of allowable characteristic kernels, we have so far neglected the question of which characteristic kernel to choose. A practitioner asking by how uch two saples differ does not want to receive a blizzard of answers for every conceivable kernel and bandwidth setting, but a single easure that satisfies soe reasonable notion of distance across the faily of kernels considered. Thus, in Section 4, we propose a generalization of the D, yielding a new distance easure between P and Q defined as γ(p, Q) = sup{γ k (P, Q) : k K} = sup{ Pk Qk H : k K}, (2) which is the axial RKHS distance between P and Q over a faily, K of positive definite kernels. For exaple, K can be the faily of Gaussian kernels on R d indexed by the bandwidth paraeter. This distance easure is very natural in the light of our results on binary classification (in Section 2): ost directly, this corresponds to the proble of learning the kernel by iniizing the risk of the associated Parzen-based classifier. As a less direct justification, we also increase the upper bound on the argin allowed for a hard argin SV between the saples. To apply the generalized D in practice, we ust ensure its epirical estiator is consistent. In our ain result of Section 4, we provide an epirical estiate of γ(p, Q) based on finite saples, and show that any popular kernels like the Gaussian, Laplacian, and the entire atérn class on R d yield consistent estiates There is a subtlety here, since unlike the proble of testing for differences in distributions, classification suffers fro slow learning rates. See [4, Chapter 7] for details. 2

3 of γ(p, Q). The proof is based on bounding the Radeacher chaos coplexity of K, which can be understood as the U-process equivalent of Radeacher coplexity [3]. Finally, in Section 5, we provide a siple experiental deonstration that the generalized D can be applied in practice to the proble of hoogeneity testing. Specifically, we show that when two distributions differ on particular length scales, the kernel selected by the generalized D is appropriate to this difference, and the resulting hypothesis test outperfors the heuristic kernel choice eployed in earlier studies [8]. The proofs of the results in Sections 2-4 are provided in the appendix. 2 Characteristic Kernels and Binary Classification One of the ost iportant applications of the axiu ean discrepancy is in nonparaetric hypothesis testing [8, 9, 6], where the characteristic property of k is required to distinguish between probability easures. In the following, we show how D naturally appears in binary classification, with reference to the Parzen window classifier and hard-argin SV. This otivates the need for characteristic k to guarantee that classes arising fro different distributions can be classified by kernel-based algoriths. To this end, let us consider the binary classification proble with X being a -valued rando variable, Y being a {, +}-valued rando variable and the product space, {, +}, being endowed with an induced Borel probability easure µ. A discriinant function, f is a real valued easurable function on, whose sign is used to ake a classification decision. Given a loss function L : {, +} R R, the goal is to choose an f that iniizes the risk associated with L, with the optial L-risk being defined as R L F = inf f F { L(y, f(x)) dµ(x, y) = inf ε f F } L (f) dp + ( ε) L (f) dq, (3) where F is the set of all easurable functions on, L (α) := L(, α), L (α) := L(, α), P(X) := µ(x Y = +), Q(X) := µ(x Y = ), ε := µ(, Y = +). Here, P and Q represent the class-conditional distributions and ε is the prior distribution of class +. Now, we present the result that relates γ k to the optial risk associated with the Parzen window classifier. Theore (γ k and Parzen classification). Let L (α) = α ε and L (α) = α ε. Then, γ k(p, Q) = RF L k, where F k = {f : f H } and H is an RKHS with a easurable and bounded k. Suppose {(X i, Y i )} N i=, X i, Y i {, +}, i is a training saple drawn i.i.d. fro µ and = {i : Y i = }. If f F k is an epirical iniizer of (3) (where F is replaced by F k in (3)), then sign( f(x)) = {, Y i = k(x, X i) >, which is the Parzen window classifier. N Y i = k(x, X i) N Y i = k(x, X i) Y i = k(x, X i), (4) Theore shows that γ k is the negative of the optial L-risk (where L is the linear loss as defined in Theore ) associated with the Parzen window classifier. Therefore, if k is not characteristic, which eans γ k (P, Q) = for soe P Q, then RF L k =, i.e., the risk is axiu (note that since γ k (P, Q) = RF L k, the axiu risk is zero). In other words, if k is characteristic, then the axiu risk is obtained only when P = Q. This otivates the iportance of characteristic kernels in binary classification. In the following, we provide another result which provides a siilar otivation for the iportance of characteristic kernels in binary classification, wherein we relate γ k to the argin of a hard-argin SV. Theore 2 (γ k and hard-argin SV). Suppose {(X i, Y i )} N i=, X i, Y i {, +}, i is a training saple drawn i.i.d. fro µ. Assuing the training saple is separable, let f sv be the solution to the progra, inf{ f H : Y i f(x i ), i}, where H is an RKHS with easurable and bounded k. If k is characteristic, then γ k(p, Q n ), (5) f sv H 2 where P := Y i = δ X i, Q n := n Y i = δ X i, = {i : Y i = } and n = N. δ x represents the Dirac easure at x. 3

4 Theore 2 provides a bound on the argin of hard-argin SV in ters of D. (5) shows that a saller D between P and Q n enforces a saller argin (i.e., a less sooth classifier, f sv, where soothness is easured as f sv H ). We can observe that the bound in (5) ay be loose if the nuber of support vectors is sall. Suppose k is not characteristic, then γ k (P, Q n ) can be zero for P Q n and therefore the argin is zero, which eans even unlike distributions can becoe inseparable in this feature representation. Another justification of using characteristic kernels in kernel-based classification algoriths can be provided by studying the conditions on H for which the Bayes risk is realized for all µ. Steinwart and Christann [9, Corollary 5.37] have showed that under certain conditions on L, the Bayes risk is achieved for all µ if and only if H is dense in L p (, η) for all η, where η = εp + ( ε)q. Here, L p (, η) represents the Banach space of p-power integrable functions, where p [, ) is dependent on the loss function, L. Denseness of H in L p (, η) iplies H + R is dense L p (, η), which therefore yields that k is characteristic [6, 7]. On the other hand, if constant functions are included in H, then it is easy to show that the characteristic property of k is also sufficient to achieve the Bayes risk. As an exaple, it can be shown that characteristic kernels are necessary (and sufficient if constant functions are in H) for SVs to achieve the Bayes risk [9, Exaple 5.4]. Therefore, the characteristic property of k is fundaental in kernel-based classification algoriths. Having showed how characteristic kernels play a role in kernel-based classification, in the following section, we provide a novel characterization for the. 3 Novel Characterization for Characteristic Kernels A positive definite (pd) kernel, k is said to be characteristic to P if and only if γ k (P, Q) = P = Q, P, Q P. The following result provides a novel characterization for characteristic kernels, which shows that strictly pd kernels are characteristic to P. An advantage with this characterization is that it holds for any arbitrary topological space unlike the earlier characterizations where a group structure on is assued [7, 7]. First, we define strictly pd kernels as follows. Definition 3 (Strictly positive definite kernels). Let be a topological space. A easurable and bounded kernel, k is said to be strictly positive definite if and only if k(x, y) dµ(x) dµ(y) > for all finite non-zero signed Borel easures, µ defined on. Note that the above definition is not equivalent to the usual definition of strictly pd kernels that involves finite sus [9, Definition 4.5]. The above definition is a generalization of integrally strictly positive definite functions [2, Section 6]: k(x, y)f(x)f(y) dx dy > for all f L 2 (R d ), which is the strictly positive definiteness of the integral operator given by the kernel. Definition 3 is stronger than the finite su definition as [9, Theore 4.62] shows a kernel that is strictly pd in the finite su sense but not in the integral sense. Theore 4 (Strictly pd kernels are characteristic). If k is strictly positive definite on, then k is characteristic to P. The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic. We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to P. Exaples of strictly pd kernels on R d include exp( σ x y 2 2), σ >, exp( σ x y ), σ >, (c 2 + x y 2 2) β, β >, c >, B 2l+ -splines etc. Note that k(x, y) = f(x)k(x, y)f(y) is a strictly pd kernel if k is strictly pd, where f : R is a bounded continuous function. Therefore, translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant strictly pd kernel. A siple exaple of a translation-variant kernel that is a strictly pd kernel on copact sets of R d is k(x, y) = exp(σx T y), σ >, where we have chosen f(.) = exp(σ. 2 2/2) and k(x, y) = exp( σ x y 2 2/2), σ >. Therefore, k is characteristic on copact sets of R d, which is the sae result that follows fro the universality of k [8, Section 3, Exaple ]. The following result in [3], which is based on the usual definition of strictly pd kernels, can be obtained as a corollary to Theore 4. Corollary 5 ([3]). Let X = {x i } i=, Y = {y j} n j= and assue that x i x j, y i y j, i, j. Suppose k is strictly positive definite. Then i= α ik(., x i ) = n j= β jk(., y j ) for soe α i, β j R\{} X = Y. Suppose we choose α i =, i and β j = n, j in Corollary 5. Then i= α ik(., x i ) and n j= β jk(., y j ) represent the ean functions in H. Note that the Parzen classifier in (4) 4

5 is a ean classifier (that separates the ean functions) in H, i.e., sign( k(., x), w H ), where w = i= k(., x i) n n i= k(., y i). Suppose k is strictly pd (ore generally, suppose k is characteristic). Then, by Corollary 5, the noral vector, w to the hyperplane in H passing through the origin is zero, i.e., the ean functions coincide (and are therefore not classifiable) if and only if X = Y. 4 Generalizing the D for Classes of Characteristic Kernels The discussion so far has been related to the characteristic property of k that akes γ k a etric on P. We have seen that this characteristic property is of prie iportance both in distribution testing, and to ensure classifiability of dissiilar distributions in the RKHS. We have not yet addressed how to choose aong a selection/faily of characteristic kernels, given a particular pair of distributions we wish to discriinate between. We introduce one approach to this proble in the present section. Let = R d and k σ (x, y) = exp( σ x y 2 2), σ R +, where σ represents the bandwidth paraeter. {k σ : σ R + } is the faily of Gaussian kernels and {γ kσ : σ R + } is the faily of Ds indexed by the kernel paraeter, σ. Note that k σ is characteristic for any σ R ++ and therefore γ kσ is a etric on P for any σ R ++. However, in practice, one would prefer a single nuber that defines the distance between P and Q. The question therefore to be addressed is how to choose appropriate σ. The choice of σ has iportant iplications on the statistical aspect of γ kσ. Note that as σ, k σ and as σ, k σ a.e., which eans γ kσ (P, Q) as σ or σ for all P, Q P (this behavior is also exhibited by k σ (x, y) = exp( σ x y ) and k σ (x, y) = σ 2 /(σ 2 + x y 2 2), which are also characteristic). This eans choosing sufficiently sall or sufficiently large σ (depending on P and Q) akes γ kσ (P, Q) arbitrarily sall. Therefore, σ has to be chosen appropriately in applications to effectively distinguish between P and Q. Presently, the applications involving D set σ heuristically [8, 9]. To generalize the D to failies of kernels, we propose the following odification to γ k, which yields a pseudoetric on P, γ(p, Q) = sup{γ k (P, Q) : k K} = sup{ Pk Qk H : k K}. (6) Note that γ is the axial RKHS distance between P and Q over a faily, K of positive definite kernels. It is easy to check that if any k K is characteristic, then γ is a etric on P. Exaples for K include: K g := {e σ x y 2 2, x, y R d : σ R + }; K l := {e σ x y, x, y R d : σ R + }; K ψ := {e σψ(x,y), x, y : σ R + }, where ψ : R is a negative definite kernel; K rbf := { e λ x y 2 2 dµσ (λ), x, y R d, µ σ + : σ Σ R d }, where + is the set of all finite nonnegative Borel easures, µ σ on R + that are not concentrated at zero, etc. The proposal of γ(p, Q) in (6) can be otivated by the connection that we have established in Section 2 between γ k and the Parzen window classifier. Since the Parzen window classifier depends on the kernel, k, one can propose to learn the kernel like in support vector achines [], wherein the kernel is chosen such that R L F k in Theore is iniized over k K, i.e., inf R L F k = sup γ k (P, Q) = γ(p, Q). A siilar otivation for γ can be provided based on (5) as learning the kernel in a hard-argin SV by axiizing its argin. At this point, we briefly discuss the issue of noralized vs. unnoralized kernel failies, K in (6). We say a translation-invariant kernel, k on R d is noralized if ψ(y) dy = c (soe positive constant independent of the kernel paraeter), where k(x, y) = ψ(x y). K is a noralized kernel faily if every kernel in K is noralized. If K is not noralized, we say it is unnoralized. For exaple, it is easy to see that K g and K l are unnoralized kernel failies. Let us consider the noralized Gaussian faily, K n g = {(σ/π) d/2 e σ x y 2 2, x, y R d : σ [σ, )}. It can be shown that for any k σ, k τ K n g, < σ < τ <, we have γ kσ (P, Q) γ kτ (P, Q), which eans, γ(p, Q) = γ σ (P, Q). Therefore, the generalized D reduces to a single kernel D. A siilar result also holds for the noralized inverse-quadratic kernel faily, { 2σ 2 /π(σ 2 + x y 2 2), x, y R : σ [σ, )}. These exaples show that the generalized D definition is usually not very useful if K is a noralized kernel faily. In addition, σ should be chosen beforehand, which is equivalent to heuristically setting the kernel paraeter in γ k. Note that σ cannot be zero because in the liiting case of σ, the kernels approach a Dirac distribution, which eans the liiting kernel is not bounded and therefore the definition of D in () does not hold. So, in this work, we consider unnoralized kernel failies to render the definition of generalized D in (6) useful. 5

6 To use γ in statistical applications where P and Q are known only through i.i.d. saples {X i } i= and {Y i } n i= respectively, we require its estiator γ(p, Q n ) to be consistent, where P and Q n represent the epirical easures based on {X i } i= and {Y j} n j=. For k easurable and bounded, [8, 5] have shown that γ k (P, Q n ) is a n/( + n)-consistent estiator of γ k (P, Q). The statistical consistency of γ(p, Q n ) is established in the following theore, which uses tools fro U-process theory [3, Chapters 3,5]. We begin with the following definition. Definition 6 (Radeacher chaos). Let G be a class of functions on and {ρ i } n i= be independent Radeacher rando variables, i.e., Pr(ρ i = ) = Pr(ρ i = ) = 2. The hoogeneous Radeacher chaos process of order two with respect to {ρ i } n i= is defined as {n n i<j ρ iρ j g(x i, x j ) : g G} for soe {x i } n i=. The Radeacher chaos coplexity over G is defined as U n (G; {x i } n i=) := E ρ sup n ρ i ρ j g(x i, x j ). (7) n g G i<j We now provide the ain result of the present section. Theore 7 (Consistency of γ(p, Q n )). Let every k K be easurable and bounded with ν := sup,x k(x, x) <. Then, with probability at least δ, γ(p, Q n ) γ(p, Q) A, where A = 6U (K; {X i }) + 6U n(k; {Y i }) n + ( 8ν + 36ν log 4 δ ) + n n. (8) Fro (8), it is clear that if U (K; {X i }) = O P () and U n (K; {Y i }) = O Q (), then γ(p, Q n ) a.s. γ(p, Q). The following result provides a bound on U (K; {X i }) in ters of the entropy integral. Lea 8 (Entropy bound). For any K as in Theore 7 with K, there exists a universal constant C such that U (K; {X i } i=) C [ ν log N (K, D, ɛ) dɛ, (9) ] where D(k, k 2 ) = i<j (k (X i, X j ) k 2 (X i, X j )) 2 2. N (K, D, ɛ) represents the ɛ- covering nuber of K with respect to the etric D. Assuing K to be a VC-subgraph class, the following result, as a corollary to Lea 8 provides an estiate of U (K; {X i } i= ). Before presenting the result, we first provide the definition of a VC-subgraph class. Definition 9 (VC-subgraph class). The subgraph of a function g : R is the subset of R given by {(x, t) : t < g(x)}. A collection G of easurable functions on a saple space is called a VC-subgraph class, if the collection of all subgraphs of the functions in G fors a VC-class of sets (in R). The VC-index (also called the VC-diension) of a VC-subgraph class, G is the sae as the pseudodiension of G. See [, Definition.] for details. Corollary (U (K; {X i }) for VC-subgraph, K). Suppose K is a VC-subgraph class with V (K) being the VC-index. Assue K satisfies the conditions in Theore 7 and K. Then for soe universal constants C and C. U (K; {X i }) Cν log(c V (K)(6e 9 ) V (K) ), () Using () in (8), we have γ(p, Q n ) γ(p, Q) = O P,Q ( ( + n)/n) and by the Borel- Cantelli lea, γ(p, Q n ) γ(p, Q) a.s.. Now, the question reduces to which of the kernel classes, K have V (K) <. [22, Lea 2] showed that V (K g ) = (also see [23]) and U (K rbf ) C 2 U (K g ), where C 2 <. It can be shown that V (K ψ ) = and V (K l ) =. All these classes satisfy the conditions of Theore 7 and Corollary and therefore provide consistent estiates of γ(p, Q) for any P, Q P. Exaples of kernels on R d that are covered by these classes include the Gaussian, Laplacian, inverse ultiquadratics, atérn class etc. Other choices for K that are popular in achine learning are the linear cobination of kernels, K lin := {k λ = l i= λ ik i k λ is pd, l i= λ i = } and K con := {k λ = l i= λ ik i λ i, l i= λ i = }. [6, Lea 7] have shown that V (K con ) V (K lin ) l. Therefore, instead of using a class based on a fixed, paraeterized kernel, one can also use a finite linear cobination of kernels to copute γ. 6

7 So far, we have presented the etric property and statistical consistency (of the epirical estiator) of γ. Now, the question is how do we copute γ(p, Q n ) in practice. To show this, in the following, we present two exaples. Exaple. Suppose K = K g. Then, γ(p, Q n ) can be written as γ 2 (P, Q n ) = sup σ R + i,j= e σ X i X j n i,j= e σ Y i Y j 2 n 2 2,n i,j= e σ X i Y j 2 n. () The optiu σ can be obtained by solving () and γ(p, Q n ) = P k σ Q n k σ Hσ. Exaple 2. Suppose K = K con. Then, γ(p, Q n ) becoes γ 2 (P, Q n ) = sup P k Q n k 2 H = k d(p Q n ) (P Q n ) con sup con = sup{λ T a : λ T =, λ }, (2) where we have replaced k by l i= λ ik i. Here λ = (λ,..., λ l ) and (a) i = P k i Q n k i 2 H i = 2 a,b= k i(x a, X b ) + n n 2 a,b= k i(y a, Y b ) 2,n n a,b= k i(x a, Y b ). It is easy to see that γ 2 (P, Q n ) = ax i l (a) i. Siilar exaples can be provided for other K, where γ(p, Q n ) can be coputed by solving a seidefinite progra (K = K lin ) or by the constrained gradient descent ( K = K l, K rbf ). Finally, while the approach in (6) to generalizing γ k is our focus in this paper, an alternative Bayesian strategy would be to define a non-negative finite easure λ over K, and to average γ k over that easure, i.e., β(p, Q) := K γ k(p, Q) dλ(k). This also yields a pseudoetric on P. That said, β(p, Q) λ(k)γ(p, Q), P, Q, which eans if P and Q can be distinguished by β, they can be distinguished by γ, but not vice-versa. In this sense, γ is stronger than β. One further coplication with the Bayesian approach is in defining a sensible λ over K. Note that γ k (single kernel D based on k ) can be obtained by defining λ(k) = δ(k k ) in β(p, Q). 5 Experients In this section, we present a benchark experient that illustrates the generalized D proposed in Section 4 is preferred above the single kernel D where the kernel paraeter is set heuristically. The experiental setup is as follows. Let p = N(, σ 2 p), a noral distribution in R with zero ean and variance, σ 2 p. Let q be the perturbed version of p, given as q(x) = p(x)( + sin νx). Here p and q are the densities associated with P and Q respectively. It is easy to see that q differs fro p at increasing frequencies with increasing ν. Let k(x, y) = exp( (x y) 2 /σ). Now, the goal is that given rando saples drawn i.i.d. fro P and Q (with ν fixed), we would like to test H : P = Q vs. H : P Q. The idea is that as ν increases, it will be harder to distinguish between P and Q for a fixed saple size. Therefore, using this setup we can verify whether the adaptive bandwidth selection achieved by γ (as the test statistic) helps to distinguish between P and Q at higher ν copared to γ k with a heuristic σ. To this end, using γ(p, Q n ) and γ k (P, Q n ) (with various σ) as test statistics T n, we design a test that returns H if T n c n, and H otherwise. The proble therefore reduces to finding c n. c n is deterined as the ( α) quantile of the asyptotic distribution of T n under H, which therefore fixes the type-i error (the probability of rejecting H when it is true) to α. The consistency of this test under γ k (for any fixed σ) is proved in [8]. A siilar result can be shown for γ under soe conditions on K. We skip the details here. In our experients, we set = n =, σp 2 = and draw two sets of independent rando saples fro Q. The distribution of T n is estiated by bootstrapping on these saples (25 bootstrap iterations are perfored) and the associated 95 th quantile (we choose α =.5) is coputed. Since the perforance of the test is judged by its type-ii error (the probability of accepting H when H is true), we draw a rando saple, one each fro P and Q and test whether P = Q. This process is repeated 3 ties, and estiates of type-i and type-ii errors are obtained for both γ and γ k. 4 different values for σ are considered on a logarithic scale of base 2 with exponents ( 3, 2,,,, 3 2, 2, 5 2, 3, 7 2, 4, 5, 6) along with the edian distance between saples as one ore choice. 5 different choices for ν are considered: ( 2, 3 4,, 5 4, 3 2 ). 7

8 Error (in %) Type I error Type II error.5.75 ν.25.5 (a) log σ 3 2 Type I error (in %) ν.25.5 (d) 5 ν=.5 ν=.75 ν=. ν=.25 ν= log σ (b) edian as σ 9 8 Type II error (in %) 5 ν=.5 ν=.75 ν=. ν=.25 ν= ν.25.5 (e) log σ (c) Figure : (a) Type-I and Type-II errors (in %) for γ for varying ν. (b,c) Type-I and type-ii error (in %) for γ k (with different σ) for varying ν. The dotted line in (c) corresponds to the edian heuristic, which shows that its associated type-ii error is very large at large ν. (d) Box plot of log σ grouped by ν, where σ is selected by γ. (e) Box plot of the edian distance between points (which is also a choice for σ), grouped by ν. Refer to Section 5 for details. Figure (a) shows the estiated type-i and type-ii errors using γ as the test statistic for varying ν. Note that the type-i error is close to its design value of 5%, while the type-ii error is zero for all ν, which eans γ distinguishes between P and Q for all perturbations. Figures (b,c) show the estiates of type-i and type-ii errors using γ k as the test statistic for different σ and ν. Figure (d) shows the box plot for log σ, grouped by ν, where σ is the bandwidth selected by γ. Figure (e) shows the box plot of the edian distance between points (which is also a choice for σ), grouped by ν. Fro Figures (c) and (e), it is easy to see that the edian heuristic exhibits high type-ii error for ν = 3 2, while γ exhibits zero type-ii error (fro Figure (a)). Figure (c) also shows that heuristic choices of σ can result in high type-ii errors. It is intuitive to note that as ν increases, (which eans the characteristic function of Q differs fro that of P at higher frequencies), a saller σ is needed to detect these changes. The advantage of using γ is that it selects σ in a distribution-dependent fashion and its behavior in the box plot shown in Figure (d) atches with the previously entioned intuition about the behavior of σ with respect to ν. These results deonstrate the validity of using γ as a distance easure in applications. 6 Conclusions In this work, we have shown how D appears in binary classification, and thus that characteristic kernels are iportant in kernel-based classification algoriths. We have broadened the class of characteristic RKHSs to include those induced by strictly positive definite kernels (with particular application to kernels on non-copact doains, and/or kernels that are not translation invariant). We have further provided a convergent generalization of D over failies of kernel functions, which becoes necessary even in considering relatively siple failies of kernels (such as the Gaussian kernels paraeterized by their bandwidth). The usefulness of the generalized D is illustrated experientally with a two-saple testing proble. Acknowledgents The authors thank anonyous reviewers for their constructive coents and especially the reviewer who pointed out the connection between characteristic kernels and the achievability of Bayes risk. B. K. S. was supported by the PI for Biological Cybernetics, National Science Foundation (grant DS-SPA 62549), the Fair Isaac Corporation and the University of California I- CRO progra. A. G. was supported by grants DARPA IPTO FA , ONR URI N47747, and ARO URI W9NF

9 A Proofs We provide proofs for the results in Sections 2-4. A. Proof of Theore To prove Theore, we need the following result fro [7]. Theore 3 ([7]). Let F k := {f : f H }, where (H, k) is an RKHS defined on a easurable space with k easurable and bounded. Then, where H represents the RKHS nor. γ k (P, Q) = sup f F k Pf Qf = Pk Qk H, (3) Proof of Theore : Fro (3), we have ε L (f) dp + ( ε) L (f) dq = Therefore, R L F k f dq f dp = Qf Pf. = inf f F k (Qf Pf) = sup f F k (Pf Qf) = sup f F k Pf Qf = γ k (P, Q), which follows fro Theore 3. Given {(X i, Y i )} N i= drawn i.i.d. fro µ, the epirical equivalent of (3) is given by { } inf f(x i ) + f(x i ) : f F k. N Solving this for f gives f = and the result in (4) follows. A.2 Proof of Theore 2 Y i = Y i= k(., X i) N Y i = k(., X i) N Y i = Y k(., X i= i) Y i = k(., X, i) H Before we prove Theore 2, we present a lea which we will use to prove Theore 2. Lea 4. Let θ : V R and ψ : V R be convex functions on a real vector space V. Suppose where θ is not constant on {x : ψ(x) b} and a <. Then a = sup{θ(x) : ψ(x) b}, (4) b = inf{ψ(x) : θ(x) a}. (5) We need the following result fro [, Theore 32.] to prove Lea 4. Theore 5 ([]). Let f be a convex function, and let C be a convex set contained in the doain of f. If f attains its supreu relative to C at soe point of relative interior of C, then f is actually constant throughout C. Proof of Lea 4: Note that A := {x : ψ(x) b} is a convex subset of V. Since θ is not constant on A, by Theore 5, θ attains its supreu on the boundary of A. Therefore, any solution, x to (4) satisfies θ(x ) = a and ψ(x ) = b. Let G := {x : θ(x) > a}. For any x G, ψ(x) > b. If this were not the case, then x is not a solution to (4). Let H := {x : θ(x) = a}. Clearly, x H and so there exists an x H for which ψ(x) = b. Suppose inf{ψ(x) : x H} = c < b, which eans for soe x H, x A. Fro (4), this iplies θ attains its supreu relative to A at soe point of relative interior of A. By Theore 5, this iplies θ is constant on A leading to a contradiction. Therefore, inf{ψ(x) : x H} = b and the result in (5) follows. 9

10 Proof of Theore 2: By Theore 3, γ k (P, Q) = sup{pf Qf : f H }. Note that Pf Qf and f H are convex functionals on H. For P Q, Pf Qf is not constant on F k, since k is characteristic. Therefore, by Lea 4, we have = inf{ f H : Pf Qf γ k (P, Q), f H}. Since this holds for all P Q, it holds for P and Q n. Therefore, we have Consider 2 γ k (P, Q n ) = inf { f H : P f Q n f 2, f H}. {f H : P f Q n f 2} = which eans 2 γ k (P,Q n ) f sv H. A.3 Proof of Theore 4 { f H : f(x i ) n Y i= {f H : Y i f(x i ), i}, Y i= f(x i ) 2 To prove Theore 4, we need the following lea that provides necessary and sufficient conditions for a kernel not to be characteristic. Lea 6. Let k be easurable and bounded on. Then P Q, P, Q P such that γ k (P, Q) = if and only if there exists a finite non-zero signed Borel easure µ that satisfies: (i) k(x, y) dµ(x) dµ(y) =, (ii) µ() =. Proof. ( ) Suppose there exists a finite non-zero signed Borel easure, µ that satisfies (i) and (ii) in Lea 6. By the Jordan decoposition theore [5, Theore 5.6.], there exist unique positive easures µ + and µ such that µ = µ + µ and µ + µ (µ + and µ are singular). By (ii), we have µ + () = µ () =: α. Define P = α µ + and Q = α µ. Clearly, P Q, P, Q P. Then, γ 2 k(p, Q) = Pk Qk 2 H = α 2 µk 2 H = α 2 µk, µk H. (6) Fro the proof of Theore 3 (see Theore 3 in [7]), we have µk, µk H = k(x, y) dµ(x) dµ(y) and therefore, by (i), γk (P, Q) =. So, we have constructed P Q such that γ k (P, Q) =. ( ) Suppose P Q, P, Q P such that γ k (P, Q) =. Let µ = P Q. Clearly µ is a finite non-zero signed Borel easure that satisfies µ() =. Note that γk 2(P, Q) = Pk Qk 2 H = µk 2 H = k(x, y) dµ(x) dµ(y), and therefore (i) follows. Proof of Theore 4: Since k is strictly pd on, we have k(x, y) dη(x)dη(y) > for any finite non-zero signed Borel easure η. This eans there does not exist a finite non-zero signed Borel easure that satisfies (i) in Lea 6. Therefore, by Lea 6, there does not exist P Q, P, Q P such that γ k (P, Q) =, which iplies k is characteristic to P. A.4 Proof of Corollary 5 Consider i= α ik(., x i ) = k(., x) dµ X (x) = µ X k, where µ X = i= α iδ xi. Here δ xi represents the Dirac easure at x i. So, we have µ X k = µ Y k, which is equivalent to k(x, y) d(µ X µ Y )(x) d(µ X µ Y )(y) =. (7) Since k is strictly pd, by Lea 6, we have µ X k = µ Y k µ X = µ Y X = Y. }

11 A.5 Proof of Theore 7 Consider γ(p, Q n ) γ(p, Q) = sup P k Q n k H sup Pk Qk H P k Q n k H Pk Qk H sup sup sup sup P k Q n k Pk + Qk H [ P k Pk H + Q n k Qk H ] P k Pk H + sup Q n k Qk H. (8) We now bound the ters sup P k Pk H and sup Q n k Qk H. Since sup P k Pk H satisfies the bounded difference property, using cdiarid s inequality gives that with probability at least δ 4 over the choice of {X i}, we have sup P k Pk H E sup P k Pk H + 2ν log 4 δ. (9) By invoking syetrization for E sup P k Pk H, we have E sup P k Pk H 2EE ρ sup ρ i k(., X i ), (2) H where {ρ i } i= represent i.i.d. Radeacher rando variables and E ρ represents the expectation w.r.t. {ρ i } conditioned on {X i }. Since E ρ sup i= ρ ik(., X i ) H satisfies the bounded difference property, by cdiarid s inequality, with probability at least δ 4 over the choice of the rando saples of size, we have 2ν EE ρ sup ρ i k(., X i ) E ρ sup ρ i k(., X i ) + log 4 δ. (2) H H By writing i= i<j i= i= ρ i k(., X i ) = ρ i ρ j k(x i, X j ) i= H i,j= 2 ρ i ρ j k(x i, X j ) ν +, (22) we have with probability at least δ 4, the following holds: EE ρ sup 2U (K; {X i }) ν 2ν ρ i k(., X i ) + + H log 4 δ. (23) i= Tying (9)-(23), we have that w.p. at least δ 2 over the choice of {X i}, the following holds: 8U (K; {X i }) sup P k Pk H + 2 ν 8ν + log 4 δ. (24) Perforing siilar analysis for sup Q n k Qk H, we have that w.p. at least δ 2 over the choice of {Y i }, 8Un (K; {Y i }) sup Q n k Qk H + 2 ν 8ν + n n n log 4 δ. (25) Using (24) and (25) with a + b 2(a + b) provides the result.

12 A.6 Proof of Lea 8 Fro [2, Proposition 2.2, Proposition 2.6] (also see [3, Corollary 5..8]), we have that there exists a universal constant C < such that U (K; {X i }) C ν log N (K, D, ɛ) dɛ, where 2 D 2 (k, k 2 ) = E ρ n ρ i ρ j h(x i, X j ) i<j = 2 E ρ ρ i ρ j ρ r ρ s h(x i, X j )h(x r, X s ) i<j,r<s = 2 h 2 (X i, X j ), i<j where h(x i, X j ) = k (X i, X j ) k 2 (X i, X j ). A.7 Proof of Corollary The result follows by bounding the unifor covering nuber of the VC-subgraph class, K. By [2, Theore 2.6], we have N (K, D, ɛ) C V (K)(6ν 2 ɛ 2 e) V (K), (26) where V (K) is the VC-index of K. Therefore, ν U (K; {X i }) C log N (K, D, ɛ) dɛ ν ν 4V (K)C log( ) dɛ + νv (K)C log(6e) + Cν log(c V (K)). (27) ɛ Note that ν log( ν ɛ ) dɛ 2ν. Using this in (27) and rearranging the ters provides the result. References []. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cabridge University Press, UK, 999. [2]. A. Arcones and E. Giné. Liit theores for U-processes. Annals of Probability, 2: , 993. [3] V. H. de la Peña and E. Giné. Decoupling: Fro Dependence to Independence. Springer-Verlag, NY, 999. [4] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 996. [5] R.. Dudley. Real Analysis and Probability. Cabridge University Press, Cabridge, UK, 22. [6] K. Fukuizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel easures of conditional dependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Inforation Processing Systes 2, pages , Cabridge, A, 28. IT Press. [7] K. Fukuizu, B. K. Sriperubudur, A. Gretton, and B. Schölkopf. Characteristic kernels on groups and seigroups. In D. Koller, D. Schuurans, Y. Bengio, and L. Bottou, editors, Advances in Neural Inforation Processing Systes 2, pages , 29. [8] A. Gretton, K.. Borgwardt,. Rasch, B. Schölkopf, and A. Sola. A kernel ethod for the two saple proble. In B. Schölkopf, J. Platt, and T. Hoffan, editors, Advances in Neural Inforation Processing Systes 9, pages IT Press, 27. [9] A. Gretton, K. Fukuizu, C.-H. Teo, L. Song, B. Schölkopf, and A. Sola. A kernel statistical test of independence. In Advances in Neural Inforation Processing Systes 2, pages IT Press, 28. [] G. R. G. Lanckriet, N. Christianini, P. Bartlett, L. El Ghaoui, and. I. Jordan. Learning the kernel atrix with seidefinite prograing. Journal of achine Learning Research, 5:24 72, 24. [] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 97. 2

13 [2] B. Schölkopf and A. J. Sola. Learning with Kernels. IT Press, Cabridge, A, 22. [3] B. Schölkopf, B. K. Sriperubudur, A. Gretton, and K. Fukuizu. RKHS representation of easures. In Learning Theory and Approxiation Workshop, Oberwolfach, Gerany, 28. [4] J. Shawe-Taylor and N. Cristianini. Kernel ethods for Pattern Analysis. Cabridge University Press, UK, 24. [5] A. J. Sola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space ebedding for distributions. In Proc. 8th International Conference on Algorithic Learning Theory, pages 3 3. Springer-Verlag, Berlin, Gerany, 27. [6] N. Srebro and S. Ben-David. Learning bounds for support vector achines with learned kernels. In G. Lugosi and H. U. Sion, editors, Proc. of the 9 th Annual Conference on Learning Theory, pages 69 83, 26. [7] B. K. Sriperubudur, A. Gretton, K. Fukuizu, G. R. G. Lanckriet, and B. Schölkopf. Injective Hilbert space ebeddings of probability easures. In R. Servedio and T. Zhang, editors, Proc. of the 2 st Annual Conference on Learning Theory, pages 22, 28. [8] I. Steinwart. On the influence of the kernel on the consistency of support vector achines. Journal of achine Learning Research, 2:67 93, 22. [9] I. Steinwart and A. Christann. Support Vector achines. Springer, 28. [2] J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky ountain Journal of atheatics, 6(3):49 433, 976. [2] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Epirical Processes. Springer-Verlag, New York, 996. [22] Y. Ying and C. Capbell. Generalization bounds for learning the kernel. In Proc. of the 22 nd Annual Conference on Learning Theory, 29. [23] Y. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Journal of achine Learning Research, 8: , 27. 3

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Kernel-Based Nonparametric Anomaly Detection

Kernel-Based Nonparametric Anomaly Detection Kernel-Based Nonparaetric Anoaly Detection Shaofeng Zou Dept of EECS Syracuse University Eail: szou@syr.edu Yingbin Liang Dept of EECS Syracuse University Eail: yliang6@syr.edu H. Vincent Poor Dept of

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer. UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Chaotic Coupled Map Lattices

Chaotic Coupled Map Lattices Chaotic Coupled Map Lattices Author: Dustin Keys Advisors: Dr. Robert Indik, Dr. Kevin Lin 1 Introduction When a syste of chaotic aps is coupled in a way that allows the to share inforation about each

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

Scale Invariant Conditional Dependence Measures

Scale Invariant Conditional Dependence Measures Sashank J. Reddi sjakkar@cs.cu.edu Machine Learning Departent, School of Coputer Science, Carnegie Mellon University Barnabás Póczos bapoczos@cs.cu.edu Machine Learning Departent, School of Coputer Science,

More information

A Bernstein-Markov Theorem for Normed Spaces

A Bernstein-Markov Theorem for Normed Spaces A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES S. E. Ahed, R. J. Tokins and A. I. Volodin Departent of Matheatics and Statistics University of Regina Regina,

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming SVM Soft Margin Classifiers: Linear Prograing versus Quadratic Prograing Qiang Wu wu.qiang@student.cityu.edu.hk Ding-Xuan Zhou azhou@cityu.edu.hk Departent of Matheatics, City University of Hong Kong,

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Multi-kernel Regularized Classifiers

Multi-kernel Regularized Classifiers Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA wu.qiang@student.cityu.edu.hk, yying@cityu.edu.hk,

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011 This article was downloaded by: [Xi'an Jiaotong University] On: 15 Noveber 2011, At: 18:34 Publisher: Taylor & Francis Infora Ltd Registered in England and Wales Registered Nuber: 1072954 Registered office:

More information

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations

More information

The Methods of Solution for Constrained Nonlinear Programming

The Methods of Solution for Constrained Nonlinear Programming Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 01-06 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.co The Methods of Solution for Constrained

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

arxiv: v1 [math.pr] 17 May 2009

arxiv: v1 [math.pr] 17 May 2009 A strong law of large nubers for artingale arrays Yves F. Atchadé arxiv:0905.2761v1 [ath.pr] 17 May 2009 March 2009 Abstract: We prove a artingale triangular array generalization of the Chow-Birnbau- Marshall

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Non-parametric Estimation of Integral Probability Metrics

Non-parametric Estimation of Integral Probability Metrics Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

Metric Entropy of Convex Hulls

Metric Entropy of Convex Hulls Metric Entropy of Convex Hulls Fuchang Gao University of Idaho Abstract Let T be a precopact subset of a Hilbert space. The etric entropy of the convex hull of T is estiated in ters of the etric entropy

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

The degree of a typical vertex in generalized random intersection graph models

The degree of a typical vertex in generalized random intersection graph models Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Detection and Estimation Theory

Detection and Estimation Theory ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer

More information

arxiv: v1 [stat.ml] 27 Oct 2018

arxiv: v1 [stat.ml] 27 Oct 2018 Stein Variational Gradient Descent as Moent Matching arxiv:80.693v stat.ml] 27 Oct 208 Qiang Liu, Dilin Wang Departent of Coputer Science The University of Texas at Austin Austin, TX 7872 {lqiang, dilin}@cs.utexas.edu

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy Storage Capacity and Dynaics of Nononotonic Networks Bruno Crespi a and Ignazio Lazzizzera b a. IRST, I-38050 Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I-38050 Povo (Trento) Italy INFN Gruppo

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany. New upper bound for the B-spline basis condition nuber II. A proof of de Boor's 2 -conjecture K. Scherer Institut fur Angewandte Matheati, Universitat Bonn, 535 Bonn, Gerany and A. Yu. Shadrin Coputing

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information