A New Perspective on Learning Linear Separators with Large L q L p Margins

Size: px
Start display at page:

Download "A New Perspective on Learning Linear Separators with Large L q L p Margins"

Transcription

1 A New Persective on Learning Linear Searators with Large L q L Margins Maria-Florina Balcan Georgia Institute of Technology Christoher Berlind Georgia Institute of Technology Abstract We give theoretical and emirical results that rovide new insights into large margin learning We rove a bound on the generalization error of learning linear searators with large L q L margins (where L q and L are dual norms) for any finite 1 The bound leads to a simle data-deendent sufficient condition for fast learning in addition to extending and imroving uon revious results We also rovide the first study that shows the benefits of taking advantage of margins with < 2 over margins with 2 Our exeriments confirm that our theoretical results are relevant in ractice 1 INTRODUCTION The notion of margin arises naturally in many areas of machine learning Margins have long been used to motivate the design of algorithms, to give sufficient conditions for fast learning, and to exlain unexected erformance of algorithms in ractice Here we are concerned with learning the class of homogeneous linear searators in R d over distributions with large margins We use a general notion of margin, the L q L margin, that catures, among others, the notions used in the analyses of Percetron ( = q = 2) and Winnow ( =, q = 1) For, q [1, ] with 1/ + 1/q = 1, the L q L margin of a linear classifier x sign(w x) with resect to a distribution D is defined as w x γ q, (D, w) = inf x D w q x While revious work has addressed the case of 2 both theoretically (Grove et al, 2001; Servedio, 2000; Aearing in Proceedings of the 17 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2014, Reykjavik, Iceland JMLR: W&CP volume 33 Coyright 2014 by the authors Gentile, 2003) and exerimentally (Zhang, 2002), the < 2 case has been mentioned but much less exlored This ga in the literature is ossibly due to the fact that when < 2 a large margin alone does not guarantee small samle comlexity (see Examle 1 below for such a case) This leads to the question of whether large L q L margins with < 2 can lead to small samle comlexity, and if so, under what conditions will this haen? Furthermore, are there situations in which taking advantage of margins with < 2 can lead to better erformance than using margins with 2? In this work, we answer these three questions using both theoretical and emirical evidence We first give a bound on the generalization error of linear searators with large L q L margins that holds for any finite 1 The result is roved through a new bound on the fat-shattering dimension of linear searators with bounded L q norm The bound imroves uon revious results by removing a factor of log d when 2 < and extends the reviously known bounds to the case of 1 < 2 A highlight of this theoretical result is that it gives a simle sufficient condition for fast learning even for the < 2 case The condition is related to the L 2, norm of the data matrix and can be estimated from the data We then give a concrete family of learning roblems in which using the L L 1 margin gives significantly better samle comlexity guarantees than for L q L margins with > 1 We define a family of distributions over labeled examles and consider the samle comlexity of learning the class W of linear searators with large L q L margins By bounding covering numbers, we uer bound the samle comlexity of learning W 1 and lower bound the comlexity of learning W when > 1, and we show that the uer bound can be significantly smaller than the lower bound In addition, we give exerimental results suorting our claim that taking advantage of large L L 1 margins can lead to faster learning We observe that in the realizable case, the roblem of finding a consistent linear searator that maximizes the L q L margin 68

2 Learning Linear Searators with L ql Margins is a convex rogram (similar to SVM) An extension of this method to the non-realizable case is equivalent to minimizing the L q -norm regularized hinge loss We aly these margin-maximization algorithms to both synthetic and real-world data sets and find that maximizing the L L 1 margin can result in better classifiers than maximizing other margins We also show that the theoretical condition for fast learning that aears in our generalization bound is favorably satisfied on many real-world data sets Related Work It has long been known that the classic algorithms Percetron (Rosenblatt, 1958) and Winnow (Littlestone, 1988) have mistake bounds of 1/γ2,2 2 and Õ(1/γ1, ), 2 resectively A family of quasi-additive algorithms (Grove et al, 2001) interolates between the behavior of Percetron and Winnow by defining a Percetron-like algorithm for any > 1 While this gives an algorithm for any 1 < the mistake bound of Õ(1/γ2 q,) only alies for 2 For small values of, these algorithms can be used to learn nonlinear searators by using factorizable kernels (Gentile, 2013) A related family (Servedio, 2000) was designed for learning in the PAC model rather than the mistake bound model, but again, guarantees were only given for 2 Other works (Kakade et al, 2009; Cortes et al, 2010; Kloft and Blanchard, 2012; Maurer and Pontil, 2012) bound the Rademacher comlexity of classes of linear searators under general forms of regularization Secial cases of each of these regularization methods corresond to L q -norm regularization, which is closely related to maximizing L q L margin Kakade et al (2009) directly consider the case of L q -norm regularization but only give Rademacher comlexity bounds for the case of 2 Both Cortes et al (2010) and Kloft and Blanchard (2012) give Rademacher comlexity bounds that cover the entire range 1 in the context of multile kernel learning, but their discussion of excess risk bounds for different choices of is limited to the 2 case while our work discusses the generalization error over the entire range Maurer and Pontil (2012) consider the more general setting of block-norm regularized linear classes but only give bounds for the case of 2 In contrast to our work, none of the above works give lower bounds on the samle comlexity or give concrete evidence of when some values of will result in faster learning than others There are a few other cases in the literature where the 2 regime is discussed Zhang (2002) deals with algorithms for L q -norm regularized loss minimization and discusses cases in which L -norm regularization may be aroriate Balcan et al (2013) study the roblem of learning two-sided disjunctions in the semisuervised setting and note that the regularity assumtion inherent in the roblem can be interreted as a large L L 1 margin assumtion A related roblem osed by Blum and Balcan (2007), the two-sided majority with margins roblem, has a slightly modified form which constitutes another natural occurrence of a large L L 1 margin (see sulementary material) 2 PRELIMINARIES Let D be a distribution over a bounded instance sace X R d A linear searator over X is a classifier h(x) = sign(w x) for some weight vector w R d We use h and w to denote the target function and weight vector, resectively, so that h (x) = sign(w x) gives the label for any instance x and err D (h) = Pr x D [h(x) h (x)] is the generalization error of any hyothesis h We will often abuse notation and refer to a classifier and its corresonding weight vector interchangeably We will overload the notation X to reresent either a set of n oints in R d or the d n matrix of containing one oint er column For any oint x = (x 1,, x d ) R d and 1, the L -norm of x is x = ( d x i ) 1 and the L -norm is x = max i x i Let X denote su x X x, which is finite for any by our assumtion that X is bounded The L q -norm is the dual of the L -norm if 1/ + 1/q = 1 (so the L -norm and L 1 -norm are duals) In this work, and q will always denote dual norms For any weight vector w, the L q L margin of w with resect to D is defined as w x γ q, (D, w) = inf x D w q x We can similarly define γ q, (X, w) for a set X We will dro the first argument when referring to the distribution-based definition and the distribution is clear from context We assume that D has a ositive margin; that is, there exists w such that γ q, (D, w) > 0 Note that by Hölder s inequality, w x w q x for dual and q, so γ q, (D, w) 1 We also define the L a,b matrix norm b/a 1/b r c M a,b = m ij a j=1 69

3 Maria-Florina Balcan, Christoher Berlind for any r c matrix M = (m ij ) In other words, we take the L a -norm of each row in the matrix and then take the L b -norm of the resulting vector of L a -norms 21 L q L Suort Vector Machines Given a linearly searable set X of n labeled examles, we can solve the convex rogram min w st w q y i (w x i ) x i 1, 1 i n (1) to maximize the L q L margin Observe that a solution ŵ to this roblem has γ q, (X, ŵ) = 1/ ŵ q We call an algorithm that oututs a solution to (1) an L q L SVM due to the close relationshi between this roblem and the standard suort vector machine If X is not linearly searable, we can introduce nonnegative slack variables in the usual way and solve min w, ξ 0 st w q + C n ξ i y i (w x i ) x i 1 ξ i, 1 i n (2) which is equivalent to minimizing the hinge loss with resect to an L -normalized data set using L q -norm regularization on the weight vector sace 3 GENERALIZATION BOUND In this section we give an uer bound on the generalization error of learning linear searators over distributions with large L q L margins The roof follows from combining a theorem of Bartlett and Shawe- Taylor (1999) with a new bound on the fat-shattering dimension of the class of linear searators with small L q -norm We begin with the following definitions Definition For a set F of real-valued functions on X, a finite set {x 1,, x n } X is said to be γ-shattered by F if there are real numbers r 1,, r n such that for all b = (b 1,, b n ) { 1, 1} n, there is a function f b F such that { f b (x i r i + γ if b i = 1 ) r i γ if b i = 1 The fat-shattering dimension of F at scale γ, denoted fat F (γ), is the size of the largest subset of X which is γ-shattered by F Our bound on the fat-shattering dimension will use two lemmas analogous to Lemmas 11 and 12 in (Servedio, 2000) Lemma 1 Let F = {x w x : w q W q } with 1 If the set {x 1,, x n } X n is γ- shattered by F then every b = (b 1,, b n ) { 1, 1} n satisfies n b ix i γn W q Proof The roof is identical to that of Lemma 11 in (Servedio, 2000), relacing the radius 1/ X of F in their lemma with W q The next lemma will deend on the following classical result from robability theory known as the Khintchine inequality Theorem 1 (Khintchine) If the random variable σ = (σ 1,, σ n ) is uniform over { 1, 1} n and 0 < <, then any finite set {z 1,, z n } C satisfies ( [ n n ]) 1 z i 2 E σ i z i B n z i 2 A where A and B are constants deending only on The recise otimal constants for A and B were found by Haageru (1982), but for our uroses, it suffices to note that when 1 we have 1/2 A 1 and 1 B Lemma 2 For any set X = {x 1,, x n } X n and any 1 <, there is some b = (b 1,, b n ) { 1, 1} n such that n b ix i B X 2, Proof We will bound the exectation of n b ix i when b = (b 1,, b n ) is uniformly distributed over { 1, 1} n We have 1/ E n b i x i d = E n ɛ i x i j j=1 [ d n ] 1/ E ɛ i x i j j=1 d j=1 B = B X 2, ( n ) /2 x i j 2 1/ where the first inequality is an alication of Jensen s inequality and the second uses the Khintchine inequality The roof is comleted by noting that there must be some choice of b for which the value of n b ix i is smaller than its exectation We can use these two lemmas to give an uer bound on the fat-shattering dimension for any finite 70

4 Learning Linear Searators with L ql Margins Theorem 2 Let F = {x w x : w q W q } with 1 < If there is a constant C = C(d, ) indeendent of n such that X 2, Cn α X for any set X of n examles drawn from D, then ( ) CB W q X 1 1 α fat F (γ) γ Proof Combining Lemmas 1 and 2, we have that any set X = {x 1,, x n } X n that is γ-shattered by F satisfies B X 2, CB n α X Solving γn W q for n gives us n ( CB W q X γ ) 1/(1 α) as an uer bound on the maximum size of any γ-shattered set This bound extends and imroves uon Theorem 8 in (Servedio, 2000) In their secific setting, W q = 1/ X, so we can directly comare their bound fat F (γ) 2 log 4d γ 2 (3) for 2 to our bound ( ) 1/(1 α) CB fat F (γ) (4) γ for 1 < Observe that by Minkowski s inequality, any set X satisfies X 2, n 1/2 X if 2 In this case, (4) simlifies to (B /γ) 2 which is dimension-indeendent and imroves uon (3) by a factor of log d when is constant When 1 < 2, (3) does not aly, but (4) still gives a bound that can be small in many cases deending on the relationshi between X 2, and γ We will give secific examles in Section 31 The fat-shattering dimension is relevant due to the following theorem of Bartlett and Shawe-Taylor (1999) that relates the generalization erformance of a classifier with large margin to the fat-shattering dimension of the associated real-valued function class at a scale of roughly the margin of the classifier Theorem 3 (Bartlett & Shawe-Taylor) Let F be a collection of real-valued functions on a set X and let D be a distribution over X Let X = {x 1,, x n } be a set of examles drawn iid from D with labels y i = h (x i ) for each i With robability at least 1 δ, if a classifier h(x) = sign(f(x)) with f F satisfies y i f(x i ) γ > 0 for each x i X, then err D (h) 2 n where k = fat F (γ/16) ( k log 8en k log(32n) + log 8n δ ), Now we can state and rove the following theorem which bounds the generalization erformance of the L q L SVM algorithm Theorem 4 For any distribution D and target w with γ q, (D, w ) γ q,, if there is a constant C = C(d, ) such that X 2, Cn α X for any set X of n examles from D then there is a olynomial time algorithm that oututs, with robability at least 1 δ, a classifier h such that err D (h) = O ( 1 n ( (CB γ q, ) 1 )) 1 α log 2 n + log n δ Proof By the definition of L q L margin, there exists a w (namely, w / w q ) with w q = 1 that achieves margin γ q, with resect to D This w has margin at least γ q, with resect to any set X of n examles from D A vector ŵ satisfying these roerties can be found in olynomial time by solving the convex rogram (1) and normalizing the solution Notice that if the samle is normalized to have x = 1 for every x X then the L q L margin of ŵ does not change but becomes equal to the functional margin y(ŵ x) aearing in the Theorem 3 Alying Theorem 2 with W q = 1 and X = 1 yields fat F (γ) ( CB γ )1/(1 α) and alying Theorem 3 to ŵ gives us the desired result Theorem 4 tells us that if the quantity ( CB γ q, ) 1 1 α (5) is small for a certain choice of and q then the L q L SVM will have good generalization This gives us a data-deendent bound, as (5) deends on data; secifically, C and α deend on the distribution D alone, while γ q, deends on the relationshi between D and the target w As mentioned, if 2 then we can use C = 1 and α = 1/2 for any distribution, in which case the bound deends solely on the margin γ q, (and to a lesser extent on B ) If 2 then any set has X 2, n 1/ X (this follows by subadditivity of the function z z /2 when 2) and we can obtain a similar dimension-indeendent bound with C = 1 and α = 1/ Achieving dimension indeendence for all distributions comes at the rice of the bound becoming uninformative as 1, as (5) simlifies to (B /γ q, ) q for these values More interesting situations arise when we consider the quantity (5) for secific distributions, as we will show in the next section 31 Examles Here we will give some secific learning roblems showing when large margins can be helful and when they are not helful We focus on the 2 case, as large margins are always helful when 2 71

5 Maria-Florina Balcan, Christoher Berlind Examle 1 Unhelful margins First, let D 1 be the uniform distribution over the standard basis vectors in R d and let w be any weight vector in { 1, 1} d In this case γ q, = d 1/q, which is a large margin for small If n d, then X 2, is roughly n 1/ X (ignoring log factors), and (5) simlifies to Bd q We could also choose to simlify (5) using C = d 1/ and α = 0, which gives us B d Either way, the bound in Theorem 4 becomes Õ(d/n) which is uninformative since n d If we take n d, we can still obtain a bound of Õ(d/n), but this is the same as the worst-case bound based on VC dimension, so the large margin has no advantage In fact, this examle rovides a lower bound: even if an algorithm knows the distribution D 1 and is allowed a 1/2 robability of failure, an error of ɛ cannot be guaranteed with fewer than (1 2ɛ)d examles because any algorithm can hoe for at best an error rate of 1/2 on the examles it has not yet seen Examle 2 Helful margins As another examle, divide the d coordinates into k = o(d) disjoint blocks of equal size and let D 2 be the uniform distribution over examles that have 1 s for all coordinates within some block and 0 s elsewhere Taking w to be a vector in { 1, 1} d that has the same sign within each block, we have γ q, = k 1/q If k < n < d then X 2, is roughly k 1/ 1/2 n X, and (5) simlifies to B 2 k When k = o(d) this is a significant imrovement over worst case bounds for any constant choice of Examle 3 An advantage for < 2 Consider a distribution that is a combination of the revious two examles: with robability 1/2 it returns an examle drawn from D 1 and otherwise returns an examle from D 2 By including the basis vectors, we have made the margin γ q, = d 1/q but as long as k = o(d) the bound on X 2, does not change significantly from Examle 2, and we can still use C = k 1/ 1/2 and α = 1/2 Now (5) simlifies to B 2 k for = 1, but becomes B 2 k 2/ 1 d 2/q in general When k = d this gives us an error bound of Õ( d/n) for = 1 but Õ(d/n) or worse for 2 While this uer bound does not imly that generalization error will be worse for 2 than it is for = 1, we show in the next section that for a slightly modified version of this distribution we can obtain samle comlexity lower bounds for large margin algorithms with 2 that are significantly greater than the uer bound for = 1 4 THE CASE FOR L L 1 MARGINS Here we give a family of learning roblems to show the benefits of using L L 1 margins over other margins We do this by defining a distribution D over unlabeled examles in R d that can be consistently labeled by a variety of otential target functions w We then consider a family of large L q L margin concet classes W and bound the samle comlexity of learning a concet in W using covering number bounds We show that learning W 1 can be much easier than learning W for > 1; for examle, with certain arameters for D having O( d) examles is sufficient for learning W 1, while learning any other W requires Ω(d) examles Secifically, let W = {w R d : w = 1, γ q, (w) γ q, (w )}, where w maximizes the L L 1 margin with resect to D We restrict our discussion to weight vectors with unit L norm because normalization does not change the margin (nor does it affect the outut of the corresonding classifier) Let the covering number N (ɛ, W, D) be the size of the smallest set V W such that for every w W there exists a v V with d D (w, v) := Pr x D [sign(w x) sign(v x)] ɛ Define the distribution D over {0, 1} d as follows Divide the d coordinates into k disjoint blocks of size d/k (assume d/(2k) is an odd integer) Fli a fair coin If heads, ick a random block and return an examle with exactly d/(2k) randomly chosen coordinates set to 1 within the chosen block and all other coordinates set to 0 If tails, return a standard basis vector (exactly one coordinate set to 1) chosen uniformly at random The target function will be determined by any weight vector w achieving the maximum L L 1 margin with resect to D As we will see, w can be any vector in { 1, 1} d with comlete agreement within each block We first give an uer bound on the covering of W 1 Proosition 1 For any ɛ > 0, N (ɛ, W 1, D) 2 k Proof Let V k be the following set Divide the d coordinates into k disjoint blocks of size d/k A vector v { 1, 1} d is a member of V k if and only if each block in v is entirely +1 s or entirely 1 s We will show that W 1 = V k, and since V k = 2 k we will have N (ɛ, W 1, D) 2 k for any ɛ Note that by Hölder s inequality, γ q, (w) 1 for any w R d For any w V k and any examle x drawn from D, we have w x = x 1, so γ,1 (w) = 1, the maximum margin If w / V k then either w / { 1, 1} d or w has sign disagreement within at least one of the k blocks If w / { 1, 1} d then γ,1 (w) = min x w x / x 1 min i w e i = min i w i < 1 If w has sign disagreement within a block then w x < x 1 for any x with 1 s in disagreeing coordinates of w, and this results in a margin strictly less than 1 Now we will give lower bounds on the covering numbers for W with > 1 In the following let H(α) = α log(α) (1 α) log(1 α), the binary entroy function 72

6 Learning Linear Searators with L ql Margins Proosition 2 If 1 < < then for any ɛ > 0, N (ɛ, W, D) 2 (1/2 H(2ɛ))d k1/q (d/2) 1/ k Proof First we show that W 2 d/2 k1/q (d/2) 1/ k Any w W 1 has margin γ q, (w ) = d 1/q, so W = {w R d : γ q, (w) d 1/q } Note that W { 1, 1} d because if w / { 1, 1} d then γ q, (w) min i w e i / w q = min i w i / w q < d 1/q Let w { 1, 1} d be a weight vector such that in each block there are at least d/k r ositive values and at most r negative values or vice versa (there are at most r values with whichever sign is in the minority) Clearly w has large margin with resect to any of the basis vectors drawn from D For the rest of D we have inf x w x = max(1, n/(2k) 2r), so w W if and only if max(1, d/(2k) 2r) (d/(2k)) 1/ For < and d > 2k, this haens if and only if r 1 2 ( d 2k ( d 2k )1/ ) Letting r = 1 2 ( d 2k ( d 2k )1/ ), we have W = (2 r ) ) k 2 d/2 k1/q (d/2) 1/ k i=0 ( d/k i Now we can lower bound the covering number using a volume argument by noting that if m is the cardinality of the largest ɛ-ball around any w W then N (ɛ, W, D) W /m For any air w, w W, d D (w, w ) h(w, w )/(2d) where h(w, w ) is the hamming distance (number of coordinates in which w and w disagree) Therefore, in order for d D (w, w ) ɛ we need h(w, w ) 2ɛd For any w W the number of w such that h(w, w ) 2ɛd is at most ) 2 H(2ɛ)d It follows that N (ɛ, W, D) 2ɛd i=0 ( d i W /2 H(2ɛ)d 2 d/2 H(2ɛ)d k1/q (d/2) 1/ k Proosition 3 For any ɛ > 0, N (ɛ, W, D) 2 (1 H(2ɛ))d Proof First we show that W = { 1, 1} d Any w W 1 has margin γ 1, (w ) = 1/d, so W = {w R d : γ 1, (w) 1/d} For any w { 1, 1} d and examle x drawn from D, we have w 1 = d, x = 1, and w x 1 (since x has an odd number of coordinates set to 1) resulting in margin γ 1, (w) 1/d If w / { 1, 1} d then γ 1, (w) = min x w x / w 1 min i w e i / w 1 = min i w i / w 1 < 1/d To bound the covering number, we use the same volume argument as above Again, the size of the largest ɛ-ball around any w W is at most 2 H(2ɛ)d (since this bound only requires that every air w, w W has d D (w, w ) h(w, w )/(2d)) It follows that N (ɛ, W, D) 2 d /2 H(2ɛ)d Using standard distribution-secific samle comlexity bounds based on covering numbers (Itai and Benedek, 1991), we have an uer bound of O((1/ɛ) ln(n (ɛ, W, D)/δ)) and lower bound of ln((1 δ)n (2ɛ, W, D)) for learning, with robability at least 1 δ, a concet in W to within ɛ error Thus, we have the following results for the samle comlexity m of learning W with resect to D If = 1 then m O if 1 < < then m ( 1 ɛ ( k + ln 1 δ )), ( ) ( ) 1/ 1 d 2 H(4ɛ) d k 1/q k + ln(1 δ), 2 and if = then m (1 H(4ɛ)) d + ln(1 δ) For aroriate values of k and ɛ relative to d, the the samle comlexity can be much smaller for the = 1 case For examle, if k = O(d 1/4 ) and Ω(d 1/4 ) ɛ 1/40, then (assuming δ is a small constant) having O( d) examles is sufficient for learning W 1 while at least Ω(d) examles are required to learn W for any > 1 5 EXPERIMENTS We erformed two emirical studies to suort our theoretical results First, to give further evidence that using L L 1 margins can lead to faster learning than other margins, we ran exeriments on both synthetic and real-world data sets Using the L q L SVM formulation defined in (1) for linearly searable data and the formulation defined in (2) for non-searable data, both imlemented using standard convex otimization software, we ran our algorithms for a range of values of and a range of s n on each data set We reort several cases in which maximizing the L L 1 margin results in faster learning (ie, smaller samle comlexity) than maximizing other margins Figure 1 shows results on two synthetic data sets One is generated using the Blocks distribution family from Section 4 with d = 90 and k = 9 The other uses examles generated from a standard Gaussian distribution in R 100 subject to having L L 1 margin at least 0075 with resect to a fixed random target vector in { 1, 1} d (in other words, Gaussian samles with margin smaller than 0075 are rejected) In both cases, the error decreases much faster for < 2 than for large Figure 2 shows results on three data sets from the UCI Machine Learning Reository (Bache and Lichman, 2013) The Fertility data set consists of 100 training examles in R 10, the SPECTF Heart data set has 267 examles in R 44, and we used a subset of the CNAE-9 data set with 240 examles in R 857 In all three cases, better erformance was achieved by algorithms with < 2 than by those with > 2 73

7 Maria-Florina Balcan, Christoher Berlind generalization error = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = 170 generalization error n = 50 n = 100 n = 150 n = 200 n = 250 n = 300 n = 350 n = 400 n = 450 n = 500 generalization error = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = 170 generalization error n = 50 n = 75 n = 100 n = 125 n = 150 n = 175 n = 200 n = 225 n = 250 Figure 1: Synthetic data results for Blocks distribution (to) and Gaussian with margin (bottom) The left column lots generalization error (averaged over 500 trials with different training sets) versus number of training examles n while the right column lots error versus The goal of our second exeriment was to determine, for real-world data, what arameter α can be used in the bound on X 2, in Theorem 4 Secifically, for each data set we want to find α min = inf{α : X 2, n α X }, the smallest value of α so the bound holds with C = 1 Recall that for = 1, α min can theoretically be as great as 1, while for 2 it is at most 1/2 We would like to see whether α min is often small in real data sets or whether it is close to the theoretical uer bound We can estimate α min for a given set of data by creating a sequence {X m } n m=1 of data matrices by adding to the matrix one oint from the data set at a time For each oint in the sequence we can comute α m = log( X m 2, / X ), log m a value of α that realizes the bound with equality for this articular data matrix We reeat this rocess T times, each with a different random ordering of the data, to find T sequences αm, i where 1 i T and 1 m n We can then comute ˆα min = max i,m αm, i a value of α which realizes the bound for every data matrix considered and which causes the bound to hold with equality in at least one instance Figure 3 shows a histogram of the resulting estimates on a variety of data sets and for three values of number of data sets = 1 = 2 = ˆα min Figure 3: A histogram showing the values of ˆα min on 47 data sets from the UCI reository Notice that in the vast majority of cases, the estimate of α min is less than 1/2 As exected there are more values above 1/2 for = 1 than for 2, but none of the estimates were above 07 This gives us evidence that many real data sets are much more favorable for learning with large L L 1 margins than the worst-case bounds may suggest 74

8 Learning Linear Searators with L ql Margins = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = n = 10 n = 15 n = 20 n = 25 n = 30 n = 35 n = 40 n = 45 n = 50 n = 55 n = = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = n = 26 n = 40 n = 53 n = 66 n = 80 n = 93 n = 106 n = 120 n = 133 n = 146 n = = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = n = 8 n = 12 n = 16 n = 20 n = 24 n = 28 n = 32 n = 36 n = 40 Figure 2: Results for Fertility data (to), SPECTF Heart data (middle), and CNAE-9 (bottom) The left column lots error (averaged over 100 trials with different training sets and tested on all non-training examles) versus number of training examles n while the right column lots error versus 6 DISCUSSION Theorem 4 alies to the realizable case in which the two classes are linearly searable by a ositive L q L margin The result can be extended to the nonrealizable case by bounding the emirical Rademacher comlexity in terms of the L 2, -norm of the data using the Khintchine inequality This bound can be seen as a secial case of Proosition 2 of Kloft and Blanchard (2012) as our setting is a secial case of multile kernel learning (see sulementary material for details) Kloft and Blanchard (2012) also rove bounds on the oulation and local Rademacher comlexities, although in doing so they introduce an exlicit deendence on dimension (number of kernels) This highlights a difference in goals between our work and much of the MKL literature The MKL literature rovides bounds that aly to all data distributions while focusing on the 2 regime, as this is the most relevant range for MKL On the other hand, we are interested in understanding what kinds of data lead to fast (and dimension-indeendent) learning for different notions of margin, and furthermore when one tye of margin can be rovably better than another We make stes toward understanding this through our condition on the L 2, -norm and concrete lower bounds on the generalization error, neither of which has been exlored in the context of MKL Carrying over these ersectives into more general settings such as MKL is an exciting direction for further research Acknowledgments We thank Ying Xiao for insightful discussions and the anonymous reviewers for their helful comments This work was suorted in art by NSF grants CCF and CCF , AFOSR grant FA , ONR grant N , and a Microsoft Research Faculty Fellowshi 75

9 Maria-Florina Balcan, Christoher Berlind References K Bache and M Lichman UCI machine learning reository, 2013 URL htt://archiveicsuciedu/ml M-F Balcan, C Berlind, S Ehrlich, and Y Liang Efficient Semi-suervised and Active Learning of Disjunctions In Proceedings of the International Conference on Machine Learning (ICML), 2013 P L Bartlett and J Shawe-Taylor Generalization erformance of suort vector machines and other attern classifiers In B Schölkof, C J C Burges, and A J Smola, editors, Advances in Kernel Methods, ages MIT Press, 1999 A Blum and M-F Balcan Oen Problems in Efficient Semi-Suervised PAC Learning In Proceedings of the Conference on Learning Theory (COLT), 2007 C Cortes, M Mohri, and A Rostamizadeh Generalization Bounds for Learning Kernels In Proceedings of the International Conference on Machine Learning (ICML), 2010 C Gentile The Robustness of the -Norm Algorithms Machine Learning, 53: , 2003 C Gentile Personal communication, 2013 A J Grove, N Littlestone, and D Schuurmans General Convergence Results for Linear Discriminant Udates Machine Learning, 43: , 2001 U Haageru The best constants in the Khintchine inequality Studia Mathematica, 1982 A Itai and G M Benedek Learnability with resect to fixed distributions Theoretical Comuter Science, 86(2): , 1991 S M Kakade, K Sridharan, and A Tewari On the Comlexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization In Proceedings of Neural Information Processing Systems (NIPS), 2009 M Kloft and G Blanchard On the Convergence Rate of l -Norm Multile Kernel Learning Journal of Machine Learning Research, 13: , 2012 N Littlestone Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm Machine Learning, 2(4): , 1988 A Maurer and M Pontil Structured Sarsity and Generalization Journal of Machine Learning Research, 13: , 2012 F Rosenblatt The ercetron: a robabilistic model for information storage and organization in the brain Psychological review, 65(6): , Nov 1958 R A Servedio PAC Analogues of Percetron and Winnow via Boosting the Margin In Proceedings of the Conference on Learning Theory (COLT), ages , 2000 T Zhang On the dual formulation of regularized linear systems with convex risks Machine Learning, 46:91 129,

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics

More information

Hotelling s Two- Sample T 2

Hotelling s Two- Sample T 2 Chater 600 Hotelling s Two- Samle T Introduction This module calculates ower for the Hotelling s two-grou, T-squared (T) test statistic. Hotelling s T is an extension of the univariate two-samle t-test

More information

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems Int. J. Oen Problems Comt. Math., Vol. 3, No. 2, June 2010 ISSN 1998-6262; Coyright c ICSRS Publication, 2010 www.i-csrs.org Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points. Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the

More information

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest.

More information

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables Imroved Bounds on Bell Numbers and on Moments of Sums of Random Variables Daniel Berend Tamir Tassa Abstract We rovide bounds for moments of sums of sequences of indeendent random variables. Concentrating

More information

Sums of independent random variables

Sums of independent random variables 3 Sums of indeendent random variables This lecture collects a number of estimates for sums of indeendent random variables with values in a Banach sace E. We concentrate on sums of the form N γ nx n, where

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material Robustness of classifiers to uniform l and Gaussian noise Sulementary material Jean-Yves Franceschi Ecole Normale Suérieure de Lyon LIP UMR 5668 Omar Fawzi Ecole Normale Suérieure de Lyon LIP UMR 5668

More information

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)] LECTURE 7 NOTES 1. Convergence of random variables. Before delving into the large samle roerties of the MLE, we review some concets from large samle theory. 1. Convergence in robability: x n x if, for

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

Statics and dynamics: some elementary concepts

Statics and dynamics: some elementary concepts 1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and

More information

Inequalities for the L 1 Deviation of the Empirical Distribution

Inequalities for the L 1 Deviation of the Empirical Distribution Inequalities for the L 1 Deviation of the Emirical Distribution Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, Marcelo J. Weinberger June 13, 2003 Abstract We derive bounds on the robability

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS #A13 INTEGERS 14 (014) ON THE LEAST SIGNIFICANT ADIC DIGITS OF CERTAIN LUCAS NUMBERS Tamás Lengyel Deartment of Mathematics, Occidental College, Los Angeles, California lengyel@oxy.edu Received: 6/13/13,

More information

DISCRIMINANTS IN TOWERS

DISCRIMINANTS IN TOWERS DISCRIMINANTS IN TOWERS JOSEPH RABINOFF Let A be a Dedekind domain with fraction field F, let K/F be a finite searable extension field, and let B be the integral closure of A in K. In this note, we will

More information

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com

More information

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction GOOD MODELS FOR CUBIC SURFACES ANDREAS-STEPHAN ELSENHANS Abstract. This article describes an algorithm for finding a model of a hyersurface with small coefficients. It is shown that the aroach works in

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT

THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT ZANE LI Let e(z) := e 2πiz and for g : [0, ] C and J [0, ], define the extension oerator E J g(x) := g(t)e(tx + t 2 x 2 ) dt. J For a ositive weight ν

More information

Efficient Semi-supervised and Active Learning of Disjunctions

Efficient Semi-supervised and Active Learning of Disjunctions Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,

More information

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales Lecture 6 Classification of states We have shown that all states of an irreducible countable state Markov chain must of the same tye. This gives rise to the following classification. Definition. [Classification

More information

1 Gambler s Ruin Problem

1 Gambler s Ruin Problem Coyright c 2017 by Karl Sigman 1 Gambler s Ruin Problem Let N 2 be an integer and let 1 i N 1. Consider a gambler who starts with an initial fortune of $i and then on each successive gamble either wins

More information

Analysis of some entrance probabilities for killed birth-death processes

Analysis of some entrance probabilities for killed birth-death processes Analysis of some entrance robabilities for killed birth-death rocesses Master s Thesis O.J.G. van der Velde Suervisor: Dr. F.M. Sieksma July 5, 207 Mathematical Institute, Leiden University Contents Introduction

More information

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Asymptotically Optimal Simulation Allocation under Dependent Sampling Asymtotically Otimal Simulation Allocation under Deendent Samling Xiaoing Xiong The Robert H. Smith School of Business, University of Maryland, College Park, MD 20742-1815, USA, xiaoingx@yahoo.com Sandee

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Improved Capacity Bounds for the Binary Energy Harvesting Channel Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,

More information

Machine Learning: Homework 4

Machine Learning: Homework 4 10-601 Machine Learning: Homework 4 Due 5.m. Monday, February 16, 2015 Instructions Late homework olicy: Homework is worth full credit if submitted before the due date, half credit during the next 48 hours,

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

On Wald-Type Optimal Stopping for Brownian Motion

On Wald-Type Optimal Stopping for Brownian Motion J Al Probab Vol 34, No 1, 1997, (66-73) Prerint Ser No 1, 1994, Math Inst Aarhus On Wald-Tye Otimal Stoing for Brownian Motion S RAVRSN and PSKIR The solution is resented to all otimal stoing roblems of

More information

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS CASEY BRUCK 1. Abstract The goal of this aer is to rovide a concise way for undergraduate mathematics students to learn about how rime numbers behave

More information

On the capacity of the general trapdoor channel with feedback

On the capacity of the general trapdoor channel with feedback On the caacity of the general tradoor channel with feedback Jui Wu and Achilleas Anastasooulos Electrical Engineering and Comuter Science Deartment University of Michigan Ann Arbor, MI, 48109-1 email:

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

A Note on Guaranteed Sparse Recovery via l 1 -Minimization

A Note on Guaranteed Sparse Recovery via l 1 -Minimization A Note on Guaranteed Sarse Recovery via l -Minimization Simon Foucart, Université Pierre et Marie Curie Abstract It is roved that every s-sarse vector x C N can be recovered from the measurement vector

More information

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test) Chater 225 Tests for Two Proortions in a Stratified Design (Cochran/Mantel-Haenszel Test) Introduction In a stratified design, the subects are selected from two or more strata which are formed from imortant

More information

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit Chater 5 Statistical Inference 69 CHAPTER 5 STATISTICAL INFERENCE.0 Hyothesis Testing.0 Decision Errors 3.0 How a Hyothesis is Tested 4.0 Test for Goodness of Fit 5.0 Inferences about Two Means It ain't

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Approximation of the Euclidean Distance by Chamfer Distances

Approximation of the Euclidean Distance by Chamfer Distances Acta Cybernetica 0 (0 399 47. Aroximation of the Euclidean Distance by Chamfer Distances András Hajdu, Lajos Hajdu, and Robert Tijdeman Abstract Chamfer distances lay an imortant role in the theory of

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

MATH 361: NUMBER THEORY EIGHTH LECTURE

MATH 361: NUMBER THEORY EIGHTH LECTURE MATH 361: NUMBER THEORY EIGHTH LECTURE 1. Quadratic Recirocity: Introduction Quadratic recirocity is the first result of modern number theory. Lagrange conjectured it in the late 1700 s, but it was first

More information

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers

More information

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies Online Aendix to Accomany AComarisonof Traditional and Oen-Access Aointment Scheduling Policies Lawrence W. Robinson Johnson Graduate School of Management Cornell University Ithaca, NY 14853-6201 lwr2@cornell.edu

More information

Real Analysis 1 Fall Homework 3. a n.

Real Analysis 1 Fall Homework 3. a n. eal Analysis Fall 06 Homework 3. Let and consider the measure sace N, P, µ, where µ is counting measure. That is, if N, then µ equals the number of elements in if is finite; µ = otherwise. One usually

More information

Uniform Law on the Unit Sphere of a Banach Space

Uniform Law on the Unit Sphere of a Banach Space Uniform Law on the Unit Shere of a Banach Sace by Bernard Beauzamy Société de Calcul Mathématique SA Faubourg Saint Honoré 75008 Paris France Setember 008 Abstract We investigate the construction of a

More information

Elementary theory of L p spaces

Elementary theory of L p spaces CHAPTER 3 Elementary theory of L saces 3.1 Convexity. Jensen, Hölder, Minkowski inequality. We begin with two definitions. A set A R d is said to be convex if, for any x 0, x 1 2 A x = x 0 + (x 1 x 0 )

More information

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES OHAD GILADI AND ASSAF NAOR Abstract. It is shown that if (, ) is a Banach sace with Rademacher tye 1 then for every n N there exists

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Convex Analysis and Economic Theory Winter 2018

Convex Analysis and Economic Theory Winter 2018 Division of the Humanities and Social Sciences Ec 181 KC Border Conve Analysis and Economic Theory Winter 2018 Toic 16: Fenchel conjugates 16.1 Conjugate functions Recall from Proosition 14.1.1 that is

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

1-way quantum finite automata: strengths, weaknesses and generalizations

1-way quantum finite automata: strengths, weaknesses and generalizations 1-way quantum finite automata: strengths, weaknesses and generalizations arxiv:quant-h/9802062v3 30 Se 1998 Andris Ambainis UC Berkeley Abstract Rūsiņš Freivalds University of Latvia We study 1-way quantum

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

Brownian Motion and Random Prime Factorization

Brownian Motion and Random Prime Factorization Brownian Motion and Random Prime Factorization Kendrick Tang June 4, 202 Contents Introduction 2 2 Brownian Motion 2 2. Develoing Brownian Motion.................... 2 2.. Measure Saces and Borel Sigma-Algebras.........

More information

General Linear Model Introduction, Classes of Linear models and Estimation

General Linear Model Introduction, Classes of Linear models and Estimation Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)

More information

On Z p -norms of random vectors

On Z p -norms of random vectors On Z -norms of random vectors Rafa l Lata la Abstract To any n-dimensional random vector X we may associate its L -centroid body Z X and the corresonding norm. We formulate a conjecture concerning the

More information

p-adic Measures and Bernoulli Numbers

p-adic Measures and Bernoulli Numbers -Adic Measures and Bernoulli Numbers Adam Bowers Introduction The constants B k in the Taylor series exansion t e t = t k B k k! k=0 are known as the Bernoulli numbers. The first few are,, 6, 0, 30, 0,

More information

Additive results for the generalized Drazin inverse in a Banach algebra

Additive results for the generalized Drazin inverse in a Banach algebra Additive results for the generalized Drazin inverse in a Banach algebra Dragana S. Cvetković-Ilić Dragan S. Djordjević and Yimin Wei* Abstract In this aer we investigate additive roerties of the generalized

More information

HENSEL S LEMMA KEITH CONRAD

HENSEL S LEMMA KEITH CONRAD HENSEL S LEMMA KEITH CONRAD 1. Introduction In the -adic integers, congruences are aroximations: for a and b in Z, a b mod n is the same as a b 1/ n. Turning information modulo one ower of into similar

More information

Extension of Minimax to Infinite Matrices

Extension of Minimax to Infinite Matrices Extension of Minimax to Infinite Matrices Chris Calabro June 21, 2004 Abstract Von Neumann s minimax theorem is tyically alied to a finite ayoff matrix A R m n. Here we show that (i) if m, n are both inite,

More information

ECE 534 Information Theory - Midterm 2

ECE 534 Information Theory - Midterm 2 ECE 534 Information Theory - Midterm Nov.4, 009. 3:30-4:45 in LH03. You will be given the full class time: 75 minutes. Use it wisely! Many of the roblems have short answers; try to find shortcuts. You

More information

1 Extremum Estimators

1 Extremum Estimators FINC 9311-21 Financial Econometrics Handout Jialin Yu 1 Extremum Estimators Let θ 0 be a vector of k 1 unknown arameters. Extremum estimators: estimators obtained by maximizing or minimizing some objective

More information

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces Abstract and Alied Analysis Volume 2012, Article ID 264103, 11 ages doi:10.1155/2012/264103 Research Article An iterative Algorithm for Hemicontractive Maings in Banach Saces Youli Yu, 1 Zhitao Wu, 2 and

More information

Spectral Clustering based on the graph p-laplacian

Spectral Clustering based on the graph p-laplacian Sectral Clustering based on the grah -Lalacian Thomas Bühler tb@cs.uni-sb.de Matthias Hein hein@cs.uni-sb.de Saarland University Comuter Science Deartment Camus E 663 Saarbrücken Germany Abstract We resent

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Supplementary Materials for Robust Estimation of the False Discovery Rate

Supplementary Materials for Robust Estimation of the False Discovery Rate Sulementary Materials for Robust Estimation of the False Discovery Rate Stan Pounds and Cheng Cheng This sulemental contains roofs regarding theoretical roerties of the roosed method (Section S1), rovides

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

Chapter 1: PROBABILITY BASICS

Chapter 1: PROBABILITY BASICS Charles Boncelet, obability, Statistics, and Random Signals," Oxford University ess, 0. ISBN: 978-0-9-0005-0 Chater : PROBABILITY BASICS Sections. What Is obability?. Exeriments, Outcomes, and Events.

More information

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential Chem 467 Sulement to Lectures 33 Phase Equilibrium Chemical Potential Revisited We introduced the chemical otential as the conjugate variable to amount. Briefly reviewing, the total Gibbs energy of a system

More information

Khinchine inequality for slightly dependent random variables

Khinchine inequality for slightly dependent random variables arxiv:170808095v1 [mathpr] 7 Aug 017 Khinchine inequality for slightly deendent random variables Susanna Sektor Abstract We rove a Khintchine tye inequality under the assumtion that the sum of Rademacher

More information

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS #A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS Ramy F. Taki ElDin Physics and Engineering Mathematics Deartment, Faculty of Engineering, Ain Shams University, Cairo, Egyt

More information

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Journal of Modern Alied Statistical Methods Volume Issue Article 7 --03 A Comarison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Ghadban Khalaf King Khalid University, Saudi

More information

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve In Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson and R.P. Limann, eds. Morgan Kaumann Publishers, San Mateo CA, 1995,. 950{957. A Simle Weight Decay Can Imrove Generalization

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

t 0 Xt sup X t p c p inf t 0

t 0 Xt sup X t p c p inf t 0 SHARP MAXIMAL L -ESTIMATES FOR MARTINGALES RODRIGO BAÑUELOS AND ADAM OSȨKOWSKI ABSTRACT. Let X be a suermartingale starting from 0 which has only nonnegative jums. For each 0 < < we determine the best

More information

Plotting the Wilson distribution

Plotting the Wilson distribution , Survey of English Usage, University College London Setember 018 1 1. Introduction We have discussed the Wilson score interval at length elsewhere (Wallis 013a, b). Given an observed Binomial roortion

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS Proceedings of DETC 03 ASME 003 Design Engineering Technical Conferences and Comuters and Information in Engineering Conference Chicago, Illinois USA, Setember -6, 003 DETC003/DAC-48760 AN EFFICIENT ALGORITHM

More information

Improvement on the Decay of Crossing Numbers

Improvement on the Decay of Crossing Numbers Grahs and Combinatorics 2013) 29:365 371 DOI 10.1007/s00373-012-1137-3 ORIGINAL PAPER Imrovement on the Decay of Crossing Numbers Jakub Černý Jan Kynčl Géza Tóth Received: 24 Aril 2007 / Revised: 1 November

More information

Introduction to Banach Spaces

Introduction to Banach Spaces CHAPTER 8 Introduction to Banach Saces 1. Uniform and Absolute Convergence As a rearation we begin by reviewing some familiar roerties of Cauchy sequences and uniform limits in the setting of metric saces.

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

LORENZO BRANDOLESE AND MARIA E. SCHONBEK

LORENZO BRANDOLESE AND MARIA E. SCHONBEK LARGE TIME DECAY AND GROWTH FOR SOLUTIONS OF A VISCOUS BOUSSINESQ SYSTEM LORENZO BRANDOLESE AND MARIA E. SCHONBEK Abstract. In this aer we analyze the decay and the growth for large time of weak and strong

More information

Coding Along Hermite Polynomials for Gaussian Noise Channels

Coding Along Hermite Polynomials for Gaussian Noise Channels Coding Along Hermite olynomials for Gaussian Noise Channels Emmanuel A. Abbe IG, EFL Lausanne, 1015 CH Email: emmanuel.abbe@efl.ch Lizhong Zheng LIDS, MIT Cambridge, MA 0139 Email: lizhong@mit.edu Abstract

More information

15-451/651: Design & Analysis of Algorithms October 23, 2018 Lecture #17: Prediction from Expert Advice last changed: October 25, 2018

15-451/651: Design & Analysis of Algorithms October 23, 2018 Lecture #17: Prediction from Expert Advice last changed: October 25, 2018 5-45/65: Design & Analysis of Algorithms October 23, 208 Lecture #7: Prediction from Exert Advice last changed: October 25, 208 Prediction with Exert Advice Today we ll study the roblem of making redictions

More information

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition A Qualitative Event-based Aroach to Multile Fault Diagnosis in Continuous Systems using Structural Model Decomosition Matthew J. Daigle a,,, Anibal Bregon b,, Xenofon Koutsoukos c, Gautam Biswas c, Belarmino

More information

1 Probability Spaces and Random Variables

1 Probability Spaces and Random Variables 1 Probability Saces and Random Variables 1.1 Probability saces Ω: samle sace consisting of elementary events (or samle oints). F : the set of events P: robability 1.2 Kolmogorov s axioms Definition 1.2.1

More information

B8.1 Martingales Through Measure Theory. Concept of independence

B8.1 Martingales Through Measure Theory. Concept of independence B8.1 Martingales Through Measure Theory Concet of indeendence Motivated by the notion of indeendent events in relims robability, we have generalized the concet of indeendence to families of σ-algebras.

More information

Online Learning of Noisy Data with Kernels

Online Learning of Noisy Data with Kernels Online Learning of Noisy Data with Kernels Nicolò Cesa-Bianchi Università degli Studi di Milano cesa-bianchi@dsiunimiit Shai Shalev Shwartz The Hebrew University shais@cshujiacil Ohad Shamir The Hebrew

More information

A Social Welfare Optimal Sequential Allocation Procedure

A Social Welfare Optimal Sequential Allocation Procedure A Social Welfare Otimal Sequential Allocation Procedure Thomas Kalinowsi Universität Rostoc, Germany Nina Narodytsa and Toby Walsh NICTA and UNSW, Australia May 2, 201 Abstract We consider a simle sequential

More information

Best approximation by linear combinations of characteristic functions of half-spaces

Best approximation by linear combinations of characteristic functions of half-spaces Best aroximation by linear combinations of characteristic functions of half-saces Paul C. Kainen Deartment of Mathematics Georgetown University Washington, D.C. 20057-1233, USA Věra Kůrková Institute of

More information

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels oname manuscrit o. will be inserted by the editor) Quantitative estimates of roagation of chaos for stochastic systems with W, kernels Pierre-Emmanuel Jabin Zhenfu Wang Received: date / Acceted: date Abstract

More information

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP Submitted to the Annals of Statistics arxiv: arxiv:1706.07237 CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP By Johannes Tewes, Dimitris N. Politis and Daniel J. Nordman Ruhr-Universität

More information

7.2 Inference for comparing means of two populations where the samples are independent

7.2 Inference for comparing means of two populations where the samples are independent Objectives 7.2 Inference for comaring means of two oulations where the samles are indeendent Two-samle t significance test (we give three examles) Two-samle t confidence interval htt://onlinestatbook.com/2/tests_of_means/difference_means.ht

More information

AKRON: An Algorithm for Approximating Sparse Kernel Reconstruction

AKRON: An Algorithm for Approximating Sparse Kernel Reconstruction : An Algorithm for Aroximating Sarse Kernel Reconstruction Gregory Ditzler Det. of Electrical and Comuter Engineering The University of Arizona Tucson, AZ 8572 USA ditzler@email.arizona.edu Nidhal Carla

More information

Econometrica Supplementary Material

Econometrica Supplementary Material Econometrica Sulementary Material SUPPLEMENT TO WEAKLY BELIEF-FREE EQUILIBRIA IN REPEATED GAMES WITH PRIVATE MONITORING (Econometrica, Vol. 79, No. 3, May 2011, 877 892) BY KANDORI,MICHIHIRO IN THIS SUPPLEMENT,

More information