Approximating and Testing k-histogram Distributions in Sub-linear time

Size: px
Start display at page:

Download "Approximating and Testing k-histogram Distributions in Sub-linear time"

Transcription

1 Electronic Colloquiu on Coputational Coplexity, Report No Approxiating and Testing k-histogra Distributions in Sub-linear tie Piotr Indyk Reut Levi Ronitt Rubinfeld Noveber 9, 011 Abstract A discrete distribution p, over [n], is a k-histogra if its probability distribution function can be represented as a piece-wise constant function with k pieces. Such a function is represented by a list of k intervals and k corresponding values. We consider the following proble: given a collection of saples fro a distribution p, find a k-histogra that approxiately iniizes the l distance to the distribution p. We give tie and saple efficient algoriths for this proble. We further provide algoriths that distinguish distributions that have the property of being a k- histogra fro distributions that are ɛ-far fro any k-histogra in the l 1 distance and l distance respectively. CSAIL, MIT, Cabridge MA E-ail: indyk@theory.lcs.it.edu. This aterial is based upon work supported by David and Lucille Packard Fellowship, MADALGO Center for Massive Data Algorithics, funded by the Danish National Research Association and NSF grant CCF School of Coputer Science, Tel Aviv University. E-ail: reuti.levi@gail.co. Research supported by the Israel Science Foundation grant nos. 1147/09 and 46/08 CSAIL, MIT, Cabridge MA 0139 and the Blavatnik School of Coputer Science, Tel Aviv University. E-ail: ronitt@csail.it.edu. Research supported by NSF grants and , Marie Curie Reintegration grant PIRG03- GA and the Israel Science Foundation grant nos. 1147/09 and 1675/09. ISSN

2 1 Introduction The ubiquity of assive data sets is a phenoenon that began over a decade ago, and is becoing ore and ore pervasive. As a result, there has been recently a significant interests in constructing succinct representations of the data. Ideally, such representations should take little space and coputation tie to operate on, while approxiately preserving the desired properties of the data. One of the ost natural and useful succinct representations of the data are histogras. For a data set D whose eleents coe fro the universe [n], a k-histogra H is a piecewise constant function defined over [n] consisting of k pieces. Note that a k-histogra can be described using Ok nubers. A good k- histogra is such that a the value Hi is a good approxiation of the total nuber of ties an eleent i occurs in the data set denoted by P i and b the value of k is sall. Histogras are a popular and flexible way to approxiate the distribution of data attributes e.g., eployees age or salary in databases. They can be used for data visualization, analysis and approxiate query answering. As a result, coputing and aintaining histogras of the data has attracted a substantial aount of interests in databases and beyond, see e.g., [GMP97, JPK + 98, GKS06, CMN98, TGIK0, GGI + 0], or the survey [Ioa03]. A popular criterion for fitting a histogra to a distribution P is the least-squares criterion. Specifically, the goal is to find H that iniizes the l nor P H. Such histogras are often called v-optial histogras, with v standing for variance. There has been a substantial aount of work on algoriths, approxiate or exact, that copute the optial k-histogra H given P and k [JPK + 98, GKS06]. However, since these algoriths need to read the whole input to copute H, their running ties are at least linear in n. A ore efficient way to construct data histogras is to use rando saples fro data set D. There have been soe results on this front as well [CMN98, GMP97]. However, they have been restricted to so-called equi-depth histogras which are essentially approxiate quantiles of the data distribution or so-called copressed histogras. Although the nae by which they are referred to sounds siilar, both of these representations are quite different fro the representations considered in this paper. We are not aware of any work on constructing v-optial histogras fro rando saples with provable guarantees. The proble of constructing an approxiate histogra fro rando saples can be forulated in the fraework of distribution property testing and estiation [Rub06, Ron08]. In this fraework, an algorith is given access to i.i.d. saples fro an unknown probability distribution p, and its goal is to to characterize or estiate various properties of p. In our case we define p = P/ p 1. Then choosing a rando eleent fro the data set D corresponds to choosing i [n] according the distribution p. In this paper we propose several algoriths for constructing and testing for the existence of good histogras approxiating a given distribution p. 1.1 Histogra taxonoy Forally a histogra is a function H : [n] [0, 1] that is defined by a sequence of intervals I 1,..., I k and a corresponding sequence of values v 1,..., v k. For t [n], Ht represents an estiate to pt. We consider the following classes of histogras see [TGIK0] for a full list of classes: 1. Tiling histogras: the intervals for a tiling of [n] i.e., they are disjoint and cover the whole doain. For any t we have Ht = v i, where t I i. In practice we represent a tiling k-histogra as a sequence {I 1, v 1... I k, v k }.. Priority histogras: the intervals can overlap. For any t we have Ht = v i, where i is the largest index such that t I i ; if none exists Ht = 0. In practice we represent a priority k-histogra as 1

3 {I 1, v 1, r 1... I k, v k, r k } where r 1,..., r k correspond to the priority of the intervals. Note that if a function has a tiling k-histogra representation then it has a priority k-histogra representation. Conversely if it has a priority k-histogra representation then it has a tiling k-histogra representation. 1. Results The following algoriths receive as input a distribution over [n], p, an accuracy paraeter ɛ and an integer k. 1. In Section 3, we describe an algorith which outputs a priority k-histogra that is closest to p in the l distance up to ɛ-additive error. The algorith is a greedy algorith, at each step it enuerates over all possible intervals and adds the interval which iniizes the approxiated l distance. The saple coplexity of the algorith is Õk/ɛ ln n and the running tie is Õk/ɛ n. We then iprove the running tie substantially to Õk/ɛ ln n by enuerating on a partial set of intervals.. In Section 4, we provide a testing algorith for the property of being a tiling k-histogra with respect to the l 1 nor. The saple coplexity of the algorith is Õ knɛ 5. We provide a siilar test for the l nor that has saple coplexity of Oln n ɛ 4. We prove that testing if a distribution is a tiling k-histogra in the l 1 -nor requires Ω kn saples for every k 1/ɛ. 1.3 Related Work Our forulation of the proble falls within the fraework of property testing [RS96, GGR98, BFR + 00]. Properties of single and pairs of distributions has been studied quite extensively in the past see [BFR + 10, BFF + 01, AAK + 07, BDKR05, GMP97, BKR04, RRSS09, Val08, VV11]. One question that has received uch attention in property testing is to deterine whether or not two distributions are siilar. A proble referred to as Identity testing assues that the algorith is given access to saples of distribution p and an explicit description of distribution q. The goal is to distinguish a pair of distributions that are identical fro a pair of distributions that are far fro each other. A special case of Identity testing is Unifority Testing, where the fixed distribution, q, is the unifor distribution. A unifor distribution can be represented by a tiling 1-histogra and therefore the study of unifority testing is closely related to our study. Goldreich and Ron [GR00] study Unifority Testing in the context of approxiating graph expansion. They show that counting pairwise collisions in a saple can be used to approxiate the l -nor of the probability distribution fro which the saple was drawn fro. Several ore recent works, including this one, ake use of this technical tool. Batu et al. [BFR + 10] note that running the [GR00] algorith with Õ n saples yields an algoriths for unifority testing in the l 1 -nor. Paninski [Pan08] gives an optial algorith in this setting that takes a saple of size O n and proves a atching lower bound of Ω n. Valiant [Val08] shows that a tolerant tester for unifority for constant precision would require n 1 o1 saples. Several works in property testing of distributions approxiate the distribution by a sall histogra distribution and use this representation as an essential way in their algorith [BKR04], [BFF + 01]. Histogras were subject of extensive research in data strea literature, see [TGIK0, GGI + 0] and the references therein. Our algorith in Section 3 is inspired by streaing algorith in [TGIK0].

4 Preliinaries Denote by D n the set of all discrete distributions over [n]. A property of a discrete distributions is a subset P D n. We say that a distribution p D n is ɛ-far fro p D n in the l 1 distance l distance if p p 1 > ɛ p p > ɛ. We say that an algorith, A, is a testing algorith for the property P if given an accuracy paraeter ɛ and a distribution p: 1. if p P, A accepts p with probability at least /3. if p is ɛ-far according to any specified distance easure fro every distribution in P, A rejects p with probability at least /3. Let p D n, the for every i [n], denote by p i the probability of the i-th eleent. For every I [n], let pi denote the weight of I, i.e. p i. For every I [n] such that pi 0, let p I denote the distribution of p restricted to I i.e. p I i = p i pi. Call an interval I flat if p I is unifor or pi = 0. Given a set of saples fro p, S, denote by S I the saples that fall in the interval I. Define the observed collision probability of interval I as colls I S where colls I def = occi,si and occi, SI is [ the nuber of occurrences of i in S I. In [GR00], in the proof of Lea 1, it was shown that E p I and that [ ] colls I Pr p I > δ p I S In particular, since p I 1, we also have that [ ] colls I Pr p I > ɛ < S In a siilar fashion we prove the following lea. < < δ SI ] colls I = S 1/ 1 pi 4 δ S p I. 1 ɛ 1 S. 3 Lea 1 Based on [GR00] If we take 4 saples, S, then, for every interval I, ɛ [ colls I Pr ] p i ɛpi > 3 4 S 4 Proof: For every i < j define an indicator variable C i,j so that C i,j = 1 if the ith saple is equal to the jth saple and is in the interval I. For every i < j, µ def = E[C i,j ] = p def i. Let P = {i, j : 1 i < j }. By Chebyshev s inequality: [ i,j P Pr C i,j ] p i P > ɛpi Var[ i,j P C i,j] ɛ pi P 3

5 Fro [GR00] we know that Var i,j P C i,j P µ + P 3/ µ 3/ 5 [ ] and since µ p I we have Var i,j P C i,j pi P + P 3/ µ 1/, thus Pr [ i,j P C i,j P p i ] > ɛpi < P + P 3/ µ 1/ ɛ P 6 ɛ P 1/ 6 ɛ Near-optial Priority k-histogra In this section we give an algorith that given p D n, outputs a priority k-histogra which is close in the l distance to an optial tiling k-histogra that describes p. The algorith, based on a sketching algorith in [TGIK0], takes a greedy strategy. Initially the algorith starts with an epty priority histogra. It then proceed by doing k ln ɛ 1 iterations, where in each iteration it goes over all n possible intervals and adds the best one, i.e the interval I [n] which iniizes the distance between p and H when added to the currently constructed priority histogra H. The algorith has an efficient saple coplexity of only logarithic dependence on n but the running tie has polynoial dependence on n. This polynoial dependency is due to the exhaustive search for the interval which iniizes the distance between p and H. Theore 1 Let p D n be the distribution and let H be the tiling k-histogra which iniizes p H. The priority histogra H reported by Algorith 1 satisfies p H p H + 5ɛ. The saple coplexity of Algorith 1 is Õk/ɛ ln n. The running tie coplexity of Algorith 1 is Õk/ɛ n. Proof: By Chernoff s bound and union bound over the intervals in [n], with high constant probability, for every I, y I pi ξ. 8 By Lea 1 and Chernoff s bound, with high constant probability, for every I, z I p i ξpi. 9 Henceforth, we assue that the estiations obtained by the algorith are good. It clear that we can transfor any priority histogra H to any tiling k-histogra, H, by adding the k intervals of H to H with the highest priority. This iplies that there exists an interval J and a value y J such that adding the to H as described in Algorith 1 decreases the error in the following way p H J,yJ p H 1 1 p H k p H. 10 4

6 Algorith 1: Greedy algorith for priority k-histogra 1 Obtain l = ln1n saples, S, fro p, where ξ = ɛ ξ k ln 1 ɛ For each interval I [n] set y I := S I l ; 3 Obtain r = ln6n sets of saples, S 1,..., S r, each of size = 4 fro p; ξ I 4 For each interval I [n] set z I := edian ; 5 Initialize the priority histogra H to epty; 6 for i := 1 to k ln 1 ɛ do 7 foreach interval J [n] do 8 Create H J,yJ obtained by: collsi 1 S1,..., collsr Sr Adding the interval J to H with the value y J ; Recoputing the interval to the left right of J, I L I R so it would not intersect with J; Adding I L I R with the value y IL y IR to H; c J := z I y I ; 9 Let J in be the interval with the sallest value of c J ; 10 Update H to be H Jin,y Jin ; 11 return H ; where H J,yJ is defined in Algorith 1. Next, we would like to write the distance between H J,yJ and p as a function of p i and pi, for I H J,y J. We note that the value of x that iniizes the su p i x is x = pi, therefore p H J,yJ = = p i pi 11 p pi pi i p i + 1 p i pi. 13 Since c J = z I y I, by applying the triangle inequality twice we get that c J z I p i + z I p i + 5 p i y I p i p + pi y I 14, 15

7 So after reordering the ters in Equation 15 we obtain that c J p i pi + z I p i + y I p Fro the fact that y I p = y I pi y I + pi and Equation 8 it follows that Therefore we obtain fro Equations 9, 13, 16 and 17 that. 16 y I pi ξξ + pi. 17 c J p H J,yJ + ξpi + ξξ + pi 18 p H J,yJ + 3ξ + {I H J,y J } ξ. 19 Since the algorith calculates c J for every interval J, we derive fro Equations 10 and 19 that at the q-th step p H Jin,y p Jin H 1 1 p H k p H + 3ξ + qξ. 0 So for H obtained by the algorith after q steps we have p H p H 1 1 q k +q3ξ +qξ. Setting q = k ln 1 ɛ we obtain that p H p H + 5ɛ as desired. 3.1 Iproving the Running Tie We now turn to iproving the running tie coplexity to atch the saple coplexity. Instead of going over all possible intervals in [n] in search for an interval I [n] to add to the constructed priority histogra H. We search for I over a uch saller subset of intervals, in particular, only those intervals whose endpoints are saples or neighbors of saples. In Lea we prove that if we decrease the value a histogra H assigns to an interval I, then the square of the distance between H and p in the l -nor can grow by at ost pi. The lea iplies that we can treat light weight intervals as atoic coponents in our search because they do not affect the distance between H and p by uch. While the running tie is reduced significantly, we prove that the histogra this algorith outputs is still close to being optial. Lea Let p D n and let I be an interval in [n]. For 0 β 1 < β 1, p i β pi 1 p i β 1 Theore Let p and H be as in Theore 1. There is an algorith that outputs a priority histogra H that satisfies p H p H + 8ɛ. The saple coplexity of the algorith and the running tie coplexity of the algorith is Õk/ɛ ln n. Proof: In the iproved algorith, as in Algorith 1, we take l = ln1n saples, T. Instead of going ξ over all J [n] in Step 7 we consider only a sall subset of intervals as candidates. We denote this subset of intervals by T. Let T be the set of all eleents in T and those that are distance one away, i.e. 6

8 T = {in{i + 1, n}, i, ax{i 1, 0} i T }. Then T is the set of all intervals between pairs of eleents in T, i.e. [a, b] T if and only if a b and a, b T. Thus, the size of T is bounded above by 3l+1. Therefore we decrease the nuber of iterations in Step 7 fro n to at ost 3l+1. It is easy to see that intervals which are not in T have sall weight. Forally, let I be an intervals such that pi > ξ. The probability that I has no hits after taking l saples is at ost 1 ξ l < 1/n. Therefore by union bound over all the intervals I [n], with high constant probability, for every interval which has no hits after taking l saples, the weight of the interval is at ost ξ. Next we see why in Step 7 we can ignore intervals which have sall weight. Consider a single run of the loop in Step 7 in Algorith 1. Let H be the histogra constructed by the algorith so far and let J in be the interval added to H at the end of the run. We shall see that there is an interval J T such that p H J, pj p H Jin,y Jin 4ξ. J Denote the endpoints of J in by a and b where a < b. Let I 1 = [a 1, b 1 ] be the largest interval in T such that I 1 J in and let I = [a, b ] be the sallest interval in T such that J in I. Therefore for every interval J = [x, y] where x {a 1, a } and y {b 1, b } we have that i J J in p i ξ where J J in is the syetric difference of J and J in. Let β 1, β the value assigned to i [a, a 1 ], i [a, a 1 ] by H Jin, y Jin, respectively. Notice that the algorith only assigns values to intervals in T, therefore β 1 and β are well defined. Take J to be as follows. If β 1 > y J then take the start-point of J to be a 1 otherwise take it to be a. If β > y J then take the end-point of J to be b 1 otherwise take it to be b. By lea it follows that p HJ,yJin p H Jin, y Jin p i 4ξ. 3 i J J in Thus, we obtain Equation fro the fact that p H J, pj J = in δ p H J,δ 4 Thus, by siilar calculations as in the proof of theore 1, after q steps, p H p H 1 1 q k + q3ξ + qξ + 4ξ; Setting q = k ln 1 ɛ we obtain that p H p H 8ɛ. Proof of Lea : p i β p i + β 5 p i β 1 p i β = p i β 1 p i + β 1 piβ β 1 + β 1 β 6 pi 7 4 Testing whether a Distribution is a Tiling k-histogra In this section we provide testing algoriths for the property of being a tiling k-histogra. The testing algoriths attept to partition [n] into k intervals which are flat according to p recall that an interval is flat if it has unifor conditional distribution or it has no weight. If it fails to do so then it rejects p. Intervals that are close to being flat can be detected because either they have light weight, in which case they can be 7

9 found via sapling, or they are not light weight, in which case they have sall l -nor. Sall l -nor can in turn be detected via estiations of the collision probability. Thus an interval that has overall sall nuber of saples or alternatively sall nuber of pairwise collisions is considered by the algorith to be a flat interval. The search of the flat intervals boundaries is perfored in a siilar anner to a search of a value in a binary search. The efficiency of our testing algorith is stated in the following theores: Theore 3 Algorith is a testing algorith for the property of being a tiling k-histogra for the l distance easure. The saple coplexity of the algorith is Oln n ɛ 4. The running tie coplexity of the algorith is Ok ln 3 n ɛ 4. Theore 4 There exists a testing algorith for the property of being a tiling k-histogra for the l 1 distance easure. The saple coplexity of the algorith is Õ knɛ 5. The running tie coplexity of the algorith is Õk knɛ 5. Algorith : Test Tiling k-histogra 1 Obtain r = 16 ln6n sets of saples, S 1,..., S r, each of size = 64 ln n ɛ 4 fro p; Set previous := 1, low := 1, high := n; 3 for i := 1 to k do 4 while high low do 5 id := low + high - low /; 6 if testflatness-l [previous,id], S 1,..., S r, ɛ then 7 low := id+1; 8 else 9 high := id 1; 10 previous := low; 11 high := n; 1 If previous = n then return ACCEPT; 13 return REJECT Algorith 3: testflatness-l I, S 1,..., S r, ɛ 1 For each i [r] set ˆp i I := Si ; If there exists i [r] such that Si 3 If edian collsi 1 S1 4 return REJECT;,..., collsr < ɛ I Sr 1 then return ACCEPT ; + ax i{ ɛ ˆp i I } then return ACCEPT ; Proof of Theore 3: Let I be an interval in [n] we first show that [ Pr edian{colls1 I S 1,..., collsr I } p I ax i S r { ɛ ˆp i I } ] > 1 1 6n. 8 8

10 Recall that ˆp i I = Si 64, hence, due to the facts that and S i ɛ 4 we get that Si Si 64 SI i. By Equation 3, for each i [r], 16ˆpi I ɛ 4 [ collsi i Pr p I S i ɛ 4 ] ɛ ˆp i > 3 I 4. 9 Since each estiate collsi I Si is close to p I with high constant probability, we get fro Chernoff s bound that for r = 16 ln6n the edian of r results is close to p I with very high probability as stated in Equation 8. By union bound over all the intervals in [n], with high constant probability, the following holds for everyone of the at ost n intervals in [n], I, edian{colls1 I S 1,..., collsr I } p I ax { ɛ i ˆp i I }. 30 S r So henceforth we assue that this is the case. Assue the algorith rejects. When this occurs it iplies that there are at least k distinct intervals such that for each interval the test testflatness-l returned REJECT. For each of these intervals I we have pi 0 and edian{ colls1 I S1,..., collsr I Sr } > 1 + ax i{ ɛ ˆp i I }. In this case p I 1, and so I is not flat and contains at least one bucket boundary. Thus, there are at least k internal bucket boundaries. Therefore p is not a tiling k-histogra. Assue the algorith accepts p. When this occurs there is a partition of [n] to k intervals, I, such that for each interval I I, testflatness-l returned ACCEPT. Define p to be pi on the intervals obtained by the algorith. For every I I, If is the case that there exists i [r], such that Si < ɛ, then by fact 1 below, pi < ɛ. Therefore, fro the fact that p i x is iniized by x = pi and the Cauchy-Schwarz inequality we get that Otherwise, if Si fact 1, it follows that ˆp i I = Si p i pi p i 31 pi ɛ pi. 3 ɛ for every i [r] then by the second ite in fact 1, pi ɛ 4. By the first ite in pi and therefore edian{ colls1 I,..., collsr I } 1 + S 1 S r ɛ pi. 33 pi pi 1 This iplies that p I 1 + ɛ pi. Thus, p I u ɛ pi and since p I u = we get that p i pi ɛ pi. Hence I I p i pi ɛ, thus, p is ɛ-close to p in the l -nor. Fact 1 If we take 48 lnn γ ɛ saples, S, then with probability greater than 1 1 γ : 9

11 1. For any I such that pi ɛ 4, pi S I 3pI. For any I such that S I ɛ, pi > ɛ 4 3. For any I such that S I < ɛ, pi < ɛ Proof: Fix I, if pi ɛ 4, by Chernoff s bound with probability greater than 1 e ɛ 48, pi S I 3pI. 34 In particular, if pi = ɛ 4, then Si pi ɛ 4 or pi > ɛ 4 but then pi S I 1 n e ɛ 48 > 1 1 γ, the above is true for every I. 3ɛ 8, thus if S I ɛ > 3ɛ 8 then pi > ɛ 4. If S I < ɛ then either < ɛ. By the union bound, with probability greater than Algorith 4: testflatness-l 1 I, S 1,..., S r, ɛ 1 If there exists i [r] such that SI i < 163 then return ACCEPT; ɛ 4 collsi If edian 1 S1,,..., collsr I Sr ɛ 4 then return ACCEPT ; 3 return REJECT; Proof of Theore 4: Apply Algorith with the following changes: take each set of saples S i to be of size = 13 knɛ 5 and replace testflatness-l with testflatness-l 1. By Equation [ ] colls I Pr p I > δ p I 4 < δ. 35 S p I S Thus, if S I is such that S 16 δ 16 δ p I, then [ ] colls I Pr p I > δ p I > S By additive Chernoff s bound and the union bound for r = 16 ln6n and δ = ɛ 16, with high constant probability for every interval I that passes Step 1 in Algorith 4 it holds that colls I S p I δ p I the total nuber of intervals in [n] is less than n. So fro fro this point on we assue that the algorith obtains a δ-ultiplicative approxiation of p I for every I that passes Step 1. Assue the algorith rejects p, then there are at least k distinct intervals such that for each interval the test testflatness-l 1 returned REJECT. By our assuption each of these intervals is not flat and thus contains at least one bucket boundary. Thus, there are at least k internal buckets boundaries, therefore p is not a tiling k-histogra. Assue the algorith accepts p, then there is a partition of [n] to k intervals, I, such that for each interval I I, testflatness-l 1 returned ACCEPT. Define p to be pi on the intervals obtained by the algorith. For 10

12 ɛ any interval I for which testflatness-l 1 returned ACCEPT and passes Step 1 it holds that p I u < thus p i pi ɛ pi. Denote by L the set of intervals for which testflatness-l 1 returned ACCEPT on Step 1. By Chernoff s bound, for every I L, with probability greater than 1 e ɛ 3k, either pi ɛ 4k or pi 163. Hence, with probability greater than 1 n r e ɛ 3k > 1 1 6, the total weight of the intervals in L: ɛ 4 I L ax{ 163 ɛ ɛ 4, 4k } ɛ ɛ 4 = ɛ 1 + ɛ 4 I L I L kn, 37 where the last inequality follows fro the fact that L k which iplies that I L /n k. Therefore, p is ɛ-close to p in l 1 -nor. 4.1 Lower Bound We prove that for every k 1/ɛ, the upper bound in Theore 4 is tight in ter of the dependence in k and n. We note that for k = n, testing tiling k-histogra is trivial, i.e. every distribution is a tiling n-histogra. Hence, we can not expect to have a lower bound for any k. We also note that the testing lower bound is also an approxiation lower bound. Theore 5 Given a distribution D testing if D is a tiling k-histogra in the l 1 -nor requires Ω kn saples for every k 1/ɛ. Proof: Divide [n] into k intervals of equal size up to ±1. In the YES instance the total probability of each interval alternates between 0 and /k and within each interval the eleents have equal probability. The NO instance is defined siilarly with one exception, randoly pick one of the intervals that have total probability /k, I, and within I randoly pick half of the eleents to have probability 0 and the other half of the eleents to have twice the probability of the corresponding eleents in the YES instance. In the proof of the lower bound for testing unifority it is shown that distinguishing a unifor distribution fro a distribution that is unifor on a rando half of the eleents and has 0 weight on the other half requires Ω n. Since the nuber of eleents in I is Θn/k, by a siilar arguent we know that at least Ω n/k saples are required fro I in order to distinguish the YES instance fro the NO instance. Fro the fact that the total probability of I is Θ1/k we know that in order to obtain Θ n/k hits in I we are required to take a total nuber of saples which is of order nk, thus we obtain a lower bound of Ω nk. References [AAK + 07] N. Alon, A. Andoni, T. Kaufan, K. Matulef, R. Rubinfeld, and N. Xie. Testing k-wise and alost k-wise independence. In Proceedings of the Thirty-Ninth Annual ACM Syposiu on the Theory of Coputing STOC, pages , 007. [BDKR05] T. Batu, S. Dasgupta, R. Kuar, and R. Rubinfeld. The coplexity of approxiating the entropy. SIAM Journal on Coputing, 351:13 150, 005. [BFF + 01] T. Batu, L. Fortnow, E. Fischer, R. Kuar, R. Rubinfeld, and P. White. Testing rando variables for independence and identity. In Proceedings of the Forty-Second Annual Syposiu on Foundations of Coputer Science FOCS, pages ,

13 [BFR + 00] [BFR + 10] [BKR04] [CMN98] [GGI + 0] [GGR98] [GKS06] [GMP97] [GR00] T. Batu, L. Fortnow, R. Rubinfeld, W.D. Sith, and P. White. Testing that distributions are close. In Proceedings of the Forty-First Annual Syposiu on Foundations of Coputer Science FOCS, pages 59 69, Los Alaitos, CA, USA, 000. IEEE Coputer Society. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Sith, and P. White. Testing closeness of discrete distributions. CoRR, abs/ , 010. This is a long version of [BFR + 00]. T. Batu, R. Kuar, and R. Rubinfeld. Sublinear algoriths for testing onotone and uniodal distributions. In Proceedings of the Thirty-Sixth Annual ACM Syposiu on the Theory of Coputing STOC, pages , 004. S. Chaudhuri, R. Motwani, and V. Narasayya. Rando sapling for histogra construction: how uch is enough? SIGMOD, A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, M. Muthukrishnan, and M. Strauss. Fast, sall-space algoriths for approxiate histogra aintenance. STOC, 00. O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approxiation. Journal of the ACM, 454: , S. Guha, N. Koudas, and K. Shi. Approxiation and streaing algoriths for histogra construction probles. ACM Transactions on Database Systes TODS, 311, 006. P.B. Gibbons, Y Matias, and V. Poosala. Fast increental aintenance of approxiate histogras. VLDB, O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Electronic Colloqiu on Coputational Coplexity, 70, 000. [Ioa03] Y. Ioannidis. The history of histogras abridged. VLDB, 003. [JPK + 98] [Pan08] [Ron08] [RRSS09] [RS96] H. V. Jagadish, V. Poosala, N. Koudas, K. Sevcik, S. Muthukrishnan, and T. Suel. Optial histogras with quality guarantees. VLDB, L. Paninski. Testing for unifority given very sparsely-sapled discrete data. IEEE Transactions on Inforation Theory, 5410: , 008. D. Ron. Property testing: A learning theory perspective. Foundations and Trends in Machine Learning, 3:307 40, 008. S. Raskhodnikova, D. Ron, A. Shpilka, and A. Sith. Strong lower bonds for approxiating distributions support size and the distinct eleents proble. SIAM Journal on Coputing, 393:813 84, 009. R. Rubinfeld and M. Sudan. Robust characterization of polynoials with applications to progra testing. SIAM Journal on Coputing, 5:5 71, [Rub06] R. Rubinfeld. Sublinear tie algoriths. In Proc. International Congress of Matheaticians, volue 3, pages , 006. [TGIK0] [Val08] [VV11] Nitin Thaper, Sudipto Guha, Piotr Indyk, and Nick Koudas. Dynaic ultidiensional histogras. In SIGMOD Conference, pages , 00. P. Valiant. Testing syetric properties of distributions. In Proceedings of the Fourtieth Annual ACM Syposiu on the Theory of Coputing STOC, pages , 008. G. Valiant and P. Valiant. Estiating the unseen: an n/ logn-saple estiator for entropy and support size, shown optial via new CLTs. In Proceedings of the Fourty-Third Annual ACM Syposiu on the Theory of Coputing, pages , 011. See also ECCC TR and TR ECCC ISSN

Approximating and Testing k-histogram Distributions in Sub-linear Time

Approximating and Testing k-histogram Distributions in Sub-linear Time Electronic Colloquiu on Coputational Coplexity, Revision 1 of Report No. 171 (011) Approxiating and Testing k-histogra Distributions in Sub-linear Tie Piotr Indyk Reut Levi Ronitt Rubinfeld July 31, 014

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Collision-based Testers are Optimal for Uniformity and Closeness

Collision-based Testers are Optimal for Uniformity and Closeness Electronic Colloquiu on Coputational Coplexity, Report No. 178 (016) Collision-based Testers are Optial for Unifority and Closeness Ilias Diakonikolas Theis Gouleakis John Peebles Eric Price USC MIT MIT

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Taming Big Probability Distributions

Taming Big Probability Distributions Taming Big Probability Distributions Ronitt Rubinfeld June 18, 2012 CSAIL, Massachusetts Institute of Technology, Cambridge MA 02139 and the Blavatnik School of Computer Science, Tel Aviv University. Research

More information

Testing Distributions Candidacy Talk

Testing Distributions Candidacy Talk Testing Distributions Candidacy Talk Clément Canonne Columbia University 2015 Introduction Introduction Property testing: what can we say about an object while barely looking at it? P 3 / 20 Introduction

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

arxiv: v1 [cs.ds] 29 Jan 2012

arxiv: v1 [cs.ds] 29 Jan 2012 A parallel approxiation algorith for ixed packing covering seidefinite progras arxiv:1201.6090v1 [cs.ds] 29 Jan 2012 Rahul Jain National U. Singapore January 28, 2012 Abstract Penghui Yao National U. Singapore

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Lecture 9 November 23, 2015

Lecture 9 November 23, 2015 CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting LogLog-Beta and More: A New Algorith for Cardinality Estiation Based on LogLog Counting Jason Qin, Denys Ki, Yuei Tung The AOLP Core Data Service, AOL, 22000 AOL Way Dulles, VA 20163 E-ail: jasonqin@teaaolco

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs International Cobinatorics Volue 2011, Article ID 872703, 9 pages doi:10.1155/2011/872703 Research Article On the Isolated Vertices and Connectivity in Rando Intersection Graphs Yilun Shang Institute for

More information

Lecture 20 November 7, 2013

Lecture 20 November 7, 2013 CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though,

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Necessity of low effective dimension

Necessity of low effective dimension Necessity of low effective diension Art B. Owen Stanford University October 2002, Orig: July 2002 Abstract Practitioners have long noticed that quasi-monte Carlo ethods work very well on functions that

More information

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1. Notes on Coplexity Theory Last updated: October, 2005 Jonathan Katz Handout 7 1 More on Randoized Coplexity Classes Reinder: so far we have seen RP,coRP, and BPP. We introduce two ore tie-bounded randoized

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Note on generating all subsets of a finite set with disjoint unions

Note on generating all subsets of a finite set with disjoint unions Note on generating all subsets of a finite set with disjoint unions David Ellis e-ail: dce27@ca.ac.uk Subitted: Dec 2, 2008; Accepted: May 12, 2009; Published: May 20, 2009 Matheatics Subject Classification:

More information

Mixed Robust/Average Submodular Partitioning

Mixed Robust/Average Submodular Partitioning Mixed Robust/Average Subodular Partitioning Kai Wei 1 Rishabh Iyer 1 Shengjie Wang 2 Wenruo Bai 1 Jeff Biles 1 1 Departent of Electrical Engineering, University of Washington 2 Departent of Coputer Science,

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Hamming Compressed Sensing

Hamming Compressed Sensing Haing Copressed Sensing Tianyi Zhou, and Dacheng Tao, Meber, IEEE Abstract arxiv:.73v2 [cs.it] Oct 2 Copressed sensing CS and -bit CS cannot directly recover quantized signals and require tie consuing

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

When Short Runs Beat Long Runs

When Short Runs Beat Long Runs When Short Runs Beat Long Runs Sean Luke George Mason University http://www.cs.gu.edu/ sean/ Abstract What will yield the best results: doing one run n generations long or doing runs n/ generations long

More information

Weighted- 1 minimization with multiple weighting sets

Weighted- 1 minimization with multiple weighting sets Weighted- 1 iniization with ultiple weighting sets Hassan Mansour a,b and Özgür Yılaza a Matheatics Departent, University of British Colubia, Vancouver - BC, Canada; b Coputer Science Departent, University

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

The Transactional Nature of Quantum Information

The Transactional Nature of Quantum Information The Transactional Nature of Quantu Inforation Subhash Kak Departent of Coputer Science Oklahoa State University Stillwater, OK 7478 ABSTRACT Inforation, in its counications sense, is a transactional property.

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Fundamental Limits of Database Alignment

Fundamental Limits of Database Alignment Fundaental Liits of Database Alignent Daniel Cullina Dept of Electrical Engineering Princeton University dcullina@princetonedu Prateek Mittal Dept of Electrical Engineering Princeton University pittal@princetonedu

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors New Slack-Monotonic Schedulability Analysis of Real-Tie Tasks on Multiprocessors Risat Mahud Pathan and Jan Jonsson Chalers University of Technology SE-41 96, Göteborg, Sweden {risat, janjo}@chalers.se

More information

Detection and Estimation Theory

Detection and Estimation Theory ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy Storage Capacity and Dynaics of Nononotonic Networks Bruno Crespi a and Ignazio Lazzizzera b a. IRST, I-38050 Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I-38050 Povo (Trento) Italy INFN Gruppo

More information

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

Testing equivalence between distributions using conditional samples

Testing equivalence between distributions using conditional samples Testing equivalence between distributions using conditional samples (Extended Abstract) Clément Canonne Dana Ron Rocco A. Servedio July 5, 2013 Abstract We study a recently introduced framework [7, 8]

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University

Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University Estimating properties of distributions Ronitt Rubinfeld MIT and Tel Aviv University Distributions are everywhere What properties do your distributions have? Play the lottery? Is it independent? Is it uniform?

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

A Pass-Efficient Algorithm for Clustering Census Data

A Pass-Efficient Algorithm for Clustering Census Data A Pass-Efficient Algorithm for Clustering Census Data Kevin Chang Yale University Ravi Kannan Yale University Abstract We present a number of streaming algorithms for a basic clustering problem for massive

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University

More information

1 Distribution Property Testing

1 Distribution Property Testing ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data Instructor: Jayadev Acharya Lecture #10-11 Scribe: JA 27, 29 September, 2016 Please send errors to acharya@cornell.edu 1 Distribution

More information

Prerequisites. We recall: Theorem 2 A subset of a countably innite set is countable.

Prerequisites. We recall: Theorem 2 A subset of a countably innite set is countable. Prerequisites 1 Set Theory We recall the basic facts about countable and uncountable sets, union and intersection of sets and iages and preiages of functions. 1.1 Countable and uncountable sets We can

More information

Birthday Paradox Calculations and Approximation

Birthday Paradox Calculations and Approximation Birthday Paradox Calculations and Approxiation Joshua E. Hill InfoGard Laboratories -March- v. Birthday Proble In the birthday proble, we have a group of n randoly selected people. If we assue that birthdays

More information

Adaptive Stabilization of a Class of Nonlinear Systems With Nonparametric Uncertainty

Adaptive Stabilization of a Class of Nonlinear Systems With Nonparametric Uncertainty IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 46, NO. 11, NOVEMBER 2001 1821 Adaptive Stabilization of a Class of Nonlinear Systes With Nonparaetric Uncertainty Aleander V. Roup and Dennis S. Bernstein

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Using a De-Convolution Window for Operating Modal Analysis

Using a De-Convolution Window for Operating Modal Analysis Using a De-Convolution Window for Operating Modal Analysis Brian Schwarz Vibrant Technology, Inc. Scotts Valley, CA Mark Richardson Vibrant Technology, Inc. Scotts Valley, CA Abstract Operating Modal Analysis

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding

Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED PAPER) 1 Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding Lai Wei, Student Meber, IEEE, David G. M. Mitchell, Meber, IEEE, Thoas

More information