Pay for a Sliding Bloom Filter and Get Counting, Distinct Elements, and Entropy for Free

Size: px
Start display at page:

Download "Pay for a Sliding Bloom Filter and Get Counting, Distinct Elements, and Entropy for Free"

Transcription

1 Pay for a Sliding Bloom Filter and Get Counting, Distinct Elements, and Entropy for Free Eran Assaf Hebrew University Ran Ben Basat Technion Gil Einziger Nokia Bell Labs Roy Friedman Technion Abstract For many networking applications, recent data is more significant than older data, motivating the need for sliding window solutions. Various capabilities, such as DDoS detection and load balancing, require insights about multiple metrics including Bloom filters, per-flow counting, count distinct and entropy estimation. In this work, we present a unified construction that solves all the above problems in the sliding window model. Our single solution offers a better space to accuracy tradeoff than the state-of-the-art for each of these individual problems! We show this both analytically and by running multiple real Internet backbone and datacenter packet traces. 1 Introduction Network measurements are at the core of many applications, such as load balancing, quality of service, anomaly/intrusion detection, and caching [1, 14, 17, 5, 3]. Measurement algorithms are required to cope with the throughput demands of modern links, forcing them to rely on scarcely available fast SRAM memory. However, such memory is limited in size [35], which motivates approximate solutions that conserve space. Network algorithms often find recent data useful. For example, anomaly detection systems attempt to detect manifesting anomalies and a load balancer needs to balance the current load rather than the historical one. Hence, the sliding window model is an active research field [3, 34, 37, 41, 4]. The desired measurement types differ from one application to the other. For example, a load balancer may be interested in the heavy hitter flows [1], which are responsible for a large portion of the traffic. Additionally, anomaly detection systems often monitor the number of distinct elements [5] and entropy [38] or use Bloom filters [5]. Yet, existing algorithms usually provide just a single utility at a time, e.g., approximate set membership Bloom filters [6], per-flow counting [13, 18], count distinct [16, 0, 1] and entropy [38]. Therefore, as networks complexity grows, multiple measurement types may be required. However, employing multiple stand-alone solutions incurs the additive combined cost of each of them, which is inefficient in both memory and computation. In this work, we suggest Sliding Window Approximate Measurement Protocol SWAMP, an algorithm that bundles together four commonly used measurement types. Specifically, it approximates set membership, per flow counting, distinct elements and entropy in the sliding window model. As illustrated in Figure 1, SWAMP stores flows fingerprints 1 in a cyclic buffer while their frequencies are maintained in a compact fingerprint hash table named TinyTable [18]. On each packet arrival, its corresponding fingerprint replaces the oldest one in the buffer. We then update the table, decrementing the departing fingerprint s frequency and incrementing that of the arriving one. An additional counter Z maintains the number of distinct fingerprints in the window and is updated every time a fingerprint s frequency is reduced to 0 or increased to 1. Intuitively, the number of distinct fingerprints provides a good estimation of the number of distinct elements. Additionally, the scalar Ĥ not illustrated maintains the fingerprints distribution entropy and approximates the real entropy. 1.1 Contribution We present SWAMP, a sliding window algorithm for approximate set membership Bloom filters, per-flow counting, distinct elements and entropy measurements. We prove that SWAMP operates in constant time and provides 1 A fingerprint is a short random string obtained by hashing an ID. 1

2 Figure 1: An overview of SWAMP: Fingerprints are stored in a cyclic fingerprint buffer CFB, and their frequencies are maintained by TinyTable. Upon item x n s arrival, we update CFB and the table by removing the oldest item s x n W fingerprint in black and adding that of x n in red. We also maintain an estimate for the number of distinct fingerprints Z. Since the black fingerprints count is now zero, we decrement Z. accuracy guarantees for each of the supported problems. Despite its versatility, SWAMP improves the state of the art for each. For approximate set membership, SWAMP is memory succinct when the false positive rate is constant and requires up to 40% less space than [34]. SWAMP is also succinct for per-flow counting and is more accurate than [3] on real packet traces. When compared with 1 + ε count distinct approximation algorithms [11, 3], SWAMP asymptotically improves the query time from Oε to a constant. It is also up to x1000 times more accurate on real packet traces. For entropy, SWAMP asymptotically improves the runtime to a constant and provides accurate estimations in practice. Algorithm Space Time Counts SWAMP 1 + o1 W log W O 1 SWBF [34] + o1 W log W O 1 TBF [4] O W log W log ε 1 O log ε 1 Table 1: Comparison of sliding window set membership algorithms for ε = W o1. the notations we use can be found in Table Paper organization Related work on the problems covered by this work is found in Section. Section 3 provides formal definitions and introduces SWAMP. Section 4 describes an empirical evaluation of SWAMP and previously suggested algorithms. Section 5 includes a formal analysis of SWAMP which is briefly summarized in Table. Finally, we conclude with a short discussion in Section 6.

3 Problem Estimator Guarantee Reference Prtrue x S W, ε-approximate Set Membership IsMember W = 1 Corollary 5.1 Prtrue x / S W ε Frequency W, ε-approximate Set Multiplicity f Pr f x x f x = 1 Theorem 5. Pr f x f x ε DistinctLB Z PrD Z = 1 Theorem 5.6 W, ε, δ-approximate Count Distinct Pr D Z 1 εd log δ δ. DistinctMLE D D Pr D 1 εd log δ δ. Theorem 5.8 W, ε, δ-entropy Estimation Entropy Ĥ Pr H Ĥ = 1 Theorem 5.10 Pr H Ĥ εδ 1 δ. Related work.1 Set Membership and Counting Table : Summary of SWAMP s accuracy guarantees. A Bloom filter [6] is an efficient data structure that encodes an approximate set. Given an item, a Bloom filter can be queried if that item is a part of the set. An answer of no is always correct, while an answer of yes may be false with a certain probability. This case is called False Positive. Plain Bloom filters do not support removals or counting and thus many algorithms fill this gap. For example, some alternatives support removals [7, 18, 19, 31, 33, 40] and others support multiplicity queries [13, 18]. Additionally, some works use aging [41] and others compute the approximate set with regard to a sliding windows [4, 34]. SWBF [34] uses a Cuckoo hash table to build a sliding Bloom filter, which is more space efficient than previously suggested Timing Bloom filters TBF [4]. The Cuckoo table is allocated with W entries such that each entry stores a fingerprint and a time stamp. Cuckoo tables require that W entries remain empty to avoid circles and this is done implicitly by treating cells containing outdated items as empty. Finally, a cleanup process is used to remove outdated items and allow timestamps to be wrapped around. A comparison of SWAMP, TBF and SWBF appears in Table 1.. Count Distinct The number of distinct elements provides a useful indicator for anomaly detection algorithms. Accurate count distinct is impractical due to the massive scale of the data [] and thus most approaches resort to approximate solutions [, 1, 16]. Approximate algorithms typically use a hash function H : ID {0, 1} that maps ids to infinite bit strings. In practice, finite bit strings are used and 3 bit integers suffice to reach estimations of over 10 9 []. These algorithms look for certain observables in the hashes. For example, some algorithms [, 6] treat the minimal observed hash value as a real number in [0, 1] and exploit the fact that Emin HM = 1 D+1, where D is the real number of distinct items in the multi-set M. Alternatively, one can seek patterns of the form 0 β 1 1 [16, ] and exploit the fact that such a pattern is encountered on average once per every β unique elements. Monitoring observables reduces the required amount of space as we only need to maintain a single one. In practice, the variance of such methods is large and hence multiple observables are maintained. In principle, one could repeat the process and perform m independent experiments but this has significant computational overheads. Instead, stochastic averaging [1] is used to mimic the effects of multiple experiments 1 with a single hash calculation. At any case, using m repetitions reduces the standard deviation by a factor of m. The state of the art count distinct algorithm is HyperLogLog HLL [], which is used in multiple Google projects [8]. HLL requires m bytes and its standard deviation is σ 1.04 m. SWHLL extends HLL to sliding 3

4 windows [11, 3], and was used to detect attacks such as port scans [10]. SWAMP s space requirement is proportional to W and thus, it is only comparable in space to HLL when ε = OW. However, when multiple functionalities are required the residual space overhead of SWAMP is only logw bits, which is considerably less than any standalone alternative..3 Entropy Detection Entropy is commonly used as a signal for anomaly detection [38]. Intuitively, it can be viewed as a summary of the entire traffic histogram. The benefit of entropy based approaches is that they require no exact understanding of the attack s mechanism. Instead, such a solution assumes that a sharp change in the entropy is caused by anomalies. An ε, δ approximation of the entropy of a stream can be calculated in O ε log δ 1 space [7], an algorithm that was also extended to sliding window using priority sampling [15]. That sliding window algorithm is improved by [8] whose algorithm requires O ε log δ 1 log N memory..4 Preliminaries Compact Set Multiplicity Our work requires a compact set multiplicity structure that support both set membership and multiplicity queries. TinyTable [18] and CQF [39] fit the description while other structures [7, 31] are naturally expendable for multiplicity queries with at the expense of additional space. We choose TinyTable [18] as its code is publicly available as open source. TinyTable encodes W fingerprints of size L using 1 + α W L log W + 3+o W bits, where α is a small constant that affects update speed; when α grows, TinyTable becomes faster but also consumes more space. 3 SWAMP Algorithm 3.1 Model We consider a stream S of IDs where at each step an ID is added to S. The last W elements in S are denoted S W. Given an ID y, the notation f W y represents the frequency of y in S W. Similarly, f W y is an approximation of f W y. For ease of reference, notations are summarized in Table 3. Symbol Meaning W Sliding window size S Stream. S W Last W elements in the stream. ε Accuracy parameter for sets membership. ε D Accuracy parameter for count distinct. ε H Accuracy parameter for entropy estimation. δ Confidence for count distinct. L Fingerprint length in bits, L log W ε 1. fy W Frequency of ID y in S W. f y W An approximation of fy W. D Number of distinct elements in S W. Z Number of distinct fingerprints. D Estimation provided by DISTINCTMLE function. α TinyTable s parameter α = 1.. h A pairwise independent hash function. F A set of fingerprints stored in CF B. Table 3: List of symbols 4

5 Algorithm 1 SWAMP Require: TinyTable T inyt able, Fingerprint Array CF B, integer curr, integer Z initialization CF B 0 curr 0 Initial index is 0. Z 0 0 distinct fingerprints. T T T inyt able Ĥ 0 0 entropy 1: function Update ID x : prevf req T T.frequencyCF B[curr] 3: T T.removeCF B[curr] 4: CF B[curr] hx 5: T T.addCF B[curr] 6: xf req T T.frequencyCF B[curr] 7: UpdateCDprevF req, xf req 8: UpdateEntropyprevF req, xf req 9: curr curr + 1 mod W 10: procedure UpdateCDprevF req, xf req 11: if prevf req = 1 then 1: Z Z 1 13: if xf req = 1 then 14: Z Z : procedure UpdateEntropyprevF req, xf req 16: prevf req P P W The previous probability 17: prevf req 1 CP W The current probability 18: Ĥ Ĥ + P P log P P CP log CP 19: xf req 1 xp P W x s previous probability 0: xf req xcp W x s current probability 1: Ĥ Ĥ + xp P log xp P xcp log xcp : function IsMemberID x 3: if T T.frequencyhx > 0 then 4: return true 5: return false 6: function Frequency ID x 7: return T T.frequencyhx 8: function DistinctLB 9: return Z 30: function DistinctMLE ln 1 31: D Z ln L 1 1 L 3: return D 33: function Entropy 34: return Ĥ 3. Problems definitions We start by formally defining the approximate set membership problem. Definition 3.1. We say that an algorithm solves the W, ε-approximate Set Membership problem if given an ID y, it returns true if y S W and if y / S W, it returns false with probability of at least 1 ε. The above problem is solved by SWBF [34] and by Timing Bloom filter TBF [4]. In practice, SWAMP solves the stronger W, ε-approximate Set Multiplicity problem, as defined below: Definition 3.. We say that an algorithm solves the W, ε-approximate Set Multiplicity problem if given an ID y, it returns an estimation f W y s.t. f W y f W y and with probability of at least 1 ε: f y W = f y W. Intuitively, the W, ε-approximate Set Multiplicity problem guarantees that we always get an over approximation of the frequency and that with probability of at least 1 ε we get the exact window frequency. A simple observation shows that any algorithm that solves the W, ε-approximate Set Multiplicity problem 5

6 a Sliding Bloom filter b Count Distinct c Entropy Figure : Empirical error and theoretical guarantee for multiple functionalities random inputs. also solves the W, ε-approximate Set Membership problem. Specifically, if y S W, then f W y f W y implies that f W y 1 and we can return true. On the other hand, if y / S W, then f y W = 0 and with probability of at least 1 ε, we get: f y W = 0 = f y W. Thus, the ismember estimator simply returns true if f y W > 0 and false otherwise. We later show that this estimator solves the W, ε-approximate Set Membership problem. The goal of the W, ε, δ-approximate Count Distinct problem is to maintain an estimation of the number of distinct elements in S W. We denote their number by D. Definition 3.3. We say that an algorithm solves the W, ε, δ-approximate Count Distinct problem if it returns an estimation D such that D D and with probability 1 δ: D 1 ε D. Intuitively, an algorithm that solves the W, ε, δ-approximate Count Distinct problem is able to conservatively estimate the number of distinct elements in the window and with probability of 1 δ, this estimate is close to the real number of distinct elements. The entropy of a window is defined as: H D f i i=1 W log fi, W where D is the number of distinct elements in the window, W is the total number of packets, and f i is the frequency of flow i. We define the window entropy estimation problem as: Definition 3.4. An algorithm solves the W, ε, δ-entropy Estimation problem, if it provides an estimator Ĥ so that H Ĥ and Pr H Ĥ ε δ. 3.3 SWAMP algorithm We now present Sliding Window Approximate Measurement Protocol SWAMP. SWAMP uses a single hash function h, which given an ID y, generates L log W ε 1 random bits hy that are called its fingerprint. We note that h only needs to be pairwise-independent and can thus be efficiently implemented using only Olog W space. Fingerprints are then stored in a cyclic fingerprint buffer of length W that is denoted CF B. The variable curr always points to the oldest entry in the buffer. Fingerprints are also stored in TinyTable [18] that provides compact encoding and multiplicity information. The update operation replaces the oldest fingerprint in the window with that of the newly arriving item. To do so, it updates both the cyclic fingerprint buffer CF B and TinyTable. In CFB, the fingerprint at location curr is replaced with the newly arriving fingerprint. In TinyTable, we remove one occurrence of the oldest fingerprint and add the newly arriving fingerprint. The update method also updates the variable Z, which measures the number of distinct fingerprints in the window. Z is incremented every time that a new unique fingerprint is added to TinyTable, i.e., F REQUENCY y changes from 0 to 1, where the method F REQUENCY y receives 6

7 a Window size is 16 and varying ε. b ε = 10 and varying window sizes. Figure 3: Memory consumption of sliding Bloom filters as a function of W and ε. a fingerprint and returns its frequency as provided by TinyTable. Similarly, denote x the item whose fingerprint is removed; if F REQUENCY x changes to 0, we decrement Z. SWAMP has two methods to estimate the number of distinct flows. DistinctLB simply returns Z, which yields a conservative estimator of D while DistinctMLE is an approximation of its Maximum Likelihood Estimator. Clearly, DistinctMLE is more accurate than DistinctLB, but its estimation error is two sided. A pseudo code is provided in Algorithm 3.1 and an illustration is given by Figure 1. 4 Empirical Evaluation 4.1 Overview We evaluate SWAMP s various functionalities, each against its known solutions. We start with the W, ε- Approximate Set Membership problem where we compare SWAMP to SWBF [34] and Timing Bloom filter TBF [4], which only solve W, ε-approximate Set Membership. For counting, we compare SWAMP to Window Compact Space Saving WCSS [3] that solves heavy hitters identification on a sliding window. WCSS provides a different accuracy guarantee and therefore we evaluate their empirical accuracy when both algorithms are given the same amount of space. For the distinct elements problem, we compare SWAMP against Sliding Hyper Log Log [11, 3], denoted SWHLL, who proposed running HyperLogLog and LogLog on a sliding window. In our settings, small range correction is active and thus HyperLogLog and LogLog devolve into the same algorithm. Additionally, since we already know that small range correction is active, we can allocate only a single bit rather than 5 to each counter. This option, denoted SWLC, slightly improves the space/accuracy ratio. In all measurements, we use a window size of W = 16 unless specified otherwise. In addition, the underlying TinyTable uses α = 0. as recommended by its authors [18]. Our evaluation includes six Internet packet traces consisting of backbone routers and data center traffic. The backbone traces contain a mix of 1 billion UDP, TCP and ICMP packets collected from two major routers in Chicago [30] and San Jose [9] during the years The dataset Chicago 16 refers to data collected from the Chicago router in 016, San Jose 14 to data collected from the San Jose router in 014, etc. The datacenter packet traces are taken from two university datacenters that consist of up to 1,000 servers [4]. These traces are denoted DC1 and DC. 7

8 a Chicago 16 Dataset b San Jose 14 Dataset c DC 1 Dataset d Chicago 15 Dataset e San Jose 13 Dataset f DC Dataset Figure 4: Space/accuracy trade-off in estimating per flow packet counts window size: W = Evaluation of analytical guarantees Figure evaluates the accuracy of our analysis from Section 5 on random inputs. As can be observed, the analysis of sliding Bloom filter Figure a and count distinct Figure b is accurate. For entropy Figure c the accuracy is better than anticipated indicating that our analysis here is just an upper bound, but the trend line is nearly identical. 4.3 Set membership on sliding windows We now compare SWAMP to TBF [4] and SWBF [34]. Our evaluation focuses on two aspects, fixing ε and changing the window size Figure 3b as well as fixing the window size and changing ε Figure 3a. As can be observed, SWAMP is considerably more space efficient than the alternatives in both cases for a wide range of window sizes and for a wide range of error probabilities. In the tested range, it is 5-40% smaller than the best alternative. 4.4 Per-flow counting on sliding windows Next, we evaluate SWAMP for its per-flow counting functionality. We compare SWAMP to WCSS [3] that solves heavy hitters on a sliding window. Our evaluation uses the On Arrival model, which was used to evaluate WCSS. In that model, we perform a query for each incoming packet. Then, we calculate the Root Mean Square Error. We repeated each experiment 5 times with different seeds and computed 95% confidence intervals for SWAMP. Note that WCSS is a deterministic algorithm and as such was run only once. 8

9 a Chicago 16 Dataset b San Jose 14 Dataset c DC 1 Dataset d Chicago 15 Dataset e San Jose 13 Dataset f DC Dataset Figure 5: Accuracy of SWAMP s count distinct functionality compared to alternatives. The results appear in Figure 4. Note that the space consumption of WCSS is proportional to ε and that of SWAMP to W. Thus, SWAMP cannot be run for the entire range. Yet, when it is feasible, SWAMP s error is lower on average than that of WCSS. Additionally, in many of the configurations we are able to show statistical significance to this improvement. Note that SWAMP becomes accurate with high probability using about 300KB of memory while WCSS requires about 8.3MB to provide the same accuracy. That is, an improvement of x Count distinct on sliding windows Next, we evaluate the count distinct functionality in terms of accuracy vs. space on the different datasets. We performed 5 runs and summarized the averaged the results. We evaluate two functionalities: SWAMP-LB and SWAMP-MLE. SWAMP-LB corresponds to the function DistinctLB in Algorithm 3.1 and provides one sided estimation while SWAMP-MLE corresponds to the function DistinctMLE in Algorithm 3.1 and provides an unbiased estimator. Figure 5 shows the results of this evaluation. As can be observed, SWAMP-MLE is up to x1000 more accurate than alternatives. Additionally, SWAMP-LB also out performs the alternatives for parts of the range. Note that SWAMP-LB is the only one sided estimator in this evaluation. 4.6 Entropy estimation on sliding window Figure 6 shows results for Entropy estimation. As shown, SWAMP provides a very accurate entropy estimation in its entire operational range. Moreover, our analysis in Section 5.3 is conservative and SWAMP is much more accurate in practice. 9

10 a Chicago 16 Dataset b San Jose 14 Dataset c DC 1 Dataset d Chicago 15 Dataset e San Jose 13 Dataset f DC Dataset Figure 6: Accuracy of SWAMP s entropy functionality compared to the theoretical guarantee. 5 Analysis of SWAMP This section is dedicated for the analysis of SWAMP. Our analysis is partitioned into three subsections. Section 5.1 shows that SWAMP solves the W, ε-approximate Set Membership and the W, ε-approximate Set Multiplicity problems and explains the conditions where it is succinct. It also shows that SWAMP operates in constant runtime. Next, Section 5. shows that SWAMP solves the W, ε, δ-approximate Count Distinct problem using DistinctLB and that DistinctMLE approximates its Maximum Likelihood Estimator. Finally, Section 5.3 shows that SWAMP solves W, ε, δ-entropy Estimation. 5.1 Analysis of per flow counting We start by showing that SWAMP s runtime is constant. Theorem 5.1. SWAMP s runtime is O1with high probability. Proof. Update - Updating the cyclic buffer requires two TinyTable operations - add and remove - both performed in constant time with high probability for any constant α. The manipulations to Ĥ and Z are also done in constant time. ISMEMBER - Is satisfied by TinyTable in constant time. FREQUENCY - Is satisfied by TinyTable in constant time. DistinctLB - Is satisfied by returning an integer. DistinctMLE - Is satisfied with a simple calculation. Entropy - Is satisfied by returning a floating point. Next, we prove that SWAMP solves the W, ε-approximate Set Multiplicity problem with regard to the function FREQUENCY in Algorithm

11 Theorem 5.. SWAMP solves the W, ε-approximate Set Multiplicity problem. That is, given an ID y, F REQUENCY y provides an estimation f y W such that: f [ y W fy W and Pr fy W = f ] y W 1 ε. Proof. SWAMP s CF B variable stores W different fingerprints, each of length L = log W ε 1. For a fingerprint of size L, the probability that two fingerprints collide is L. There are W 1 fingerprints that hy may collide with, and f W y > f W y only if it collides with one of the fingerprints in CF B. Next, we use the Union Bound to get: [ Pr f yw f ] W y W log W ε 1 = ε. Thus, given an item y, with probability 1 ε, its fingerprint hy is unique and TinyTable accurately measures f W y. Any collision of hy with other fingerprints only increases f W y, thus f W y f W y in all cases. Theorem 5. shows that SWAMP solves the W, ε-approximate Set Multiplicity, which includes per flow counting and sliding Bloom filter functionalities. Our next step is to analyze the space consumption of SWAMP. This enables us to show that SWAMP is memory optimal to the W, ε-approximate Set Multiplicity, or more precisely, that it is succinct according to Definition 5.1. Definition 5.1. Given an information theoretic lower bound B, an algorithm is called succinct if it uses B1 + o1 bits. We now analyze the space consumption of SWAMP. Theorem 5.3. Given a window size W and an accuracy parameter ε, the number of bits required by SWAMP is: W log W ε α log ε o W. Proof. Our cyclic buffer CF B stores W fingerprints, each of size log W ε 1, for an overall space of W log W ε 1 bits. Additionally, TinyTable requires: 1 + α W log W ε 1 log W o W = 1 + α W log ε o W bits. Each one of the variables curr, Z and Ĥ require logw = o W and thus SWAMP s memory consumption is W log W ε α log ε o W. The work of [37] implies a lower bound for the W, ε-approximate Set Membership problem of B W log ε 1 + max log log ε 1, log W bits. This lower bound also applies to our case, as SWAMP solves the W, ε-approximate Set Multiplicity and the W, ε-approximate Set Membership by returning whether or not f y W > 0. Corollary 5.1. SWAMP solves the W, ε-approximate Set Membership problem. We now show that SWAMP is succinct for ε = W o1. Theorem 5.4. Let ε = W o1. Then SWAMP is succinct for the W, ε-approximate Set Membership and W, ε-approximate Set Multiplicity problems. Proof. Theorem 5.3 shows that the required space is: S W log W + W + α log ε 1 + 4W + 3W α + ow = 1 [ W log W α log ε α + o1 ] = W log log W W α log ε 1 log W + o1. We show that this implies succinctness according to Definition 5.1. Under our assumption that ε = W o1, we get that: log ε 1 = log W o1 = olog W, hence SWAMP uses W log W 1 + o1 = B1 + o1 bits and is succinct. 11

12 5. Analysis of count distinct functionality We now move to analyze SWAMP s count distinct functionality. Recall that SWAMP has two estimators; the one sided DistinctLB and the more accurate DistinctMLE. Also, recall that Z monitors the number of distinct fingerprints and that D is the real number of distinct flows on the window Analysis of DistinctLB We now present an analysis for DistinctLB, showing that it solves the W, ε, δ-approximate Count Distinct problem. Theorem 5.5. Z D and EZ D 1 ε. Proof. Clearly, Z D always, since any two identical IDs map to the same fingerprint. Next, for any 0 i L 1, denote by Z i the indicator random variable, indicating whether there is some item in the window whose fingerprint is i. Then Z = L 1 i=0 Z i, hence EZ = L 1 i=0 EZ i. However, for any i, EZ i = PrZ i = 1 = 1 1 L D. The probability that a fingerprint is exactly i is L, thus: EZ = L 1 1 L D. 1 Since 0 < L 1, we have: 1 L D < 1 D L + D L which, in turn, implies EZ > D D L. Finally, we note that L = ε DD 1ε W and that D W. Hence, EZ > D > D 1 ε W. Our next step is a probabilistic bound on the error of Z X = D Z. This is required for Theorem 5.6, which is the main result and shows that Z provides an ε, δ approximation for the distinct elements problem. We model the problem as a balls and bins problem where D balls are placed into L = W ε bins. The variable X is the number of bins with at least balls. For any 0 i L 1, denote by X i the random variable denoting the number of balls in the i-th bin. Note that the variables X i are dependent of each other and are difficult to reason about directly. Luckily, X is monotonically increasing and we can use a Poisson approximation. To do so, we denote by Y i P oisson D L the corresponding independent Poisson variables and by Y the Poisson approximation of X, that is Y is the sum of Y i s with value or more. Our goal is to apply Lemma 5.1, which links the Poisson approximation to the exact case. In our case, it allows us to derive insight about X by analyzing Y. Lemma 5.1 Corollary 5.11, page 103 of [36]. Let E be an event whose probability is either monotonically increasing or decreasing with the number of balls. If E has probability p in the Poisson case then E has probability at most p in the exact case. Here E is an event depending on X 0,..., X L 1 in the exact case, and the probability of the event in the Poisson case is obtained by computing E using Y 0,..., Y L 1. That is, we need to bound the probability PrX i and to do so we define a random variable Y and bound: PrY i which can be at most PrX i. We may now denote by Ỹi = Y i the indicator variables, indicating whether Y i has value at least. By definition: Y = L 1 i=0 Ỹ i. It follows that: EY = L 1 i=0 EỸi = L PrY i. Y i P oission D L, hence: PrY i = 1 PrY i = 0 PrY i = 1 1

13 and thus: EY = L 1 e D L 1 + D. L Since e x 1 + x > 1 x = 1 e D L 1 + D L+1, for 0 < x 1, we get EY < L 1 1 D L+1 = D L+1. Recall that L = W ε 1 and D W and thus EY < D ε. Note also that the {Ỹi} are independent, since the {Y i } are independent, and as they are indicator Bernoulli variables, in particular they are Poisson trials. Therefore, we may use a Chernoff bound on Y as it is the sum of independent Poisson trials. We use the following Chernoff bound to continue: Lemma 5. Theorem 4.4, page 64 of [36]. Let X 1,..., X n be independent Poisson trials such that: PrX i = p i, and let X = n i=1 X i, then for R 6EX, Pr X R R. Lemma 5. allows us to approach Theorem 5.6, which is the main result for DistinctLB and shows that it provides an ε, δ approximation of D, which we can then use in Corollary 5. to show that it solves the count distinct problem. Theorem 5.6. Let δ Then: Proof. For δ 1 18, log log δ 1 Pr D Z 1 εd log δ. δ δ 6 and thus, as εd 6EY we can use Lemma 5. to get that: Pr Y 1 εd log δ εd log δ log δ δ. As X monotonically increases with D, we use Lemma 5.1 to conclude that Pr X 1 εd log δ δ. The following readily follows, as claimed. Corollary 5.. DistinctLB solves the W, ε, δ-approximate Count Distinct problem, for ε D = 1 ε log δ, and any δ That is, for such δ we get: Pr Z 1 ε D D 1 δ. 5.. Analysis of DistinctMLE We now provide an analysis of the DistinctMLE estimation method, which is more accurate but offers two sided error. Our goal is to derive confidence intervals around DistinctMLE and for this we require a better analysis of Z, which is provided by Lemma 5.3. Lemma 5.3. D D L EZ D Proof. This follows simply from 1 by expanding the binomial. D L + We also need to analyze the second moment of Z. Lemma 5.4 does just that. Lemma 5.4. EZ = L L+1 L 1 1 D L + L L 1 1 D L 1. D L. 3 13

14 Proof. As in Theorem 5.5, we write Z as a sum of indicator variables Z = L 1 i=0 Z i. Then By linearity of the expectation, it implies Z = EZ = L 1 i=0 L 1 i=0 Z i + i j Z i Z j. EZ i + i j EZ i Z j. 3 We note that for any i j, 1 Z i 1 Z j is also an indicator variable, attaining the value 1 with probability 1 D. Therefore, for any i j, we have by linearity of the expectation, that L 1 D L = E1 Z i 1 Z j = 1 EZ i EZ j + EZ i Z j. Since EZ i = 1 D, 1 1 it follows that L EZ i Z j = 1 D L + 1 Plugging it back in Equation 3, and using Zi = Z i, we obtain EZ = L D L + L L Finally, expanding and rearranging we obtain the claim of this lemma. 1 1 D L 1 = D L 1 1 D L. 1 D L 1 1 D L. As our interest lies in the confidence intervals, we shall only need an approximation, described in the following simple corollary Corollary 5.3. D 4 D3 3 L < EZ < D + D3 3 L. Proof. First, note that binomial expansion yields D 3 1 D D L + L 3L 1 1 D L 1 D D L + L. Plugging it back into Lemma 5.4, and expanding we get on the one hand EZ > L L+1 L = D + 1 D L + D L D L + L L 3DD 1 + which yields the lower bound. On the other hand, we have EZ < L L+1 L = D + 1 D D L + L D + L 3L 3 3DD D L 1 + D 8DD 1D 6 which yields the upper bound. Thus, we have established the corollary. D 3 L = D 3L 3 D 3 + L L 1 D D L 1 + L = DD 1D = D + 3 = DD 18D 7 6 L, DD 1D 13 6 L, 14

15 Using Corollary 5.3 and Lemma 5.3 we can finally get an estimate for E D, as described in the following theorem. This shows that D is unbiased up to an ODε additive factor. Theorem 5.7. Proof. We first note that for x [0, 1 one has D ε 3 + ε < E W D D < D ε ε 3 x + x ln1 x x + x + x 3 31 x 3. 5 Therefore, we have Z L + Z ln 1 Z L+1 L Z L + Z L+1 + Z 3 3 L Z 3. Since always Z D, and D W = ε, we can take expectation and obtain, by linearity and monotonicity of the L L expectation, that EZ L + EZ L+1 E ln 1 Z L EZ L + EZ L+1 + D 3 3 L D 3. 6 We can now substitute Lemma 5.3 and Corollary 5.3 to obtain E ln 1 Z L D L + D D D3 + L+1 6 3L + 3 3L + D 3 3 L D 3 Recalling 5 we see also that ln 1 1 L 1 L + 1 E D D + D W L + L 1 + W 31 ε 3 L + L 1 D L + D L+1 + D. Combining both inequalities, we get L+1 D + D W 3L + W 31 ε 3 3L. W L + W 31 ε 3 L Since L = W ε 1, this gives us the upper bound. On the other hand, substituting Lemma 5.3 and Corollary 5.3 in the left inequality given in 6 we have also E ln 1 Z L D L + D 4D3 L+1 6 3L. From 5 we see also that ln 1 1 L 1 L + 1 L L 1 3. Combining both inequalities, we get E D D 4D 3 6 L + L D 3 3L D D Since D W and L = W ε 1, this yields the lower bound, as claimed. Next, we bound the error probability. This is done in the following theorem. Theorem 5.8. Let δ 1 18 and denote ε D = 1 ε log δ. Then Pr D D 1 εd δ. D 3 L L.. Proof. We first note that Pr ln 1 Z L ln 1 1 a = Pr Z L a L. L 15

16 Now, using Corollary 5., and the fact that the result immediately follows. L a L a, Theorem 5.8 shows the soundness of the DistinctMLE estimator by proving that it is at least as accurate as the DistinctLB estimator. 5.3 Analysis of entropy estimation functionality We now turn to analyze the entropy estimation. Recall that SWAMP uses the estimator Ĥ. If we denote by F the set of distinct fingerprints in the last W elements, it is given by Ĥ = h F n h W logn h W, where n h is the number of occurrences of fingerprint h in the window of last W elements. We begin by showing that EĤ approximates H, where H = fy W f W W log y W y D is the entropy in the window. Theorem 5.9. Ĥ H and EĤ is at least H ε. Proof. For any y D, let hy be its fingerprint. Recall that f y W is the frequency n hy of hy, hence we have: f y W = fy W. y :hy=hy For ease of notation, we denote p y = f W y W, and p h = n h W. Thus p h = It follows that: Ĥ = h F p h logp h = h F y D:hy=h p y y D:hy=h log Now, since p y 0 for all y, it follows that for any y, Hence, by the monotonicity of the logarithm, y D:hy =hy p y. y D:hy=h p y p y p y. Ĥ y D p y logp y = H = p y log y D y D:hy =hy p y. proving the first part of our claim. Conversely, denote for any y D and any h F, by I y,h the Bernoulli random variable, attaining the value 1 if hy = h. Then we see that: Ĥ = p y log p y I y,hy. y D y D 16

17 We see, by using Jensen s inequality, the concavity of the logarithm and the linearity of expectation, that for any y: E log p y I y,hy log p y EI y,hy y D y D Since for any y y, we have EI y,hy = L, we see that E log p y I y,hy log L p y + p y = logp y + L 1 p y y D y y D Summing this over y, and using the linearity of expectation once more, we obtain: EĤ p y logp y + L 1 p y. y D Subtracting H yields: EĤ H y D p y logpy + L 1 p y logp y y D p y log1 + L 1 1. p y Note that L = ε W and 1 W p y < 1. Hence, 0 < L 1 p y 1 < ε. Since for any 0 < x, log1 + x < x, we get: EĤ H y D L 1 p y = ε W D 1. As D W, we get that EĤ > H ε, as claimed. We proceed with a confidence interval derivation: Theorem Let ε H = εδ 1 then Ĥ solves W, ε H, δ-entropy Estimation. That is: Pr H Ĥ ε H δ. Proof. Denote by X H Ĥ the random variable of the estimation error. According to Theorem 5.9, X is non-negative and EX ε. Therefore, according to Markov s theorem we have Pr H Ĥ + εδ 1 = 1 Pr X εδ 1 1 EX 1 δ. εδ 1 Setting ε H = εδ 1 and rearranging the expression completes the proof. 6 Discussion In modern networks, operators are likely to require multiple measurement types. To that end, this work suggests SWAMP, a unified algorithm that monitors four common measurement metrics in constant time and compact space. Specifically, SWAMP approximates the following metrics on a sliding window: Bloom filters, per-flow counting, count distinct and entropy estimation. For all problems, we proved formal accuracy guarantees and demonstrated them on real Internet traces. Despite being a general algorithm, SWAMP advances the state of the art for all these problems. For sliding Bloom filters, we showed that SWAMP is memory succinct for constant false positive rates and that it reduces the required space by 5%-40% compared to previous approaches [34]. In per-flow counting, our algorithm outperforms WCSS [3] a state of the art window algorithm. When compared with 1 + ε approximation count distinct algorithms [11, 3], SWAMP asymptotically improves the query time from Oε to a constant. It is also up to x1000 times more accurate on real packet traces. For the entropy estimation on a sliding window [4], SWAMP reduces the update time to a constant. 17

18 While SWAMP benefits from the compactness of TinyTable [18], most of its space reductions inherently come from using fingerprints rather than sketches. For example, all existing count distinct and entropy algorithms require Ωε space for computing a 1 + ε approximation. SWAMP can compute the exact answers using OW log W bits. Thus, for a small ε value, we get an asymptotic reduction by storing the fingerprints on any compact table. Finally, while the formal analysis of SWAMP is mathematically involved, the actual code is short and simple to implement. This facilitates its adoption in network devices and SDN. In particular, OpenBox [9] demonstrated that sharing common measurement results across multiple network functionalities is feasible and efficient. Our work fits into this trend. References [1] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In ACM SIGCOMM, 014. [] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, 00. [3] R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. Heavy Hitters in Streams and Sliding Windows. In IEEE INFOCOM, 016. [4] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In ACM IMC 010, pages [5] G. Bianchi, N. d Heureuse, and S. Niccolini. On-demand time-decaying bloom filters for telemarketer detection. ACM SIGCOMM CCR, 011. [6] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, [7] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An improved construction for counting bloom filters. In ESA, 006. [8] V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling from sliding windows. In ACM PODS, 009. [9] A. Bremler-Barr, Y. Harchol, and D. Hay. Openbox: A software-defined framework for developing, deploying, and managing network functions. In ACM SIGCOMM 016, pages [10] Y. Chabchoub, R. Chiky, and B. Dogan. How can sliding hyperloglog and ewma detect port scan attacks in ip traffic? EURASIP Journal on Information Security, 014. [11] Y. Chabchoub and G. Hebrail. Sliding hyperloglog: Estimating cardinality in a data stream over a sliding window. In IEEE ICDM Workshops, 010. [1] P. Chassaing and L. Gerin. Efficient estimation of the cardinality of large data sets. In DMTCS, 006. [13] S. Cohen and Y. Matias. Spectral bloom filters. In Proc. of the 003 ACM SIGMOD Int. Conf. on Management of Data, 003. [14] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee. DevoFlow: Scaling Flow Management for High-performance Networks. In ACM SIGCOMM, 011. [15] N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 546. [16] M. Durand and P. Flajolet. Loglog counting of large cardinalities. In ESA, 003. [17] G. Einziger and R. Friedman. TinyLFU: A highly efficient cache admission policy. In Euromicro PDP,

19 [18] G. Einziger and R. Friedman. Counting with TinyTable: Every Bit Counts! In ACM ICDCN, 016. [19] G. Einziger and R. Friedman. Tinyset - an access efficient self adjusting bloom filter construction. IEEE/ACM Transactions on Networking, 017. [0] C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In ACM IMC, 003. [1] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., [] P. Flajolet, ric Fusy, O. Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA, 007. [3] E. Fusy and F. Giroire. Estimating the number of Active Flows in a Data Stream over a Sliding Window, pages [4] M. Gabel, D. Keren, and A. Schuster. Anarchists, unite: Practical entropy approximation for distributed streams. In KDD, 017. [5] P. Garcia-Teodoro, J. E. Diaz-Verdejo, G. Macia-Fernandez, and E. Vazquez. Anomaly-Based Network Intrusion Detection: Techniques, Systems and Challenges. Computers and Security, 009. [6] F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 009. [7] S. Guha, A. McGregor, and S. Venkatasubramanian. Streaming and sublinear approximation of entropy and information distances. In ACM SODA, pages , 006. [8] S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In ACM EDBT, 013. [9] P. Hick. CAIDA Anonymized Internet Trace, equinix-sanjose :00-13:05 UTC, Direction B., 014. [30] P. Hick. CAIDA Anonymized Internet Trace, equinix-chicago :00-13:05 UTC, Direction A., 016. [31] N. Hua, H. C. Zhao, B. Lin, and J. Xu. Rank-indexed hashing: A compact construction of bloom filters and variants. In ICNP. IEEE, 008. [3] A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar. AF-QCN: Approximate Fairness with Quantized Congestion Notification for Multi-tenanted Data Centers. In IEEE HOTI, 010. [33] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. SIGPLAN Not., 3511, Nov [34] Y. Liu, W. Chen, and Y. Guan. Near-optimal approximate membership query over time-decaying windows. In IEEE INFOCOM, 013. [35] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In ACM SIGMETRICS, 008. [36] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 005. [37] M. Naor and E. Yogev. Sliding Bloom Filter. In ISAAC,

20 [38] G. Nychis, V. Sekar, D. G. Andersen, H. Kim, and H. Zhang. An empirical evaluation of entropy-based traffic anomaly detection. In ACM IMC, 008. [39] P. Pandey, M. A. Bender, R. Johnson, and R. Patro. A general-purpose counting filter: Making every bit count. In ACM SIGMOD, 017. [40] O. Rottenstreich, Y. Kanizo, and I. Keslassy. The variable-increment counting bloom filter. In INFOCOM, 01. [41] M. Yoon. Aging bloom filter with two active buffers for dynamic sets. IEEE Trans. on Knowl. and Data Eng., 010. [4] L. Zhang and Y. Guan. Detecting click fraud in pay-per-click streams of online advertising networks. In ICDCS,

arxiv: v1 [cs.ds] 5 Dec 2017

arxiv: v1 [cs.ds] 5 Dec 2017 Pay for a Sliding Bloom Filter and Get Counting, Distinct Elements, and Entropy for Free Eran Assaf Hebrew University Ran Ben Basat Technion Gil Einziger Nokia Bell Labs Roy Friedman Technion arxiv:171.01779v1

More information

Classic Network Measurements meets Software Defined Networking

Classic Network Measurements meets Software Defined Networking Classic Network s meets Software Defined Networking Ran Ben Basat, Technion, Israel Joint work with Gil Einziger and Erez Waisbard (Nokia Bell Labs) Roy Friedman (Technion) and Marcello Luzieli (UFGRS)

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

arxiv: v3 [cs.ds] 15 Oct 2017

arxiv: v3 [cs.ds] 15 Oct 2017 Fast Flow Volume Estimation Ran Ben Basat Technion sran@cs.technion.ac.il Gil Einziger Nokia Bell Labs gil.einziger@nokia.com Roy Friedman Technion roy@cs.technion.ac.il arxiv:70.0355v3 [cs.ds] 5 Oct 07

More information

arxiv: v3 [cs.ds] 24 Apr 2018

arxiv: v3 [cs.ds] 24 Apr 2018 Give Me Some Slack: Efficient Network Measurements Ran Ben Basat 1, Gil Einziger 2, and Roy Friedman 1 1 Department of Computer Science, Technion {sran,roy}@cs.technion.ac.il 2 Nokia Bell Labs gil.einziger@nokia.com

More information

Chapter 2 Per-Flow Size Estimators

Chapter 2 Per-Flow Size Estimators Chapter 2 Per-Flow Size Estimators Abstract This chapter discusses the measurement of per-flow sizes for high-speed links. It is a particularly difficult problem because of the need to process and store

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic Yoshihide MATSUMOTO a), Hiroaki HAZEYAMA b), and Youki KADOBAYASHI

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

The Bloom Paradox: When not to Use a Bloom Filter

The Bloom Paradox: When not to Use a Bloom Filter 1 The Bloom Paradox: When not to Use a Bloom Filter Ori Rottenstreich and Isaac Keslassy Abstract In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, the Bloom filter is harmful and

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

How Philippe Flipped Coins to Count Data

How Philippe Flipped Coins to Count Data 1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

Tight Bounds for Sliding Bloom Filters

Tight Bounds for Sliding Bloom Filters Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Performance Analysis of Priority Queueing Schemes in Internet Routers

Performance Analysis of Priority Queueing Schemes in Internet Routers Conference on Information Sciences and Systems, The Johns Hopkins University, March 8, Performance Analysis of Priority Queueing Schemes in Internet Routers Ashvin Lakshmikantha Coordinated Science Lab

More information

Finding Heavy Distinct Hitters in Data Streams

Finding Heavy Distinct Hitters in Data Streams Finding Heavy Distinct Hitters in Data Streams Thomas Locher IBM Research Zurich thl@zurich.ibm.com ABSTRACT A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011 Bloom Filter Redux Matthias Vallentin Gene Pang CS 270 Combinatorial Algorithms and Data Structures UC Berkeley, Spring 2011 Inspiration Our background: network security, databases We deal with massive

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data

Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data Patrick Loiseau 1, Paulo Gonçalves 1, Stéphane Girard 2, Florence Forbes 2, Pascale Vicat-Blanc Primet 1

More information

Optimal Elephant Flow Detection

Optimal Elephant Flow Detection Optimal Elephant Flow Detection Ran Ben Basat Computer Science Department Technion sran@cs.technion.ac.il Gil Einziger Nokia Bell Labs gil.einziger@nokia.com Roy Friedman Computer Science Department Technion

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Modeling Residual-Geometric Flow Sampling

Modeling Residual-Geometric Flow Sampling Modeling Residual-Geometric Flow Sampling Xiaoming Wang Joint work with Xiaoyong Li and Dmitri Loguinov Amazon.com Inc., Seattle, WA April 13 th, 2011 1 Agenda Introduction Underlying model of residual

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

A Simple and Efficient Estimation Method for Stream Expression Cardinalities

A Simple and Efficient Estimation Method for Stream Expression Cardinalities A Simple and Efficient Estimation Method for Stream Expression Cardinalities Aiyou Chen, Jin Cao and Tian Bu Bell Labs, Alcatel-Lucent (aychen,jincao,tbu)@research.bell-labs.com ABSTRACT Estimating the

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

A Model for Learned Bloom Filters, and Optimizing by Sandwiching A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested

More information

Acceptably inaccurate. Probabilistic data structures

Acceptably inaccurate. Probabilistic data structures Acceptably inaccurate Probabilistic data structures Hello Today's talk Motivation Bloom filters Count-Min Sketch HyperLogLog Motivation Tape HDD SSD Memory Tape HDD SSD Memory Speed Tape HDD SSD Memory

More information

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017 8 Un-aggregated data 3 Data element e D has key and value

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Identifying and Estimating Persistent Items in Data Streams

Identifying and Estimating Persistent Items in Data Streams 1 Identifying and Estimating Persistent Items in Data Streams Haipeng Dai Muhammad Shahzad Alex X. Liu Meng Li Yuankun Zhong Guihai Chen Abstract This paper addresses the fundamental problem of finding

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

Network-Wide Routing-Oblivious Heavy Hitters

Network-Wide Routing-Oblivious Heavy Hitters Network-Wide Routing-Oblivious Heavy Hitters Ran Ben Basat Gil Einziger Shir Landau Feibish Technion sran@cs.technion.ac.il Nokia Bell Labs gil.einziger@nokia.com Princeton University sfeibish@cs.princeton.edu

More information

A Statistical Analysis of Probabilistic Counting Algorithms

A Statistical Analysis of Probabilistic Counting Algorithms Probabilistic counting algorithms 1 A Statistical Analysis of Probabilistic Counting Algorithms PETER CLIFFORD Department of Statistics, University of Oxford arxiv:0801.3552v3 [stat.co] 7 Nov 2010 IOANA

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

An Early Traffic Sampling Algorithm

An Early Traffic Sampling Algorithm An Early Traffic Sampling Algorithm Hou Ying ( ), Huang Hai, Chen Dan, Wang ShengNan, and Li Peng National Digital Switching System Engineering & Techological R&D center, ZhengZhou, 450002, China ndschy@139.com,

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

Hashing. Martin Babka. January 12, 2011

Hashing. Martin Babka. January 12, 2011 Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice

More information

Space-efficient Tracking of Persistent Items in a Massive Data Stream

Space-efficient Tracking of Persistent Items in a Massive Data Stream Space-efficient Tracking of Persistent Items in a Massive Data Stream Bibudh Lahiri Dept. of ECE Iowa State University Ames, IA, USA 50011 bibudh@iastate.edu* Jaideep Chandrashekar Intel Labs Berkeley

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Maximizing Data Locality in Distributed Systems

Maximizing Data Locality in Distributed Systems Maximizing Data Locality in Distributed Systems Fan Chung a, Ronald Graham a, Ranjita Bhagwan b,, Stefan Savage a, Geoffrey M. Voelker a a Department of Computer Science and Engineering University of California,

More information

Weighted Bloom Filter

Weighted Bloom Filter Weighted Bloom Filter Jehoshua Bruck Jie Gao Anxiao (Andrew) Jiang Department of Electrical Engineering, California Institute of Technology. bruck@paradise.caltech.edu. Department of Computer Science,

More information

Self-Tuning the Parameter of Adaptive Non-Linear Sampling Method for Flow Statistics

Self-Tuning the Parameter of Adaptive Non-Linear Sampling Method for Flow Statistics 2009 International Conference on Computational Science and Engineering Self-Tuning the Parameter of Adaptive Non-Linear Sampling Method for Flow Statistics Chengchen Hu 1,2, Bin Liu 1 1 Department of Computer

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

Space efficient tracking of persistent items in a massive data stream

Space efficient tracking of persistent items in a massive data stream Electrical and Computer Engineering Publications Electrical and Computer Engineering 2014 Space efficient tracking of persistent items in a massive data stream Impetus Technologies Srikanta Tirthapura

More information

11 Heavy Hitters Streaming Majority

11 Heavy Hitters Streaming Majority 11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath

More information

Processing Top-k Queries from Samples

Processing Top-k Queries from Samples TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Processing Top-k Queries from Samples This thesis was submitted in partial fulfillment of the requirements

More information

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Daniel Ting Tableau Software Seattle, Washington dting@tableau.com ABSTRACT We introduce and study a new data sketch for processing

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

Optimal Random Sampling from Distributed Streams Revisited

Optimal Random Sampling from Distributed Streams Revisited Optimal Random Sampling from Distributed Streams Revisited Srikanta Tirthapura 1 and David P. Woodruff 2 1 Dept. of ECE, Iowa State University, Ames, IA, 50011, USA. snt@iastate.edu 2 IBM Almaden Research

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Improved Algorithms for Distributed Entropy Monitoring

Improved Algorithms for Distributed Entropy Monitoring Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

CS 580: Algorithm Design and Analysis

CS 580: Algorithm Design and Analysis CS 580: Algorithm Design and Analysis Jeremiah Blocki Purdue University Spring 2018 Announcements: Homework 6 deadline extended to April 24 th at 11:59 PM Course Evaluation Survey: Live until 4/29/2018

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

A Near-Optimal Algorithm for Computing the Entropy of a Stream

A Near-Optimal Algorithm for Computing the Entropy of a Stream A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a

More information

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li Ping Li Compressed Counting, Data Streams March 2009 DIMACS 2009 1 Compressed Counting Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca,

More information

Approximate Fairness with Quantized Congestion Notification for Multi-tenanted Data Centers

Approximate Fairness with Quantized Congestion Notification for Multi-tenanted Data Centers Approximate Fairness with Quantized Congestion Notification for Multi-tenanted Data Centers Abdul Kabbani Stanford University Joint work with: Mohammad Alizadeh, Masato Yasuda, Rong Pan, and Balaji Prabhakar

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 3 Estimation and Bounds Estimation from Packet

More information

What s New: Finding Significant Differences in Network Data Streams

What s New: Finding Significant Differences in Network Data Streams What s New: Finding Significant Differences in Network Data Streams Graham Cormode S. Muthukrishnan Abstract Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An

More information

Small active counters

Small active counters Small active counters Rade Stanojević Hamilton Institute, NUIM, Irel rade.stanojevic@nuim.ie Abstract The need for efficient counter architecture has arisen for the following two reasons. Firstly, a number

More information

New cardinality estimation algorithms for HyperLogLog sketches

New cardinality estimation algorithms for HyperLogLog sketches New cardinality estimation algorithms for HyperLogLog sketches Otmar Ertl otmar.ertl@gmail.com April 3, 2017 This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog

More information

Competitive Management of Non-Preemptive Queues with Multiple Values

Competitive Management of Non-Preemptive Queues with Multiple Values Competitive Management of Non-Preemptive Queues with Multiple Values Nir Andelman and Yishay Mansour School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel Abstract. We consider the online problem

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Lecture 4: Two-point Sampling, Coupon Collector s problem

Lecture 4: Two-point Sampling, Coupon Collector s problem Randomized Algorithms Lecture 4: Two-point Sampling, Coupon Collector s problem Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized Algorithms

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Using the Power of Two Choices to Improve Bloom Filters

Using the Power of Two Choices to Improve Bloom Filters Internet Mathematics Vol. 4, No. 1: 17-33 Using the Power of Two Choices to Improve Bloom Filters Steve Lumetta and Michael Mitzenmacher Abstract. We consider the combination of two ideas from the hashing

More information

Maintaining Significant Stream Statistics over Sliding Windows

Maintaining Significant Stream Statistics over Sliding Windows Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have The first bound is the strongest, the other two bounds are often easier to state and compute Proof: Applying Markov's inequality, for any >0 we have Pr (1 + ) = Pr For any >0, we can set = ln 1+ (4.4.1):

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Sparse Fourier Transform (lecture 1)

Sparse Fourier Transform (lecture 1) 1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n

More information