Notes on Frequency Estimation in Data Streams

Similar documents
18.1 Introduction and Recap

Lecture 4: Universal Hash Functions/Streaming Cont d

Mining Data Streams-Estimating Frequency Moment

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Lecture 5 September 17, 2015

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Finding Dense Subgraphs in G(n, 1/2)

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Problem Set 9 Solutions

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Lecture Space-Bounded Derandomization

Edge Isoperimetric Inequalities

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Expected Value and Variance

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

Min Cut, Fast Cut, Polynomial Identities

APPENDIX A Some Linear Algebra

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

2.3 Nilpotent endomorphisms

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

9 Characteristic classes

= z 20 z n. (k 20) + 4 z k = 4

The Geometry of Logit and Probit

Eigenvalues of Random Graphs

x = , so that calculated

Composite Hypotheses testing

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Chapter 8 Indicator Variables

Linear Regression Analysis: Terminology and Notation

Lecture 2: Gram-Schmidt Vectors and the LLL Algorithm

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

Complete subgraphs in multipartite graphs

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

CHAPTER 17 Amortized Analysis

Lecture 3: Probability Distributions

Lecture 10 Support Vector Machines II

5 The Rational Canonical Form

Module 9. Lecture 6. Duality in Assignment Problems

Feature Selection: Part 1

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Lecture 3. Ax x i a i. i i

More metrics on cartesian products

A Robust Method for Calculating the Correlation Coefficient

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

12 MATH 101A: ALGEBRA I, PART C: MULTILINEAR ALGEBRA. 4. Tensor product

Norms, Condition Numbers, Eigenvalues and Eigenvectors

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

FINITELY-GENERATED MODULES OVER A PRINCIPAL IDEAL DOMAIN

Mathematical Preparations

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Exercises of Chapter 2

find (x): given element x, return the canonical element of the set containing x;

Errors for Linear Systems

20. Mon, Oct. 13 What we have done so far corresponds roughly to Chapters 2 & 3 of Lee. Now we turn to Chapter 4. The first idea is connectedness.

Lecture Notes on Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

Linear Approximation with Regularization and Moving Least Squares

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

Foundations of Arithmetic

a b a In case b 0, a being divisible by b is the same as to say that

Assortment Optimization under MNL

Basically, if you have a dummy dependent variable you will be estimating a probability.

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

THE SUMMATION NOTATION Ʃ

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Maximizing the number of nonnegative subsets

Introduction to Algorithms

Vapnik-Chervonenkis theory

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

NP-Completeness : Proofs

Bernoulli Numbers and Polynomials

Convergence of random processes

Difference Equations

First day August 1, Problems and Solutions

Lecture 10: May 6, 2013

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

COS 511: Theoretical Machine Learning

DECOUPLING THEORY HW2

Lecture 3: Shannon s Theorem

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Calculation of time complexity (3%)

Economics 101. Lecture 4 - Equilibrium and Efficiency

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Transcription:

Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to the doman {1,..., n}, and v s the change n the frequency of the tem: f v 1 then the meanng s v addtons of tem and f v 1 then the meanng s v deletons of tem. The goal s to compute some functon whle usng space that s sublnear n the length of the stream. Ths s relevant both when data s lterally obtaned as a long stream of sgnals, where the stream s too long to keep n memory, and when the data resdes on some external devce and readng t n one pass s much more effcent than allowng random access. A natural specal case s that v = +1 for every elements. In ths case the stream s smply a sequence of tems (wth repettons) a j = for {1,..., n}. One of the frst problems that was studed n ths model (wth the specal case of sngle addtons), s computng frequency moments. Namely, let m = {j : a j = } denote the number of occurrences of n the stream. Then for each k 0 we defne F k = n (m ) k. (1) In partcular, F 1 equals m, the length of the sequence, F 0 s the number of dstnct elements appearng n the sequence (snce f m > 0 then m 0 = 1 and f m = 0 then m 0 = 0), and F 2 s the repeat rate or Gn s ndex of homogenety needed n order to compute the surprse ndex of the sequence. Fnally, for k = we defne F = max 1 n m (2) Gven an approxmaton parameter ɛ and a securty parameter δ, the algorthm should compute an estmate ˆF k such that the probablty that ˆF k F k > ɛf k s at most δ. What s known? 1. There s a lower bound of n 1 2 k (for constant ɛ and δ), whch n partcular means that for k 3 the lower bound s of the form n α for constant α (that approaches 1 when k ncreases). 2. There s a (recent) upper bound whose dependence on n s Õ n 1 2 k, so that t roughly matches the lower bound (the exact expresson s O k 2 log(1/δ) n 1 2 ɛ 2+4/k k log 2 m(log m + log n) ). 3. For the specal case of k = 1, clearly the exact value of F k = m can be computed usng space log m. To get an estmate, O(log log m + log(1/ɛ)) bts suffce. 4. For the specal case of k = 0 t s possble to compute an estmate that s wthn a factor 1/c and a factor c of F 0 wth probablty at least 1 2/c, where c > 2, usng O(log n) bts. 1

5. For the specal case of k = 2 t suffces to use O log(1/δ) (log n + log m). ɛ 2 6. Estmatng F requres space Ω(n) for m = O(n) and constant ɛ and δ. 7. Randomness s crucal: for k 1, every algorthm that computes an estmate of F k wth constant ɛ must use Ω(n) space. We shall dscuss the orgnal result of Alon et. al. whose dependence on n s Õ n 1 1 k (to be precse: O k log(1/δ) n 1 1 ɛ 2 k (log n + log m). If tme permts we wll talk about some of the specal cases. Assume frst that the length of the sequence, m, s known n advance. Ths assumpton s removed later. Let s 1 = 8 kn 1 1 ɛ 2 k and s 2 = 2 log(1/δ). The algorthm computes s 2 random varables, Y 1,..., Y s2 and outputs ther medan (ths s a standard technque to go from a constant probablty of devatng by more than some allowed devaton, to only an δ probablty that ths event occurs so the nterestng part s n defnng and analyzng the behavor of the Y t s). Each Y t s the average of s 1 random varables, X t,j where 1 j s 1. The X t,j s are ndependent, dentcally dstrbuted random varables. In order to explan how each X t,j = X s dstrbuted, we ntroduce some notaton. For each p {1,..., m}, let r(p) = {q : q p, a q = a p } (3) denote the number of occurrences of l = a p among the elements n the sequence that follow a p, ncludng a p (so that r(p) 1). Next defne ( R k (p) = m (r(p)) k (r(p) 1) k) (4) Each varable X t,j = X s determned (ndependently) by selectng an ndex p {1,..., m} unformly at random and lettng X = R k (p). Note that n order to compute r(p) and hence X = R k (p) t suffces to use log m bts to select p and count up to p, and then t suffces to mantan the log n bts representng a p and the log m bts representng r(p) and the log m bts representng R k (p). By defnton of X (recall that m = {j : a j = }), Exp[X] = 1 m = = = m n ( m R k (j) (5) ( (r(j)) k (r(j) 1) k) (6) ) ((m ) k (m 1) k ) + ((m 1) k (m 2) k ) +... + (2 k 1 k ) + (1 k 0 k )(7) k (m ) k = F k. Thus we have an unbased estmator of F k. What remans to be done s to bound the devaton of the average of the X t,j s from ths correct expected value. (The X t,j s are ndependent, so we could (8) 2

apply Chernoff. However, ther range s very bg so we wouldn t get a very good bound.) To ths end we bound the varance Var[X] = Exp[X 2 ] Exp 2 [X] and apply Chebshev: so that Pr[ X Exp[X] t Var 1/2 [X]] 1 t 2 Pr[ X Exp[X] T ] Var[X] T 2 In order to bound Exp[X 2 ] we shall use the followng nequalty, whch holds for any par of numbers a, b such that a > b > 0: a k b k = (a b)(a k 1 + a k 2 b +... + ab k 2 + b k 1 ) (9) (a b)ka k 1 (10) (You may be famlar wth the specal case of (a 2 b 2 ) = (a b)(a + b).) We use ths nequalty wth a = b + 1, so that a k (a 1) k ka k 1, and get: Exp[X 2 ] = 1 m m = m k m k n m ( ) 2 m (r k (r 1) k ) r=1 m n k r k 1 ((r k (r 1) k ) r=1 n n ( (m ) 2k 1 (m ) k 1 (m 1) k + (m 1) 2k 1 (m 1) k 1 (m 2) k +... + 2 2k 1 m 2k 1 = m k F 2k 1 = k F 1 F 2k 1 It can be shown (and s gven as an exercse) that F 1 F 2k 1 n 1 1/k (F k ) 2 (16) where we have used the nequalty ( 1 n n m ) k 1 n n mk. Therefore, Var[X] Exp[X 2 ] k F 1 F 2k 1 k n 1 1/k F 2 k (17) and so whereas Var[Y t ] = Var 1 s 1 s 1 X t,j Exp[Y t ] = Exp 1 s 1 = 1 s 1 Var[X] k n1 1/k F 2 k s 1 (18) s 1 X t,j = Exp[X] = F k (19) 3

By Chebyshev s nequalty ] Pr [ Y t F k > ɛf k Var[Y t] ɛ 2 F 2 k k n1 1/k F 2 k s 1 ɛ 2 F 2 k (20) By our choce of s 1 = 8 kn 1 1/k ths s at most 1 ɛ 2 8. As mentoned before, a standard analyss transform the constant probablty of small devaton of the Y t s to a hgh probablty of small devaton of ther medan (gven as an exercse). Dealng wth an unknown m. In ths case we start computng the random varable X wth the assumpton that m = 1, so that necessarly a p = a 1 (and we get that r(p) = 1 and X = 1 (1 k 0 k ) = 1). If ndeed m = 1 the process ends (note that f m = 1 then F k = 1 for every k). Otherwse, the value of m s updated to 2, and p = 1 s replaced by p = 2 wth probablty 1/2. In ether case, r(p) s modfed accordngly. In general, after vewng the frst t 1 tems, there s a current choce of p t 1 and a correspondng value of r(p t 1 ). If a new tem arrves, the belef for m s changed to t and p t s set to t wth probablty 1/t and remans p t 1 wth probablty 1 1/t. In the former case we have that r(p t ) = 1, and n the latter case r(p t ) s r(p t 1 ) + 1, f a t = a pt, and s r(p t 1 ) otherwse. As n the case that m s known, the algorthm only needs to remember a pt and r(p t ) at each step, at a cost of O(log n + log m) bts, and flppng a con wth bas 1/m takes O(log m) bts as well. On the relaton between m and n. If m = poly(n) then the factor of (log n + log m) s smply log n. When m s very large then nstead of computng r(p) exactly, we can estmate t usng log log m + log(1/ɛ) bts. Improved Estmaton of F 2 If we plug n k = 2 n the aforementoned expresson, we get a dependence on n that grows lke Õ( n). We next show how to get an estmate usng only O log(1/δ) (log n + log m) memory bts. ɛ 2 We set s 2 = 2 log(1/δ) as before and s 1 = 16. Here too the output s the medan of s ɛ 2 1 random varables Y 1,..., Y s1, where each Y t s the average of X t,j for j = 1,..., s 2. Each X,j = X s computed as follows. A central dea s usng a set V = {v 1,..., v h } of vectors of length n wth +1, 1 entres that are four-wse ndependent. That s, for every four dstnct coordnates, 1 1 < 2 < 3 < 4 n, and for every choce of γ 1,..., γ 4 { 1, +1}, exactly a (1/16)-fracton of the vectors n V have γ j n ther j coordnate for every j = 1,..., 4. (Note that 4-wse ndependence mples that for each coordnate, half of the vectors n V have +1 n the th coordnate and half have 1, and t mples s-wse ndependence for s = 2 and s = 3.) Such sets, of sze only h = O(n 2 ), not only exst but t s possble to compute each partcular coordnate of any v p of our choce usng O(log n) space. To compute X, we frst select 1 p h unformly at random (ths requres O(log n) bts of space). Ths determnes v p = (β 1,..., β n ) (where we wll compute the coordnates of v p when we need them). Let Z = n β m. Computng Z can be done n one pass usng O(log n + log m) space: Intally, Z = 0. For each a j, j = 1,..., m, f β aj = +1 then Z s ncremented by 1, and f β aj = 1 then t s decremented by 1. To compute each β aj t takes O(log n) space, and to mantan Z t takes O(log m) space. When the sequence termnates, we set X = Z 2. 4

As n the proof for general k we next compute Exp[X] and Var[X]. Before dong so, we make a few observatons that follow from the fact that each β { 1, +1} and the 4-wse ndependece: 1. For every, β 2 = β4 = 1, whle β3 = β. 2. For every, j Exp[β β j ] = 1 4 (+1 +1) + 1 4 (+1 1)1 4 ( 1 +1)1 ( 1 1) = 0 4 3. Smlarly, for every j k, Exp[β β j β k ] = 0 and for every j k l, Exp[β β j β k β l ] = 0. Usng the frst two propertes: ( n ) 2 Exp[X] = Exp[Z 2 ] = Exp β m (21) = Exp β β j m m j (22) j (m ) 2 Exp[β 2 ] + m m j Exp[β β j ] (23) j (m ) 2 = F 2 (24) Smlarly (though a bt more tedously...) ( n ) 4 Exp[X 2 ] = Exp β m (25) (m ) 4 Exp[β 4 ] + 4 (m ) 3 m j Exp[β 3 β j ] + 4 (m ) 2 m j m k Exp[β 2 β j β k ] j j k + 6 (m ) 2 (m j ) 2 Exp[β 2 βj 2 ] + m m j m k m l Exp[β β j β k β l ] (26) j j k l (m ) 4 + 6 j (m ) 2 (m j ) 2 (27) It follows that Var[X] = Exp[X 2 ] (Exp[X]) 2 (28) = (m ) 4 + 6 2 (m ) 2 (m j ) 2 (m ) 2 (29) j (m ) 4 + 6 (m ) 2 (m j ) 2 j (m ) 4 + 2 j (m ) 2 (m j ) 2 (30) = 4 j (m ) 2 (m j ) 2 2F 2 2 (31) 5

By Chebshev, for each 1 s 2, Pr [ Y F 2 > ɛf 2 ] Var[Y ] ɛ 2 F 2 2 2F 2 2 s 1 ɛ 2 F 2 2 = 1 8 (32) and we complete the argument as before. Estmatng F 0 to wthn a constant factor Here we ll only gve the dea wthout the full analyss. Let F = GF (2 d ) where d = log n. We vew each a j n the sequence as an element n the feld F. To compute an estmate for F 0 (the number of dstnct elements n the sequence), the algorthm selects α, β unformly at random n F. For each a j, the algorthm computes z j = z(a j ) = αa j + β, and consders the representaton of z j as a d-bt vector z j,1,..., z j,d. It then sets r j = r(z j ) to be the largest ndex such that all r(z j ) rghtmost bts of z j are 0. It mantans R as the maxmum over all r j, and when the sequence termnates t outputs Y = 2 R. The underlyng dea s that for each fxed l F, z(l) s unformly dstrbuted n F (gven the choce of α and β). That s, for every l, l F, Pr α,β [z(l) = l ] = 1 F. and so, for any r, the probablty that ts r rghtmost bts are 0 s 2 r. Now, f F 0 < 2 r /c, then by Markov, the probablty that any one of the the F 0 dfferent values l gves z(l) wth r(z(l)) > r s less than 1/c. (A few more detals: Let B denote the subset of elements that appear n the stream (where we want to estmate B. Let F r denote all elements n F whose r rghtmost bts are 0, so that F r = 2 d r. For each element l F, let X l be a 0/1 random varable that s 1 f and only f z(l) F r. Now, Pr[X l = 1] = 2 r so that Exp[ l B X l] = B 2 r. If B < 2 r /c then Ths expectaton s less than 1/c, and so the probablty that get at least 1 (.e., c tmes the expectaton) s less than 1/c) For the other drecton (showng that f F 0 > c2 r then the probablty that none of the F 0 dfferent l s gve r(z(l)) > r s less than 1/c), requres to apply Chebshev, and to use the fact that for any par l l the probablty that r(z(l)), r(z(l )) r, s 2 2r. 6

Constructng k-wse Independent Sample Spaces In the estmaton of F 2 we bult on the exstence of a set of n-dmensonal vectors of sze O(n 2 ) that are 4-wse ndependent. Here we shall show a general (but slghtly weaker) constructon of k-wse ndepdent sample spaces of sze O(n k ). Here too let F = GF (2 d ) where d = log n. We shall actually construct a set of n k vectors over F n (f we want to get bnary vectors we can take the least sgnfcant bt of each coordnate). Let w 1,..., w n denote the elements of the feld F. For each choce of k elements c 0,..., c k 1 F we defne the vector v c 0,...,c k 1 as follows: v c 0,...,c k 1 = k 1 j=0 c j w j. In other words, f we defne the (unvarate) polynomal p c 0,...,c k 1 = k 1 j=0 c jx j, then v c 0,...,c k 1 = p c 0,...,c k 1 (w ). By constructon there are n k vectors and each coordnate of any gven vector can be computed usng O(log n) bts. To see why we get a k-wse ndependent space, consder the n d Vendermonde matrx, where M,j = w j. Then each vector vc 0,...,c k 1 s the result of multplyng the matrx M wth the vector (c 0,..., c k 1 ). If we consder any choce of k rows, they are lnearly ndependent, mplyng the desred k-wse ndependence. 7