Lecture 4 February 16, 2016

Similar documents
Lecture 4: Unique-SAT, Parity-SAT, and Approximate Counting

Problem Set 2 Solutions

Lecture 2 February 8, 2016

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Optimally Sparse SVMs

Lecture 4: April 10, 2013

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Basics of Probability Theory (for Theory of Computation courses)

An Introduction to Randomized Algorithms

Lecture 3: August 31

Disjoint set (Union-Find)

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Lecture 2: Concentration Bounds

There is no straightforward approach for choosing the warmup period l.

# fixed points of g. Tree to string. Repeatedly select the leaf with the smallest label, write down the label of its neighbour and remove the leaf.

4.1 Sigma Notation and Riemann Sums

6.3 Testing Series With Positive Terms

SDS 321: Introduction to Probability and Statistics

Feedback in Iterative Algorithms

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Lecture 19: Convergence

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Topic 9: Sampling Distributions of Estimators

Hashing and Amortization

Lecture 5: April 17, 2013

Lecture 2: April 3, 2013

Frequentist Inference

Topic 9: Sampling Distributions of Estimators

Lecture 14: Graph Entropy

Massachusetts Institute of Technology

IP Reference guide for integer programming formulations.

HOMEWORK 2 SOLUTIONS

Lecture 9: Hierarchy Theorems

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

CS 330 Discussion - Probability

Lecture 9: Expanders Part 2, Extractors

Linear Regression Demystified

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

( ) = p and P( i = b) = q.

Homework 5 Solutions

Output Analysis and Run-Length Control

Lecture 6: Coupon Collector s problem

ST5215: Advanced Statistical Theory

Rademacher Complexity

CS161: Algorithm Design and Analysis Handout #10 Stanford University Wednesday, 10 February 2016

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

7.1 Convergence of sequences of random variables

Shannon s noiseless coding theorem

Chapter 2 The Solution of Numerical Algebraic and Transcendental Equations

Lecture 16: Monotone Formula Lower Bounds via Graph Entropy. 2 Monotone Formula Lower Bounds via Graph Entropy

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

1 Hash tables. 1.1 Implementation

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Topic 9: Sampling Distributions of Estimators

Skip Lists. Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 S 3 S S 1

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

CS / MCS 401 Homework 3 grader solutions

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Carleton College, Winter 2017 Math 121, Practice Final Prof. Jones. Note: the exam will have a section of true-false questions, like the one below.

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Axioms of Measure Theory

7.1 Convergence of sequences of random variables

CS:3330 (Prof. Pemmaraju ): Assignment #1 Solutions. (b) For n = 3, we will have 3 men and 3 women with preferences as follows: m 1 : w 3 > w 1 > w 2

CS284A: Representations and Algorithms in Molecular Biology

4.1 SIGMA NOTATION AND RIEMANN SUMS

11. Hash Tables. m is not too large. Many applications require a dynamic set that supports only the directory operations INSERT, SEARCH and DELETE.

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Advanced Stochastic Processes.

Lecture 2. The Lovász Local Lemma

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Infinite Sequences and Series

Lecture 12: September 27

The random version of Dvoretzky s theorem in l n

CS 171 Lecture Outline October 09, 2008

Solutions to selected exercise of Randomized Algorithms

6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:

Economics Spring 2015

STAT Homework 1 - Solutions

Sieve Estimators: Consistency and Rates of Convergence

Lecture #20. n ( x p i )1/p = max

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

b i u x i U a i j u x i u x j

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Statistics 511 Additional Materials

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Lecture 10 October Minimaxity and least favorable prior sequences

Machine Learning Brett Bernstein

Lecture 11: Pseudorandom functions

Estimation for Complete Data

Introductory Analysis I Fall 2014 Homework #7 Solutions

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Lecture 11: Hash Functions and Random Oracle Model

The multiplicative structure of finite field and a construction of LRC

Seunghee Ye Ma 8: Week 5 Oct 28

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Binomial Distribution

Lecture 12: November 13, 2018

Chapter 6. Advanced Counting Techniques

Transcription:

MIT 6.854/18.415: Advaced Algorithms Sprig 16 Prof. Akur Moitra Lecture 4 February 16, 16 Scribe: Be Eysebach, Devi Neal 1 Last Time Cosistet Hashig - hash fuctios that evolve well Radom Trees - routig schemes that deal with icosistet views Today: Distict Elemets ad Cout-Mi Sketch. What ca we do if we ca t store data, oly stream it? Distict Elemets Problem: Cout the umber of distict elemets i a sequece X 1, X,..., X. For example, how may uique words did Shakespeare use? Naively this problem takes O(N) space, where N is the umber of distict elemets i the sequece. For Shakespeare s total vocabulary, N 35,. However, it turs out that you ca do much better tha the aive method if you are willig to accept some level of approximatio. There s a famous quote from a 3 paper by Mariae Durad ad Philippe Flajolet: Usig oly memory equivalet to 5 lies of prited text, you ca estimate with a typical accuracy of 5% ad i a sigle pass the total vocabulary of Shakespeare..1 Usig a Sigle Hash Fuctio Idea: Choose a radom hash fuctio h : U [, 1] ad pass oce through the data, hashig each item ad storig oly the miimum of h(x 1 ), h(x ),..., h(x i ),.... Let Y mi i {h(x i )} be the miimum, ad let N be the true umber of distict elemets. Lemma 1. E[Y ] = 1 1

Proof. E[Y ] = = 1 1 [ = = 1 P[Y z]dz (1 z) N dz ] (1 1 z)n+1 With some thought, you ca cofirm that E[Y ] is the same as the probability of choosig umbers i the iterval [, 1] ad havig the last umber be the miimum. By symmetry, this probability is 1 N+1. So, if we estimate N by 1 Y 1, we ll at least get the right aswer i expectatio. To show that we also get the right aswer with good probability, we re goig to use the Chebyshev tail boud to boud the probability that our estimate for Y is close to its expectatio. To do this, we must first compute the variace of Y. Lemma. Proof. V ar[y ] ( 1 ) V ar[y ] = E[Y ] E[Y ] 1 ( ) 1 = z N(1 z) N 1 dz = ()(N + ) 1 ()() ()() 1 ()() ( ) 1 = Ufortuately, we caot apply Chebyshev directly because the variace of Y is too large. I particular, Chebyshev oly gives error resolutio dow to size of Y s stadard deviatio (i.e. V ar[y ]). Zero is oe stadard deviatio below E[Y ], so Chebyshev would oly tell us that P[Y = ] is some costat. This is bad because we caot solve N = 1 Y 1 whe Y =. We wat P[Y = ] to be very small.

. k Hash Fuctios Fortuately, we ca apply the stadard techique of reducig the variace of our estimate via repetitio. Idea [Flajolet-Marti]: Use k hash fuctios h 1,..., h k : U [, 1]. Now, evaluate each hash fuctio o each item i the sequece, storig the miimum for each hash fuctio separately. Let Ȳ 1 k k i=1 Y i be the average miimum. The variace of the sum of idepedet radom variables is the sum of their variaces. Thus, V ar[ȳ ] = 1 k k i=1 Y 1 i, k(n+1) where we ve used Lemma for the iequality. Applyig Chebyshev: [ P Ȳ 1 ɛ ] V ar[ȳ ) ] 1 kɛ ( ɛ N+1 with proba- Thus, our estimate for the umber of distict elemets,, satisfies 1Ȳ N+1 1+ɛ 1 Ȳ N+1 1 ɛ bility at least 1 1. For small ɛ, this guaratee is equivalet to: kɛ (1 O(ɛ))N 1 Y 1 (1 + O(ɛ))N We eed to set k = O(1/ɛ ) to get a ɛ accurate estimate with probability 9/1, for example. I practice, we do t eed our hash fuctios to map to arbitrary real umbers. It is sufficiet to use legth O(log ) biary strigs. For more details, see [1]. 3 Heavy Hitters We ca compute may statistics about a stream of data beyod the umber of distict elemets. Oe particularly popular ad practically useful goal is to fid elemets that appear frequetly i the stream. These items are simply called frequet items or sometimes heavy hitters. 3.1 Misra-Gries, 198 [3] We begi with a straight forward versio of the heavy hitters problem: Give a sequece of elemets X 1, X,..., X, output a list with at most k values, esurig that every elemet which occurs at least k+1 + 1 times i the sequece is o the list. Note that there ca oly be k such items, although there could be fewer. We allow false positives i the list. Here s the algorithm: iitialize empty list for each item if item o list icremet its couter else if legth(list) < k 3

add item to list set item s couter to 1 else throw away item decremet couter of every item i list delete items i list with couter = Fact: Sice the Misra-Gries algorithm stores k couters with value at most, it uses O(k log ) space. Lemma 3. Let f x deote the frequecy of item x. Whe Misra Gries termiates, the couter for x is at least f x k+1. Note that x could have a couter equal to, i which case it will ot be o the list. Proof. The fial value of x s couter is equal to the umber of times it appears i our sequece, f x, mius the umber of times x was throw out because the list was full at the time. We argue that this umber ca t be higher tha k+1. Whe a item is throw out, each elemet cotaied i the list is decremeted. Additioally, x s virtual couter is effectively decremeted from 1 to. Sice the etire list must be filled, this correspods to k + 1 tokes beig destroyed every time x is throw out. There are a total of tokes received, so this evet ca occur at most k+1 times. We coclude that the couter for x is at least f x k+1. So, if f x > k+1, it will appear o the list at the ed of the algorithm, as desired. 3. Cout-Mi Sketch We ca also solve a more ambitious versio of the heavy hitters problem: Give a sequece X 1, X,..., X, compute f x to withi additive error ɛ. Note that a additive approximatio is more meaigful for heavier items i the list ad becomes meaigless whe f x ɛ. We will use the Cout-Mi Sketch (Cormode, Muthukrisha 5 []). Choose l radom hash fuctios h 1,..., h l : U {1,,..., b}. Iitialize a l b array CMS[l][b] with zeros. The, as elemets X i i the sequece are streamed i, for each hash fuctio h j ( ), icremet the (j, h j (X i )) etry of the table. To estimate the frequecy of item x, compute Cout(x) = mi j { CMS[j][h j (x)] } Claim 4. For ay fixed item x ad idex j, CMS[j][h j (x)] f x. Proof. We icremet CMS[j][h j (x)] each time we see x. However, this etry will be icremeted if h j (y) = h j (x) for some other item y x that also appears i the stream. Hece the iequality. 4

3..1 Aalysis Let z j CMS[j][h j (x)], ad ote that z j = f x + y x,h j (y)=h j (x) f y. We wat to examie the expected value of z j : E[z j ] = f x + y x f y P[h j (x) = h j (y)] = f x + 1 b y x f x + 1 b = f x + b y f y f y Note that z j is a biased estimator: its expected value is greater tha f x, the quatity we hope to estimate. Now, we wat to show that z j is close to f x. Usig the Markov Boud whe b = ɛ, P[z j ɛ] 1 Whe we have l hash fuctios, our estimate is the miimum z j. Our estimate is bad iff every z j is much larger tha f x : P[(mi z j ) f x + ɛ] = P[ j, z j f x + ɛ] j 1 l Settig l = O(log()), we get P[(mi j z j ) f x + ɛ] 1. Accordigly, our estimate is accurate up to ɛ error with high probability. 3.. Compariso with Misra Gries By settig ɛ = 1 k, the Cout-Mi sketch solves almost the same problem as Misra-Gries. The key differece is that the Cout-Mi sketch guaratees that for every x, if Cout(x) is large the x is a heavy hitter. I Misra-Gries, a item o the retured list might ot be a heavy hitter. We pay for this guaratee i space. Misra-Gries takes O(k log()) space, while the Cout-Mi sketch requires O(k log ()) space (settig ɛ = 1 k ad otig that each couter stores a value of at most ). Also ote that Cout-Mi oly obtais its solutio with high probability. Misra-Gries is determiistic ad so it always succeeds. 5

Refereces [1] Flajolet, Philippe ad Marti, Nigel. 1985. Probabilistic coutig algorithms for data base applicatios. I Joural of Computer ad System Scieces, Volume 31, Number. pp. 18 9. [] Cormode, Graham, ad Muthukrisha, S. 5. A improved data stream summary: the cout-mi sketch ad its applicatios. I Joural of Algorithms. pp. 58 75. [3] Misra, Jayadev ad Gries, David. 198. Fidig repeated elemets. I Sciece of Computer Programmig. pp. 143 15. 6