CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

Similar documents
Lecture 1 September 3, 2013

Lecture 01 August 31, 2017

Lecture 2 Sept. 8, 2015

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

The space complexity of approximating the frequency moments

Lecture 3 Sept. 4, 2014

Lecture 5: Probabilistic tools and Applications II

1 Approximate Quantiles and Summaries

1 Estimating Frequency Moments in Streams

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

2 How many distinct elements are in a stream?

Lecture 4: Sampling, Tail Inequalities

Some notes on streaming algorithms continued

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

Discrete Random Variables

Lecture 2: Streaming Algorithms

14.1 Finding frequent elements in stream

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Introduction to Randomized Algorithms III

CPSC 490 Problem Solving in Computer Science

Lecture 2. Frequency problems

ECE 302: Probabilistic Methods in Electrical Engineering

Frequency Estimators

Discrete Random Variables

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Lecture 2 Sep 5, 2017

CSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes )

Lecture 6 September 13, 2016

Data Streams & Communication Complexity

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

Data Stream Methods. Graham Cormode S. Muthukrishnan

CS483 Design and Analysis of Algorithms

6 Filtering and Streaming

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

CS 237: Probability in Computing

Lecture 1: Introduction to Sublinear Algorithms

X = X X n, + X 2

Testing k-wise Independence over Streaming Data

Recitation 6. Randomization. 6.1 Announcements. RandomLab has been released, and is due Monday, October 2. It s worth 100 points.

Lecture Lecture 25 November 25, 2014

Parity Checker Example. EECS150 - Digital Design Lecture 9 - Finite State Machines 1. Formal Design Process. Formal Design Process

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea

Improved Concentration Bounds for Count-Sketch

CMPSCI 240: Reasoning Under Uncertainty

Lecture 4 Thursday Sep 11, 2014

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

Lecture 13 March 7, 2017

Intractable Problems Part Two

CS5112: Algorithms and Data Structures for Applications

Finite State Machines. CS 447 Wireless Embedded Systems

Lecture 4: September Reminder: convergence of sequences

Lecture 2 September 4, 2014

Bernoulli and Binomial

2. The Binomial Distribution

Fast Sorting and Selection. A Lower Bound for Worst Case

Exponential Distribution and Poisson Process

CS 237: Probability in Computing

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 3 Luca Trevisan August 31, 2017

Randomized Time Space Tradeoffs for Directed Graph Connectivity

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

Lecture Examples of problems which have randomized algorithms

CS173 Lecture B, November 3, 2015

COS 341: Discrete Mathematics

Skip Lists. What is a Skip List. Skip Lists 3/19/14

Notes on Discrete Probability

ELEG 3143 Probability & Stochastic Process Ch. 2 Discrete Random Variables

CS Communication Complexity: Applications and New Directions

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

CS483 Design and Analysis of Algorithms

EXPERIMENT 2 Reaction Time Objectives Theory

Expectation, inequalities and laws of large numbers

Finite State Machines CS 64: Computer Organization and Design Logic Lecture #15 Fall 2018

Welcome to MAT 137! Course website:

Notes 6 : First and second moment methods

Lecture 26. Sequence Algorithms (Continued)

20.1 2SAT. CS125 Lecture 20 Fall 2016

Big Data. Big data arises in many forms: Common themes:

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Math489/889 Stochastic Processes and Advanced Mathematical Finance Solutions for Homework 7

1 Randomized Computation

CS 361: Probability & Statistics

Lecture Lecture 9 October 1, 2015

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Outline. We will cover (over the next few weeks) Induction Strong Induction Constructive Induction Structural Induction

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Lecture 2: Discrete Probability Distributions

ECE302 Spring 2015 HW10 Solutions May 3,

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

Lecture Notes 17. Randomness: The verifier can toss coins and is allowed to err with some (small) probability if it is unlucky in its coin tosses.

Lecture 4: LMN Learning (Part 2)

Combinatorial Proof of the Hot Spot Theorem

CMPT 882 Machine Learning

MATH 341, Section 001 FALL 2014 Introduction to the Language and Practice of Mathematics

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential.

Lectures One Way Permutations, Goldreich Levin Theorem, Commitments

Probability Theory and Simulation Methods

Transcription:

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017

Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half an hour to next Monday s and Wednesday s lectures. Next week s office hours for next week: 4.15-5.00pm Babis Tsourakakis CS 591 Data Analytics, Lecture 9 2 / 25

Reminder: Reservoir Sampling Input: Stream of n items that does not fit in memory x 1,..., x n Goal: Keep a uniform random sample Simple: Draw a random integer {1,..., n}, and select that item. Issue: We don t know the length n of the stream a priori Problem: Find a uniform sample s from a stream of unknown length Babis Tsourakakis CS 591 Data Analytics, Lecture 9 3 / 25

Algorithm: Reservoir Sampling 1 Initially we set s x 1 2 When i-th element arrives, we set s x i with probability 1 i Intuition: Correctness: What is the probability that s = x i at some time t i? Babis Tsourakakis CS 591 Data Analytics, Lecture 9 4 / 25

Algorithm: Reservoir Sampling Correctness: What is the probability that s = x i at some time t i? Pr [s = x i ] = 1 i ( 1 1 ) (... 1 1 ) = 1 i + 1 t t. Question: What about the space complexity? Babis Tsourakakis CS 591 Data Analytics, Lecture 9 5 / 25

Extension: Weighted Reservoir Sampling Input: Stream of w 1, w 2,... Goal: Select index i with probability proportional to w i Again, we don t know the length n of the stream a priori Define for each index i W i = w 1 +... w i. Any ideas? Babis Tsourakakis CS 591 Data Analytics, Lecture 9 6 / 25

Extension: Weighted Reservoir Sampling Intuition: Babis Tsourakakis CS 591 Data Analytics, Lecture 9 7 / 25

Algorithm: Weighted Reservoir Sampling 1 Initially we set s x 1 2 When i-th element arrives, we set s x i with probability w i W i 1 +w i Correctness: On seeing w i+1 : We switch to w i+1 with probability w i+1 W i+1 If we don t switch the winner j i remains with probability Pr [s = w j ] = (1 w i+1 W i+1 ) w j W i = w j W i+1. Question: What about the space complexity? Babis Tsourakakis CS 591 Data Analytics, Lecture 9 8 / 25

Reservoir Sampling, k items Input: Stream of items x 1, x 2,..., Parameter k 1 Goal: Keep k uniform random samples Ideas? Babis Tsourakakis CS 591 Data Analytics, Lecture 9 9 / 25

Extension: Weighted Reservoir Sampling Intuition: # k-sets that do not contain i = ( i 1 k ) ( i k) = 1 k i. Babis Tsourakakis CS 591 Data Analytics, Lecture 9 10 / 25

Python implementation import random def reservoir_sampling ( stream, k): S =[] counter = 1 for x in stream : if(counter <=k): S. append (x) else : c = random. randint (0, counter -1); if( c<k): S[c] = x counter +=1 return S Babis Tsourakakis CS 591 Data Analytics, Lecture 9 11 / 25

Probabilistic Counting Researchers at Bell Labs needed to count overlapping substrings of three letters. Why? To develop statistic-based spell-checker. This spellchecker (typo) was included in early Unix distributions Robert Morris came up with an elegant way of counting each item approximately [Morris, 1978] Why approximately? 8-bit counters can hold counts up to 255 (too small range)... Babis Tsourakakis CS 591 Data Analytics, Lecture 9 12 / 25

Probabilistic Counting Increments performed in a probabilistic manner Suppose we use constant probabilities, i.e., we increase the counter from i to i + 1 with probability p Why is this problematic? Morris idea: probability Pr [i i + 1] = f (i) depends on the actual counter value Outline: Morris Morris+ Morris++ 1. 1 We will follow Jelani Nelson s exposition (M,M+,M++). Babis Tsourakakis CS 591 Data Analytics, Lecture 9 13 / 25

Morris algorithm 1 Initialize X 0. 2 For each event, increment X with probability 1 2 X. 3 Output ñ = 2 X 1. Notice that our data structure just maintains an integer! We will see that this integer grows as log n, hence we need O(log log n) bits. Babis Tsourakakis CS 591 Data Analytics, Lecture 9 14 / 25

Example: Morris algorithm Source: [Flajolet, 1985] Pr [ñ = 1] = 8 38, Pr [ñ = 2] =, 64 64 Pr [ñ = 3] = 17 1, Pr [ñ = 4] =. 64 64 Babis Tsourakakis CS 591 Data Analytics, Lecture 9 15 / 25

Morris algorithm: Analysis X n := Morris counter after n updates Let s compute E [ 2 Xn ]. Idea 1: Let s apply the formula E [ 2 Xn] = n 2 l p n,l. l=0 We have to compute Pr [X n = l] = p n,l Babis Tsourakakis CS 591 Data Analytics, Lecture 9 16 / 25

Morris algorithm: Analysis Suppose we reached state l via n 1 transitions from state 1 to state 1, n 2 from state 2 to state 2,..., n l from state l to l (1 2 1 ) n 1 2 1 (1 2 2 ) n 2 2 2... (1 2 (l 1) ) n l 1 2 (l 1) (1 2 l ) n l. with the condition that n 1 +... + n l + l 1 = n. Tedious approach... Let s try an inductive approach. Babis Tsourakakis CS 591 Data Analytics, Lecture 9 17 / 25

Morris algorithm : Analysis 1 Base case. It s obviously true for n = 0. 2 Induction step. E [ 2 ] X n+1 = = = Pr [X n = j] E [ 2 X n+1 X n = j ] j=0 Pr [X n = j] (2 j (1 1 2 ) + 1 2 j+1 ) j 2 j j=0 Pr [X n = j] 2 j + j=0 j = E [ 2 Xn] + 1 = (n + 1) + 1 Pr [X n = j] Babis Tsourakakis CS 591 Data Analytics, Lecture 9 18 / 25

Morris algorithm : Analysis In a similar way, we can show that E [ 2 2Xn] = 3 2 n2 + 3 2 n + 1. Therefore Var [ 2 Xn] < 1 2 n2. Chebyshev s inequality: Pr [ ñ n ɛn] < 1 2ɛ 2. Babis Tsourakakis CS 591 Data Analytics, Lecture 9 19 / 25

Details For the second moment, again we use induction: E [ 2 2Xn] = j 0 = j 0 2 2j Pr [X n = j] 2 2j ( Pr [X n 1 = j] ( 1 2 j) + Pr [X n 1 = j 1]2 (j 1) ) =... = E [ 2 2X n 1 ] + 3E [ 2 X n 1 ] = 3 2 (n 1)2 + 3 (n 1) + 1 + 3n 2 = 3 2 n2 + 3 2 n + 1 Babis Tsourakakis CS 591 Data Analytics, Lecture 9 20 / 25

Morris algorithm Source: [Flajolet, 1985] Exact probability distribution for Morris counter for n = 1024 Variance is not bad, but we want to do better! Morris+ Babis Tsourakakis CS 591 Data Analytics, Lecture 9 21 / 25

Morris+ algorithm We instantiate s independent copies of Morris and average their outputs. s = 1 2ɛ 2 δ Our estimate becomes ñ = 1 s s ñ i. i=1 Now Chebyshev s inequality gives Next step is Morris++! Pr [ ñ n ɛn] < 1 2sɛ 2 < δ. Babis Tsourakakis CS 591 Data Analytics, Lecture 9 22 / 25

Morris++ algorithm Run t = 36 lg 1 δ copies of Morris+ with failure probability δ = 1 3 Output the median estimate from the t copies of Morris+ In other words, we have t coin tosses, each results in 1 with probability 1 3, and in 0 with probability 2 3. Failure if median is 0, otherwise success Chernoff! Babis Tsourakakis CS 591 Data Analytics, Lecture 9 23 / 25

Morris++ algorithm analysis Y i = By Chernoff bound, { 1, if i-th Morris+ fails. 0, otherwise. [ ] [ [ ] ] Pr Y i > t Pr Y i E Y i > t < δ. 2 6 i i i Morris++ estimate ñ (1 ± ɛ)-approximates n with probability at least (1 δ). Question: What is the space complexity that we achieved? Babis Tsourakakis CS 591 Data Analytics, Lecture 9 24 / 25

references I Flajolet, P. (1985). Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113 134. Morris, R. (1978). Counting large numbers of events in small registers. Communications of the ACM, 21(10):840 842. Babis Tsourakakis CS 591 Data Analytics, Lecture 9 25 / 25