Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Similar documents
1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

1 Maintaining a Dictionary

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Introduction to discrete probability. The rules Sample space (finite except for one example)

Lecture 6 : Dimensionality Reduction

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Algorithms for Data Science

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Lecture 2. Frequency problems

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mathematical Foundations of Computer Science Lecture Outline October 18, 2018

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

A Course in Machine Learning

Lecture 4: Hashing and Streaming Algorithms

Integration Review. May 11, 2013

Two formulas for the Euler ϕ-function

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Some notes on streaming algorithms continued

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Integration by Parts

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

14.1 Finding frequent elements in stream

Ad Placement Strategies

arxiv: v1 [cs.ds] 3 Feb 2018

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture 5. Symmetric Shearer s Lemma

6.842 Randomness and Computation Lecture 5

Hash tables. Hash tables

Vectors in two dimensions

Solutions to Practice Problems Tuesday, October 28, 2008

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Hash tables. Hash tables

Lecture 6: The Pigeonhole Principle and Probability Spaces

Lecture 1. ABC of Probability

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Bloom Filters and Locality-Sensitive Hashing

Lecture 5: Two-point Sampling

Lecture 4: Two-point Sampling, Coupon Collector s problem

6 Filtering and Streaming

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

SECTION 3.2 THE PRODUCT AND QUOTIENT RULES 1 8 3

CS 124 Math Review Section January 29, 2018

Randomness and Computation March 13, Lecture 3

1 The Basic Counting Principles

REAL ANALYSIS I HOMEWORK 5

Discrete Structures Prelim 1 Selected problems from past exams

Approximate counting: count-min data structure. Problem definition

MATH 56A: STOCHASTIC PROCESSES CHAPTER 3

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

ACM 116: Lecture 1. Agenda. Philosophy of the Course. Definition of probabilities. Equally likely outcomes. Elements of combinatorics

Proof by Mathematical Induction.

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Warm-up Quantifiers and the harmonic series Sets Second warmup Induction Bijections. Writing more proofs. Misha Lavrov

Lecture 8 HASHING!!!!!

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

Physics 2112 Unit 5: Electric Potential Energy

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Implicit Differentiation

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

2. This exam consists of 15 questions. The rst nine questions are multiple choice Q10 requires two

P (E) = P (A 1 )P (A 2 )... P (A n ).

Lecture 23: Alternation vs. Counting

CPSC 467: Cryptography and Computer Security

1 Terminology and setup

Concrete models and tight upper/lower bounds

1 Recommended Reading 1. 2 Public Key/Private Key Cryptography Overview RSA Algorithm... 2

Confidence Intervals

Lecture 3 Sept. 4, 2014

*Karle Laska s Sections: There is no class tomorrow and Friday! Have a good weekend! Scores will be posted in Compass early Friday morning

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

So far we have implemented the search for a key by carefully choosing split-elements.

EE 595 (PMP) Introduction to Security and Privacy Homework 4

MATHEMATICAL METHODS

Business Statistics. Lecture 3: Random Variables and the Normal Distribution

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

What can you prove by induction?

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Data Mining Recitation Notes Week 3

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

MAT2377. Ali Karimnezhad. Version September 9, Ali Karimnezhad

Transcription:

Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a hotel Each hotel has 00 single rooms /00 * 0 9 : #people staying in hotels toay #hotels = /00 * /00 * 0 9 = 0 5, or 00,000 Sample a winow of 000 ays.2 Want Suspecte terrorists meet an stay in the same hotel twice in the 000 ay perio. How many such events might we expect? In other wors, we want to ientify a pair of persons (A, B) who stay at the same hotel on two ifferent ays, 2..2. Probability Pr[A stays at a hotel on ] = 00 Pr[B stays at a hotel on ] = 00 The probability that they both visit on is 00 00 = 0 4 The probability that they stay at the same hotel: 0 4 0 5 It happens on both an 2 : p = (0 4 0 5 ) 2.2.2 Number of events #ways to select a pair of iniviuals is ( ) n 2 = n(n ) 2 2 n2 #choices for (A, B,, 2 ) 2 n2 2 2 Final answer 2 n2 2 2 p = 4 06

.2.3 Exercise What if #ays raise to 2000? 2 How o you sample when you o not know the size of population 2. Reservoir Sampling 2.2 Algorithm Go to the first house, an we nee to pick it with probability an call it our sample. If there is no more house, we report this house as our answer. If there are still more houses, we continue an pick the secon house with probability /2, call it our sample, kicking out the first house. If there is no more house, we report our sample. Otherwise, we will keep going. For the k-th house, we pick it with probability /k, an replace the existing choice. 2.3 Analysis We prove that all elements got chosen with equal probability (which is weaker than what Reservoir Sampling actually provices) by inuction. Where there is only one house, we select with probability 00%, so the claim is true when n =. Suppose the claim is true for n houses, we now prove it also works for n +. Accoring to our assumption, before seeing the (n + )-th house, each house from to n will have n probability of being our sample. Upon seeing the new house, we will select it with probability n+, which means it will be our sample with probability n+. On the other han, with probability n n+, the original sample will persist. So the probability of a certain house from to n persist to be our sample woul be n n n+ = n. 2.4 Reservoir Sampling Multiple items Goal: pick p items ranomly out of n items. No knowlege of n. A new item with probability p n. If we en up not aing this new item, we o nothing. If this item gets chosen, one of our ol samples will be kicke out uniformly at ranom to make space for the new item. 2.5 Property of Reservoir Sampling When selecting a single item, Reservoir Sampling guarantees that every item has the same chance of being chosen. However, when we pick more than one item, Reservoir Sampling provies more than that. It guarantees that every subset is chosen with same probability 2.5. What oes it mean? We have A, B, C, D. p = 2, n = 4, an our algorithm is as follows: Toss a coin, return (A, B) if we see hea, an (C, D) otherwise. Each letter will be sample with probability /2, but we will never see (A, C). So Reservoir Sampling is a bit stronger 2

3 Sampling from a Stream We want to store 0-th of the stream for analysis 4 Hash function 4. A common hash functin h(x) = (ax + b mo p), where p is a prime. 4.. Ba example h(x) = x mo B, where B = 0. Then hash even numbers will always get even numbers 4.2 An obvious approach Generate a ranom number 0-9, save the query if it is 0. 4.2. Ba example Fails to answer the following query: among all the ifferent queries, what fraction are uplicate queries? Suppose a user has issue s search queries one time an search queries twice, an no queries more than twice. We will have s + 2 queries in total. Then the correct answer woul be uplicates. If we use the previous algorithm, we woul get 00 s+, since there are s + ifferent queries, an have will appear twice: only queries that have uplicates will ever appear twice. Further more, it will appear twice if an only if both got sample. The probability that both queries got sample is. With queries that have uplicates, we woul expect 00 such appearances. Of the queries with uplicates, 8 00 will appear exactly once. Either only the first query got sample, or only the secon query got sample. So the probability that a certain uplicate query appear exactly once is 0 0 0 + 0 0 0 = 8 00. So the total expecte number woul be 8 00. In all, we have /00 s/00+9/00 = 0s+9. 4.3 Use hash to sample /0 of the users Use h(user), an select the set {u h(u) = 0}. 4.4 Algorithm Use a hash function on user i (so that if a user come again, we will know whether he was previously sample or not). Sample user when hash to 0. By this metho, we can guarantee that each user is sample with probability /0. 3

5 Bloom Filters Goal is to answer membership queries in S. If x S, always say yes If x / S, might say yes (with small probability of error) 5. Example billion items, S = m Hash table a of size n << m y ifferent values to insert When some value v come, change a[h(v)] to When aske if a certain value v is in S, answer yes if an only if a[h(v)] is 5.2 Analysis What is the chance that a certain cell in a is 0? Suppose we have y arts an x targets Probability a single art oes not touch a specific entry: ( ) ( ) x x = x Probability that all arts (we have y arts) miss this entry: ( x) y = ( ( x )x) y x = e y x for large x. If we have x = 8 0 9, y = 0 9, then probability an element is hit is e /8 0.75 5.3 What if we have k hash functions? Suppose S = m, hash function is of n cells, an we have k hash functions. y = km, n = x On seeing v, we set a[h i (v)] to, for all hash function h,..., h k. When querying v, we answer yes if an only a[h i (v)] = is true for all k hash functions. Chance that a certain cell in a is 0: e km n Chance that we get false postive, i.e. a value v is not in S, but we thought it is in: ( e km n ) k. If k = 2, n = 0 m, y = k m ( ) The probability is: e 2 5 0.0329. 4

6 Count Min Sketch Want to count the frequency of number. 6. Algorithm Initialize an array a with 0 When a number x come, use 2 hash functions h, h 2, an increase a[h (x)] an a[h 2 (x)] When aske the frequency of x, answer min{a[h (x)], a[h 2 (x)]}. truth estimate ( + ɛ) truth, with high probability 7 Verify matrix multiplication 7. Freival s Algorithm To check if A = B C, check if Ar = B (C r) for ranom binary vector r. 8 Closest Pair Instea of O(N log N), we can get expecte running time O(N) using ranomization (answer is always correct, but might take longer) [Rabin, an inepenently Khuller & Matias link] 5