compare to comparison and pointer based sorting, binary trees

Similar documents
Lecture 5: Hashing. David Woodruff Carnegie Mellon University

1 Maintaining a Dictionary

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Lecture 10 Oct. 3, 2017

1 Randomized Computation

U.C. Berkeley CS174: Randomized Algorithms Lecture Note 12 Professor Luca Trevisan April 29, 2003

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Lecture 24: Randomized Complexity, Course Summary

Analysis of Algorithms I: Perfect Hashing

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

6.1 Occupancy Problem

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

CS483 Design and Analysis of Algorithms

Algorithms for Data Science

Lecture: Analysis of Algorithms (CS )

1 Difference between grad and undergrad algorithms

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CPSC 536N: Randomized Algorithms Term 2. Lecture 9

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSE525: Randomized Algorithms and Probabilistic Analysis April 2, Lecture 1

6.854 Advanced Algorithms

Lecture 7: Schwartz-Zippel Lemma, Perfect Matching. 1.1 Polynomial Identity Testing and Schwartz-Zippel Lemma

Application: Bucket Sort

Problem Set 2. Assigned: Mon. November. 23, 2015

Lecture 8 HASHING!!!!!

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

CS 580: Algorithm Design and Analysis

1 Reductions from Maximum Matchings to Perfect Matchings

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Randomized Load Balancing:The Power of 2 Choices

RECAP How to find a maximum matching?

Hash tables. Hash tables

Theoretical Cryptography, Lectures 18-20

Randomness. What next?

Lecture 8 - Algebraic Methods for Matching 1

CS Communication Complexity: Applications and New Directions

Introduction to Hash Tables

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

CSE 312, 2017 Winter, W.L.Ruzzo. 5. independence [ ]

CPSC 467: Cryptography and Computer Security

CSC 5170: Theory of Computational Complexity Lecture 5 The Chinese University of Hong Kong 8 February 2010

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Introduction Long transparent proofs The real PCP theorem. Real Number PCPs. Klaus Meer. Brandenburg University of Technology, Cottbus, Germany

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Lecture 6: Finite Fields

3 The fundamentals: Algorithms, the integers, and matrices

Find the Missing Number or Numbers: An Exposition

1 Recommended Reading 1. 2 Public Key/Private Key Cryptography Overview RSA Algorithm... 2

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Asymmetric Communication Complexity and Data Structure Lower Bounds

Hash tables. Hash tables

N/4 + N/2 + N = 2N 2.

ALG 4.1 Randomized Pattern Matching:

Majority is incompressible by AC 0 [p] circuits

Hash tables. Hash tables

Umans Complexity Theory Lectures

Lecture 15: A Brief Look at PCP

CSCI-B609: A Theorist s Toolkit, Fall 2016 Oct 4. Theorem 1. A non-zero, univariate polynomial with degree d has at most d roots.

Some notes on streaming algorithms continued

Cryptography CS 555. Topic 18: RSA Implementation and Security. CS555 Topic 18 1

2. Polynomials. 19 points. 3/3/3/3/3/4 Clearly indicate your correctly formatted answer: this is what is to be graded. No need to justify!

Data Structures and Algorithms

IITM-CS6845: Theory Toolkit February 3, 2012

1 Estimating Frequency Moments in Streams

Module 1: Analyzing the Efficiency of Algorithms

Great Theoretical Ideas in Computer Science

CSC 8301 Design & Analysis of Algorithms: Lower Bounds

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Lecture 23: Alternation vs. Counting

Lecture 4: Two-point Sampling, Coupon Collector s problem

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Communication Complexity

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle

Testing by Implicit Learning

Lecture 3 Sept. 4, 2014

Basics of hashing: k-independence and applications

Lecture 7: More Arithmetic and Fun With Primes

CPSC 467: Cryptography and Computer Security

15-251: Great Theoretical Ideas In Computer Science Recitation 9 : Randomized Algorithms and Communication Complexity Solutions

Big O 2/14/13. Administrative. Does it terminate? David Kauchak cs302 Spring 2013

Outline. CS38 Introduction to Algorithms. Max-3-SAT approximation algorithm 6/3/2014. randomness in algorithms. Lecture 19 June 3, 2014

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Expectation of geometric distribution

Randomized Algorithms

Second main application of Chernoff: analysis of load balancing. Already saw balls in bins example. oblivious algorithms only consider self packet.

* 8 Groups, with Appendix containing Rings and Fields.

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t]

Factoring. there exists some 1 i < j l such that x i x j (mod p). (1) p gcd(x i x j, n).

Rings If R is a commutative ring, a zero divisor is a nonzero element x such that xy = 0 for some nonzero element y R.

Randomized Algorithms, Spring 2014: Project 2

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Transcription:

Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power: arithmetic, indirect addressing compare to comparison and pointer based sorting, binary trees problem: space. Hashing: find function h mapping M into table of size n m Note some items get mapped to same place: collision use linked list etc. search, insert cost equals size of linked list goal: keep linked lists small: few collisions Hash families: problem: for any hash function, some bad input (if n items, then m/n items to same bucket) Solution: build family of functions, choose one that works well Set of all functions? Idea: choose function that stores items in sorted order without collisions problem: to evaluate function, must examine all data evaluation time Ω(log n). description size Ω(n log m), 1

Better goal: choose function that can be evaluated in constant time without looking at data (except query key) How about a random function? set S of s items If s = n, balls in bins O((log n)/(log log n)) collisions w.h.p. And matches that somewhere but we care more about average collisions over many operations C ij = 1 if i, j collide Time to find i is j C ij expected value (n 1)/n 1 more generally expected search time for item (present or not): O(s/n) = O(1) if s = n Problem: n m functions (specify one of n places for each of n items) too much space to specify (m log n), hard to evaluate for O(1) search time, need to identify function in O(1) time. so function description must fit in O(1) machine words Assuming log m bit words So, fixed number of cells can only distinguish poly(m) functions This bounds size of hash family we can choose from Our analysis: sloppier constants but more intuitive than book 2-universal family: [Carter-Wegman] how much independence was used above? pairwise (search item versus each other item) so: OK if items land pairwise independent pick p in range m,..., 2m (not random) pick random a, b 2

map x to (ax + b mod p) mod n pairwise independent, uniform before mod m So pairwise independent, near-uniform after mod m at most 2 uniform buckets to same place argument above holds: O(1) expected search time. represent with two O(log m)-bit integers: hash family of poly size. max load? expected load in a bin is 1 so O( n) with prob. 1-1/n (chebyshev). this bounds expected max-load some item may have bad load, but unlikely to be the requested one can show the max load is probably achieved for some 2-universal families perfect hash families perfect hash function: no collisions for any S of s n, perfect h in family eg, set of all functions but hash choice in table: m O(1) size family. exists iff m = 2 Ω(n) (probabilistic method) (hard computationally) random function. Pr(perfect)= n!/n n So take n n /n! e n functions. Pr(all bad)= 1/e Number of subsets: at most m n n n So take e ln m = ne n ln m functions. Pr(all bad) 1/m n So with nonzero probability, no set has all bad functions (union) number of functions: ne n ln m = m O(1) if m = 2 Ω(n) Too bad: only fit sets of log m items note one word contains n-bits one per item also, hard computationally Alternative try: use more space: How big can s be for random s to n without collisions? 3

( ) Expected number of collisions is E[ C ij ] = s (1/n) s 2 /2n 2 So s = n works with prob. 1/2 (markov) Is this best possible? Birthday problem: (1 1/n) (1 s/n) e (1/n+2/n+ +s/n) e s2 /2n So, when s = n has Ω(1) chance of collision 23 for birthdays Two level hashing solves problem Hash s items into O(s) space 2-universally Build quadratic size hash table on contents of each bucket bound b 2 k = k ( i [i b k ]) 2 = C i + C ij expected value O(s). So try till get (markov) Then build collision-free quadratic tables inside Try till get Polynomial time in s, Las-vegas algorithm Easy: 6s cells Hard: s + o(s) cells (bit fiddling) Derandomization Probability 1/2 top-level function works Only m 2 top-level functions Try them all! Polynomial in m (not n), deterministic algorithm Fingerprinting Basic idea: compare two things from a big universe U generally takes log U, could be huge. Better: randomly map U to smaller V, compare elements of V. Probability(same)= 1/ V 4

intuition: log V bits to compare, error prob. 1/ V We work with fields add, subtract, mult, divide 0 and 1 elements eg reals, rats, (not ints) talk about Z p which field often won t matter. Verifying matrix multiplications: Claim AB = C check by mul: n 3, or n 2.376 with deep math Freivald s O(n 2 ). Good to apply at end of complex algorithm (check answer) Freivald s technique: n choose random r {0, 1} check ABr = Cr time O(n 2 ) if AB = C, fine. What if AB = C? trouble if (AB C)r = 0 but D = AB C = 0 find some nonzero row (d 1,..., d n ) wlog d 1 = 0 trouble if d i r i = 0 ie r 1 = ( i>1 d i r i )/d 1 principle of deferred decisions: choose all i 2 first then have exactly one error value for r 1 prob. pick it is at most 1/2 How improve detection prob? k trials makes 1/2 k failure. Or choosing r [1, s] makes 1/s. 5

Doesn t just do matrix mul. check any matrix identity claim useful when matrices are implicit (e.g. AB) We are mapping matrices (n 2 entries) to vectors (n entries). String matching Checksums: Alice and Bob have bit strings of length n Think of n bit integers a, b take a prime number p, compare a mod p and b mod p with log p bits. trouble if a = b (mod p). How avoid? How likely? c = a b is n-bit integer. so at most n prime factors. How many prime factors less than k? Θ(k/ ln k) so take 2n 2 log n limit number of primes about n 2 So on random one, 1/n error prob. O(log n) bits to send. implement by add/sub, no mul or div! How find prime? Well, a randomly chosen number is prime with prob. 1/ ln n, so just try a few. How know its prime? Simple randomized test (later) Pattern matching in strings m-bit pattern n-bit string work mod prime p of size at most t prob. error at particular point most m/(t/ log t) so pick big t, union bound implement by add/sub, no mul or div! 6

Fingerprints by Polynomials Good for fingeerprinting composable data objects. check if P (x)q(x) = R(x) P and Q of degree n (means R of degree at most 2n) mult in O(n log n) using FFT evaluation at fixed point in O(n) time Random test: S F pick random r S evaluate P (r)q(r) R(r) suppose this poly not 0 then degree 2n, so at most 2n roots thus, prob (discover nonroot) S /2n so, eg, sufficient to pick random int in [0, 4n] Note: no prime needed (but needed for Z p sometimes) Again, major benefit if polynomial implicitly specified. String checksum: treat as degree n polynomial eval a random O(log n) bit input, prob. get 0 small Multivariate: n variables Proof: degree of term: sum of vars degrees total degree d: max degree of term. Schwartz-Zippel: fix S F and let each r i random in S Note: no dependence on number of vars! Pr[Q(r i ) = 0 Q = 0] d/ S 7

induction. Base done. Q = 0. So pick some (say) x 1 that affects Q write Q = i k x 1 i Q i (x 2,..., x n ) with Q k () = 0 by choice of k Q k has total degree at most d k By induction, prob Q k evals to 0 is at most (d k)/ S i suppose it didn t. Then q(x) = x 1 Q(r 2,..., r n ) is a nonzero univariate poly. by base, prob. eval to 0 is k/ S add: get d/ S why can we add? Small problem: Pr[E 1 ] = Pr[E 1 E 2 ] + Pr[E 1 E 2 ] Pr[E 1 E 2 ] + Pr[E 2 ] degree n poly can generate huge values from small inputs. Solution 1: If poly is over Z p, can do all math mod p Need p exceeding coefficients, degree p need not be random Solution 2: Work in Z but all computation mod random q (as in string matching) Perfect matching Define Edmonds matrix: variable x ij if edge (u i, v j ) determinant nonzero if PM poly nonzero symbolically. so apply Schwartz-Zippel Degree is n So number r (1,..., n 2 ) yields 0 with prob. 1/n 8

Det may be huge! We picked random input r, knew evaled to nonzero but maybe huge number n How big? About n!r, So only O(n log n + n log r) prime divisors (or, a string of that many bits) So compute mod p, where p is O((n log n + n log r) 2 ) only need O(log n + log log r) bits 9