Lecture 4: Universal Hash Functions/Streaming Cont d

Similar documents
18.1 Introduction and Recap

Notes on Frequency Estimation in Data Streams

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Lecture 5 September 17, 2015

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Introduction to Algorithms

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Lecture 10: May 6, 2013

Lecture 3 January 31, 2017

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Lecture Space-Bounded Derandomization

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Mining Data Streams-Estimating Frequency Moment

Introduction to Algorithms

1 The Mistake Bound Model

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

APPENDIX A Some Linear Algebra

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Min Cut, Fast Cut, Polynomial Identities

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Random Partitions of Samples

Estimation: Part 2. Chapter GREG estimation

First day August 1, Problems and Solutions

Finding Primitive Roots Pseudo-Deterministically

Errors for Linear Systems

Generalized Linear Methods

DISCRIMINANTS AND RAMIFIED PRIMES. 1. Introduction A prime number p is said to be ramified in a number field K if the prime ideal factorization

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Randomness and Computation

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Lecture 4: September 12

Provable Security Signatures

Problem Set 9 Solutions

SL n (F ) Equals its Own Derived Group

Exercises of Chapter 2

Lecture 10 Support Vector Machines II

Learning Theory: Lecture Notes

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Some Consequences. Example of Extended Euclidean Algorithm. The Fundamental Theorem of Arithmetic, II. Characterizing the GCD and LCM

Eigenvalues of Random Graphs

Finding Dense Subgraphs in G(n, 1/2)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Lecture 5 Decoding Binary BCH Codes

Complete subgraphs in multipartite graphs

Math 426: Probability MWF 1pm, Gasson 310 Homework 4 Selected Solutions

Math 217 Fall 2013 Homework 2 Solutions

1 Matrix representations of canonical matrices

Lecture Randomized Load Balancing strategies and their analysis. Probability concepts include, counting, the union bound, and Chernoff bounds.

Lecture 3. Ax x i a i. i i

Lecture 21: Numerical methods for pricing American type derivatives

2.3 Nilpotent endomorphisms

EEE 241: Linear Systems

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 2: Prelude to the big shrink

Hash functions : MAC / HMAC

The written Master s Examination

Online Classification: Perceptron and Winnow

CIS 700: algorithms for Big Data

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Exercises. 18 Algorithms

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

Expected Value and Variance

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Vapnik-Chervonenkis theory

Differentiating Gaussian Processes

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Lecture Notes on Linear Regression

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

5 The Rational Canonical Form

Affine transformations and convexity

Hashing. Alexandra Stefan

More metrics on cartesian products

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture 3: Shannon s Theorem

Foundations of Arithmetic

Discussion 11 Summary 11/20/2018

Low correlation tensor decomposition via entropy maximization

Stanford University Graph Partitioning and Expanders Handout 3 Luca Trevisan May 8, 2013

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

The Expectation-Maximization Algorithm

Linear Approximation with Regularization and Moving Least Squares

Randić Energy and Randić Estrada Index of a Graph

Lecture 4: Hashing and Streaming Algorithms

Excess Error, Approximation Error, and Estimation Error

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Part (a) (Number of collisions) Recall we showed that if we throw m balls in n bins, the average number of. Use Chebyshev s inequality to show that:

20. Mon, Oct. 13 What we have done so far corresponds roughly to Chapters 2 & 3 of Lee. Now we turn to Chapter 4. The first idea is connectedness.

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

Solutions Homework 4 March 5, 2018

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

6.842 Randomness and Computation February 18, Lecture 4

Transcription:

CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected to the usual scrutny reserved for formal publcatons. 4. Hash Functons Suppose we want to mantan a data structure of a set of elements x,..., x m of a unverse U, e.g., mages, that can perform nserton/deleton/search operatons. A smple strategy would be to have one bucket for every possble mage,.e., each element of U, and ndcate n each bucket whether or not the correspondng mage appeared. Unfortunately, U can be much much larger than the space avalable n our computers; for example, f U represents the set of all possble mages, U s as bg as 000000. Instead, one may use a hash functon. A hash functon h : U [B] maps elements of U to ntegers n [B]. For every element of the sequence we mark h(x ) wth x. When a query x arrves, we go to the cell h(x) f no element s stored there, x s not n our sequence. Otherwse, we go over all elements stored n h(x) and see f any of them s equal to x. Observe that the search operaton thus depends on the number of elements stored n h(x). Ideally, we would lke to have a hash functon that stores at most one element n every 0 B. Fx a functon h. Observe that h maps /B fracton of all elements of U to the same number [B]. Therefore, the search operaton n the worst case s very slow. We can mtgate ths problem by choosng a hash functon h unformly at random the famly of all functons that map U to B; let H = h : U [B], and let h H chosen unformly at random. Now, f the length of the sequence m B, then, by the brthday paradox phenomenon, wth hgh probablty, no two elements of the sequence map to the same cell. In other words, there s no collsons. However, observe that H has U B many functons, so even descrbng h requres log U B = U log B bts of memory. Recall that we assumed U 000000 so we cannot effcently represent h. Instead, we are gong to work wth smaller much famles of functons say H ; such a famly can only guarantee weaker notons of ndependence, but because H H, t s much easer to descrbe a randomly chosen functon from H. 4. -Unversal Functons In ths secton, we descrbe a famly hash functons that only preserve parwse-ndependent. Let p be a prme number, and let H = {h : [p] [p], h(x) = ax + b mod p}. Observe that any functon h a,b H can be represented n O(log p) bts of memory just by recordng the a, b [p]. Next, we show that a unformly random functon h H s parwse ndependent. Lemma 4.. For any x, y, c, d [p]x y, P [h(x) = c, h(y) = d] = p Proof. Suppose for some x y, h(x) c, and h(y) d. Equvalently, we can wrte, ax + b c mod p, and ay + b d mod p. 4-

4- Lecture 4: Unversal Hash Functons/Streamng Cont d Usng the laws of modular equatons, we can wrte, a(x y) (c b) (d b) mod p. Snce p s a prme, any number z p has a multplcatve nverse,.e., there s a number z p such that p p mod p. Snce x y, x y 0. Therefore, t has a multplcatve nverse, and we can wrte, a = (x y) (c d) mod p, whch gves, b = d ay mod p. In words, havng x, y, c, d unquely defnes a, b. Snce there are p possbltes for a, b, we get P [h(x) = c, h(y) = d] = /p. For our applcatons n estmatng F 0, we frst need to choose a prme number p > n. Then, we can use a hash functon h : [n] [B] where for any 0 x n, h(x) = ax + b mod p mod B. It s easy to see that such a functon s almost parwse ndependent whch s good enough for our applcaton n estmatng F 0. We can extend the above constructon to a famly of k-wse ndependence hash functons. We say a hash functon h : [p] [p] s k-wse ndependent f for all dstnct x 0,..., x k, P [, h(x ) = c ] = p k. Such a hash functon h can be constructed by choosng a 0, a,..., a k unformly and ndependently from [p] and lettng h(x) = a k x k + a k x k...a x + a 0. We are not provng that ths wll gve a k-wse ndependence hash functon. Instead, we just gve the hgh-level dea. Let h be a 4-wse ndependent hash functon and let x 0, x, x, x 3 [p] be dstnct and c 0, c, c, c 3 [p] we need to show that there s a unque set a 0, a, a, a 3 for whch h(x ) = c for all. To fnd a 0, a, a, a 3 t s enough to solve the followng system of lnear equautons. x 3 0 x 0 x 0 a 3 c 0 x 3 x x a x 3 x x a = c c x 3 3 x 3 x 3 a 0 c 3 It turns out that the Matrx n the LHS has a nonzero determnant of x 0, x, x, x 3 are dstnct. In such a case, t s nvertble, and we can use the nverse to unquely defne a 0, a, a, a 3. 4.3 F Moment Before desgnng a streamng algorthm that estmates F, let us revst the random walk example that we had a few lectures ago. Let X = X where for each, X = { +, w.p., w.p.

Lecture 4: Unversal Hash Functons/Streamng Cont d 4-3 Usng the Hoeffdng bound, we prevously showed that for any c >, P [X c n] e c. Is ths bound tght? Can we show that X Ω(n) wth a constant probablty? The answer yes. More generally t follows from the central lmt theorem. But nstead of usng such a heavy tool there s a more elementary argument that we can use. To show that X Ω( n) wth a constant probablty, t s enough to show that E [ X ] n. E [ [ ] X ] = E X ] = E X X j,j =,j E [X X j ] = E [ X ] = n, where n the second to last equalty we use that X, X j are ndependent, so E [X X j ] 0 only when = j, and n the last equalty we use E [ ] X s. Now back to estmatng F. We want to use a smlar dea. Let x, x,..., x m [n] be the nput sequence. For each [n] let m := #{x j = }. Recall that Let h : [n] {+, } where for any [n], F := h() = n m. = { +,,, chosen ndependently. Consder the followng algorthm: Start wth Y = 0. After readng each x, let Y = Y + h(x ). Return Y. Before, analyzng the algorthm let us study two extreme cases. Frst assume that x = x = = x m. Then, Y = m, Y = m as desred. Now, assume that x, x, dots, x m are mutually dstnct, then the dstrbuton of Y s the same as a random walk of length m; so by the prevous observaton Y n and Y n as desred. Lemma 4.. Y s an unbased estmator of F,.e., E [ Y ] = F. Proof. Frst, observe that Y = m h(). Therefore, E [ Y ] = E,j m m j h()h(j) =,j m m j E [h()h(j)] = m E [ h() ] = m where the second to last equalty uses that h() s ndependent of h(j) for all j. Now, all we need to do s to estmate the expectaton of Y wthn a ± ɛ factor. By Chebyshev s nequalty all we need to show s that Y has a small varance.

4-4 Lecture 4: Unversal Hash Functons/Streamng Cont d Lemma 4.3. Var(Y ) E [ Y ]. Proof. Frst, we calculate E [ Y 4]. The dea s smlar to before, we just use the ndependence of h() s. E [ Y 4] = E m m j m k m l h()h(j)h(k)h(l) =,j,k,l,j,k,l m m j m k m l E [h()h(j)h(k)h(l)] = m 4 E [ h() 4] + 6 <j m m je [ h() h(j) ] To see the last equalty, observe that for any 4-tuple,, j, k, l, E [h()h(j)h(k)h(l)] s nonzero only f each nteger n [m] shows up an even number. In other words, there are only two cases where E [h()h(j)h(k)h(l)] s nonzero: () when = j = k = l, () when two of these four numbers are equal and the other two are also equal. Snce for each, E [ h() ] = E [ h() 4] =, we have Now, usng Lemma 4., we can wrte, as desred. E [ Y 4] = n = m 4 + 6 <j m m j. Var(Y ) = E [ Y 4] E [ Y ] = 4 m m j E [ Y ] Now, all we need to do s to use ndependent samples of Y to reduce the varance. Suppose we take k ndependent samples of Y usng k ndependently chosen hash functons h,..., h k,.e., we run the followng algorthm: Start wth Y = Y = = Y k = 0. After readng x, let Y j = Y j + h(x ) for all j k. Then, <j Var( k (Y + + Y k )) = k Var(Y ). Therefore, by the Chebyshev s nequalty, we can wrte, [ P k E [ Y ] ɛe [ Y ]] Var( k k = Y ) ɛ E [Y ] Y = k E [ Y ] ɛ E [Y ] = ɛ k So, k = 5 ɛ many samples s enough to approxmate F wthn + ɛ factor wth probablty at least 9 0. Note that n the above constructon we assumed that h(.) assgns ndependent values to all ntegers n [n]. But, t can be seen from the proof that we only used 4-wse ndependence. The only place that we used ndependence was to show that E [h()h(j)h(k)h(l)] = 0 when, j, k, l are mutually dstnct. That s of course true even f h(.) s just a 4-wse ndependent functon. Takng that nto account we can run the above algorthm wth space O(log(n)/ɛ ). In addton, we can turn the above probablstc guarantee nto δ probablty usng log δ ɛ We refran from gvng the detals. For more detaled dscusson we refer to [AMS96]. many samples.

REFERENCES 4-5 References [AMS96] N. Alon, Y. Matas, and M. Szegedy. The space complexty of approxmatng the frequency moments. In: STOCw. ACM. 996, pp. 0 9 (ct. on p. 4-4).