Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Size: px
Start display at page:

Download "Declaring Independence via the Sketching of Sketches. Until August Hire Me!"

Transcription

1 Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August Hire Me!

2 The Problem

3 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Image from

4 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Sample (sub-linear time): How many are required to distinguish independence from ε-far from independence? [Batu et al. 01], [Alon et al. 07], [Valiant 08] Image from

5 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Sample (sub-linear time): How many are required to distinguish independence from ε-far from independence? [Batu et al. 01], [Alon et al. 07], [Valiant 08] Stream (sub-linear space): Access pairs sequentially or online and limited memory. Image from

6 Formulation

7 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),...

8 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj

9 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms?

10 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms? E.g., L 1 (s r) = i,j s ij r ij L 2 (s r) = i,j (s ij r ij ) 2 I (s, r) = H(p) H(p q)

11 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms? E.g., L 1 (s r) = i,j s ij r ij L 2 (s r) = i,j (s ij r ij ) 2 I (s, r) = H(p) H(p q) Previous work: Can estimate L1 and L2 between marginals. [Alon, Matias, Szegedy 96], [Feigenbaum et al. 99], [Indyk 00], [Guha, Indyk, McGregor 07], [Ganguly, Cormode 07]

12 Our Results

13 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches

14 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches Estimating L1(s-r): O(ln n)-factor approx. in Õ(ln δ -1 ) space. Sketches of sketches and sketches/embeddings

15 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches Estimating L1(s-r): O(ln n)-factor approx. in Õ(ln δ -1 ) space. Sketches of sketches and sketches/embeddings Other Results: L1(s-r): Additive approximations Mutual Information: Additive but not (1+ε)-factor approx. Distributed Model: Pairs are observed by different parties.

16 a) Neat Result for L2 b) Sketching Sketches c) Other Results

17 a) Neat Result for L2 b) Sketching Sketches c) Other Results

18 First Attempt

19 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96]

20 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2

21 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

22 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

23 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

24 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

25 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2 Repeating O(ε -2 ln δ -1 ) times and take the mean.

26 Computing Estimator

27 Computing Estimator Need to compute: z.r and z.s

28 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m

29 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term!

30 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term! Good News: Use bilinear sketch: If z ij = x i y j z.s = ij z ijs ij = (x.p)(y.q) for x, y { 1, 1} n i.e., product of sketches is sketch of product.

31 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term! Good News: Use bilinear sketch: If z ij = x i y j z.s = ij z ijs ij = (x.p)(y.q) for x, y { 1, 1} n i.e., product of sketches is sketch of product. Bad News: z is no longer 4-wise independent even if x and y are fully random, e.g., z 11 z 12 z 21 z 22 = (x 1 ) 2 (x 2 ) 2 (y 1 ) 2 (y 2 ) 2 = 1

32 Still Get Low Variance

33 Still Get Low Variance Lemma: Variance has at most tripled.

34 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n

35 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle

36 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle

37 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2

38 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2

39 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2 Less independence useful for range-sums. [Rusu, Dobra 06]

40 Summary of L2 Result Thm: (1+ε)-factor approx. (w/p 1-δ) in Õ(ε -2 ln δ -1 ) space. Proof Ideas: 1) First attempt: Use AMS technique. 2) Road block: Can t sketch product distribution. 3) Bilinear sketch: Product of sketches was sketch of product! 4) PANIC: No longer 4-wise independence. 5) Relax: We didn t need full 4-wise independence.

41 a) Neat Result for L2 b) Sketching Sketches c) Other Results

42 L1 Result

43 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space.

44 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00]

45 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00] Review of L1 sketching: Let entries of z be Cauchy(0,1) Compute estimator z.a Repeat k=o(ε -2 ln δ -1 ) times with different z. Take the median and appeal to concentration lemmas.

46 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00] Review of L1 sketching: Let entries of z be Cauchy(0,1) Compute estimator z.a Repeat k=o(ε -2 ln δ -1 ) times with different z. Take the median and appeal to concentration lemmas. N.B. If median were mean we d have a dimensionality reduction result that doesn t exist. [Brinkman, Charikar 03]

47 Sketching Sketches

48 Sketching Sketches To sketch product distribution need z = ym x z = ( } y {{ ) } n ( x ) ( x ) } 0 0 {{... ( x ) } n 2

49 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2

50 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a

51 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a The Problem: Need to take median of multiple inner sketches before taking outer sketch.

52 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a The Problem: Need to take median of multiple inner sketches before taking outer sketch. The size of the inner sketch is large.

53 L1 Result

54 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space.

55 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1)

56 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1) Pr [ ] Ω(1) M(x).a a O(log n) 9/10

57 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1) Pr [ ] Ω(1) M(x).a a O(log n) 9/10 Repeat Õ(ln δ -1 ) times and take median.

58 a) Neat Result for L2 b) Sketching Sketches c) Other Results

59 Other Results

60 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07]

61 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07] Distributed Model: Player 1 sees (3,. ), (5,. ), (2,. ), (3,. ), (7,. ), (1,. ), (3,. ), (6,. ),... Player 2 sees (.,5), (.,3), (.,7), (.,4), (.,1), (.,2), (.,9), (.,6),... Very hard in general, e.g., can t check if L1(s-r)=0

62 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07] Distributed Model: Player 1 sees (3,. ), (5,. ), (2,. ), (3,. ), (7,. ), (1,. ), (3,. ), (6,. ),... Player 2 sees (.,5), (.,3), (.,7), (.,4), (.,1), (.,2), (.,9), (.,6),... Very hard in general, e.g., can t check if L1(s-r)=0 Additive Approximation for L1(s-r): L 1 (p q) = i p il 1 (q q i ) where q i is q conditioned on first term equals i. [Guha, McGregor, Venkatasubramanian 06]

63 Main Results Can estimate L2(r-s) well using neat extension of AMS sketch. Can estimate L1(r-s) up to O(log n) factor using p-stable distributions. Can estimate mutual information additively using entropy algorithms. Questions?

Sketching and Streaming for Distributions

Sketching and Streaming for Distributions Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Testing k-wise Independence over Streaming Data

Testing k-wise Independence over Streaming Data Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Andrew McGregor University of Massachuse7s mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura

More information

A Near-Optimal Algorithm for Computing the Entropy of a Stream

A Near-Optimal Algorithm for Computing the Entropy of a Stream A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University

Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University Estimating properties of distributions Ronitt Rubinfeld MIT and Tel Aviv University Distributions are everywhere What properties do your distributions have? Play the lottery? Is it independent? Is it uniform?

More information

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Sparse Johnson-Lindenstrauss Transforms

Sparse Johnson-Lindenstrauss Transforms Sparse Johnson-Lindenstrauss Transforms Jelani Nelson MIT May 24, 211 joint work with Daniel Kane (Harvard) Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

Sublinear Algorithms for Big Data

Sublinear Algorithms for Big Data Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 2: Sublinear in Communication Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 Applicaitons They want to

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline. Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

Sub-Gaussian estimators under heavy tails

Sub-Gaussian estimators under heavy tails Sub-Gaussian estimators under heavy tails Roberto Imbuzeiro Oliveira XIX Escola Brasileira de Probabilidade Maresias, August 6th 2015 Joint with Luc Devroye (McGill) Matthieu Lerasle (CNRS/Nice) Gábor

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

An Information Divergence Estimation over Data Streams

An Information Divergence Estimation over Data Streams An Information Divergence Estimation over Data Streams Emmanuelle Anceaume, Yann Busnel To cite this version: Emmanuelle Anceaume, Yann Busnel. An Information Divergence Estimation over Data Streams. C.Elks.

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Streaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher

Streaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher Streaming Graph Computations with a Helpful Advisor Justin Thaler Graham Cormode and Michael Mitzenmacher Thanks to Andrew McGregor A few slides borrowed from IITK Workshop on Algorithms for Processing

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

Combining geometry and combinatorics

Combining geometry and combinatorics Combining geometry and combinatorics A unified approach to sparse signal recovery Anna C. Gilbert University of Michigan joint work with R. Berinde (MIT), P. Indyk (MIT), H. Karloff (AT&T), M. Strauss

More information

On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries

On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries Yossi Matias School of Computer Science Tel Aviv University Tel Aviv 69978, Israel matias@tau.ac.il Daniel Urieli School

More information

Sparser Johnson-Lindenstrauss Transforms

Sparser Johnson-Lindenstrauss Transforms Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)

More information

STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS

STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS SUDIPTO GUHA AND ANDREW MCGREGOR Abstract. When trying to process a data-stream in small space, how important is the order

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

Graph & Geometry Problems in Data Streams

Graph & Geometry Problems in Data Streams Graph & Geometry Problems in Data Streams 2009 Barbados Workshop on Computational Complexity Andrew McGregor Introduction Models: Graph Streams: Stream of edges E = {e 1, e 2,..., e m } describe a graph

More information

Fast Dimension Reduction

Fast Dimension Reduction Fast Dimension Reduction MMDS 2008 Nir Ailon Google Research NY Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes (with Edo Liberty) The Fast Johnson Lindenstrauss Transform (with Bernard

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

Estimating Dominance Norms of Multiple Data Streams

Estimating Dominance Norms of Multiple Data Streams Estimating Dominance Norms of Multiple Data Streams Graham Cormode and S. Muthukrishnan Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data

More information

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Sudipto Guha 1 and Andrew McGregor 2 1 University of Pennsylvania sudipto@cis.upenn.edu 2 University of California, San Diego

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Improved Algorithms for Distributed Entropy Monitoring

Improved Algorithms for Distributed Entropy Monitoring Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

Polynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M.

Polynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M. Polynomial Representations of Threshold Functions and Algorithmic Applications Ryan Williams Stanford Joint with Josh Alman (Stanford) and Timothy M. Chan (Waterloo) Outline The Context: Polynomial Representations,

More information

Tight Bounds for Distributed Streaming

Tight Bounds for Distributed Streaming Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming

More information

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Anna C. Gilbert Department of Mathematics University of Michigan Sparse signal recovery measurements:

More information

1-Pass Relative-Error L p -Sampling with Applications

1-Pass Relative-Error L p -Sampling with Applications -Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m

More information

Stream Computation and Arthur- Merlin Communication

Stream Computation and Arthur- Merlin Communication Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,

More information

Lower Bound Techniques for Multiparty Communication Complexity

Lower Bound Techniques for Multiparty Communication Complexity Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand

More information

Range-efficient computation of F 0 over massive data streams

Range-efficient computation of F 0 over massive data streams Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State

More information

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6 Massachusetts Institute of Technology Lecturer: Michel X. Goemans 8.409: Topics in TCS: Embeddings of Finite Metric Spaces September 7, 006 Scribe: Benjamin Rossman Lecture 6 Today we loo at dimension

More information

On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions

On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions Dimitris Fotakis Yahoo Research NY and National Technical University of Athens Joint work with Vasilis Kontonis (NTU

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Zero-One Laws for Sliding Windows and Universal Sketches

Zero-One Laws for Sliding Windows and Universal Sketches Zero-One Laws for Sliding Windows and Universal Sketches Vladimir Braverman 1, Rafail Ostrovsky 2, and Alan Roytman 3 1 Department of Computer Science, Johns Hopkins University, USA vova@cs.jhu.edu 2 Department

More information

Similarity Search in High Dimensions II. Piotr Indyk MIT

Similarity Search in High Dimensions II. Piotr Indyk MIT Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance

More information

Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections

Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections Journal of Machine Learning Research 007 Submitted 0/06; Published Nonlinear Estimators and Tail Bounds for Dimension Reduction in l Using Cauchy Random Projections Ping Li Department of Statistical Science

More information

Trusting the Cloud with Practical Interactive Proofs

Trusting the Cloud with Practical Interactive Proofs Trusting the Cloud with Practical Interactive Proofs Graham Cormode G.Cormode@warwick.ac.uk Amit Chakrabarti (Dartmouth) Andrew McGregor (U Mass Amherst) Justin Thaler (Harvard/Yahoo!) Suresh Venkatasubramanian

More information

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017 8 Un-aggregated data 3 Data element e D has key and value

More information

Graph Sparsification I : Effective Resistance Sampling

Graph Sparsification I : Effective Resistance Sampling Graph Sparsification I : Effective Resistance Sampling Nikhil Srivastava Microsoft Research India Simons Institute, August 26 2014 Graphs G G=(V,E,w) undirected V = n w: E R + Sparsification Approximate

More information

b. ( ) ( ) ( ) ( ) ( ) 5. Independence: Two events (A & B) are independent if one of the conditions listed below is satisfied; ( ) ( ) ( )

b. ( ) ( ) ( ) ( ) ( ) 5. Independence: Two events (A & B) are independent if one of the conditions listed below is satisfied; ( ) ( ) ( ) 1. Set a. b. 2. Definitions a. Random Experiment: An experiment that can result in different outcomes, even though it is performed under the same conditions and in the same manner. b. Sample Space: This

More information

Finding Frequent Items in Data Streams

Finding Frequent Items in Data Streams Finding Frequent Items in Data Streams Moses Charikar 1, Kevin Chen 2, and Martin Farach-Colton 3 1 Princeton University moses@cs.princeton.edu 2 UC Berkeley kevinc@cs.berkeley.edu 3 Rutgers University

More information

Lecture 15: Random Projections

Lecture 15: Random Projections Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction

More information

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials). Tail Inequalities William Hunt Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV William.Hunt@mail.wvu.edu Introduction In this chapter, we are interested

More information

Algorithms for Calculating Statistical Properties on Moving Points

Algorithms for Calculating Statistical Properties on Moving Points Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009

More information

Second main application of Chernoff: analysis of load balancing. Already saw balls in bins example. oblivious algorithms only consider self packet.

Second main application of Chernoff: analysis of load balancing. Already saw balls in bins example. oblivious algorithms only consider self packet. Routing Second main application of Chernoff: analysis of load balancing. Already saw balls in bins example synchronous message passing bidirectional links, one message per step queues on links permutation

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 Allan Borodin March 13, 2016 1 / 33 Lecture 9 Announcements 1 I have the assignments graded by Lalla. 2 I have now posted five questions

More information

Grothendieck s Inequality

Grothendieck s Inequality Grothendieck s Inequality Leqi Zhu 1 Introduction Let A = (A ij ) R m n be an m n matrix. Then A defines a linear operator between normed spaces (R m, p ) and (R n, q ), for 1 p, q. The (p q)-norm of A

More information

CS Foundations of Communication Complexity

CS Foundations of Communication Complexity CS 49 - Foundations of Communication Complexity Lecturer: Toniann Pitassi 1 The Discrepancy Method Cont d In the previous lecture we ve outlined the discrepancy method, which is a method for getting lower

More information

Density estimation in linear time (+approximating L 1 -distances) Satyaki Mahalanabis Daniel Štefankovič. University of Rochester

Density estimation in linear time (+approximating L 1 -distances) Satyaki Mahalanabis Daniel Štefankovič. University of Rochester Density estimation in linear time (+approximating L 1 -distances) Satyaki Mahalanabis Daniel Štefankovič University of Rochester Density estimation f 1 f 2 f 6 + DATA f 4 f f f 4 3 f 5 F = a family of

More information

Robust exponential convergence of hp-fem for singularly perturbed systems of reaction-diffusion equations

Robust exponential convergence of hp-fem for singularly perturbed systems of reaction-diffusion equations Robust exponential convergence of hp-fem for singularly perturbed systems of reaction-diffusion equations Christos Xenophontos Department of Mathematics and Statistics University of Cyprus joint work with

More information

2. Matrix Algebra and Random Vectors

2. Matrix Algebra and Random Vectors 2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

Approximate Counting of Cycles in Streams

Approximate Counting of Cycles in Streams Approximate Counting of Cycles in Streams Madhusudan Manjunath 1, Kurt Mehlhorn 1, Konstantinos Panagiotou 1, and He Sun 1,2 1 Max Planck Institute for Informatics, Saarbrücken, Germany 2 Fudan University,

More information

Heavy Hitters. Piotr Indyk MIT. Lecture 4

Heavy Hitters. Piotr Indyk MIT. Lecture 4 Heavy Hitters Piotr Indyk MIT Last Few Lectures Recap (last few lectures) Update a vector x Maintain a linear sketch Can compute L p norm of x (in zillion different ways) Questions: Can we do anything

More information

Improved Range-Summable Random Variable Construction Algorithms

Improved Range-Summable Random Variable Construction Algorithms Improved Range-Summable Random Variable Construction Algorithms A. R. Calderbank A. Gilbert K. Levchenko S. Muthukrishnan M. Strauss January 19, 2005 Abstract Range-summable universal hash functions, also

More information

Norms of Random Matrices & Low-Rank via Sampling

Norms of Random Matrices & Low-Rank via Sampling CS369M: Algorithms for Modern Massive Data Set Analysis Lecture 4-10/05/2009 Norms of Random Matrices & Low-Rank via Sampling Lecturer: Michael Mahoney Scribes: Jacob Bien and Noah Youngs *Unedited Notes

More information

THE LIGHT BULB PROBLEM 1

THE LIGHT BULB PROBLEM 1 THE LIGHT BULB PROBLEM 1 Ramamohan Paturi 2 Sanguthevar Rajasekaran 3 John Reif 3 Univ. of California, San Diego Univ. of Pennsylvania Duke University 1 Running Title: The Light Bulb Problem Corresponding

More information

CR-precis: A deterministic summary structure for update data streams

CR-precis: A deterministic summary structure for update data streams CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

An Improved Frequent Items Algorithm with Applications to Web Caching

An Improved Frequent Items Algorithm with Applications to Web Caching An Improved Frequent Items Algorithm with Applications to Web Caching Kevin Chen and Satish Rao UC Berkeley Abstract. We present a simple, intuitive algorithm for the problem of finding an approximate

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Max-Planck Institute for Informatics yli@mpi-inf.mpg.de David P. Woodruff IBM Research, Almaden dpwoodru@us.ibm.com Huy L. Nguy ên

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Stream-Order and Order-Statistics

Stream-Order and Order-Statistics Stream-Order and Order-Statistics Sudipto Guha Andrew McGregor Abstract When processing a data-stream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Uniformly X Intersecting Families. Noga Alon and Eyal Lubetzky

Uniformly X Intersecting Families. Noga Alon and Eyal Lubetzky Uniformly X Intersecting Families Noga Alon and Eyal Lubetzky April 2007 X Intersecting Families Let A, B denote two families of subsets of [n]. The pair (A,B) is called iff `-cross-intersecting A B =`

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Exponential Metrics. Lecture by Sarah Cannon and Emma Cohen. November 18th, 2014

Exponential Metrics. Lecture by Sarah Cannon and Emma Cohen. November 18th, 2014 Exponential Metrics Lecture by Sarah Cannon and Emma Cohen November 18th, 2014 1 Background The coupling theorem we proved in class: Theorem 1.1. Let φ t be a random variable [distance metric] satisfying:

More information

Lecture 5. Max-cut, Expansion and Grothendieck s Inequality

Lecture 5. Max-cut, Expansion and Grothendieck s Inequality CS369H: Hierarchies of Integer Programming Relaxations Spring 2016-2017 Lecture 5. Max-cut, Expansion and Grothendieck s Inequality Professor Moses Charikar Scribes: Kiran Shiragur Overview Here we derive

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

Near-Optimal Algorithms for Maximum Constraint Satisfaction Problems

Near-Optimal Algorithms for Maximum Constraint Satisfaction Problems Near-Optimal Algorithms for Maximum Constraint Satisfaction Problems Moses Charikar Konstantin Makarychev Yury Makarychev Princeton University Abstract In this paper we present approximation algorithms

More information

Conditional Independence

Conditional Independence H. Nooitgedagt Conditional Independence Bachelorscriptie, 18 augustus 2008 Scriptiebegeleider: prof.dr. R. Gill Mathematisch Instituut, Universiteit Leiden Preface In this thesis I ll discuss Conditional

More information