Declaring Independence via the Sketching of Sketches. Until August Hire Me!
|
|
- Katrina Briggs
- 6 years ago
- Views:
Transcription
1 Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August Hire Me!
2 The Problem
3 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Image from
4 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Sample (sub-linear time): How many are required to distinguish independence from ε-far from independence? [Batu et al. 01], [Alon et al. 07], [Valiant 08] Image from
5 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Sample (sub-linear time): How many are required to distinguish independence from ε-far from independence? [Batu et al. 01], [Alon et al. 07], [Valiant 08] Stream (sub-linear space): Access pairs sequentially or online and limited memory. Image from
6 Formulation
7 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),...
8 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj
9 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms?
10 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms? E.g., L 1 (s r) = i,j s ij r ij L 2 (s r) = i,j (s ij r ij ) 2 I (s, r) = H(p) H(p q)
11 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms? E.g., L 1 (s r) = i,j s ij r ij L 2 (s r) = i,j (s ij r ij ) 2 I (s, r) = H(p) H(p q) Previous work: Can estimate L1 and L2 between marginals. [Alon, Matias, Szegedy 96], [Feigenbaum et al. 99], [Indyk 00], [Guha, Indyk, McGregor 07], [Ganguly, Cormode 07]
12 Our Results
13 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches
14 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches Estimating L1(s-r): O(ln n)-factor approx. in Õ(ln δ -1 ) space. Sketches of sketches and sketches/embeddings
15 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches Estimating L1(s-r): O(ln n)-factor approx. in Õ(ln δ -1 ) space. Sketches of sketches and sketches/embeddings Other Results: L1(s-r): Additive approximations Mutual Information: Additive but not (1+ε)-factor approx. Distributed Model: Pairs are observed by different parties.
16 a) Neat Result for L2 b) Sketching Sketches c) Other Results
17 a) Neat Result for L2 b) Sketching Sketches c) Other Results
18 First Attempt
19 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96]
20 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2
21 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2
22 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2
23 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2
24 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2
25 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2 Repeating O(ε -2 ln δ -1 ) times and take the mean.
26 Computing Estimator
27 Computing Estimator Need to compute: z.r and z.s
28 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m
29 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term!
30 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term! Good News: Use bilinear sketch: If z ij = x i y j z.s = ij z ijs ij = (x.p)(y.q) for x, y { 1, 1} n i.e., product of sketches is sketch of product.
31 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term! Good News: Use bilinear sketch: If z ij = x i y j z.s = ij z ijs ij = (x.p)(y.q) for x, y { 1, 1} n i.e., product of sketches is sketch of product. Bad News: z is no longer 4-wise independent even if x and y are fully random, e.g., z 11 z 12 z 21 z 22 = (x 1 ) 2 (x 2 ) 2 (y 1 ) 2 (y 2 ) 2 = 1
32 Still Get Low Variance
33 Still Get Low Variance Lemma: Variance has at most tripled.
34 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n
35 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle
36 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle
37 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2
38 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2
39 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2 Less independence useful for range-sums. [Rusu, Dobra 06]
40 Summary of L2 Result Thm: (1+ε)-factor approx. (w/p 1-δ) in Õ(ε -2 ln δ -1 ) space. Proof Ideas: 1) First attempt: Use AMS technique. 2) Road block: Can t sketch product distribution. 3) Bilinear sketch: Product of sketches was sketch of product! 4) PANIC: No longer 4-wise independence. 5) Relax: We didn t need full 4-wise independence.
41 a) Neat Result for L2 b) Sketching Sketches c) Other Results
42 L1 Result
43 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space.
44 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00]
45 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00] Review of L1 sketching: Let entries of z be Cauchy(0,1) Compute estimator z.a Repeat k=o(ε -2 ln δ -1 ) times with different z. Take the median and appeal to concentration lemmas.
46 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00] Review of L1 sketching: Let entries of z be Cauchy(0,1) Compute estimator z.a Repeat k=o(ε -2 ln δ -1 ) times with different z. Take the median and appeal to concentration lemmas. N.B. If median were mean we d have a dimensionality reduction result that doesn t exist. [Brinkman, Charikar 03]
47 Sketching Sketches
48 Sketching Sketches To sketch product distribution need z = ym x z = ( } y {{ ) } n ( x ) ( x ) } 0 0 {{... ( x ) } n 2
49 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2
50 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a
51 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a The Problem: Need to take median of multiple inner sketches before taking outer sketch.
52 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a The Problem: Need to take median of multiple inner sketches before taking outer sketch. The size of the inner sketch is large.
53 L1 Result
54 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space.
55 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1)
56 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1) Pr [ ] Ω(1) M(x).a a O(log n) 9/10
57 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1) Pr [ ] Ω(1) M(x).a a O(log n) 9/10 Repeat Õ(ln δ -1 ) times and take median.
58 a) Neat Result for L2 b) Sketching Sketches c) Other Results
59 Other Results
60 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07]
61 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07] Distributed Model: Player 1 sees (3,. ), (5,. ), (2,. ), (3,. ), (7,. ), (1,. ), (3,. ), (6,. ),... Player 2 sees (.,5), (.,3), (.,7), (.,4), (.,1), (.,2), (.,9), (.,6),... Very hard in general, e.g., can t check if L1(s-r)=0
62 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07] Distributed Model: Player 1 sees (3,. ), (5,. ), (2,. ), (3,. ), (7,. ), (1,. ), (3,. ), (6,. ),... Player 2 sees (.,5), (.,3), (.,7), (.,4), (.,1), (.,2), (.,9), (.,6),... Very hard in general, e.g., can t check if L1(s-r)=0 Additive Approximation for L1(s-r): L 1 (p q) = i p il 1 (q q i ) where q i is q conditioned on first term equals i. [Guha, McGregor, Venkatasubramanian 06]
63 Main Results Can estimate L2(r-s) well using neat extension of AMS sketch. Can estimate L1(r-s) up to O(log n) factor using p-stable distributions. Can estimate mutual information additively using entropy algorithms. Questions?
Sketching and Streaming for Distributions
Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationComputing the Entropy of a Stream
Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationData Streams & Communication Complexity
Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationTesting k-wise Independence over Streaming Data
Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationSpace- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams
Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Andrew McGregor University of Massachuse7s mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura
More informationA Near-Optimal Algorithm for Computing the Entropy of a Stream
A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationCMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms
CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationEstimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University
Estimating properties of distributions Ronitt Rubinfeld MIT and Tel Aviv University Distributions are everywhere What properties do your distributions have? Play the lottery? Is it independent? Is it uniform?
More informationBlock Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk
Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationSparse Johnson-Lindenstrauss Transforms
Sparse Johnson-Lindenstrauss Transforms Jelani Nelson MIT May 24, 211 joint work with Daniel Kane (Harvard) Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean
More informationStreaming and communication complexity of Hamming distance
Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern
More informationSublinear Algorithms for Big Data
Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 2: Sublinear in Communication Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 Applicaitons They want to
More informationLeveraging Big Data: Lecture 13
Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector
More informationDependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.
Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,
More informationTrace Reconstruction Revisited
Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu
More informationSub-Gaussian estimators under heavy tails
Sub-Gaussian estimators under heavy tails Roberto Imbuzeiro Oliveira XIX Escola Brasileira de Probabilidade Maresias, August 6th 2015 Joint with Luc Devroye (McGill) Matthieu Lerasle (CNRS/Nice) Gábor
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More informationAn Information Divergence Estimation over Data Streams
An Information Divergence Estimation over Data Streams Emmanuelle Anceaume, Yann Busnel To cite this version: Emmanuelle Anceaume, Yann Busnel. An Information Divergence Estimation over Data Streams. C.Elks.
More informationLecture Lecture 9 October 1, 2015
CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest
More informationStreaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher
Streaming Graph Computations with a Helpful Advisor Justin Thaler Graham Cormode and Michael Mitzenmacher Thanks to Andrew McGregor A few slides borrowed from IITK Workshop on Algorithms for Processing
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More informationCombining geometry and combinatorics
Combining geometry and combinatorics A unified approach to sparse signal recovery Anna C. Gilbert University of Michigan joint work with R. Berinde (MIT), P. Indyk (MIT), H. Karloff (AT&T), M. Strauss
More informationOn the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries
On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries Yossi Matias School of Computer Science Tel Aviv University Tel Aviv 69978, Israel matias@tau.ac.il Daniel Urieli School
More informationSparser Johnson-Lindenstrauss Transforms
Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)
More informationSTREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS
STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS SUDIPTO GUHA AND ANDREW MCGREGOR Abstract. When trying to process a data-stream in small space, how important is the order
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationTight Bounds for Distributed Functional Monitoring
Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.
More informationCS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014
CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /
More informationGraph & Geometry Problems in Data Streams
Graph & Geometry Problems in Data Streams 2009 Barbados Workshop on Computational Complexity Andrew McGregor Introduction Models: Graph Streams: Stream of edges E = {e 1, e 2,..., e m } describe a graph
More informationFast Dimension Reduction
Fast Dimension Reduction MMDS 2008 Nir Ailon Google Research NY Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes (with Edo Liberty) The Fast Johnson Lindenstrauss Transform (with Bernard
More information2 How many distinct elements are in a stream?
Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition
More informationLecture 1 September 3, 2013
CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html
More informationEstimating Dominance Norms of Multiple Data Streams
Estimating Dominance Norms of Multiple Data Streams Graham Cormode and S. Muthukrishnan Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data
More informationLower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming
Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Sudipto Guha 1 and Andrew McGregor 2 1 University of Pennsylvania sudipto@cis.upenn.edu 2 University of California, San Diego
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationImproved Algorithms for Distributed Entropy Monitoring
Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data
More informationLecture 3 Frequency Moments, Heavy Hitters
COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is
More informationPolynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M.
Polynomial Representations of Threshold Functions and Algorithmic Applications Ryan Williams Stanford Joint with Josh Alman (Stanford) and Timothy M. Chan (Waterloo) Outline The Context: Polynomial Representations,
More informationTight Bounds for Distributed Streaming
Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming
More informationSparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery
Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Anna C. Gilbert Department of Mathematics University of Michigan Sparse signal recovery measurements:
More information1-Pass Relative-Error L p -Sampling with Applications
-Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m
More informationStream Computation and Arthur- Merlin Communication
Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,
More informationLower Bound Techniques for Multiparty Communication Complexity
Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand
More informationRange-efficient computation of F 0 over massive data streams
Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State
More information18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6
Massachusetts Institute of Technology Lecturer: Michel X. Goemans 8.409: Topics in TCS: Embeddings of Finite Metric Spaces September 7, 006 Scribe: Benjamin Rossman Lecture 6 Today we loo at dimension
More informationOn Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions
On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions Dimitris Fotakis Yahoo Research NY and National Technical University of Athens Joint work with Vasilis Kontonis (NTU
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationZero-One Laws for Sliding Windows and Universal Sketches
Zero-One Laws for Sliding Windows and Universal Sketches Vladimir Braverman 1, Rafail Ostrovsky 2, and Alan Roytman 3 1 Department of Computer Science, Johns Hopkins University, USA vova@cs.jhu.edu 2 Department
More informationSimilarity Search in High Dimensions II. Piotr Indyk MIT
Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance
More informationNonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections
Journal of Machine Learning Research 007 Submitted 0/06; Published Nonlinear Estimators and Tail Bounds for Dimension Reduction in l Using Cauchy Random Projections Ping Li Department of Statistical Science
More informationTrusting the Cloud with Practical Interactive Proofs
Trusting the Cloud with Practical Interactive Proofs Graham Cormode G.Cormode@warwick.ac.uk Amit Chakrabarti (Dartmouth) Andrew McGregor (U Mass Amherst) Justin Thaler (Harvard/Yahoo!) Suresh Venkatasubramanian
More informationBeyond Distinct Counting: LogLog Composable Sketches of frequency statistics
Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017 8 Un-aggregated data 3 Data element e D has key and value
More informationGraph Sparsification I : Effective Resistance Sampling
Graph Sparsification I : Effective Resistance Sampling Nikhil Srivastava Microsoft Research India Simons Institute, August 26 2014 Graphs G G=(V,E,w) undirected V = n w: E R + Sparsification Approximate
More informationb. ( ) ( ) ( ) ( ) ( ) 5. Independence: Two events (A & B) are independent if one of the conditions listed below is satisfied; ( ) ( ) ( )
1. Set a. b. 2. Definitions a. Random Experiment: An experiment that can result in different outcomes, even though it is performed under the same conditions and in the same manner. b. Sample Space: This
More informationFinding Frequent Items in Data Streams
Finding Frequent Items in Data Streams Moses Charikar 1, Kevin Chen 2, and Martin Farach-Colton 3 1 Princeton University moses@cs.princeton.edu 2 UC Berkeley kevinc@cs.berkeley.edu 3 Rutgers University
More informationLecture 15: Random Projections
Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction
More informationTail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).
Tail Inequalities William Hunt Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV William.Hunt@mail.wvu.edu Introduction In this chapter, we are interested
More informationAlgorithms for Calculating Statistical Properties on Moving Points
Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009
More informationSecond main application of Chernoff: analysis of load balancing. Already saw balls in bins example. oblivious algorithms only consider self packet.
Routing Second main application of Chernoff: analysis of load balancing. Already saw balls in bins example synchronous message passing bidirectional links, one message per step queues on links permutation
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 Allan Borodin March 13, 2016 1 / 33 Lecture 9 Announcements 1 I have the assignments graded by Lalla. 2 I have now posted five questions
More informationGrothendieck s Inequality
Grothendieck s Inequality Leqi Zhu 1 Introduction Let A = (A ij ) R m n be an m n matrix. Then A defines a linear operator between normed spaces (R m, p ) and (R n, q ), for 1 p, q. The (p q)-norm of A
More informationCS Foundations of Communication Complexity
CS 49 - Foundations of Communication Complexity Lecturer: Toniann Pitassi 1 The Discrepancy Method Cont d In the previous lecture we ve outlined the discrepancy method, which is a method for getting lower
More informationDensity estimation in linear time (+approximating L 1 -distances) Satyaki Mahalanabis Daniel Štefankovič. University of Rochester
Density estimation in linear time (+approximating L 1 -distances) Satyaki Mahalanabis Daniel Štefankovič University of Rochester Density estimation f 1 f 2 f 6 + DATA f 4 f f f 4 3 f 5 F = a family of
More informationRobust exponential convergence of hp-fem for singularly perturbed systems of reaction-diffusion equations
Robust exponential convergence of hp-fem for singularly perturbed systems of reaction-diffusion equations Christos Xenophontos Department of Mathematics and Statistics University of Cyprus joint work with
More information2. Matrix Algebra and Random Vectors
2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns
More informationRandom Feature Maps for Dot Product Kernels Supplementary Material
Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains
More informationApproximate Counting of Cycles in Streams
Approximate Counting of Cycles in Streams Madhusudan Manjunath 1, Kurt Mehlhorn 1, Konstantinos Panagiotou 1, and He Sun 1,2 1 Max Planck Institute for Informatics, Saarbrücken, Germany 2 Fudan University,
More informationHeavy Hitters. Piotr Indyk MIT. Lecture 4
Heavy Hitters Piotr Indyk MIT Last Few Lectures Recap (last few lectures) Update a vector x Maintain a linear sketch Can compute L p norm of x (in zillion different ways) Questions: Can we do anything
More informationImproved Range-Summable Random Variable Construction Algorithms
Improved Range-Summable Random Variable Construction Algorithms A. R. Calderbank A. Gilbert K. Levchenko S. Muthukrishnan M. Strauss January 19, 2005 Abstract Range-summable universal hash functions, also
More informationNorms of Random Matrices & Low-Rank via Sampling
CS369M: Algorithms for Modern Massive Data Set Analysis Lecture 4-10/05/2009 Norms of Random Matrices & Low-Rank via Sampling Lecturer: Michael Mahoney Scribes: Jacob Bien and Noah Youngs *Unedited Notes
More informationTHE LIGHT BULB PROBLEM 1
THE LIGHT BULB PROBLEM 1 Ramamohan Paturi 2 Sanguthevar Rajasekaran 3 John Reif 3 Univ. of California, San Diego Univ. of Pennsylvania Duke University 1 Running Title: The Light Bulb Problem Corresponding
More informationCR-precis: A deterministic summary structure for update data streams
CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationAn Improved Frequent Items Algorithm with Applications to Web Caching
An Improved Frequent Items Algorithm with Applications to Web Caching Kevin Chen and Satish Rao UC Berkeley Abstract. We present a simple, intuitive algorithm for the problem of finding an approximate
More informationSpring 2012 Math 541B Exam 1
Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote
More informationTurnstile Streaming Algorithms Might as Well Be Linear Sketches
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Max-Planck Institute for Informatics yli@mpi-inf.mpg.de David P. Woodruff IBM Research, Almaden dpwoodru@us.ibm.com Huy L. Nguy ên
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationStream-Order and Order-Statistics
Stream-Order and Order-Statistics Sudipto Guha Andrew McGregor Abstract When processing a data-stream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable
More informationThe main results about probability measures are the following two facts:
Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a
More informationAlgorithms for Querying Noisy Distributed/Streaming Datasets
Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias
More informationUniformly X Intersecting Families. Noga Alon and Eyal Lubetzky
Uniformly X Intersecting Families Noga Alon and Eyal Lubetzky April 2007 X Intersecting Families Let A, B denote two families of subsets of [n]. The pair (A,B) is called iff `-cross-intersecting A B =`
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationExponential Metrics. Lecture by Sarah Cannon and Emma Cohen. November 18th, 2014
Exponential Metrics Lecture by Sarah Cannon and Emma Cohen November 18th, 2014 1 Background The coupling theorem we proved in class: Theorem 1.1. Let φ t be a random variable [distance metric] satisfying:
More informationLecture 5. Max-cut, Expansion and Grothendieck s Inequality
CS369H: Hierarchies of Integer Programming Relaxations Spring 2016-2017 Lecture 5. Max-cut, Expansion and Grothendieck s Inequality Professor Moses Charikar Scribes: Kiran Shiragur Overview Here we derive
More informationOnline (and Distributed) Learning with Information Constraints. Ohad Shamir
Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information
More informationSuccinct Approximate Counting of Skewed Data
Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly
More informationNear-Optimal Algorithms for Maximum Constraint Satisfaction Problems
Near-Optimal Algorithms for Maximum Constraint Satisfaction Problems Moses Charikar Konstantin Makarychev Yury Makarychev Princeton University Abstract In this paper we present approximation algorithms
More informationConditional Independence
H. Nooitgedagt Conditional Independence Bachelorscriptie, 18 augustus 2008 Scriptiebegeleider: prof.dr. R. Gill Mathematisch Instituut, Universiteit Leiden Preface In this thesis I ll discuss Conditional
More information