Improved Concentration Bounds for Count-Sketch

Size: px
Start display at page:

Download "Improved Concentration Bounds for Count-Sketch"

Transcription

1 Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

2 Count-Sketch: a classic streaming algorithm Charikar, Chen, Farach-Colton 2002 Solves heavy hitters problem Estimate a vector x R n from low dimensional sketch Ax R m. Nice algorithm Simple Used in Google s MapReduce standard library [CCF02] bounds the maximum error over all coordinates. We show, for the same algorithm, Most coordinates have asymptotically better estimation accuracy. The average accuracy over many coordinates will be asymptotically better with high probability. Experiments show our asymptotics are correct. Caveat: we assume fully independent hash functions. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

3 Outline 1 Robust Estimation of Symmetric Variables Lemma Relevance to Count-Sketch 2 Electoral Colleges and Direct Elections Lemma Relevance to Count-Sketch 3 Experiments! Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

4 Outline 1 Robust Estimation of Symmetric Variables Lemma Relevance to Count-Sketch 2 Electoral Colleges and Direct Elections Lemma Relevance to Count-Sketch 3 Experiments! Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

5 Estimating a symmetric random variable s mean X 1/2 1/2 mean µ, standard deviation σ µ ± Unknown distribution X over R, symmetric about unknown µ. Given samples x1,..., x R X. How to estimate µ? Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

6 Estimating a symmetric random variable s mean X 1/2 1/2 mean µ, standard deviation σ µ ± Unknown distribution X over R, symmetric about unknown µ. Given samples x1,..., x R X. How to estimate µ? Mean: Converges to µ as σ/ R. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

7 Estimating a symmetric random variable s mean X 1/2 1/2 mean µ, standard deviation σ µ ± Unknown distribution X over R, symmetric about unknown µ. Given samples x1,..., x R X. How to estimate µ? Mean: Converges to µ as σ/ R. No robustness to outliers Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

8 Estimating a symmetric random variable s mean X 1/2 1/2 mean µ, standard deviation σ µ ± Unknown distribution X over R, symmetric about unknown µ. Given samples x1,..., x R X. How to estimate µ? Mean: Converges to µ as σ/ R. No robustness to outliers Median: Extremely robust Doesn t necessarily converge to µ. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

9 Estimating a symmetric random variable s mean X µ σ µ µ +σ Median doesn t converge Consider: median of pairwise means x µ = median i + x i+1 i {1, 3, 5,...} 2 Converges as O(σ/ R), even with outliers. That is: median of (X + X ) converges. [See also: Hodges-Lehmann estimator.] Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

10 Why does median converge for X + X? WLOG µ = 0. Define the Fourier transform F X of X : F X (t) = E [cos(τxt)] x X 2π 6.28 (standard Fourier transform of PDF, specialized to symmetric X.) Convolution multiplication F X +X (t) = (F X (t)) 2 0 for all t. Theorem Let Y be symmetric about 0 with F Y (t) 0 for all t and E[Y 2 ] = σ 2. Then for all ɛ 1, Pr[ y ɛσ] ɛ Standard Chernoff bounds: median y 1,..., y R converges as σ/ R. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

11 Proof Theorem Let F Y (t) 0 for all t and E[Y 2 ] = 1. Then for all ɛ 1, Pr[ y ɛ] ɛ. F Y (t) = E[cos(τyt)] 1 τ 2 Pr[ y ɛ] = Y 1 ɛ Y 1 ɛ = F Y ɛ 1/ɛ 2 t ɛ 1/ɛ ɛ. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

12 Outline 1 Robust Estimation of Symmetric Variables Lemma Relevance to Count-Sketch 2 Electoral Colleges and Direct Elections Lemma Relevance to Count-Sketch 3 Experiments! Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

13 Count-Sketch Want to estimate x R n from small sketch. Hash to k buckets and sum up with random signs Choose random h : [n] [k], s : [n] {±1}. Store y j = i: h(i)=j s(i)x i k Can estimate x i by x i = y h(i) s(i). R Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

14 Count-Sketch Want to estimate x R n from small sketch. Hash to k buckets and sum up with random signs Choose random h : [n] [k], s : [n] {±1}. Store y j = i: h(i)=j s(i)x i k Can estimate x i by x i = y h(i) s(i). Repeat R times, take the median. For each row, R { ±xj with probability 1/k x i x i = j i 0 otherwise Symmetric, non-negative Fourier transform. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

15 Count-Sketch Analysis Let σ 2 = 1 k min k-sparse x [k] x x [k] 2 2 be the typical error for a single row of Count-Sketch with k columns. Theorem For the any coordinate i, we have for all t R that t Pr[ x i x i > R σ] e Ω(t). (CCF02: t = R = O(log n) case; x x σ w.h.p.) Corollary Excluding e Ω(R) probability events, we have for each i that E[( x i x i ) 2 ] = σ 2 /R Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

16 Estimation of multiple coordinates? What about the average error on a set S of k coordinates? Linearity of expectation: E[ x S x S 2 2 ] = O(1) R kσ2. Does it concentrate? By expectation: p = Θ(1). If independent: p = e Ω(k). Pr[ x S x S 2 2 > O(1) R kσ2 ] < p =??? Sum of many variables, but not independent... Chebyshev s inequality, bounding covariance of error: Feasible to analyze (though kind of nasty). Ideally get: p = 1/ k. We can get p = 1/k 1/14. Can we at least get high probability, i.e. 1/k c for arbitrary constant c? Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

17 Boosting the error probability in a black box manner We know that x S x S 2 is small with all but k 1/14 probability. Way to get all but k c probability: repeat 100c times and take the median of results. With all but k c probability, > 75c of the x (i) S will have small error. Median of results has at most 3 small total error. But resulting algorithm is stupid: Run count-sketch with R = O(cR). Arbitrarily partition into blocks of R rows. Estimate is median (over blocks) of median (within block) of individual estimates. Can we show that the direct median is as good as the median-of-medians? Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

18 Outline 1 Robust Estimation of Symmetric Variables Lemma Relevance to Count-Sketch 2 Electoral Colleges and Direct Elections Lemma Relevance to Count-Sketch 3 Experiments! Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

19 Electoral Colleges Suppose you have a two-party election for k offices. Voters come from a distribution X over {0, 1} k. True majority slate of candidates x {0, 1} k. Election day, receive ballots x1,..., x n X. How to best estimate x? For each office, x majority x electoral x 1 x 2 x 3 x n 1 x n x CA x TX x 1 x CA x n TX +1 x n Is x majority better than x electoral in every way? Is for all α,? Pr[ x majority x > α] Pr[ x electoral x > α] Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

20 Electoral Colleges Is x majority better than x electoral in every way, so Pr[ x majority x > α] Pr[ x electoral x > α] for all α,? Don t know, but Theorem Pr[ x majority x > 3α] 4 Pr[ x electoral x > α] for all p-norms. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

21 Proof Theorem for all p-norms. Pr[ x majority x > 3α] 4 Pr[ x electoral x > α] Follows easily from: Lemma (median 3 ) For any x 1,..., x n R k, we have median median median x i = median partitions into states states within state populace x i (With 4p failure probability, 3/4 of partitions have error at most α; then their median has error 3α.) Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

22 Outline 1 Robust Estimation of Symmetric Variables Lemma Relevance to Count-Sketch 2 Electoral Colleges and Direct Elections Lemma Relevance to Count-Sketch 3 Experiments! Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

23 Concentration for sets We know that a median-of-medians variant of Count-Sketch would give good estimation of sets with high probability. Therefore the standard Count-Sketch would as well. Theorem For any constant c, we have for any set S of coordinates that S Pr[ x S x S 2 > O( R σ)] S c. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

24 Outline 1 Robust Estimation of Symmetric Variables Lemma Relevance to Count-Sketch 2 Electoral Colleges and Direct Elections Lemma Relevance to Count-Sketch 3 Experiments! Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

25 Experiments Claims 1 Individual coordinates have error that concentrates like a Gaussian with standard deviation σ/ R. 2 Sets of coordinates have error O(σ k/r) with high probability. Evaluate on power-law distribution with typical parameters. Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

26 Experiments 1 Individual coordinates have error that concentrates like a Gaussian with standard deviation σ/ R. Compare observed error to expected error for various R, C. Probability density Distribution of errors, 10 trials at n = R=20, C=20 R=20, C=30 R=20, C=50 R=20, C=100 R=20, C=200 R=20, C=500 R=20, C=1000 R=20, C=2000 R=20, C=5000 R=20, C=10000 R=50, C=20 R=50, C=30 R=50, C=50 R=50, C=100 R=50, C=200 R=100, C=20 R=100, C=30 R=100, C=50 R=100, C=100 R=100, C=200 R=200, C=20 R=200, C=30 R=200, C=50 R=200, C=100 R=500, C=20 R=500, C=30 R=1000, C= The ratio x i x i /m R,C Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

27 Experiments 2 Sets of coordinates have error O(σ k/r) with high probability. (for large enough R, C) Compare observed error to expected error for various R, C. Probability density Distribution of E k for various C with n =10000, k =25, R = C=20 C=50 C=100 C=200 C=500 C= E k /(m R,C k) Probability density Distribution of E k for various R with n =10000, k =25, C = R=10 R=20 R=50 R=100 R=200 R=500 R=1000 R=2000 R= E k /(m R,C k) Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

28 Conclusions We present an improved analysis of Count-Sketch, a classic algorithm used in practice. Experiments show it gives the right asymptotics More applications of our lemmas? Independence? Thank You Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

29 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds for Count-Sketch / 24

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Heavy Hitters. Piotr Indyk MIT. Lecture 4

Heavy Hitters. Piotr Indyk MIT. Lecture 4 Heavy Hitters Piotr Indyk MIT Last Few Lectures Recap (last few lectures) Update a vector x Maintain a linear sketch Can compute L p norm of x (in zillion different ways) Questions: Can we do anything

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model

More information

Sparse Fourier Transform (lecture 1)

Sparse Fourier Transform (lecture 1) 1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n

More information

Sharp bounds for learning a mixture of two Gaussians

Sharp bounds for learning a mixture of two Gaussians Sharp bounds for learning a mixture of two Gaussians Moritz Hardt Eric Price IBM Almaden 2014-05-28 Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 1 / 25

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

1-Pass Relative-Error L p -Sampling with Applications

1-Pass Relative-Error L p -Sampling with Applications -Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com

More information

Communication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington

Communication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA 15 June 15, 2015 1-1 Model of computation The coordinator model: k sites and 1 coordinator.

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Probability Background

Probability Background Probability Background Namrata Vaswani, Iowa State University August 24, 2015 Probability recap 1: EE 322 notes Quick test of concepts: Given random variables X 1, X 2,... X n. Compute the PDF of the second

More information

Testing that distributions are close

Testing that distributions are close Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, Patrick White May 26, 2010 Anat Ganor Introduction Goal Given two distributions over an n element set, we wish to check whether these distributions

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, Sofya Vorotnikova 1 1 University of Massachusetts Amherst 2 IBM Almaden Research Center Problem Description Take original string x of length

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden Sketching as a Tool for Numerical Linear Algebra All Lectures David Woodruff IBM Almaden Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less

More information

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half

More information

An Improved Frequent Items Algorithm with Applications to Web Caching

An Improved Frequent Items Algorithm with Applications to Web Caching An Improved Frequent Items Algorithm with Applications to Web Caching Kevin Chen and Satish Rao UC Berkeley Abstract. We present a simple, intuitive algorithm for the problem of finding an approximate

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

An Algorithmist s Toolkit Nov. 10, Lecture 17

An Algorithmist s Toolkit Nov. 10, Lecture 17 8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Dissertation Defense

Dissertation Defense Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation

More information

Combining geometry and combinatorics

Combining geometry and combinatorics Combining geometry and combinatorics A unified approach to sparse signal recovery Anna C. Gilbert University of Michigan joint work with R. Berinde (MIT), P. Indyk (MIT), H. Karloff (AT&T), M. Strauss

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

CS 229r Notes. Sam Elder. November 26, 2013

CS 229r Notes. Sam Elder. November 26, 2013 CS 9r Notes Sam Elder November 6, 013 Tuesday, September 3, 013 Standard information: Instructor: Jelani Nelson Grades (see course website for details) will be based on scribing (10%), psets (40%) 1, and

More information

Robustness Meets Algorithms

Robustness Meets Algorithms Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms Tim Roughgarden March 3, 2016 1 Preamble In CS109 and CS161, you learned some tricks of

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Lecture 1: Introduction to Sublinear Algorithms

Lecture 1: Introduction to Sublinear Algorithms CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for

More information

Testing random variables for independence and identity

Testing random variables for independence and identity Testing random variables for independence and identity Tuğkan Batu Eldar Fischer Lance Fortnow Ravi Kumar Ronitt Rubinfeld Patrick White January 10, 2003 Abstract Given access to independent samples of

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

12. Perturbed Matrices

12. Perturbed Matrices MAT334 : Applied Linear Algebra Mike Newman, winter 208 2. Perturbed Matrices motivation We want to solve a system Ax = b in a context where A and b are not known exactly. There might be experimental errors,

More information

Learning and Fourier Analysis

Learning and Fourier Analysis Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

Fast Random Projections

Fast Random Projections Fast Random Projections Edo Liberty 1 September 18, 2007 1 Yale University, New Haven CT, supported by AFOSR and NGA (www.edoliberty.com) Advised by Steven Zucker. About This talk will survey a few random

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea 15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 200 Scribe: Florin Oprea 8.1 Introduction In this lecture we will consider

More information

New Characterizations in Turnstile Streams with Applications

New Characterizations in Turnstile Streams with Applications New Characterizations in Turnstile Streams with Applications Yuqing Ai 1, Wei Hu 1, Yi Li 2, and David P. Woodruff 3 1 The Institute for Theoretical Computer Science (ITCS), Institute for Interdisciplinary

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

Finding Frequent Items in Data Streams

Finding Frequent Items in Data Streams Finding Frequent Items in Data Streams Moses Charikar 1, Kevin Chen 2, and Martin Farach-Colton 3 1 Princeton University moses@cs.princeton.edu 2 UC Berkeley kevinc@cs.berkeley.edu 3 Rutgers University

More information

6.207/14.15: Networks Lecture 24: Decisions in Groups

6.207/14.15: Networks Lecture 24: Decisions in Groups 6.207/14.15: Networks Lecture 24: Decisions in Groups Daron Acemoglu and Asu Ozdaglar MIT December 9, 2009 1 Introduction Outline Group and collective choices Arrow s Impossibility Theorem Gibbard-Satterthwaite

More information

Lecture 10 September 27, 2016

Lecture 10 September 27, 2016 CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 10 September 27, 2016 Scribes: Quinten McNamara & William Hoza 1 Overview In this lecture, we focus on constructing coresets, which are

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas Gregory Valiant UC Berkeley Berkeley, CA gregory.valiant@gmail.com Abstract Given a set of n d-dimensional Boolean

More information

On Learning Mixtures of Well-Separated Gaussians

On Learning Mixtures of Well-Separated Gaussians On Learning Mixtures of Well-Separated Gaussians Oded Regev Aravindan Viayaraghavan Abstract We consider the problem of efficiently learning mixtures of a large number of spherical Gaussians, when the

More information

Lecture 9: March 26, 2014

Lecture 9: March 26, 2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query

More information

Sparse Recovery and Fourier Sampling. Eric Price

Sparse Recovery and Fourier Sampling. Eric Price Sparse Recovery and Fourier Sampling by Eric Price Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds Randomized Algorithms Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized

More information

Notes on MapReduce Algorithms

Notes on MapReduce Algorithms Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our

More information

An exponential separation between quantum and classical one-way communication complexity

An exponential separation between quantum and classical one-way communication complexity An exponential separation between quantum and classical one-way communication complexity Ashley Montanaro Centre for Quantum Information and Foundations, Department of Applied Mathematics and Theoretical

More information

Lecture 5: Probabilistic tools and Applications II

Lecture 5: Probabilistic tools and Applications II T-79.7003: Graphs and Networks Fall 2013 Lecture 5: Probabilistic tools and Applications II Lecturer: Charalampos E. Tsourakakis Oct. 11, 2013 5.1 Overview In the first part of today s lecture we will

More information

Topics in Approximation Algorithms Solution for Homework 3

Topics in Approximation Algorithms Solution for Homework 3 Topics in Approximation Algorithms Solution for Homework 3 Problem 1 We show that any solution {U t } can be modified to satisfy U τ L τ as follows. Suppose U τ L τ, so there is a vertex v U τ but v L

More information

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Anna C. Gilbert Department of Mathematics University of Michigan Sparse signal recovery measurements:

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

CodedSketch: A Coding Scheme for Distributed Computation of Approximated Matrix Multiplications

CodedSketch: A Coding Scheme for Distributed Computation of Approximated Matrix Multiplications 1 CodedSketch: A Coding Scheme for Distributed Computation of Approximated Matrix Multiplications Tayyebeh Jahani-Nezhad, Mohammad Ali Maddah-Ali arxiv:1812.10460v1 [cs.it] 26 Dec 2018 Abstract In this

More information

Junta Approximations for Submodular, XOS and Self-Bounding Functions

Junta Approximations for Submodular, XOS and Self-Bounding Functions Junta Approximations for Submodular, XOS and Self-Bounding Functions Vitaly Feldman Jan Vondrák IBM Almaden Research Center Simons Institute, Berkeley, October 2013 Feldman-Vondrák Approximations by Juntas

More information

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev. Announcements No class monday. Metric embedding seminar. Review expectation notion of high probability. Markov. Today: Book 4.1, 3.3, 4.2 Chebyshev. Remind variance, standard deviation. σ 2 = E[(X µ X

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

3 Random Samples from Normal Distributions

3 Random Samples from Normal Distributions 3 Random Samples from Normal Distributions Statistical theory for random samples drawn from normal distributions is very important, partly because a great deal is known about its various associated distributions

More information

Robust polynomial regression up to the information theoretic limit

Robust polynomial regression up to the information theoretic limit Robust polynomial regression up to the information theoretic limit Daniel Kane Sushrut Karmalkar Eric Price August 16, 017 Abstract We consider the problem of robust polynomial regression, where one receives

More information

CSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes )

CSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes ) CSE548, AMS54: Analysis of Algorithms, Spring 014 Date: May 1 Final In-Class Exam ( :35 PM 3:50 PM : 75 Minutes ) This exam will account for either 15% or 30% of your overall grade depending on your relative

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

Probability Revealing Samples

Probability Revealing Samples Krzysztof Onak IBM Research Xiaorui Sun Microsoft Research Abstract In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution

More information

CMPSCI 711: More Advanced Algorithms

CMPSCI 711: More Advanced Algorithms CMPSCI 711: More Advanced Algorithms Section 1-1: Sampling Andrew McGregor Last Compiled: April 29, 2012 1/14 Concentration Bounds Theorem (Markov) Let X be a non-negative random variable with expectation

More information

Polynomials, quantum query complexity, and Grothendieck s inequality

Polynomials, quantum query complexity, and Grothendieck s inequality Polynomials, quantum query complexity, and Grothendieck s inequality Scott Aaronson 1, Andris Ambainis 2, Jānis Iraids 2, Martins Kokainis 2, Juris Smotrovs 2 1 Computer Science and Artificial Intelligence

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Algorithms for Calculating Statistical Properties on Moving Points

Algorithms for Calculating Statistical Properties on Moving Points Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009

More information

CSCI-6971 Lecture Notes: Probability theory

CSCI-6971 Lecture Notes: Probability theory CSCI-6971 Lecture Notes: Probability theory Kristopher R. Beevers Department of Computer Science Rensselaer Polytechnic Institute beevek@cs.rpi.edu January 31, 2006 1 Properties of probabilities Let, A,

More information

Adaptive Matrix Vector Product

Adaptive Matrix Vector Product Adaptive Matrix Vector Product Santosh S. Vempala David P. Woodruff November 6, 2016 Abstract We consider the following streaming problem: given a hardwired m n matrix A together with a poly(mn)-bit hardwired

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Finite Dictatorships and Infinite Democracies

Finite Dictatorships and Infinite Democracies Finite Dictatorships and Infinite Democracies Iian B. Smythe October 20, 2015 Abstract Does there exist a reasonable method of voting that when presented with three or more alternatives avoids the undue

More information

Homework 4 Solutions

Homework 4 Solutions CS 174: Combinatorics and Discrete Probability Fall 01 Homework 4 Solutions Problem 1. (Exercise 3.4 from MU 5 points) Recall the randomized algorithm discussed in class for finding the median of a set

More information

ECE302 Spring 2015 HW10 Solutions May 3,

ECE302 Spring 2015 HW10 Solutions May 3, ECE32 Spring 25 HW Solutions May 3, 25 Solutions to HW Note: Most of these solutions were generated by R. D. Yates and D. J. Goodman, the authors of our textbook. I have added comments in italics where

More information

Robust Statistics, Revisited

Robust Statistics, Revisited Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown

More information

ABSOLUTELY FLAT IDEMPOTENTS

ABSOLUTELY FLAT IDEMPOTENTS ABSOLUTELY FLAT IDEMPOTENTS JONATHAN M GROVES, YONATAN HAREL, CHRISTOPHER J HILLAR, CHARLES R JOHNSON, AND PATRICK X RAULT Abstract A real n-by-n idempotent matrix A with all entries having the same absolute

More information

Tutorial: Sparse Recovery Using Sparse Matrices. Piotr Indyk MIT

Tutorial: Sparse Recovery Using Sparse Matrices. Piotr Indyk MIT Tutorial: Sparse Recovery Using Sparse Matrices Piotr Indyk MIT Problem Formulation (approximation theory, learning Fourier coeffs, linear sketching, finite rate of innovation, compressed sensing...) Setup:

More information

Sparser Johnson-Lindenstrauss Transforms

Sparser Johnson-Lindenstrauss Transforms Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)

More information

Accelerated Dense Random Projections

Accelerated Dense Random Projections 1 Advisor: Steven Zucker 1 Yale University, Department of Computer Science. Dimensionality reduction (1 ε) xi x j 2 Ψ(xi ) Ψ(x j ) 2 (1 + ε) xi x j 2 ( n 2) distances are ε preserved Target dimension k

More information

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery Anna C. Gilbert Department of Mathematics University of Michigan Connection between... Sparse Approximation and Compressed

More information

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric 16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.

More information

6.854 Advanced Algorithms

6.854 Advanced Algorithms 6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using

More information

Lecture 10 Oct. 3, 2017

Lecture 10 Oct. 3, 2017 CS 388R: Randomized Algorithms Fall 2017 Lecture 10 Oct. 3, 2017 Prof. Eric Price Scribe: Jianwei Chen, Josh Vekhter NOTE: THESE NOTES HAVE NOT BEEN EDITED OR CHECKED FOR CORRECTNESS 1 Overview In the

More information