Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University

Size: px
Start display at page:

Download "Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University"

Transcription

1 Estimating properties of distributions Ronitt Rubinfeld MIT and Tel Aviv University

2 Distributions are everywhere

3 What properties do your distributions have?

4 Play the lottery? Is it independent? Is it uniform?

5 Is the lottery unfair? From Hitlotto.com: Lottery experts agree, past number histories can be the key to predicting future winners.

6 True Story! Polish lottery Multilotek Choose uniformly at random distinct 20 numbers out of 1 to 80. Initial machine biased e.g., probability of too small Past results:

7 New Jersey Pick 3,4 Lottery New Jersey Pick k ( =3,4) Lottery. Pick k digits in order. 10 k possible values. Assume lottery draws iid Data: Pick results from 5/22/75 to 10/15/00 χ 2 -test gives 42% confidence Pick results from 9/1/77 to 10/15/00. fewer results than possible values χ 2 -test gives no confidence

8 Testing Independence: Shopping patterns: Independent of zip code?

9 Testing closeness of two distributions: Transactions of yr olds Transactions of yr olds trend change?

10 Outbreak of diseases Similar patterns? Correlated with income level? More prevalent near large airports? Flu 2011 Flu 2012

11 Information in neural spike trails [Strong, Koberle, de Ruyter van Steveninck, Bialek 98] Neural signals Each application of stimuli gives sample of signal (spike trail) time Entropy of (discretised) signal indicates which neurons respond to stimuli

12 Compressibility of data

13 Worm detection find ``heavy hitters nodes that send to many distinct addresses

14 Testing properties of distributions: Decisions based on samples of distribution No assumptions on shape of distribution i.e., smoothness, monotonicity, Normal distribution, Focus on large domains Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques, learning distribution

15 Model: p p is arbitrary black-box distribution over [n], generates iid samples. samples p i = Prob[ p outputs i ] Test Sample complexity in terms of n? Pass/Fail?

16 Some properties Similarities of distributions: Testing uniformity Testing identity Testing closeness Entropy estimation Support size Independence and limited independence properties Monotonicity

17 Similarities of distributions Are p and q close or far? q is known to the tester q is uniform q is given via samples

18 Two Distances L 1 Distance: p q 1 = p i q i L 2 Distance: i ε D p q 2 = ( ( p i q i ) 2 ) 1/2 i ε D

19 Is p uniform? p Test samples Theorem: ([Goldreich Ron] [Paninski]) Sample complexity of distinguishing p = U from p U 1 > ε is θ(n 1/ 2 ) Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: Testing identity Pass/Fail?

20 Is p uniform? p samples Theorem: ([Goldreich Ron] [Paninski]) Sample complexity of distinguishing p = U from p U 1 > ε is θ(n 1/ 2 ) Test Pass/Fail?

21 Upper bound for L 2 distance [Goldreich Ron] L 2 distance: p q 2 2 = pi q i 2 p-u 2 2 = Σ(p i -1/n) 2 = Σp i 2-2Σp i /n + Σ1/n 2 ======== = Σp i 2-1/n Estimate collision probability to estimate L 2 distance from uniform

22 Estimating collision probabilities [GR]? Naïve idea: Use independent samples to estimate collision probability Get O(k) samples of collision probability from k samples of p?? Better idea: Use all pairs in sample to estimate collision probabilities Get O(k 2 ) samples of collision probability from k samples of p Not all independent, use variance to bound accuracy??????

23 Testing uniformity [GR] Upper bound: Estimate collision probability Issues: Collision probability of uniform is 1/n Use O(sqrt(n)) samples Relation between L 1 and L 2 norms Comment: [P] uses different estimator Easy lower bound: Ω(n ½ ) Can get Ω (n ½ /ε 2 ) [P]

24 Back to the lottery plenty of samples!

25 Is p uniform? p samples Theorem: ([Goldreich Ron][Batu Fortnow R. Smith White] [Paninski]) Sample complexity of distinguishing p=u from p-u 1 >ε is θ(n 1/2 ) Test Pass/Fail? Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: Testing identity

26 This image cannot currently be displayed. This image cannot currently be displayed. Testing identity via testing uniformity on subdomains: (Relabel domain so that q monotone) Partition domain into O(log n) groups, so that each group almost flat -- differ by <(1+ε) multiplicative factor q close to uniform over each group Test: Test that p close to uniform over each group Test that p assigns approximately correct total weights to each group q (known)

27 Testing closeness p Test q Theorem: ([BFRSW] [P. Valiant]) Sample complexity of distinguishing p=q from p-q 1 >ε ~ is θ(n 2/3 ) Pass/Fail?

28 Why so different? Collision statistics are all that matter Collisions on heavy elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but light elements are either identical or very different

29 Lower bound p: n 2/3 1/2 n/4 n/4 1/2 Claim: No algorithm can distinguish (p,p) from (p,q) with only o(n 2/3 ) samples. p i =1/2n 2/3 p i =0 p i =2/n p i =0 q: 1/2 1/2 q i =1/2n 2/3 heavy elements q i =0 q i =0 light elements q i =2/n p-q 1 =1

30 Lower Bound Argument Assume algorithm is symmetric or we can permute the underlying domain. Only collision statistics matter! If we use only o(n 2/3 ) samples 3 + -way collisions are only in heavy elements and only a small fraction of these. Can focus only on 2-way collisions (one from each source)

31 Lower Bound (cont.) (p,p) case # 2-way collisions = H + L (p,q) case # 2-way collisions = H H: # 2-way collisions in heavy elements L: # 2-way collisions in light elements When sample size s=o(n 2/3 ), St.Dev.(H) = Θ(s n -1/3 ) >> E[L] = o(s n -1/3 ) = Var[L] Cannot distinguish H+L from H!!!

32 P. Valiant s characterization for symmetric properties: Collisions tell all! Canonical tester identifies if there is a distribution with the property that expects observed collision statistics Difficulty in analysis: Collision statistics aren t independent Low frequency elements can be ignored? Applies to symmetric properties with continuity condition

33 Upper Bound for L2 distance: Theorem: [Batu Fortnow R. Smith White] There is an algorithm which on input distributions p and q satisfies: if p-q 2 < ε/2 output Pass if p-q 2 > ε output Fail error probability < 1/3 sample, time complexity O(ε -4 ) No dependence on n?!?!

34 What about L 1 distance?

35 Naïve nearly linear time algorithm learning approach 1. Estimate probabilities p i and q i s using frequencies 2. Estimate L 1 distance directly Uses O(n log n) samples in general.

36 L 2 distance test Use O(ε -4 ) algorithm for estimating L 2 distance to within ε Naïve use of relationship between L 1 and L 2 distance leads to O(n 2 ) algorithm

37 Naïve nearly linear time algorithm learning approach (revisited) 1. Estimate probabilities p i and q i s using frequencies 2. Estimate L 1 distance directly Uses O(n log n) samples in general. If minimum nonzero p i or q i is > b, then need only O(1/b log n) samples. When b=n -2/3, need Õ(n 2/3 ) samples

38 L 2 distance test (revisited) Use O(ε -4 ) algorithm for estimating L 2 distance to within ε Naïve use of relationship between L 1 and L 2 distance leads to O(n 2 ) algorithm Can analyze tighter variance bound for L 2 test in terms of maximum probability b of p i and q i s. When b<n -2/3, need Õ(n 2/3 ) samples for L 1 distance

39 Algorithm Identify all ``heavy elements Estimate probabilities of heavy elements and fail if not close Filter out heavy elements Use L 2 test and fail if not close

40 Theorem: There exists an algorithm which on input distributions p and q uses O(ε -4 n 2/3 log n) samples and satisfies: if p-q 1 < ε/n 1/ 3 outputs Pass if p-q 1 > ε outputs Fail probability of error is < 1/3.

41 A historical note: Interest in [GR] and [BFRSW] sparked by search for property testers for expanders Eventual success! [Czumaj Sohler, Kale Seshadri, Nachmias Shapira] Used to give O(n 2/3 ) time property testers for rapidly mixing Markov chains [BFRSW]

42 Approximating the distance between two Estimating distributions? p q 1 requires Ө(n/log n) samples! [P. Valiant 08] [G. Valiant and P. Valiant STOC11, FOCS11]

43 Information theoretic quantities Entropy Support size

44 Can we approximate the entropy? In general, not to within a multiplicative factor... 0 entropy distributions are hard to distinguish (even in superlinear time) What if entropy is bigger? Can γ-multiplicatively approximate the entropy with Õ(n 1/γ2 ) samples (when entropy >2γ/ε) [Batu Dasgupta R. Kumar] requires Ω(n 1/γ2 ) [Valiant] better bounds when support size is small [Brautbar Samorodnitsky] Similar bounds for estimating support size

45 What about additive approximations? Intense interest in information theory, neurobiology, physics, machine learning. Previously, O(n) algorithms made the above people very happy. (better than trivial O(n log n)!) O(n/log n) algorithms possible! [Raskhodnikova, Ron, Shpilka, Smith] [Valiant Valiant] And Ω(n/log n) samples required [Valiant Valiant] Similar story for support size

46 Estimating Compressibility of Data [Raskhodnikova Ron Rubinfeld Smith] General question undecidable Run-length encoding Huffman coding Entropy Lempel-Ziv ``Color number = Number of elements with probability at least 1/n (support size!)

47 Testing Independence: Shopping patterns: Independent of zip code?

48 Independence of pairs p is joint distribution on pairs <a,b> from [n] x [m] (wlog n m) Marginal distributions p 1,p 2 p independent if p = p 1 x p 2, that is p (a,b) =(p 1 ) a (p 2 ) b for all a,b

49 Independence vs. product of marginals Lemma: [Sahai Vadhan] If A,B, such that p AxB 1 <ε/3 then p- p 1 x p 2 1 <ε

50 Testing Independence [Batu Fischer Fortnow Kumar R. White] p samples Independence Test Goal: If p = p 1 x p 2 then PASS If p p 1 x p 2 1 >ε then FAIL Pass/Fail?

51 1st try: use closeness test p p 1 x p 2 Simulate p 1 and p 2, and check p- p 1 x p 2 1 <ε. samples Closeness Test Pass/Fail? Behavior: If p- p 1 x p 2 1 <ε/n 1/3 then PASS If p- p 1 x p 2 1 >ε then FAIL Sample complexity: Õ((nm) 2/3 )

52 2nd try: Use identity test Algorithm: Approximate marginal distributions f 1 p 1 and f 2 p 2 Use Identity testing algorithm to test that p f 1 x f 2 Comments: use care when showing that good distributions pass Sample complexity: Õ(n+m + (nm) 1/2 ) Can combine with previous using filtering ideas identity test works well on distribution restricted to ``heavy prefixes from p 1 closeness test works well if max probability element is bounded from above

53 Theorem: [Batu Fischer Fortnow Kumar R. White] There exists an algorithm for testing independence with sample complexity O(n 2/3 m 1/3 poly(log n, ε -1 )) s.t. If p=p 1 x p 2, it outputs PASS If p-q 1 >ε for any independent q, it outputs FAIL This is tight! [Levi Ron Rubinfeld]

54 An open question: What is the complexity of testing independence of distributions over k-tuples from [n 1 ]x x[n k ]? Easy Ω( n i 1/2 ) lower bound

55 Testing the monotonicity of distributions: Does the occurrence of cancer decrease with distance from the nuclear reactor?

56 Monotone distributions p is monotone if 0.4 i<j implies p i p j pi 0.1 Many distributions are monotone or are made of small number of monotone distributions pi

57 First Monotone distributions over totally ordered domains [1..n] pi

58 Form of test? Idea: test that average weight of distribution in range [i..j] less than average weight of distribution in [i j ] for various choices of i<i,j<j Problem: uniform distribution on even numbers passes such tests ARGHHHHHHHHHH!

59 Lower bound [Batu Kumar R.] Lemma: Testing monotonicity requires Ω( n) samples Proof: p close to uniform iff p, p R = reversal of p, are both close to monotone p Rate p R Rate

60 Algorithm idea: Approximate distribution by k-flat distribution: Properties: Partition domain into k intervals Conditional distribution uniform in each Questions: Does it exist for k=o(polylog(n))? How do you find interval boundaries? Check if k-flat distribution close to monotone Solve linear program on O(polylog(n)) variables

61 Upper bound [Batu Kumar R.] Lemma: There is an algorithm for testing monotonicity over totally ordered domains which uses Õ(n 1/2 ε -2 ) samples s.t. (with probability 2/3) If p monotone, outputs PASS If ε far from monotone, outputs FAIL Can also test unimodal distributions

62 More recent work: K-flat distributions [Levi Indyk R.] K-modal distributions [Daskalakis Diakonikolas Servedio] Poisson Binomial Distributions [Daskalakis Diakonikolas Servedio] Monotonicity over general posets [Bhattacharyya Fischer R. Valiant]

63 Other properties? Higher dimensional flat distributions Mixtures of k Gaussians Junta -distributions

64 Getting past the lower bounds Special distributions e.g, uniform on a subset, monotone Other query models Queries to probabilities of elements Other distance measures

65 Flat distributions Entropy can be estimated somewhat faster when distribution is uniform on a subset of the elements [Batu Dasgupta Kumar R.][Brautbar Samorodnitsky]

66 Monotone distributions over totally ordered domains Test uniformity with O(1) samples [Batu Kumar R.] Other tasks doable with polylogarithmic samples: [Batu Dasgupta Kumar R.][BKR] Examples: Testing closeness Testing independence Estimating entropy Algorithm: Use k-flat partitions to approximate distributions Test property on approximation Do these big wins carry over to other partial orders? Yes and no.[r. Servedio] [Bhattacharyya Fischer R. Valiant] Still open: can you get sublinear sample algorithms?

67 Other models: Distribution given explicitly -- i.e. can query probability of element in one step! Entropy [BDKR] Distribution given both by samples and explicitly e.g. data in sorted order, Google n-gram data Entropy, closeness [BDKR][RS]

68 Other distance measures: KL-divergence [Guha McGregor Venkatasubramanian] EMD [Doba Nguyen Nguyen R.]

69 What about many distributions?

70 Collection of distributions: Further refinement: Known or unknown distribution on i s? Two models: D 1 D 2 D m Sampling model: samples Get (i,x) for random i, x D i Query model: Get (i,x) for query i and x D i Test Pass/Fail? Sample complexity in terms of n,m?

71 Some questions (and answers): Are they all equal? Can they be clustered into k groups of similar distributions? Do they all have the same mean? See [Levi Ron R. 2011, Levi Ron R. 2012]

72 Conclusions and Future Directions Distribution property testing problems are everywhere Several useful techniques known Other properties for which sublinear tests exist? Special classes of distributions? Other query models? Non-iid samples?

73 Thank you

Taming Big Probability Distributions

Taming Big Probability Distributions Taming Big Probability Distributions Ronitt Rubinfeld June 18, 2012 CSAIL, Massachusetts Institute of Technology, Cambridge MA 02139 and the Blavatnik School of Computer Science, Tel Aviv University. Research

More information

Testing Distributions Candidacy Talk

Testing Distributions Candidacy Talk Testing Distributions Candidacy Talk Clément Canonne Columbia University 2015 Introduction Introduction Property testing: what can we say about an object while barely looking at it? P 3 / 20 Introduction

More information

Testing that distributions are close

Testing that distributions are close Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, Patrick White May 26, 2010 Anat Ganor Introduction Goal Given two distributions over an n element set, we wish to check whether these distributions

More information

What can we do in sublinear time?

What can we do in sublinear time? What can we do in sublinear time? 0368.4612 Seminar on Sublinear Time Algorithms Lecture 1 Ronitt Rubinfeld Today s goal Motivation Overview of main definitions Examples (Disclaimer this is not a complete

More information

1 Distribution Property Testing

1 Distribution Property Testing ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data Instructor: Jayadev Acharya Lecture #10-11 Scribe: JA 27, 29 September, 2016 Please send errors to acharya@cornell.edu 1 Distribution

More information

Testing Closeness of Discrete Distributions

Testing Closeness of Discrete Distributions Testing Closeness of Discrete Distributions Tuğkan Batu Lance Fortnow Ronitt Rubinfeld Warren D. Smith Patrick White May 29, 2018 arxiv:1009.5397v2 [cs.ds] 4 Nov 2010 Abstract Given samples from two distributions

More information

Testing equivalence between distributions using conditional samples

Testing equivalence between distributions using conditional samples Testing equivalence between distributions using conditional samples (Extended Abstract) Clément Canonne Dana Ron Rocco A. Servedio July 5, 2013 Abstract We study a recently introduced framework [7, 8]

More information

Testing Cluster Structure of Graphs. Artur Czumaj

Testing Cluster Structure of Graphs. Artur Czumaj Testing Cluster Structure of Graphs Artur Czumaj DIMAP and Department of Computer Science University of Warwick Joint work with Pan Peng and Christian Sohler (TU Dortmund) Dealing with BigData in Graphs

More information

distributed simulation and distributed inference

distributed simulation and distributed inference distributed simulation and distributed inference Algorithms, Tradeoffs, and a Conjecture Clément Canonne (Stanford University) April 13, 2018 Joint work with Jayadev Acharya (Cornell University) and Himanshu

More information

Testing random variables for independence and identity

Testing random variables for independence and identity Testing random variables for independence and identity Tuğkan Batu Eldar Fischer Lance Fortnow Ravi Kumar Ronitt Rubinfeld Patrick White January 10, 2003 Abstract Given access to independent samples of

More information

Testing that distributions are close

Testing that distributions are close Testing that distributions are close Tuğkan Batu Lance Fortnow Ronitt Rubinfeld Warren D. Smith Patrick White October 12, 2005 Abstract Given two distributions over an n element set, we wish to check whether

More information

With P. Berman and S. Raskhodnikova (STOC 14+). Grigory Yaroslavtsev Warren Center for Network and Data Sciences

With P. Berman and S. Raskhodnikova (STOC 14+). Grigory Yaroslavtsev Warren Center for Network and Data Sciences L p -Testing With P. Berman and S. Raskhodnikova (STOC 14+). Grigory Yaroslavtsev Warren Center for Network and Data Sciences http://grigory.us Property Testing [Goldreich, Goldwasser, Ron; Rubinfeld,

More information

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model

More information

Probability Revealing Samples

Probability Revealing Samples Krzysztof Onak IBM Research Xiaorui Sun Microsoft Research Abstract In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution

More information

Testing Monotone High-Dimensional Distributions

Testing Monotone High-Dimensional Distributions Testing Monotone High-Dimensional Distributions Ronitt Rubinfeld Computer Science & Artificial Intelligence Lab. MIT Cambridge, MA 02139 ronitt@theory.lcs.mit.edu Rocco A. Servedio Department of Computer

More information

TESTING PROPERTIES OF DISTRIBUTIONS

TESTING PROPERTIES OF DISTRIBUTIONS TESTING PROPERTIES OF DISTRIBUTIONS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

More information

Tolerant Versus Intolerant Testing for Boolean Properties

Tolerant Versus Intolerant Testing for Boolean Properties Tolerant Versus Intolerant Testing for Boolean Properties Eldar Fischer Faculty of Computer Science Technion Israel Institute of Technology Technion City, Haifa 32000, Israel. eldar@cs.technion.ac.il Lance

More information

Learning and Testing Structured Distributions

Learning and Testing Structured Distributions Learning and Testing Structured Distributions Ilias Diakonikolas University of Edinburgh Simons Institute March 2015 This Talk Algorithmic Framework for Distribution Estimation: Leads to fast & robust

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Estimation of information-theoretic quantities

Estimation of information-theoretic quantities Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Tolerant Versus Intolerant Testing for Boolean Properties

Tolerant Versus Intolerant Testing for Boolean Properties Electronic Colloquium on Computational Complexity, Report No. 105 (2004) Tolerant Versus Intolerant Testing for Boolean Properties Eldar Fischer Lance Fortnow November 18, 2004 Abstract A property tester

More information

Quantum algorithms for testing properties of distributions

Quantum algorithms for testing properties of distributions Quantum algorithms for testing properties of distributions Sergey Bravyi, Aram W. Harrow, and Avinatan Hassidim July 8, 2009 Abstract Suppose one has access to oracles generating samples from two unknown

More information

An Automatic Inequality Prover and Instance Optimal Identity Testing

An Automatic Inequality Prover and Instance Optimal Identity Testing 04 IEEE Annual Symposium on Foundations of Computer Science An Automatic Inequality Prover and Instance Optimal Identity Testing Gregory Valiant Stanford University valiant@stanford.edu Paul Valiant Brown

More information

Classy sample correctors

Classy sample correctors Classy sample correctors 1 Ronitt Rubinfeld MIT and Tel Aviv University joint work with Clement Canonne (Columbia) and Themis Gouleakis (MIT) 1 thanks to Clement and G for inspiring this classy title Our

More information

Testing the Expansion of a Graph

Testing the Expansion of a Graph Testing the Expansion of a Graph Asaf Nachmias Asaf Shapira Abstract We study the problem of testing the expansion of graphs with bounded degree d in sublinear time. A graph is said to be an α- expander

More information

Testing monotonicity of distributions over general partial orders

Testing monotonicity of distributions over general partial orders Testing monotonicity of distributions over general partial orders Arnab Bhattacharyya Eldar Fischer 2 Ronitt Rubinfeld 3 Paul Valiant 4 MIT CSAIL. 2 Technion. 3 MIT CSAIL, Tel Aviv University. 4 U.C. Berkeley.

More information

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Declaring Independence via the Sketching of Sketches. Until August Hire Me! Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem

More information

Lecture 1: 01/22/2014

Lecture 1: 01/22/2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 1: 01/22/2014 Spring 2014 Scribes: Clément Canonne and Richard Stark 1 Today High-level overview Administrative

More information

An Optimal Lower Bound for Monotonicity Testing over Hypergrids

An Optimal Lower Bound for Monotonicity Testing over Hypergrids An Optimal Lower Bound for Monotonicity Testing over Hypergrids Deeparnab Chakrabarty 1 and C. Seshadhri 2 1 Microsoft Research India dechakr@microsoft.com 2 Sandia National Labs, Livermore scomand@sandia.gov

More information

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Proclaiming Dictators and Juntas or Testing Boolean Formulae Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas The Academic College of Tel-Aviv-Yaffo Tel-Aviv, ISRAEL michalp@mta.ac.il Dana Ron Department of EE Systems Tel-Aviv University

More information

Sublinear Time Algorithms for Earth Mover s Distance

Sublinear Time Algorithms for Earth Mover s Distance Sublinear Time Algorithms for Earth Mover s Distance Khanh Do Ba MIT, CSAIL doba@mit.edu Huy L. Nguyen MIT hlnguyen@mit.edu Huy N. Nguyen MIT, CSAIL huyn@mit.edu Ronitt Rubinfeld MIT, CSAIL ronitt@csail.mit.edu

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 Allan Borodin March 13, 2016 1 / 33 Lecture 9 Announcements 1 I have the assignments graded by Lalla. 2 I have now posted five questions

More information

Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing. Extended Abstract 1

Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing. Extended Abstract 1 Proceedings of Machine Learning Research vol 65:1 7, 2017 Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing Constantinos Daskslakis EECS and CSAIL, MIT Qinxuan

More information

Optimal Testing for Properties of Distributions

Optimal Testing for Properties of Distributions Optimal Testing for Properties of Distributions Jayadev Acharya, Constantinos Daskalakis, Gautam Kamath EECS, MIT {jayadev, costis, g}@mit.edu Abstract Given samples from an unknown discrete distribution

More information

Testing Expansion in Bounded-Degree Graphs

Testing Expansion in Bounded-Degree Graphs Testing Expansion in Bounded-Degree Graphs Artur Czumaj Department of Computer Science University of Warwick Coventry CV4 7AL, UK czumaj@dcs.warwick.ac.uk Christian Sohler Department of Computer Science

More information

Lecture 5: Derandomization (Part II)

Lecture 5: Derandomization (Part II) CS369E: Expanders May 1, 005 Lecture 5: Derandomization (Part II) Lecturer: Prahladh Harsha Scribe: Adam Barth Today we will use expanders to derandomize the algorithm for linearity test. Before presenting

More information

Testing that distributions are close

Testing that distributions are close Testing that distributions are close Tuğkan Batu Computer Science Department Cornell University Ithaca, NY 14853 batu@cs.cornell.edu Warren D. Smith NEC Research Institute 4 Independence Way Princeton,

More information

Multi-join Query Evaluation on Big Data Lecture 2

Multi-join Query Evaluation on Big Data Lecture 2 Multi-join Query Evaluation on Big Data Lecture 2 Dan Suciu March, 2015 Dan Suciu Multi-Joins Lecture 2 March, 2015 1 / 34 Multi-join Query Evaluation Outline Part 1 Optimal Sequential Algorithms. Thursday

More information

Testing Booleanity and the Uncertainty Principle

Testing Booleanity and the Uncertainty Principle Testing Booleanity and the Uncertainty Principle Tom Gur Weizmann Institute of Science tom.gur@weizmann.ac.il Omer Tamuz Weizmann Institute of Science omer.tamuz@weizmann.ac.il Abstract Let f : { 1, 1}

More information

Testing Linear-Invariant Function Isomorphism

Testing Linear-Invariant Function Isomorphism Testing Linear-Invariant Function Isomorphism Karl Wimmer and Yuichi Yoshida 2 Duquesne University 2 National Institute of Informatics and Preferred Infrastructure, Inc. wimmerk@duq.edu,yyoshida@nii.ac.jp

More information

Learning Sums of Independent Integer Random Variables

Learning Sums of Independent Integer Random Variables Learning Sums of Independent Integer Random Variables Cos8s Daskalakis (MIT) Ilias Diakonikolas (Edinburgh) Ryan O Donnell (CMU) Rocco Servedio (Columbia) Li- Yang Tan (Columbia) FOCS 2013, Berkeley CA

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 16 October 30, 2017 CPSC 467, Lecture 16 1/52 Properties of Hash Functions Hash functions do not always look random Relations among

More information

Approximate Pattern Matching and the Query Complexity of Edit Distance

Approximate Pattern Matching and the Query Complexity of Edit Distance Krzysztof Onak Approximate Pattern Matching p. 1/20 Approximate Pattern Matching and the Query Complexity of Edit Distance Joint work with: Krzysztof Onak MIT Alexandr Andoni (CCI) Robert Krauthgamer (Weizmann

More information

Testing by Implicit Learning

Testing by Implicit Learning Testing by Implicit Learning Rocco Servedio Columbia University ITCS Property Testing Workshop Beijing January 2010 What this talk is about 1. Testing by Implicit Learning: method for testing classes of

More information

A Self-Tester for Linear Functions over the Integers with an Elementary Proof of Correctness

A Self-Tester for Linear Functions over the Integers with an Elementary Proof of Correctness A Self-Tester for Linear Functions over the Integers with an Elementary Proof of Correctness The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions

On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions On Learning Powers of Poisson Binomial Distributions and Graph Binomial Distributions Dimitris Fotakis Yahoo Research NY and National Technical University of Athens Joint work with Vasilis Kontonis (NTU

More information

Sofya Raskhodnikova

Sofya Raskhodnikova Sofya Raskhodnikova sofya@wisdom.weizmann.ac.il Work: Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel 76100 +972 8 934 3573 http://theory.lcs.mit.edu/

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Recoverabilty Conditions for Rankings Under Partial Information

Recoverabilty Conditions for Rankings Under Partial Information Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations

More information

Sublinear Algorithms Lecture 3. Sofya Raskhodnikova Penn State University

Sublinear Algorithms Lecture 3. Sofya Raskhodnikova Penn State University Sublinear Algorithms Lecture 3 Sofya Raskhodnikova Penn State University 1 Graph Properties Testing if a Graph is Connected [Goldreich Ron] Input: a graph G = (V, E) on n vertices in adjacency lists representation

More information

DRAFT. Algebraic computation models. Chapter 14

DRAFT. Algebraic computation models. Chapter 14 Chapter 14 Algebraic computation models Somewhat rough We think of numerical algorithms root-finding, gaussian elimination etc. as operating over R or C, even though the underlying representation of the

More information

Learning and Testing Submodular Functions

Learning and Testing Submodular Functions Learning and Testing Submodular Functions Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture3.pdf CIS 625: Computational Learning Theory Submodularity Discrete analog of

More information

CS369E: Communication Complexity (for Algorithm Designers) Lecture #8: Lower Bounds in Property Testing

CS369E: Communication Complexity (for Algorithm Designers) Lecture #8: Lower Bounds in Property Testing CS369E: Communication Complexity (for Algorithm Designers) Lecture #8: Lower Bounds in Property Testing Tim Roughgarden March 12, 2015 1 Property Testing We begin in this section with a brief introduction

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Settling the Query Complexity of Non-Adaptive Junta Testing

Settling the Query Complexity of Non-Adaptive Junta Testing Settling the Query Complexity of Non-Adaptive Junta Testing Erik Waingarten, Columbia University Based on joint work with Xi Chen (Columbia University) Rocco Servedio (Columbia University) Li-Yang Tan

More information

Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Constantinos Daskalakis, MIT Gautam Kamath, MIT

Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Constantinos Daskalakis, MIT Gautam Kamath, MIT Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians Constantinos Daskalakis, MIT Gautam Kamath, MIT What s a Gaussian Mixture Model (GMM)? Interpretation 1: PDF is a convex

More information

Optimal compression of approximate Euclidean distances

Optimal compression of approximate Euclidean distances Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch

More information

Learning and Fourier Analysis

Learning and Fourier Analysis Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for PAC-style learning under uniform distribution

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Testing Booleanity and the Uncertainty Principle

Testing Booleanity and the Uncertainty Principle Electronic Colloquium on Computational Complexity, Report No. 31 (2012) Testing Booleanity and the Uncertainty Principle Tom Gur and Omer Tamuz April 4, 2012 Abstract Let f : { 1, 1} n R be a real function

More information

Sparse Fourier Transform (lecture 1)

Sparse Fourier Transform (lecture 1) 1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n

More information

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it

More information

Learning mixtures of structured distributions over discrete domains

Learning mixtures of structured distributions over discrete domains Learning mixtures of structured distributions over discrete domains Siu-On Chan UC Berkeley siuon@cs.berkeley.edu. Ilias Diakonikolas University of Edinburgh ilias.d@ed.ac.uk. Xiaorui Sun Columbia University

More information

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011 ! Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! 1 Property Testing Instance space X = R n (Distri D over X)! Tested function f : X->{0,1}! A property P of Boolean fn

More information

Property Testing Bounds for Linear and Quadratic Functions via Parity Decision Trees

Property Testing Bounds for Linear and Quadratic Functions via Parity Decision Trees Electronic Colloquium on Computational Complexity, Revision 2 of Report No. 142 (2013) Property Testing Bounds for Linear and Quadratic Functions via Parity Decision Trees Abhishek Bhrushundi 1, Sourav

More information

Testing k-modal Distributions: Optimal Algorithms via Reductions

Testing k-modal Distributions: Optimal Algorithms via Reductions Testing k-modal Distributions: Optimal Algorithms via Reductions Constantinos Daskalakis MIT Ilias Diakonikolas UC Berkeley Paul Valiant Brown University October 3, 2012 Rocco A. Servedio Columbia University

More information

Data Structures and Algorithms CMPSC 465

Data Structures and Algorithms CMPSC 465 Data Structures and Algorithms CMPSC 465 LECTURE 9 Solving recurrences Substitution method Adam Smith S. Raskhodnikova and A. Smith; based on slides by E. Demaine and C. Leiserson Review question Draw

More information

A Combinatorial Characterization of the Testable Graph Properties: It s All About Regularity

A Combinatorial Characterization of the Testable Graph Properties: It s All About Regularity A Combinatorial Characterization of the Testable Graph Properties: It s All About Regularity ABSTRACT Noga Alon Tel-Aviv University and IAS nogaa@tau.ac.il Ilan Newman Haifa University ilan@cs.haifa.ac.il

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Earthmover resilience & testing in ordered structures

Earthmover resilience & testing in ordered structures Earthmover resilience & testing in ordered structures Omri Ben-Eliezer Tel-Aviv University Eldar Fischer Technion Computational Complexity Conference 28 UCSD, San Diago Property testing (RS96, GGR98) Meta

More information

Two Decades of Property Testing

Two Decades of Property Testing Two Decades of Property Testing Madhu Sudan Microsoft Research 12/09/2014 Invariance in Property Testing @MIT 1 of 29 Kepler s Big Data Problem Tycho Brahe (~1550-1600): Wished to measure planetary motion

More information

t-closeness: Privacy beyond k-anonymity and l-diversity

t-closeness: Privacy beyond k-anonymity and l-diversity t-closeness: Privacy beyond k-anonymity and l-diversity paper by Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian presentation by Caitlin Lustig for CS 295D Definitions Attributes: explicit identifiers,

More information

Succinct Data Structures for the Range Minimum Query Problems

Succinct Data Structures for the Range Minimum Query Problems Succinct Data Structures for the Range Minimum Query Problems S. Srinivasa Rao Seoul National University Joint work with Gerth Brodal, Pooya Davoodi, Mordecai Golin, Roberto Grossi John Iacono, Danny Kryzanc,

More information

Testing for Concise Representations

Testing for Concise Representations Testing for Concise Representations Ilias Diakonikolas Columbia University ilias@cs.columbia.edu Ronitt Rubinfeld MIT ronitt@theory.csail.mit.edu Homin K. Lee Columbia University homin@cs.columbia.edu

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 14 October 16, 2013 CPSC 467, Lecture 14 1/45 Message Digest / Cryptographic Hash Functions Hash Function Constructions Extending

More information

Robustness Meets Algorithms

Robustness Meets Algorithms Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately

More information

Succinct Data Structures for Approximating Convex Functions with Applications

Succinct Data Structures for Approximating Convex Functions with Applications Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca

More information

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 5. 1 Review (Pairwise Independence and Derandomization) 6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can

More information

CSC 8301 Design & Analysis of Algorithms: Lower Bounds

CSC 8301 Design & Analysis of Algorithms: Lower Bounds CSC 8301 Design & Analysis of Algorithms: Lower Bounds Professor Henry Carter Fall 2016 Recap Iterative improvement algorithms take a feasible solution and iteratively improve it until optimized Simplex

More information

Lecture 16 Oct. 26, 2017

Lecture 16 Oct. 26, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Piotr Indyk Lecture 16 Oct. 26, 2017 Scribe: Chi-Ning Chou 1 Overview In the last lecture we constructed sparse RIP 1 matrix via expander and showed that

More information

Spectral Methods for Testing Clusters in Graphs

Spectral Methods for Testing Clusters in Graphs Spectral Methods for Testing Clusters in Graphs Sandeep Silwal, Jonathan Tidor Mentor: Jonathan Tidor August, 8 Abstract We study the problem of testing if a graph has a cluster structure in the framework

More information

Fast Approximate PCPs for Multidimensional Bin-Packing Problems

Fast Approximate PCPs for Multidimensional Bin-Packing Problems Fast Approximate PCPs for Multidimensional Bin-Packing Problems Tuğkan Batu 1 School of Computing Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S6 Ronitt Rubinfeld 2 CSAIL, MIT, Cambridge,

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35 1 / 35 Information Theory Mark van Rossum School of Informatics, University of Edinburgh January 24, 2018 0 Version: January 24, 2018 Why information theory 2 / 35 Understanding the neural code. Encoding

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Learning and Fourier Analysis

Learning and Fourier Analysis Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Enumeration Schemes for Words Avoiding Permutations

Enumeration Schemes for Words Avoiding Permutations Enumeration Schemes for Words Avoiding Permutations Lara Pudwell November 27, 2007 Abstract The enumeration of permutation classes has been accomplished with a variety of techniques. One wide-reaching

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

What Can We Learn Privately?

What Can We Learn Privately? What Can We Learn Privately? Sofya Raskhodnikova Penn State University Joint work with Shiva Kasiviswanathan Homin Lee Kobbi Nissim Adam Smith Los Alamos UT Austin Ben Gurion Penn State To appear in SICOMP

More information

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004 On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Lecture 21: Algebraic Computation Models

Lecture 21: Algebraic Computation Models princeton university cos 522: computational complexity Lecture 21: Algebraic Computation Models Lecturer: Sanjeev Arora Scribe:Loukas Georgiadis We think of numerical algorithms root-finding, gaussian

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Efficient Compression of Monotone and m-modal Distributions

Efficient Compression of Monotone and m-modal Distributions Efficient Compression of Monotone and m-modal Distriutions Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh ECE Department, UCSD {jacharya, ashkan, alon, asuresh}@ucsd.edu Astract

More information