Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University

Size: px

Start display at page:

Download "Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University"

Patricia Charles
5 years ago
Views:

1 Estimating properties of distributions Ronitt Rubinfeld MIT and Tel Aviv University

2 Distributions are everywhere

3 What properties do your distributions have?

4 Play the lottery? Is it independent? Is it uniform?

5 Is the lottery unfair? From Hitlotto.com: Lottery experts agree, past number histories can be the key to predicting future winners.

6 True Story! Polish lottery Multilotek Choose uniformly at random distinct 20 numbers out of 1 to 80. Initial machine biased e.g., probability of too small Past results:

7 New Jersey Pick 3,4 Lottery New Jersey Pick k ( =3,4) Lottery. Pick k digits in order. 10 k possible values. Assume lottery draws iid Data: Pick results from 5/22/75 to 10/15/00 χ 2 -test gives 42% confidence Pick results from 9/1/77 to 10/15/00. fewer results than possible values χ 2 -test gives no confidence

8 Testing Independence: Shopping patterns: Independent of zip code?

9 Testing closeness of two distributions: Transactions of yr olds Transactions of yr olds trend change?

10 Outbreak of diseases Similar patterns? Correlated with income level? More prevalent near large airports? Flu 2011 Flu 2012

11 Information in neural spike trails [Strong, Koberle, de Ruyter van Steveninck, Bialek 98] Neural signals Each application of stimuli gives sample of signal (spike trail) time Entropy of (discretised) signal indicates which neurons respond to stimuli

12 Compressibility of data

13 Worm detection find ``heavy hitters nodes that send to many distinct addresses

14 Testing properties of distributions: Decisions based on samples of distribution No assumptions on shape of distribution i.e., smoothness, monotonicity, Normal distribution, Focus on large domains Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques, learning distribution

15 Model: p p is arbitrary black-box distribution over [n], generates iid samples. samples p i = Prob[ p outputs i ] Test Sample complexity in terms of n? Pass/Fail?

16 Some properties Similarities of distributions: Testing uniformity Testing identity Testing closeness Entropy estimation Support size Independence and limited independence properties Monotonicity

17 Similarities of distributions Are p and q close or far? q is known to the tester q is uniform q is given via samples

18 Two Distances L 1 Distance: p q 1 = p i q i L 2 Distance: i ε D p q 2 = ( ( p i q i ) 2 ) 1/2 i ε D

19 Is p uniform? p Test samples Theorem: ([Goldreich Ron] [Paninski]) Sample complexity of distinguishing p = U from p U 1 > ε is θ(n 1/ 2 ) Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: Testing identity Pass/Fail?

20 Is p uniform? p samples Theorem: ([Goldreich Ron] [Paninski]) Sample complexity of distinguishing p = U from p U 1 > ε is θ(n 1/ 2 ) Test Pass/Fail?

21 Upper bound for L 2 distance [Goldreich Ron] L 2 distance: p q 2 2 = pi q i 2 p-u 2 2 = Σ(p i -1/n) 2 = Σp i 2-2Σp i /n + Σ1/n 2 ======== = Σp i 2-1/n Estimate collision probability to estimate L 2 distance from uniform

22 Estimating collision probabilities [GR]? Naïve idea: Use independent samples to estimate collision probability Get O(k) samples of collision probability from k samples of p?? Better idea: Use all pairs in sample to estimate collision probabilities Get O(k 2 ) samples of collision probability from k samples of p Not all independent, use variance to bound accuracy??????

23 Testing uniformity [GR] Upper bound: Estimate collision probability Issues: Collision probability of uniform is 1/n Use O(sqrt(n)) samples Relation between L 1 and L 2 norms Comment: [P] uses different estimator Easy lower bound: Ω(n ½ ) Can get Ω (n ½ /ε 2 ) [P]

24 Back to the lottery plenty of samples!

25 Is p uniform? p samples Theorem: ([Goldreich Ron][Batu Fortnow R. Smith White] [Paninski]) Sample complexity of distinguishing p=u from p-u 1 >ε is θ(n 1/2 ) Test Pass/Fail? Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: Testing identity

26 This image cannot currently be displayed. This image cannot currently be displayed. Testing identity via testing uniformity on subdomains: (Relabel domain so that q monotone) Partition domain into O(log n) groups, so that each group almost flat -- differ by <(1+ε) multiplicative factor q close to uniform over each group Test: Test that p close to uniform over each group Test that p assigns approximately correct total weights to each group q (known)

27 Testing closeness p Test q Theorem: ([BFRSW] [P. Valiant]) Sample complexity of distinguishing p=q from p-q 1 >ε ~ is θ(n 2/3 ) Pass/Fail?

28 Why so different? Collision statistics are all that matter Collisions on heavy elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but light elements are either identical or very different

29 Lower bound p: n 2/3 1/2 n/4 n/4 1/2 Claim: No algorithm can distinguish (p,p) from (p,q) with only o(n 2/3 ) samples. p i =1/2n 2/3 p i =0 p i =2/n p i =0 q: 1/2 1/2 q i =1/2n 2/3 heavy elements q i =0 q i =0 light elements q i =2/n p-q 1 =1

30 Lower Bound Argument Assume algorithm is symmetric or we can permute the underlying domain. Only collision statistics matter! If we use only o(n 2/3 ) samples 3 + -way collisions are only in heavy elements and only a small fraction of these. Can focus only on 2-way collisions (one from each source)

31 Lower Bound (cont.) (p,p) case # 2-way collisions = H + L (p,q) case # 2-way collisions = H H: # 2-way collisions in heavy elements L: # 2-way collisions in light elements When sample size s=o(n 2/3 ), St.Dev.(H) = Θ(s n -1/3 ) >> E[L] = o(s n -1/3 ) = Var[L] Cannot distinguish H+L from H!!!

32 P. Valiant s characterization for symmetric properties: Collisions tell all! Canonical tester identifies if there is a distribution with the property that expects observed collision statistics Difficulty in analysis: Collision statistics aren t independent Low frequency elements can be ignored? Applies to symmetric properties with continuity condition

33 Upper Bound for L2 distance: Theorem: [Batu Fortnow R. Smith White] There is an algorithm which on input distributions p and q satisfies: if p-q 2 < ε/2 output Pass if p-q 2 > ε output Fail error probability < 1/3 sample, time complexity O(ε -4 ) No dependence on n?!?!

34 What about L 1 distance?

35 Naïve nearly linear time algorithm learning approach 1. Estimate probabilities p i and q i s using frequencies 2. Estimate L 1 distance directly Uses O(n log n) samples in general.

36 L 2 distance test Use O(ε -4 ) algorithm for estimating L 2 distance to within ε Naïve use of relationship between L 1 and L 2 distance leads to O(n 2 ) algorithm

37 Naïve nearly linear time algorithm learning approach (revisited) 1. Estimate probabilities p i and q i s using frequencies 2. Estimate L 1 distance directly Uses O(n log n) samples in general. If minimum nonzero p i or q i is > b, then need only O(1/b log n) samples. When b=n -2/3, need Õ(n 2/3 ) samples

38 L 2 distance test (revisited) Use O(ε -4 ) algorithm for estimating L 2 distance to within ε Naïve use of relationship between L 1 and L 2 distance leads to O(n 2 ) algorithm Can analyze tighter variance bound for L 2 test in terms of maximum probability b of p i and q i s. When b<n -2/3, need Õ(n 2/3 ) samples for L 1 distance

39 Algorithm Identify all ``heavy elements Estimate probabilities of heavy elements and fail if not close Filter out heavy elements Use L 2 test and fail if not close

40 Theorem: There exists an algorithm which on input distributions p and q uses O(ε -4 n 2/3 log n) samples and satisfies: if p-q 1 < ε/n 1/ 3 outputs Pass if p-q 1 > ε outputs Fail probability of error is < 1/3.

41 A historical note: Interest in [GR] and [BFRSW] sparked by search for property testers for expanders Eventual success! [Czumaj Sohler, Kale Seshadri, Nachmias Shapira] Used to give O(n 2/3 ) time property testers for rapidly mixing Markov chains [BFRSW]

42 Approximating the distance between two Estimating distributions? p q 1 requires Ө(n/log n) samples! [P. Valiant 08] [G. Valiant and P. Valiant STOC11, FOCS11]

43 Information theoretic quantities Entropy Support size

44 Can we approximate the entropy? In general, not to within a multiplicative factor... 0 entropy distributions are hard to distinguish (even in superlinear time) What if entropy is bigger? Can γ-multiplicatively approximate the entropy with Õ(n 1/γ2 ) samples (when entropy >2γ/ε) [Batu Dasgupta R. Kumar] requires Ω(n 1/γ2 ) [Valiant] better bounds when support size is small [Brautbar Samorodnitsky] Similar bounds for estimating support size

45 What about additive approximations? Intense interest in information theory, neurobiology, physics, machine learning. Previously, O(n) algorithms made the above people very happy. (better than trivial O(n log n)!) O(n/log n) algorithms possible! [Raskhodnikova, Ron, Shpilka, Smith] [Valiant Valiant] And Ω(n/log n) samples required [Valiant Valiant] Similar story for support size

46 Estimating Compressibility of Data [Raskhodnikova Ron Rubinfeld Smith] General question undecidable Run-length encoding Huffman coding Entropy Lempel-Ziv ``Color number = Number of elements with probability at least 1/n (support size!)

47 Testing Independence: Shopping patterns: Independent of zip code?

48 Independence of pairs p is joint distribution on pairs <a,b> from [n] x [m] (wlog n m) Marginal distributions p 1,p 2 p independent if p = p 1 x p 2, that is p (a,b) =(p 1 ) a (p 2 ) b for all a,b

49 Independence vs. product of marginals Lemma: [Sahai Vadhan] If A,B, such that p AxB 1 <ε/3 then p- p 1 x p 2 1 <ε

50 Testing Independence [Batu Fischer Fortnow Kumar R. White] p samples Independence Test Goal: If p = p 1 x p 2 then PASS If p p 1 x p 2 1 >ε then FAIL Pass/Fail?

51 1st try: use closeness test p p 1 x p 2 Simulate p 1 and p 2, and check p- p 1 x p 2 1 <ε. samples Closeness Test Pass/Fail? Behavior: If p- p 1 x p 2 1 <ε/n 1/3 then PASS If p- p 1 x p 2 1 >ε then FAIL Sample complexity: Õ((nm) 2/3 )

52 2nd try: Use identity test Algorithm: Approximate marginal distributions f 1 p 1 and f 2 p 2 Use Identity testing algorithm to test that p f 1 x f 2 Comments: use care when showing that good distributions pass Sample complexity: Õ(n+m + (nm) 1/2 ) Can combine with previous using filtering ideas identity test works well on distribution restricted to ``heavy prefixes from p 1 closeness test works well if max probability element is bounded from above

53 Theorem: [Batu Fischer Fortnow Kumar R. White] There exists an algorithm for testing independence with sample complexity O(n 2/3 m 1/3 poly(log n, ε -1 )) s.t. If p=p 1 x p 2, it outputs PASS If p-q 1 >ε for any independent q, it outputs FAIL This is tight! [Levi Ron Rubinfeld]

54 An open question: What is the complexity of testing independence of distributions over k-tuples from [n 1 ]x x[n k ]? Easy Ω( n i 1/2 ) lower bound

55 Testing the monotonicity of distributions: Does the occurrence of cancer decrease with distance from the nuclear reactor?

56 Monotone distributions p is monotone if 0.4 i<j implies p i p j pi 0.1 Many distributions are monotone or are made of small number of monotone distributions pi

57 First Monotone distributions over totally ordered domains [1..n] pi

58 Form of test? Idea: test that average weight of distribution in range [i..j] less than average weight of distribution in [i j ] for various choices of i<i,j<j Problem: uniform distribution on even numbers passes such tests ARGHHHHHHHHHH!

59 Lower bound [Batu Kumar R.] Lemma: Testing monotonicity requires Ω( n) samples Proof: p close to uniform iff p, p R = reversal of p, are both close to monotone p Rate p R Rate

60 Algorithm idea: Approximate distribution by k-flat distribution: Properties: Partition domain into k intervals Conditional distribution uniform in each Questions: Does it exist for k=o(polylog(n))? How do you find interval boundaries? Check if k-flat distribution close to monotone Solve linear program on O(polylog(n)) variables

61 Upper bound [Batu Kumar R.] Lemma: There is an algorithm for testing monotonicity over totally ordered domains which uses Õ(n 1/2 ε -2 ) samples s.t. (with probability 2/3) If p monotone, outputs PASS If ε far from monotone, outputs FAIL Can also test unimodal distributions

62 More recent work: K-flat distributions [Levi Indyk R.] K-modal distributions [Daskalakis Diakonikolas Servedio] Poisson Binomial Distributions [Daskalakis Diakonikolas Servedio] Monotonicity over general posets [Bhattacharyya Fischer R. Valiant]

63 Other properties? Higher dimensional flat distributions Mixtures of k Gaussians Junta -distributions

64 Getting past the lower bounds Special distributions e.g, uniform on a subset, monotone Other query models Queries to probabilities of elements Other distance measures

65 Flat distributions Entropy can be estimated somewhat faster when distribution is uniform on a subset of the elements [Batu Dasgupta Kumar R.][Brautbar Samorodnitsky]

66 Monotone distributions over totally ordered domains Test uniformity with O(1) samples [Batu Kumar R.] Other tasks doable with polylogarithmic samples: [Batu Dasgupta Kumar R.][BKR] Examples: Testing closeness Testing independence Estimating entropy Algorithm: Use k-flat partitions to approximate distributions Test property on approximation Do these big wins carry over to other partial orders? Yes and no.[r. Servedio] [Bhattacharyya Fischer R. Valiant] Still open: can you get sublinear sample algorithms?

67 Other models: Distribution given explicitly -- i.e. can query probability of element in one step! Entropy [BDKR] Distribution given both by samples and explicitly e.g. data in sorted order, Google n-gram data Entropy, closeness [BDKR][RS]

68 Other distance measures: KL-divergence [Guha McGregor Venkatasubramanian] EMD [Doba Nguyen Nguyen R.]

69 What about many distributions?

70 Collection of distributions: Further refinement: Known or unknown distribution on i s? Two models: D 1 D 2 D m Sampling model: samples Get (i,x) for random i, x D i Query model: Get (i,x) for query i and x D i Test Pass/Fail? Sample complexity in terms of n,m?

71 Some questions (and answers): Are they all equal? Can they be clustered into k groups of similar distributions? Do they all have the same mean? See [Levi Ron R. 2011, Levi Ron R. 2012]

72 Conclusions and Future Directions Distribution property testing problems are everywhere Several useful techniques known Other properties for which sublinear tests exist? Special classes of distributions? Other query models? Non-iid samples?

73 Thank you

Taming Big Probability Distributions

Taming Big Probability Distributions Ronitt Rubinfeld June 18, 2012 CSAIL, Massachusetts Institute of Technology, Cambridge MA 02139 and the Blavatnik School of Computer Science, Tel Aviv University. Research