Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets

Size: px

Start display at page:

Download "Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets"

Adelia Ferguson
5 years ago
Views:

1 Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets Edith Cohen Google Research Tel Aviv University

with the outcome S(v, u) can be interpreted as the set of all data vectors

2 A Monotone Sampling Scheme Data domain V( R + ) v random seed u ~ U 0,1 Outcome S(v, u) : function of the data v and seed u Seed value u is available with the outcome S(v, u) can be interpreted as the set of all data vectors consistent with the outcome and u Monotonicity: Fixing v, S(v, u) is non-increasing with u.

3 Monotone Estimation Problem (MEP) A monotone sampling scheme (V, S) : Data domain V( R + ) Sampling scheme S: V [0,1], A nonnegative function f: V 0 Goal: estimate f v from the sample Data v Desired properties of the estimator f 6 (S): Sample S estimator Q: f(v)? f 6 (S v, u, u) > Unbiased v, f 6 S v, x, x dx = f v (useful with sums of MEPs)? Nonnegative keep estimate f 6 in the same domain as f (Pareto) optimal (admissible) any estimator with smaller var? C f6 S v, u has for some v, larger var? C E, f6 S(v, u)

4 Bounds on f(v) from S and u Data v. The lower the seed u is, the more we know on v and hence on f(v). f(v) Information on f(v) u 1

5 Estimators for MEPs Unbiased, Nonnegative, Bounded variance Admissible: Pareto Optimal in terms of variance Results preview: Explicit expressions for estimators for any MEP for which such estimator exists Solution is not unique. Consider some estimators with natural properties if we require monotonicity - f 6 (S, u) is non-increasing with u, we get uniqueness Notion of Competitiveness of estimators We will come back to this, but first see some applications

6 MEP applications in data analysis Scalable computation of approximate statistics and queries over large data sets Data is sampled (composable, distributed scheme). Sample is used to estimate statistics/queries expressed as a sum of multiple MEPs Key-value pairs with multiple sets of values (instances) Take coordinated samples of instances. We get a MEP for each key Sketching graph-based influence and similarity functions Distance sketch the utility values (relations of node to all others). Get a MEP for each target node from sketches of seed nodes Sketching generalized coverage functions Coordinated weighted sample of the utility vector of each element. MEP for each item.

7 Social/Communication data Activity value v(x) is associated with each key x = (b, c) (e.g. number of messages, communication from b to c) Take a weighted sample of keys. For example bottom-k ( weighed reservoir ) or PPS (Probability Proportional to Size) Monday activity (a,b) 40 (f,g) 5 (h,c) 20 (a,z) 10 (h,f) 10 (f,s) 10 For τ > 0, iid u x U[0,1]: x S v x τ u(x) Monday Sample: (a,b) 40 (a,z) 10.. (f,s) 10 With bottom-k, τ is set to obtain a fixed sample size k Without replacement sampling: v x τ ln u x Fully composable sampling scheme

8 Samples of multiple days Coordinated samples: Different values for different days. Each key is sampled with same seed u(x) in different days Monday activity (a,b) 40 (f,g) 5 (h,c) 20 (a,z) 10 (h,f) 10 (f,s) 10 Monday Sample: (a,b) 40 (a,z) 10.. (f,s) 10 Tuesday activity (a,b) 3 (f,g) 5 (g,c) 10 (a,z) 50 (s,f) 20 (g,h) 10 Tuesday Sample: (g,c) (a,z) 50.. (g,h) Wednesday activity (a,b) 30 (g,c) 5 (h,c) 10 (a,z) 10 (b,f) 20 (d,h) 10 Wednesday Sample: (a,b) 30 (b,f) 20.. (d,h) 10

9 Matrix view keys instances In our example: keys x = (a, b) are user pairs. Instances are days. Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

10 Example Statistics Specify a segment of the keys Y X, examples: one user in CA and one in NY apple device to android Queries/Statistics S T f(v > x, v U x,, v + x ) Total communication of segment on Wednesday. S T v > x X L X distance/weighted Jaccard change in activity of segment between Friday and Saturday S T v >(x) v U (x) X L X X increase/decrease S T Coverage of segment Y in days D : S T max {0, v >(x) v U (x)} X max a b v a(x) Average/sum of median/max/min/top-3/concave aggregate of activity values over days D We would like to compute an estimate from the sample

11 Matrix view keys instances Coordinated PPS sample τ = 100 for all entries u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

12 Estimate sum statistics, one key at a time f f(v(x)) S T Sum over keys x Y of f(v x ), where v(x) = (v > x, v U (x) ) For L X distance: f(v) = v > v U X Estimate one key at a time: f f 6 (S(v(x)) S T The estimator for f(v) is applied to the sample of v

13 Easy statistics: Sum over entries Estimate a single entry at a time Example: Total communication of segment Y on Monday Inverse probability estimate (Horviz Thompson) [HT52]: Sum over sampled x Y of v monday x p monday (x) Inclusion Probability p monday x can be computed from v x and τ : x S v x τ u(x) p a x = Pr? C [v a x τ i u x ]

14 HT estimator (single-instance) Coordinated PPS sample τ = 100 u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

15 HT estimator (single-instance) τ = 100. Day: Wednesday, Segment: CA-NY u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

16 HT estimator for single-instance τ = 100. Day: Wednesday, Segment: CA-NY u We (a,b) 43 (g,c) 0 (h,c) 60 (a,z) 24 (h,f) 3 (f,s) 20 (d,h) 0 p = 0.43 p = 0.20 Exact: = 123 HT estimate is 0 for keys that are not sampled, v/p when key is sampled HT estimate: = 200

17 Inverse-Probability (HT) estimator ü Unbiased: 1 p x 0 + p x z { S X S = f(v x ) ü Nonnegative: v x 0 so { S X S 0 ü Bounded variance (for all v) ü Monotone: more information higher estimate ü Optimal: UMVU The unique minimum variance (unbiased, nonnegative, sum) estimator Works when f depends on a single entry. What about general f?

18 Queries involving multiple columns L X X distance f(v) = v> v U X L X X increase f(v) = max{0, v > v U } X HT estimate is positive only when we know f v = v > v U from the sample. But for v U = 0, v > > 0 then f v > 0 but sample never reveals f v because second entry is never sampled. Thus, HT is biased Even when unbiased, HT may not be optimal. E.g. when v > is sampled and we can deduce from τ U and u that v U a < v > then we know that f v v > a. An optimal estimator will use this incomplete information We want estimators with the same nice properties as HT and optimality

19 Sampled data Coordinated PPS sample τ = 100 u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h) Want to estimate U U U Lets look at key (a,z), and estimating

20 Information on f Fix the data v. The lower u is, the more we know on v and on f v = U = 81. We plot the lower bound we have on f(v) as a function of the seed u u

21 This is a MEP! Monotone Estimation Problem A monotone sampling scheme (V, S) : Data domain V( R + ) here v >, v U R E Sampling scheme S: V [0,1], here S((v >, v U ), u) reveals v a when v a > 100 u U A nonnegative function f: V 0 here v > v U Goal: estimate f(v): specify an estimator f 6 (S, u) that is Unbiased, Nonnegative, Bounded variance, Admissible (optimal) U Solution is not unique.

22 The optimal (admissible) range Cum. Est For x > u We see S(v, u) and u. We know what S(v, x) is for all x > u. >? Suppose we fixed M = f 6 S v, x, x dx

23 MEP Estimators Order optimal estimators: For an order on the data domain V: Any estimator with lower variance on v, must have higher variance on z v The L* estimator: The unique admissible monotone estimator Order optimal for: z v f z < f(v) 4-variance competitive (soon we define that) The U* estimator: Order optimal for: z v f z > f(v) Choice of estimator depends on properties we want, possibly depending on typical data distribution. L* is a good default (monotone and competitive)

24 Variance Competitiveness [CK13] A worst-case over data theoretical indicator for estimator quality For each v, we can consider the minimum E? C[E,>] [f 6U S v, u, u ] attainable by an estimator that is unbiased and nonnegative for all other v We use such optimal estimator f 6 (v) for v as a reference point. An estimator f 6 S, u is c-competitive if for any data v, the expectation of the square is within a factor c of the minimum possible for v (by an unbiased and nonnegative estimator). For all unbiased nonnegative g, E? C[E,>] [f 6U S(v, u) ] c E? C E,> [g U S(v, u )] The L estimator is 4-competitive and this is tight. For some MEPs, ratio is 4

25 Optimal estimator f6(v) for data v (unbiased and nonnegative for all data) The optimal estimates f 6 (v) are the negated derivative of the lower hull of the Lower bound function. f(s, u) Lower Bound function for v Lower Hull u 1 Intuition: The lower bound guides us on outcome S, how high we can go with the estimate, in order to optimize variance for v while still being nonnegative on all other consistent data vectors.

26 The L* estimator We see S(v, u) and u. We know what S(v, x) is for all x > u f S(v, u), u u? > f S(v, x), x x U dx

27 L > estimation example Estimators for f v >, v U = v > v U Scheme: v a 0 is sampled if v a > u lower bound (LB) on f 0.6,0.2 from S and u The Lower hull of LB The L, U, and opt for v estimators U is optimized for the vector f 0.6,0.0 (always consistent with S) L is optimized locally for the vector f 0.6, u (consistent vector with smallest f

28 L U U estimation example Estimators for f v >, v U = v > v U U Scheme: v a 0 is sampled if v a > u lower bound (LB) on f 0.6,0.2 from S and u The Lower hull of LB The L, U, and opt for v estimators U is optimized for the vector f 0.6,0.0 (always consistent with S) L is optimized locally for the vector f 0.6, u (consistent vector with smallest f

29 Summary Defined Monotone Estimation Problems (MEPs) (motivated by coordinated sampling) Derive Pareto optimal (admissible) unbiased and nonnegative estimators (for any MEP when they exist): L* (lower end of range: unique monotone estimator, dominates HT), U* (upper end of range), Order optimal estimators (optimized for certain data patterns)

30 Applications Estimators for Euclidean and Manhattan distances from samples [C KDD 14] sketch-based closeness similarity in social networks [CDFGGW COSN 13] (similarity of the sets of longrange interactions) Sketching generalized coverage functions, including graph-based influence functions [CDPW 14, C 16]

31 Future Tighter bounds on universal ratio: L* is 4 competitive, can do competitive, lower bound is 1.44 competitive. Instance-optimal competitiveness Give efficient construction for any MEP Multi-dimensional MEPs: Have multiple independent seed (independent samples of columns ), some initial derivations for d = 2 and coverage and distance functions [CK 12, C 14], but the full picture is missing

32 L 1 distance [C KDD14] Independent / Coordinated PPS sampling #IP flows to a destination in two time periods

33 L U U distance [C KDD14] Independent/Coordinated PPS sampling Surname occurrences in 2007, 2008 books (Google ngrams)

35 Coordination of samples Very powerful tool for big data analysis with applications well beyond what [Brewer, Early, Joyce 1972] could envision Locality Sensitive Hashing (LSH) (similar weight vectors have similar samples/sketches) Multi-objective samples (universal samples): A single sample (as small as possible) that provides statistical guarantees for multiple sets of weights. Statistics/Domain queries that span multiple instances (Jaccard similarity, L X distances, distinct counts, union size, ) MinHash sketches are a special case with 0/1 weights. Facilitates faster computation of samples. Example: [C 97] Sketching/sampling reachability sets and neighborhoods of all nodes in a graph in near-linear time.

The Magic of Random Sampling: From Surveys to Big Data

The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.