Multi-Objective Weighted Sampling

Size: px

Start display at page:

Download "Multi-Objective Weighted Sampling"

Cody Stone
6 years ago
Views:

1 1 Multi-Objective Weighted Sampling Edith Cohen Google Research Mountain View, CA, USA arxiv: v6 [csdb] 13 Jun 2017 Abstract Multi-objective samples are powerful and versatile summaries of large data sets For a set of kes X and associated values f 0, a weighted sample taken with respect to f allows us to approimate segment-sum statistics sum(f; H) = H f, for an subset H of the kes, with statisticall-guaranteed qualit that depends on sample size and the relative weight of H When estimating sum(g; H) for g f, however, qualit guarantees are lost A multi-objective sample with respect to a set of functions F provides for each f F the same statistical guarantees as a dedicated weighted sample while minimizing the summar size We analze properties of multi-objective samples and present sampling schemes and meta-algortithms for estimation and optimization while showcasing two important application domains The first are ke-value data sets, where different functions f F applied to the values correspond to different statistics such as moments, thresholds, capping, and sum A multi-objective sample allows us to approimate all statistics in F The second is metric spaces, where kes are points, and each f F is defined b a set of points C with f being the service cost of b C, and sum(f; X) models centralit or clustering cost of C A multi-objective sample allows us to estimate costs for each f F In these domains, multi-objective samples are often of small size, are efficientl to construct, and enable scalable estimation and optimization We aim here to facilitate further applications of this powerful technique 1 INTRODUCTION Random sampling is a powerful tool for working with ver large data sets on which eact computation, even of simple statistics, can be time and resource consuming A small sample of the data allows us to efficientl obtain approimate answers Consider data in the form of ke value pairs {(, f )}, where kes are from some universe X, f 0, and we define f 0 for kes X that are not present in the data Ver common statistics over such data are segment sum statistics sum(f; H) = H f, where H X is a segment 1 of X Eamples of such data sets are IP flow kes and btes, users and activit, or customers and distance to the nearest facilit Segments ma correspond to a certain demographic or location or other meta data of kes Segment statistics in these eample correspond respectivel to total traffic, activit, or service cost of the segment When the data set is large, we can compute a weighted sample which includes each ke with probabilit (roughl) proportional to f and allows us to estimate segment sum statistics ŝum(f; H) for quer segments H Popular weighted sampling schemes [30], [32] include Poisson Probabilit Proportional to Size (pps) [21], VarOpt [5], [12], and the bottom-k schemes [28], [14], [15] Sequential Poisson (priorit) [24], [17] and PPS without replacement (ppswor) [27] These weighted samples provide us with nonnegative unbiased estimates ŝum(g; H) for an segment and g 0 (provided that f > 0 when g > 0) For all statistics sum(f; H) we obtain statistical guarantees on estimation qualit: The error, measured b the coefficient of variation (CV), which is the standard deviation divided b the mean, is at most the inverse 1 The alternative term selection is used in the DB literature and the term domain is used in the statistics literature of the square root of the size of the sample multiplied b the fraction sum(f; H)/ sum(f, X ) of weight that is due to the segment H This trade-off of qualit (across segments) and sample size are (worst-case) optimal Moreover, the estimates are wellconcentrated in the Chernoff-Bernstein sense: The probabilit of an error that is c times the CV decreases eponentiall in c In man applications, such as the following eamples, there are multiple sets of values f F that are associated with the kes: (i) Data records can come with eplicit multiple weights, as with activit summaries of customers/jobs that specif both bandwidth and computation consumption (ii) Metric objectives, such as our service cost eample, where each configuration of facilit locations induces a different set of distances and hence service costs (iii) The raw data can be specified in terms of a set of ke value pairs {(, w )} but we are interested in different functions f f(w ) of the values that correspond to different statistics such as Statistics function f(w) count f(w) = 1 for w > 0 sum f(w) = w threshold with T > 0 thresh T (w) = I w T moment with p > 0 f(w) = w p capping with T > 0 cap T = min{t, w} Eample 11 Consider a to data set D: (u1, 5), (u3, 100), (u10, 23), (u12, 7), (u17, 1), (u24, 5), (u31, 220), (u42, 19), (u43, 3), (u55, 2) For a segment H with H D = {u3, u12, u42, u55}, we have sum(h) = 128, count(h) = 4, thresh 10 (H) = 2, cap 5 (H) = 17, and 2-moment(H) = For these applications, we are interested in a summar that can provide us with estimates with statisticall guaranteed qualit for each f F The naive solutions are not satisfactor: We can compute a weighted sample taken with respect to a particular f F, but the qualit of the estimates sum(g; H) rapidl

2 2 degrades with the dissimilarit between g and f We can compute a dedicated sample for each f F, but the total summar size can be much larger than necessar Multi-objective samples, a notion crstallized in [16] 2, provide us with the desired statistical guarantees on qualit with minimal summar size Multi-objective samples build on the classic notion of sample coordination [23], [4], [29], [7], [28], [25] In a nutshell, coordinated samples are localit sensitive hashes of f, mapping similar f to similar samples A multi-objective sample is (roughl) the union S (F ) = f F S(f) of all the kes that are included in coordinated weighted samples S (f) for f F Because the samples are coordinated, the number of distinct kes included and hence the size of S (F ) is (roughl) as small as possible Since for each f F, the sample S (F ) includes the dedicated sample S (f), the estimate qualit from S (F ) dominates that of S (f) In this paper, we review the definition of multi-objective samples, stud their properties, and present efficient sampling schemes We consider both general sets F of objectives and families F with special structure B eploiting special structure, we can bound the overhead, which is the increase factor in sample size necessar to meet multi-objective qualit guarantees, and obtain efficient sampling schemes that avoid dependence of the computation on F 11 Organization In Section 2 we review (single-objective) weighted sampling focusing on the Poisson pps and bottom-k sampling schemes We then review the definitions [16] and establish properties of multi-objective samples In Section 3 we stud multi-objective pps samples We show that the multi-objective sample size is also necessar for meeting the qualit guarantees for segment statistics for all f F We also show that the guarantees are met when we use upper bounds on the multi-objective pps sampling probabilities instead of working with the eact values In Section 4 we stud multi-objective bottom-k samples In Section 5 we establish a fundamental propert of multiobjective samples: We define the sampling closure F of a set of objectives F, as all functions f for which a multi-objective sample S (F ) meets the qualit guarantees for segment statistics Clearl F F but we show that the closure F also includes ever f that is a non-negative linear combination of functions from F In Section 6, we consider data sets in the form of ke value pairs and the famil M of all monotone non-decreasing functions of the values This famil includes most natural statistics, such as our eamples of count, sum, threshold, moments, and capping Since M is infinite, it is inefficient to appl a generic multi-objective sampling algorithm to compute S (M) We present efficient near-linear sampling schemes for S (M) which also appl over streamed or distributed data Moreover, we establish a bound on the sample size of E[ S (M) ] k ln n, where n is the number of kes in our data set and k is the reference size of the singleobjective samples S (f) for each f M The design is based on a surprising relation to All-Distances Sketches [7], [8] Furthermore, we establish that (when ke weights are unique), a sample of size Ω(k ln n) is necessar: Intuitivel, the hardness stems from the need to support all threshold functions In Section 7 we stud the set C = {cap T T > 0} of all capping functions The closure C includes all concave f M 2 The collocated model with at most a linear growth (satisf f () 1 and f () 0) Since C M, the multi-objective sample S (M) includes S (C) and provides estimates with statistical guarantees for all f C The more specialized sample S (C), however, can be much smaller than S (M) We design an efficient algorithm for computing S (C) samples In Section 8 we discuss metric objectives and multi-objective samples as summaries of a set of points that allows us to approimate such objectives In Section 9 we discuss different tpes of statistical guarantees across functions f in settings where we are onl interested in statistics sum(f; X ) over the full data Our basic multi-objective samples analzes the sample size required for ForEach, where the statistical guarantees appl to each estimate ŝum(f; H) in isolation In particular, also to each estimate over the full data set ForAll is much stronger and bounds the (distribution of) the maimum relative error of estimates ŝum(f; X ) for all f F Meeting ForAll tpicall necessitates a larger multi-objective sample size than meeting ForEach In section 10 we present a meta-algorithm for optimization over samples The goal is to maimize a (smooth) function of sum(f; X ) over f F When X is large, we can instead perform the optimization over a small multi-objective sample of X This framework has important applications to metric objectives and estimating loss of a model from eamples The ForOpt guarantee is for a sample size that facilitates such optimization, that is, the approimate maimizer over the sample is an approimate maimizer over the data set This guarantee is stronger than ForEach but generall weaker than ForAll We make a ke observation that with a ForEach sample we are onl prone to testable onesided errors on the optimization result Based on that, we present an adaptive algorithm where the sample size is increased until ForOpt is met This framework unifies and generalizes previous work of optimization over coordinated samples [13], [11] We conclude in Section 11 2 WEIGHTED SAMPLING (SINGLE OBJECTIVE) We review weighted sampling schemes with respect to a set of values f, focusing on preparation for the multi-objective generalization The schemes are specified in terms of a samplesize parameter k which allows us to trade-off representation size and estimation qualit 21 Poisson Probabilit Proportional to Size (pps) The pps sample S (f,k) includes each ke independentl with probabilit p (f,k) = min{1, k f f } (1) Eample 21 The table below lists pps sampling probabilities p (f,3) (k = 3, rounded to the nearest hundredth) for kes in our eample data for sum (f = w ), thresh 10 (f = I w 10), and cap 5 (f = min{5, w }) The number in parenthesis is sum(f, X ) = f We can see that sampling probabilities highl var between functions f ke u1 u3 u10 u12 u17 u24 u31 u42 u43 u55 w sum (385) thresh 10 (4) cap 5 (41) PPS samples can be computed b association a random value u U[0, 1] with each ke and including the ke in the sample

3 if u p (f,k) This formulation to us when there are multiple objectives as it facilitates the coordination of samples taken with respect to the different objectives Coordination is achieved using the same set u 22 Bottom-k (order) sampling Bottom-k sampling unifies priorit (sequential Poisson) [24], [17] and pps without replacement (ppswor) sampling [27] To obtain a bottom-k sample for f we associate a random value u U[0, 1] with each ke To obtain a ppswor sample we use r ln(1 u ) and to obtain a priorit sample we use r u The bottomk sample S (f,k) for f contains the k kes with minimum f-seed, where f-seed() r f To support estimation, we also retain the threshold, τ (f,k), which is defined to be the (k + 1)st smallest f-seed 23 Estimators We estimate a statistics sum(g; H) from a weighted sample S (f,k) using the inverse probabilit estimator [22]: ŝum(g; H) = g p (f,k) (2) H S The estimate is alwas nonnegative and is unbiased when the functions satisf g > 0 = f > 0 (which ensures that an ke with g > 0 is sampled with positive probabilit) To appl this estimator, we need to compute p (f,k) for S To do so with pps samples (1) we include the sum f with S as auiliar information For bottom-k samples, inclusion probabilities of kes are not readil available We therefore use the inverse probabilit estimator (2) with conditional probabilities p (f,k) [17], [15]: A ke, fiing the randomization u for all other kes, is sampled if and onl if f-seed() < t, where t is the kth smallest f- seed among kes For S (f,k), the kth smallest f-seed among other kes is t = τ (f,k), and thus p (f,k) = Pr u U[0,1] [ r f < τ (f,k) ] (3) Note that the right hand side epression for probabilit is equal to 1 e ft with ppswor and to min{1, f t} with priorit sampling 24 Estimation qualit We consider the variance and concentration of our estimates A natural measure of estimation qualit of our unbiased estimates is the coefficient of variation (CV), which is the ratio of the standard deviation to the mean We can upper bound the CV of our estimates (2) of sum(g; H) in terms of the (epected) sample size k and the relative g-weight of the segment H, defined as q (g) (H) = sum(g; H) sum(g; X ) To be able to epress a bound on the CV when we estimate a statistics sum(g; H) using a weighted sample taken with respect to f, we define the disparit between f and g as ρ(f, g) = ma f g ma g f The disparit alwas satisfies ρ(f, g) 1 and we have equalit ρ(f, g) = 1 onl when g is a scaling of f, that is, equal to g = cf for some c > 0 We obtain the following upper bound: Theorem 21 For pps samples and the estimator (2), ρ(f, g) g H, CV [ŝum(g; H)] q (g) (H)k For bottom-k samples, we replace k b k 1 The proof for ρ = 1 is standard for pps, provided in [7], [8] for ppswor, and in [31] for priorit samples The proof for ρ 1 for ppswor is provided in Theorem A1 The proof for pps is simpler, using a subset of the arguments The proof for priorit can be obtained b generalizing [31] Moreover, the estimates obtained from these weighted sample are concentrated in the Chernoff-Hoeffding-Bernstein sense We provide the proof for the multiplicative form of the bound and for Poisson pps samples: Theorem 22 For δ 1, For δ > 1, Pr[ ŝum(g; H) sum(g; H) > δ sum(g; H)] 2 ep( q (g) (H)kρ 2 δ 2 /3) Pr[ŝum(g; H) sum(g; H) > δ sum(g; H)] ep( q (g) (H)kρ 2 δ/3) kf / sum(f) The contribution of such a ke to Proof Consider Poisson pps sampling and the inverse probabilit estimator The contribution of kes that are sampled with p (f,k) = 1 is computed eactl Let the contribution of these kes be (1 α) sum(g; H), for some α [0, 1]) If α = 0, the estimate is the eact sum and we are done Otherwise, it suffices to estimate the remaining α sum(g; H) with relative error δ = δ/α Consider the remaining kes, which have inclusion probabilities p (f,k) the estimate is 0 if is not sampled and is g ρ(f, g) sum(f)/k p (f,k) when is sampled Note that b definition sum(g; H) = q (g) (H) sum(g) q (g) (H) sum(f)/ρ(f, g) We appl the concentration bounds to the sum of random variables in the range [0, ρ(f, g) sum(f)/k] To use the standard form, we can normalize our random variables and to have range [0, 1] and accordingl normalize the epectation α sum(g; H) to obtain sum(g; H) µ = α ρ(f, g) sum(f)/k] αq(g) (H)ρ(f, g) 2 k We can now appl multiplicative Chernoff bounds for random variables in the range [0, 1] with δ = δ/α and epectation µ The formula bounds the probabilit of relative error that eceeds δ b 2 ep( δ 2 µ/3 when δ < 1 and b ep( δµ/3 when δ > 1 25 Computing the sample Consider data presented as streamed or distributed elements of the form of ke-value pairs (, f ), where X and f > 0 We define f 0 for kes that are not in the data An important propert of our samples (bottom-k or pps) is that the are composable (mergeable) Meaning that a sample of 3

4 the union of two data sets can be computed from the samples of the data sets Composabilit facilitates efficient streamed or distributed computation The sampling algorithms can use a random hash function applied to ke to generate u so seed values can be computed on the fl from (, f ) and do not need to be stored With bottom-k sampling we permit kes to occur in multiple elements, in which case we define f to be the maimum value of elements with ke The sample S(D) of a set D of elements contains the pair (, f ) for the k + 1 (unique) kes with smallest f-seeds 3 The sample of the union i D i is obtained from i S(D i) b first replacing multiple occurrences of a ke with the one with largest f(w) and then returning the pairs for the k + 1 kes with smallest f-seeds With pps sampling, the information we store with our sample S(D) includes the sum sum(f; D) D f and the sampled pairs (, f ), which are those with u kf / sum(f; D) Because we need to accuratel track the sum, we require that elements have unique kes The sample of a union D = i D i is obtained using the sum sum(f; D) = i sum(f; D i), and retaining onl kes in i S(D i) that satisf u kf / sum(f; D) 3 MULTI-OBJECTIVE PPS SAMPLES Our objectives are specified as pairs (f, k f ) where f F is a function and k f specifies a desired estimation qualit for sum(f; H) statistics, stated in terms of the qualit (Theorem 21 and Theorem 22) provided b a single-objective sample for f with size parameter k f To simplif notation, we sometimes omit k f when clear from contet A multi-objective sample S (F ) [16] is defined b considering dedicated samples S (f,k f ) for each objective that are coordinated The dedicated samples are coordinating b using the same randomization, which is the association of u U[0, 1] with kes The multi-objective sample S (F ) = f F S(f,k f ) contains all kes that are included in at least one of the coordinated dedicated samples In the remaining part of this section we stud pps samples Multi-objective bottom-k samples are studied in the net section Lemma 31 A multi-objective pps sample for F includes each ke independentl with probabilit p (F ) = min{1, ma f F k f f f } (4) Proof Consider coordinated dedicated pps samples for f F obtained using the same set {u } The ke is included in at least one of the samples if and onl if the value u is at most the maimum over objectives (f, k f ) of the pps inclusion probabilit for that objective: u ma f F p(f,k f ) = ma min{1, k f f f F = min{1, ma f F k f f f } f Since u are independent, so are the inclusion probabilities of different kes Eample 32 Consider the three objectives: sum, thresh 10, and cap 5 all with k = 3 as in Eample 21 The epected size of S (F ) 3 When kes are unique to elements it suffices to keep onl the (k + 1)st smallest f-seed without the pair (, f ) } is S (F ) = p(f ) = 468 The naive solution of maintaining a separate dedicated sample for each objective would have total epected size 829 (Note that the dedicated epected sample size for sum is 229 and for thresh 10, and cap 5 it is 3 To estimate a statistics sum(g; H) from S (F ), we appl the inverse probabilit estimator ŝum(g; H) = S (F ) H 4 g p (F ) (5) using the probabilities p (F ) (4) To compute the estimator (5), we need to know p (F ) when S (F ) These probabilities can be computed if we maintain the sums sum(f) = f for f F as auiliar information and we have f available to us when S (f,k f ) In some settings it is easier to obtain upper bounds π p (F ) on the multi-objective pps inclusion probabilities, compute a Poisson sample using π, and appl the respective inverse-probabilit estimator ŝum(g; H) = S H g π (6) Side note: It is sometime useful to use other sampling schemes, in particular, VarOpt (dependent) sampling [5], [20], [12] to obtain a fied sample size The estimation qualit bounds on the CV and concentration also hold with VarOpt (which has negative covariances) 31 Multi-objective pps estimation qualit We show that the estimation qualit, in terms of the bounds on the CV and concentration, of the estimator (5) is at least as good as that of the estimate we obtain from the dedicated samples To do so we prove a more general claim that holds for an Poisson sampling scheme that includes each ke in the sample S with probabilit π p (f,k f ) and the respective inverse probabilit estimator (6) The following lemma shows that estimate qualit can onl improve when inclusion probabilities increase: Lemma 33 The variance var[ŝum(g; H)] of (6) and hence CV [ŝum(g; H)] are non-increasing in π Proof For each ke consider the inverse probabilit estimator ĝ = g /π when is sampled and ĝ = 0 otherwise Note that var[ĝ ] = g 2 (1/π 1), which is decreasing with π We have ŝum(g; H)] = H ĝ When covariances between ĝ are nonpositive, which is the case in particular with independet inclusions, we have var[ŝum(g; H)] = H var[ĝ ] and the claim follows as it applies to each summand We net consider concentration of the estimates Lemma 34 The concentration claim of Theorem 22 carries over when we use the inverse probabilit estimator with an sampling probabilities that satisf π p (f,k) for all Proof The generalization of the proof is immediate, as the range of the random variables ˆf can onl decrease when we increase the inclusion probabilit

5 5 32 Uniform guarantees across objectives An important special case is when k f = k for all f F, that is, we seek uniform statistical guarantees for all our objectives We use the notation S (F,k) for the respective multi-objective sample We can write the multi-objective pps probabilities (4) as p (F,k) = min{1, k ma f F f f } = min{1, kp (F,1) } (7) The last equalit follows when recalling the definitions of p (f,1) = f / f and hence p (F,1) f = ma f F p(f,1) = ma f F f We refer to p (f,1) as the base pps probabilities for f Note that the base pps probabilities are rescaling of f and pps probabilities are invariant to this scaling We refer to p (F,1) as the multiobjective base pps probabilities for F Finall, for a reason that will soon be clear, we refer to the sum h (pps) (F ) p (F,1) as the multi-objective pps overhead of F It is eas to see that h(f ) [1, F ] and h(f ) is closer to 1 when all objectives in F are more similar We can write kp (F,1) = (kh(f )) p (F,1) p(f,1) That is, the multi-objective pps probabilities (7) are equivalent to single-objective pps probabilities with size parameter kh(f ) computed with respect to base probabilit weights g = p (F,1) Side note: We can appl an single-objective weighted sampling scheme such as VarOpt or bottom-k to the weights p (F,1), or to upper-bounds π p (F,1) on these weights while adjusting the sample size parameter to k π 33 Lower bound on multi-objective sample size The following theorem shows that an multi-objective sample for F that meets the qualit guarantees on all domain queries for (f, k f ) must include each ke with probabilit at least p (f,1) k f 1 p (f,1) k f +1 2 p(f,k f ) Moreover, when p (f,1) k f 1, then the lower bound on inclusion probabilit is close to p (f,k f ) This implies the multi-objective pps sample size is necessar to meet the qualit guarantees we seek (Theorem 21) Theorem 31 Consider a sampling scheme that for weights f supports estimators that satisf for each segment H CV [ŝum(f; H)] 1/ q (f) (H)k Then the inclusion probabilit of a ke must be at least p p(f,1) k p (f,1) k min{1, p(f,1) k} Proof Consider a segment of a single ke H = {} Then p (f,1) q (f) (H) = f / f q The best nonnegative unbiased sum estimator is the HT estimator: When the ke is not sampled, there is no evidence of the segment and the estimate must be 0 When it is, uniform estimate minimize the variance If the ke is included with probabilit p, the CV of the estimate is CV [ŝum(f; {})] = (1/p 1) 05 From the requirement (1/p 1) 05 1/(qk) 05, we obtain that p qk qk+1 4 MULTI-OBJECTIVE BOTTOM-k SAMPLES The sample S (F ) is defined with respect to random {u } Each dedicated sample S (f,k f ) includes the k f lowest f-seeds, computed using {u } S (F ) accordingl includes all kes that have one of the k f lowest f-seeds for at least one f F To estimate statistics sum(g; H) from bottom-k S (F ), we again appl the inverse probabilit estimator (5) but here we use the conditional inclusion probabilit p (F ) for each ke [16] This is the probabilit (over u U[0, 1]) that S (F ), when fiing u for all to be as in the current sample Note that p (F ) = ma f F p(f), where p (f) are as defined in (3) In order to compute the probabilities p (F ) for S (F ), it alwas suffices to maintain the slightl larger sample f F S(f,k f +1) For completeness, we show that it suffices to instead maintain with S (F ) f F S(f,k f ) a smaller (possibl empt) set Z f F S(f,k f +1) \ S (F ) of auiliar kes We now define the set Z and show how inclusion probabilities can be computed from S (F ) Z For a ke S (F ), we denote b g () = arg ma p (f) f F S (f) the objective with the most forgiving threshold for If p (g() ) < 1, let be the ke with (k + 1) smallest g-seed (otherwise is not defined) The auiliar kes are then Z = { S (F ) } \ S (F ) We use the sample and auiliar kes S (F ) Z as follows to compute the inclusion probabilities: We first compute for each f F, τ f, which is the k f + 1 smallest f-seed of kes in S (F ) Z For each S (F ), we then use p (F ) priorit) or p (F ) To see that p (F ) have τ f = ma f F f(w )τ f (for = 1 ep( ma f F f(w )τ f ) (for ppswor) are correctl computed, note that while we can > τ (f,k f ) for some f F (Z ma not include the threshold kes of all the dedicated samples S (f,k f ) ), our definition of Z ensures that τ f = τ (f,k f ) for f such that there is at least one where f = g () and p (g() ) < 1 Composabilit: Note that multi-objective samples S (F ) are composable, since the are a union of (composable) single-objective samples S (f) It is not hard to see that composabilit applies with the auiliar kes: The set of auiliar kes in the composed sample must be a subset of sampled and auiliar kes in the components Therefore, the sample itself includes all the necessar state for streaming or distributed computation Estimate qualit: We can verif that for an f F and, for an random assignment {u } for, we have p (F ) p (f) Therefore (appling Lemma 33 and noting zero covariances[15]) the variance and the CV are at most that of the estimator (2) applied to the bottom-k f sample S (f) To summarize, we obtain the following statistical guarantees on estimate qualit with multiobjective samples:

6 6 Theorem 41 For each H and g, the inverse-probabilit estimator applied to a multi-objective pps sample S (F ) has ρ(f, g) CV [ŝum(g; H)] min f F q (g) (H)k f The estimator applied to a multi-objective bottom-k samples has the same guarantee but with (k f 1) replacing k f Sample size overhead: We must have E[ S (F ) ] f F k f The worst-case, where the size of S (F ) is the sum of the sizes of the dedicated samples, materializes when functions f F have disjoint supports The sample size, however, can be much smaller when functions are more related With uniform guarantees (k f k), we define the multiobjective bottom-k overhead to be h (botk) (F ) E[ S (F ) ]/k This is the sample size overhead of a multi-objective versus a dedicated bottom-k sample pps versus bottom-k multi-objective sample size: For some sets F, with the same parameter k, we can have much larger multiobjective overhead with bottom-k than with pps A multi-objective pps samples is the smallest sample that can include a pps sample for each f A multi-objective bottom-k sample must include a bottom-k f sample for each f Consider a set of n > k kes For each subset of n/2 kes we define a function f that is uniform on the subset and 0 elsewhere It is eas to see that in this case h (pps) (F ) = 2 whereas h (botk) (F ) (n/2 + k)/k Computation: When the data has the form of elements (, w ) with unique kes and f = f(w ) for a set of functions F, then short of further structural assumptions on F, the sampling algorithm that computes S (F ) must appl all functions f F to all elements The computation is thus Ω( F n) and can be O( F n + S (F ) log k) time b identifing for each f F, the k kes with smallest f-seed() In the sequel we will see eamples of large or infinite sets F but with special structure that allows us to efficientl compute a multi-objective sample 5 THE SAMPLING CLOSURE We define the sampling closure F of a set of functions F to be the set of all functions f such that for all k and for all H, the estimate of sum(f; H) from S (F,k) has the CV bound of Theorem 21 Note that this definitions is with respect to uniform guarantees (same size parameter k for all objectives) We show that the closure of F contains all non-negative linear combinations of functions from F Theorem 51 An f = g F α gg where α g 0 is in F Proof We first consider pps samples, where we establish the stronger claim S (F {f},k) = S (F,k), or equivalentl, for all kes, p (f,k) p (F,k) (8) For a function g, we use the notation g(x ) = g, and recall that p (g,k) = min{1, k g g(x )} We first consider f = cg for some g F In this case, p (f,k) = p (g,k) p (F,k) and (8) follows To complete the proof, it suffices to establish (8) for f = g (1) + g (2) g (2) such that g (1), g (2) F Let c be such that g (2) (X ) = c g(1) g (1) (X ) we can assume WLOG that c 1 (otherwise reverse g (1) and g (2) ) For convenience denote α = g (2) (X )/g (1) (X ) Then we can write f f(x ) = = Therefore p (f,k) g (1) + g (2) g (1) (X ) + g (2) (X ) (1 + cα)g (1) (1 + α)g (1) (X ) = 1 + cα 1 + α g (1) g (1) (X ) g (1) g (1) (X ) = ma{ g(1) g (1) (X ), g (2) g (2) (X ) } ma{p (g(1),k), p (g(2),k) } p (F,k) The proof for multi-objective bottom-k samples is more involved, and deferred to the full version Note that the multiobjective bottom-k sample S (F,k) ma not include a bottom-k sample S (f,k), but it is still possible to bound the CV 6 THE UNIVERSAL SAMPLE FOR MONOTONE STATISTICS In this section we consider the (infinite) set M of all monotone non-decreasing functions and the objectives (f, k) for all f M We show that the multi-objective pps and bottom-k samples S (M,k), which we refer to as the universal monotone sample, are larger than a single dedicated weighted sample b at most a logarithmic factor in the number of kes We will also show that this is tight We will also present efficient universal monotone bottom-k sampling scheme for streamed or distributed data We take the following steps We consider the multi-objective sample S (thresh,k) for the set thresh of all threshold functions (recall that thresh T () = 1 if T and thresh T = 0 otherwise) We epress the inclusion probabilities p (thresh,k) and bound the sample size Since all threshold functions are monotone, thresh M We will establish that S (thresh,k) = S (M,k) 4 We start with the simpler case of pps and then move on to bottom-k samples 61 Universal monotone pps Theorem 61 Consider a data set D = {(, w )} of n kes and the sorted order of kes b non-increasing w Then a ke that is in position i in the sorted order has base multi-objective pps probabilit p (thresh,1) = p (M,1) 1/i When all kes have unique weights, equalit holds Proof Consider the function thresh w The function has weight 1 on all the i kes with weight w Therefore, the base (threshw,1) pps probabilit is p 1/i When the kes have unique weights then there are eactl i kes with weight w w (threshw,1) and we have p = 1/i If we consider all threshold 4 We observe that M = thresh This follows from Theorem 51, after noticing that an f M can be epressed as a non-negative combination of threshold functions f() = 0 α(t ) thresh()dt, T for some function α(t ) 0 We establish here the stronger relation S (thresh,k) = S (M,k)

7 functions, then p (thresh T,1) = 0 when T > w and p (thresh T,1) 1/i when T w Therefore, p (thresh,1) = ma T p(thresh T,1) = p (w,1) = 1/i We now consider an arbitrar monotone function f = f(w ) From monotonicit there are at least i kes with f f therefore, f / f 1/i Thus, p (f,1) = f / f 1/i and p (M,1) = ma f M p(f,1) 1/i There is a simple main-memor sampling scheme where we sort the kes, compute the probabilities p (M,1) and then p (M,k) = min{1, kp (M,1) } and compute a sample accordingl We net present universal monotone bottom-k samples and sampling schemes that are efficient on streamed or distributed data 62 Universal monotone bottom-k Theorem 62 Consider a data set D = {(, w )} of n kes The universal monotone bottom-k sample has epected size E[ S (M,k) ] k ln n and can be computed using O(n + k log n log k) operations For a particular T, the bottom-k sample S (thresh T,k) is the set of k kes with smallest u among kes with w T The set of kes in the multi-objective sample is S (thresh,k) = T >0 S(thresh T,k) We show that a ke is in the multi-objective sample for thresh if and onl if it is in the bottom-k sample for thresh w : Lemma 61 Fiing {u }, for an ke, S (thresh,k) S (threshw,k) Proof Consider the position t(, T ) of in an ordering of kes induced b thresh T -seed() We claim that if for a ke we have thresh T -seed() < thresh T -seed() for some T > 0, this must hold for T = w The claim can be established b separatel considering w w and w < w The claim implies that t(, T ) is minimized for T = w We now consider the auiliar kes Z associated with this sample Recall that these kes are not technicall part of the sample but the information (u, w ) for Z is needed in order to compute the conditional inclusion probabilities p (thresh,k) for S Note that it follows from Lemma 61 that for all kes, p (thresh,k) (threshw,k) = p For a ke, let Y = { w w } be the set of kes other than that have weight that is at least that of Let be the ke with kth smallest u in Y, when Y k The auiliar kes are Z = { S} \ S A ke is included in the sample with probabilit 1 if is not defined (which means it has one of the k largest weights) Otherwise, it is (conditionall) included if and onl if u < u To compute the inclusion probabilit p (thresh,k) from S Z, we do as follows If there are k or fewer kes in S Z with weight that is at most w, then p (thresh,k) = 1 (For correctness, note that in this case all kes with weight w would be in S) Otherwise, observe that is the ke with (k + 1)th smallest u value in S Z among all kes with w w We compute from the sample and use p (thresh,k) = u Note that when weights are unique, Z = The definition of S (thresh,k) is equivalent to that of an All- Distances Sketch (ADS) computed with respect to weights w (as inverse distances) [7], [8], and we can appl some algorithms and analsis In particular, we obtain that E[ S (thresh,k) ] k ln n and the size is well-concentrated around this epectation The argument is simple: Consider kes ordered b decreasing weight The probabilit that the ith ke has one of the k smallest u values, and thus is a member of S (thresh,k) is at most min{1, k/i} Summing the epectations over all kes we obtain n i=1 min{1, k/i} < k ln n We shall see that the bound is asmptoticall tight when weights are unique With repeated weights, however, the sample size can be much smaller 5 Lemma 62 For an data set {(, w )}, when using the same randomization {u } to generate both samples, S (M,k) = S (thresh,k) Proof Consider f M and the samples obtained for some fied randomization u for all kes Suppose that a ke is in the bottom-k sample S (f,k) B definition, we have that f- seed() = r /f(w ) is among the k smallest f-seeds of all kes Therefore, it must be among the k smallest f-seeds in the set Y of kes with w w From monotonicit of f, this implies that r must be one of the k smallest in {r Y }, which is the same as u being one of the k smallest in {u Y } This implies that S (threshw,k) 63 Estimation qualit The estimator (5) with the conditional inclusion probabilities p (M,k) generalizes the HIP estimator of [8] to sketches computed for non-unique weights Theorem 41 implies that for an f M 1 and H, CV [ŝum(f; H)] When weights are q (f) (H)(k 1) unique and we estimate statistics over all kes, we have the tighter bound CV [ŝum(f; X )] 1 2k 1 [8] 64 Sampling algorithms The samples, including the auiliar information, are composable Composabilit holds even when we allow multiple elements to have the same ke and interpret w to be the maimum weight over elements with ke To do so, we use a random hash function to generate u consistentl for multiple elements with the same ke To compose multiple samples, we take a union of the elements, replace multiple elements with same ke with the one of maimum weight, and appl a sampling algorithm to the set of remaining elements The updated inclusion probabilities can be computed from the composed sample We present two algorithms that compute the sample S (M,k) along with the auiliar kes Z and the inclusion probabilities p (M,k) for S (M,k) The algorithms process the elements either in order of decreasing w or in order of increasing u These two orderings ma be suitable for different applications and it is worthwhile to present both: In the related contet of alldistances sketches, both ordering on distances were used [7], [26], [3], [8] The algorithms are correct when applied to an set of n elements that includes S Z Recall that the inclusion probabilit p (M,k) is the kth smallest u among kes with w w Therefore, all kes with the same weight have the same inclusion 5 The sample we would obtain with repeated weights is alwas a subset of the sample we would have obtained with tie breaking In particular, the sample size can be at most k times the number of unique weights 7

8 8 probabilit For convenience, we thus epress the probabilities as a function p(w) of the weights Algorithm 1 Universal monotone sampling: Scan b weight Initialize empt ma heap H of size k ; // k smallest u values processed so far ptau + ; ; // ** omit with unique weights for (, w ) b decreasing w then increasing u order do if H < k then S S {}; Insert u to H; p(w ) 1; Continue if u < ma(h) then S S {}; p(w ) ma(h) ; // is sampled ptau ma(h); prevw w ; // ** Delete ma(h) from H Insert u to H else // ** if u < ptau and w = prevw then Z Z {}; p(w ) u Algorithm 1 processes kes b order of decreasing weight, breaking ties b increasing u We maintain a ma-heap H of the k smallest u values processed so far When processing a current ke, we include S if u < ma(h) If including, we delete ma(h) and insert u into H Correctness follows from H being the k smallest u values of kes with weight at least w When weights are unique, the probabilit p(w ) is the kth largest u value in H just before is inserted When weights are not unique, we also need to compute Z To do so, we track the previous ma(h), which we call ptau If the current ke has u (ma(h), ptau), we include Z It is eas to verif that in this case, p(w ) = u Note that the algorithm ma overwrite p(w) multiple times, as kes with weight w are inserted to the sample or to Z Algorithm 2 processes kes in order of increasing u The algorithm maintains a min heap H of the k largest weights processed so far With unique weights, the current ke is included in the sample if and onl if w > min(h) If is included, we delete from the heap H the ke with weight min(h) and insert When weights are not unique, we also track the weight w of the previous removed ke from H When processing a ke then if w = min(h) and w > prevw then is inserted to Z The computed inclusion probabilities p(w) with unique weights is u, were is the ke whose processing triggered the deletion of the ke with weight w from H To establish correctness, consider the set H, just after is deleted B construction, H contains the k kes with weight w < w that have smallest u values Therefore, p(w ) = ma H u Since we process kes b increasing u, this maimum u value in H is the value of the most recent inserted ke, that is, the ke which triggered the removal of Finall, the kes that remain in H in the end are those with the k largest weights The algorithms correctl assigns p(w ) = 1 for these kes When multiple kes can have the same weight, then p(w) is the minimum of ma H u after the first ke of weight w is evicted, and the minimum u z of a ke z with w z = w that was not sampled If the minimum is realized at such a ke z, that ke is included in Z, and the algorithm set p(w) accordingl when z is processed If p(w) is not set alread when the first ke of weight w is deleted from H, the algorithm correctl assigns p(w) to be ma H u After all kes are processed, p(w ) is set for remaining kes H where a ke with the same weight was previousl deleted from H Other kes are assigned p(w) = 1 We now analze the number of operations With both algorithms, the cost of processing a ke is O(1) if the ke is not inserted and O(log k) if the ke is included in the sample Using the bound on sample size, we obtain a bound of O(n + k ln n log k) on processing cost The sorting requires O(n log n) computation, which dominates the computation (since tpicall k n) When u are assigned randoml, however, we can generate them with a sorted order b u in O(n) time, enabling a faster O(n + k log k log n) computation Algorithm 2 Universal monotone sampling: Scan b u H ; // min heap of size k, prioritized b le order on (w, u ), containing kes with largest priorities processed so far prevw for b increasing u order do if H < k then S S {}; Insert to H; Continue arg min z H(w z, u z) ; // The min weight ke in H with largest u z if w > w then S S {} ; // Add to sample prevw w if p(prevw) = then p(prevw) u Delete from H Insert to H; else // ** if w = w and w > prevw then Z Z {}; p(w ) u for H do // kes with largest weights if p(w ) = then p(w ) 1 65 Lower bound on sample size We now show that the worst-case factor of ln n on the size of universal monotone sample is in a sense necessar It suffices to show this for threshold functions: Theorem 63 Consider data sets where all kes have unique weights An sampling scheme with a nonnegative unbiased estimator that for all T > 0 and H has CV [ŝum(thresh; H)] 1/ T must have samples of epected size Ω(k ln n) q (thresh T ) (H)k, Proof We will use Theorem 31 which relates estimation qualit to sampling probabilities Consider the ke with the ith heaviest weight Appling Theorem 31 to and thresh w we obtain that p k k+i Summing the sampling probabilities over all kes i [n] to bound the epected sample size we obtain p k n 1 i=1 k+i = k(h n H k ) k(ln n ln k) 7 THE UNIVERSAL CAPPING SAMPLE An important strict subset of monotone functions is the set C = {cap T T > 0} of capping functions We stud the multi-objective bottom-k sample S (C,k), which we refer to as the

9 9 universal capping sample From Theorem 51, the closure of C includes all functions of the form f() = 0 α(t ) cap T ()dt, for some α(t ) 0 This is the set of all non-decreasing concave functions with at most a linear growth, that is f(w) that satisf df dw 1 and d2 f dw 0 We show that the sample S (C,k) can be computed using O(n+ k log n log k) operaions from an D D that is superset of the kes in S (C,k) We start with properties of S (C,k) which we will use to design our sampling algorithm For a ke, let h be the number h of kes with w w and u < u Let l be the number of kes with w < w and r /w < r /w For a ke and T > 0 and fiing the assignment {u } for all kes, let t(, T ) be the position of cap T -seed() in the list of values cap T -seed() for all The function t has the following properties: Lemma 71 For a ke, t(, T ) is minimized for T = w Moreover, t(, T ) is non-decreasing for T w and nonincreasing for T w Proof We can verif that for an ke such that there is a T > 0 such that the cap T -seed() < cap T -seed(), we must have cap w -seed() < cap w -seed() Moreover, the set of T values where cap T -seed() < cap T -seed() is an interval which contains T = w We can establish the claim b separatel considering the cases w w and w < w As a corollar, we obtain that a ke is in the universal capping sample onl if it is in the bottom-k sample for cap w : Corollar 72 Fiing {u }, for an ke, S (C,k) S (cap w,k) Lemma 73 Fiing the assignment {u }, S (C,k) l + h < k Proof From Lemma 71, a ke is in a bottom-k cap T sample for some T if and onl if it is in the bottom-k sample for cap w The kes with a lower cap w -seed than are those with w w and u < u, which are counted in h, and those with w < w and r /w < r /w, which are counted in l Therefore, a ke is in S (cap w,k) if and onl if there are fewer than k kes with a lower cap w -seed, which is the same as having h + l < k For reference, a ke is in the universal monotone sample S (M,k) if and onl if it satisfies the weaker condition h < k Lemma 74 A ke can be auiliar onl if l + h = k Proof A ke is auiliar (in the set Z) onl if for some S, it has the kth smallest cap w -seed among all kes other than This means it has the (k + 1)th smallest cap w -seed The number of kes with seed smaller than seed() is minimized for T = w If the cap w -seed of is one of the k smallest ones, it is included in the sample Therefore, to be auiliar, it must have the (k + 1)th smallest seed 71 Sampling algorithm We are read to present our sampling algorithm We first process the data so that multiple elements with the same ke are replaced with with the one with maimum weight The net step is to identif all kes with h k It suffices to compute S (M,k) of the data with the auiliar kes We can appl a variant of Algorithm 1: We process the kes in order of decreasing weight, breaking ties b increasing rank, while maintaining a binar search tree H of size k which contains the k lowest u values of processed kes When processing, if u > ma(h) then h > k, and the ke is removed Otherwise, h is the position of u in H, and u is inserted to H and ma(h) removed from H We now onl consider kes with h k Note that in epectation there are at most k ln n such kes The algorithm then computes l for all kes with l k This is done b scanning kes in order of increasing weight, tracking in a binar search tree structure H the (at most) k smallest r /w values When processing, if r /w < ma(h), then l is the position of r /w in H We then delete ma(h) and insert r /w Kes that have l + h < k then constitute the sample S and kes with l + h = k are retained as potentiall being auiliar Finall, we perform another pass on the sampled and potentiall auiliar kes For each ke, we determine the k + 1th smallest cap w -seed, which is τ (cap w,k) Using Corollar 72, we can use (3) to compute p (cap w,k) = p (C,k) At the same time we can also determine the precise set of auiliar kes b removing those that are not the (k + 1)th smallest seed for an cap w for S 72 Size of S (C,k) The sample S (C,k) is contained in S (M,k), but can be much smaller Intuitivel, this is because two kes with similar, but not necessaril identical, weights are likel to have the same relation between their f-seeds across all f C This is not true for M: For a threshold T between the two weights, the thresh T -seed would alwas be lower for the higher weight ke whereas the relation for lower T value can be either one with almost equal probabilities In particular, we obtain a bound on S (C,k) which does not depend on n: Theorem 71 E[ S (C,k) ] ek ln ma w min w Proof Consider a set of kes Y such that ma Y w min Y w = ρ We show that the epected number of kes Y that for at least one T > 0 have one of the bottom-k cap T -seeds is at most ρk ma w The claim then follows b partitioning kes to ln min w groups where weights within each group var b at most a factor of e and then noticing that the bottom-k across all groups must be a subset of the union of bottom-k sets within each group We now prove the claim for Y Denote b τ the (k + 1)th smallest r value of Y The set of k kes with r < τ are the bottom-k sample for cap T min Y w Consider a ke From Lemma 71, we have S (C,k) onl if S (cap w,k) A necessar condition for the latter is that r /w < τ/ min Y w This probabilit is at most [ ] r τ Pr < u U[0,1] w min Y w w Pr min Y w [r < τ] ρ k u U[0,1] Y Thus, the epected number of kes that satisf this condition is at most ρk

10 10 8 METRIC OBJECTIVES In this section we discuss the application of multi-objective sampling to additive cost objectives The formulation has a set of kes X, a set of models Q, and a nonnegative cost function c(q, ) of servicing X b Q Q In metric settings, the kes X are points in a metric space M, Q Q is a configuration of facilities (that can also be points Q M), and c(q, ) is distance-based and is the cost of servicing b Q For each Q Q we are interested in the total cost of servicing X which is c(q, X) = X c(q, ) A concrete eample is the k-means clustering cost function, where Q is a set of points (centers) of size k and c(q, ) = min q Q d(q, ) 2 In this formulation, we are interested in computing a small summar of X that would allow us to estimate c(q, X) for each Q Q Such summaries in a metric setting are sometimes referred to as coresets [1] Multi-objective samples can be used as such a summar Each Q Q has a corresponding function f c(q, ) A multi-objective sample for the set F of all the functions for Q Q allows us to estimate sum(f) = c(q, X) for each Q In particular, a sample of size h(f )ɛ 2 allows us to estimate c(q, X) for each Q Q with CV at most ɛ and good concentration The challenges, for a domain of such problems, are to Upper-bound the multi-objective overhead h(f ) as a function of parameters of the domain ( X, c, structure of Q Q) The overhead is a fundamental propert of the problem domain Efficientl compute upper bounds µ on the multiobjective sampling probabilities p (F,1) so that the sum µ is not much larger than h(f ) We are interested in obtaining these bounds without enumerating over Q Q (which can be infinite or ver large) Recentl, we [6] applied multi-objective sampling to the problem of centralit estimation in metric spaces Here M is a general metric space, X M is a set of points, each Q is a single point in M, and the cost functions is c(q, ) = d(q, ) p The centralit Q is the sum c(q, ) We established that the multi-objective overhead is constant and that upper bound probabilities (with constant overhead) can be computed ver efficientl using O( X ) distance computations More recentl [10], we generalized the result to the k-means objective where Q are subsest of size at most k and establish that the overhead is O(k) 9 FOREACH, FORALL Our multi-objective sampling probabilities provide statistical guarantees that hold for each f and H: Theorem 41 states that the estimate ŝum(f; H) has the CV and concentration bounds over the sample distribution S p (sample S that includes each X independentl (or VarOpt) with probabilit p ) In this section we focus on uniform per-objective guarantees (k f = k for all f F ) and statistics sum(f; X ) = sum(f) over the full data set For F and probabilities p, we define the ForEach Normalized Mean Squared Error (NMSE): NMSE e (F, p) = ma f F E S p( ŝum(f) sum(f) 1)2, (9) and the ForAll NMSE: NMSE a (F, p) = E S p ma f F ( ŝum(f) sum(f) 1)2 (10) The respective normalized root MSE (NRMSE) are the squared roots of the NMSE Note that ForAll is stronger than ForEach as it requires a simultaneous good approimation of sum(f) for all f F : p, NMSE e (F, p) NMSE a (F, p) We are interested in the tradeoff between the epected size of a sample, which is sum(p) p, and the NRMSE The multi-objective pps probabilities are such that for all l > 1, NMSE e (F, p (F,l) ) 1/l For a parameter l 1, we can also consider the ForAll error NMSE a (F, p (F,l) ) and ask for a bound on l so that the NRMSE a ɛ A union-bound argument established that l = ɛ 2 log F alwas suffices Moreover, when F is the sampling closure of a smaller subset F, then l = ɛ 2 log F suffices If we onl bound the maimum error on an subset of F of size m, we can use l = ɛ 2 log m When F is the set of all monotone functions over n kes, then l = O(ɛ 2 log log n) suffices To see this intuitivel, recall that it suffices to consider all threshold functions since all monotone functions are nonnegative combinations of threshold functions There are n threshold functions but these functions have O(log n) points where the value significantl changes b a factor We provide an eample of a famil F where the sample-size gap between NMSE e and NMSE a is linear in the support size Consider a set of n kes and define f for each subset of n/2 kes so that f is uniform on the subset and 0 outside it The multi-objective base pps sampling probabilities are p (F,1) = 2/n for all and hence the overhead is h(f ) = 2 Therefore, p of size 2ɛ 2 has NRMSE e (F, p) ɛ In contrast, an p with NRMSE a (F, p) 1/2 must contain at least one ke from the support of each f in (almost) all samples, impling epected sample size that is at least n/2 When we seek to bound NRMSE e, probabilities of the form p (F,l) essentiall optimize the size-qualit tradeoff For NRMSE a, however, the minimum size p that meets a certain error NRMSE a (F, p) ɛ can be much smaller than the minimum size p that is restricted to the form p (F,l) Consider ɛ > 0 and a set F that has k parts F i with disjoint supports of equal sizes n/k All the parts F i ecept F 1 have a single f that is uniform on the support, which means that with uniform p of size ɛ 2 we have, NRMSE e (F i, p) = NRMSE a (F i, p) = ɛ The part F 1 has similar structure to our previous eample which means that an p that has NRMSE a (F 1, p) 1/2 has size at least n/(2k) wheras a uniform p of size 2ɛ 2 has NRMSE e (F 1, p) = ɛ The multiobjective base sampling probabilities are therefore p (F,1) = k/n for kes in the supports of F i where i > 1 and p (F,1) = 2k/n for kes in the support of F 1 and thus the overhead is k + 1 The minimum size p for NRMSE a (F, p) = 1/2 must have value at least 1/2 for kes in the support of F 1 and value about (k/n) log(k) for other kes (having ForAll requirement for each part and a logarithmic factor due to a (tight) union bound) Therefore the sample size is O(n/k + k log k) In contrast, To have NRMSE a (F, p (F,l) ) = 1/2, we have to use l = Ω(n/k), obtaining a sample size of Ω(n)

Chapter 6. Self-Adjusting Data Structures

Chapter 6. Self-Adjusting Data Structures Chapter 6 Self-Adjusting Data Structures Chapter 5 describes a data structure that is able to achieve an epected quer time that is proportional to the entrop of the quer distribution. The most interesting