Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Similar documents
Effective computation of biased quantiles over data streams

1 Approximate Quantiles and Summaries

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

Disjoint Sets : ADT. Disjoint-Set : Find-Set

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Dictionary: an abstract data type

past balancing schemes require maintenance of balance info at all times, are aggresive use work of searches to pay for work of rebalancing

11 Heavy Hitters Streaming Majority

Dictionary: an abstract data type

Disjoint-Set Forests

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn

Distributed Summaries

Optimal Color Range Reporting in One Dimension

Fast Sorting and Selection. A Lower Bound for Worst Case

Data Structures and Algorithms

ENS Lyon Camp. Day 2. Basic group. Cartesian Tree. 26 October

The Complexity of Constructing Evolutionary Trees Using Experiments

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Range Medians. 1 Introduction. Sariel Har-Peled 1 and S. Muthukrishnan 2

Time-Decaying Aggregates in Out-of-order Streams

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Binary Search Trees. Dictionary Operations: Additional operations: find(key) insert(key, value) erase(key)

Splay Trees. CMSC 420: Lecture 8

Lecture 18 April 26, 2012

Assignment 5: Solutions

CS4311 Design and Analysis of Algorithms. Analysis of Union-By-Rank with Path Compression

CPSC 320 Sample Final Examination December 2013

Fundamental Algorithms

Jeffrey D. Ullman Stanford University

Data Stream Methods. Graham Cormode S. Muthukrishnan

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

1 Terminology and setup

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :52 AM

Transforming Hierarchical Trees on Metric Spaces

Analysis of Algorithms. Outline 1 Introduction Basic Definitions Ordered Trees. Fibonacci Heaps. Andres Mendez-Vazquez. October 29, Notes.

6.854 Advanced Algorithms

Data Structures and Algorithms " Search Trees!!

CS 161: Design and Analysis of Algorithms

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Time-Decaying Aggregates in Out-of-order Streams

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Review Of Topics. Review: Induction

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

Sorting. Chapter 11. CSE 2011 Prof. J. Elder Last Updated: :11 AM

The Union of Probabilistic Boxes: Maintaining the Volume

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

TASM: Top-k Approximate Subtree Matching

A Mergeable Summaries

Computing Connected Components Given a graph G = (V; E) compute the connected components of G. Connected-Components(G) 1 for each vertex v 2 V [G] 2 d

Integer Sorting on the word-ram

INF2220: algorithms and data structures Series 1

CS361 Homework #3 Solutions

N/4 + N/2 + N = 2N 2.

Sorting Algorithms. We have already seen: Selection-sort Insertion-sort Heap-sort. We will see: Bubble-sort Merge-sort Quick-sort

Lecture 2 September 4, 2014

Divide-and-conquer: Order Statistics. Curs: Fall 2017

Amortized Complexity Main Idea

Space-optimal Heavy Hitters with Strong Error Bounds

Dynamic Structures for Top-k Queries on Uncertain Data

8: Splay Trees. Move-to-Front Heuristic. Splaying. Splay(12) CSE326 Spring April 16, Search for Pebbles: Move found item to front of list

A Mergeable Summaries

Problem 5. Use mathematical induction to show that when n is an exact power of two, the solution of the recurrence

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Dynamic Ordered Sets with Exponential Search Trees

Algorithms for pattern involvement in permutations

Data selection. Lower complexity bound for sorting

Algorithms, Design and Analysis. Order of growth. Table 2.1. Big-oh. Asymptotic growth rate. Types of formulas for basic operation count

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Priority queues implemented via heaps

Recitation 7. Treaps and Combining BSTs. 7.1 Announcements. FingerLab is due Friday afternoon. It s worth 125 points.

Self-improving Algorithms for Coordinate-Wise Maxima and Convex Hulls

5. DIVIDE AND CONQUER I

Lecture 5: Splay Trees

Interval Selection in the streaming model

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Concrete models and tight upper/lower bounds

arxiv: v2 [cs.dc] 28 Apr 2013

CSE 4502/5717: Big Data Analytics

Divide-and-Conquer Algorithms Part Two

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Succinct Data Structures for Approximating Convex Functions with Applications

5 Spatial Access Methods

CSE 202: Design and Analysis of Algorithms Lecture 3

Online Sorted Range Reporting and Approximating the Mode

1 Maintaining a Dictionary

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Union-Find Structures

2.2 Asymptotic Order of Growth. definitions and notation (2.2) examples (2.4) properties (2.2)

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

CMPT 307 : Divide-and-Conqer (Study Guide) Should be read in conjunction with the text June 2, 2015

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

Search Trees. EECS 2011 Prof. J. Elder Last Updated: 24 March 2015

Skip Lists. What is a Skip List. Skip Lists 3/19/14

COL 730: Parallel Programming

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

25. Minimum Spanning Trees

25. Minimum Spanning Trees

Transcription:

Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com

Quantiles Quantiles summarize data distribution concisely. Given N items, the φ quantile is the item with rank φn in the sorted order. Eg. The median is the 0.5-quantile, the minimum is the 0-quantile. Equidepth histograms put bucket boundaries on regular quantile values, eg 0.1, 0.2 0.9 Quantiles are a robust and rich summary: median is less affected by outliers than mean

Quantiles over Data Streams Data stream consists of N items in arbitrary order. Models many data sources eg network traffic, each packet is one item. Requires linear space to compute quantiles exactly in one pass, Ω(N 1/p ) in p passes. ε-approximate computation in sub-linear space Φ-quantile: item with rank between (Φ-ε)N and (Φ+ε)N [GK01]: insertions only, space O(1/ε log(εn)) [CM04]: insertions & deletions, space O(1/ε log U log 1/δ)

Why Biased Quantiles? IP network traffic is very skewed Long tails of great interest Eg: 0.9, 0.95, 0.99-quantiles of TCP round trip times Issue: uniform error guarantees ε = 0.05: okay for median, but not 0.99-quantile ε = 0.001: okay for both, but needs too much space Goal: support relative error guarantees in small space Low-biased quantiles: φ-quantiles in ranks φ(1±ε)n High-biased quantiles: (1-φ)-quantiles in ranks (1-(1±ε)φ)N

Prior Work Sampling approach due to Gupta & Zane [GZ03] Keep O(1/ε log N) samplers at different sample rates, each keeping a sample of O(1/ε 2 ) items Total space: O(1/ε 3 ), probabilistic algorithm Deterministic alg [CKMS05] Worst case input causes linear space usage Showed lower bound of Ω(1/ε log εn) Improved probabilistic alg of Zhang+ [ZLXKW05] Needs O(1/ε 2 polylog N) space and time

Our Approach Domain-oriented approach: items drawn from [1 U], want space to depend on O(log U) Impose binary tree structure over domain Maintain counts c w on (subset of) nodes Count represents input items from that subtree So counts to left of a leaf are from items strictly less; uncertainty in rank of item is from ancestors Similar to [SBAS04] approach for uniform quantiles

Functions over the tree We define some functions to measure counts over the tree. lf(x) = leftmost leaf in subtree x anc(x) = set of ancestors of node x L(v) = lf(w) < lf(v) c w (Left count) L(v) A(x) A(x) = w anc(x) c w (Ancestor count) v lf(x) x

Accuracy Invariants To ensure accurate answers, we maintain two invariants over the set of counts: x. L(x) A(x) rank(x) L(x) ensures we can deterministically bound ranks v. v lf(v) (c v α L(v)) ensures range of possible ranks is bounded To guarantee ε-accurate ranks, will set α = ε/log U (since we use ❷ summed over log U ancestors) Claim: any summary satisfying ❶ and ❷ allows us to find r (x) so r (x) rank(x) ε rank(x) ❶ ❷

Maintenance Need to show how to maintain the accuracy invariants, while guaranteeing space is bounded and updates are fast. Will Insert each update x. Insert will be defined to maintain accuracy, but space may grow Periodically will run a linear scan of data structure to Compress it. Will argue that these two together maintain space and time bounds.

Data Structure Store subset of nodes and counts as bq-summary Nodes with count 0 do not need to be stored Split bq into two: bq-leaves (bql) and bq-tree (bqt). This division is needed to get tightest space bounds. bq-leaves is a subset of leaf nodes only bq-tree is subset of nodes strictly to right of bq-leaves bql bqt

Space Conditions We will maintain four additional conditions to ensure space is bounded. Set z = max u bql u. z < lf(par(v bqt)) c par(v) αl(par(v)) Ensures parents of the bqt nodes are full 1/α log(αn) bql min(n,1/α) ❹ Ensures not too many or too few bql nodes z < min lf(v) v bqt Ensures bq-leaves to left of bq-tree nodes c v bql bqt v = N Sanity check on conservation of counts ❸ ❺ ❻

Space Bound Outline Will show that maintaining all six conditions ensures that space is tightly bounded Main effort is in proving size of bqt is bounded Will divide bqt into equivalence classes based on increasing L() values Since each L() value of class must increase by a multiplicative factor, can bound total space Equivalence classes

Equivalence Classes Only consider full nodes V in bqt (with at least one child present): by ❸, for v V, c v = αl(v) Partition V into equivalence classes based on L(v) E i is set of nodes in i th equivalence class, with L value = L i 3 5 1 L 1 is sum of bq-leaves: L 1 = v bql c v 1 3 2 1 L 1 =4 L 2 = 6 L 3 = 10 L 4 = 15 Example with α = ½

Space Bound By ❹ we have bql = L 1 1/α The L i s increase exponentially, can show L i+1 L 1 Π j=1i (1+α E j ) Consider item U+1, so rank(u+1)=n. By ❻, N = L(U+1) 1/α Π j=1q (1 + α E j ) Taking logs allows us to bound size of bqt So total space = bql + bqt = O(1/ε log (εn) log U)

Insert Procedure Must show we can maintain data structure quickly Insert allows space constraints to lapse slightly by using old (pre-calculated and stored) L() values. Given update item x: Compare to z = max u bql u If x z, place x in bql in time O(1) If x > z place x in bqt in time O(log log U): Find closest materialized ancestor y of x in bqt Add 1 to c y unless this would make c y > αl(y), if so then create child of y with count = 1

Accuracy of Insert Insert procedure maintains ❶, ❷, ❺, and ❻ Fairly easy to check each of these, e.g. x. L(x) A(x) rank(x) L(x) ❶ Inserting into bq, increases L(x) and rank(x) for everything to the right of inserted item. Other conditions preserved either by inspection, or by design of Insert routine (eg inserting into child node if inserting into y would break ❷)

Compress If we keep Inserting, space can grow without limit, but in worst case, we add one new node per insert, so Compress when space doubles Need to periodically recompute L() values for nodes, and merge together nodes when possible First, resize bq-leaves so bql = min(n,1/α) Recompute z = max v bql v in time linear in bql, Insert leaves removed from bql into bqt. Tricky part is compressing bq-tree

Compress Tree Compress Tree operation takes a (sub)tree in bqt, ensures that each node becomes full (has count = αl(v)) by pulling up weight from below For node v compute L(v) and wt(v) = v anc(w) c w Set c v as big as possible by borrowing from wt(v) If c v = αl(v), then recurse on children in order Else, we have accounted for all weight below, so delete all descendents With care, Compress Tree takes time O( bqt ) and computes L(v) incrementally as a side effect Can show that Compress maintains conditions ❶, ❷, ❺, and ❻ and restores conditions ❸ and ❹

Final Result Can answer rank queries with error ε rank(x), using space O(1/ε log εn log U), and amortized update time O(log log U). Lower bound on space = O(1/ε log (εn)) To answer queries, need latest values of L(v), so need time O(1/ε log εn log U) to preprocess Can then answer queries in time O(log U) each Alternatively, spend O(log U) time on updates and allow L(v) values to be computed in time O(log U) Quantile queries can be answered by binary searching for item with desired rank

Extensions Partially biased algorithm Sometimes only need accuracy down to some ε N Can reduce space slightly for this weaker guarantee Space required is O(1/ε log (ε/ε ) log U) Uniform algorithm The Compress Tree idea can be applied to εn error bq-leaves not needed, space used is O(1/ε log U) Time is O(log log U) amortized as before

Experimental Results CKMS, MRC = prior work, SBQ = this work Outperforms prior work in both time and space

Commentary Took some amount of effort to get the invariants and conditions just right : Small changes to conditions meant either space or time bounds would break bq-leaves needed to ensure that space bounds are as tight as possible Easy to merge together summaries to get summary of union (for distributed computations) Linearity of L and A means everything goes through

Conclusions Close to optimal space bounds What about faster updates, less work for queries? Made crucial use of tree-structure over universe Any way to drop U and work over arbitrary domains?