Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles

Similar documents
11 Heavy Hitters Streaming Majority

1 Approximate Quantiles and Summaries

Distributed Summaries

Effective computation of biased quantiles over data streams

Approximate Counts and Quantiles over Sliding Windows

A Mergeable Summaries

Simple and Deterministic Matrix Sketches

26 Mergeable Summaries

Approximate Sorting of Data Streams with Limited Storage

A Mergeable Summaries

10 Distributed Matrix Sketching

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Stream-Order and Order-Statistics

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Tracking Distributed Aggregates over Time-based Sliding Windows

Leveraging Big Data: Lecture 13

Random Sampling on Big Data: Techniques and Applications Ke Yi

Boutsidis, Garber, Karnin, Liberty. PRESENTED BY Firstname Lastname August 25, 2013 PRESENTED BY Zohar Karnin November 23, 2014

STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS

4 Frequent Directions

Lecture 2. Frequency problems

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

7 Algorithms for Massive Data Problems

Heat Kernel Based Community Detection

Approximate counting: count-min data structure. Problem definition

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Hash tables. Hash tables

CSCE 561 Information Retrieval System Models

Tight Bounds for Distributed Functional Monitoring

The Count-Min-Sketch and its Applications

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

A fast randomized algorithm for approximating an SVD of a matrix

ECEN 689 Special Topics in Data Science for Communications Networks

An Algorithmist s Toolkit Nov. 10, Lecture 17

2 How many distinct elements are in a stream?

Hash tables. Hash tables

Normalized power iterations for the computation of SVD

Probabilistic Near-Duplicate. Detection Using Simhash

Optimal compression of approximate Euclidean distances

Stream Computation and Arthur- Merlin Communication

Algorithms for Data Science

Continuous Matrix Approximation on Distributed Data

Initial Sampling for Automatic Interactive Data Exploration

Data Stream Methods. Graham Cormode S. Muthukrishnan

Processing Big Data Matrix Sketching

Some notes on streaming algorithms continued

(i) The optimisation problem solved is 1 min

14.1 Finding frequent elements in stream

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

1 Maintaining a Dictionary

to be more efficient on enormous scale, in a stream, or in distributed settings.

Lecture 3 Sept. 4, 2014

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

The PAC Learning Framework -II

IV. The Normal Distribution

Stat 20: Intro to Probability and Statistics

Sketching and Streaming for Distributions

Lab Activity: The Central Limit Theorem

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Analysis of Algorithms I: Perfect Hashing

Hash tables. Hash tables

A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Finding Frequent Items in Probabilistic Data

Project 2: Hadoop PageRank Cloud Computing Spring 2017

LEC1: Instance-based classifiers

Aalto University 2) University of Oxford

IR: Information Retrieval

Introduction to Randomized Algorithms III

CMPT 307 : Divide-and-Conqer (Study Guide) Should be read in conjunction with the text June 2, 2015

arxiv: v1 [cs.it] 23 Jan 2019

Andrews Curtis Groups and the Andrews Curtis Conjecture

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Introduction to Data Mining

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

Sources of randomness

Recommendation Systems

Measuring Goodness of an Algorithm. Asymptotic Analysis of Algorithms. Measuring Efficiency of an Algorithm. Algorithm and Data Structure

More NP-Complete Problems

A fast randomized algorithm for the approximation of matrices preliminary report

Harvard Center for Geographic Analysis Geospatial on the MOC

CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Nov 30. Final In-Class Exam. ( 7:05 PM 8:20 PM : 75 Minutes )

Succinct Data Structures for Approximating Convex Functions with Applications

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Wavelet decomposition of data streams. by Dragana Veljkovic

Exascale I/O challenges for Numerical Weather Prediction

Optimal Random Sampling from Distributed Streams Revisited

The Online Space Complexity of Probabilistic Languages

Range Medians. 1 Introduction. Sariel Har-Peled 1 and S. Muthukrishnan 2

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19

Algorithms for Distributed Functional Monitoring

Kurt Schmidt. June 5, 2017

{X i } realize. n i=1 X i. Note that again X is a random variable. If we are to

Lecture 6 September 13, 2016

Table I: Symbols and descriptions

Transcription:

Edo Liberty Principal Scientist Amazon Web Services Streaming Quantiles

Streaming Quantiles Manku, Rajagopalan, Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. Munro, Paterson. Selection and sorting with limited storage. Greenwald, Khanna. Space-efficient online computation of quantile summaries. Wang, Luo, Yi, Cormode. Quantiles over data streams: An experimental study. Greenwald, Khanna. Quantiles and equidepth histograms over streams. Agarwal, Cormode, Huang, Phillips, Wei, Yi. Mergeable summaries. Felber, Ostrovsky. A randomized online quantile summary in O((1/ε) log(1/ε)) words. Lang, Karnin, Liberty, Optimal Quantile Approximation in Streams.

Problem Definition R( ) = 0.6 n n 0 n Create a sketch for R 0 such that R 0 (x) R(x) apple "n

Solutions Uniform sampling Fast and simple Fully mergeable Space Greenwald Khanna (GK) sketch Slow, complex Not mergeable Space Felber-Ostrovsky, combines sampling and GK (2015) Slow, complex Not mergeable Space Õ(1/" 2 ) log(n)/" log(1/")/ Previously conjectured space optimal for all algorithms. log(1/")/ lower bound for deterministic algorithms by Hung and Ting 2010.

Solutions cont Uniform sampling Fast, simple Fully mergeable Space Greenwald Khanna (GK) sketch Slow, complex Not mergeable Space Felber-Ostrovsky, combines sampling and GK (2015) Slow, complex Not mergeable Space Manku-Rajagopalan-Lindsay (MRL) Fast simple Fully mergeable Space Agarwal, Cormode, Huang, Phillips, Wei, Yi Fast, complex Fully mergeable Space Õ(1/" 2 ) log(n)/" log(1/")/ log 2 (n)/" log 3/2 (1/")/" Buffer based solutions

The basic buffer idea Buffer of size k 5 1 4 7 0 3

The basic buffer idea Stores k stream entries 3 0 7 4 1 5

The basic buffer idea The buffer sorts k stream entries 7 5 4 3 1 0

The basic buffer idea Deletes every other item 7 5 4 3 1 0

The basic buffer idea And outputs the rest with double the weight 5 3 0

The basic buffer idea R(x) =2 R(x) =5 0 1 3 4 5 7 0 R 0 (x) =2 3 5 R 0 (x) =6 1 R 0 (x) =2 4 R 0 (x) =4 7 x x

The basic buffer idea n/k Repeat time until the end of the stream 5 0 5 1 0 3 n/2 n R 0 (x) R(x) < n/k

Manku-Rajagopalan-Lindsay (MRL) sketch H = log 2 (n) Buffers of size k 5 1 0 3 n R 0 (x) R(x) apple n/k log 2 (n)

Manku-Rajagopalan-Lindsay (MRL) sketch If we set k = log 2 (n)/" we get R 0 (x) R(x) apple "n while maintaining at most H k apple log 2 2(n)/" stream items. Manku-Rajagopalan-Lindsay (MRL) sketch Fast, Simple Fully mergeable Space

Agarwal, Cormode, Huang, Phillips, Wei, Yi (1) log(1/") Buffers of size k start sampling after O(1/" 2 ) items 5 1 0 3 Reduces space usage to log 2 (1/")/" items from the stream.

Agarwal, Cormode, Huang, Phillips, Wei, Yi (2) R(x) =1 5 7 5 R 0 (x) =2 R 0 (x) =0 7 R 0 (x) is a random variable now and E[R 0 (x)] = R(x) Reduces space usage to x log 3/2 (1/")/" items from the stream.

Recap Uniform sampling Fast, simple Fully mergeable Space Greenwald Khanna (GK) sketch Slow, complex Not mergeable Space Felber-Ostrovsky, combines sampling and GK (2015) Slow, complex Not mergeable Space Manku-Rajagopalan-Lindsay (MRL) Fast, simple Fully mergeable Space Õ(1/" 2 ) log(n)/" log(1/")/ log 2 (n)/" Agarwal, Cormode, Huang, Phillips, Wei, Yi Fast, complex Fully mergeable Space log 3/2 (1/")/"

Our goal Uniform sampling Fast, simple Fully mergeable Space Greenwald Khanna (GK) sketch Slow, complex Not mergeable Space Felber-Ostrovsky, combines sampling and GK (2015) Slow, complex Not mergeable Space Manku-Rajagopalan-Lindsay (MRL) Fast, simple Fully mergeable Space Õ(1/" 2 ) log(n)/" log(1/")/ log 2 (n)/" Agarwal, Cormode, Huang, Phillips, Wei, Yi Fast, complex Fully mergeable Space log 3/2 (1/")/" Karnin, Lang, Liberty Fast, simple Fully mergeable Space log(1/")/

Observation The first buffers contribute very little to the error. They are too good. Weight of items in the level w h =2 h 1 m h =2 H h 1 Number of compactions h = H... h =2 h =1

Idea Let buffers shrink at-most-exponentially k h kc H h TBD later h = H h =2 h =1 Number of compactions w h =2 h 1 H apple log(n/ck)+2 m h apple (2/c) H h 1

Analysis R(h, x) the rank of x among 1. The items yielded by the compactor at height h 2. All the items stored in the compactors of heights h 0 apple h C = c 2 (2c 1) Claim, for Pr [ R(x, H 0 ) R(x) "n] apple 2 exp C" 2 k 2 2 2(H H0 ) Proof Use Hoeffding s inequality on HX [R(x, h) R(x, h 1)] h=1

Solution 1 Set c =2/3 and k h = dkc H h e +1 Karnin, Lang, Liberty (1) Fast, simple Fully mergeable Space p log(1/")/" Better than previously conjectured optimal! log(n) exponentially decreasing capacity buffers sampler replaces all buffers of size 2

Solution 2 (KLL + MRL) c =2/3 log log(1/") k h = dkc H h e +1 k Set and except that the top buffers all have capacity. Karnin, Lang, Liberty (2) Fast, simple Fully mergeable Space log 2 log(1/")/" log log(1/") Buffers of capacity k log(n) exponentially decreasing capacity buffers sampler replaces all buffers of size 2

Solution 3 (KLL + GK) c =2/3 log log(1/") k h = dkc H h e +1 k Set and replace the top with a GK sketch Karnin, Lang, Liberty (3) Fast, simple Fully mergeable Space log log(1/")/" GK Sketch GK sketch replaces top log log(1/") levels log(n) exponentially decreasing capacity buffers sampler replaces all buffers of size 2

Count Distinct (Demo Only)

Assume you need to estimate the distribution of numbers in a file $ head data.csv 0 1 0 3 0 2 3 7 3 2 In this one, row i tasks a value from [0,i] uniformly at random.

Some stats: there are 10,000,000 such numbers in this ~76Mb file. $ time wc -lc data.csv 10000000 76046666 data.csv real 0m0.101s user 0m0.072s sys 0m0.021s Reading the file take ~1/10 seconds. We don t foresee IO being an issue.

In python it looks like this: $ cat quantiles.py import sys ints = sorted([int(x) for x in sys.stdin]) for i in range(0,len(ints),int(len(ints)/100)): print(str(ints[i])) $ time cat data.csv python quantiles.py > /dev/null real 0m13.406s user 0m12.937s sys 0m0.407s

This is the way to do this with the sketching library $ time cat data.csv sketch rank $ time cat data.csv sketch rank > /dev/null real 0m1.495s user 0m1.878s sys 0m0.141s Too fast to use the system monitor UI... It uses ~ 4k of memory!

10000000 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000 0 exact and approximate quantiles 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 approximate quantiles exact quantiles 10000000 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000 0 exact vs approximate quantiles 0 2000000 4000000 6000000 8000000 10000000

Some experimental results 0.01 Lazy KLL versus (Sketch Library and Two Variants) 4000 Lazy KLL versus (Sketch Library and Two Variants) 0.009 3500 Error 0.008 0.007 0.006 0.005 0.004 0.003 0.002 Sketch Library 0.001 Variant 1 Variant 2 0 Lazy KLL 100 1000 10000 100000 1e+06 Number of Items in Randomly Permuted Stream Space Used For Storing Samples 3000 2500 2000 1500 1000 Sketch Library 500 Variant 1 Variant 2 0 Lazy KLL 100 1000 10000 100000 1e+06 Number of Items in Randomly Permuted Stream

Thank you!