Approximate Sorting of Data Streams with Limited Storage

Similar documents
Compression in the Space of Permutations

1 Approximate Quantiles and Summaries

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles

Lecture Lecture 9 October 1, 2015

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming

Web Structure Mining Nodes, Links and Influence

2013 IEEE International Symposium on Information Theory

BECAUSE of their data reduction properties, independence

Centrality Measures. Leonid E. Zhukov

STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS

1 Estimating Frequency Moments in Streams

Stream-Order and Order-Statistics

Sketching and Streaming for Distributions

Divide-and-Conquer. Consequence. Brute force: n 2. Divide-and-conquer: n log n. Divide et impera. Veni, vidi, vici.

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Week 5: Quicksort, Lower bound, Greedy

2012 IEEE International Symposium on Information Theory Proceedings

Session III: New ETSI Model on Wideband Speech and Noise Transmission Quality Phase II. STF Validation results

Permutation distance measures for memetic algorithms with population management

Rank and Rating Aggregation. Amy Langville Mathematics Department College of Charleston

The Online Space Complexity of Probabilistic Languages

Approximate Counts and Quantiles over Sliding Windows

Computing the Entropy of a Stream

Weather Company Energy and Power Products

A Novel Distance-Based Approach to Constrained Rank Aggregation

Lecture 2 Sept. 8, 2015

Data Streams & Communication Complexity

Mathematical Embeddings of Complex Systems

Centrality Measures. Leonid E. Zhukov

Floyd-Hoare Style Program Verification

A Unified Approach to Combinatorial Key Predistribution Schemes for Sensor Networks

2017 Scientific Ocean Drilling Bibliographic Database Report

A Las Vegas approximation algorithm for metric 1-median selection

Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows

Similarity searching, or how to find your neighbors efficiently

21st Century Global Learning

Permutation distance measures for memetic algorithms with population management

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Lecture 3 Sept. 4, 2014

Unit 14: Nonparametric Statistical Methods

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Essential Maths Skills. for AS/A-level. Geography. Helen Harris. Series Editor Heather Davis Educational Consultant with Cornwall Learning

Data selection. Lower complexity bound for sorting

How to display data badly

How Philippe Flipped Coins to Count Data

! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.

Square Kilometer Array:

Lecture 2: Streaming Algorithms

Copyright 2000, Kevin Wayne 1

Counting Distance Permutations

Standard Form Scientific Notation Numbers $ 10 8,000, Numbers $ 1 and, Numbers. 0 and, 1 0.

Lecture 1: Asymptotics, Recurrences, Elementary Sorting

11 Heavy Hitters Streaming Majority

Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces

Maintaining Significant Stream Statistics over Sliding Windows

Node and Link Analysis

UNIT 4 RANK CORRELATION (Rho AND KENDALL RANK CORRELATION

Divide-and-conquer. Curs 2015

Objectives: 5pts Graph the average ecological footprints of several countries. Select two countries with different sized footprints and research

Lecture 1: Introduction to Sublinear Algorithms

? 11.5 Perfect hashing. Exercises

Computing Trusted Authority Scores in Peer-to-Peer Web Search Networks

Europe, Canada, Latin America, & Australia

Using Web Maps to Measure the Development of Global Scale Cognitive Maps

The Five Themes of Geography

K-Lists. Anindya Sen and Corrie Scalisi. June 11, University of California, Santa Cruz. Anindya Sen and Corrie Scalisi (UCSC) K-Lists 1 / 25

Establishment of Space Weather Information Service

Large-scale Collaborative Ranking in Near-Linear Time

Lecture 1 September 3, 2013

Data Representation for Flash Memories 53

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Sliding Window Algorithms for Regular Languages

1 ONE SAMPLE TEST FOR MEDIAN: THE SIGN TEST

Divide-Conquer-Glue Algorithms

IEEE Transactions on Image Processing EiC Report

CHAPTER 6 : LITERATURE REVIEW

P E R E N C O - C H R I S T M A S P A R T Y

Section 4.1 Solving Systems of Linear Inequalities

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

CS246 Final Exam, Winter 2011

Joint Coding for Flash Memory Storage

ClasSi: Measuring Ranking Quality in the Presence of Object Classes with Similarity Information

United Kingdom. Total MAP Caseload. Average time needed to close MAP cases (in months)

Mathematics. Pre-Leaving Certificate Examination, Paper 2 Higher Level Time: 2 hours, 30 minutes. 300 marks L.20 NAME SCHOOL TEACHER

Improved Concentration Bounds for Count-Sketch

On the longest increasing subsequence of a circular list

Randomized Algorithms

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Lower Bounds for Testing Bipartiteness in Dense Graphs

5. DIVIDE AND CONQUER I

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

Weighted Bloom Filter

Dimensionality Reduction

Bivariate Paired Numerical Data

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

DS504/CS586: Big Data Analytics Graph Mining II

Efficient Spatial Data Structure for Multiversion Management of Engineering Drawings

From Non-Negative Matrix Factorization to Deep Learning

Transcription:

Approximate Sorting of Data Streams with Limited Storage Farzad Farnoud! Eitan Yakoobi! Jehoshua Bruck 20th International Computing and Combinatorics Conference (COCOON'14)

Sorting with Limited Storage Source Sensors Netflix, Amazon servers Data Stream Sensor data:! Tbs/day Movies:! 1/week User with limited storage Application Rank Correlation Testing (Spearman s, Kendall s) Preference Learning

Problem Statement Stream s s s {z } 2 1 Algorithm s n {z } X: permutation defined! by the ordering of s m storage cells Y:! approximation of X If s i <s j then i appears before j in X! To store stream elements, m cells are available; no limitation on other types of storage! Algorithm can compare any two elements residing in storage! Deterministic algorithms, X is a random permutation! Performance measure: permutation distortion between X and Y

Example Suppose X=253461 (s2<s5<s3<s4<s6<s1) and m=3 s 6 s 5 s 4 s 3 s 2 s 1 {z } storage cells s2<s1 s2<s3<s1 s2<s3<s4 s2<s5<s4 s2<s4<s6 Output, e.g. Y=235146 (s2<s3<s5<s1<s4<s6)

Related Work! J. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315 323, 1980.! G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited storage. ACM SIGMOD 1998! Sudipto Guha and Andrew McGregor. Approximate quantiles and the order of the stream. In Proc. 25th ACM Symposium on Principles of Database Systems, pp. 273 279, 2006.! A. Chakrabarti, T. S. Jayram, and M. Patrascu. Tight lower bounds for selection in randomly ordered streams. SODA 2008

Performance Measures Stream s {z } s 2 s 1 s n {z } X: permutation defined! by the ordering of s Algorithm m storage cells Y:! approximation of X Kendall tau distortion:! Counts # of pairwise disagreements ( = # of transpositions of adjacent elements taking X to Y )! Example: d τ (312,123)=2 since 312 132 123 Weighted Kendall distortion! Chebyshev distortion:! Maximum error in the rank of any element! Example: d c (35124,12345)=3

Universal Bounds: Kendall Distortion Theorem: For any algorithm with storage m=µn and average Kendall distortion D=δn, if δ is bounded away from zero, then! µ ( + ) + ( + ( )) y = xe x y 2.5 2.0 1.5 1.0 0.5-5 -4-3 -2-1 1 x -1-2 -3-4 -5 x 1 x = W 0 HyL 0.5 1.0 1.5 2.0 2.5 x = W -1 HyL y

Universal Bounds: Kendall Distortion Theorem: For any algorithm with storage m=µn and average Kendall distortion D=δn, if δ is bounded away from zero, then! µ ( + ) + ( + ( )) 1.0 m As δ increases, we asymptotically have µ 1/(e 2 δ)(1+o(1)) 0.8 0.6 0.4 0.2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 d

Universal Bounds: Kendall Distortion Proof outline:! Let C={y1,y 2,y 3, } denote the set of possible Y s for a given algorithm! Since algorithm is deterministic, C m!m n-m! C can be viewed as a rate-distortion code [Wang et al 13, Farnoud et al 14]! For distortion D, B(D): size of ball of radius D! If we have B(D), eliminating C gives the result

Universal Bounds: Chebyshev Distortion Theorem: For any algorithm with storage m=µn and average Chebyshev distortion D=δn, with 2/n δ 1/2, ( / ) µ ( + ( )) For any fixed δ as n increases, storage requirement becomes a vanishing fraction of n Constant distortion needs at least constant μ m 0.35 0.30 0.25 0.20 0.15 0.10 0.05 n =100 n =1000 0.0 0.1 0.2 0.3 0.4 0.5 d

Algorithm A simple algorithm:! Store the first m-1 elements of the stream, s1,,s m-1, as pivots! Compare each new element with the pivots! Example: Suppose X=263415 (s2 <s 6 <s 3 <s 4 <s 1 <s 5 ) and m=3: s 6 s 5 s 4 s 3 s 2 s 1 {z } storage cells s2<s1 s2<s3<s1 s2<s4<s1 s2<s1<s5 s2<s6<s1 Output Y=234615, d τ (263415,234615)=2, d c (263415,234615)=2

Algorithm: Kendall Distortion Theorem: The algorithm asymptotically requires at most a constant factor as much storage as an optimal algorithm for the same Kendall distortion. For small δ, Proposed alg: µ 1! Opt. alg: µ bounded away from 0! For large δ, Proposed alg: µ 1/(2δ) (1+o(1))! Opt. alg: µ 1/(e 2 δ) (1+o(1)) 1.0 m 0.8 0.6 0.4 0.2 Proposed algorithm Lower bound 0.0 0.5 1.0 1.5 2.0 2.5 3.0 d

Algorithm: Chebyshev Distortion Theorem: If the proposed algorithm has storage m=µn and average Chebyshev distortion D=δn, with δ 1/2 and δ bounded away from 0, then μ W-1(-δ/e)/(δn). If δ is bounded away from 0, we need at most a constant times as much storage! For vanishing distortion, better algorithm and/or bounds are needed m 1.0 0.8 0.6 0.4 0.2 Proposed algorithm Lower bound n =500 0.1 0.2 0.3 0.4 0.5 d

Algorithm: Chebyshev Distortion Proof outline:! Given Y, X is unknown only in segments bounded by pivots: If Y=2347156, then Xϵ{2347156, 2437156, 2374165, 2437165, } Chebyshev distortion is bounded by the length of the longest segment! Coupling with a randomly broken stick of length n into m parts! Statistics of the length of the longest piece are well known [Holst 80]! δn = E[d c (X,Y)] E[length of longest piece of stick] n ln(me) / m (-mδ) e (-mδ) -δ/e

Weighted Kendall: Why? Ranking of Wikipedia pages:! Rankings with different methods for important items are all very similar! This is not reflected by the Kendall tau correlation! The correlation coefficient is affected by items with low ranks Ind. PR Katz Indegree 1 0.75 0.90 PageRank 0.75 1 0.75 Katz 0.90 0.75 1 Harmonic 0.62 0.61 0.70 Indegree PageRank Katz United States United States United States List of sovereign states Animal List of sovereign states Animal List of sovereign states United Kingdom England France France France Germany Animal Association football Association football World War II United Kingdom England England Germany India Association football Canada United Kingdom Germany World War II Canada Canada India Arthropod India Australia Insect Australia London World War II London Japan Japan Italy Italy Australia Japan Arthropod Village New York City Insect Italy English language New York City Poland China English language English language Poland Village Nationa Reg. of Hist. Places World War I Rankings of Wiki pages with different methods: correlations and top-20 pages [Vigna 14]

Distortion with Weighted Kendall Weighted Kendall distortion: [F, Milenkovic 13]! Weight wi for transposing ith and (i+1)st elements! Can be used to penalize mistakes in higher positions more! Example: w1 =2, w 2 =1, d w (312,123)=3 since 312 132 123 Axiomatically derived based on Kemeny s axioms! Time Hin msl 600 Ê No need for ground truth! Fast computation: O(n lg n) with small constant for monotonically decreasing weights 500 400 300 200 100 Ê Ê Ê Ê ÊÊ Length Hin 1000sL 100 200 300 400 500

Distortion with Weighted Kendall What should the ranks of pivots be if errors in higher positions are penalized more?! Example: Linearly decreasing weight function: wi = 1+ c (n-i-1) with c>0: The pivots are chosen more closely at the top to provide better accuracy for highly ranked items! Optimum positions for pivots is asymptotically independent from c! 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" m=2" m=3" m=4" m=5"

Conclusion Provided bounds on the performance of algorithms for sorting with limited storage! Proposed an algorithm and showed that it is asymptotically optimal for! Kendall distortion (up to a constant factor), and! Chebyshev distortion (up to a constant factor and for δ bounded away from 0)! Future work and open problems:! What is the best possible algorithm if only the last m are remembered?! Tighter bounds for Kendall distortion! Better algorithm/bounds for Chebyshev distortion when δ is small

Thank You!