ECEN 689 Special Topics in Data Science for Communications Networks

Similar documents
Leveraging Big Data: Lecture 13

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Priority queues implemented via heaps

1 Approximate Quantiles and Summaries

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

RANDOMIZED ALGORITHMS

Binary Search Trees. Lecture 29 Section Robb T. Koether. Hampden-Sydney College. Fri, Apr 8, 2016

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :52 AM

Notes on Logarithmic Lower Bounds in the Cell Probe Model

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

ECEN 689 Special Topics in Data Science for Communications Networks

Stream sampling for variance-optimal estimation of subset sums

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

WHAT IS A SKIP LIST? S 2 S 1. A skip list for a set S of distinct (key, element) items is a series of lists

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

INF2220: algorithms and data structures Series 1

Dictionary: an abstract data type

ENS Lyon Camp. Day 2. Basic group. Cartesian Tree. 26 October

Lecture 2 September 4, 2014

Data-Intensive Distributed Computing

Probabilistic Near-Duplicate. Detection Using Simhash

Design of discrete-event simulations

arxiv: v2 [cs.ds] 15 Nov 2010

14.1 Finding frequent elements in stream

Skip Lists. What is a Skip List. Skip Lists 3/19/14

Introduction to discrete probability. The rules Sample space (finite except for one example)

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

CS Data Structures and Algorithm Analysis

Priority sampling for estimation of arbitrary subset sums

Advanced Implementations of Tables: Balanced Search Trees and Hashing

11.1 Set Cover ILP formulation of set cover Deterministic rounding

Search Trees. EECS 2011 Prof. J. Elder Last Updated: 24 March 2015

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d

Part 4. Decomposition Algorithms

Lecture 59 : Instance Compression and Succinct PCP s for NP

Introduction. An Introduction to Algorithms and Data Structures

Lecture 3 Sept. 4, 2014

Integer Programming. Wolfram Wiesemann. December 6, 2007

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Jeffrey D. Ullman Stanford University

CSE 202: Design and Analysis of Algorithms Lecture 3

Dictionary: an abstract data type

Wavelet decomposition of data streams. by Dragana Veljkovic

Optimal Combination of Sampled Network Measurements

Online Learning, Mistake Bounds, Perceptron Algorithm

Charging from Sampled Network Usage

25 Minimum bandwidth: Approximation via volume respecting embeddings

Discrete-event simulations

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Lecture 21: Counting and Sampling Problems

CS 240 Data Structures and Data Management. Module 4: Dictionaries

Multi-Dimensional Online Tracking

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CSE 202 Homework 4 Matthias Springer, A

compare to comparison and pointer based sorting, binary trees

Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data

Computing the Entropy of a Stream

Weight-balanced Binary Search Trees

Real-Time Workload Models with Efficient Analysis

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

UNIVERSITY OF YORK. MSc Examinations 2004 MATHEMATICS Networks. Time Allowed: 3 hours.

Efficient Binary Fuzzy Measure Representation and Choquet Integral Learning

Introduction to Randomized Algorithms III

Algorithms Exam TIN093 /DIT602

Graph-theoretic Problems

Lecture 8 HASHING!!!!!

Learn More, Sample Less: Control of Volume and Variance in Network Measurement

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

Zebo Peng Embedded Systems Laboratory IDA, Linköping University

Weight-balanced Binary Search Trees

The Magic of Random Sampling: From Surveys to Big Data

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Don t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates

Analysis of 2x2 Cross-Over Designs using T-Tests

past balancing schemes require maintenance of balance info at all times, are aggresive use work of searches to pay for work of rebalancing

AVL Trees. Properties Insertion. October 17, 2017 Cinda Heeren / Geoffrey Tien 1

Chapter 5 Divide and Conquer

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Each internal node v with d(v) children stores d 1 keys. k i 1 < key in i-th sub-tree k i, where we use k 0 = and k d =.

Incremental Learning and Concept Drift: Overview

Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

(tree searching technique) (Boolean formulas) satisfying assignment: (X 1, X 2 )

1 Maintaining a Dictionary

Big Data. Big data arises in many forms: Common themes:

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

On Two Class-Constrained Versions of the Multiple Knapsack Problem

STA 4273H: Statistical Machine Learning

Decision Tree Learning

where X is the feasible region, i.e., the set of the feasible solutions.

Randomization in Algorithms and Data Structures

Window-aware Load Shedding for Aggregation Queries over Data Streams

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Treedy: A Heuristic for Counting and Sampling Subsets

Scheduling: Queues & Computation

Heaps and Priority Queues

Transcription:

ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples

Sampling as a Mediator of Constraints Data Characteristics (Heavy Tails, Correlations) Sampling Resource Constraints (Bandwidth, Storage, CPU) Query Requirements (Ad Hoc, Accuracy, Aggregates, Speed)

Why Summarize (ISP) Big Data? When transmission bandwidth for measurements is limited Not such a big issue in ISPs with in-band collection Typically raw accumulation is not feasible (even nation states) High rate streaming data Maintain historical summaries for baselining, time series analysis To facilitate fast queries When infeasible to run exploratory queries over full data As part of hierarchical query infrastructure: Maintain full data over limited duration window Drill down into full data through one or more layers of summarization

The need to limit sample volume Output pipe Sampler Sampler

Fixed Size Sample Given N objects, select k < N objects uniformly at random Each subset of k objects should be equally likely One way: uniformly sample from N, k times w/o replacement 1/(N-1) 1/N 1/(N-2) k objects selected

Fixed Size Stream Sample: Reservoirs Stream constraint: see each item once Discard permanently if not selected Assume reservoir of capacity k items available Reasonable: k = final sample size Can provisionally include items from stream in reservoir Take first k items w.p. 1 Can discard later to select different items later in stream k = 3

Reservoir Sampling in Practice How to achieve uniform sampling distribution? No need to know stream size in advance Include first k items w.p. 1 Include item n > k w.p. p n = k/n, n > k, if included: evict one item Pick j uniformly from {1,2,,n} If j k, swap item n into location j in reservoir, discard replaced item Neat proof shows the uniformity of the sampling method: Let S n = sample set after n arrivals m (< n) n k=7 Previously sampled item: induc4on New item: selec4on probability Prob[n S n ] = p n := k/n m S n- 1 w.p. p n- 1 m S n w.p. p n- 1 * (1 p n / k) = p n

Reservoir Sampling: Skip Counting Simple approach: check each item in turn O(1) per item: Fine if computation time < interarrival time Otherwise build up computation backlog O(N) Better: skip counting Find random index m(n) of next selection > n Distribution: Prob[m(n) m] = 1 - (1-p n+1 )*(1-p n+2 )* *(1-p m ) Expected number of selections from stream is k + Σ k<m N p m = k + Σ k<m N k/m = O(k ( 1 + ln (N/k) )) There is algorithm with this average running time

IPPS Stream Reservoir Sampling Each arriving item: Provisionally include item in reservoir If m+1 items, discard 1 item randomly (same as include m items randomly) Choose inclusion probabilities to be previous IPPS Calculate threshold z to include m items on average: z solves Σ i p z (x i ) = m Discard item i with probability q i =1 p z (x i ) Adjust m surviving x i with Horvitz-Thompson x i = x i / p i = max{x i,z} 1 Example: m=9 x 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 Recalculate threshold z: 10 Adjust min{1, weights: xi z} = 9 i= 1 x' i = max{x i,z} z 0 x 10 x 9 Recalculate x 8 Discard x probs: 6 x qi = 1- min{1,x 5 i z} x 4 x 3 x 2 x 1

Computation in IPPS Stream Sampling Calculate threshold z to include m average items: z solves Σ i p z (x i ) = m? Weight order: x (1) x (2) x (n). Any z: small items: x (i) z; Large x (i) > z If x (t) is a small item, then: m = Σ i p z (x (i) ) = Σ i min{ 1, x (i) / z } ( implicit definition of z, using form of p z ) Σ i t x (i) / z +(m+1 t) ( p z (x (i) ) 1 for the m+1 t terms (i) > t ) Σ i t x (i) / x (t) + (m+1 t) ( x (t) small and hence z ) In other words: Σ i t x (i) / x (t) t-1 Largest possible index t for small item: t* = max{ t: Σ i t x(i) / x (t) t-1} Then find z from Σ i t* x (i) / z = t*-1 (why?) 1 x (t+1) /z m+1-t items are x (t) x (t+1) z How to find t*? x (t)

How to find t*? t* = max{ t: Σ i t x(i) / x (t) t-1} Exercise: show g(t) = Σ i t x(i) / x (t) t -1 is nonincreasing in t. g(t) t t* Show that this makes t* easier to find

Monotonic Functions Searching for changepoints f is binary function on {1,2, n}: f(t) is either 0 or 1 f is monotonic: for some t*, f(t) = 1 if and only if t > t* Example: f(t) = 0 if Σ i t x(i) / x (t) t-1; f(t) = 1 if Σ i t x(i) / x (t) < t-1; t* = max{t: f(t) = 0} = max{ t: Σ i t x(i) / x (t) t-1} Task Find changepoint t* f(t) =0 if t t* f(t) =1 if t>t* Simple approach: inspect f(t) for t = 1,2,.., t* = last t for which f(t) =0 computational cost: O(n): might have to inspect all n t*

Binary Search for Changepoint Initialize: Set t in center Iterate (until can move t no further right) If f(t) = 1 we know t* < t restrict attention to lower half subinterval, set t to its center If f (t) = 0 t* = t we know t* t restrict attention to upper half subinterval, set t to its center If h is number of halvings then 2 h ~ n #iterations = number of halvings h = O(log n) < O(n) t t*

Need to generalize binary search Domain may be point subset of real numbers E.g. set of weight x i, not simply {1,2,,n} Domain can change with time Database insertions and deletions For a given f, this can alter changepoint: need to locate again Need general way To store/retrieve data points That enables finding function changepoints Abstracts the idea of binary search from any detailed setting Is efficient

Introducing binary search trees Data structure to store and retrieve points in ordered set e.g. numbers with order of < Points do not need to be added in any particular order No presorting needed Tree like structure is very efficient Stores n= 2 h -1 items if depth h Computational cost O(log n) to retrieve any item We will use to locate changepoint of binary monotonic function 4 17 23 root 2 6 20 36 98 1 3 5 9 18 22 25

Storing in a Binary Search Tree Storing an ordered set, e.g., {17, 4, 6, 23, 36, 4} with > order Store first element at root 17 Pass further elements down tree To left child if element child To right child if element > child 17 17 >17 root 4 4 >4 23 23 >23 4 6 36 Tree is depth O(log m) for m items Computational complexity to add, delete or retrieve item is O(log m) Binary search easy: go left/right until found, O(log m) steps Need to rebalance: maintain bounded depth (i.e. approx. symmetry)

Search on a Binary Tree Search for some node (and whatever information attached) say node 9 Start at root Iterate until found: Branch left if 9 node Branch right if 9 > node 9 > 4 9 17 17 4 23 4 6 9 > 6 36 9 25 Found node 9!

Monotone Functions and Binary Search Tree Monotone function f on nodes: f(node 1 ) f(node 2 ) if and only if node 1 < node 2 Example: f(node) = weight stored at node Monotone binary function Takes values 0 or 1 Seek node* = largest node with f(node) = 0 f(17) = 1 Example: largest node with weight 10 17 Happens to be 9: not known at start f(node) = 0 if weight 10, 1 otherwise f(4) = 0 Start at root and iterate until can t move further right: 4 23 if( f(node) = 1) { node > node*: so branch left } 4 6 f(6) = 0 36 if( f(node) = 0) { node node*: so branch right 9 25 } node = node* can t move further right: node* = 9 f(9) = 0

Back to IPPS: maintaining partial sums Σ i t x (i) Insert: each node x (i) maintains partial sums S (i) of weights x (j) x (i) that traverse it Recovery: from x (i) add up partial sums S (j) from all nodes x (j) x (i) on path back to root Weight of items 25: 25+23+17+4+6+4+9 Weight of items 9 9+6+4+4 4 4 +4 17 17 +4 +6 23 +4 +9 23 4 4 6 6 36 36 +25 9 9 25 25

The Iteration and its Computational Cost Current reservoir of unbiased estimates, threshold z Reservoir maintained in two BSTs; One for small items ( x i z ) and large (x i > z) Maintains partial sums for number and total weight of items x i Insert new weight Binary tree search to find t* = max{ t: : Σ i t x(i) / x (t) t-1} Find Σ i t x(i) and t by adding up appropriate partial sums for weights and counts Compute new threshold z from Σ i t* x (i) / z = t*-1 Transfer items between small and large BSTs as needed Discard one item at random using discard probabilities Generate random r uniformly on [0,1] Find smallest d t* such at Σ i d (1 - x (i) / z ) r Binary search on small items: sum is function of counts and weight sums Basic computational takeway Computing in BST gives O(log m) complexity per arriving item Actually O(log log m) averaged over m arrivals

Summary: IPPS Sampling on Data Streams Motivation Computer networking: storing sampled flow records for later analysis Ingredients: Probability/Statistics, Algorithms, Data structures Statistical properties Statistically optimal trade-off between sampling size and variance IPPS stream reservoir sampling Iterative algorithm that maintains only current unbiased estimates in fixed size reservoir of size m Implementation Used binary search trees to get O(log m) computational cost per arrival

Additional references Binary trees in general: Goodrich: Data Structures and Algorithms in Python. Ch. 8 TAMU Library, online These notes describe insertion and search; BST also provide for deletion, rebalance and other operations. IPPS Stream Sampling: Cohen et. Al., Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums. SIAM J. Comput. 40(5): 1402-1431 (2011) http://epubs.siam.org/doi/pdf/10.1137/10079817x