Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

Size: px
Start display at page:

Download "Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics"

Transcription

1 Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017

2 8 Un-aggregated data 3 Data element e D has key and value (e.key,e.value) Weight frequency of a key x: Aggregated View w * = e. value 7 = 7.:7;<* density W of frequencies of keys) Will also use: Max m * = max e.value :7;<* f-statistics: f W = f(w * ) *,

3 f- statistics f W = f(w * ) *, Distinct f w = 1 (w> 0) #of distinct keys Sum f w = w Frequency moments f w = w F Cap: f w = min(t, w) Complement Laplace transform: f w = 1 e LMN Other: f x = log(1 + x) f(x) = min(x R.ST,T) Applications: Search queries, online interactions, online transactions, word/term (co)-occurrences in corpus, network traffic Issue: Aggregated view Wis costly: Data movement, storage, computation Use sketches!

4 Sketches A sketch for f is a lossy summary of the data D from which we can approximate (estimate) f D = f(w) Data D Sketch (D) estimator Q: f(w)? fx(s) Sketch structure design goals: Optimize sketch-size vs. estimate quality Can we get size O(ε La + log log n)? Composable/mergeable Data A Data B Data A B Sketch(A) Sketch(B) Sketch(A B)

5 Why composable is useful? Distributed data/parallelize computation Sketch 1 Sketch 2 Sketch 3 Sketche 4 Sketch 5 S. 1 2 S. 3 4 S Streamed data Sketch Sketch Sketch Sketch Sketch Sketch

6 Approximate statistics via small sketches Data element e D has key and value (e.key,e.value) Weight of key x is the Sum of its element values: w * = e. value :7;<* f-statistics: f D = f W = f(w * ) *, Quality: Coefficient of variation p q = ε, concentration? Distinct f w = 1 (x > 0): [Flajolet Martin 85, Flajolet et al 07] O(ε La + log log n) Sum f w = w: [Morris 77] O(ε La + log log n) Frequency moments f w = w F : [Alon Matias Szegedy 99, Indyk 01] O(ε La log a n) Capping f w = min(t, w) [C 15] (via sampling) Others: universal sketches [Braverman Ostrovsky 10] O(ε La log n) Polynomial(ε Lo, log n)

7 Sum: f(w * ) *, log(n): A single register of size to keep the sum. Clearly composable O(ε La + log log n): [Morris 1977] +[Flajolet 1985] Maintain the exponent t, initialized t 0 Estimate: return 1 + ε t 1 Add Y: Increase t by maximum amount so that estimate increase by Z Y Let Δ = Y Z Increment t with probability { o} ~ Merge t a t o : same as Add 1 + ε t 1 to counter t o Composable version [C 15]:

8 Distinct count sketches HyperLogLog [Flajolet et al 2007] Optimal size O(ε La + log log n) for CV p q = ε ; n distinct keys Idea: store k = ε La exponents of hashes. Exponents value concentrated so store one and k offsets. HIP estimators [Cohen 14, Ting 15]: halve the variance to o a:! Idea: track an estimated count c with sketch structure. When structure is modified, add inverse modification probability to c. Sketch c Approx. count If structure S is modified c += 1/p(modified)

9 Distinct count sketches Initialize: k = ε La registers c o,, c :, ; Hash functions H x Exp[1] Process element e. key: For i k : c min (c, H e. key ) Estimate: :Lo Analysis: c minimum over active keys of independent EXP[1] c EXP Distinct D Parameter estimation problem Composability: minimum is composable *Simplified version [C 94] Reduce size: keep exponents only of c, one exponent and ε La constant-size offsets

10 MaxDistinct sketches Max value of an element with key x: m * = max e.value :7;<* MaxDistinct D = m * Initialize: k = ε La registers c o,, c :, Hash functions H x Exp[1] Process element (e. key, e. value): For i k : c min (c, 7.:7; 7. š œ7 ) Estimate: :Lo Analysis: For each key x, the minimum over 7.:7; elements of EXP[m 7. š œ7 *] c minimum over keys x of independent EXP[m * ] c EXP MaxDistinct D

11 Element processing framework 5 MAP Distinct/Max-distinct sketch Data element randomized Output elements (0 or more) Composable sketch of output elements Goal: E[ MaxDistinct( 7 8 MAP e )] = f(w), + concentration Q: For which f we can do this? How? (specify MAP)

12 W = (Soft) cap functions cap w = T 1 e L cap w = min(t, w) (aggregated D) cap ª W = «min 3, w * = 10 * cap ª W = 3 «(1 e L ª ) * 8.38

13 Warm Up: Sketching cap -statistics we work with f t w = 1 e L t!! The statistics is the Laplace c (complement-laplace) transform of W(density function of frequencies) L W t = «1 e L t * cap -statistics is T Laplace c transform at point t = ö since cap w = Tf o/ w cap W = «cap w * = T «f ö w * * = TL [W](1/T)

14 Sketching Laplace c transform of Wat point t f t w = 1 e L t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Output: (approximate) number of distinct output keys M Distinct sketch! Output sketch size is barely affected by r, only element processing

15 Claim: (correctness) 1 r E Distinct MAP e 7 8 Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i = «1 e L t * = L [W](t) We compute the probability that the output key x#i is generated: y ¼ t for at least one element e D with e. key = x EXP e. value t min ½ ½.¾½ < EXP w * t = 1 e LM ÀN!! Each input key x and i r have a unique potential output key x#i Sum over all x, i (Poisson rvs) to establish claim

16 Subtlety:? We need enough ε La distinct output keys for low error. We set r = O(ε La ), but still need to address small t o E Distinct MAP e Á 7 8 = 1 e L t * = L Ã W t density W of frequencies: 10 w * = 1 2 w * = 5 1 w * = 10 Sum W = 30 Distinct(W)=13 L W t = 13 10e Lt 2e LTt e LoRt At the regime with low distinct count we can use L W t tsum(w)

17 f(w) in the nonnegative span of g t w = 1 e L t Any function f(w) that can be expressed a t 0 as: f w = É a t We get a t = L Lo f w R Ë 1 e L t dt t = o t LLo [ ÌÍ Ì ](t) = L [a](w) Span includes all concave sublinear f without sharp corners: low frequency moments p 1 ; logarithms; soft capping functions f(w) a(t) T(1 e L ) w log(1 + w) Tδ(t 1 T ) 1 2 π tlo.t e Lt t

18 Sketching statistics for f in the nonnegative span Ë f w = É a t Ë f W = É f w W w dw = R R 1 e L t dt Ë = É a t É W w 1 e L t R Ë Ë = É a t L W t dt R Ë Can sketch L W R = L [a](w) Ë É W w É a t 1 e L t dt R R dw dt a t 0 dw = statistics f W expressed in terms of the L transform t at many points t. But we will see a better way

19 Ë Sketching f(w) = a t L W t dt R Idea: Modify the element map for point t to work with weighted combination of t values M MaxDistinct sketch point t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Distinct count sketch combination a(t) *slightly simplified Input: e = e. key, e. value For i = 1,, r y EXP[e. value] Ë output (e.key#i, a(t) dt) ; MaxDistinct sketch

20 Extension: Multi-objective sketch of the span Idea: If we can sketch all t values together, we can use the sketch for all statistics in the span log factor increase in sketch size (analysis of all-distance sketches [C 94,15]) point t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Distinct count sketch All t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] output (e.key#i,y ) All-threshold sketch

21 All-threshold Distinct sketches Input elements (e. key, e. value) Sketch allows us to approximate for any t : Distinct e. key e. value t} Initialize: k = ε La registers c o,, c :, ; Hash functions H x Exp[1] Process element e. key: For i k : c min (c, H e. key ) Estimate: :Lo Distinct counting sketch All-threshold Distinct sketch Remember for each register c all breakpoints where the minimum increases Logarithmic number of breakpoints [C 94] Any sample-based distinct sketch can be similarly extended [C 15]

22 Handling sharp corners: All concave sublinear f Reduces to sketching Cap o w = min 1,w Soft cap gives 1 o 7 approximation Better approximation: Use a signed inverse transform to approximate Ë Cap o w, controlling the L o a t = a t dt Separate estimate the negative and positive components Grid search on 3 points We get 12% Open question to get ε R

23 Conclusion Summary: Simple, practical design of composable double-logarithmic size sketching for concave sublinear statistics Results novel theoretically even with O(ε La log n) size Open: Handling statistics with sharp corners. Loglog in the super-linear regime (second moment?)

24 Thank you!

The Magic of Random Sampling: From Surveys to Big Data

The Magic of Random Sampling: From Surveys to Big Data The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Foundations of Data Mining

Foundations of Data Mining Foundations of Data Mining http://www.cohenwang.com/edith/dataminingclass2017 Instructors: Haim Kaplan Amos Fiat Lecture 1 1 Course logistics Tuesdays 16:00-19:00, Sherman 002 Slides for (most or all)

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r )

T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r ) v e r. E N G O u t l i n e T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r ) C o n t e n t s : T h i s w o

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Lecture 1: Introduction to Sublinear Algorithms

Lecture 1: Introduction to Sublinear Algorithms CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for

More information

Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets

Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets Edith Cohen Google Research Tel Aviv University A Monotone Sampling Scheme Data domain V( R + ) v random seed u

More information

Streaming Property Testing of Visibly Pushdown Languages

Streaming Property Testing of Visibly Pushdown Languages Streaming Property Testing of Visibly Pushdown Languages Nathanaël François Frédéric Magniez Michel de Rougemont Olivier Serre SUBLINEAR Workshop - January 7, 2016 1 / 11 François, Magniez, Rougemont and

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

An Algorithmist s Toolkit Nov. 10, Lecture 17

An Algorithmist s Toolkit Nov. 10, Lecture 17 8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from

More information

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis 1 All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis Edith Cohen arxiv:136.3284v7 [cs.ds] 17 Jan 215 Abstract Graph datasets with billions of edges, such as social and Web graphs,

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Declaring Independence via the Sketching of Sketches. Until August Hire Me! Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem

More information

Count-Min Tree Sketch: Approximate counting for NLP

Count-Min Tree Sketch: Approximate counting for NLP Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

Don t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates

Don t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates Don t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates Edith Cohen AT&T Labs Research 18 Park Avenue Florham Park, NJ 7932, USA edith@cohenwang.com Graham Cormode AT&T Labs Research

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

How Philippe Flipped Coins to Count Data

How Philippe Flipped Coins to Count Data 1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

Continuous-time Markov Chains

Continuous-time Markov Chains Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 23, 2017

More information

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Computer Representation of Numbers Counting numbers (unsigned integers) are the numbers 0,

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Daniel Ting Tableau Software Seattle, Washington dting@tableau.com ABSTRACT We introduce and study a new data sketch for processing

More information

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles Edo Liberty Principal Scientist Amazon Web Services Streaming Quantiles Streaming Quantiles Manku, Rajagopalan, Lindsay. Random sampling techniques for space efficient online computation of order statistics

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Sketching and Streaming for Distributions

Sketching and Streaming for Distributions Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,

More information

New cardinality estimation algorithms for HyperLogLog sketches

New cardinality estimation algorithms for HyperLogLog sketches New cardinality estimation algorithms for HyperLogLog sketches Otmar Ertl otmar.ertl@gmail.com April 3, 2017 This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

ELEMENTS OF PROBABILITY THEORY

ELEMENTS OF PROBABILITY THEORY ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable

More information

Greedy Maximization Framework for Graph-based Influence Functions

Greedy Maximization Framework for Graph-based Influence Functions Greedy Maximization Framework for Graph-based Influence Functions Edith Cohen Google Research Tel Aviv University HotWeb '16 1 Large Graphs Model relations/interactions (edges) between entities (nodes)

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

Time-Decayed Correlated Aggregates over Data Streams

Time-Decayed Correlated Aggregates over Data Streams Time-Decayed Correlated Aggregates over Data Streams Graham Cormode AT&T Labs Research graham@research.att.com Srikanta Tirthapura Bojian Xu ECE Dept., Iowa State University {snt,bojianxu}@iastate.edu

More information

The Online Space Complexity of Probabilistic Languages

The Online Space Complexity of Probabilistic Languages The Online Space Complexity of Probabilistic Languages Nathanaël Fijalkow LIAFA, Paris 7, France University of Warsaw, Poland, University of Oxford, United Kingdom Abstract. In this paper, we define the

More information

J2 e-*= (27T)- 1 / 2 f V* 1 '»' 1 *»

J2 e-*= (27T)- 1 / 2 f V* 1 '»' 1 *» THE NORMAL APPROXIMATION TO THE POISSON DISTRIBUTION AND A PROOF OF A CONJECTURE OF RAMANUJAN 1 TSENG TUNG CHENG 1. Summary. The Poisson distribution with parameter X is given by (1.1) F(x) = 23 p r where

More information

Range-efficient computation of F 0 over massive data streams

Range-efficient computation of F 0 over massive data streams Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Frequency in % of each F-scale rating in outbreaks

Frequency in % of each F-scale rating in outbreaks Frequency in % of each F-scale rating in outbreaks 2 1 1 1 F/EF1 F/EF2 F/EF3-F/EF5 1954 196 197 198 199 2 21 Supplementary Figure 1 Distribution of outbreak tornadoes by F/EF-scale rating. Frequency (in

More information

A Simple and Efficient Estimation Method for Stream Expression Cardinalities

A Simple and Efficient Estimation Method for Stream Expression Cardinalities A Simple and Efficient Estimation Method for Stream Expression Cardinalities Aiyou Chen, Jin Cao and Tian Bu Bell Labs, Alcatel-Lucent (aychen,jincao,tbu)@research.bell-labs.com ABSTRACT Estimating the

More information

Space-optimal Heavy Hitters with Strong Error Bounds

Space-optimal Heavy Hitters with Strong Error Bounds Space-optimal Heavy Hitters with Strong Error Bounds Graham Cormode graham@research.att.com Radu Berinde(MIT) Piotr Indyk(MIT) Martin Strauss (U. Michigan) The Frequent Items Problem TheFrequent Items

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Distributed Summaries

Distributed Summaries Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:

More information

Framework for functional tree simulation applied to 'golden delicious' apple trees

Framework for functional tree simulation applied to 'golden delicious' apple trees Purdue University Purdue e-pubs Open Access Theses Theses and Dissertations Spring 2015 Framework for functional tree simulation applied to 'golden delicious' apple trees Marek Fiser Purdue University

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

Weighted Sampling for Scalable Analytics of Large Data Sets

Weighted Sampling for Scalable Analytics of Large Data Sets Weighted Sampling for Scalable Analytics of Large Data Sets 1 Google, CA USA 2 School of Computer Science Tel Aviv University, Israel May 24, 2016 Data Model Key value pairs (x, w x ) (users/activity,

More information

HW7 Solutions. f(x) = 0 otherwise. 0 otherwise. The density function looks like this: = 20 if x [10, 90) if x [90, 100]

HW7 Solutions. f(x) = 0 otherwise. 0 otherwise. The density function looks like this: = 20 if x [10, 90) if x [90, 100] HW7 Solutions. 5 pts.) James Bond James Bond, my favorite hero, has again jumped off a plane. The plane is traveling from from base A to base B, distance km apart. Now suppose the plane takes off from

More information

Problems 5: Continuous Markov process and the diffusion equation

Problems 5: Continuous Markov process and the diffusion equation Problems 5: Continuous Markov process and the diffusion equation Roman Belavkin Middlesex University Question Give a definition of Markov stochastic process. What is a continuous Markov process? Answer:

More information

Tracking Distributed Aggregates over Time-based Sliding Windows

Tracking Distributed Aggregates over Time-based Sliding Windows Tracking Distributed Aggregates over Time-based Sliding Windows Graham Cormode and Ke Yi Abstract. The area of distributed monitoring requires tracking the value of a function of distributed data as new

More information

CHAPTER 6 : LITERATURE REVIEW

CHAPTER 6 : LITERATURE REVIEW CHAPTER 6 : LITERATURE REVIEW Chapter : LITERATURE REVIEW 77 M E A S U R I N G T H E E F F I C I E N C Y O F D E C I S I O N M A K I N G U N I T S A B S T R A C T A n o n l i n e a r ( n o n c o n v e

More information

P E R E N C O - C H R I S T M A S P A R T Y

P E R E N C O - C H R I S T M A S P A R T Y L E T T I C E L E T T I C E I S A F A M I L Y R U N C O M P A N Y S P A N N I N G T W O G E N E R A T I O N S A N D T H R E E D E C A D E S. B A S E D I N L O N D O N, W E H A V E T H E P E R F E C T R

More information

Estimating Dominance Norms of Multiple Data Streams

Estimating Dominance Norms of Multiple Data Streams Estimating Dominance Norms of Multiple Data Streams Graham Cormode and S. Muthukrishnan Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data

More information

Optimal compression of approximate Euclidean distances

Optimal compression of approximate Euclidean distances Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Succinct Data Structures for Approximating Convex Functions with Applications

Succinct Data Structures for Approximating Convex Functions with Applications Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca

More information

Processing Top-k Queries from Samples

Processing Top-k Queries from Samples TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Processing Top-k Queries from Samples This thesis was submitted in partial fulfillment of the requirements

More information

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

Acceptably inaccurate. Probabilistic data structures

Acceptably inaccurate. Probabilistic data structures Acceptably inaccurate Probabilistic data structures Hello Today's talk Motivation Bloom filters Count-Min Sketch HyperLogLog Motivation Tape HDD SSD Memory Tape HDD SSD Memory Speed Tape HDD SSD Memory

More information

In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond

In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond Paolo Boldi, Sebastiano Vigna Laboratory for Web Algorithmics Università degli

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

A Statistical Analysis of Probabilistic Counting Algorithms

A Statistical Analysis of Probabilistic Counting Algorithms Probabilistic counting algorithms 1 A Statistical Analysis of Probabilistic Counting Algorithms PETER CLIFFORD Department of Statistics, University of Oxford arxiv:0801.3552v3 [stat.co] 7 Nov 2010 IOANA

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

Tight Bounds for Distributed Streaming

Tight Bounds for Distributed Streaming Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming

More information

Multilayer Neural Networks

Multilayer Neural Networks Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the

More information

CSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes )

CSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes ) CSE548, AMS54: Analysis of Algorithms, Spring 014 Date: May 1 Final In-Class Exam ( :35 PM 3:50 PM : 75 Minutes ) This exam will account for either 15% or 30% of your overall grade depending on your relative

More information

Quantum query complexity of entropy estimation

Quantum query complexity of entropy estimation Quantum query complexity of entropy estimation Xiaodi Wu QuICS, University of Maryland MSR Redmond, July 19th, 2017 J o i n t C e n t e r f o r Quantum Information and Computer Science Outline Motivation

More information

Linear sketching for Functions over Boolean Hypercube

Linear sketching for Functions over Boolean Hypercube Linear sketching for Functions over Boolean Hypercube Grigory Yaroslavtsev (Indiana University, Bloomington) http://grigory.us with Sampath Kannan (U. Pennsylvania), Elchanan Mossel (MIT) and Swagato Sanyal

More information

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment:

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment: Moments Lecture 10: Central Limit Theorem and CDFs Sta230 / Mth 230 Colin Rundel Raw moment: Central moment: µ n = EX n ) µ n = E[X µ) 2 ] February 25, 2014 Normalized / Standardized moment: µ n σ n Sta230

More information

ECE 650 Lecture #10 (was Part 1 & 2) D. van Alphen. D. van Alphen 1

ECE 650 Lecture #10 (was Part 1 & 2) D. van Alphen. D. van Alphen 1 ECE 650 Lecture #10 (was Part 1 & 2) D. van Alphen D. van Alphen 1 Lecture 10 Overview Part 1 Review of Lecture 9 Continuing: Systems with Random Inputs More about Poisson RV s Intro. to Poisson Processes

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Formulas for probability theory and linear models SF2941

Formulas for probability theory and linear models SF2941 Formulas for probability theory and linear models SF2941 These pages + Appendix 2 of Gut) are permitted as assistance at the exam. 11 maj 2008 Selected formulae of probability Bivariate probability Transforms

More information

Chapter 8. Some Approximations to Probability Distributions: Limit Theorems

Chapter 8. Some Approximations to Probability Distributions: Limit Theorems Chapter 8. Some Approximations to Probability Distributions: Limit Theorems Sections 8.2 -- 8.3: Convergence in Probability and in Distribution Jiaping Wang Department of Mathematical Science 04/22/2013,

More information

A Near-Optimal Algorithm for Computing the Entropy of a Stream

A Near-Optimal Algorithm for Computing the Entropy of a Stream A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a

More information

Tabulation-Based Hashing

Tabulation-Based Hashing Mihai Pǎtraşcu Mikkel Thorup April 23, 2010 Applications of Hashing Hash tables: chaining x a t v f s r Applications of Hashing Hash tables: chaining linear probing m c a f t x y t Applications of Hashing

More information

Finding Frequent Items in Data Streams

Finding Frequent Items in Data Streams Finding Frequent Items in Data Streams Moses Charikar 1, Kevin Chen 2, and Martin Farach-Colton 3 1 Princeton University moses@cs.princeton.edu 2 UC Berkeley kevinc@cs.berkeley.edu 3 Rutgers University

More information

Super-Gaussian directions of random vectors

Super-Gaussian directions of random vectors Weizmann Institute & Tel Aviv University IMU-INdAM Conference in Analysis, Tel Aviv, June 2017 Gaussian approximation Many distributions in R n, for large n, have approximately Gaussian marginals. Classical

More information

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes: Practice Exam 1 1. Losses for an insurance coverage have the following cumulative distribution function: F(0) = 0 F(1,000) = 0.2 F(5,000) = 0.4 F(10,000) = 0.9 F(100,000) = 1 with linear interpolation

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information