Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics
|
|
- Norma Hardy
- 5 years ago
- Views:
Transcription
1 Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017
2 8 Un-aggregated data 3 Data element e D has key and value (e.key,e.value) Weight frequency of a key x: Aggregated View w * = e. value 7 = 7.:7;<* density W of frequencies of keys) Will also use: Max m * = max e.value :7;<* f-statistics: f W = f(w * ) *,
3 f- statistics f W = f(w * ) *, Distinct f w = 1 (w> 0) #of distinct keys Sum f w = w Frequency moments f w = w F Cap: f w = min(t, w) Complement Laplace transform: f w = 1 e LMN Other: f x = log(1 + x) f(x) = min(x R.ST,T) Applications: Search queries, online interactions, online transactions, word/term (co)-occurrences in corpus, network traffic Issue: Aggregated view Wis costly: Data movement, storage, computation Use sketches!
4 Sketches A sketch for f is a lossy summary of the data D from which we can approximate (estimate) f D = f(w) Data D Sketch (D) estimator Q: f(w)? fx(s) Sketch structure design goals: Optimize sketch-size vs. estimate quality Can we get size O(ε La + log log n)? Composable/mergeable Data A Data B Data A B Sketch(A) Sketch(B) Sketch(A B)
5 Why composable is useful? Distributed data/parallelize computation Sketch 1 Sketch 2 Sketch 3 Sketche 4 Sketch 5 S. 1 2 S. 3 4 S Streamed data Sketch Sketch Sketch Sketch Sketch Sketch
6 Approximate statistics via small sketches Data element e D has key and value (e.key,e.value) Weight of key x is the Sum of its element values: w * = e. value :7;<* f-statistics: f D = f W = f(w * ) *, Quality: Coefficient of variation p q = ε, concentration? Distinct f w = 1 (x > 0): [Flajolet Martin 85, Flajolet et al 07] O(ε La + log log n) Sum f w = w: [Morris 77] O(ε La + log log n) Frequency moments f w = w F : [Alon Matias Szegedy 99, Indyk 01] O(ε La log a n) Capping f w = min(t, w) [C 15] (via sampling) Others: universal sketches [Braverman Ostrovsky 10] O(ε La log n) Polynomial(ε Lo, log n)
7 Sum: f(w * ) *, log(n): A single register of size to keep the sum. Clearly composable O(ε La + log log n): [Morris 1977] +[Flajolet 1985] Maintain the exponent t, initialized t 0 Estimate: return 1 + ε t 1 Add Y: Increase t by maximum amount so that estimate increase by Z Y Let Δ = Y Z Increment t with probability { o} ~ Merge t a t o : same as Add 1 + ε t 1 to counter t o Composable version [C 15]:
8 Distinct count sketches HyperLogLog [Flajolet et al 2007] Optimal size O(ε La + log log n) for CV p q = ε ; n distinct keys Idea: store k = ε La exponents of hashes. Exponents value concentrated so store one and k offsets. HIP estimators [Cohen 14, Ting 15]: halve the variance to o a:! Idea: track an estimated count c with sketch structure. When structure is modified, add inverse modification probability to c. Sketch c Approx. count If structure S is modified c += 1/p(modified)
9 Distinct count sketches Initialize: k = ε La registers c o,, c :, ; Hash functions H x Exp[1] Process element e. key: For i k : c min (c, H e. key ) Estimate: :Lo Analysis: c minimum over active keys of independent EXP[1] c EXP Distinct D Parameter estimation problem Composability: minimum is composable *Simplified version [C 94] Reduce size: keep exponents only of c, one exponent and ε La constant-size offsets
10 MaxDistinct sketches Max value of an element with key x: m * = max e.value :7;<* MaxDistinct D = m * Initialize: k = ε La registers c o,, c :, Hash functions H x Exp[1] Process element (e. key, e. value): For i k : c min (c, 7.:7; 7. š œ7 ) Estimate: :Lo Analysis: For each key x, the minimum over 7.:7; elements of EXP[m 7. š œ7 *] c minimum over keys x of independent EXP[m * ] c EXP MaxDistinct D
11 Element processing framework 5 MAP Distinct/Max-distinct sketch Data element randomized Output elements (0 or more) Composable sketch of output elements Goal: E[ MaxDistinct( 7 8 MAP e )] = f(w), + concentration Q: For which f we can do this? How? (specify MAP)
12 W = (Soft) cap functions cap w = T 1 e L cap w = min(t, w) (aggregated D) cap ª W = «min 3, w * = 10 * cap ª W = 3 «(1 e L ª ) * 8.38
13 Warm Up: Sketching cap -statistics we work with f t w = 1 e L t!! The statistics is the Laplace c (complement-laplace) transform of W(density function of frequencies) L W t = «1 e L t * cap -statistics is T Laplace c transform at point t = ö since cap w = Tf o/ w cap W = «cap w * = T «f ö w * * = TL [W](1/T)
14 Sketching Laplace c transform of Wat point t f t w = 1 e L t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Output: (approximate) number of distinct output keys M Distinct sketch! Output sketch size is barely affected by r, only element processing
15 Claim: (correctness) 1 r E Distinct MAP e 7 8 Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i = «1 e L t * = L [W](t) We compute the probability that the output key x#i is generated: y ¼ t for at least one element e D with e. key = x EXP e. value t min ½ ½.¾½ < EXP w * t = 1 e LM ÀN!! Each input key x and i r have a unique potential output key x#i Sum over all x, i (Poisson rvs) to establish claim
16 Subtlety:? We need enough ε La distinct output keys for low error. We set r = O(ε La ), but still need to address small t o E Distinct MAP e Á 7 8 = 1 e L t * = L Ã W t density W of frequencies: 10 w * = 1 2 w * = 5 1 w * = 10 Sum W = 30 Distinct(W)=13 L W t = 13 10e Lt 2e LTt e LoRt At the regime with low distinct count we can use L W t tsum(w)
17 f(w) in the nonnegative span of g t w = 1 e L t Any function f(w) that can be expressed a t 0 as: f w = É a t We get a t = L Lo f w R Ë 1 e L t dt t = o t LLo [ ÌÍ Ì ](t) = L [a](w) Span includes all concave sublinear f without sharp corners: low frequency moments p 1 ; logarithms; soft capping functions f(w) a(t) T(1 e L ) w log(1 + w) Tδ(t 1 T ) 1 2 π tlo.t e Lt t
18 Sketching statistics for f in the nonnegative span Ë f w = É a t Ë f W = É f w W w dw = R R 1 e L t dt Ë = É a t É W w 1 e L t R Ë Ë = É a t L W t dt R Ë Can sketch L W R = L [a](w) Ë É W w É a t 1 e L t dt R R dw dt a t 0 dw = statistics f W expressed in terms of the L transform t at many points t. But we will see a better way
19 Ë Sketching f(w) = a t L W t dt R Idea: Modify the element map for point t to work with weighted combination of t values M MaxDistinct sketch point t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Distinct count sketch combination a(t) *slightly simplified Input: e = e. key, e. value For i = 1,, r y EXP[e. value] Ë output (e.key#i, a(t) dt) ; MaxDistinct sketch
20 Extension: Multi-objective sketch of the span Idea: If we can sketch all t values together, we can use the sketch for all statistics in the span log factor increase in sketch size (analysis of all-distance sketches [C 94,15]) point t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Distinct count sketch All t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] output (e.key#i,y ) All-threshold sketch
21 All-threshold Distinct sketches Input elements (e. key, e. value) Sketch allows us to approximate for any t : Distinct e. key e. value t} Initialize: k = ε La registers c o,, c :, ; Hash functions H x Exp[1] Process element e. key: For i k : c min (c, H e. key ) Estimate: :Lo Distinct counting sketch All-threshold Distinct sketch Remember for each register c all breakpoints where the minimum increases Logarithmic number of breakpoints [C 94] Any sample-based distinct sketch can be similarly extended [C 15]
22 Handling sharp corners: All concave sublinear f Reduces to sketching Cap o w = min 1,w Soft cap gives 1 o 7 approximation Better approximation: Use a signed inverse transform to approximate Ë Cap o w, controlling the L o a t = a t dt Separate estimate the negative and positive components Grid search on 3 points We get 12% Open question to get ε R
23 Conclusion Summary: Simple, practical design of composable double-logarithmic size sketching for concave sublinear statistics Results novel theoretically even with O(ε La log n) size Open: Handling statistics with sharp corners. Loglog in the super-linear regime (second moment?)
24 Thank you!
The Magic of Random Sampling: From Surveys to Big Data
The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationFoundations of Data Mining
Foundations of Data Mining http://www.cohenwang.com/edith/dataminingclass2017 Instructors: Haim Kaplan Amos Fiat Lecture 1 1 Course logistics Tuesdays 16:00-19:00, Sherman 002 Slides for (most or all)
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationT i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r )
v e r. E N G O u t l i n e T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r ) C o n t e n t s : T h i s w o
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationLecture 1: Introduction to Sublinear Algorithms
CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for
More informationMonotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets
Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets Edith Cohen Google Research Tel Aviv University A Monotone Sampling Scheme Data domain V( R + ) v random seed u
More informationStreaming Property Testing of Visibly Pushdown Languages
Streaming Property Testing of Visibly Pushdown Languages Nathanaël François Frédéric Magniez Michel de Rougemont Olivier Serre SUBLINEAR Workshop - January 7, 2016 1 / 11 François, Magniez, Rougemont and
More informationData Streams & Communication Complexity
Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x
More informationLeveraging Big Data: Lecture 13
Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationAlgorithms for Querying Noisy Distributed/Streaming Datasets
Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias
More informationAn Algorithmist s Toolkit Nov. 10, Lecture 17
8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from
More informationAll-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis
1 All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis Edith Cohen arxiv:136.3284v7 [cs.ds] 17 Jan 215 Abstract Graph datasets with billions of edges, such as social and Web graphs,
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationDeclaring Independence via the Sketching of Sketches. Until August Hire Me!
Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem
More informationCount-Min Tree Sketch: Approximate counting for NLP
Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationWavelet decomposition of data streams. by Dragana Veljkovic
Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records
More informationDon t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates
Don t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates Edith Cohen AT&T Labs Research 18 Park Avenue Florham Park, NJ 7932, USA edith@cohenwang.com Graham Cormode AT&T Labs Research
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationHow Philippe Flipped Coins to Count Data
1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain
More informationFrequency Estimators
Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches
More informationContinuous-time Markov Chains
Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 23, 2017
More informationEAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science
EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Computer Representation of Numbers Counting numbers (unsigned integers) are the numbers 0,
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationData Sketches for Disaggregated Subset Sum and Frequent Item Estimation
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Daniel Ting Tableau Software Seattle, Washington dting@tableau.com ABSTRACT We introduce and study a new data sketch for processing
More informationEdo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles
Edo Liberty Principal Scientist Amazon Web Services Streaming Quantiles Streaming Quantiles Manku, Rajagopalan, Lindsay. Random sampling techniques for space efficient online computation of order statistics
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationLecture 2 Sept. 8, 2015
CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],
More informationSuccinct Approximate Counting of Skewed Data
Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationSketching and Streaming for Distributions
Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,
More informationNew cardinality estimation algorithms for HyperLogLog sketches
New cardinality estimation algorithms for HyperLogLog sketches Otmar Ertl otmar.ertl@gmail.com April 3, 2017 This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationELEMENTS OF PROBABILITY THEORY
ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable
More informationGreedy Maximization Framework for Graph-based Influence Functions
Greedy Maximization Framework for Graph-based Influence Functions Edith Cohen Google Research Tel Aviv University HotWeb '16 1 Large Graphs Model relations/interactions (edges) between entities (nodes)
More informationECEN 689 Special Topics in Data Science for Communications Networks
ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as
More information7 Algorithms for Massive Data Problems
7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random
More informationTime-Decayed Correlated Aggregates over Data Streams
Time-Decayed Correlated Aggregates over Data Streams Graham Cormode AT&T Labs Research graham@research.att.com Srikanta Tirthapura Bojian Xu ECE Dept., Iowa State University {snt,bojianxu}@iastate.edu
More informationThe Online Space Complexity of Probabilistic Languages
The Online Space Complexity of Probabilistic Languages Nathanaël Fijalkow LIAFA, Paris 7, France University of Warsaw, Poland, University of Oxford, United Kingdom Abstract. In this paper, we define the
More informationJ2 e-*= (27T)- 1 / 2 f V* 1 '»' 1 *»
THE NORMAL APPROXIMATION TO THE POISSON DISTRIBUTION AND A PROOF OF A CONJECTURE OF RAMANUJAN 1 TSENG TUNG CHENG 1. Summary. The Poisson distribution with parameter X is given by (1.1) F(x) = 23 p r where
More informationRange-efficient computation of F 0 over massive data streams
Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More informationFrequency in % of each F-scale rating in outbreaks
Frequency in % of each F-scale rating in outbreaks 2 1 1 1 F/EF1 F/EF2 F/EF3-F/EF5 1954 196 197 198 199 2 21 Supplementary Figure 1 Distribution of outbreak tornadoes by F/EF-scale rating. Frequency (in
More informationA Simple and Efficient Estimation Method for Stream Expression Cardinalities
A Simple and Efficient Estimation Method for Stream Expression Cardinalities Aiyou Chen, Jin Cao and Tian Bu Bell Labs, Alcatel-Lucent (aychen,jincao,tbu)@research.bell-labs.com ABSTRACT Estimating the
More informationSpace-optimal Heavy Hitters with Strong Error Bounds
Space-optimal Heavy Hitters with Strong Error Bounds Graham Cormode graham@research.att.com Radu Berinde(MIT) Piotr Indyk(MIT) Martin Strauss (U. Michigan) The Frequent Items Problem TheFrequent Items
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationLecture 3 Frequency Moments, Heavy Hitters
COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationDistributed Summaries
Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:
More informationFramework for functional tree simulation applied to 'golden delicious' apple trees
Purdue University Purdue e-pubs Open Access Theses Theses and Dissertations Spring 2015 Framework for functional tree simulation applied to 'golden delicious' apple trees Marek Fiser Purdue University
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationWeighted Sampling for Scalable Analytics of Large Data Sets
Weighted Sampling for Scalable Analytics of Large Data Sets 1 Google, CA USA 2 School of Computer Science Tel Aviv University, Israel May 24, 2016 Data Model Key value pairs (x, w x ) (users/activity,
More informationHW7 Solutions. f(x) = 0 otherwise. 0 otherwise. The density function looks like this: = 20 if x [10, 90) if x [90, 100]
HW7 Solutions. 5 pts.) James Bond James Bond, my favorite hero, has again jumped off a plane. The plane is traveling from from base A to base B, distance km apart. Now suppose the plane takes off from
More informationProblems 5: Continuous Markov process and the diffusion equation
Problems 5: Continuous Markov process and the diffusion equation Roman Belavkin Middlesex University Question Give a definition of Markov stochastic process. What is a continuous Markov process? Answer:
More informationTracking Distributed Aggregates over Time-based Sliding Windows
Tracking Distributed Aggregates over Time-based Sliding Windows Graham Cormode and Ke Yi Abstract. The area of distributed monitoring requires tracking the value of a function of distributed data as new
More informationCHAPTER 6 : LITERATURE REVIEW
CHAPTER 6 : LITERATURE REVIEW Chapter : LITERATURE REVIEW 77 M E A S U R I N G T H E E F F I C I E N C Y O F D E C I S I O N M A K I N G U N I T S A B S T R A C T A n o n l i n e a r ( n o n c o n v e
More informationP E R E N C O - C H R I S T M A S P A R T Y
L E T T I C E L E T T I C E I S A F A M I L Y R U N C O M P A N Y S P A N N I N G T W O G E N E R A T I O N S A N D T H R E E D E C A D E S. B A S E D I N L O N D O N, W E H A V E T H E P E R F E C T R
More informationEstimating Dominance Norms of Multiple Data Streams
Estimating Dominance Norms of Multiple Data Streams Graham Cormode and S. Muthukrishnan Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data
More informationOptimal compression of approximate Euclidean distances
Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationSuccinct Data Structures for Approximating Convex Functions with Applications
Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca
More informationProcessing Top-k Queries from Samples
TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Processing Top-k Queries from Samples This thesis was submitted in partial fulfillment of the requirements
More informationMining Data Streams. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationHash-based Indexing: Application, Impact, and Realization Alternatives
: Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider
More informationAcceptably inaccurate. Probabilistic data structures
Acceptably inaccurate Probabilistic data structures Hello Today's talk Motivation Bloom filters Count-Min Sketch HyperLogLog Motivation Tape HDD SSD Memory Tape HDD SSD Memory Speed Tape HDD SSD Memory
More informationIn-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond
In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond Paolo Boldi, Sebastiano Vigna Laboratory for Web Algorithmics Università degli
More informationComputing the Entropy of a Stream
Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationA Statistical Analysis of Probabilistic Counting Algorithms
Probabilistic counting algorithms 1 A Statistical Analysis of Probabilistic Counting Algorithms PETER CLIFFORD Department of Statistics, University of Oxford arxiv:0801.3552v3 [stat.co] 7 Nov 2010 IOANA
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More informationTight Bounds for Distributed Streaming
Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming
More informationMultilayer Neural Networks
Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the
More informationCSE548, AMS542: Analysis of Algorithms, Spring 2014 Date: May 12. Final In-Class Exam. ( 2:35 PM 3:50 PM : 75 Minutes )
CSE548, AMS54: Analysis of Algorithms, Spring 014 Date: May 1 Final In-Class Exam ( :35 PM 3:50 PM : 75 Minutes ) This exam will account for either 15% or 30% of your overall grade depending on your relative
More informationQuantum query complexity of entropy estimation
Quantum query complexity of entropy estimation Xiaodi Wu QuICS, University of Maryland MSR Redmond, July 19th, 2017 J o i n t C e n t e r f o r Quantum Information and Computer Science Outline Motivation
More informationLinear sketching for Functions over Boolean Hypercube
Linear sketching for Functions over Boolean Hypercube Grigory Yaroslavtsev (Indiana University, Bloomington) http://grigory.us with Sampath Kannan (U. Pennsylvania), Elchanan Mossel (MIT) and Swagato Sanyal
More informationMoments. Raw moment: February 25, 2014 Normalized / Standardized moment:
Moments Lecture 10: Central Limit Theorem and CDFs Sta230 / Mth 230 Colin Rundel Raw moment: Central moment: µ n = EX n ) µ n = E[X µ) 2 ] February 25, 2014 Normalized / Standardized moment: µ n σ n Sta230
More informationECE 650 Lecture #10 (was Part 1 & 2) D. van Alphen. D. van Alphen 1
ECE 650 Lecture #10 (was Part 1 & 2) D. van Alphen D. van Alphen 1 Lecture 10 Overview Part 1 Review of Lecture 9 Continuing: Systems with Random Inputs More about Poisson RV s Intro. to Poisson Processes
More informationMining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationFormulas for probability theory and linear models SF2941
Formulas for probability theory and linear models SF2941 These pages + Appendix 2 of Gut) are permitted as assistance at the exam. 11 maj 2008 Selected formulae of probability Bivariate probability Transforms
More informationChapter 8. Some Approximations to Probability Distributions: Limit Theorems
Chapter 8. Some Approximations to Probability Distributions: Limit Theorems Sections 8.2 -- 8.3: Convergence in Probability and in Distribution Jiaping Wang Department of Mathematical Science 04/22/2013,
More informationA Near-Optimal Algorithm for Computing the Entropy of a Stream
A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a
More informationTabulation-Based Hashing
Mihai Pǎtraşcu Mikkel Thorup April 23, 2010 Applications of Hashing Hash tables: chaining x a t v f s r Applications of Hashing Hash tables: chaining linear probing m c a f t x y t Applications of Hashing
More informationFinding Frequent Items in Data Streams
Finding Frequent Items in Data Streams Moses Charikar 1, Kevin Chen 2, and Martin Farach-Colton 3 1 Princeton University moses@cs.princeton.edu 2 UC Berkeley kevinc@cs.berkeley.edu 3 Rutgers University
More informationSuper-Gaussian directions of random vectors
Weizmann Institute & Tel Aviv University IMU-INdAM Conference in Analysis, Tel Aviv, June 2017 Gaussian approximation Many distributions in R n, for large n, have approximately Gaussian marginals. Classical
More informationPractice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:
Practice Exam 1 1. Losses for an insurance coverage have the following cumulative distribution function: F(0) = 0 F(1,000) = 0.2 F(5,000) = 0.4 F(10,000) = 0.9 F(100,000) = 1 with linear interpolation
More informationThe Count-Min-Sketch and its Applications
The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local
More information