Christos Papadimitriou & Luca Trevisan November 22, 2016

Similar documents
Vehicle Arrival Models : Headway

Notes for Lecture 17-18

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

20. Applications of the Genetic-Drift Model

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Longest Common Prefixes

1 Review of Zero-Sum Games

An random variable is a quantity that assumes different values with certain probabilities.

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Math 10B: Mock Mid II. April 13, 2016

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

Stochastic models and their distributions

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

5.1 - Logarithms and Their Properties

GMM - Generalized Method of Moments

Approximation Algorithms for Unique Games via Orthogonal Separators

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 33: November 29

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

5. Stochastic processes (1)

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Introduction to Probability and Statistics Slides 4 Chapter 4

Comparing Means: t-tests for One Sample & Two Related Samples

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

ACE 564 Spring Lecture 7. Extensions of The Multiple Regression Model: Dummy Independent Variables. by Professor Scott H.

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests

Chapter 7: Solving Trig Equations

Predator - Prey Model Trajectories and the nonlinear conservation law

Random Walk with Anti-Correlated Steps

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

4.1 - Logarithms and Their Properties

Notes on Kalman Filtering

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Math 115 Final Exam December 14, 2017

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Math 333 Problem Set #2 Solution 14 February 2003

Matlab and Python programming: how to get started

) were both constant and we brought them from under the integral.

Chapter 2. First Order Scalar Equations

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Two Coupled Oscillators / Normal Modes

Chapter 3 Common Families of Distributions

13.3 Term structure models

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

3.1 More on model selection

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

Appendix to Creating Work Breaks From Available Idleness

16 Max-Flow Algorithms

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts)

6.2 Transforms of Derivatives and Integrals.

04. Kinetics of a second order reaction

Robust estimation based on the first- and third-moment restrictions of the power transformation model

THE UNIVERSITY OF TEXAS AT AUSTIN McCombs School of Business

Handout: Power Laws and Preferential Attachment

FITTING EQUATIONS TO DATA

V The Fourier Transform

Some Basic Information about M-S-D Systems

STATE-SPACE MODELLING. A mass balance across the tank gives:

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

Final Spring 2007

HOMEWORK # 2: MATH 211, SPRING Note: This is the last solution set where I will describe the MATLAB I used to make my pictures.

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Linear Cryptanalysis

5.2. The Natural Logarithm. Solution

SOLUTIONS TO ECE 3084

Math From Scratch Lesson 34: Isolating Variables

Notes 8B Day 1 Doubling Time

Solutions from Chapter 9.1 and 9.2

Maintenance Models. Prof. Robert C. Leachman IEOR 130, Methods of Manufacturing Improvement Spring, 2011

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Asymptotic Equipartition Property - Seminar 3, part 1

Linear Response Theory: The connection between QFT and experiments

LAB 5: Computer Simulation of RLC Circuit Response using PSpice

Linear Time-invariant systems, Convolution, and Cross-correlation

An introduction to the theory of SDDP algorithm

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

10. State Space Methods

dy dx = xey (a) y(0) = 2 (b) y(1) = 2.5 SOLUTION: See next page

Macroeconomic Theory Ph.D. Qualifying Examination Fall 2005 ANSWER EACH PART IN A SEPARATE BLUE BOOK. PART ONE: ANSWER IN BOOK 1 WEIGHT 1/3

Ensamble methods: Bagging and Boosting

Challenge Problems. DIS 203 and 210. March 6, (e 2) k. k(k + 2). k=1. f(x) = k(k + 2) = 1 x k

Chapter 3 Boundary Value Problem

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Echocardiography Project and Finite Fourier Series

Transcription:

U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream (i.e. a sequence) of daa iems and are able, in real ime, o compue useful feaures of he daa. Sreaming algorihms are useful in any problem domain, including graph algorihms, bu we will resric o he following imporan family of problems: we ge a sequence of daa iems (for example sales records, or web his), and each daa iem has a daa field of ineres, which we will call a label (for example, he produc id of he iem ha was sold, he IP address ha we ge a hi from) and we wan o compue various saisics abou he sequence of labels. Absracing away all he daa excep he labels, we can hin of our inpu as being a sream x 1, x 2,... x n where each x i Σ is a label, and Σ is he se of all possible labels (e.g. all produc idenifiers, or all IP addresses). For a given sream x 1,..., x n, and for a label a Σ, we will call f a he frequency of a in he sream, ha is, he number of imes a appears in he sream. We will be ineresed in he following hree problems: Finding heavy hiers, ha is, labels for which f a is large (e.g. find bes-selling producs, or IP addresses ha visi our sie mos ofen) Finding he number of disinc labels in he sream (e.g. finding how many differen producs we have sold, or how many unique visiors we have) Finding he sum-of-squares quaniy a Σ f 2 a, which is usually denoed as F 2, and also called he second frequency momen of he sream I should be self-eviden ha he firs wo problems are imporan. To undersand he quaniy F 2, consider wo exreme cases ha we can have in a sream of n iems. If all he n labels in he sream are differen, hen we have a Σ f a 2 = n, because we have n frequencies ha are 1 and all he ohers are zero. If all he n labels are idenical, hen a Σ f a 2 = n 2. In all oher cases, he value 1

of F 2 for a sream of n iems will always be beween n and n 2, and smaller values correspond o sreams wih several disinc labels each occurring no oo ofen, and large values correspond o sreams wih few disinc labels occurring ofen. So F 2 is a one-parameer measure of how diversified is he sream. All hree problems have simple soluions in which we sore he enire sream. We can sore a lis of all he disinc labels we have seen, and for each of hem also sore he number of occurrences. Tha is, he daa srucure would conain a pair (a, f a ) for every label a ha appears a leas once in he sream. Such a daa srucure can be mainained using O( (log n + log Σ )) O(n (log n + log Σ )) bis of memory. Alernaively, one could have an array of size Σ soring f a for every label a, using O( Σ log n) bis of memory. If he labels are IPv6 addresses, hen Σ = 2 128, so he second soluion is definiely no feasible; in a seup in which a sream migh conain hundreds of millions of iems or more, even a daa srucure of he firs ype is raher large. We will give a soluion o he heavy hier problem ha uses O((log n) 2 ) bis of memory (in realisic implemenaions, he daa srucure requires only an array of size abou 300-600, conaining 32-bi inegers) and soluions o he oher problems ha use O(log n) bis of memory (in realisic seings, he daa srucures are an array of size ranging from a few dozens o a few housands 32-bi inegers). Those soluions will be randomized and will only provide approximae soluions. Boh feaures are necessary: i is possible o prove ha randomized exac algorihms and deerminisic approximae algorihms mus use an amoun of memory of he same order as he rivial soluion of soring all he values f(a). We will no prove hese impossibiliy resuls, bu nex wee we will prove ha any deerminisic and exac algorihm for each of he above hree problems mus use Ω(min{ Σ, n}) bis of memory. Today we see algorihms for he heavy hiers and disinc elemens problems. Nex wee we will see he sum-of-squares algorihm. 1 Probabiliy Review Since we are going o do a probabilisic analysis of randomized algorihms, le us review he handful of noions from discree probabiliy ha we are going o use. Suppose ha A and B are wo evens, ha is, hings ha may or may no happen. Then we have he union bound PrA B] PrA] + PrB] If A and B are independen hen we also have PrA B] = PrA] PrB] 2

Informally, a random variable is an oucome of a probabilisic experimen, which has various possible values wih various probabiliies. (Formally, i is a funcion from he sample space o he reals, hough he formal definiion does no help inuiion very much.) The expecaion of a random variable X is E X = v PrX = v] v where v ranges over all possible values ha he random variable can ae. We are going o mae use of he following hree useful facs abou random variables: Lineariy of expecaion: if X and Y are wo random variables, hen EX +Y ] = E X + E y, and if v is a number, hen EvX] = v E X. Produc formula for independen random variables: If X and Y are independen, hen E XY = (E X) (E Y ) Marov s inequaliy: If X is a random variable ha is always 0, hen, for every > 0, we have PrX ] E X Lineariy of expecaion and he produc rule grealy simplify he compuaion of he expecaion of random variables ha come up in applicaions (which are ofen sums or producs of simpler random variables). Marov s inequaliy is useful because once we compue he expecaion of a random variable we are able o mae high-probabiliy saemens abou i. For example, if we now ha X is a non-negaive random variable of expecaion 100, we can say ha here is a 90% probabiliy ha X 1, 000. Finally, he variance of a random variable X is VarX := E(X E X) 2 and he sandard deviaion of a random variable X is VarX. The significance of his definiion is ha, if he variance is small, we can prove ha X has, wih high probabiliy, values close o he expecaion. Tha is, for every > 0, Pr X E X ] = Pr(X E X) 2 2 ] VarX 2 and, afer he change of variable = c VarX, Pr X E X c VarX] 1 c 2 3

so, for example, here is a leas a 99% probabiliy ha X is wihin 10 sandard deviaions of is expecaion. The above inequaliy is called Chebyshev s inequaliy. (There are a dozen alernaive spellings for Chebyshev s name; you may have encounered his inequaliy before, spelled differenly.) If one nows more abou X, hen i is possible o say a lo more abou he probabiliy ha X deviaes from he expecaion. In several ineresing cases, for example, he probabiliy of deviaing by c sandard deviaions is exponenially, raher han polynomially, small in c 2. The advanage of Chebyshev s inequaliy is ha one does no need o now anyhing abou X oher han is expecaion and is variance. Two final hings abou variance: if X 1,..., X n are pairwise independen random variables, hen VarX 1 +... + X n ] = VarX 1 +... + VarX n and if X is a random variable whose only possible values are 0 and 1, hen and E X = PrX = 1] VarX = E X 2 (E X) 2 E X 2 = E X = PrX = 1] 2 The Heavy Hiers Problem From Problem 5 in HW1, you may remember he problem of finding a label such ha f a > n in a sream of lengh n. The soluion is an algorihm ha is deerminisic 2 and uses only wo variables (and log Σ + log n bis of memory), bu is analysis relies on he fac ha we are ineresed in finding a label occurring a majoriy of he imes. Suppose, insead, ha we are ineresed in finding all labels, if any, ha occur a leas.3n imes. Nex wee, we will show ha his requires Ω(min{ Σ, n}) bis of memory. The daa srucure ha we presen in his secion, however, is able o do he following using O((log n) 2 ) memory: come up wih a lis ha includes all labels ha occur a leas.3n imes and which includes, wih high probabiliy, none of he labels ha occur less han.2n imes. Of course.2 and.3 could be replaced by any oher wo consans. Even more impressively, every ime he algorihm sees an iem in he sream, i is able o approximae he number of imes i has seen ha elemen so far, up o an addiive error of.1n, and, again,.1 could be replaced by any oher posiive consan. The algorihm is called Coun-Min, or Coun-Min-Sech, and i has been implemened in a number of libraries. 4

The algorihm relies on wo parameers l and B. We choose l = 2 log n and B = 20. In general, if we wan o be able o approximae frequencies up o an addiive error of ɛn, we will choose B = 2/ɛ. The following version of he algorihm repors an approximaion of he number of imes i has seen he label so far for each label of he sream ha i processes: Iniialize an l B array M o all zeroes Pic l random funcions h 1,..., h l, where h i : Σ {1,..., B} while no end-of-sream: read a label x from he sream for i = 1 o l: Mi, h i (x)] + + prin esimaed number of imes, x, occurred so far is, min i=1,...,l Mi, h i (x)] And he following version of he algorihm consrucs a lis ha, includes all he labels ha occur.3n imes in he sream and, wih high probabiliy, includes none of he labels ha occur <.3n 2n B imes in he sream. (If B = 20, hen.3n 2n B =.2n.) Iniialize an l B array M o all zeroes Iniialize L o an empy lis Pic l random funcions h 1,..., h l, where h i : Σ {1,..., B} while no end-of-sream: read a label x from he sream for i = 1 o l: Mi, h i (x)] + + if min i=1,...,l Mi, h i (x)] >.3n add x o L, if no already presen reurn L Le us analyze he above algorihm. The firs observaion is ha, when we see a label a, we increase all he l values M1, h 1 (a)], M2, h 2 (a)],..., Ml, h l (a)] 5

and so, a he end of he sream, having done he above f a imes, i follows ha all he above values are a leas f a, and so f a min i=1,...,l Mi, h i(a)] In paricular, his means ha if a is a label such ha f a.3n hen, by he las ime we see an occurrence of a, i mus be ha min i=1,...,l Mi, h i (a)] >.3n, and so, in he second algorihm, we see ha a is cerainly added o he lis. The problem is ha min i=1,...,l Mi, h i (a)] could be much bigger han f a, and so he firs algorihm could deliver a poor approximaion and he second algorihm could add some ligh hiers o he lis. We need o prove ha his failure mode has a low probabiliy of happening. Firs of all, we see ha, for a fixed choice of he funcions h i, Mi, h i (a)] = f a + b a:h i (b)=h i (a) Now le s compue he expecaion of Mi, h i (a)] over he random choice of he funcion h i. We see ha i is f b E Mi, h i (a)] = f a + b a Prh i (a) = h i (b)] f b = f a + 1 B f a + n B b a f b where we use he fac ha, for a random funcion h i : Σ {1,..., B}, he probabiliy ha h i (a) = h i (b) is exacly 1 B, and he fac ha b a f b = n f a n. So far, we have proved ha he conen of Mi, h i (a)] is always a leas f a and, in expecaion, is a mos f a + n. Le us now ranslae his saemen abou expecaions B o a saemen abou probabiliies. Applying Marov s inequaliy o he (non-negaive!) random variable Mi, h i (a)] f a, we have and, using he independence of he h i, Pr Mi, h i (a)] > f a + 2n ] 1 B 2 6

Pr min Mi, h i(a)] > f a + 2n ] i=1...,l B ( = Pr M1, h 1 (a) > f a + 2n ) (... Ml, h l (a) > f a + 2n )] B B = Pr M1, h 1 (a) > f a + 2n ]... Pr Ml, h l (a) > f a + 2n ] B B 1 2 l So, if we choose B = 20 and l = 2 log n, we have ha he esimae min i Mi, h i (a)] is beween f a and f a +.1n, excep wih probabiliy a mos 1 n 2. In paricular, here is a probabiliy a leas 1 1/n, ha we ge a good esimae n imes in a row. 3 Couning Disinc Elemens Here he algorihm is a lo simpler. We pic a random funcion h : Σ 0, 1] (we will see laer how o appropriaely discreize his choice so ha he funcion can be sored in memory), and, for every iem x in he sream, we evaluae h(x), eeping rac of he smalles hash value obained so far. If min is he minimum hash value seen a he end of he sream, hen 1 is our esimae of he number of disinc elemens in min he sream. The inuiion for he algorihm is he following: suppose ha he sream has disinc labels. Then when we evaluae h a every elemen of he sream, we are evaluaing h a disinc inpus, alhough some inpus are maybe repeaed several imes. If h is a random funcion, we are geing random values in he inerval 0, 1]. Inuiively, we would expec random numbers seleced in he inerval 0, 1] o be evenly disribued, and so he minimum of hese numbers should be abou 1/. Thus, he inverse of he minimum should be around. Here is a simple calculaion showing ha here is a leas 60% probabiliy ha he algorihm achieves a consan-facor approximaion. Suppose ha he sream has disinc labels, and call r 1,..., r he random numbers in 0, 1] corresponding o he evaluaion of h a he disinc labels of he sream. Then 7

Pr Algorihm s oupu ] 2 = Pr min{r 1,..., r } 2 = Pr r 1 2... r 2 ] = Pr r 1 2 ]... Pr r 2 ] ( = 1 2 ) e 2.14 ] and Pr Algorihm s oupu 4] = Pr min{r 1,..., r } 1 ] 4 = Pr r 1 1 4... r 1 ] 4 Pr r i 1 ] 4 i=1 = 1 4 so ha ] Pr 2 Algorihm s oupu 4.61 There are various ways o improve he qualiy of he approximaion and o increase he probabiliy of success. One of he simples ways is o eep rac no of he smalles hash value encounered so far, bu he disinc labels wih he smalles hash values. This is a se of labels and values ha can be ep in a daa srucure of size O( log Σ ), and processing an elemen from he sream aes ime O(log ) if he labels are pu in a prioriy queue. Then, if sh is he -h smalles hash value in he sream, our esimae for he number of disinc values is. sh The inuiion is ha, as before, if we have disinc labels, heir hashed values will be random poins in he inerval 0, 1], which we would expec o be uniformly spaced, so ha he -h smalles would have a value of abou. Thus is inverse, muliplied by, should be an esimae of. 8

Why would i help o wor wih he -h smalles hash insead of he smalles? The inuiion is ha i aes only one oulier o sew he minimum, bu one needs o have ouliers o sew he -h smalles, and he laer is a more unliely even. 3.1 * Rigorous Analysis of he -h Smalles Algorihm Le us sech a more rigorous analysis. These calculaions are a bi more complicaed han he res of he conen of his lecure, so i is o o sip hem. We will see ha we can ge an esimae ha is, wih high probabiliy, beween ɛ and + ɛ, by choosing o be of he order of 1/ɛ 2. For example, choosing = 30/ɛ 2 here is a leas a 93% probabiliy of geing an ɛ-approximaion. For concreeness, we will see ha, wih = 3, 000, we ge, wih probabiliy a leas 93%, an error ha is a mos 10%. As before, we le r 1,..., r be he hashes of he disinc labels in he sream. We le sh be he -h smalles of r 1,..., r. PrAlgorihm s oupu 1.1] = Pr sh = Pr ] 1.1 ( #i : r i 1.1 ) ] Now le s sudy he number of is such ha r i /(1.1), and le s give i a name N := #i : r i 1.1 This is a random variable whose expecaion is easy o compue. If we define S i o be 1 if r i /(1.1) and 0 oherwise, hen N = i S i and so E N = i E S i = i Pr r i The variance of N is also easy o compue. ] = 1.1 1.1 = 1.1 VarN = i VarS i 1.1 So N has an average of abou.91 and a sandard deviaion of less han, so by Chebyshev s inequaliy 9

PrAlgorihm s oupu 1.1] = PrN ] = Pr(N E N) /1.1] VarN (/11) 2 = 121 = 121 3000 4.1% Similarly, and if we call PrAlgorihm s oupu.9] = Pr sh = Pr ].9 ( #i : r i.9 ) ] We see ha N := #i : r i.9 E N =.9 VarN and PrAlgorihm s oupu.9] = PrN ] = Pr E N N ].9 10 VarN (/9) 2 = 81 = 81 3000 = 2.7%

So all ogeher we have Pr.9 Algorihm s oupu 1.1] 93% 11