This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Similar documents
ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Lec 05 Arithmetic Coding

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Randomized Recovery for Boolean Compressed Sensing

Fixed-to-Variable Length Distribution Matching

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Non-Parametric Non-Line-of-Sight Identification 1

On Constant Power Water-filling

COS 424: Interacting with Data. Written Exercises

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

An Algorithm for Quantization of Discrete Probability Distributions

Fundamentals of Image Compression

Pattern Recognition and Machine Learning. Artificial Neural networks

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

Low-complexity, Low-memory EMS algorithm for non-binary LDPC codes

Multi-Scale/Multi-Resolution: Wavelet Transform

Birthday Paradox Calculations and Approximation

Error Exponents in Asynchronous Communication

CS Lecture 13. More Maximum Likelihood

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

A Simple Regression Problem

Pattern Recognition and Machine Learning. Artificial Neural networks

Interactive Markov Models of Evolutionary Algorithms

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

arxiv: v1 [cs.ds] 3 Feb 2014

Block designs and statistics

Multicollision Attacks on Some Generalized Sequential Hash Functions

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

Data-Driven Imaging in Anisotropic Media

arxiv: v3 [cs.ds] 22 Mar 2016

Genetic Quantum Algorithm and its Application to Combinatorial Optimization Problem

Ch 12: Variations on Backpropagation

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

!! Let x n = x 1,x 2,,x n with x j! X!! We say that x n is "-typical with respect to p(x) if

Analyzing Simulation Results

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Secret-Key Sharing Based on Layered Broadcast Coding over Fading Channels

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Probability Distributions

ESE 523 Information Theory

A note on the multiplication of sparse matrices

Witsenhausen s counterexample as Assisted Interference Suppression

Convex Programming for Scheduling Unrelated Parallel Machines

Feature Extraction Techniques

Impact of Imperfect Channel State Information on ARQ Schemes over Rayleigh Fading Channels

Support recovery in compressed sensing: An estimation theoretic approach

N-Point. DFTs of Two Length-N Real Sequences

The Transactional Nature of Quantum Information

Fairness via priority scheduling

List Scheduling and LPT Oliver Braun (09/05/2017)

Power-Controlled Feedback and Training for Two-way MIMO Channels

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Combining Classifiers

s = (Y Q Y P)/(X Q - X P)

Computable Shell Decomposition Bounds

Lower Bounds for Quantized Matrix Completion

OBJECTIVES INTRODUCTION

Ph 20.3 Numerical Solution of Ordinary Differential Equations

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

SPECTRUM sensing is a core concept of cognitive radio

Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel

Energy Efficient Predictive Resource Allocation for VoD and Real-time Services

Testing Properties of Collections of Distributions

lecture 37: Linear Multistep Methods: Absolute Stability, Part I lecture 38: Linear Multistep Methods: Absolute Stability, Part II

arxiv: v2 [math.co] 3 Dec 2008

Effective joint probabilistic data association using maximum a posteriori estimates of target states

Sharp Time Data Tradeoffs for Linear Inverse Problems

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

Mathematical Model and Algorithm for the Task Allocation Problem of Robots in the Smart Warehouse

Computable Shell Decomposition Bounds

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

arxiv: v3 [cs.lg] 7 Jan 2016

Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding

Exact tensor completion with sum-of-squares

Content-Type Coding. arxiv: v2 [cs.it] 3 Jun Linqi Song UCLA

Chaotic Coupled Map Lattices

Ensemble Based on Data Envelopment Analysis

Stochastic Subgradient Methods

Kinematics and dynamics, a computational approach

Rounding Methods for Discrete Linear Classification (Extended Version)

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

An Adaptive UKF Algorithm for the State and Parameter Estimations of a Mobile Robot

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

Lecture 21. Interior Point Methods Setup and Algorithm

Kernel Methods and Support Vector Machines

Bayes Decision Rule and Naïve Bayes Classifier

Machine Learning Basics: Estimators, Bias and Variance

Transcription:

CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when assuing unifor distribution of gaps, i.e. the gap sizes are randoly distributed. However, the distribution of gaps depends on the listing. If we sort URLs into lexical order, URLs fro the the sae site are likely to be close to each other. Then the distribution of gaps tends to have this clustering property. In order to utilize such property, people use local odels, where each inverted file entry uses its own odel. Generally, local odels outperfor global odels but are ore coplex to ipleent. Golob code: works well if ost gaps are b, but it s not good for large gaps. γ-code works better with large gaps. Skewed-Bernoulli is a cobination of the two.. Local Hyperbolic Model V T = (b, 2b, 4b,..., 2 i b,...) This odel assues that the probability of a gap has size i is proportional to /i. i.e., P r(i), fori =, 2,..., i Let be the truncate point of gap size, then P r(i) = i j= j i log e What should be? E[gap size] = i P r(i) i i loge log e = N f t where N is the nuber of docuents, and f t is the frequency of ters. We can then use observed frequencies to build the code. But one proble with this schee is that we have to do it at a ter by ter base.

To solve this proble, we can batch ters (with sae frequencies) together. An even better ethod is to use logarithic batching, i.e., we group ters according to logarithic intervals [2 i, 2 i+ ). In practice, this ethod uses only about 20 codes. All the odels we have looked at so far are zero-order odel, which eans that each sybol is independent, and they do not use context inforation. We ight be able to design better schees if we can use inforation of previous gap distribution. Such copression schees are called context sensitive copression schees. For the paraeterized global odel such as Golob code, we can choose the paraeter b to be the average of the previous K gap sizes. But there is a proble. Suppose we have any sall gaps followed by a large gap, b could becoe for the sall gaps, which is the sae as unary coding, then, when it coes to the large gap, it s going to use a huge code for that large gap. This proble can be alleviated by using Skewed-Bernoulli code..2 Interpolative Code With this schee, we assue that we know the coplete inverted file entry, and we recursively calculate ranges and code the with inial nuber of bits. e.g. assue we have 20 docuents in total, and a given inverted file entry has 7 docuents then the gaps are < 7; 3, 8, 9,, 2, 3, 7 > < 7; 3, 5,, 2,,, 4 > note that if we know the 4th nuber is and the 6th nuber is 3, there is only one possible value ( 2 ) for the 5th nuber. We recursively halve the range and calculate the lowest possible value and highest possible value for the nuber in the iddle. So for the exaple above, we first look at the 4th nuber, then 8, which is the iddle nuber in the left 3 docuents, then 3, 9, then 3 (the iddle nuber in the right 3 docuents), and finally 2, 7. We use (x, lo, hi) to denote that x lies in the interval [lo, hi]. Then for the exaple above, we can get the following sequence: (,4,7) since is the 4th nuber aong the 7 docuents, and there are 20 docuents in total. (8,2,9); (3,,7); (9,9,0); (3,3,9); (2,2,2) Interpolative code can achieve the best copression ratio, but it has longer decoding tie. Why the (x, lo, hi) coding is the best coding? Intuitively, for a sorted sequence of nubers within and N, the edian is likely to be N/2. And the probability P r(x is the edian) is uch higher if x is close to N/2. Then, using Huffan coding, we can get very short code. 2

2 Perforance of Index Copression Since local odels use ore inforation (such as the previous distribution of gap), they perfor better than global odels. Global observed frequency odel is only slightly better than γ or δ codes. Both γ and δ codes have perforance close to the local schees, but are outperfored by the local schees. And the perforances for the various local schees are: Bernoulli < Hyperbolic < Skewed Bernoulli < batchedf requencies < Interpolativecode 3 Hashing Based Signature Schees As for the purpose of indexing, a docuent is just a set of ters, and the database is soe representation of such sets so that queries can be answered efficiently. While soeties we want exact answers to our queries, there are any situations when soe probability of isatch can be allowed. We now discuss a schee that is used for the later case. 3. Bloo filters A Bloo filter represents a set S of n eleents by a bit vector V of size, initially all set to 0. A Bloo filter uses k independent hash functions (h, h 2,..., h k ) with range {, 2,..., }. We assue these hash functions ap each ite in the universe to a rando nuber unifor over the range {, 2,..., }. For each eleent e S, the bits h i (e) are set to for i k. To check if an eleent x is in set S, we check whether all the bits h i (x)( i k) are set to. If not, eleent x is definitely NOT in the set S. Otherwise, we assue that x is in S. However, it s possible that x is actually not in the set, but all its k corresponding bits are set (because soe other eleents are hashed to those bits). So a Bloo filter ay yield a false positive, where it suggests that an eleent x is in S even though it is not. For any applications, this is acceptable as long as the probability of a false positive is sufficiently sall. Bloo filters are widely used in any applications such as distributed caches. Now given the schee, what is the probability a false positive? Since there are n eleents and k hash functions, we need to set n k bits. Assuing purely rando hash functions, each tie we set a bit, the probability that a particular bit is not set is /. So, after reference: Copress Bloo Filter, by Michael Mitzenacher. 3

all the eleents of S are hashed into the Bloo filter, the probability that a particular bit is still not set is ( )nk e nk let p = e nk, then the probability of a false positive (this is when all the k corresponding bits are set) is ( ( ) nk ) k ( e nk ) k = ( p) k Let f = ( p) k. We now use the asyptotic approxiations p and f to represent respectively the probability of a bit not being set and the probability of a false positive. Suppose the nuber n and are fixed, how can we choose the nuber of hash functions k such that the probability of false positive f is iniized? p = e nk p = e nk ln p = nk k = n lnp f = ( p) k = ( p) n lnp = e ln( p)lnp n Since and n are fixed, f is iniized when ln( p)lnp is axiized. d ln( p) ln( p)lnp = lnp dp p p = 0 p = /2 so the optial (i.e. inial) f is achieved when p = /2: /2 = p = e nk ln2 = nk k = ln2( n ) f = ( p) k = ( 2 )ln2( n ) (0.685) n 4

3.2 Copressed Bloo filter Many applications using Bloo filter ay need to pass the Bloo filter as a essage, and the transission size Z(Z ) can becoe a liiting factor. If every bit has the sae probability, the Bloo Filter cannot be copressed (Z = ). Suppose, however, if we can choose k such that p, the probability of a bit not being set is not /2, we ay be able to copress the Bloo filter before sending it out, thus reducing the transission size Z. The lower bound of Z is H(p, p), where H(p, p) = plog 2 p ( p)log 2 ( p) is the entropy of the distribution {p, p}. Suppose n and the transission size Z are fixed, what is the best k that iniizes f? [to be continued in next class... ] 5