UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

Similar documents
Shannon s noiseless coding theorem

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

1 Hash tables. 1.1 Implementation

Math 155 (Lecture 3)

Design and Analysis of Algorithms

CS322: Network Analysis. Problem Set 2 - Fall 2009

Hashing and Amortization

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Lecture 11: Pseudorandom functions

Lecture 1: Basic problems of coding theory

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture 6: Source coding, Typicality, and Noisy channels and capacity

CS284A: Representations and Algorithms in Molecular Biology

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Infinite Sequences and Series

Lecture 9: Hierarchy Theorems

CS / MCS 401 Homework 3 grader solutions

4.1 SIGMA NOTATION AND RIEMANN SUMS

Math 216A Notes, Week 5

Section 5.1 The Basics of Counting

6.3 Testing Series With Positive Terms

4.1 Sigma Notation and Riemann Sums

Problem Set 2 Solutions

CS161: Algorithm Design and Analysis Handout #10 Stanford University Wednesday, 10 February 2016

Injections, Surjections, and the Pigeonhole Principle

Lecture 2: April 3, 2013

Statistics 511 Additional Materials

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Fundamental Theorem of Algebra. Yvonne Lai March 2010

b i u x i U a i j u x i u x j

6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:

Intro to Learning Theory

Analysis of Algorithms. Introduction. Contents

1 Generating functions for balls in boxes

Mathematical Induction

Lecture Notes for Analysis Class

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Recursive Algorithm for Generating Partitions of an Integer. 1 Preliminary

Convergence of random variables. (telegram style notes) P.J.C. Spreij

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

CS:3330 (Prof. Pemmaraju ): Assignment #1 Solutions. (b) For n = 3, we will have 3 men and 3 women with preferences as follows: m 1 : w 3 > w 1 > w 2

Bertrand s Postulate

The multiplicative structure of finite field and a construction of LRC

18.440, March 9, Stirling s formula

The Growth of Functions. Theoretical Supplement

Lecture 6: Integration and the Mean Value Theorem. slope =

Series III. Chapter Alternating Series

Math 2784 (or 2794W) University of Connecticut


Lecture 14: Graph Entropy

Lecture 2: Monte Carlo Simulation

Optimally Sparse SVMs

Putnam Training Exercise Counting, Probability, Pigeonhole Principle (Answers)

6.867 Machine learning, lecture 7 (Jaakkola) 1

1 Approximating Integrals using Taylor Polynomials

Lecture 11: Channel Coding Theorem: Converse Part

Entropies & Information Theory

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

CS 332: Algorithms. Linear-Time Sorting. Order statistics. Slide credit: David Luebke (Virginia)

Average-Case Analysis of QuickSort

Lecture 9: Pseudo-random generators against space bounded computation,

MA131 - Analysis 1. Workbook 2 Sequences I

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Sequences I. Chapter Introduction

Lecture 4: April 10, 2013

Disjoint set (Union-Find)

HOMEWORK 2 SOLUTIONS

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

Introduction to Signals and Systems, Part V: Lecture Summary

Lecture 10: Universal coding and prediction

INTEGRATION BY PARTS (TABLE METHOD)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

On Random Line Segments in the Unit Square

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

Data Analysis and Statistical Methods Statistics 651

Binary codes from graphs on triples and permutation decoding

CS 171 Lecture Outline October 09, 2008

PROPERTIES OF AN EULER SQUARE

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

MA131 - Analysis 1. Workbook 3 Sequences II

The Binomial Theorem

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Random Walks in the Plane: An Energy-Based Approach. Three prizes offered by SCM

Seunghee Ye Ma 8: Week 5 Oct 28

Mixtures of Gaussians and the EM Algorithm

Beurling Integers: Part 2

Entropy Rates and Asymptotic Equipartition

Lecture 3: August 31

Chapter 4. Fourier Series

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Posted-Price, Sealed-Bid Auctions

Lecture 10 October Minimaxity and least favorable prior sequences

Transcription:

UC Berkeley CS 170: Efficiet Algorithms ad Itractable Problems Hadout 17 Lecturer: David Wager April 3, 2003 Notes 17 for CS 170 1 The Lempel-Ziv algorithm There is a sese i which the Huffma codig was optimal, but this is uder several assumptios: 1. The compressio is lossless, i.e., ucompressig the compressed file yields exactly the origial file. Whe lossy compressio is permitted, as for video, other algorithms ca achieve much greater compressio, ad this is a very active area of research because people wat to be able to sed video ad audio over the Web. 2. We kow all the frequecies f(i) with which each character appears. How do we get this iformatio? We could make two passes over the data, the first to compute the f(i), ad the secod to ecode the file. But this ca be much more expesive tha passig over the data oce for large files residig o disk or tape. Oe way to do just oe pass over the data is to assume that the fractios f(i)/ of each character i the file are similar to files you ve compressed before. For example you could assume all Java programs (or Eglish text, or PowerPoit files, or...) have about the same fractios of characters appearig. A secod cleverer way is to estimate the fractios f(i)/ o the fly as you process the file. Oe ca make Huffma codig adaptive this way. 3. We kow the set of characters (the alphabet) appearig i the file. This may seem obvious, but there is a lot of freedom of choice. For example, the alphabet could be the characters o a keyboard, or they could be the key words ad variables ames appearig i a program. To see what differece this ca make, suppose we have a file cosistig of strigs aaaa ad strigs bbbb cocateated i some order. If we choose the alphabet {a, b} the 8 bits are eeded to ecode the file. But if we choose the alphabet {aaaa, bbbb} the oly 2 bits are eeded. Pickig the correct alphabet turs out to be crucial i practical compressio algorithms. Both the UNIX compress ad GNU gzip algorithms use a greedy algorithm due to Lempel ad Ziv to compute a good alphabet i oe pass while compressig. Here is how it works. If s ad t are two bit strigs, we will use the otatio s t to mea the bit strig gotte by cocateatig s ad t. We let f be the file we wat to compress, ad thik of it just as a strig of bits, that is 0 s ad 1 s. We will build a alphabet A of commo bit strigs ecoutered i f, ad use it to compress f. Give A, we will break f ito shorter bit strigs like ad ecode this by f = A(1) 0 A(2) 1 A(7) 0 A(5) 1 A(i) j 1 0 2 1 7 0 5 1 i j

Notes umber 17 2 F = 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 0 set A is full A(1) = A(0)0 = 0 A(2) = A(1)0 = 00 A(3) = A(0)1 = 1 A(4) = A(3)1 = 11 A(5) = A(3)0 = 10 A(6) = A(5)1 = 101 = A(6)0 = 1010 = A(1)0 = 00 Ecoded F = (0,0),(1,0),(0,1), (3,1),(3,0), (5,1),(6,0),(1,0) = 0000 0010 0001 0111 0110 1011 1100 0010 Figure 1: A example of the Lempel-Ziv algorithm. The idices i of A(i) are i tur ecoded as fixed legth biary itegers, ad the bits j are just bits. Give the fixed legth (say r) of the biary itegers, we decode by takig every group of r + 1 bits of a compressed file, usig the first r bits to look up a strig i A, ad cocateatig the last bit. So whe storig (or sedig) a ecoded file, a header cotaiig A is also stored (or set). Notice that while Huffma s algorithm ecodes blocks of fixed size ito biary sequeces of variable legth, Lempel-Ziv ecodes blocks of varyig legth ito blocks of fixed size. Here is the algorithm for ecodig, icludig buildig A. Typically a fixed size is available for A, ad oce it fills up, the algorithm stops lookig for ew characters. A = { }... start with a alphabet cotaiig oly a empty strig i = 0... poits to ext place i file f to start ecodig repeat fid A(k) i the curret alphabet that matches as may leadig bits of f i f i+1 f i+2 as possible... iitially oly A(0) = empty strig matches let b be the umber of bits i A(k) if A is ot full, add A(k) f i+b to A... f i+b is the first bit umatched by A(k) output k f i+b i = i + b + 1 util i > legth(f) Note that A is built greedily, based o the begiig of the file. Thus there are o optimality guaratees for this algorithm. It ca perform badly if the ature of the file chages substatially after A is filled up, however the algorithm makes oly oe pass through the file (there are other possible implemetatios: A may be ubouded, ad the idex k would be ecoded with a variable-legth code itself). I Figure 1 there is a example of the algorithm ruig, where the alphabet A fills up after 6 characters are iserted. I this small example o compressio is obtaied, but if A were large, ad the same log bit strigs appeared frequetly, compressio would

Notes umber 17 3 be substatial. The gzip mapage claims that source code ad Eglish text is typically compressed 60%-70%. To observe a example, we took a latex file of 74, 892 bytes. Ruig Huffma s algorithm, with bytes used as blocks, we could have compressed the file to 36, 757 bytes, plus the space eeded to specify the code. The Uix program compress produced a ecodig of size 34, 385, while gzip produced a ecodig of size 22, 815. 2 Lower bouds o data compressio 2.1 Simple Results How much ca we compress a file without loss? We preset some results that give lower bouds for ay compressio algorithm. Let us start from a worst case aalysis. Theorem 1 Let C : {0, 1} {0, 1} be a ecodig algorithm that allows lossless decodig (i.e., let C be a ijective fuctio mappig bits ito a sequece of bits). The there is a file f {0, 1} such that C(f). I words, for ay lossless compressio algorithm there is always a file that the algorithm is uable to compress. Proof: Suppose, by cotradictio, that there is a compressio algorithm C such that, for all f {0, 1}, C(f) 1. The the set {C(f) : f {0, 1} } has 2 elemets because C is ijective, but it is also a set of strigs of legth 1, ad so it has at most 1 l=1 2l = 2 2 elemets, which gives a cotradictio. While the previous aalysis showed the existece of icompressible files, the ext theorem shows that radom files are hard to compress, thus givig a average case aalysis. Theorem 2 Let C : {0, 1} {0, 1} be a ecodig algorithm that allows lossless decodig (i.e., let C be a ijective fuctio mappig bits ito a sequece of bits). Let f {0, 1} be a sequece of radomly ad uiformly selected bits. The, for every t, Pr[ C(f) t] 1 2 t 1 For example, there is less tha a chace i a millio of compressio a iput file of bits ito a output file of legth 21, ad less tha a chace i eight millios that the output will be 3 bytes shorter tha the iput or less. Proof: We ca write Pr[ C(f) t] = {f : C(f) t} 2 Regardig the umerator, it is the size of a set that cotais oly strigs of legth t or less, so it is o more tha t l=1 2l, which is at most 2 t+1 2 < 2 t+1 = 2 /2 t 1. The followig result is harder to prove, ad we will just state it.

Notes umber 17 4 Theorem 3 Let C : {0, 1} {0, 1} be a prefix-free ecodig, ad let f be a radom file of bits. The E[ C(f) ]. This meas that, from the average poit of view, the optimum prefix-free ecodig of a radom file is just to leave the file as it is. I practice, however, files are ot completely radom. Oce we formalize the otio of a ot-completely-radom file, we ca show that some compressio is possible, but ot below a certai limit. First, we observe that eve if ot all -bits strigs are possible files, we still have lower bouds. Theorem 4 Let F {0, 1} be a set of possible files, ad let C : {0, 1} {0, 1} be a ijective fuctio. The 1. There is a file f F such that C(f) log 2 F. 2. If we pick a file f uiformly at radom from F, the for every t we have Pr[ C(f) (log 2 F ) t] 1 2 t 1 3. If C is prefix-free, the whe we pick a file f uiformly at radom from F we have E[ C(f) ] log 2 F. Proof: Part 1 ad 2 is proved with the same ideas as i Theorem 1 ad Theorem 2. Part 3 has a more complicated proof that we omit. 2.2 Itroductio to Etropy Suppose ow that we are i the followig settig: the file cotais characters there are c differet characters possible character i has probability p(i) of appearig i the file What ca we say about probable ad expected legth of the output of a ecodig algorithm? Let us first do a very rough approximate calculatio. Whe we pick a file accordig to the above distributio, very likely there will be about p(i) characters equal to i. The files with these typical frequecies have a total probability about p = i p(i)p(i) of beig geerated. Sice files with typical frequecies make up almost all the probability mass, there must be about 1/p = i (1/p(i))p(i) files of typical frequecies. Now, we are i a settig which is similar to the oe of parts 2 ad 3 of Theorem 4, where F is the set of files with typical frequecies. We the expect the ecodig to be of legth at least log 2 i (1/p(i))p(i) = i p(i) log 2(1/p(i)). The quatity i p(i) log 2 1/(p(i)) is the expected umber of bits that it takes to ecode each character, ad is called the etropy

Notes umber 17 5 of the distributio over the characters. The otio of etropy, the discovery of several of its properties, (a formal versio of) the calculatio above, as well as a (iefficiet) optimal compressio algorithm, ad much, much more, are due to Shao, ad appeared i the late 40s i oe of the most ifluetial research papers ever writte. 2.3 A Calculatio Makig the above calculatio precise would be log, ad ivolve a lot of ɛs. Istead, we will formalize a slightly differet settig. Cosider the set F of files such that the file cotais characters there are c differet characters possible character i occurs p(i) times i the file We will show that F cotais roughly 2 i p(i) log 2 1/p(i) files, ad so a radom elemet of F caot be compressed to less tha i p(i) log 2 1/p(i) bits. Pickig a radom elemet of F is almost but ot quite the settig that we described before, but it is close eough, ad iterestig i its ow. Let us call f(i) = p(i) the umber of occurreces of character i i the file. We eed two results, both from Math 55. The first gives a formula for F : F =! f(1)! f(2)! f(c)! Here is a sketch of the proof of this formula. There are! permutatios of characters, but may are the same because there are oly c differet characters. I particular, the f(1) appearaces of character 1 are the same, so all f(1)! orderigs of these locatios are idetical. Thus we eed to divide! by f(1)!. The same argumet leads us to divide by all other f(i)!. Now we have a exact formula for F, but it is hard to iterpret, so we replace it by a simpler approximatio. We eed a secod result from Math 55, amely Stirlig s formula for approximatig!:! 2π +.5 e This is a good approximatio i the sese that the ratio!/[ 2π +.5 e ] approaches 1 quickly as grows. (I Math 55 we motivated this formula by the approximatio log! = i=2 log i 1 log x dx.) We will use Stirlig s formula i the form log 2! log 2 2π + ( +.5) log2 log 2 e Stirlig s formula is accurate for large argumets, so we will be iterested i approximatig log 2 F for large. Furthermore, we will actually estimate log 2 F, which ca be iterpreted as the average umber of bits per character to sed a log file. Here goes: log 2 F = log 2(!/(f(1)! f(c)!)) = log 2! c log 2 f(i)!

Notes umber 17 6 1 [log 2 2π + ( +.5) log2 log 2 e (log 2 2π + (f(i) +.5) log2 f(i) f(i) log 2 e)] = 1 [ log 2 f(i) log 2 f(i) +(1 c) log 2 2π +.5 log2.5 log 2 f(i)] f(i) = log 2 log 2 f(i) + (1 c) log 2 2π +.5 log 2.5 c log 2 f(i) As gets large, the three fractios o the last lie above all go to zero: the first term looks like O(1/), ad the last two terms look like O( log 2 ). This lets us simplify to get log 2 F f(i) log 2 log 2 f(i) = log 2 p(i) log 2 p(i) = (log 2 )p(i) p(i) log 2 p(i) = p(i) log 2 /p(i) = p(i) log 2 1/p(i) Normally, the quatity i p(i) log 2 1/p(i) is deoted by H. How much more space ca Huffma codig take to ecode a file tha Shao s lower boud H? A theorem of Gallagher (1978) shows that at worst Huffma will take (p max +.086) bits more tha H, where p max is the largest of ay p i. But it ofte does much better. Furthermore, if we take blocks of k characters, ad ecode them usig Huffma s algorithm, the, for large k ad for tedig to ifiity, the average legth of the ecodig teds to the etropy.