COS597D: Information Theory in Computer Science October 19, Lecture 10

Similar documents
Lecture 1 : Data Compression and Entropy

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

,

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Lecture 1: September 25, A quick reminder about random variables and convexity

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Kolmogorov complexity

COS597D: Information Theory in Computer Science October 5, Lecture 6

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

CISC 876: Kolmogorov Complexity

Intro to Information Theory

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Data Compression. Limit of Information Compression. October, Examples of codes 1

Lecture 13: Foundations of Math and Kolmogorov Complexity

Entropy as a measure of surprise

2 Plain Kolmogorov Complexity

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

A Mathematical Theory of Communication

Information Theory and Statistics Lecture 2: Source coding

Divisible Load Scheduling

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Writing proofs for MATH 61CM, 61DM Week 1: basic logic, proof by contradiction, proof by induction

HW 4 SOLUTIONS. , x + x x 1 ) 2

Chapter 5: Data Compression

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Computability Theory

CS1800: Mathematical Induction. Professor Kevin Gold

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Coding of memoryless sources 1/35

Lecture 7: Dynamic Programming I: Optimal BSTs

Lecture 4: Completion of a Metric Space

Lecture 6 September 21, 2016

1 Introduction to information theory

Lecture 3 : Algorithms for source coding. September 30, 2016

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Kolmogorov complexity ; induction, prediction and compression

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

Lecture 3: Decision Trees

2. Prime and Maximal Ideals

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Introduction to Turing Machines. Reading: Chapters 8 & 9

On improving matchings in trees, via bounded-length augmentations 1

1 Basic Definitions. 2 Proof By Contradiction. 3 Exchange Argument

CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression

Lecture notes on Turing machines

arxiv: v2 [cs.ds] 3 Oct 2017

SMT 2013 Power Round Solutions February 2, 2013

Notes on arithmetic. 1. Representation in base B

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Introductory Analysis I Fall 2014 Homework #5 Solutions

CS4800: Algorithms & Data Jonathan Ullman

Proving Completeness for Nested Sequent Calculi 1

Homework 1 (revised) Solutions

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Introduction to Kolmogorov Complexity

1 Seidel s LP algorithm

Time-bounded computations

Recovery Based on Kolmogorov Complexity in Underdetermined Systems of Linear Equations

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Pseudorandom Generators

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

König s Lemma and Kleene Tree

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Basic counting techniques. Periklis A. Papakonstantinou Rutgers Business School

Theory of Computation

2.6 Variations on Turing Machines

k-protected VERTICES IN BINARY SEARCH TREES

Part V. Intractable Problems

3 Self-Delimiting Kolmogorov complexity

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

Lecture 14 - P v.s. NP 1

Finish K-Complexity, Start Time Complexity

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, Notes 22 for CS 170

An introduction to basic information theory. Hampus Wessman

Finding Divisors, Number of Divisors and Summation of Divisors. Jane Alam Jan

SOME TRANSFINITE INDUCTION DEDUCTIONS

Is there an Elegant Universal Theory of Prediction?

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Discrete Optimization 2010 Lecture 2 Matroids & Shortest Paths

10.3 Matroids and approximation

Algorithms: Lecture 2

Lecture 3: Sizes of Infinity

Theory of Computing Tamás Herendi

2. Two binary operations (addition, denoted + and multiplication, denoted

CS 360, Winter Morphology of Proof: An introduction to rigorous proof techniques

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Inserting plus signs and adding

Where do pseudo-random generators come from?

Lecture 4: Constructing the Integers, Rationals and Reals

CMPUT 675: Approximation Algorithms Fall 2014

Gap Embedding for Well-Quasi-Orderings 1

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

The min cost flow problem Course notes for Optimization Spring 2007

Transcription:

COS597D: Information Theory in Computer Science October 9, 20 Lecture 0 Lecturer: Mark Braverman Scribe: Andrej Risteski Kolmogorov Complexity In the previous lectures, we became acquainted with the concept of Shannon entropy, which is designed to capture distributions X over sets, i.e. relative frequencies of elements in a given set. Intuitively, however, we would like a notion that captures the fact that:... is not very random. 3.459... is not very random. (Since we can write a short program outputting π to arbitrary accuracy and so predict the next bit easily.) 784952... is fairly random. In other words, we want to describe how compressible and/or hard to describe a string is. Definition. Let U be a universal computer. A valid execution of U on p is one that terminates and reads the entire string p. Note that the programs on which U executes validly form a prefix-free set. Indeed, if p is a prefix of p, U is valid on p, it will terminate once it s done reading p so it cannot read all of p so the execution on p cannot be valid. Definition 2. Kolmogorov complexity K U (x) of string x is defined as length of string p. min l(p), where l(p) denotes the p:u(p)=x In other words, the Kolmogorov complexity of x is the shortest program outputting x. Let s note that this definition is fairly robust. Claim 3. If U is a universal computer, A another computer, there is some constant C A, s.t. K U (X) K A (x) + C A for all strings x. Proof This is almost immediate. One can write a compiler for translating A to U. This is come constantsized program so the claim holds. Having the claim above in mind, we will henceforth drop the subscript U in the notation for Kolmogorov complexity. Now, let s note a few properties about Kolmogorov complexity. Claim 4. K(X l(x)) l(x) + C Proof Just use the trivial encoding as a program which outputs the sequence of bits in x. Claim 5. K(X) l(x) + 2 log l(x) + C Portions of the notes are based on Elements of Information Theory, T.M.Cover and J.A.Thomas

Proof We need to somehow encode the length of the string, and then we can use the trivial encoding as above. We can do this by encoding the length of the string as binary number, and then doubling each of the digits in the binary representation, followed by the string 0. For example, if the length is 6, i.e. 0 in binary, we encode it as 000. This gives us the bound we need. As a remark, we can keep doing the trick in the proof above recursively. For instance, by doing it one more time, we can get Kolmogorov complexity l(x) + log l(x) + 2 log log l(x) + C. By continuing the recursion until the terms are positive, we get an encoding of size l(x) + log l(x) + log log l(x) + + c. Claim 6. The number of strings with K(x) = k satisfies {x : K(x) < k} < 2 k. Proof Trivial. There are at most 2 k programs of length k so they encode at most 2 k different strings. Theorem 7. Kolmogorov complexity is not computable. Proof Suppose there is a program Q(x) that outputs K(x). Then, consider the following program: P(n): For all x {0, } n If x is the first string with K(x) n, using the description of Q to check this. Output x and halt. Now, if the sequence of strings outputted for each input length n for P is x, x 2,..., x n, then K(x n ) n. However, K(x n ) c+2 log n by just using the program above, which has a constant-size description (given that the description of Q is just a constant as well), and using 2 log n bits for encoding n. So we have a contradiction. 2 Kolmogorov Complexity and Entropy We note here a relationship between Kolmogorov complexity and entropy. Lemma 8. For any computer U, 2 l(p) p:u(p) halts Proof This is fairly simple, and based on the proof of Kraft s inequality in Elements of information theory. Take any finite set of strings p, s.t. U(p) halts. We will call these strings valid in the proof to follow. We can view the valid strings as nodes in a full binary tree, where each string is located at the node we get by tracing out the sequence of bits in the string from the root. Let l max be the maximal length of any of the valid strings, i.e. the maximal depth of the tree where there is a node that represents a valid string. Consider all notes at depth l max of the tree. Each of them either denotes a valid string, a descendant of a valid string, or neither. A valid string at level l has at most 2 lmax l descendants at level l max. Each of these descendant sets are disjoint, since the valid strings are prefix-free. So, 2 lmax l(p) 2 lmax, i.e. p:p is a valid string 2 l(p), as we need. p:p is a valid string Now, since this holds for any finite subset of strings p, s.t. U(p) halts, it holds in the infinite case as well. 2

Lemma 9. If x i are iid random variables, n E [K(x, x 2,..., x n )] H(x), as x This will not be proven here only the following partial result: Theorem 0. Let x, x 2,..., x n be B /2 ; Then, Pr[K(x, x 2,..., x n ) < n k] < 2 k Proof Pr[K(x, x 2,... x n ) < n k] = {x x 2... x n : K(x, x 2,..., x n ) < n k} 2 n < 2 n k 2 n = 2 k. Finally, we give a few definitions to inform the reader how one might generalize these notions to infinite strings. Definition. A sequence x, x 2... x n is algorithmically random if K(x, x 2,..., x n n) n. Definition 2. An infinite string is incompressible if lim n K(x,x 2,...,x n n) n =. 3 Universal Probability In this section, we define universal probability, which defines a probability distribution on all binary strings, favoring the ones with short programs outputting them. The name universal is clear here: in nature usually we like shorter and simpler explanations (i.e. Occam s Razor), and this notion captures exactly that. Equivalently, we also capture the following idea. If a monkey were to start hitting a typewriter, under the assumption that the text he types out is a valid program, what output strings of these programs are more likely to be produced? Definition 3. The universal probability P U (x) of a string x is defined as 2 l(p) = Pr [U(p) = x] p:u(p) halts p:u(p)=x Note that by claim 3, P U (x) C A P A(x) for any other universal computer A. 3. Kolmogorov Complexity vs Universal Probability We prove now the main result regarding universal probability, relating it to Kolmogorov complexity. Theorem 4. 2 K(x) < P U (x) C 2 K(x), for some constant C. Proof The proof here is based on the exposition in Elements of Information Theory. The part P U (x) > 2 K(x) is trivial just by definition P U (x) includes the term 2 K(x). We can rewrite the second part as K(x) log( P U (x)) + C. In other words, we wish to find a short program for strings with high P U (x). The problem is that P U (x) is not computable, so despite the fact that we know it will be large for simple programs, we can t just brute force run all programs of length log( P U (x)), and find one. So we have to be somewhat clever. The program for K(x) will try to simulate the universal computer U on all programs in lexicographic order, and keep an estimate for P U (x ) as it goes along for all strings x. Then, we will specify a particular short string, which will allow the program to get x from these estimates. We can simulate U on all programs as follows. Let p, p 2,... be the programs in lexicographic order, and the corresponding outputs after running them are x, x 2,... (if they terminate of course). We run the simulation in stages. In stage i, we run programs p, p 2,... p i for i steps each. That way, each program that terminates will be run until it terminates. Furthermore, we handle the terminations of programs as follows. If the program p k terminates in exactly i steps in the i-th stage (i.e. we see it terminate for the first time), we increase the estimate P U (x k ) by 2 l(pk). Then, we calculate n k = log( P ). Now, we build a binary tree in whose nodes triplets (p U (x k ) k, n k, x k ) will 3

live. We will assign the triplets to the nodes incrementally. Namely, every time a program p k outputs a string x k, we check whether the new estimate of n k, after updating P U (x k ) has changed. If that is so, then we find any unassigned node on level n k +, and assign it the triplet (p k, n k, x k ). There are two things we need to prove. One is that using this construction, we get a program of the required size outputting x. The other is that it is always possible to place the pairs as specified above (i.e. if we need to put a node on some level, there is always at least one unassigned node on that level). For the first part, the encoding of x can be the path to the lowest depth node (i.e. closest to the root) containing a triplet (p k, n k, x k ), x k = x. By construction, the depth will be log( P U (x) ) + log( P U (x) )+2 so K(x) log( P U (x)) + C, as we need. For the second, the proof resembles the Kraft inequality proof mentioned above. We sketch the proof here as well, because the idea of the proof is simple, but elegant, and used in other places as well. We state it in the form of a general lemma. Lemma 5. Let r, r 2,... be a sequence of requests to place index i at depth r i of a prefix-free binary coding n tree. Then, it s possible to satisfy all of the requests online if 2 ri. Proof We can intuitively thing of think of 2 ri as the volume of a request i, as it blocks all of the nodes at depth higher than r i. So, really, what the statement of the lemma says is that the obvious condition the volumes need to satisfy is also sufficient. We prove that the trivial strategy of placement, greedy leftmost actually works. In other words, when we get a new request r i, we place the element in the leftmost available spot at depth r i. The proof is by induction on n. The case n = is trivial. So, suppose the case holds when the number of requests is < n. Now, suppose we get n requests. Suppose also, for the sake of contradiction, we get stuck inserting r n. Let k be the depth of the deepest element in the right subtree R (WLOG assume R is not empty, otherwise the claim is trivial: we can just insert the request in the right subtree so we get a contradiction immediately). We prove that in this case, both subtrees have to be fairly full, and the volume constraint is violated. An element on depth k and r n didn t fit into left subtree L. Indeed, otherwise in the former case, we could have put all the elements on depth k in the left subtree. In the latter, we could have satisfied r n by putting the element in the left subtree. Let s define l := max(k, r n ). Let d, d 2,... d m be the depths of the nodes assigned m to the left subtree L. Then, since neither k nor r n fit into the left subtree, w(l) = 2 di + 2 l > 2. Similarly, since r n didn t fit in R, w(r) + 2 rn > 2. Furthermore, all elements of R are at depth l, therefore w(r) is an integer multiple of 2 l. Hence, w(r) + 2 rn 2 + 2 l. Putting everything together, w(l) + w(r) + 2 rn > 2 2 l + 2 + 2 l =, which is a contradiction. Hence, the claim follows by induction. With the lemma above, the last piece of the proof is almost complete. Once we verify that the sequence of requests in our particular case satisfy the constraints of Lemma 5, we are done. But this is easy. In our particular case, we get n 2 ri = = x {0,} = x {0,} 2 i= i= there is a request to insert x at level n k 2 (nk+) = there is a request to insert x at level n k 2 nk = i= 4

But now, since we insert a new triplet if the new estimate of n k changes, and the estimate will stop growing once the estimated value PŨ (x) reaches it s actual value, P U (x), the above summation is at most: x {0,} 2 = 2 log /P U (x) i = i=0 x {0,} 2 2 log /PU (x) = 2 i = i=0 2 2 log /PU (x) 2 x {0,} P U (x) x {0,} where the last inequality comes from Lemma 8. This finishes the proof. 5