Lecture 1: September 25, A quick reminder about random variables and convexity

Similar documents
Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Information Theory and Statistics Lecture 2: Source coding

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Entropy as a measure of surprise

Lecture 1 : Data Compression and Entropy

1 Introduction to information theory

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Chapter 2: Source coding

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Coding of memoryless sources 1/35

DCSP-3: Minimal Length Coding. Jianfeng Feng

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

2. Introduction to commutative rings (continued)

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

COS597D: Information Theory in Computer Science October 19, Lecture 10

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Lecture 4 : Adaptive source coding algorithms

Computing and Communications 2. Information Theory -Entropy

COMM901 Source Coding and Compression. Quiz 1

Basic Measure and Integration Theory. Michael L. Carroll

Chapter 5: Data Compression

1. Introduction to commutative rings and fields

3F1 Information Theory, Lecture 3

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Lecture 12: November 6, 2017

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Information Theory. Week 4 Compressing streams. Iain Murray,

Chapter 9 Fundamental Limits in Information Theory

(Refer Slide Time: 0:21)

Intro to Information Theory

Lecture 3 : Algorithms for source coding. September 30, 2016

After taking the square and expanding, we get x + y 2 = (x + y) (x + y) = x 2 + 2x y + y 2, inequality in analysis, we obtain.

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

lossless, optimal compressor

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Quantum-inspired Huffman Coding

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

UNIT I INFORMATION THEORY. I k log 2

CSCI 2570 Introduction to Nanocomputing

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

Lecture 11: Quantum Information III - Source Coding

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

The semantics of propositional logic

Lecture 11: Polar codes construction

COMPLEX ANALYSIS Spring 2014

3 Self-Delimiting Kolmogorov complexity

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

CSE 2001: Introduction to Theory of Computation Fall Suprakash Datta

FINITE ABELIAN GROUPS Amin Witno

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures

Information Theory Primer:

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

An introduction to basic information theory. Hampus Wessman

Theory of Computation

3F1 Information Theory, Lecture 3

Natural Numbers: Also called the counting numbers The set of natural numbers is represented by the symbol,.

The binary entropy function

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

CSE 20. Lecture 4: Introduction to Boolean algebra. CSE 20: Lecture4

Upper Bounds on the Capacity of Binary Intermittent Communication

CONSTRUCTION OF THE REAL NUMBERS.

Kolmogorov complexity

Geometric Steiner Trees

CS4800: Algorithms & Data Jonathan Ullman

1. Introduction to commutative rings and fields

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

QUIVERS AND LATTICES.

1S11: Calculus for students in Science

Lecture 19 : Reed-Muller, Concatenation Codes & Decoding problem

c i r i i=1 r 1 = [1, 2] r 2 = [0, 1] r 3 = [3, 4].

Example: Letter Frequencies

Online Learning, Mistake Bounds, Perceptron Algorithm

Ph219/CS219 Problem Set 3

Example: Letter Frequencies

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3

Matrix Algebra: Vectors

Data Compression Techniques

Discrete Mathematics: Lectures 6 and 7 Sets, Relations, Functions and Counting Instructor: Arijit Bishnu Date: August 4 and 6, 2009

MAT137 - Term 2, Week 2

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

EE/Stats 376A: Information theory Winter Lecture 5 Jan 24. Lecturer: David Tse Scribe: Michael X, Nima H, Geng Z, Anton J, Vivek B.

Data Compression Techniques

Characterisation of Accumulation Points. Convergence in Metric Spaces. Characterisation of Closed Sets. Characterisation of Closed Sets

Real Analysis on Metric Spaces

Polynomial one or more monomials added or subtracted. (i.e. : 5x or 6xy-3 or 6xy - 5x + 3 or

Homework 1 (revised) Solutions

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Transcription:

Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications to statistics, machine learning and theoretical computer science. The course will have 4-5 homeworks 60 % of the grade and a final 60 %. The homeworks will be posted on the course homepage and announced in class, and will be due about one week after they are posted. The only pre-requisite for the course is familiarity with discrete probability and random variables. Some knowledge of finite fields will help with the coding theory part though we will briefly review the relevant concepts from algebra. We will not follow any single textbook, though the book Elements of Information Theory by T. M. Cover and J. A. Thomas is a good reference for most of the material we will cover. The Resources section on the course page also contains links to some other similar courses. Check the course webpage regularly for updates. 2 A quick reminder about random variables and convexity 2. Random variables Let Ω be a finite set. Let µ : Ω [0, ] be a function such that µω =. ω Ω We often refer to Ω as a sample space and the function µ as a probability distribution on this space. A real-valued random variable over Ω is any function X : Ω R. We define E [X] = µω Xω. ω Ω

We will also think of a random variable X as given by its distribution. If U is the finite set of values taken by X, we can think of the probability distribution on U given by = P [X = x] = µω, ω:xω=x for all values x U. Throughout these notes, we will use uppercase letters to denote random variables and lowercase letters to denote their values. 2.2 Convexity and Jensen s inequality A set S R n is said to be convex subset of R n if the line segment joining any two points in S lies entirely in S i.e.,, for all x, y S and for all α [0, ], α x + α y S. For a convex set S R n, a function f : S R is said to be a convex function on S, if for all x, y S and for all α [0.], we have f α x + α y α f x + α f y. Equivalently, we say that the function f is convex if the set S f = {x, z z f x} is a convex subset of R n+. A function which satisfies the opposite inequality i.e., for all x, y S and α [0, ] f α x + α y α f x + α f y, is said to be a concave function. Note that if f is a convex function then f is a concave function and vice-versa. For a single variable function f : R R which is twice differentiable, we can also use the easier criterion that f is convex on S R if and only if f x 0 for all x S. We will frequently use the following inequality about convex functions. Lemma 2. Jensen s inequality Let S R n be a convex set and let X be a random variable taking values only inside S. Then, for a convex function f : S R, we have that E [ f X] f E [X]. Equivalently, for a concave function f : S R, we have E [ f X] f E [X]. Note that the definition of convexity is the same as the statement of Jensen s inequality for a random variable taking only two values: x with probability α and y with probability α. You can try the following exercises to familiarize yourself with this inequality. Exercise 2.2 Prove Jensen s inequality when the random variable X has a finite support. Exercise 2.3 Check that f x = x 2 is a convex function on R. Exercise 2.4 Prove the Cauchy-Schwarz inequality using Jensen s inequality. 2

3 Entropy The concepts from information theory are applicable in many areas as it gives a precise mathematical way of stating and answering the following question: How much information is revealed by the outcome of a random event? Let us begin with a few simple examples. Let X be a random variable which takes the value a with probability /2 and b with probability /2. We can then describe the value of X using one bit say 0 for a and for b. Suppose it takes one of the values {a,..., a n }, each with probability, then we can describe the outcome using log 2 n bits. The n possible outcomes for this random variable each occur with probability /n, and require log 2 n bits to describe. The concept of entropy is basically an extrapolation of this idea when the different outcomes do not occur with equal probability. We think of the information content of an event that occurs with probability p as being log 2 /p. If a random variable X is distributed over a universe U = {a,..., a n } such that it takes value x U with probability. Then, we define the entropy of the random variable X as HX = log. The following basic property of entropy is extremely useful in applications to counting problems. Proposition 3. Let X be a random variable taking values in a finite set U as above. Then 0 HX log U. Proof: Since we have log/ 0 for all x U and hence HX 0. For the upper bound, consider a random variable Y which takes value / with probability. Since log is a concave function, we use Jensen s inequality to say that log = E [logy] log E [Y] = log = log U. 4 Source Coding We will now attempt to make precise the intuition that a random variable X takes HX bits to describe on average. We shall need the notion of prefix-free codes as defined below. 3

Definition 4. A code for a set U over an alphabet Σ is a map C : U Σ which maps each element of U to a finite string over the alphabet Σ. We say that a code is prefix-free if for any x, y U such that x = y, Cx is not a prefix of Cy i.e., Cy = Cx σ for any σ Σ. For now, we will just use Σ = {0, }. For the rest of lecture, we will use prefix-free code to mean prefix-free code over {0, }. The image Cx for an image x is also referred to as the codeword for x. Note that a prefix-free code has the convenient property that if we are receiving a stream of coded symbols, we can decode them online. As soon as we see Cx for some x U, we know what we have received so far cannot be a prefix for Cy, for any y = x. The following inequality gives a characterization of the lengths of codewords in a prefix-free code. This will help prove both upper and lower bounds on the expected length of a codeword in a prefix-free code, in terms of entropy. Proposition 4.2 Kraft s inequality Let U = n. There exists a prefix-free code for U over {0, } with codeword lengths l,..., l n if and only if n i= 2 l i. For codes over a larger alphabet Σ, we replace 2 l i above by Σ l i. Proof: We first prove the only if part. Let C be a prefix-free code with codeword lengths l,..., l n and let l = max {l,..., l n }. Consider an experiment where we generate l random bits. For x U, let E x denote the event that the first Cx bits we generate are equal to Cx. Note that since C is a prefix-free code, E x and E y are mutually exclusive for x = y. Moreover, the probability that E x happens is exactly /2 Cx. This gives P [E x ] = 2 Cx = n i= 2 l i. For the if part, given l,..., l n satisfying i 2 l i, we will construct a prefix-free code with these codeword lengths. Without loss of generality, we can assume that l l 2 l n = l. It will be useful here to think of all binary strings of length at most l as a complete binary tree. The root corresponds to the empty string and each node at depth d corresponds to a string of length d. For a node corresponding to a string s, its left and right children correspond respectively to the strings s0 and s. The tree has 2 l leaves corresponding to all strings in {0, } l. We will now construct our code by choosing nodes at depth l,..., l n in this tree. When we select a node, we will delete the entire tree below it. This will maintain the prefix-free 4

property of the code. We first chose an arbitrary node s at depth l as a codeword of length l and delete the subtree below it. This deletes /2 l fraction of the leaves. Since there are still more leaves left in the tree, there exists a node say s 2 at depth l 2. Also, s cannot be a prefix of s 2 since s 2 does not lie in the subtree below s. We can similarly proceed to choose other codewords. At each step, we have some leaves left in the tree since i 2 l i. As was noted in the lecture, we need to carry out this argument in increasing order of lengths. Otherwise, if we choose longer codewords first, we may have to choose a shorter codeword later which does not lie on the path from the root to any of the longer codewords and this may not always possible e.g., there exists a code with lengths, 2, 2 but if we choose the strings 0 and 0 first then there is no way to choose a codeword of length which is not a prefix. The Shannon code: Given a random variable X taking values in U, we now construct a prefix-free code for conveying the value of X, using at most HX bits on average over the distribution of X. For an element x U which occurs with probability, we will use a codeword of length log/. By proposition 4, there exists a prefix-free code with these codeword lengths, since 2 Cx = 2 log/ Also, the expected number of bits used is log/ This code is known as the Shannon code. 2 log/ = =. log/ + = HX +. Of course the upper bound above would not be very impressive if it was possible to do much better using some other code. However, we will now show that any prefix-free code must use at least HX on average. Claim 4.3 Let X be a random variable taking values in U and let C be a prefix-free code for U. The the expected number of bits used by C to communicate the value of X is at least HX. Proof: The expected number of bits used is Cx. We consider the quantity HX Cx = log Cx = log 2 Cx. 5

We consider a random variable Y with takes the value 2 Cx with probability. The above expression then becomes E [logy]. Using Jensen s inequality gives E [logy] log E [Y] = log 2 Cx which is non-positive since 2 Cx by Proposition 4. = log 2 Cx 6