Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Similar documents
Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

(Classical) Information Theory III: Noisy channel coding

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Lecture 22: Final Review

Computing and Communications 2. Information Theory -Entropy

Noisy channel communication

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Chapter 9 Fundamental Limits in Information Theory

Capacity of a channel Shannon s second theorem. Information Theory 1/33

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Block 2: Introduction to Information Theory

Lecture 1. Introduction

Revision of Lecture 5

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Lecture 1: Introduction, Entropy and ML estimation

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Shannon s Noisy-Channel Coding Theorem

Communication Theory II

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

ECE Information theory Final


MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Chapter 2: Source coding

Principles of Communications

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Entropies & Information Theory

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Information Theory Primer:

Information maximization in a network of linear neurons

Lecture 8: Channel Capacity, Continuous Random Variables

An introduction to basic information theory. Hampus Wessman

ECE 4400:693 - Information Theory

Lecture 5 Channel Coding over Continuous Channels

Exercises with solutions (Set D)

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Basic Principles of Video Coding

CS 630 Basic Probability and Information Theory. Tim Campbell

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Information Theory in Intelligent Decision Making

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

Slide11 Haykin Chapter 10: Information-Theoretic Models

Lecture 2: August 31

Shannon s Noisy-Channel Coding Theorem

Lecture 11: Continuous-valued signals and differential entropy

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Electrical and Information Technology. Information Theory. Problems and Solutions. Contents. Problems... 1 Solutions...7

ELEMENT OF INFORMATION THEORY

Entropy as a measure of surprise

Noisy-Channel Coding

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Information and Entropy

Homework Set #2 Data Compression, Huffman code and AEP

INTRODUCTION TO INFORMATION THEORY

Lecture 8: Shannon s Noise Models

Lecture 5: GPs and Streaming regression

Digital communication system. Shannon s separation principle

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Shannon s noisy-channel theorem

Introduction to Information Theory. Part 4

X 1 : X Table 1: Y = X X 2

Compression and Coding

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Source Coding: Part I of Fundamentals of Source and Video Coding

Coding for Discrete Source

Non-binary Distributed Arithmetic Coding

Lecture 4 Noisy Channel Coding

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

ELEC546 Review of Information Theory

One Lesson of Information Theory

Computational Systems Biology: Biology X

3F1 Information Theory, Lecture 3

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Quiz 2 Date: Monday, November 21, 2016

Kyle Reing University of Southern California April 18, 2018

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Revision of Lecture 4

Upper Bounds on the Capacity of Binary Intermittent Communication

Reliable Computation over Multiple-Access Channels

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

the Information Bottleneck

Information Theory and Coding Techniques

Towards a Theory of Information Flow in the Finitary Process Soup

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Shannon s A Mathematical Theory of Communication

Lecture 15: Conditional and Joint Typicaility

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Transcription:

Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is concerned with: Source coding, reducing redundancy by modelling the structure in the data Channel coding, how to deal with noisy transmission Key idea is prediction Source coding: redundancy means predictability of the rest of the data given part of it Channel coding: Predict what we want given what we have been given Rate-distortion theory Mutual information Channel capacity Information Theory Textbooks Elements of Information Theory. T. M. Cover and J. A. Thomas. Wiley, 1991. [comprehensive] Coding and Information Theory. R. W. Hamming. Prentice-Hall, 1980. [introductory] Entropy A discrete random variable X takes on values from an alphabet X, and has probability mass function P (x) = P (X = x) for x X. The entropy H(X) of X is defined as H(X) = P (x) log P (x) x X convention: for P (x) = 0, 0 log 1/0 0 Information Theory, Inference and Learning Algorithms D. J. C. MacKay, CUP (2003), available online (viewing only) http://www.inference.phy.cam.ac.uk/mackay/itila The entropy measures the information content or uncertainty of X. Units: log 2 bits; log e nats.

Joint entropy, conditional entropy H(X, Y ) = P (x, y) log P (x, y) x,y H(Y X) = P (x)h(y X = x) x = P (x) P (y x) log P (y x) x y = E P (x,y) log P (y x) H(X, Y ) = H(X) + H(Y X) If X, Y are independent H(X, Y ) = H(X) + H(Y ) Coding theory A coding scheme C assigns a code C(x) to every symbol x; C(x) has length l(x). The expected code length L(C) of the code is L(C) = p(x)l(x) x X Theorem 1: Noiseless coding theorem The expected length L(C) of any instantaneous code for X is bounded below by H(X), i.e. L(C) H(X) Theorem 2 There exists an instantaneous code such that H(X) L(C) < H(X) + 1 Practical coding methods How can we come close to the lower bound? Huffman coding H(X) L(C) < H(X) + 1 Use blocking to reduce the extra bit to an arbitrarily small amount. Arithmetic coding

Coding with the wrong probabilities Say we use the wrong probabilities q i to construct a code. Then But i L(C q ) = i p i log q i p i log p i q i > 0 if q i p i L(C q ) H(X) > 0 i.e. using the wrong probabilities increases the minimum attainable average code length. Coding real data So far we have discussed coding sequences if iid random variables. But, for example, the pixels in an image are not iid RVs. So what do we do? Consider an image having N pixels, each of which can take on k grey-level values, as a single RV taking on k N values. We would then need to estimate probabilities for all k N different images in order to code a particular image properly, which is rather difficult for large k and N. One solution is to chop images into blocks, e.g. 8 8 pixels, and code each block separately. Predictive encoding try to predict the current pixel value given nearby context. Successful prediction reduces uncertainty. Rate-distortion theory What happens if we can t afford enough bits to code all of the symbols exactly? We must be prepared for lossy compression, when two different symbols are assigned the same code. In order to minimize the errors caused by this, we need a distortion function d(x i, x j ) which measures how much error is caused when symbol x i codes for x j. H(X 1, X 2 ) = H(X 1 ) + H(X 2 X 1 ) x x x x The k-means algorithm is a method of choosing code book vectors so as to minimize the expected distortion

Source coding Patterns that we observe have a lot of structure, e.g. visual scenes that we care about don t look like snow on the TV This gives rise to redundancy, i.e. that observing part of a scene will help us predict other parts This redundancy can be exploited to code the data efficiently loss less compression Q: Why is coding so important? A: Because of the lossless coding theorem: the best probabilistic model of the data will have the shortest code Source coding gives us a way of comparing and evaluating different models of data, and searching for good ones Usually we will build models with hidden variables a new representation of the data Mutual information I(X; Y ) = KL(p(x, y), p(x)p(y)) 0 p(x, y) = p(x, y) log = I(Y ; X) p(x)p(y) x,y = x,y p(x, y) log p(x y) p(x) = H(X) H(X Y ) = H(X) + H(Y ) H(X, Y ) Mutual information is a measure of the amount of information that one RV contains about another. It is the reduction in uncertainty of one RV due to knowledge of the other. Zero mutual information if X and Y are independent

Mutual Information Example 2: Example 1: Y 2 lung Y1 non smoker smoker 1/3 0 Y 2 lung no lung smoker 1/9 2/9 Y 1 non smoker 2/9 4/9 no lung 0 2/3 Continuous variables PCA and mutual information Linsker, 1988, Principle of maximum information preservation Consider a random variable Y = a T X + ɛ, with a T a = 1. I(Y 1 ; Y 2 ) = P (y 1, y 2 ) log P (y 1, y 2 ) P (y 1 )P (y 2 ) dy 1 dy 2 = 1 2 log(1 ρ2 ) How do we maximize I(Y ; X)? I(Y ; X) = H(Y ) H(Y X) But H(Y X) is just the entropy of the noise term ɛ. If X has a joint multivariate Gaussian distribution then Y will have a Gaussian distribution. The (differential) entropy of a Gaussian N(µ, σ 2 ) is 1 2 log 2πeσ2. Hence we maximize information preservation by choosing a to give Y maximum variance subject to the constraint a T a = 1.

Channel capacity The channel capacity of a discrete memoryless channel is defined as C = max p(x) I(X; Y ) Noisy channel coding theorem (Informal statement) Error free communication above the channel capacity is impossible; communication at bit rates below C is possible with arbitrarily small error.