Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Similar documents
Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

COMM901 Source Coding and Compression. Quiz 1

Chapter 2: Source coding

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 9 Fundamental Limits in Information Theory

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

Exercises with solutions (Set B)

Coding of memoryless sources 1/35

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Compression and Coding

An introduction to basic information theory. Hampus Wessman

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

lossless, optimal compressor

Digital communication system. Shannon s separation principle

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

1 Introduction to information theory

Lecture 1: Shannon s Theorem

UNIT I INFORMATION THEORY. I k log 2

Shannon-Fano-Elias coding

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Coding for Discrete Source

Information and Entropy

Lecture 1 : Data Compression and Entropy

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

3F1 Information Theory, Lecture 3

CSCI 2570 Introduction to Nanocomputing

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Homework Set #2 Data Compression, Huffman code and AEP

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Summary of Last Lectures

ELEC 515 Information Theory. Distortionless Source Coding

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Information Theory and Statistics Lecture 2: Source coding

Intro to Information Theory

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Data Compression Techniques

BASICS OF COMPRESSION THEORY

Entropy as a measure of surprise

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Lecture 4 : Adaptive source coding algorithms

ELEMENT OF INFORMATION THEORY

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

3F1 Information Theory, Lecture 3

Compressing Tabular Data via Pairwise Dependencies

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Data Compression Techniques

Information Theory and Coding Techniques

Lecture 3 : Algorithms for source coding. September 30, 2016

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code


EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Communications Theory and Engineering

Chapter 5: Data Compression

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Lecture 8: Shannon s Noise Models

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Ch. 2 Math Preliminaries for Lossless Compression. Section 2.4 Coding

Course notes for Data Compression - 1 The Statistical Coding Method Fall 2005

6.02 Fall 2012 Lecture #1

CSEP 590 Data Compression Autumn Arithmetic Coding

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Information Theory: Entropy, Markov Chains, and Huffman Coding

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

CMPT 365 Multimedia Systems. Lossless Compression

Lecture 5: Asymptotic Equipartition Property

A Mathematical Theory of Communication

Lecture 11: Quantum Information III - Source Coding

2018/5/3. YU Xiangyu

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

EE5585 Data Compression May 2, Lecture 27

Information Theory. Week 4 Compressing streams. Iain Murray,

Data Compression. Limit of Information Compression. October, Examples of codes 1

Introduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Chapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0

CSE 421 Greedy: Huffman Codes

! Where are we on course map? ! What we did in lab last week. " How it relates to this week. ! Compression. " What is it, examples, classifications

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Shannon s noisy-channel theorem

Ch 0 Introduction. 0.1 Overview of Information Theory and Coding

Motivation for Arithmetic Coding

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

DCSP-3: Minimal Length Coding. Jianfeng Feng

Solutions to Set #2 Data Compression, Huffman code and AEP

Transcription:

Multimedia Communications Mathematical Preliminaries for Lossless Compression

What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when a coding is decodable

Why information theory? Compression schemes can be divided into two classes: lossy and lossless. Lossy compression: involves loss of some information and data that have been compressed generally cannot be recovered exactly Lossless schemes compress the data without loss of information and the original data can be recovered exactly from the compressed data There is a relation between lossless compression and information and entropy.

Information Theory Discrete information source with N symbols (set of symbols is often called the alphabet) A N ={a 1,,a N }. The probability function p : A N [0,1] gives the probability of occurrence of the symbols (p(a 1 )=p 1,., p(a N )=p N ). When we receive one of the symbols how much information do we get? If p 1 =1, there is no surprise (no information) since we know what the message must be. If the probabilities are very different, when a symbol with a low probability arrives, you feel more surprised and get more information. Information is somewhat inversely related to the probability

Information The self-information of a symbol x A N i : A N R, i(x) = -log(p(x)) is a measure of the information one receives upon being told that symbol x is received. i increases to infinity as the probability of the symbol decreases to zero. Logarithm base = 2 unit of information = BIT (our choice) = e unit of information = NAT = 10 unit of information = HARTLEY Example: flipping a coin. P(H)=P(T)=1/2: i(h)=i(t)=1 bit

Entropy The entropy of an information source is the expected (average) value of its self-information: H(X) is the average amount of information we get from a symbol of the source

Entropy Entropy defined in the previous slide is in fact the first order entropy If X={X 1,,X m } is a sequence of outputs of an information source S, the entropy of S is For i.i.d. (independent, identically distributed) sources, H =H 1. For most sources, H is not equal to H 1.

Entropy In general, it is not possible to know the actual entropy of a physical source We have to estimate the entropy The estimate of the entropy depends on our assumption about the structure of the source. Exp: 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10 Assumption1: iid source P(1)=P(6)=P(7)=P(10)=1/16 P(2)=P(3)=P(4)=P(5)=P(8)=P(9)=2/16, H=3.25 bits Assumption2: sample-to-sample correlation 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 P(1)=13/16, P(-1)=3/16, H=0.7 bits

Entropy Our assumptions about the structure of the source are called models In previous example the model is: This is static model: the parameters do not change with n Adaptive models: the parameters change or adapt with n to the changing characteristics of data

0 H(X) log 2 N Properties of Entropy The entropy is zero when one of the symbols occurs with probability 1. The entropy is maximum when all symbols occur with equal probability. H is a continuous function of the probabilities (a small change in probability, causes a small change in average information) If all symbols are equally likely, increasing the number of symbols, increases H. The more possible outcomes there are, the more information should be contained in the occurrence of any particular outcome

Models for Information Sources Good models for sources lead to more efficient compression algorithms Physical models: if we know something about the physics of the data generation, we can use that information to construct a model Exp: physics of speech production Probability models Simplest statistical model: each symbol that is generated by the source is independent of every other symbol and each occurs with the same probability (ignorance model) Next step: independent, a probability for each symbol Next step: discard the independence assumption and come up with a description of the dependency

Models for Information Sources One of the most popular ways of representing dependence in data is through the use of Markov models kth-order Markov P(x n x n-1,, x n-k )= P(x n x n-1,, x n-k, ) Knowledge of the past k symbols is equivalent to knowledge of the past. If x n belongs to a discrete set: also called finite state process Values taken by {x n-1,, x n-k } is called the state of Markov process If the size of source alphabet is N, the number of states is N k First-order Markov model: P(x n x n-1 ).

Models for Information Sources How to describe the dependency between samples? Linear models Markov chains Entropy of a finite state process: where S i is the i th state of the Markov model.

Example: Binary image Image has two types of pixels: white and black The type of next pixel depends on current pixel being white or black We can model pixels as a first order discrete Markov chain P(b w) P(w w) S w P(w b) S b P(b b) P(S w ) = 30/31, P(S b ) =1/31, P(b w) = 0.01, P(w b) = 0.3 Entropy based on iid assumption: H = -(30/31)log(30/31)-(1/31)log(1/31) = 0.206 bits Entropy: H(X S b ) = -.3log.3-.7log.7= 0.881 bits H(X S w ) = -.01log.01-.99log.99= 0.081 bits H(X) = (30/31)*0.081 + (1/31)*.881 = 0.107 bits

Formal definition of encoding An encoding scheme for a source alphabet S={s 1,s 2,,s N } in terms of a code alphabet A={a1,,a M } is a list of mappings, s 1 -> w 1 s 2 -> w 2. s N -> w N in which w 1,.., w N A + where A + is defined as A k is the Cartesian product of A with itself k times

Formal definition of encoding Example: S={a,b,c}, code alphabet A={0,1} the scheme a -> 01 b -> 10 c -> 111 is an encoding scheme. Suppose that s i -> w i A + is an encoding scheme for a source alphabet S={s 1,,s N }. Suppose that the source letter s 1,,s N occur with relative frequencies (probabilities) f 1,.. f N respectively. The average code word length of the code is defined as: where l i is the length of w i

Fixed Length Codes If the source has an alphabet with N symbols, these can be encoded using a fixed length coder using B bits per symbol, where:

Optimal codes The average number of code letters required to encode a source text consisting of P source letters is It may be expensive and time consuming to transmit long sequences of code letters, therefore it may be desirable for to be as small as possible. Common sense or intuition suggests that, in order to minimize, we ought to have the frequently occurring source letters represented by short code words and to reserve the longer code words for rarely occurring source letters (use variable length codes). Using variable length codes, we should make sure that the code is decodable.

Variable Length Codes: Examples Letters P( a k ) Code I Code II Code III Code IV a 1 0.5 0 0 0 0 a 2 0.25 0 1 10 01 a 3 0.125 0 00 110 011 a 4 0.125 10 11 111 0111 Average length 1.125 1.25 1.75 1.875 Code I: not uniquely decodable. Code II: not uniquely decodable. Code III: uniquely decodable. (Note: rate exactly equal to H.) Code IV: uniquely decodable.

A test for unique decodability Two binary codewords a (k bit long) and b (n bit long) and n>k If the first k bits of b are identical to a, then a is called a prefix of b The last n-k bits in b are called the dangling suffix Construct a list of all the codewords Examine all pairs of codewords to see if any codeword is a prefix of another codeword Whenever there is such a pair, add the dangling suffix to the list in the previous iteration Continue until: There is a dangling suffix that is a codeword: code not uniquely decodeable There is no more dangling suffixes: code uniquely decodable

Exp: {0, 01, 11} Dangling suffix: 1 {0,01,11,1} A test for unique decodability No more dangling suffixes: code is uniquely decodable

Prefix codes One type of code in which we will never face the possibility of a dangling suffix being a codeword is a code in which no codeword is a prefix of the other These type of codes are called prefix code A simple way to check if a code is prefix is to draw the binary tree of the code

Tree Representation of Codes Code III Code IV a 3 a 1 0 a 2 0 1 a 3 0 1 1 a 4 a 1 0 1 a 2 1 a 3 1 a 4 In a prefix code, all code words are external nodes (leaves).

Instantaneously Decodable Codes Instantaneous codes decode a symbol as soon as its code is received. This simplifies the decoding logic. It is both necessary and sufficient that an instantaneous code have no code word which is a prefix of another code word (prefix condition) Uniquely decodable codes Prefix codes

McMillan and Kraft theorems Theorem (McMillan s inequality): If S =N and A =M and s i -> w i A li i=1,2,..,n is an encoding scheme resulting in a uniquely decodable code then For binary codes the condition is: Theorem (Kraft s inequality): Suppose that S={s 1,,s N } is a source alphabet and A={a 1,,a M } is a code alphabet and l 1, l 2,..,l N are positive integers. Then, there is an encoding scheme s i -> w i i=1,2,..,n for S in terms of A satisfying prefix condition with length(w i )=l i if and only if

McMillan and Kraft theorems Uniquely decodable code McMillan Exist a prefix encoding scheme with lengths l i Kraft

Kraft-McMillan inequalities Note that the theorem refers to existence of such a code and does not refer to a particular code. A particular code may obey the Kraft inequality and still not be instantaneous, but there will exist codes that have the l i and are instantaneous. Example 1: Is there an instantaneous code with code lengths 1,2,2,3? Kraft inequality: 2-1 +2-2 +2-2 +2-3 > 1 : No It is nice to work with prefix codes, are we losing something (in terms of codeword length) if we restrict ourselves to prefix codes? No. If there is a code which is uniquely decodable and nonprefix, the values of l 1, l 2,..,l N for this code satisfy the Kraft-McMillan inequality. Thus, according to Kraft theorem, a prefix code with the same codeword length also exist.

Kraft-McMillan inequalities If a set of {li} is available that obey the Kraft inequality, an instantaneous code can be systematically built. Example: M=3, l={1,2,2,2,2,2,3,3,3} find an instantaneous code. 0 1 2 a 1 0 1 2 0 1 2 a 2 a 3 a 4 a 5 a 6 0 a 7 1 2 a 8 a 9

Kraft-McMillan inequalities One approach to build a uniquely decodable code is: 1. For the particular value of m and n, find all sets of {l i } that satisfy the Kraft inequality 2. Systematically build the codewords 3. Assign the shorter codeword to source letter with higher probability (relative frequency) and longer codewords to letter less likely 4. Find the average codeword length ( ) of the above codes 5. Pick the code that has the minimum This brute force approach is useful in mixed optimization problems, in which we want to keep and serve some other purpose small Where minimization of is our only objective a faster and more elegant approach is available (Huffman algorithm)

Lossless Source Coding Theorem Consider a source with entropy H. Then for every encoding scheme for S, in terms of A, resulting in a uniquely decodable code, the average code word length satisfies: It is possible to code the source, without distortion, using H + ε bits, where ε is an arbitrarily small positive number. However, it is not possible to code the source using B bits, where B < H. The theorem does not tell how the coder can be constructed.

Kolmogorov complexity Kolmogorov complexity K(x) of a sequence x is the size of the program needed to generate x In this size we include all inputs that might be needed by the program We do not specify the programming language since it is always possible to translate a program in one language to a program in another language. If x is a random sequence with no structure the only program that could generate it would contain the sequence itself There is a correspondence between size of smallest program and amount of compression that can be obtained Problem: there is no systematic was of computing (or approximating) the Kolmogrorov complexity

Minimum Description Length Let Mj be a model from a set of models that attempts to characterize the structure in a sequence x Let D Mj be the number of bits required to describe the model Mj. Example: if Mj has coefficients then D Mj will depend how many coefficients the model has and how many bits is used to represent each Let R Mj (x) be the number of bits required to represent x with respect to the model Mj Minimum description length would be given by: