Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications Mathematical Preliminaries for Lossless Compression

What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when a coding is decodable

Why information theory? Compression schemes can be divided into two classes: lossy and lossless. Lossy compression: involves loss of some information and data that have been compressed generally cannot be recovered exactly Lossless schemes compress the data without loss of information and the original data can be recovered exactly from the compressed data There is a relation between lossless compression and information and entropy.

Information Theory Discrete information source with N symbols (set of symbols is often called the alphabet) A N ={a 1,,a N }. The probability function p : A N [0,1] gives the probability of occurrence of the symbols (p(a 1 )=p 1,., p(a N )=p N ). When we receive one of the symbols how much information do we get? If p 1 =1, there is no surprise (no information) since we know what the message must be. If the probabilities are very different, when a symbol with a low probability arrives, you feel more surprised and get more information. Information is somewhat inversely related to the probability

Information The self-information of a symbol x A N i : A N R, i(x) = -log(p(x)) is a measure of the information one receives upon being told that symbol x is received. i increases to infinity as the probability of the symbol decreases to zero. Logarithm base = 2 unit of information = BIT (our choice) = e unit of information = NAT = 10 unit of information = HARTLEY Example: flipping a coin. P(H)=P(T)=1/2: i(h)=i(t)=1 bit

Entropy The entropy of an information source is the expected (average) value of its self-information: H(X) is the average amount of information we get from a symbol of the source

Entropy Entropy defined in the previous slide is in fact the first order entropy If X={X 1,,X m } is a sequence of outputs of an information source S, the entropy of S is For i.i.d. (independent, identically distributed) sources, H =H 1. For most sources, H is not equal to H 1.

Entropy In general, it is not possible to know the actual entropy of a physical source We have to estimate the entropy The estimate of the entropy depends on our assumption about the structure of the source. Exp: 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10 Assumption1: iid source P(1)=P(6)=P(7)=P(10)=1/16 P(2)=P(3)=P(4)=P(5)=P(8)=P(9)=2/16, H=3.25 bits Assumption2: sample-to-sample correlation 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 P(1)=13/16, P(-1)=3/16, H=0.7 bits

Entropy Our assumptions about the structure of the source are called models In previous example the model is: This is static model: the parameters do not change with n Adaptive models: the parameters change or adapt with n to the changing characteristics of data

0 H(X) log 2 N Properties of Entropy The entropy is zero when one of the symbols occurs with probability 1. The entropy is maximum when all symbols occur with equal probability. H is a continuous function of the probabilities (a small change in probability, causes a small change in average information) If all symbols are equally likely, increasing the number of symbols, increases H. The more possible outcomes there are, the more information should be contained in the occurrence of any particular outcome

Models for Information Sources Good models for sources lead to more efficient compression algorithms Physical models: if we know something about the physics of the data generation, we can use that information to construct a model Exp: physics of speech production Probability models Simplest statistical model: each symbol that is generated by the source is independent of every other symbol and each occurs with the same probability (ignorance model) Next step: independent, a probability for each symbol Next step: discard the independence assumption and come up with a description of the dependency

Models for Information Sources One of the most popular ways of representing dependence in data is through the use of Markov models kth-order Markov P(x n x n-1,, x n-k )= P(x n x n-1,, x n-k, ) Knowledge of the past k symbols is equivalent to knowledge of the past. If x n belongs to a discrete set: also called finite state process Values taken by {x n-1,, x n-k } is called the state of Markov process If the size of source alphabet is N, the number of states is N k First-order Markov model: P(x n x n-1 ).

Models for Information Sources How to describe the dependency between samples? Linear models Markov chains Entropy of a finite state process: where S i is the i th state of the Markov model.

Example: Binary image Image has two types of pixels: white and black The type of next pixel depends on current pixel being white or black We can model pixels as a first order discrete Markov chain P(b w) P(w w) S w P(w b) S b P(b b) P(S w ) = 30/31, P(S b ) =1/31, P(b w) = 0.01, P(w b) = 0.3 Entropy based on iid assumption: H = -(30/31)log(30/31)-(1/31)log(1/31) = 0.206 bits Entropy: H(X S b ) = -.3log.3-.7log.7= 0.881 bits H(X S w ) = -.01log.01-.99log.99= 0.081 bits H(X) = (30/31)*0.081 + (1/31)*.881 = 0.107 bits

Formal definition of encoding An encoding scheme for a source alphabet S={s 1,s 2,,s N } in terms of a code alphabet A={a1,,a M } is a list of mappings, s 1 -> w 1 s 2 -> w 2. s N -> w N in which w 1,.., w N A + where A + is defined as A k is the Cartesian product of A with itself k times

Formal definition of encoding Example: S={a,b,c}, code alphabet A={0,1} the scheme a -> 01 b -> 10 c -> 111 is an encoding scheme. Suppose that s i -> w i A + is an encoding scheme for a source alphabet S={s 1,,s N }. Suppose that the source letter s 1,,s N occur with relative frequencies (probabilities) f 1,.. f N respectively. The average code word length of the code is defined as: where l i is the length of w i

Fixed Length Codes If the source has an alphabet with N symbols, these can be encoded using a fixed length coder using B bits per symbol, where:

Optimal codes The average number of code letters required to encode a source text consisting of P source letters is It may be expensive and time consuming to transmit long sequences of code letters, therefore it may be desirable for to be as small as possible. Common sense or intuition suggests that, in order to minimize, we ought to have the frequently occurring source letters represented by short code words and to reserve the longer code words for rarely occurring source letters (use variable length codes). Using variable length codes, we should make sure that the code is decodable.

Variable Length Codes: Examples Letters P( a k ) Code I Code II Code III Code IV a 1 0.5 0 0 0 0 a 2 0.25 0 1 10 01 a 3 0.125 0 00 110 011 a 4 0.125 10 11 111 0111 Average length 1.125 1.25 1.75 1.875 Code I: not uniquely decodable. Code II: not uniquely decodable. Code III: uniquely decodable. (Note: rate exactly equal to H.) Code IV: uniquely decodable.

A test for unique decodability Two binary codewords a (k bit long) and b (n bit long) and n>k If the first k bits of b are identical to a, then a is called a prefix of b The last n-k bits in b are called the dangling suffix Construct a list of all the codewords Examine all pairs of codewords to see if any codeword is a prefix of another codeword Whenever there is such a pair, add the dangling suffix to the list in the previous iteration Continue until: There is a dangling suffix that is a codeword: code not uniquely decodeable There is no more dangling suffixes: code uniquely decodable

Exp: {0, 01, 11} Dangling suffix: 1 {0,01,11,1} A test for unique decodability No more dangling suffixes: code is uniquely decodable

Prefix codes One type of code in which we will never face the possibility of a dangling suffix being a codeword is a code in which no codeword is a prefix of the other These type of codes are called prefix code A simple way to check if a code is prefix is to draw the binary tree of the code

Tree Representation of Codes Code III Code IV a 3 a 1 0 a 2 0 1 a 3 0 1 1 a 4 a 1 0 1 a 2 1 a 3 1 a 4 In a prefix code, all code words are external nodes (leaves).

Instantaneously Decodable Codes Instantaneous codes decode a symbol as soon as its code is received. This simplifies the decoding logic. It is both necessary and sufficient that an instantaneous code have no code word which is a prefix of another code word (prefix condition) Uniquely decodable codes Prefix codes

McMillan and Kraft theorems Theorem (McMillan s inequality): If S =N and A =M and s i -> w i A li i=1,2,..,n is an encoding scheme resulting in a uniquely decodable code then For binary codes the condition is: Theorem (Kraft s inequality): Suppose that S={s 1,,s N } is a source alphabet and A={a 1,,a M } is a code alphabet and l 1, l 2,..,l N are positive integers. Then, there is an encoding scheme s i -> w i i=1,2,..,n for S in terms of A satisfying prefix condition with length(w i )=l i if and only if

McMillan and Kraft theorems Uniquely decodable code McMillan Exist a prefix encoding scheme with lengths l i Kraft

Kraft-McMillan inequalities Note that the theorem refers to existence of such a code and does not refer to a particular code. A particular code may obey the Kraft inequality and still not be instantaneous, but there will exist codes that have the l i and are instantaneous. Example 1: Is there an instantaneous code with code lengths 1,2,2,3? Kraft inequality: 2-1 +2-2 +2-2 +2-3 > 1 : No It is nice to work with prefix codes, are we losing something (in terms of codeword length) if we restrict ourselves to prefix codes? No. If there is a code which is uniquely decodable and nonprefix, the values of l 1, l 2,..,l N for this code satisfy the Kraft-McMillan inequality. Thus, according to Kraft theorem, a prefix code with the same codeword length also exist.

Kraft-McMillan inequalities If a set of {li} is available that obey the Kraft inequality, an instantaneous code can be systematically built. Example: M=3, l={1,2,2,2,2,2,3,3,3} find an instantaneous code. 0 1 2 a 1 0 1 2 0 1 2 a 2 a 3 a 4 a 5 a 6 0 a 7 1 2 a 8 a 9

Kraft-McMillan inequalities One approach to build a uniquely decodable code is: 1. For the particular value of m and n, find all sets of {l i } that satisfy the Kraft inequality 2. Systematically build the codewords 3. Assign the shorter codeword to source letter with higher probability (relative frequency) and longer codewords to letter less likely 4. Find the average codeword length ( ) of the above codes 5. Pick the code that has the minimum This brute force approach is useful in mixed optimization problems, in which we want to keep and serve some other purpose small Where minimization of is our only objective a faster and more elegant approach is available (Huffman algorithm)

Lossless Source Coding Theorem Consider a source with entropy H. Then for every encoding scheme for S, in terms of A, resulting in a uniquely decodable code, the average code word length satisfies: It is possible to code the source, without distortion, using H + ε bits, where ε is an arbitrarily small positive number. However, it is not possible to code the source using B bits, where B < H. The theorem does not tell how the coder can be constructed.

Kolmogorov complexity Kolmogorov complexity K(x) of a sequence x is the size of the program needed to generate x In this size we include all inputs that might be needed by the program We do not specify the programming language since it is always possible to translate a program in one language to a program in another language. If x is a random sequence with no structure the only program that could generate it would contain the sequence itself There is a correspondence between size of smallest program and amount of compression that can be obtained Problem: there is no systematic was of computing (or approximating) the Kolmogrorov complexity

Minimum Description Length Let Mj be a model from a set of models that attempts to characterize the structure in a sequence x Let D Mj be the number of bits required to describe the model Mj. Example: if Mj has coefficients then D Mj will depend how many coefficients the model has and how many bits is used to represent each Let R Mj (x) be the number of bits required to represent x with respect to the model Mj Minimum description length would be given by: