Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it was assumed that The source statistics is known The source symbols are i.i.d. Normally this is not the case. How much can the source be compressed? How can it be achieved? Stefan Höst Information theory

Random process Definition (Random process) A random process {X i } n i= is a sequence of random variables. There can be an arbitrary dependence among the variables and the process is characterized by the joint probability function P ( X, X 2,..., X n = x, x 2,..., x n ) = p(x, x 2,..., x n ), n =, 2,... Definition (Stationary random process) A random process is stationary if it is invariant in time, P ( ) ( ) X,..., X n = x,..., x n = P Xq+,..., X q+n = x,..., x n for all time shifts q. Stefan Höst Information theory 2

Entropy rate Definition The entropy rate of a random process is defined as H (X) = lim n n H(X X 2... X n ) Define the alternative entropy rate for a random process as Theorem H(X X ) = lim n H(X n X X 2... X n ) The entropy rate and the alternative enropy rate are equivalent, H (X) = H(X X ) Stefan Höst Information theory 3

Entropy rate Theorem For a stationary stochastic process the entropy rate is bounded by H (X) H(X) log k log k H(X) H(X n X... X n ) H (X) n Stefan Höst Information theory 4

Source coding for random processes Optimal coding of process Let X = (X,..., X N ) be a vector of N symbols from a random process. Use an optimal source code to encode the vector. Then H(X... X N ) L (N) H(X... X N ) + which gives the average codeword length per symbol, L = N L(N), N H(X... X N ) L N H(X... X N ) + N In the limit as N the optimal codeword length per symbol becomes lim N L = H (X) Stefan Höst Information theory 5

Markov chain Definition (Markov chain) A Markov chain, or Markov process, is a random process with unit memory, P ( x n x,..., x n ) = P ( xn x n ), for all x i Definition (Stationary) A Markov chain is stationary (time invariant) if the conditional probabilities are independent of the time, P ( X n = x a X n = x b ) = P ( Xn+l = x a X n+l = x b ) for all relevant n, l, x a and x b. Stefan Höst Information theory 6

Markov chain Theorem For a Markov chain the joint probability function is p(x, x 2,..., x n ) = = n i= n i= p(x i x, x 2,..., x i ) p(x i x i ) = p(x )p(x 2 x )p(x 3 x 2 ) p(x n x n ) Stefan Höst Information theory 7

Markov chain characterization Definition A Markov chain is characterized by A state transition matrix P = [ p(x j x i ) ] i,j {,2,...,k} = [p ij] i,j {,2,...,k} where p ij and j p ij =. A finite set of states X {x, x 2,..., x k } where the state determines everything about the past. The state transition graph describes the behaviour of the process Stefan Höst Information theory 8

Example /3 /2 x /4 2/3 The state transition matrix P = 3 4 2 2 3 3 4 2 /2 The state space is x 3 x 2 X {x, x 2, x 3 } 3/4 Stefan Höst Information theory 9

Markov chain Theorem Given a Markov chain with k states, let the distribution for the states at time n be Then π (n) = (π (n) π (n) 2... π (n) k ) π (n) = π () P n where π () is the initial distribution at time. Stefan Höst Information theory

Example, asymptotic distribution P 2 = P 4 = 2 3 3 3 4 4 2 2 2 33 2 6 36 39 24 27 2 3 3 3 4 4 2 2 2 33 2 = 6 36 39 24 27 = 2 33 2 684 584 947 584 779 584 6 36 39 24 27 88 584 249 584 92 584 692 584 88 584 485 584.3485.3.2794 P 8 = =.349.3.2788.3489.32.2789 Stefan Höst Information theory

Markov chain Theorem Let π = (π... π k ) be an asymptotic distribution of the state probabilities.then j π j = π is a stationary distribution, i.e. πp = π π is a unique stationary disribution for the source. Stefan Höst Information theory 2

Entropy rate of Markov chain Theorem For a stationary Markov chain with stationary distribution π and transition matrix P, the entropy rate can be derived as where the entropy of row i in P. H (X) = π i H(X 2 X = x i ) i H(X 2 X = x i ) = p ij log p ij j Stefan Höst Information theory 3

Example, Entropy rate Entropy per row: P = 2 3 3 3 4 4 2 2 H(X 2 X = x ) = h( 3 ) H(X 2 X = x 2 ) = h( 4 ) H(X 2 X = x 3 ) = h( 2 ) = Hence H (X) = 5 43 h( 3 ) + 5 43 h( 4 ) + 2 43 h( 2 ).93 bit/source symbol Stefan Höst Information theory 4

Data processing lemma X Process A Y Process B Z Lemma (Data Processing Lemma) If the random variables X, Y and Z form a Markov chain, X Y Z, we have I(X; Z ) I(X; Y ) I(X; Z ) I(Y ; Z ) Conclusion The amount of information can not increase by data processing, neither pre nor post. It can only be transformed (or destroyed). Stefan Höst Information theory 5