LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate Markov chains Random walks on graphs Hidden Markov models Reading: Chapter 4.
Quick Review Mutual Information: I(X; Y ) = H(X) H(X Y ) = PX,Y (x, y) log x,y = D(P X,Y P X P Y ) P X,Y (x, y) P X (x)p Y (y) Chain Rule of Mutual Information. I(X 1, X 2 ; Y ) = I(X 1 ; Y ) + I(X 2 ; Y X 1 ) D(p q) 0. Entropy H(X) is concave in P X ; Mutual information I(X; Y ) is concave in P X for fixed P Y X, and convex in P Y X for fixed P X. X Y Z I(X; Y ) I(X; Z).
Fano s lemma Suppose we have r.v.s X and Y, Fano s lemma bounds the error we expect when estimating X from Y We generate an estimator of X that is X = g(y ). Probability of error P e = P r( X = X) Indicator function for error E which is 0 when X = X and 1 otherwise. Thus, P e = P (E = 1) Fano s lemma: H(E) + P e log( X 1) H(X Y )
Proof of Fano s lemma H(E, X Y ) = H(X Y ) + H(E X, Y ) = H(X Y ) H(E, X Y ) = H(E Y ) + H(X E, Y ) H(E) + H(X E, Y ) = H(E) +P e H(X E = 1, Y ) +(1 P e )H(X E = 0, Y ) = H(E) + P e H(X E = 1, Y ) H(E) + P e H(X E = 1) H(E) + P e log( X 1) Works well (tight) when X is large.
Stochastic processes A stochastic process is an indexed sequence or r.v.s X 0, X 1,..., a map from Ω to X. A stochastic process is characterized by the joint PMF: P X0,X 1,...,X n (x 0, x 1,..., x n ), (x 0, x 1,..., x n) X n, for n = 0, 1,... The entropy of a stochastic process H(X 1, X 2,...) = H(X 1 ) + H(X 2 X 1 ) +... +H(X i X 1,... X i 1 ) +... Difficulties Sum to infinity all terms are different in general.
Entropy Rate The entropy rate of a random process if it exists 1 lim H(X n ) n n Examples: i.i.d. sequence of r.v.s i.i.d. blocks of r.v.s A stochastic process is stationary if P X0,X 1,...,X n (x 0, x 1,..., x n ) = P Xl,X l+1,...,x l+n (x 0, x 1,..., x n ) for every shift l and all (x 0, x 1,..., x n ) X n. For stationary processes, the limit exists.
Entropy Rate of Stationary Processes Chain Rule: 1 1 H(X 1, X 2,..., X n ) = n n n H(Xi X 1,..., X i 1 ) i=1 The limit on LHS exists iff the individual terms on the RHS has a limit. For a stationary process H(X n+1 1 ) H(X n+1 2 ) = H(X n X1 n 1 ) X n X n Therefore the sequence H(X n X n 1 ) is non 1 increasing and non negative, so limit exists. Theorem For stationary processes, the entropy rate 1 lim H(X1 n ) = lim H(X n X n 1 ) n n n 1
Markov Chain A discrete stochastic process is a Markov chain if ( ) P Xn X 0,...,X xn n 1 x 0,..., x n 1 ( ) = P Xn X xn n 1 x n 1 for n = 1, 2,... and all (x 0, x 1,..., x n ) X n. X n : state after n transitions belongs to a finite set, e.g., {1,..., m} X 0 is either given or random
Time Invariant Markov Processes The transition probability is time invariant. p i,j = P(X n+1 = j X n = i) = P(X n+1 = j X n = i, X n 1,..., X 0 ) Markov chain is characterized by probability transition matrix P = [p i,j ] Question: Stationary vs. Time Invariant. Let r i (n) = P (X n = i), condition on an initial condition or average over random initial state, Key recursion r j (n + 1) = r i (n)p i,j i or r(n + 1) = r(n)p
Review of Markov chains Is there always a solution of π = πp, which is a probability vector? Is that solution unique? Starting from any initial state (or random), does the state distribution converge to π? A Markov chain with a single class of recurrent aperiodic states, there is a unique stationary distribution π. Each row of P n converges to π. The Entropy Rate of Markov Chain 1 lim H(X1 n n n ) = lim H(X n X n 1 ) n = π i p i,j log p i,j i,j
Random walk on graph Example: Random walk on a 3 3 chessboard 1 2 3 4 5 6 7 8 9 [ 1 1 1 1 1 ] p 2,j =, 0,,,,, 0, 0, 0 5 5 5 5 5 Condition on X n 1 = 2, observing X n gives log 2 5(bit) information. Entropy rate 4π 1 log 3 + 4π 2 log 5 + π 5 log 8 For n n chessboard with n, entropy rate approaches log 8.
Random walk on graph Consider undirected graph G = (N, E, W ) where N, E, W are the nodes, edges and weights. With each edge there is an associated edge W i,j W i,j = W j,i, W i = Wi,j W = j W i,j i,j:j>i 2W = Wi i We call a random walk the Markov chain in which the states are the nodes of the graph W i,j p i,j = Wi W i π i = 2W
Random Walk on Graph Check: i π i = 1 and i π i p i,j = W i 2W W i,j i W i W i,j = i 2W W j = 2W = π j Back to the Example: Random walk on 3 3 chessboard, W i,j = 1 for all connected i, j, 2W = 40. 3 π 1 = 40 5 π 2 =, 40 8 π 5 = 40
Random walk on graph H(X 2 X 1 ) = πi pi,j log(p i,j ) i j = ( ) W i W i,j Wi,j log i 2W j W i W i = ( ) W i,j Wi,j log i,j 2W W i = ( ) W i,j Wi,j log i,j 2W 2W ( ) W i,j Wi + log i,j 2W 2W = ( ) W i,j Wi,j log + ( ) W i Wi log 2W 2W 2W 2W i,j Entropy rate is difference of two entropies i
Hidden Markov models Consider an ALOHA wireless model M users sharing the same radio channel to transmit packets to a base station During each time slot, a packet a arrives to a user s queue with probability p, independently of the other M 1 users Also, at the beginning of each time slot, if a user has at least one packet in its queue, it will transmit a packet with probability q, independently of all other users If two packets collide at the receiver, they are not successfully transmitted and remain in their respective queues
Hidden Markov models Let X i = (n 1, n 2,..., n M ) denote the random vector at time i where n m is the number of packets that are in user m s queue. X i is a Markov chain. Consider the random vector Y i = (y 1, y 2,..., y M ) where y i = 1 if user i transmits during time slot i and y i = 0 otherwise Is Y i Markov?
Hidden Markov processes Let..., X 1, X 2,... be a stationary Markov chain and let Y i = φ(x i ) be a process, each term of which is a function of the corresponding state in the Markov chain..., Y 1, Y 2,... form a hidden Markov chain, which is not always a Markov chain, but is still stationary What is its entropy rate? We can compute H(Y n Y n 1 ), which mono 1 tonically decreases with n. Need a lower bound to the entropy rate.
Genie Trick Want to construct another sequence b n, which is lower bound of the entropy rate lim n H(Y n Y1 n 1 ), yet has the same limit. Lower bound of entropy: give a genie. Genie information has to be small, but enough to flip the scale. Choose to look at H(Y n Y n 1, X 1 ). 1 Claim H(Y n Y n 1, X 1 ) lim H(Y n Y n 1 ) 1 n 1 n 1 H(Y n Y 1, n 1 X1 ) = H(Y n Y 1, X1, X 0,..., X k ) = H(Y n Y1 n 1, X 1 k ) H(Y n Y n 1 k, Y 0 ) k All the information about the history is captured in X 1.
Hidden Markov processes Claim H(Y n Y n 1 ) 1 H(Yn Y 1 n 1, X 1 ) = I(X 1 ; Y n Y1 n 1 ) 0 Indeed, lim I (X 1 ; Y n 1 ) = n lim I ( X1 ; Y i Y i 1) n n 1 i=1 I ( X1 ; Y i Y i 1) = 1 i=1 since we have an infinite sum in which the terms are non negative and which is upper bounded by H(X 1 ), the terms must tend to 0