Lecture 6: Entropy Rate Entropy rate H(X) Random walk on graph Dr. Yao Xie, ECE587, Information Theory, Duke University
Coin tossing versus poker Toss a fair coin and see and sequence Head, Tail, Tail, Head (x 1, x 2,..., x n ) 2 nh(x) Play card games with friend and see a sequence A K Q J 10 (x 1, x 2,..., x n )? Dr. Yao Xie, ECE587, Information Theory, Duke University 1
How to model dependence: Markov chain A stochastic process X 1, X 2, State {X 1,..., X n }, each state X i X Next step only depends on the previous state Transition probability p(x n+1 x n,..., x 1 ) = p(x n+1 x n ). p i, j : the transition probability of i j p(x n+1 ) = x n p(x n )p(x n+1 x n ) p(x 1, x 2,, x n ) = p(x 1 )p(x 2 x 1 ) p(x n x n 1 ) Dr. Yao Xie, ECE587, Information Theory, Duke University 2
Hidden Markov model (HMM) Used extensively in speech recognition, handwriting recognition, machine learning. Markov process X 1, X 2,..., X n, unobservable Observe a random process Y 1, Y 2,..., Y n, such that Y i p(y i x i ) We can build a probability model p(x n, y n ) = p(x 1 ) n 1 i=1 p(x i+1 x i ) n i=1 p(y i x i ) Dr. Yao Xie, ECE587, Information Theory, Duke University 3
Time invariance Markov chain A Markov chain is time invariant if the conditional probability p(x n x n 1 ) does not depend on n p(x n+1 = b X n = a) = p(x 2 = b X 1 = a), for all a, b X For this kind of Markov chain, define transition matrix P 11 P 1n P = P n1 P nn Dr. Yao Xie, ECE587, Information Theory, Duke University 4
Simple weather model X = { Sunny: S, Rainy R } p(s S) = 1 β, p(r R) = 1 α, p(r S) = β, p(s R) = α P = [ 1 β β ] α 1 α #" $%!" $%#" "!" Dr. Yao Xie, ECE587, Information Theory, Duke University 5
Probability of seeing a sequence SSRR: p(ssrr) = p(s)p(s S)p(R S)p(R R) = p(s)(1 β)β(1 α) What will this sequence behave, after many days of observations? What sequences of observations are more typical? What is the probability of seeing a typical sequence? Dr. Yao Xie, ECE587, Information Theory, Duke University 6
Stationary distribution Stationary distribution: a distribution µ on the states such that the distribution at time n + 1 is the same as the distribution at time n. Our weather example: If µ(s) = α α+β, µ(r) = Then β α+β P = [ 1 β β ] α 1 α p(x n+1 = S) = p(s S)µ(S) + p(s R)µ(R) α = (1 β) α + β + α β α + β = α α + β = µ(s ). Dr. Yao Xie, ECE587, Information Theory, Duke University 7
How to calculate stationary distribution Stationary distribution µ i, i = 1,, X satisfies µ i = µ j p ji, (µ = µp), and j X i=1 µ i = 1. Detailed balancing : #! " $ %! "# % "!" $ &! "# & " Dr. Yao Xie, ECE587, Information Theory, Duke University 8
Stationary process A stochastic process is stationary if the joint distribution of any subset is invariant to time-shift p(x 1 = x 1,, X n = x n ) = p(x 2 = x 1,, X n+1 = x n ). Example: coin tossing p(x 1 = head, X 2 = tail) = p(x 2 = head, X 3 = tail) = p(1 p). Dr. Yao Xie, ECE587, Information Theory, Duke University 9
Entropy rate When X i are i.i.d., entropy H(X n ) = H(X 1,, X n ) = n i=1 H(X i ) = nh(x) With dependent sequence X i, how does H(X n ) grow with n? Still linear? Entropy rate characterizes the growth rate Definition 1: average entropy per symbol H(X) = lim n H(X n ) n Definition 2: rate of information innovation H (X) = lim n H(X n X n 1,, X 1 ) Dr. Yao Xie, ECE587, Information Theory, Duke University 10
H (X) exists, for X i stationary H(X n X 1,, X n 1 ) H(X n X 2,, X n 1 ) (1) H(X n 1 X 1,, X n 2 ) (2) H(X n X 1,, X n 1 ) decreases as n increases H(X) 0 The limit must exist Dr. Yao Xie, ECE587, Information Theory, Duke University 11
H(X) = H (X), for X i stationary 1 n H(X 1,, X n ) = 1 n n H(X i X i 1,, X 1 ) i=1 Each H(X n X 1,, X n 1 ) H (X) Cesaro mean: If a n a, b n = 1 n ni=1 a i, b i a, then b n a. So 1 n H(X 1,, X n ) H (X) Dr. Yao Xie, ECE587, Information Theory, Duke University 12
AEP for stationary process 1 n log p(x 1,, X n ) H(X) p(x 1,, X n ) 2 nh(x) Typical sequences in typical set of size 2 nh(x) We can use nh(x) bits to represent typical sequence Dr. Yao Xie, ECE587, Information Theory, Duke University 13
Entropy rate for Markov chain For Markov chain H(X) = lim H(X n X n 1,, X 1 ) = lim H(X n X n 1 ) = H(X 2 X 1 ) By definition p(x 2 = j X 1 = i) = P i j Entropy rate of Markov chain H(X) = µ i P i j log P i j i j Dr. Yao Xie, ECE587, Information Theory, Duke University 14
Calculate entropy rate is fairly easy 1. Find stationary distribution µ i 2. Use transition probability P i j H(X) = µ i P i j log P i j i j Dr. Yao Xie, ECE587, Information Theory, Duke University 15
Entropy rate of weather model Stationary distribution µ(s) = α α+β, µ(r) = P = β α+β [ ] 1 β β α 1 α H(X) = β [α log α + (1 α) log(1 α)] α + β = α α + β H(β) + β α + β H(α) ( H 2 αβ ) H ( ) αβ α + β Maximum when α = β = 1/2: degenerate to independent process Dr. Yao Xie, ECE587, Information Theory, Duke University 16
Random walk on graph An undirected graph with m nodes {1,..., m} Edge i j has weight W i j 0 (W i j = W ji ) A particle walks randomly from node to node Random walk X 1, X 2, : a sequence of vertices Given X n = i, next step chosen from neighboring nodes with probability P i j = W i j k W ik Dr. Yao Xie, ECE587, Information Theory, Duke University 17
Dr. Yao Xie, ECE587, Information Theory, Duke University 18
Entropy rate of random walk on graph Let W i = W i j, W = j i, j:i> j W i j Stationary distribution is µ i = W i 2W Can verify this is a stationary distribution: µp = µ Stationary distribution weight of edges emanating from node i (locality) Dr. Yao Xie, ECE587, Information Theory, Duke University 19
Dr. Yao Xie, ECE587, Information Theory, Duke University 20
Summary AEP Stationary process X 1, X 2,, X i X: as n p(x 1,, x n ) 2 nh(x) Entropy rate H(X) = lim n H(X n X n 1,..., X 1 ) = 1 n lim n H(X 1,..., X n ) Random walk on graph µ i = W i 2W Dr. Yao Xie, ECE587, Information Theory, Duke University 21