Lecture 6: Entropy Rate

Similar documents
Lecture 5: Asymptotic Equipartition Property

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

LECTURE 3. Last time:

Lecture 22: Final Review

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Sequential Data and Markov Models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Entropy Rate of Stochastic Processes

Lecture 17: Differential Entropy

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Markov Processes Hamid R. Rabiee

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Markov Chains and MCMC

Exercises with solutions (Set D)

2 : Directed GMs: Bayesian Networks

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

ECE521 Lecture 19 HMM cont. Inference in HMM

Introduction to Stochastic Processes

Probabilistic Machine Learning

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

MARKOV PROCESSES. Valerio Di Valerio

STA 414/2104: Machine Learning

MATH3200, Lecture 31: Applications of Eigenvectors. Markov Chains and Chemical Reaction Systems

Lecture 11: Maximum Entropy

Undirected Graphical Models

Lecture 8: Bayesian Networks

Hidden Markov Models

Discrete Mathematics and Probability Theory Summer 2017 Hongling Lu, Vrettos Moulos, and Allen Tang HW 6

n(1 p i ) n 1 p i = 1 3 i=1 E(X i p = p i )P(p = p i ) = 1 3 p i = n 3 (p 1 + p 2 + p 3 ). p i i=1 P(X i = 1 p = p i )P(p = p i ) = p1+p2+p3

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

TCOM 501: Networking Theory & Fundamentals. Lecture 6 February 19, 2003 Prof. Yannis A. Korilis

Dynamic Approaches: The Hidden Markov Model

Graphical Models 101. Biostatistics 615/815 Lecture 12: Hidden Markov Models. Example probability distribution. An example graphical model

18.175: Lecture 30 Markov chains

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

STA 4273H: Statistical Machine Learning

Rapid Introduction to Machine Learning/ Deep Learning

MAT 135B Midterm 1 Solutions

2 : Directed GMs: Bayesian Networks

Lecture 20 : Markov Chains

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

An Introduction to Bioinformatics Algorithms Hidden Markov Models

(Classical) Information Theory II: Source coding

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

CS Lecture 4. Markov Random Fields

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Quiz 1 Date: Monday, October 17, 2016

Math 180B Homework 4 Solutions

Lecture 6: The Pigeonhole Principle and Probability Spaces

Information Theory and Statistics Lecture 3: Stationary ergodic processes

Lecture 20: Quantization and Rate-Distortion

Approximate Inference

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Exercises with solutions (Set B)

Directed and Undirected Graphical Models

18.600: Lecture 32 Markov Chains

Discrete Probability Refresher

The PAC Learning Framework -II

8: Hidden Markov Models

HMM part 1. Dr Philip Jackson

Lecture 5: Random Walks and Markov Chain

Note Set 5: Hidden Markov Models

A brief introduction to Conditional Random Fields

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Mathematical Methods for Computer Science

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

EE5585 Data Compression May 2, Lecture 27

Lecture 9. Intro to Hidden Markov Models (finish up)

Probabilistic Robotics, Sebastian Thrun, 2005 Page 36, 37. <1 x = nf ) =1/ 3 <1 x = f ) =1 p(x = f ) = 0.01 p(x = nf ) = 0.

Probabilistic Graphical Models

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Markov Chains, Random Walks on Graphs, and the Laplacian

Random walks, Markov chains, and how to analyse them

Directed Graphical Models

CS 188: Artificial Intelligence

1 Random Walks and Electrical Networks

Lecture 5: Introduction to Markov Chains

CS 630 Basic Probability and Information Theory. Tim Campbell

Hidden Markov Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Computing and Communications 2. Information Theory -Entropy

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

COMP90051 Statistical Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 23. Random walks

MATH 564/STAT 555 Applied Stochastic Processes Homework 2, September 18, 2015 Due September 30, 2015

Lecture 20: Reversible Processes and Queues

Anatomy of a Bit. Reading for this lecture: CMR articles Yeung and Anatomy.

Bayesian Machine Learning - Lecture 7

Transcription:

Lecture 6: Entropy Rate Entropy rate H(X) Random walk on graph Dr. Yao Xie, ECE587, Information Theory, Duke University

Coin tossing versus poker Toss a fair coin and see and sequence Head, Tail, Tail, Head (x 1, x 2,..., x n ) 2 nh(x) Play card games with friend and see a sequence A K Q J 10 (x 1, x 2,..., x n )? Dr. Yao Xie, ECE587, Information Theory, Duke University 1

How to model dependence: Markov chain A stochastic process X 1, X 2, State {X 1,..., X n }, each state X i X Next step only depends on the previous state Transition probability p(x n+1 x n,..., x 1 ) = p(x n+1 x n ). p i, j : the transition probability of i j p(x n+1 ) = x n p(x n )p(x n+1 x n ) p(x 1, x 2,, x n ) = p(x 1 )p(x 2 x 1 ) p(x n x n 1 ) Dr. Yao Xie, ECE587, Information Theory, Duke University 2

Hidden Markov model (HMM) Used extensively in speech recognition, handwriting recognition, machine learning. Markov process X 1, X 2,..., X n, unobservable Observe a random process Y 1, Y 2,..., Y n, such that Y i p(y i x i ) We can build a probability model p(x n, y n ) = p(x 1 ) n 1 i=1 p(x i+1 x i ) n i=1 p(y i x i ) Dr. Yao Xie, ECE587, Information Theory, Duke University 3

Time invariance Markov chain A Markov chain is time invariant if the conditional probability p(x n x n 1 ) does not depend on n p(x n+1 = b X n = a) = p(x 2 = b X 1 = a), for all a, b X For this kind of Markov chain, define transition matrix P 11 P 1n P = P n1 P nn Dr. Yao Xie, ECE587, Information Theory, Duke University 4

Simple weather model X = { Sunny: S, Rainy R } p(s S) = 1 β, p(r R) = 1 α, p(r S) = β, p(s R) = α P = [ 1 β β ] α 1 α #" $%!" $%#" "!" Dr. Yao Xie, ECE587, Information Theory, Duke University 5

Probability of seeing a sequence SSRR: p(ssrr) = p(s)p(s S)p(R S)p(R R) = p(s)(1 β)β(1 α) What will this sequence behave, after many days of observations? What sequences of observations are more typical? What is the probability of seeing a typical sequence? Dr. Yao Xie, ECE587, Information Theory, Duke University 6

Stationary distribution Stationary distribution: a distribution µ on the states such that the distribution at time n + 1 is the same as the distribution at time n. Our weather example: If µ(s) = α α+β, µ(r) = Then β α+β P = [ 1 β β ] α 1 α p(x n+1 = S) = p(s S)µ(S) + p(s R)µ(R) α = (1 β) α + β + α β α + β = α α + β = µ(s ). Dr. Yao Xie, ECE587, Information Theory, Duke University 7

How to calculate stationary distribution Stationary distribution µ i, i = 1,, X satisfies µ i = µ j p ji, (µ = µp), and j X i=1 µ i = 1. Detailed balancing : #! " $ %! "# % "!" $ &! "# & " Dr. Yao Xie, ECE587, Information Theory, Duke University 8

Stationary process A stochastic process is stationary if the joint distribution of any subset is invariant to time-shift p(x 1 = x 1,, X n = x n ) = p(x 2 = x 1,, X n+1 = x n ). Example: coin tossing p(x 1 = head, X 2 = tail) = p(x 2 = head, X 3 = tail) = p(1 p). Dr. Yao Xie, ECE587, Information Theory, Duke University 9

Entropy rate When X i are i.i.d., entropy H(X n ) = H(X 1,, X n ) = n i=1 H(X i ) = nh(x) With dependent sequence X i, how does H(X n ) grow with n? Still linear? Entropy rate characterizes the growth rate Definition 1: average entropy per symbol H(X) = lim n H(X n ) n Definition 2: rate of information innovation H (X) = lim n H(X n X n 1,, X 1 ) Dr. Yao Xie, ECE587, Information Theory, Duke University 10

H (X) exists, for X i stationary H(X n X 1,, X n 1 ) H(X n X 2,, X n 1 ) (1) H(X n 1 X 1,, X n 2 ) (2) H(X n X 1,, X n 1 ) decreases as n increases H(X) 0 The limit must exist Dr. Yao Xie, ECE587, Information Theory, Duke University 11

H(X) = H (X), for X i stationary 1 n H(X 1,, X n ) = 1 n n H(X i X i 1,, X 1 ) i=1 Each H(X n X 1,, X n 1 ) H (X) Cesaro mean: If a n a, b n = 1 n ni=1 a i, b i a, then b n a. So 1 n H(X 1,, X n ) H (X) Dr. Yao Xie, ECE587, Information Theory, Duke University 12

AEP for stationary process 1 n log p(x 1,, X n ) H(X) p(x 1,, X n ) 2 nh(x) Typical sequences in typical set of size 2 nh(x) We can use nh(x) bits to represent typical sequence Dr. Yao Xie, ECE587, Information Theory, Duke University 13

Entropy rate for Markov chain For Markov chain H(X) = lim H(X n X n 1,, X 1 ) = lim H(X n X n 1 ) = H(X 2 X 1 ) By definition p(x 2 = j X 1 = i) = P i j Entropy rate of Markov chain H(X) = µ i P i j log P i j i j Dr. Yao Xie, ECE587, Information Theory, Duke University 14

Calculate entropy rate is fairly easy 1. Find stationary distribution µ i 2. Use transition probability P i j H(X) = µ i P i j log P i j i j Dr. Yao Xie, ECE587, Information Theory, Duke University 15

Entropy rate of weather model Stationary distribution µ(s) = α α+β, µ(r) = P = β α+β [ ] 1 β β α 1 α H(X) = β [α log α + (1 α) log(1 α)] α + β = α α + β H(β) + β α + β H(α) ( H 2 αβ ) H ( ) αβ α + β Maximum when α = β = 1/2: degenerate to independent process Dr. Yao Xie, ECE587, Information Theory, Duke University 16

Random walk on graph An undirected graph with m nodes {1,..., m} Edge i j has weight W i j 0 (W i j = W ji ) A particle walks randomly from node to node Random walk X 1, X 2, : a sequence of vertices Given X n = i, next step chosen from neighboring nodes with probability P i j = W i j k W ik Dr. Yao Xie, ECE587, Information Theory, Duke University 17

Dr. Yao Xie, ECE587, Information Theory, Duke University 18

Entropy rate of random walk on graph Let W i = W i j, W = j i, j:i> j W i j Stationary distribution is µ i = W i 2W Can verify this is a stationary distribution: µp = µ Stationary distribution weight of edges emanating from node i (locality) Dr. Yao Xie, ECE587, Information Theory, Duke University 19

Dr. Yao Xie, ECE587, Information Theory, Duke University 20

Summary AEP Stationary process X 1, X 2,, X i X: as n p(x 1,, x n ) 2 nh(x) Entropy rate H(X) = lim n H(X n X n 1,..., X 1 ) = 1 n lim n H(X 1,..., X n ) Random walk on graph µ i = W i 2W Dr. Yao Xie, ECE587, Information Theory, Duke University 21