Hidden Markov Models Part 1: Introduction

Similar documents
Hidden Markov Models Part 2: Algorithms

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Dynamical Systems

Note Set 5: Hidden Markov Models

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Design and Implementation of Speech Recognition Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov Models

Introduction to Machine Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CS 343: Artificial Intelligence

Parametric Models Part III: Hidden Markov Models

Math 350: An exploration of HMMs through doodles.

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Approximate Inference

Linear Models for Regression

Markov Chains and Hidden Markov Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Sequential Data and Markov Models

Hidden Markov Models. Terminology, Representation and Basic Problems

CSEP 573: Artificial Intelligence

Brief Introduction of Machine Learning Techniques for Content Analysis

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS 343: Artificial Intelligence

Data Mining in Bioinformatics HMM

STA 4273H: Statistical Machine Learning

Hidden Markov Models in Language Processing

HMM part 1. Dr Philip Jackson

CS 5522: Artificial Intelligence II

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Hidden Markov Models: All the Glorious Gory Details

Machine Learning, Midterm Exam

Undirected Graphical Models

Lecture 11: Hidden Markov Models

CS 188: Artificial Intelligence Spring Announcements

CS6220: DATA MINING TECHNIQUES

STA 4273H: Statistical Machine Learning

CS 5522: Artificial Intelligence II

Temporal Modeling and Basic Speech Recognition

Statistical Machine Learning from Data

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

order is number of previous outputs

Expectation Maximization

Algorithmisches Lernen/Machine Learning

Intelligent Systems (AI-2)

18.600: Lecture 32 Markov Chains

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Probabilistic Graphical Models

Hidden Markov Models

Hidden Markov Models (HMMs) November 14, 2017

CS 188: Artificial Intelligence Spring Announcements

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

CS Homework 3. October 15, 2009

Statistical Sequence Recognition and Training: An Introduction to HMMs

6.867 Machine Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Introduction to Machine Learning Midterm, Tues April 8

CS 343: Artificial Intelligence

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Lecture 21: Spectral Learning for Graphical Models

CS6220: DATA MINING TECHNIQUES

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Reinforcement Learning Wrap-up

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Data Mining Techniques

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Dimensionality reduction

Bayesian Networks Inference with Probabilistic Graphical Models

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Learning Tetris. 1 Tetris. February 3, 2009

Machine Learning, Fall 2012 Homework 2

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Based on slides by Richard Zemel

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS145: INTRODUCTION TO DATA MINING

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Pattern Recognition and Machine Learning

Robert Collins CSE586, PSU Intro to Sampling Methods

Design and Implementation of Speech Recognition Systems

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

p(d θ ) l(θ ) 1.2 x x x

L23: hidden Markov models

Midterm sample questions

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

CS 5522: Artificial Intelligence II

Directed and Undirected Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Transcription:

Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

Modeling Sequential Data Suppose that we have weather data for several days. x 1, x 2,, x Each x i is a 2D column vector. x n = 1 if it rains on day n. x n = 0 if it does not rain on day n (we call that a "sunny day"). We want to learn a model that predicts if it is going to rain or not on a certain day, based on this data. What options do we have? Lots, as usual in machine learning. 2

Predicting Rain Assuming Independence One option is to assume that the weather in any day is independent of the weather in any previous day. Thus, p(x n x 1,, x n 1 = p(x n ) Then, how can we compute p(x)? If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains. p(x = 1) = n=1 So, the probability that it rains on any day is simply the fraction of days in the training data when it rained. 3 x n

Predicting Rain Assuming Independence If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains. p(x = 1) = n=1 So, the probability that it rains on any day is simply the fraction of days in the training data when it rained. Advantages of this approach: x n Easy to apply. Only one parameter estimated. Disadvantages: ot using all the information in the data. Past weather does correlate with the weather of the next day. 4

Predicting Rain Modeling Dependence The other extreme is to assume that the weather of any day depends on the weather of the K previous days. Thus, we have to learn the whole probability distribution p(x n x 1,, x n 1 = p(x n x n K,, x n 1 Advantages of this approach: Builds a more complex model, that can capture more information about how past weather influences the weather of the next day. Disadvantages: The amount of data that is needed to reliably learn such a distribution is exponential to K. Even for relatively small values of K, like K = 5, you may need thousands of training data to learn the probabilities reliably. 5

Predicting Rain Markov Chain p(x n x 1,, x n 1 = p(x n x n K,, x n 1 This probabilistic model, where an observation depends on the preceding K observations, is called an K-th order Markov Chain. K = 0 leads to a model that is too simple and inaccurate (the weather of any day does not depend on the weather of the previous days). A large value of K may require more training data than we have. Choosing a good value of K depends on the application, and on the amount of training data. 6

Predicting Rain 1 st Order Markov Chain It is very common to use 1 st Order Markov Chains to model temporal dependencies. p(x n x 1,, x n 1 = p(x n x n 1 For the rain example, learning this model requires consists of estimating four values: p(x n = 0 x n 1 = 0 : probability of a sunny day after a sunny day. p(x n = 1 x n 1 = 0 : probability of a rainy day after a sunny day. p(x n = 0 x n 1 = 1 : probability of a sunny day after a rainy day. p(x n = 1 x n 1 = 1 : probability of a rainy day after a rainy day. 7

Visualizing a 1 st Order Markov Chain p(rain after rain) p(sun after sun) Rainy day p(sun after rain) Sunny day p(rain after sun) This is called a state transition diagram. There are two states: rain and no rain. There are four transition probabilities, defining the probability of the next state given the previous one. 8

Hidden States In our previous example, a state ("rainy day" or "sunny day") is observable. When that day comes, you can observe and find out if that day is rainy or sunny. In those cases, the learning problem can be how to predict future states, before we see them. There are also cases where the states are hidden. We cannot directly observe the value of a state. However, we can observe some features that depend on the state, and that can help us estimate the state. In those cases, the learning problem can be how to figure out the values of the states, given the observations. 9

Tree Rings and Temperatures Tree growth rings are visible in a cross-section of the tree trunk. Every year, the tree grows a new ring on the outside. Source: Wikipedia Counting the rings can tell us about the age of the tree. The width of each ring contains information about the weather conditions that year (temperature, moisture, ). 10

Modeling Tree Rings At this point, we stop worrying about the actual science of how exactly tree ring width correlates with climate. For the sake of illustration, we will make a simple assumption. The tree ring tends to be wider when the average temperature for that year is higher. So, the trunk of a 1,000 year-old tree gives us information about the mean temperature for each of the last 1,000 years. How do we model that information? We have two sequences: Sequence of observations: a sequence of widths: x 1, x 2,, x. Sequence of hidden states: a sequence of temperatures: z 1, z 2,, z. We want to find the most likely sequence of state values z 1, z 2,, z, given the observations x 1, x 2,, x. 11

Modeling Tree Rings We have two sequences: Sequence of observations: a sequence of widths: x 1, x 2,, x. Sequence of hidden states: a sequence of temperatures: z 1, z 2,, z. We want to find the most likely sequence of state values z 1, z 2,, z, given the observations x 1, x 2,, x. Assume that we have training data. Other sequences of tree ring widths, for which we know the corresponding temperatures. What can we learn from this training data? One approach is to learn p(z n x n ): the probability of the mean temperature z n for some year given the ring width x n for that year. Then, for each z n we pick the value maximizing p(z n x n ). Can we build a better model than this? 12

Hidden Markov Model The previous model simply estimated p(z x). It ignored the fact that the mean temperature in a year depends on the mean temperature of the previous year. Taking that dependency into account we can estimate temperatures with better accuracy. We can use the training data to learn a better model, as follows: Learn p(x z): the probability of a tree ring width given the mean temperature for that year. Learn p(z n z n 1 ): the probability of mean temperature for a year given the mean temperature for the previous year. Such a model is called a Hidden Markov Model. 13

Hidden Markov Model A Hidden Markov Model (HMM) is a model for how sequential data evolves. An HMM makes the following assumptions: States are hidden. States are modeled a st order Markov Chains. That is: p(z n z 1,, z n 1 = p(z n z n 1 Observation x n is conditionally independent of all other states and observations, given the value of state z n. That is: p(x n x 1,, x n 1, x n+1,, x, z 1,, z n 1 = p(x n z n 14

Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. In the tree ring example, the states can be intervals of temperatures. For example, s k can be the state corresponding to the mean temperature (in Celsius) being in the k, k + 1 interval. 15

Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. An initial state probability function π k = p(z 1 = s k ). π k defines the probability that, when we are given a new set of observations x 1,, x, the initial state z 1 is equal to s k. For the tree ring example, π k can be defined as the probability that the mean temperature in the first year is equal to k. 16

Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. An initial state probability function π k = p(z 1 = s k ). A state transition matrix A, of size K K, where A k,j = p z n = s j z n 1 = s k ) Values A k,j are called transition probabilities. For the tree ring example, A i,j is the conditional probability that the mean temperature for a certain year is j, if the mean temperature in the previous year is k. 17

Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. An initial state probability function π k = p(z 1 = s k ). A state transition matrix A, of size K K, where A k,j = p z n = s j z n 1 = s k ) Observation probability functions, also called emission probabilities, defined as: φ k x = p x n = x z n = s k ) For the tree ring example, φ k x is the probability of getting ring width x in a specific year, if the temperature for that year is k. 18

Visualizing the Tree Ring HMM Assumption: temperature discretized to four values, so that we have four state values. The vertices show the four states. The edges show legal transitions between states. 19

Visualizing the Tree Ring HMM The edges show legal transitions between states. Each directed edge has its own probability (not shown here). This is a fully connected model, where any state can follow any other state. An HMM does not have to be fully connected. 20

Joint Probability Model A fully specified HMM defines a joint probability function p(x, Z). X is the sequence of observations x 1,, x. Z is the sequence of hidden state values z 1,, z. p X, Z = p x 1,, x, z 1,, z = p z 1,, z p x n z n ) n=1 Why? Because of the assumption that x n is conditionally independent of all other observations and states, given z n. 21

Joint Probability Model A fully specified HMM defines a joint probability function p(x, Z). p X, Z = p z 1,, z p x n z n ) n=1 = p z 1 p z n z n 1 ) n=2 n=1 p x n z n ) Why? Because states are modeled a st order Markov Chains, so that p(z n z 1,, z n 1 = p(z n z n 1. 22

Joint Probability Model A fully specified HMM defines a joint probability function p(x, Z). p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 p z 1 is computed using values π k. p z n z n 1 ) is computed using transition matrix A. p x n z n ) is computed using observation probabilities φ k x. 23

Modeling the Digit 2 Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2". Here is one possible model: We represent the shape of the digit "2" as five line segments. Each line segment corresponds to a hidden state. This gives us five hidden states. We will also have a special end state, which signifies "end of observations". 24

Modeling the Digit 2 Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2". Here is one possible model: We represent the shape of the digit "2" as five line segments. Each line segment corresponds to a hidden state. We end up with five states, plus the end state. This HMM is a forward model: If z n = s k, then z n+1 = s k or z n+1 = s k+1. This is similar to the monotonicity rule in DTW. 25

Modeling the Digit 2 This HMM is a forward model: If z n = s k, then z n+1 = s k or z n+1 = s k+1. Therefore, A k,j = 0, except when k = j or k + 1 = j. Remember, A k,j = p z n = s j z n 1 = s k ). The feature vector at each video frame n can be the displacement vector: The difference between the pixel location of the hand at frame n and the pixel location of the hand at frame n 1. 26

Modeling the Digit 2 So, each observation x n is a 2D vector. It will be convenient to describe each x n with these two numbers: Its length l n, measured in pixels. Its orientation θ n, measured in degrees. Lengths l n come from a Gaussian distribution with mean μ l,k and variance σ l,k that depend on the state s k. Orientations θ n come from a Gaussian distribution with mean μ θ,k and variance σ θ,k that also depend on the state s k. x 9 x 10 27

Modeling the Digit 2 The decisions we make so far are often made by a human designer of the system. The number of states. The topology of the model (fully connected, forward, or other variations). The features that we want to use. The way to model observation probabilities (e.g., using Gaussians, Gaussian mixtures, histograms, etc). Once those decisions have been made, the actual probabilities are typically learned using training data. The initial state probability function π k = p(z 1 = s k ). The transition matrix A, where A k,j = p z n = s j z n 1 = s k ) The observation probabilities, φ k x = p x n = x z n = s k ) 28

Modeling the Digit 2 The actual probabilities are typically learned using training data. The initial state probability function π k = p(z 1 = s k ). The transition matrix A, where A k,j = p z n = s j z n 1 = s k ). The observation probabilities φ k x = p x n = x z n = s k ). Before we see the algorithm for learning these probabilities, we will first see how we can use an HMM after it has been trained. Thus, after all these probabilities are estimated. To do that, we will look at an example where we just specify these probabilities manually. 29

Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. In this case: We have five states,,. How do we define π k? x 9 x 10 30

Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. In this case: We have five states,,. How do we define π k? π 1 = 1, and π k = 0 for k > 1. x 9 x 10 31

Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. How do we define transition matrix A? x 9 x 10 32

Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. We need to decide on values for each A k,k. In this model, we spend more time on states and than on the other states. This can be modeled by having higher values for A 4,4 and A 5,5 than for A 1,1, A 2,2, A 3,3. This way, if z n =, then z n+1 is more likely to also be, and overall state lasts longer than states,,. x 9 x 10 33

Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. We need to decide on values for each A k,k. In this model, we spend more time on states and than on the other states. Here is a set of values that can represent that: A 1,1 = 0.4 A 2,2 = 0.4 A 3,3 = 0.4 A 4,4 = 0.8 A 5,5 = 0.7 x 9 x 10 34

Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. Here is the resulting transition matrix A: 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.8 0.2 0.0 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 0.0 x 9 x 10 35

Defining the Probabilities As we said before, each observation x n is a 2D vector, described by l n and θ n : l n is the length, measured in pixels. θ n is the orientation, measured in degrees. We can model p(l n ) as a Gaussian l, with: mean μ l = 10 pixels. variance σ l = 1.5 pixels. Both the mean and the variance do not depend on the state. We can model θ n as a Gaussian θ with: mean μ θ,k that depends on the state s k. Obviously, each state corresponds to moving at a different orientation. x 9 x 10 variance σ θ = 10 degrees. This way, σ θ does not depend on the state. 36

Defining the Probabilities We define observation probability functions φ k as: φ k x n = 1 σ l 2π e x μ 2 l 2 σ 2 1 l σ θ 2π e x μ θ,k 2 2 σ θ 2 For the parameters in the above formula, we (manually) pick these values: x 9 x 10 μ l = 10 pixels. σ l = 1.5 pixels. σ θ = 10 degrees. μ θ,1 = 45 degrees. μ θ,2 = 0 degrees. μ θ,3 = 60 degrees. μ θ,4 = 120 degrees. μ θ,5 = 0 degrees. 37

Defining the Probabilities As we said before, each observation x n is a 2D vector, described by l n and θ n : l n is the length, measured in pixels. θ n is the orientation, measured in degrees. We define observation probability functions φ k as: φ k x n = l l n θ,k θ n x 9 x 10 φ k x n = 1 σ l 2π e x μ 2 l 2 σ 2 1 l σ θ 2π e x μ θ,k 2 2 σ θ 2 ote: in the above formula for φ k x n, the only part that depends on the state s k is μ θ,k. 38

An HMM as a Generative Model If we have an HMM whose parameters have already been learned, we can use that HMM to generate data randomly sampled from the joint distribution defined by the HMM: p X, Z = p z 1 p z n z n 1 ) n=2 n=1 p x n z n ) We will now see how to jointly generate a random obsevation sequence x 1,, x and a random hidden state sequence z 1,, z, based on the distribution p X, Z defined by the HMM. 39

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = Step 1: pick a random z 1, based on initial state probabilities π k. Remember: π k = p(z 1 = s k ) What values of z 1 are legal in our example? 40

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = Step 1: pick a random z 1, based on initial state probabilities π k. Remember: π k = p(z 1 = s k ) What values of z 1 are legal in our example? π k > 0 only for k = 1. Therefore, it has to be that z 1 =. 41

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = ext step: pick a random x 1, based on observation probabilities φ k x. Which φ k should we use? 42

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = ext step: pick a random x 1, based on observation probabilities φ k x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. In Matlab, you can do this with this line: l1 = randn(1)*sqrt(1.5) + 10 43

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.35,? ) Z = ext step: pick a random x 1, based on observation probabilities φ k x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. Result (obviously, will differ each time): 8.3 pixels. 44

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random x 1, based on observation probabilities φ k x. We choose a θ 1 randomly from Gaussian θ,1, with mean 45 degrees and variance 10 degrees. Result (obviously, will differ each time): 54 degrees 45

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random z 2. What distribution should we draw z 2 from? 46

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random z 2. What distribution should we draw z 2 from? We should use p(z 2 z 1 = ). Where is that stored? 47

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random z 2. What distribution should we draw z 2 from? We should use p(z 2 z 1 = ). Where is that stored? On the first row of state transition matrix A. 48

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z =, ext step: pick a random z 2, from distribution p(z 2 z 1 = ). The relevant values are: A 1,1 = p(z 2 = z 1 = = 0.4 A 1,2 = p(z 2 = z 1 = = 0.6 Picking randomly we get z 2 =. 49

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z =, ext step? 50

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,? ) Z =, ext step: pick a random x 2, based on observation density φ 2 x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. Result: 10.8 pixels. 51

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,2) Z =, ext step: pick a random x 2, based on observation density φ 2 x. We choose a θ 2 randomly from Gaussian θ,2, with mean 0 degrees and variance 10 degrees. Result: 2 degrees 52

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,2) Z =, ext step? 53

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,2) Z =,, ext step: pick a random z 3, from distribution p(z 3 z 2 = ). The relevant values are: A 2,2 = p(z 3 = z 2 = = 0.4 A 2,3 = p(z 3 = z 2 = = 0.6 Picking randomly we get z 3 =. 54

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, (11.3,? ) Z =,, ext step: pick a random x 3, based on observation density φ 2 x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. Result: 11.3 pixels. 55

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, (11.3, 3) Z =,, ext step: pick a random x 3, based on observation density φ 2 x. We choose a θ 2 randomly from Gaussian θ,2, with mean 0 degrees and variance 10 degrees. Result: -3 degrees 56

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, (11.3, 3) Z =,,, ext step: pick a random z 4, from distribution p(z 4 z 2 = ). The relevant values are: A 2,2 = p(z 4 = z 3 = = 0.4 A 2,3 = p(z 4 = z 3 = = 0.6 Picking randomly we get z 4 =. 57

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, 11.3, 3, Z =,,,, Overall, this is an iterative process. pick randomly a new state z n = s k, based on the state transition probabilities. pick randomly a new observation x n, based on observation density φ k x. When do we stop? 58

Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, 11.3, 3, Z =,,,,, Overall, this is an iterative process. pick randomly a new state z n = s k, based on the state transition probabilities. pick randomly a new observation x n, based on observation density φ k x. We stop when we get z n =, since is the end state. 59

An Example of Synthetic Data The textbook shows this figure as an example of how synthetic data is generated. The top row shows some of the training images used to train a model of the digit "2". ot identical to the model we described before, but along the same lines. The bottom row shows three examples of synthetic patterns, generated using the approach we just described. What do you notice in the synthetic data? 60

An Example of Synthetic Data The synthetic data is not very realistic. The problem is that some states last longer than they should and some states last shorter than they should. For example: In the leftmost synthetic example, the top curve is too big relative to the rest of the pattern. In the middle synthetic example, the diagonal line at the middle is too long. In the rightmost synthetic example, the top curve is too small relative to the bottom horizontal line. 61

An Example of Synthetic Data Why do we get this problem of disproportionate parts? As we saw earlier, each next state is chosen randomly, based on transition probabilities. There is no "memory" to say that, e.g.,, if the top curve is big (or small), the rest of the pattern should be proportional to that. This is the price we pay for the Markovian assumption, that the future is independent of the past, given the current state. The benefit of the Markovian assumption is efficient learning and classification algorithms, as we will see. 62

HMMs: ext Steps We have seen how HMMs are defined. Set of states. Initial state probabilities. State transition matrix. Observation probabilities. We have seen how an HMM defines a probability distribution p X, Z. We have also seen how to generate random samples from that distribution. ext we will see: How to use HMMs for various tasks, like classification. How to learn HMMs from training data. 63