CS 7180: Behavioral Modeling and Decision- making in AI

CS 7180: Behavioral Modeling and Decision- making in AI Hidden Markov Models Prof. Amy Sliva October 26, 2012

Par?ally observable temporal domains POMDPs represented uncertainty about the state Belief states give probability of a state, given the actions and observations Considered observations in states, but not time or sequence of events Sometimes order is important, but underlying states still hidden Computational linguistics (speech recognition, part- of- speech tagging) Vision (facial expression and human behavior recognition from video) Bioinformatics (gene Linding) Computer security (attack detection and prediction, anomaly detection) International politics (conllict recognition and forecasting) Hidden Markov models Temporal probabilistic model special case of DBNs with hidden states Structure allows for elegant matrix implementation

Hidden Markov model Stochastic system represented by three matrices N = number of states Q ={q 1,,q T } M = number of observations O = {o 1,,o T } A = transition model a ij = P(q t+1 = j q t = i) B = observation model b j (k)= P(o 1 = k q t = j) π = prior state Environmental context sequence of states from times 1- T Sequence of evidence the agent observes at times 1- T State transition probability matrix Probability distribution over observations (probability of seeing observation o in state q) Probability distribution that state q is the start probabilities Full HMM is a triple λ state = (A,B,π) First- order Markov transition model Stationary transition and observation model

Graphical representa?on of HMMs R t -1 t f P(R t) 0.7 0.3 Rain t 1 Rain t Rain t+1 R t f t P(U t ) 0.9 Umbrella t 1 Umbrella t Umbrella t+1 Assume each state is represented by a single random variable Same as Bayesian networks If states have more than one variable, use set theoretic representation Megavariable for state equal to tuple of all values of the individual variables

Graphical representa?on of HMMs R t -1 Rain t 1 Umbrella t 1 t f P(R t) 0.7 0.3 Rain t R t f Umbrella t t P(U t ) 0.9 Rain t+1 Umbrella t+1 Hidden States Observed Evidence Assume each state is represented by a single random variable Same as Bayesian networks If states have more than one variable, use state space representation Megavariable for state equal to tuple of all values of the individual variables

Matrix representa?on of HMM HMM is a triple λ = (A,B,π) P(q) q 1 P(q 1 ) q 2 P(q 2 ) q 1 q 2 q n q 1 P(q 1 q 1 ) q 2 P(q 2 q 1 ) P(q 1 q 2 ) P(q 2 q 2 ) P(q 1 q n )...... q 1 q 2 q n o 1 P(o 1 q 1 ) P(o 1 q 2 ) o 2 P(o 2 q 1 ) P(o 2 q 2 ) P(o 1 q n ) q m P(q m ) q n P(q n q 1 ) P(q n q n )...

Example: recognizing human behaviors Use HMM to classify human actions in time- sequential images (Yamato et al., 1992) Recognize sports activities from images Backhand volley Backhand stroke Forehand volley Forehand stroke Smash Service Temporal Each activity characterized by temporally related stances Why use an HMM?

Example: recognizing human behaviors Use HMM to classify human actions in time- sequential images (Yamato et al., 1992) Recognize sports activities from images Backhand volley Backhand stroke Forehand volley Forehand stroke Smash Service Why use an HMM? Temporal Each activity characterized by temporally related stances Observable Each activity associated with observable symbols associated with each characteristic stance

Example: recognizing human behaviors Background subtraction Feature extraction Observation sequence for HMM Feature vector sequence

Example: Robot localiza?on Observations? Hidden states?

Three basic HMM problems Evaluation Given observation sequence O = {o 1,,o T } and an HMM λ = (A,B,π) how do we compute the probability of O given the model P(O λ) Decoding Given observation sequence O = {o 1,,o T } and an HMM λ = (A,B,π) how do we Lind the state sequence Q ={q 1,,q T } that best explains the observations argmax Q P(Q O, λ) Learning How do we adjust the model parameters λ = (A,B,π) to best Lit the sequence argmax λ P(O λ)

Probability of an observa?on sequence What is P(O λ)? Useful in sequence classification which model most likely generated the observations? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM P(O λ) = Σ q P(O q, λ) P(q λ) Naïve computation is very expensive Given T observations and N states, there are N T possible state sequences Even small HMMs, e.g. T=10N=10 à 10 billion different paths Compute more efliciently using DP

Forward probabili?es Auxiliary probabilities needed for DP algorithm What is the probability, given HMM λ, that at time t the state is i and the partial observation o 1 o t has been generated? α t (i) = P(o 1 o t q t = i, λ) This is the forward probability

Forward probabili?es Forward probability α t (i) = P(o 1 o t q t = i, λ) Recursive delinition α t (j) = [Σ N i=1 α t- 1 (i)a ij ] b j (o t )

Forward probabili?es Forward probability α t (i) = P(o 1 o t q t = i, λ) Forward probability of all possible prior states at t- 1 Recursive delinition α t (j) = [Σ N i=1 α t- 1 (i)a ij ] b j (o t )

Forward probabili?es Forward probability α t (i) = P(o 1 o t q t = i, λ) Transition probability from state i to j Recursive delinition α t (j) = [Σ N i=1 α t- 1 (i)a ij ] b j (o t )

Forward probabili?es Forward probability α t (i) = P(o 1 o t q t = i, λ) Observation probability of seeing o t in state j Recursive delinition α t (j) = [Σ N i=1 α t- 1 (i)a ij ] b j (o t )

Forward algorithm Dynamic programming for P(O λ) using forward probability 1. Initialization: for each state i compute probability at time 1 α 1 (i) = π i b i (o 1 ) 2. Induction: compute forward probability for every state j α t (j) = [Σ N i=1 α t- 1 (i)a ij ] b j (o t ) 2 t T, 1 j N 3. Termination: sum of forward probabilities at time T

Analysis of the forward algorithm Naïve approach to solving evaluation problem Takes O(2T*N T ) computations Forward algorithm Take O(N 2 T) computations

Alterna?ve solu?on backward probability Forward probability used observation sequence seen so far to determine probability of state i Backward probability another auxiliary probability Uses future observation sequence to determine probability of state i What is the probability, given an HMM λ, the state i at time t, that the partial observation o t+1 o T is generated? β t = P(o t+1 o T q t = i, λ)

Backward probabili?es Backward probability β t = P(o t+1 o T q t = i, λ) Recursive delinition β t (i) = [Σ N j=1 a ij b j (o t+1 ) β t+1 (j)]

Backward probabili?es Backward probability β t = P(o t+1 o T q t = i, λ) Transition probability from state i to j Recursive delinition β t (i) = [Σ N j=1 a ij b j (o t+1 ) β t+1 (j)]

Backward probabili?es Backward probability β t = P(o t+1 o T q t = i, λ) Observation probability of seeing o t in state j Recursive delinition β t (i) = [Σ N j=1 a ij b j (o t+1 ) β t+1 (j)]

Backward probabili?es Backward probability β t = P(o t+1 o T q t = i, λ) Backward probability of all possible next states at t+1 Recursive delinition β t (i) = [Σ N j=1 a ij b j (o t+1 ) β t+1 (j)]

N Backward algorithm Dynamic programming for P(O λ) using backward probability 1. Initialization: for all states i at time T β T (i) = 1 1 i N 2. Induction: for all states j work backward and compute backward probability β t (i) = [Σ N j=1 a ij b j (o t+1 ) β t+1 (j)] T - 1 1, 1 i N 3. Termination: sum over all backward probabilities at time 1

Decoding problem Forward and backward solutions to evaluation problem efliciently give sum of all paths through HMM In decoding problem we want to Lind highest probability path What is the state sequence Q* = q 1 q n such that Q* = argmax Q P(Q O, λ) Viterbi algorithm Inductive DP algorithm that keeps best state sequence at each instance

Viterbi algorithm Similar to computing the forward probabilities Instead of summing over transitions from incoming states, compute the maximum Forward recursion α t (j) = [Σ N i=1 α t- 1 (i)a ij ] b j (o t ) Viterbi recursion δ t (j) = [max N i=1 δ t- 1 (i)a ij ] b j (o t )

Viterbi algorithm Use DP to compute argmax Q P(Q O, λ) 1. Initialization: for all states i at time 1 δ 1 (i) = π i b i (o 1 ) 1 i N 2. Induction: for all states j compute max path probability and previous state that produced it δ t (j) = [max N i=1 δ t- 1 (i)a ij ] b j (o t ) ψ t (j) = [argmax N i=1 δ t- 1 (i)a ij ] 2 t T, 1 j N 3. Termination: max value of probabilities at time T p* = max N i=1 δ T (i) q T * = argmax N i=1 δ T (i) 4. Read out maximal path: get the maximal state at each time point q t * = ψ t+1 (q t+1 *)

Trellis structure of Viterbi paths N Hidden States i j 2 1 t- 1 t t+1 t+2 T δ t (j) = max N i=1 δ t- 1 (i)a ij b j (o t ) Time

Using Viterbi to find maximal sequence Viterbi algorithm actually consists of two phases Forward pass to Lind maximal probabilities at each time Backward pass to extract the maximizing state sequence N Hidden States i j 2 1 Time t- 1 t t+1 t+2 T

Using Viterbi to find maximal sequence δ 1 (i) = π i b i (o 1 ) q = 4 0.4 q = 3 q = 2 0.3 q = 1 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence δ 2 (j) = [max 4 i=1 δ 1 (i)a ij ] b j (o t ) q = 4 0.4 q = 3 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence δ 3 (j) = [max 4 i=1 δ 2 (i)a ij ] b j (o t ) q = 4 0.4 0.3 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence p* = max 4 i=1 δ 4 (i) q = 4 0.4 0.3 0.6 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence q 4 * = argmax 4 i=1 δ 4 (i) q = 4 0.4 0.3 0.6 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence q 3 * = ψ 4 (q 4 *) q = 4 0.4 0.3 0.6 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence q 2 * = ψ 3 (q 3 *) q = 4 0.4 0.3 0.6 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence q 1 * = ψ 2 (q 2 *) q = 4 0.4 0.3 0.6 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4

Using Viterbi to find maximal sequence q = 4 0.4 0.3 0.6 q = 3 0.4 q = 2 0.3 q = 1 0.6 t = 1 t = 2 t = 3 t = 4 Q* = {2, 1, 2, 4}