8: Hidden Markov Models

Size: px

Start display at page:

Download "8: Hidden Markov Models"

Richard Hood
5 years ago
Views:

1 8: Hidden Markov Models Machine Learning and Real-world Data Helen Yannakoudakis 1 Computer Laboratory University of Cambridge Lent Based on slides created by Simone Teufel

2 So far we ve looked at (statistical) classification. Experimented with different ideas for sentiment detection. Let us now talk about...

3 So far we ve looked at (statistical) classification. Experimented with different ideas for sentiment detection. Let us now talk about... the weather!

4 Weather prediction Two types of weather: rainy and cloudy The weather doesn t change within the day

5 Weather prediction Two types of weather: rainy and cloudy The weather doesn t change within the day Can we guess what the weather will be like tomorrow?

6 Weather prediction Two types of weather: rainy and cloudy The weather doesn t change within the day Can we guess what the weather will be like tomorrow? We can use a history of weather observations: P (w t = Rainy w t 1 = Rainy, w t 2 = Cloudy, w t 3 = Cloudy, w t 4 = Rainy)

7 Weather prediction Two types of weather: rainy and cloudy The weather doesn t change within the day Can we guess what the weather will be like tomorrow? We can use a history of weather observations: P (w t = Rainy w t 1 = Rainy, w t 2 = Cloudy, w t 3 = Cloudy, w t 4 = Rainy) Markov Assumption (first order): P (w t w t 1, w t 2,..., w 1 ) P (w t w t 1 )

8 Weather prediction Two types of weather: rainy and cloudy The weather doesn t change within the day Can we guess what the weather will be like tomorrow? We can use a history of weather observations: P (w t = Rainy w t 1 = Rainy, w t 2 = Cloudy, w t 3 = Cloudy, w t 4 = Rainy) Markov Assumption (first order): P (w t w t 1, w t 2,..., w 1 ) P (w t w t 1 ) The joint probability of a sequence of observations / events is then: P (w 1, w 2,..., w t ) = n P (w t w t 1 ) t=1

9 Markov Chains Tomorrow [ Rainy Cloudy] Rainy Today Cloudy Transition probability matrix

10 Markov Chains Tomorrow [ Rainy Cloudy] Rainy Today Cloudy Transition probability matrix Two states: rainy and cloudy

Markov Chains 0.7 0.7 Tomorrow [ Rainy Cloudy] Rainy 0.7 0.3 Today Cloudy 0.3 0.7 Transition probability matrix 0.3 0.3 Two states: rainy and cloudy A Markov Chain is a stochastic process that embodies the Markov Assumption.

11 Markov Chains Tomorrow [ Rainy Cloudy] Rainy Today Cloudy Transition probability matrix Two states: rainy and cloudy A Markov Chain is a stochastic process that embodies the Markov Assumption. Can be viewed as a probabilistic finite-state automaton. States are fully observable, finite and discrete; transitions are labelled with transition probabilities. Models sequential problems your current situation depends on what happened in the past

12 Markov Chains Useful for modeling the probability of a sequence of events Valid phone sequences in speech recognition Sequences of speech acts in dialog systems (answering, ordering, opposing) Predictive texting

13 Markov Chains Useful for modeling the probability of a sequence of events that can be unambiguously observed Valid phone sequences in speech recognition Sequences of speech acts in dialog systems (answering, ordering, opposing) Predictive texting

14 Markov Chains Useful for modeling the probability of a sequence of events that can be unambiguously observed Valid phone sequences in speech recognition Sequences of speech acts in dialog systems (answering, ordering, opposing) Predictive texting What if we are interested in events that are not unambiguously observed?

15 Markov Model

16 Markov Model: A Time-elapsed view

17 Hidden Markov Model: A Time-elapsed view Hidden Observed Underlying Markov Chain over hidden states. We only have access to the observations at each time step. There is no 1:1 mapping between observations and hidden states. A number of hidden states can be associated with a particular observation, but the association of states and observations is governed by statistical behaviour. We now have to infer the sequence of hidden states that correspond to a sequence of observations.

18 Hidden Markov Model: A Time-elapsed view Hidden Observed [ Rainy Cloudy] Rainy Cloudy [ Umbrella No umbrella] Rainy Cloudy Transition probabilities P (w t w t 1 ) Emission probabilities P (o t w t) (Observation likelihoods)

19 Hidden Markov Model: A Time-elapsed view start and end states s 0 s f Hidden Observed Could use initial probability distribution over hidden states. Instead, for simplicity, we will also model this probability as a transition, and we will explicitly add a special start state. Similarly, we will add a special end state to explicitly model the end of the sequence. Special start and end states not associated with real observations.

20 More formal definition of Hidden Markov Models; States and Observations S e = {s 1,..., s N } s 0 s f a set of N emitting hidden states, a special start state, a special end state. K = {k 1,... k M } k 0 k f O = O 1... O T X = X 1... X T an output alphabet of M observations ( vocabulary ). a special start symbol, a special end symbol. a sequence of T observations, each one drawn from K. a sequence of T states, each one drawn from S e.

21 More formal definition of Hidden Markov Models; First-order Hidden Markov Model 1 Markov Assumption (Limited Horizon):Transitions depend only on current state: P (X t X 1...X t 1 ) P (X t X t 1 ) 2 Output Independence: Probability of an output observation depends only on the current state and not on any other states or any other observations: P (O t X 1...X t,..., X T, O 1,..., O t,..., O T ) P (O t X t )

22 More formal definition of Hidden Markov Models; State Transition Probabilities A: a state transition probability matrix of size (N + 2) (N + 2). a 01 a 02 a a 0N a 11 a 12 a a 1N a 1f a 21 a 22 a a 2N a 2f A = a N1 a N2 a N3... a NN a Nf a ij is the probability of moving from state s i to state s j : a ij = P (X t = s j X t 1 = s i ) N+1 i a ij = 1 j=0

23 More formal definition of Hidden Markov Models; State Transition Probabilities A: a state transition probability matrix of size (N + 2) (N + 2). a 01 a 02 a a 0N a 11 a 12 a a 1N a 1f a 21 a 22 a a 2N a 2f A = a N1 a N2 a N3... a NN a Nf a ij is the probability of moving from state s i to state s j : a ij = P (X t = s j X t 1 = s i ) N+1 i a ij = 1 j=0

24 More formal definition of Hidden Markov Models; Start state s 0 and end state s f Not associated with real observations. a 0i describe transition probabilities out of the start state into state s i. a if describe transition probabilities into the end state. Transitions into start state (a i0 ) and out of end state (a fi ) undefined.

25 More formal definition of Hidden Markov Models; Emission Probabilities B: an emission probability matrix of size (M + 2) (N + 2). b 0(k 0) b 1(k 1) b 2(k 1) b 3(k 1)... b N (k 1) b 1(k 2) b 2(k 2) b 3(k 2)... b N (k 2) B = b 1(k M ) b 2(k M ) b 3(k M )... b N (k M ) b f (k f ) b i(k j) is the probability of emitting vocabulary item k j from state s i: b i(k j) = P (O t = k j X t = s i) Our HMM is defined by its parameters µ = (A, B).

26 More formal definition of Hidden Markov Models; Emission Probabilities B: an emission probability matrix of size (M + 2) (N + 2). b 0(k 0) b 1(k 1) b 2(k 1) b 3(k 1)... b N (k 1) b 1(k 2) b 2(k 2) b 3(k 2)... b N (k 2) B = b 1(k M ) b 2(k M ) b 3(k M )... b N (k M ) b f (k f ) b i(k j) is the probability of emitting vocabulary item k j from state s i: b i(k j) = P (O t = k j X t = s i) Our HMM is defined by its parameters µ = (A, B).

27 Examples where states are hidden Speech recognition Observations: audio signal States: phonemes Part-of-speech tagging (assigning tags like Noun and Verb to words) Observations: words States: part-of-speech tags Machine translation Observations: target words States: source words

Today s task: the dice HMM Imagine a fraudulous croupier in a casino where customers bet on dice outcomes. She has two dice a fair one and a loaded one.

28 Today s task: the dice HMM Imagine a fraudulous croupier in a casino where customers bet on dice outcomes. She has two dice a fair one and a loaded one. The fair one has the normal distribution of outcomes P (O) = 1 6 for each number 1 to 6. The loaded one has a different distribution. She secretly switches between the two dice. You don t know which dice is currently in use. You can only observe the numbers that are thrown.

29 Today s task: the dice HMM a 11 s 1 loaded s 0 a 21 a 12 s f a 22 s 2 fair O 0 = k 0 O 1 = 5 O 2 = 2 O 3 = 4 O 4 = 6 O f = k f There are two states (fair and loaded), and two special states (start s 0 and end s f ). Distribution of observations differs between the states.

30 Today s task: the dice HMM a 11 s 1 loaded a 01 a 1f s 0 a 21 a 12 s f a 22 a 02 a 2f s 2 fair O 0 = k 0 O 1 = 5 O 2 = 2 O 3 = 4 O 4 = 6 O f = k f There are two states (fair and loaded), and two special states (start s 0 and end s f ). Distribution of observations differs between the states.

31 Today s task: the dice HMM a 11 s 1 loaded a 01 a 1f s 0 a 21 a 12 s f a 22 a 02 a 2f s 2 fair b 2(5) = 1/6 b 2(6) = 1/6 O 0 = k 0 O 1 = 5 O 2 = 2 O 3 = 4 O 4 = 6 O f = k f There are two states (fair and loaded), and two special states (start s 0 and end s f ). Distribution of observations differs between the states.

32 Today s task: the dice HMM a 11 s 1 loaded a 01 a 1f s 0 a 21 a 12 s f b 1(5) a 02 b 1(2) a 22 a 2f b 1(4) b 1(6) s 2 fair O 0 = k 0 O 1 = 5 O 2 = 2 O 3 = 4 O 4 = 6 O f = k f There are two states (fair and loaded), and two special states (start s 0 and end s f ). Distribution of observations differs between the states.

33 Today s task: the dice HMM a 11 s 1 loaded a 01 a 1f s 0 a 21 a 12 s f a 22 a 02 a 2f s 2 fair b 0(k 0) b f (k f ) O 0 = k 0 O 1 = 5 O 2 = 2 O 3 = 4 O 4 = 6 O f = k f There are two states (fair and loaded), and two special states (start s 0 and end s f ). Distribution of observations differs between the states.

34 Fundamental tasks with HMMs Problem 1 (Labelled Learning) Given a parallel observation and state sequence O and X, learn the HMM parameters A and B. today Problem 2 (Unlabelled Learning) Given an observation sequence O (and only the set of emitting states S e ), learn the HMM parameters A and B. Problem 3 (Likelihood) Given an HMM µ = (A, B) and an observation sequence O, determine the likelihood P (O µ). Problem 4 (Decoding) Given an observation sequence O and an HMM µ = (A, B), discover the best hidden state sequence X. Task 8

35 Your Task today Task 7: Your implementation performs labelled HMM learning, i.e. it has Input: dual tape of state and observation (dice outcome) sequences X and O. (s 0 ) F F F F L L L F F F F L L L L F F (s f ) (k 0 ) (k f ) Output: HMM parameters A, B. Note: you will in a later task use your code for an HMM with more than two states. Either plan ahead now or modify your code later.

36 Parameter estimation of HMM parameters A, B Transition matrix A consists of transition probabilities a ij a ij = P (X t+1 = s j X t = s i ) count trans(x t = s i, X t+1 = s j ) count trans (X t = s i ) Emission matrix B consists of emission probabilities b i (k j ) b i (k j ) = P (O t = k j X t = s i ) count emission(o t = k j, X t = s i ) count emission (X t = s i ) (Add-one smoothed versions of these)

37 Literature Manning and Schutze (2000). Foundations of Statistical Natural Language Processing, MIT Press. Chapters 9.1, 9.2. We use state-emission HMM instead of arc-emission HMM We avoid initial state probability vector π by using explicit start and end states (s 0 and s f ) and incorporating the corresponding probabilities into the transition matrix A. (Jurafsky and Martin, 2nd Edition, Chapter 6.2 (but careful, notation!)) Fosler-Lussier, Eric (1998). Markov Models and Hidden Markov Models: A Brief Tutorial. TR Smith, Noah A. (2004). Hidden Markov Models: All the Glorious Gory Details. Bockmayr and Reinert (2011). Markov chains and Hidden Markov Models. Discrete Math for Bioinformatics WS 10/11.

8: Hidden Markov Models

8: Hidden Markov Models Machine Learning and Real-world Data Simone Teufel and Ann Copestake Computer Laboratory University of Cambridge Lent 2017 Last session: catchup 1 Research ideas from sentiment