Hidden Markov Models Part 2: Algorithms

Size: px
Start display at page:

Download "Hidden Markov Models Part 2: Algorithms"

Transcription

1 Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

2 Hidden Markov Model An HMM consists of: A set of states s 1,, s K. An initial state probability function π k = p(z 1 = s k ). A state transition matrix A, of size K K, where A k,j = p z n = s j z n 1 = s k ) Observation probability functions, also called emission probabilities, defined as: φ k x = p x n = x z n = s k ) 2

3 The Basic HMM Problems We will next see algorithms for the four problems that we usually want to solve when using HMMs. Problem 1: learn an HMM using training data. This involves: Learning the initial state probabilities π k. Learning the state transition matrix A. Learning the observation probability functions P k x. Problems 2, 3, 4 assume we have already learned an HMM. Problem 2: given a sequence of observations X = x 1,, x N, compute p X HMM): the probability of X given the model. Problem 3: given an observation sequence X = x 1,, x N, find the hidden state sequence Z = z 1,, z N maximizing p Z X, HMM) Problem 4: given an observation sequence X = x 1,, x N, compute, for any n, k, the probability p z n = s k X, HMM) 3

4 The Basic HMM Problems Problem 1: learn an HMM using training data. Problem 2: given a sequence of observations X = x 1,, x N, compute p X HMM): the probability of X given the model. Problem 3: given an observation sequence X = x 1,, x N, find the hidden state sequence Z = z 1,, z N maximizing p Z X, HMM) Problem 4: given an observation sequence X = x 1,, x N, compute, for any n, k, the probability p z n = s k X, HMM) We will first look at algorithms for problems 2, 3, 4. Last, we will look at the standard algorithm for learning an HMM using training data. 4

5 Probability of Observations Problem 2: given an HMM model θ, and a sequence of observations X = x 1,, x N, compute p X θ): the probability of X given the model. Inputs: θ = (π, A, φ k ): a trained HMM, with probability functions π, A, φ k specified. A sequence of observations X = x 1,, x N. Output: p X θ). 5

6 Probability of Observations Problem 2: given a sequence of observations X = x 1,, x N, compute p X θ). Why do we care about this problem? What would be an example application? 6

7 Probability of Observations Problem 2: given a sequence of observations X = x 1,, x N, compute p X θ). Why do we care about this problem? We need to compute p X θ) to classify X. Suppose we have multiple models θ 1,, θ C. For example, we can have one model for digit 2, one model for digit 3, one model for digit 4 A Bayesian classifier classifies X by finding the θ c that maximizes p(θ c X. Using Bayes rule, we must maximize p(x θ c p(θ c ) p(x). Therefore, we need to compute p(x θ c for each c. 7

8 The Sum Rule The sum rule, which we saw earlier in the course, states that: p X = p X, Y = y y Y We can use the sum rule, to compute p X θ as follows: p X θ = p X, Z θ Z According to this formula, we can compute p X θ by summing p X, Z θ over all possible state sequences Z. 8

9 The Sum Rule p X θ = p X, Z θ Z According to this formula, we can compute p X θ by summing p X, Z θ over all possible state sequences Z. An HMM with parameters θ specifies a joint distribution function p X, Z θ as: p X, Z θ = p z 1 θ p z n z n 1, θ) Therefore: N n=2 p X θ = p z 1 θ p z n z n 1, θ) Z N n=2 N n=1 N n=1 p x n z n, θ) p x n z n, θ) 9

10 The Sum Rule p X θ = p z 1 θ p z n z n 1, θ) N N p x n z n, θ) Z n=2 n=1 According to this formula, we can compute p X θ by summing p X, Z θ over all possible state sequences Z. What would be the time complexity of doing this computation directly? It would be linear to the number of all possible state sequences Z, which is typically exponential to N (the length of X). Luckily, there is a polynomial time algorithm for computing p X θ, that uses dynamic programming. 10

11 Dynamic Programming for p X θ p X θ = p z 1 θ p z n z n 1, θ) N N p x n z n, θ) Z n=2 n=1 To compute p X θ, we use a dynamic programming algorithm that is called the forward algorithm. We define a 2D array of problems, of size N K. N: length of observation sequence X. K: number of states in the HMM. Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ 11

12 The Forward Algorithm - Initialization p X θ = p z 1 θ p z n z n 1, θ) N N p x n z n, θ) Z n=2 n=1 Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Problem (1, k):??? 12

13 The Forward Algorithm - Initialization p X θ = p z 1 θ p z n z n 1, θ) N N p x n z n, θ) Z n=2 n=1 Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Problem (1, k): Compute α 1, k = p x 1, z 1 = s k θ Solution:??? 13

14 The Forward Algorithm - Initialization p X θ = p z 1 θ p z n z n 1, θ) N N p x n z n, θ) Z n=2 n=1 Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Problem (1, k): Compute α 1, k = p x 1, z 1 = s k θ Solution: α 1, k = p x 1, z 1 = k θ = π k φ k (x 1 ) 14

15 The Forward Algorithm - Main Loop Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Solution: 15

16 The Forward Algorithm - Main Loop Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Solution: K p x 1,, x n, z n = s k θ = p x 1,, x n, z n 1 = s j, z n = s k θ j=1 K = φ k (x n ) p x 1,, x n 1, z n 1 = s j, z n = s k θ j=1 K = φ k x n p x 1,, x n 1, z n 1 = s j θ A j,k j=1 16

17 The Forward Algorithm - Main Loop Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Solution: p x 1,, x n, z n = k θ K = φ k x n p x 1,, x n 1, z n 1 = s j θ A j,k j=1 K = φ k (x n ) α[n 1, j]a j,k j=1 17

18 The Forward Algorithm - Main Loop Problem (n, k): Compute α n, k = p x 1,, x n, z n = s k θ Solution: p x 1,, x n, z n = k θ = φ k (x n ) α[n 1, j]a j,k K K j=1 Thus, α n, k = φ k (x n ) j=1 α[n 1, j]a j,k. α n, k is easy to compute once all α[n 1, j] values have been computed. 18

19 The Forward Algorithm Our aim was to compute p X θ. Using the dynamic programming algorithm we just described, we can compute α N, k = p X, z N = s k θ. How can we use those α N, k values to compute p X θ? We use the sum rule: K p X θ = p X, z N = s k θ k=1 K p X θ = α[n, k] k=1 19

20 Problem 3: Find the Best Z Problem 3: Inputs: a trained HMM θ, and an observation sequence X = x 1,, x N Output: the state sequence Z = z 1,, z N maximizing p Z θ, X). Why do we care? Some times, we want to use HMMs for classification. In those cases, we do not really care about maximizing p Z θ, X), we just want to find the HMM parameters θ that maximize p θ X). Some times, we want to use HMMs to figure out the most likely values of the hidden states given the observations. In the tree ring example, our goal was to figure out the average temperature for each year. Those average temperatures were the hidden states. Solving problem 3 provides the most likely sequence of average temperatures. 20

21 Problem 3: Find the Best Z Problem 3: Input: an observation sequence X = x 1,, x N Output: the state sequence Z = z 1,, z N maximizing p Z θ, X). We want to maximize p Z θ, X). Using the definition of conditional probabilities: p Z θ, X) = p X, Z θ) p X θ) Since X is known, p X θ) will be the same over all possible state sequences Z. Therefore, to find the Z that maximizes p Z θ, X), it suffices to find the Z that maximizes p Z, X θ). 21

22 Problem 3: Find the Best Z Problem 3: Input: an observation sequence X = x 1,, x N Output: the state sequence Z = z 1,, z N maximizing p Z θ, X), which is the same as maximizing p Z, X θ). Solution: (again) dynamic programming. Problem (n, k): Compute G n, k and H n, k, where: G n, k = argmax z 1,,z n 1 p x 1,, x n, z 1,, z n 1, z n = s k θ Note that: We optimize over all possible values z 1,, z n 1. However, z n is constrained to be equal to s k. Values x 1,, x n are known, and thus they are fixed. 22

23 Problem 3: Find the Best Z Problem 3: Input: an observation sequence X = x 1,, x N Output: the state sequence Z = z 1,, z N maximizing p Z θ, X), which is the same as maximizing p Z, X θ). Solution: (again) dynamic programming. Problem (n, k): Compute G n, k and H n, k, where: H n, k = p x 1,, x n, G n, k θ H n, k is the joint probability of: the first n observations x 1,, x n the sequence we stored in G n, k 23

24 Problem 3: Find the Best Z Problem 3: Input: an observation sequence X = x 1,, x N Output: the state sequence Z = z 1,, z N maximizing p Z θ, X), which is the same as maximizing p Z, X θ). Solution: (again) dynamic programming. Problem (n, k): Compute G n, k and H n, k, where: H n, k = p x 1,, x n, G n, k θ H n, k is the joint probability of: the first n observations x 1,, x n the sequence we stored in G n, k G n, k stores a sequence of state values. H n, k stores a number (a probability). 24

25 The Viterbi Algorithm Problem (n, k): Compute G n, k and H n, k, where: G n, k = argmax z 1,,z n 1 p x 1,, x n, z 1,, z n 1, z n = s k θ H n, k = p x 1,, x n, G n, k θ We use a dynamic programming algorithm to compute G n, k and H n, k for all n, k such that 1 n N, 1 k K. This dynamic programming algorithm is called the Viterbi algorithm. 25

26 The Viterbi Algorithm - Initialization Problem (1, k): Compute G 1, k and H 1, k, where: G 1, k = argmax z 1,,z n 1 p x 1, z 1 = s k θ H 1, k = p x 1, z 1 = s k θ What is G 1, k? It is the empty sequence. We must maximize over the previous states z 1,, z n 1, but n = 1, so there are no previous states. What is H 1, k? H 1, k = p x 1, z 1 = s k θ = π k φ k x 1. 26

27 The Viterbi Algorithm Main Loop Problem (n, k): Compute G n, k and H n, k, where: G n, k = argmax z 1,,z n 1 p x 1,, x n, z 1,, z n 1, z n = s k θ H n, k = p x 1,, x n, G n, k θ How do we find G n, k? G n, k = G n, k s j for some G n, k and some j. The last element of G n, k is some s j. Therefore, G n, k is formed by appending some s j to some sequence G n, k 27

28 The Viterbi Algorithm Main Loop Problem (n, k): Compute G n, k and H n, k, where: G n, k = argmax z 1,,z n 1 p x 1,, x n, z 1,, z n 1, z n = s k θ H n, k = p x 1,, x n, G n, k θ G n, k = G n, k s j for some G n, k and some j. We can prove that G n, k = G n 1, j, for that j. The proof is very similar to the proof that DTW finds the optimal alignment. If G n 1, j is better than G n, k, then G n 1, j s j is better than G n, k, which is a contradiction. 28

29 The Viterbi Algorithm Main Loop Problem (n, k): Compute G n, k and H n, k, where: G n, k = argmax z 1,,z n 1 p x 1,, x n, z 1,, z n 1, z n = s k θ H n, k = p x 1,, x n, G n, k θ G n, k = G n 1, j s j for some j. Let's define j as: j = argmax j=1,,k Then: G n, k = G n 1, j s j. H n 1, j A j,k φ k x n H n, k = H n 1, j A j,kφ k x n 29

30 The Viterbi Algorithm Output Our goal is to find the Z maximizing p Z θ, X), which is the same as finding the Z maximizing p Z, X θ). The Viterbi algorithm computes G n, k and H n, k, where: G n, k = argmax z 1,,z n 1 p x 1,, x n, z 1,, z n 1, z n = s k θ H n, k = p x 1,, x n, G n, k θ Let's define k as: k = argmax H N, k k=1,,k Then, the Z maximizing p Z, X θ) is G N, k s k 30

31 State Probabilities at Specific Times Problem 4: Inputs: a trained HMM θ, and an observation sequence X = x 1,, x N Output: an array γ, of size N K, where γ n, k = p z n = s k X, θ). In words: given an observation sequence X, we want to compute, for any moment in time n and any state s k, the probability that the hidden state at that moment is s k. 31

32 State Probabilities at Specific Times Problem 4: Inputs: a trained HMM θ, and an observation sequence X = x 1,, x N Output: an array γ, of size N K, where γ n, k = p z n = s k X, θ). We have seen, in our solution for problem 2, that the forward algorithm computes an array α, of size N K, where α n, k = p x 1,, x n, z n = s k θ. We can also define another array β, of size N K, where β n, k = p x n+1,, x N θ, z n = s k. In words, β n, k is the probability of all observations after moment n, given that the state at moment n is s k. 32

33 State Probabilities at Specific Times Problem 4: Inputs: a trained HMM θ, and an observation sequence X = x 1,, x N Output: an array γ, of size N K, where γ n, k = p z n = s k X, θ). We have seen, in our solution for problem 2, that the forward algorithm computes an array α, of size N K, where α n, k = p x 1,, x n, z n = s k θ. We can also define another array β, of size N K, where β n, k = p x n+1,, x N θ, z n = s k. Then, α n, k β n, k = p x 1,, x n, z n = s k θ p x n+1,, x N θ, z n = s k = p(x, z n = s k θ) Therefore: γ n, k = p z n = s k X, θ) = α n,k β n,k P X θ) 33

34 State Probabilities at Specific Times The forward algorithm computes an array α, of size N K, where α n, k = p x 1,, x n, z n = s k θ. We define another array β, of size N K, where β n, k = p x n+1,, x N θ, z n = s k. Then, γ n, k = p z n = s k X, θ) = In the above equation: α n,k β n,k P X θ) α n, k and P X θ) are computed by the forward algorithm. We need to compute β n, k. We compute β n, k in a way very similar to the forward algorithm, but going backwards this time. The resulting combination of the forward and backward algorithms is called (not surprisingly) the forward-backward algorithm. 34

35 The Backward Algorithm We want to compute the values of an array β, of size N K, where β n, k = p x n+1,, x N θ, z n = s k. Again, we will use dynamic programming. Problem (n, k) is to compute value β n, k. However, this time we start from the end of the observations. First, compute β N, k = p {} θ, z N = s k. In this case, the observation sequence x n+1,, x N is empty, because n = N. What is the probability of an empty sequence? 35

36 Backward Algorithm - Initialization We want to compute the values of an array β, of size N K, where β n, k = p x n+1,, x N θ, z n = s k. Again, we will use dynamic programming. Problem (n, k) is to compute value β n, k. However, this time we start from the end of the observations. First, compute β N, k = p {} θ, z N = s k. In this case, the observation sequence x n+1,, x N is empty, because n = N. What is the probability of an empty sequence? β N, k = 1 36

37 Backward Algorithm - Initialization Next: compute β N 1, k. β N 1, k = p x N θ, z N 1 = s k = K j=1 p x N, z N = s j θ, z N 1 = s k = K j=1 p x N z N = s j p z N = s j θ, z N 1 = s k 37

38 Backward Algorithm - Initialization Next: compute β N 1, k. β N 1, k = p x N θ, z N 1 = s k K = p x N z N = s j p(z N = s j θ, z N 1 = s k j=1 K = φ j x N A k,j j=1 38

39 Backward Algorithm Main Loop Next: compute β n, k, for n < N 1. β n, k = p x n+1,, x N θ, z n = s k = K j=1 p x n+1,, x N, z n+1 = s j θ, z n = s k = K j=1 p x n+1 z n+1 = s j p x n+2,, x N, z n+1 = s j θ, z N 1 = s k 39

40 Backward Algorithm Main Loop Next: compute β n, k, for n < N 1. β n, k = p x n+1,, x N θ, z n = s k = K j=1 p x n+1 z n+1 = s j p x n+2,, x N, z n+1 = s j θ, z N 1 = s k K j=1 φ j x n+1 p x n+2,, x N θ, z n+1 = s j A k,j We will take a closer look at the last step 40

41 Backward Algorithm Main Loop K j=1 p x n+1 z n+1 = s j p x n+2,, x N, z n+1 = s j θ, z n = s k = K j=1 φ j x n+1 p x n+2,, x N θ, z n+1 = s j A k,j p x n+1 z n+1 = s j = φ j x n+1, by definition of φ j. 41

42 Backward Algorithm Main Loop K p x n+1 z n+1 = s j p x n+2,, x N, z n+1 = s j θ, z n = s k = j=1 K φ j x n+1 p x n+2,, x N θ, z n+1 = s j A k,j j=1 p x n+2,, x N, z n+1 = s j θ, z n = s k = p x n+2,, x N θ, z n = s k, z n+1 = s j p z n+1 = s j z n = s k by application of chain rule: P A, B = P A B P(B), which can also be written as P A, B C = P A B, C P B C) 42

43 Backward Algorithm Main Loop K j=1 p x n+1 z n+1 = s j p x n+2,, x N, z n+1 = s j θ, z n = s k = K j=1 φ j x n+1 p x n+2,, x N θ, z n+1 = s j A k,j p x n+2,, x N, z n+1 = s j θ, z n = s k = p x n+2,, x N θ, z n = s k, z n+1 = s j p z n+1 = s j z n = s k = p x n+2,, x N θ, z n = s k, z n+1 = s j A k,j (by definition of A k,j ) 43

44 Backward Algorithm Main Loop K j=1 p x n+1 z n+1 = s j p x n+2,, x N, z n+1 = s j θ, z n = s k = K j=1 φ j x n+1 p x n+2,, x N θ, z n+1 = s j A k,j p x n+2,, x N, z n+1 = s j θ, z n = s k = p x n+2,, x N θ, z n = s k, z n+1 = s j p z n+1 = s j z n = s k = p x n+2,, x N θ, z n = s k, z n+1 = s j A k,j = p x n+2,, x N θ, z n+1 = s j A k,j (x n+2,, x N are conditionally independent of z n given z n+1 ) 44

45 Backward Algorithm Main Loop In the previous slides, we showed that β n, k = p x n+1,, x N θ, z n = s k = K j=1 φ j x n+1 p x n+2,, x N θ, z n+1 = s j A k,j Note that p x n+2,, x N θ, z n+1 = s j is β n + 1, j. Consequently, we obtain the recurrence: K β n, k = φ j x n+1 β n + 1, j A k,j j=1 45

46 Backward Algorithm Main Loop K β n, k = φ j x n+1 β n + 1, j A k,j j=1 Thus, we can easily compute all values β n, k using this recurrence. The key thing is to proceed in decreasing order of n, since values β n, depend on values β n + 1,. 46

47 The Forward-Backward Algorithm Inputs: A trained HMM θ. An observation sequence X = x 1,, x N Output: An array γ, of size N K, where γ n, k = p z n = s k X, θ). Algorithm: Use the forward algorithm to compute α n, k. Use the backward algorithm to compute β n, k. For n = 1 to N: For k = 1 to K: γ n, k = p z n = s k X, θ) = α n,k β n,k P X θ) 47

48 Problem 1: Training an HMM Goal: Estimate parameters θ = (π k, A k,j, φ k ). The training data can consist of multiple observation sequences X 1, X 2, X 3,, X M. We denote the length of training sequence X i as N m. We denote the elements of each observation sequence X m as: X m = (x m,1, x m,2,, x m,nm ) Before we start training, we need to decide on: The number of states. The transitions that we will allow. 48

49 Problem 1: Training an HMM While we are given X 1, X 2, X 3,, X M, we are not given the corresponding hidden state sequences Z 1, Z 2, Z 3,, Z M. We denote the elements of each hidden state sequence Z m as: Z m = (z m,1, z m,2,, z m,nm ) The training algorithm is called Baum-Welch algorithm. It is an Expectation-Maximization algorithm, similar to the algorithm we saw for learning Gaussian mixtures. 49

50 Expectation-Maximization When we wanted to learn a mixture of Gaussians, we had the following problem: If we knew the probability of each object belonging to each Gaussian, we could estimate the parameters of each Gaussian. If we knew the parameters of each Gaussian, we could estimate the probability of each object belonging to each Gaussian. However, we know neither of these pieces of information. The EM algorithm resolved this problem using: An initialization of Gaussian parameters to some random or non-random values. A main loop where: The current values of the Gaussian parameters are used to estimate new weights of membership of every training object to every Gaussian. The current estimated membership weights are used to estimate new parameters (mean and covariance matrix) for each Gaussian. 50

51 Expectation-Maximization When we want to learn an HMM model using observation sequences as training data, we have the following problem: If we knew, for each observation sequence, the probabilities of the hidden state values, we could estimate the parameters θ. If we knew the parameters θ, we could estimate the probabilities of the hidden state values. However, we know neither of these pieces of information. The Baum-Welch algorithm resolves this problem using EM: At initialization, parameters θ are given (mostly) random values. In the main loop, these two steps are performed repeatedly: The current values of parameters θ are used to estimate new probabilities for the hidden state values. The current probabilities for the hidden state values are used to estimate new values for parameters θ. 51

52 Baum-Welch: Initialization As we said before, before we start training, we need to decide on: The number of states. The transitions that we will allow. When we start training, we initialize θ = (π k, A k,j, φ k ) to random values, with these exceptions: For any s k that we do not want to allow to ever be an initial state, we set the corresponding π k to 0. The rest of the training will keep these π k values always equal to 0. For any s k, s j such that we do not want to allow transitions from s k to s j, we set the corresponding A k,j to 0. The rest of the training will keep these A k,j values always equal to 0. 52

53 Baum-Welch: Initialization When we start training, we initialize θ = (π k, A k,j, φ k ) to random values, with these exceptions: For any s k that we do not want to allow to ever be an initial state, we set the corresponding π k to 0. The rest of the training will keep these π k values always equal to 0. For any s k, s j such that we do not want to allow transitions from s k to s j, we set the corresponding A k,j to 0. The rest of the training will keep these A k,j values always equal to 0. These initial choices constrain the topology of the resulting HMM model. For example, they can force the model to be fully connected, to be a forward model, or to be some other variation. 53

54 Baum-Welch: Expectation Step Before we start the expectation step, we have some current values for the parameters of θ = (π k, A k,j, φ k ). The goal of the expectation step is to use those values to estimate two arrays: γ m, n, k = p x m,1,, x m,n, z m,n = s k θ, X m This is the same γ n, k that we computed earlier, using the forwardbackward algorithm. Here we just need to make the array three-dimensional, because we have multiple observation sequences X m. We can compute γ m,, by running the forward-backward algorithm with θ and X m as inputs. ξ m, n, j, k = p z m,n 1 = s j, z m,n = s k θ, X m 54

55 Baum-Welch: Expectation Step To facilitate our computations, we will also extend arrays α and β to three dimensions, in the same way that we extended array γ. α m, n, k = p x m,1,, x m,n, z m,n = s k θ β m, n, k = p x m,n+1,, x m,nm θ, z m,n = s k Where needed, values α[m,, ], β[m,, ], γ[m,, ] can be computed by running the forward-backward algorithm with inputs θ and X m. 55

56 Computing the ξ Values Using Bayes rule: ξ m, n, j, k = p z m,n 1 = s j, z m,n = s k θ, X m = p X m θ, z m,n 1 = s j, z m,n = s k p z m,n 1 = s j, z m,n = s k θ p X m θ p X m θ can be computed using the forward algorithm. We will see how to simplify and compute: p X m θ, z m,n 1 = s j, z m,n = s k p z m,n 1 = s j, z m,n = s k θ 56

57 Computing the ξ Values ξ m, n, j, k = p z m,n 1 = s j, z m,n = s k θ, X m = p X m θ, z m,n 1 = s j, z m,n = s k p z m,n 1 = s j, z m,n = s k θ p X m θ p X m θ, z m,n 1 = s j, z m,n = s k = p x m,1,, x m,n 1 θ, z m,n 1 = s j p x m,n,, x m,nm θ, z m,n = s k Why? Earlier observations x m,1,, x m,n 1 are conditionally independent of state z m,n given state z m,n 1. Later observations x m,n,, x m,nm are conditionally independent of state z m,n 1 given state z m,n. 57

58 Computing the ξ Values ξ m, n, j, k = p z m,n 1 = s j, z m,n = s k θ, X m = p X m θ, z m,n 1 = s j, z m,n = s k p z m,n 1 = s j, z m,n = s k θ p X m θ p X m θ, z m,n 1 = s j, z m,n = s k = p x m,1,, x m,n 1 θ, z m,n 1 = s j p x m,n,, x m,nm θ, z m,n = s k p x m,n,, x m,nm θ, z m,n = s k = p x m,n θ, z m,n = s k p x m,n+1,, x m,nm θ, z m,n = s k = φ k x m,n β m, n, k 58

59 Computing the ξ Values ξ m, n, j, k = p z m,n 1 = s j, z m,n = s k θ, X m = p X m θ, z m,n 1 = s j, z m,n = s k p z m,n 1 = s j, z m,n = s k θ p X m θ p X m θ, z m,n 1 = s j, z m,n = s k = p x m,1,, x m,n 1 θ, z m,n 1 = s j φ k x m,n β m, n, k = p x m,1,, x m,n 1, z m,n 1 = s j θ p z m,n 1 = s j θ φ k x m,n β m, n, k = α m, n 1, j γ m, n 1, j φ k x m,n β m, n, k 59

60 Computing the ξ Values ξ m, n, j, k = α m, n 1, j γ m, n 1, j φ k x m,n β m, n, k p z m,n 1 = s j, z m,n = s k θ p X m θ We can further process p z m,n 1 = s j, z m,n = s k θ : p z m,n 1 = s j, z m,n = s k θ = p z m,n = s k θ, z m,n 1 = s j p z m,n 1 = s j θ = A j,k γ m, n 1, j 60

61 Computing the ξ Values So, we get: ξ m, n, j, k = α m, n 1, j γ m, n 1, j φ k x m,n β m, n, k A j,k γ m, n 1, j p X m θ = α m, n 1, j φ k x m,n β m, n, k A j,k p X m θ 61

62 Computing the ξ Values Based on the previous slides, the final formula for ξ is: ξ m, n, j, k = α m, n 1, j φ k x m,n β m, n, k A j,k p X m θ This formula can be computed using the parameters θ and the algorithms we have covered: α m, n 1, j is computed with the forward algorithm. β m, n, k is computed by the backward algorithm. p X m θ is computed with the forward algorithm. φ k and A j,k are specified by θ. 62

63 Baum-Welch: Summary of E Step Inputs: Training observation sequences X 1, X 2, X 3,, X M. Current values of parameters θ = (π k, A k,j, φ k ). Algorithm: For m = 1 to M: Run the forward-backward algorithm to compute: α m, n, k for n and k such that 1 n N m, 1 k K. β m, n, k for n and k such that 1 n N m, 1 k K. γ m, n, k for n and k such that 1 n N m, 1 k K. p X m θ For n, j, k such that 1 n N m, 1 j, k K: ξ m, n, j, k = α m, n 1, j φ k x m,n β m, n, k A j,k p X m θ 63

64 Baum-Welch: Maximization Step Before we start the maximization step, we have some current values for: γ m, n, k = p z m,n = s k θ, X m ξ m, n, j, k = p z m,n 1 = s j, z m,n = s k θ, X m The goal of the maximization step is to use those values to estimate new values for θ = (π k, A k,j, φ k ). In other words, we want to estimate new values for: Initial state probabilities π k. State transition probabilities A k,j. Observation probabilities φ k. 64

65 Baum-Welch: Maximization Step Initial state probabilities π k are computed as: π k = M m=1 M m=1 γ m, 1, k K j=1 γ m, 1, j In words, we compute for each k the ratio of these two quantities: The sum of probabilities, over all observation sequences X m, that state s k was the initial state for X m. The sum of probabilities over all observation sequences X m and all states s j that states j was the initial state for X m. 65

66 Baum-Welch: Maximization Step State transition probabilities A k,j are computed as: A k,j = M m=1 M m=1 N m n=2 N m n=2 ξ m, n, k, j K i=1 ξ m, n, k, i In words, we compute, for each k, j the ratio of these two quantities: The sum of probabilities, over all state transitions of all observation sequences X m, that state s k was followed by s j. The sum of probabilities, over all state transitions of all observation sequences X m and all states s i, that state s k was followed by state s i. 66

67 Baum-Welch: Maximization Step Computing the observation probabilities φ k depends on how we model those probabilities. For example, we could choose: Discrete probabilities. Histograms. Gaussians. Mixtures of Gaussians. We will show how to compute distributions φ k for the cases of: Discrete probabilities. Gaussians. 67

68 Baum-Welch: Maximization Step Computing the observation probabilities φ k depends on how we model those probabilities. In all cases, the key idea is that we treat observation x m,n as partially assigned to distribution φ k. γ m, n, k = p z m,n = s k θ, X m is the weight of the assignment of x m,n to φ k. In other words: We don't know what hidden state x m,n corresponds to, but we have computed, for each state s k, the probability γ m, n, k that x m,n corresponds to s k. Thus, x m,n influences our estimate of distribution φ k with weight proportional to γ m, n, k. 68

69 Baum-Welch: Maximization Step Suppose that the observations are discrete, and come from a finite set Y = y 1,, y R. In that case, each x m,n is an element of Y. Define an auxiliary function Eq x, y as: Eq x, y = 1 if x = y 0 if x y Then, φ k y r = p x m,n = y r z m,n = s k, and it can be computed as: M N m m=1 n=1 γ m, n, k Eq x m,n, y r φ k y r = M N m γ m, n, k m=1 n=1 69

70 Baum-Welch: Maximization Step If observations come from a finite set Y = y 1,, y R : φ k y r = M m=1 N m n=1 M m=1 γ m, n, k Eq x m,n, y r N m n=1 γ m, n, k The above formula can be seen as a weighted average of the times x m,n is equal to y r, where the weight of each x m,n is the probability that x m,n corresponds to hidden state s k. 70

71 Baum-Welch: Maximization Step Suppose that observations are vectors in R D, and that we model φ k as a Gaussian distribution. In that case, for each φ k we need to estimate a mean μ k and a covariance matrix S k. μ k y r = M m=1 M m=1 N m n=1 γ m, n, k x m,n N m n=1 γ m, n, k Thus, μ k is a weighted average of values x m,n, where the weight of each x m,n is the probability that x m,n corresponds to hidden state s k. 71

72 Baum-Welch: Maximization Step Suppose that observations are vectors in R D, and that we model φ k as a Gaussian distribution. In that case, for each φ k we need to estimate a mean μ k and a covariance matrix S k. μ k = M m=1 M m=1 N m n=1 γ m, n, k x m,n N m n=1 γ m, n, k S k = M m=1 N m n=1 γ m, n, k x m,n μ k x m,n μ k T M m=1 N m n=1 γ m, n, k 72

73 Baum-Welch: Summary of M Step π k = M m=1 M m=1 γ m, 1, k K j=1 γ m, 1, j A k,j = M m=1 M m=1 N m n=2 N m n=2 ξ m, n, k, j K i=1 ξ m, n, k, i For discrete observations: φ k y r = M m=1 N m n=1 M m=1 γ m, n, k Eq x m,n, y r N m n=1 γ m, n, k 73

74 Baum-Welch: Summary of M Step π k = M m=1 M m=1 γ m, 1, k K j=1 γ m, 1, j A k,j = M m=1 M m=1 N m n=2 N m n=2 ξ m, n, k, j K i=1 ξ m, n, k, i For Gaussian observation distributions: φ k is a Gaussian with mean μ k and covariance matrix S k, where: μ k = M m=1 M m=1 N m n=1 γ m, n, k x m,n N m n=1 γ m, n, k S k = M m=1 N m n=1 γ m, n, k x m,n μ k x m,n μ k T N m M m=1 n=1 γ m, n, k 74

75 Baum-Welch Summary Initialization: Initialize θ = (π k, A k,j, φ k ) with random values, but set to 0 values π k and A k,j to specify the network topology. Main loop: E-step: Compute all values p X m θ, α m, n, k, β m, n, k, γ m, n, k using the forward-backward algorithm. Compute all values ξ m, n, j, k = α m,n 1,j φ k x m,n β m,n,k A j,k p X m θ M-step: Update θ = (π k, A k,j, φ k ), using the values γ m, n, k and ξ m, n, j, k computed at the E-step.. 75

76 Hidden Markov Models - Recap HMMs are widely used to model temporal sequences. In HMMs, an observation sequence corresponds to a sequence of hidden states. An HMM is defined by parameters θ = (π k, A k,j, φ k ), and defines a joint probability p X, Z θ. An HMM can be used as a generative model, to produce synthetic samples from distribution p X, Z θ. HMMs can be used for Bayesian classification: One HMM θ c per class. Find the class c that maximizes p θ c X. 76

77 Hidden Markov Models - Recap HMMs can be used for Bayesian classification: One HMM θ c per class. Find the class c that maximizes p θ c X, using probabilities p X θ c calculated by the forward algorithm. HMMs can be used to find the most likely hidden state sequence Z for observation sequence X. This is done using the Viterbi algorithm. HMMs can be used to find the most likely hidden state z n for a specific value x n of an observation sequence X. This is done using the forward-backward algorithm. HMMs are trained using the Baum-Welch algorithm, which is an Expectation-Maximization algorithm. 77

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Graphical Models Seminar

Graphical Models Seminar Graphical Models Seminar Forward-Backward and Viterbi Algorithm for HMMs Bishop, PRML, Chapters 13.2.2, 13.2.3, 13.2.5 Dinu Kaufmann Departement Mathematik und Informatik Universität Basel April 8, 2013

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Advanced Data Science

Advanced Data Science Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

CS 7180: Behavioral Modeling and Decision- making in AI

CS 7180: Behavioral Modeling and Decision- making in AI CS 7180: Behavioral Modeling and Decision- making in AI Learning Probabilistic Graphical Models Prof. Amy Sliva October 31, 2012 Hidden Markov model Stochastic system represented by three matrices N =

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

Hidden Markov Models (I)

Hidden Markov Models (I) GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Speech Recognition HMM

Speech Recognition HMM Speech Recognition HMM Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/38 Agenda Recap variability

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K HiSeq X & NextSeq Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization:

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

15-381: Artificial Intelligence. Hidden Markov Models (HMMs) 15-381: Artificial Intelligence Hidden Markov Models (HMMs) What s wrong with Bayesian networks Bayesian networks are very useful for modeling joint distributions But they have their limitations: - Cannot

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Parametric Models Part III: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

HMM: Parameter Estimation

HMM: Parameter Estimation I529: Machine Learning in Bioinformatics (Spring 2017) HMM: Parameter Estimation Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Content Review HMM: three problems

More information

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter Recall: Modeling Time Series State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Mixtures of Gaussians continued

Mixtures of Gaussians continued Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others

More information

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series Recall: Modeling Time Series CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates

More information

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data 0. Notations Myungjun Choi, Yonghyun Ro, Han Lee N = number of states in the model T = length of observation sequence

More information

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,.

More information

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Models: All the Glorious Gory Details Hidden Markov Models: All the Glorious Gory Details Noah A. Smith Department of Computer Science Johns Hopkins University nasmith@cs.jhu.edu 18 October 2004 1 Introduction Hidden Markov models (HMMs, hereafter)

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course) 10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Temporal Modeling and Basic Speech Recognition

Temporal Modeling and Basic Speech Recognition UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

State-Space Methods for Inferring Spike Trains from Calcium Imaging

State-Space Methods for Inferring Spike Trains from Calcium Imaging State-Space Methods for Inferring Spike Trains from Calcium Imaging Joshua Vogelstein Johns Hopkins April 23, 2009 Joshua Vogelstein (Johns Hopkins) State-Space Calcium Imaging April 23, 2009 1 / 78 Outline

More information

Applications of Hidden Markov Models

Applications of Hidden Markov Models 18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

MACHINE LEARNING 2 UGM,HMMS Lecture 7

MACHINE LEARNING 2 UGM,HMMS Lecture 7 LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information