Directed Probabilistic Graphical Models CMSC 678 UMBC

Size: px
Start display at page:

Download "Directed Probabilistic Graphical Models CMSC 678 UMBC"

Transcription

1 Directed Probabilistic Graphical Models CMSC 678 UMBC

2 Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions?

3 Announcement 2: Progress Report on Project Due Monday April 16 th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you ve made Discuss what remains to be done Discuss any new blocks you ve experienced (or anticipate experiencing) Any questions?

4 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

5 Recap from last time

6 Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters p(z i ) count(z i, w i ) 2. M-step: maximize log-likelihood, assuming these uncertain counts p (t) (z) estimated counts p t+1 (z)

7 max θ new parameters EM Math E-step: count under uncertainty E old parameters log p (z, w) z ~ p (t)( w) θ θ posterior distribution M-step: maximize log-likelihood new parameters M θ = marginal loglikelihood of observed data X C θ = log-likelihood of complete data (X,Y) P θ = posterior loglikelihood of incomplete data Y M θ = E Y θ (t)[c θ X] E Y θ (t)[p θ X] EM does not decrease the marginal log-likelihood

8 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

9 Lagrange multipliers Assume an original optimization problem

10 Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:

11 Lagrange multipliers: an equivalent problem?

12 Lagrange multipliers: an equivalent problem?

13 Lagrange multipliers: an equivalent problem?

14 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

15 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i w 1 = 1 w 2 = 5 w 3 = 4 Generative Story for roll i = 1 to N: w i Cat(θ) 6 θ k = 1 k=1 a probability distribution over 6 sides of the die 0 θ k 1, k

16 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = w 1 = 1 w 2 = 5 w 3 = 4 Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood i p w i L θ = log p θ (w i ) i = log θ wi i

17 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood L θ = log θ wi i i p w i Q: What s an easy way to maximize this, as written exactly (even without calculus)?

18 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood L θ = log θ wi i i p w i Q: What s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing θ k (we know θ must be a distribution, but it s not specified)

19 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) L θ = log θ wi i 6 s. t. θ k = 1 k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) solve using Lagrange multipliers

20 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) F θ 1 = λ θ k θ wi i:w i =k F θ λ 6 = θ k + 1 k=1

21 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) θ k = σ i:w i =k 1 λ optimal λ when θ k = 1 6 k=1

22 Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) θ k = σ i:w i =k 1 σ k σ i:wi =k 1 = N k N optimal λ when θ k = 1 6 k=1

23 Example: Conditionally Rolling a Die p w 1, w 2,, w N = p w 1 p w 2 p w N = p w i add complexity to better explain what we see p z 1, w 1, z 2, w 2,, z N, w N = p z 1 p w 1 z 1 p z N p w N z N i z 1 = H z 2 = T w 1 = 1 w 2 = 5 = p w i z i i p z i p heads = λ p tails = 1 λ p heads = γ p tails = 1 γ p heads = ψ p tails = 1 ψ

24 Example: Conditionally Rolling a Die p w 1, w 2,, w N = p w 1 p w 2 p w N = p w i p z 1, w 1, z 2, w 2,, z N, w N = p z 1 p w 1 z 1 p z N p w N z N p heads = λ p tails = 1 λ p heads = γ p tails = 1 γ p heads = ψ p tails = 1 ψ = p w i z i i add complexity to better explain what we see p z i Generative Story λ = distribution over penny γ = distribution for dollar coin ψ = distribution over dime for item i = 1 to N: z i ~ Bernoulli λ if z i = H: w i ~ Bernoulli γ else: w i ~ Bernoulli ψ i

25 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

26 Classify with Bayes Rule argmax Y p Y X) argmax Y log p X Y) + log p(y) likelihood prior

27 The Bag of Words Representation Adapted from Jurafsky & Martin (draft)

28 The Bag of Words Representation Adapted from Jurafsky & Martin (draft)

29 The Bag of Words Representation Adapted from Jurafsky & Martin (draft) 29

30 Bag of Words Representation seen 2 sweet 1 γ( )=c whimsical 1 classifier Adapted from Jurafsky & Martin (draft) recommend 1 happy classifier

31 Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters global parameters

32 Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ y

33 Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) local variables y x i1 x i2 x i3 x i4 x i5

34 Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) y x i1 x i2 x i3 x i4 x i5 each x ij conditionally independent of one another (given the label)

35 Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) y x i1 x i2 x i3 x i4 x i5 Maximize Log-likelihood L θ = i log F yi (x ij ; θ yi ) + j i log φ yi s. t. φ k = 1 k θ k is valid for F k

36 Multinomial Naïve Bayes: A Generative Generative Story φ = distribution over K labels for label k = 1 to K: for item i = 1 to N: y i ~ Cat φ Story θ k = distribution over J feature values for each feature j x ij Cat(θ yi,j) y x i1 x i2 x i3 x i4 x i5 Maximize Log-likelihood L θ = i log θ yi,x i,j + j i log φ yi s. t. φ k = 1 k θ kj = 1 k j

37 Multinomial Naïve Bayes: A Generative Generative Story φ = distribution over K labels for label k = 1 to K: for item i = 1 to N: y i ~ Cat φ Story θ k = distribution over J feature values for each feature j x ij Cat(θ yi,j) y x i1 x i2 x i3 x i4 x i5 L θ Maximize Log-likelihood via Lagrange Multipliers = log θ yi,x i,j + log φ yi μ φ k 1 i j i k k λ k θ kj 1 j

38 Multinomial Naïve Bayes: Learning Calculate class priors For each k: items k = all items with class = k Calculate feature generation terms For each k: obs k = single object containing all items labeled as k For each feature j n kj = # of occurrences of j in obs k p k = items k # items p j k = n kj σ j n kj

39 Adapted from Jurafsky & Martin (draft) Brill and Banko (2001) With enough data, the classifier may not matter

40 Summary: Naïve Bayes is Not So Naïve, but not without issue Pro Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best) Con Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled) Adapted from Jurafsky & Martin (draft)

41 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

42 Hidden Markov Models (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(british Left Waffles on Falkland Islands) Class-based model Bigram model of the classes Model all class sequences p w i z i p z i z i 1 z 1,..,z N p z 1, w 1, z 2, w 2,, z N, w N

43 Hidden Markov Model p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Goal: maximize (log-)likelihood In practice: we don t actually observe these z values; we just see the words w

44 Hidden Markov Model p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Goal: maximize (log-)likelihood In practice: we don t actually observe these z values; we just see the words w if we did observe z, estimating the probability parameters would be easy but we don t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood but we don t! :(

45 Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Each z i can take the value of one of K latent states

46 Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 transition probabilities/parameters Each z i can take the value of one of K latent states

47 Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states

48 Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change

49 Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items?

50 Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K 2 transition values

51 Hidden Markov Model Representation p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N emission probabilities/parameters = p w i z i i p z i z i 1 transition probabilities/parameters z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

52 Hidden Markov Model Representation p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N emission probabilities/parameters initial starting distribution ( SEQ_SYM ) p z 1 z 0 z 1 = p w i z i i p z i z i 1 p z 2 z 1 p z 3 z 2 p z 4 z 3 z 2 z 3 z 4 p w 1 z 1 p w 2 z 2 p w 3 z 3 p w 4 z 4 transition probabilities/parameters w 1 w 2 w 3 w 4 Each z i can take the value of one of K latent states Transition and emission distributions do not change

53 Example: 2-state Hidden Markov Model as a Lattice z 1 = V z 2 = V z 3 = V z 4 = V z 1 = N z 2 = N z 3 = N z 4 = N w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N V N V end start N V

54 Example: 2-state Hidden Markov Model as a Lattice z 1 = V z 2 = V z 3 = V z 4 = V z 1 = N z 2 = N z 3 = N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N V N V end start N V

55 Example: 2-state Hidden Markov Model as a Lattice p V start z 1 = V p V V z 2 = p V V p V V z 3 = V V z 4 = V p N start z 1 = N p N N z 2 = p N N z 3 = p N N N N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N V N V end start N V

56 Example: 2-state Hidden Markov Model as a Lattice p V start z 1 = V p V V z 2 = p V V p V V z 3 = V V z 4 = V p N start p N V z 1 = N p V N p V N p V N p N V p N V p N N z 2 = p N N z 3 = p N N N N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N V N V end start N V

57 A Latent Sequence is a Path through the Graph z 1 = V z 2 = V p V V z 3 = V z 4 = V p V N p N V p N start z 1 = N z 2 = N z 3 = N z 4 = N p w 2 V p w 3 V p w 1 N p w 4 N w 1 w 2 w 3 w 4 N V end start N V w 1 w 2 w 3 w 4 N V Q: What s the probability of (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =

58 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

59 Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

60 Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

61 Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

62 Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

63 What s the Maximum Weighted Path?

64 What s the Maximum Weighted Path?

65 What s the Maximum Weighted Path?

66 What s the Maximum Weighted Path?

67 What s the Maximum Weighted Path?

68 What s the Maximum Weighted Path?

69 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

70 What s the Maximum Value? z i-1 = A z i = A z i-1 = B z i = B z i-1 = C z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

71 What s the Maximum Value? z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

72 What s the Maximum Value? z i-2 computing = A v at time i-1 will correctly incorporate (maximize over) paths z through i-2 = B time i-2: we correctly obey the Markov property z i-2 = C z i-1 = A z i-1 = B z i-1 = C z i = A z i = B z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) Viterbi algorithm v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

73 v = double[n+2][k*] b = int[n+2][k*] Viterbi Algorithm v[*][*] = 0 v[0][start] = 1 backpointers/ book-keeping for(i = 1; i N+1; ++i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i state) for(old = 0; old < K*; ++old) { p move = p transition (state old) if(v[i-1][old] * p obs * p move > v[i][state]) { } } } } v[i][state] = v[i-1][old] * p obs * p move b[i][state] = old

74 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

75 Forward Probability z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C α(i, B) is the total probability of all paths to that state B from the beginning

76 Forward Probability z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C marginalize across the previous hidden state values α i, B = s α i 1, s p B s ) p(obs at i B) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property α(i, B) is the total probability of all paths to that state B from the beginning

77 Forward Probability α i, s = s α i 1, s p s s ) p(obs at i s) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

78 Forward Probability α i, s = s α i 1, s p s s ) p(obs at i s) what are the immediate ways to get into state s? what s the total probability up until now? how likely is it to get into state s this way? α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

79 α i, s = what are the immediate ways to get into state s? Forward Probability s α i 1, s p s s ) p(obs at i s) what s the total probability up until now? α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i how likely is it to get into state s this way? 3. that emit the observation obs at i Q: What do we return? (How do we return the likelihood of the sequence?) A: α[n+1][end]

80 Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

81 Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

82 Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i, B) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

83 Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i, B) α(i, B) * β(i, B) = total probability of paths through state B at step i α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

84 Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i+1, s) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

85 Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i+1, s ) α(i, B) * p(s B) * p(obs at i+1 s ) * β(i+1, s ) = total probability of paths through the B s arc (at time i) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

86 With Both Forward and Backward Values α(i, s) * β(i, s) = total probability of paths through state s at step i p z i = s w 1,, w N ) = α i, s β(i, s) α(n + 1, END) α(i, s) * p(s B) * p(obs at i+1 s ) * β(i+1, s ) = total probability of paths through the s s arc (at time i) p z i = s, z i+1 = s w 1,, w N ) = α i, s p s s p obs i+1 s β(i + 1, s ) α(n + 1, END)

87 estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts

88 estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm p obs (w s) p trans (s s) 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts

89 estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters p z i = s w 1,, w N ) = α i, s β(i, s) α(n + 1, END) p obs (w s) p trans (s s) p z i = s, z i+1 = s w 1,, w N ) = α i, s p s s p obs i+1 s β(i + 1, s ) α(n + 1, END) 2. M-step: maximize log-likelihood, assuming these uncertain counts

90 M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = c(s s ) σ x c(s x) if we observed the hidden transitions

91 M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = E s s [c s s ] σ x E s x [c s x ] we don t the hidden transitions, but we can approximately count

92 M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = E s s [c s s ] σ x E s x [c s x ] we compute these in the E-step, with our α and β values we don t the hidden transitions, but we can approximately count

93 α = computeforwards() β = computebackwards() L = α[n+1][end] EM For HMMs (Baum-Welch Algorithm) for(i = N; i 0; --i) { for(next = 0; next < K*; ++next) { c obs (obs i+1 next) += α[i+1][next]* β[i+1][next]/l for(state = 0; state < K*; ++state) { u = p obs (obs i+1 next) * p trans (next state) c trans (next state) += α[i][state] * u * β[i+1][next]/l } } }

94 Bayesian Networks: Directed Acyclic Graphs p x 1, x 2, x 3,, x N = i p x i π(x i )) parents of topological sort

95 Bayesian Networks: Directed Acyclic Graphs p x 1, x 2, x 3,, x N = i p x i π(x i )) exact inference in general DAGs is NP-hard inference in trees can be exact

96 D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node X P has a fork with an observed parent node X P includes a v-structure or collider with all unobserved descendants X Z Y Y Y

97 D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node observing Z blocks the path from X to Y X Z Y P has a fork with an observed parent node observing Z blocks the path from X to Y X Z Y not observing Z blocks the path from X to Y P includes a v-structure or collider with all unobserved descendants X Z Y

98 D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node observing Z blocks the path from X to Y X Z Y P has a fork with an observed parent node observing Z blocks the path from X to Y X Z Y not observing Z blocks the path from X to Y p x, y, z = p x p y p(z x, y) p x, y = p x p y p(z x, y) = p x p y z P includes a v-structure or collider with all unobserved descendants X Z Y

99 Markov Blanket p x i x j i = factorization of graph = the set of nodes needed to form the complete conditional for a variable x i p(x 1,, x N ) p x 1,, x N dx i ς k p(x k π x k ) ς k p x k π x k ) dx i x = factor out terms not dependent on x i ς k:k=i or i π xk p(x k π x k ) ς k:k=i or i π xk p x k π x k ) dx i Markov blanket of a node x is its parents, children, and children's parents

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

A gentle introduction to Hidden Markov Models

A gentle introduction to Hidden Markov Models A gentle introduction to Hidden Markov Models Mark Johnson Brown University November 2009 1 / 27 Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

MACHINE LEARNING 2 UGM,HMMS Lecture 7

MACHINE LEARNING 2 UGM,HMMS Lecture 7 LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Graphical Models Seminar

Graphical Models Seminar Graphical Models Seminar Forward-Backward and Viterbi Algorithm for HMMs Bishop, PRML, Chapters 13.2.2, 13.2.3, 13.2.5 Dinu Kaufmann Departement Mathematik und Informatik Universität Basel April 8, 2013

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference COS402- Artificial Intelligence Fall 2015 Lecture 10: Bayesian Networks & Exact Inference Outline Logical inference and probabilistic inference Independence and conditional independence Bayes Nets Semantics

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech Probabilistic Graphical Models and Bayesian Networks Artificial Intelligence Bert Huang Virginia Tech Concept Map for Segment Probabilistic Graphical Models Probabilistic Time Series Models Particle Filters

More information

Conditional Independence and Factorization

Conditional Independence and Factorization Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 9. Intro to Hidden Markov Models (finish up) Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2 Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training

More information

Learning with Probabilities

Learning with Probabilities Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1 Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior

More information

Qualifier: CS 6375 Machine Learning Spring 2015

Qualifier: CS 6375 Machine Learning Spring 2015 Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms  Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

Expectation-Maximization (EM) algorithm

Expectation-Maximization (EM) algorithm I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probability. CS 3793/5233 Artificial Intelligence Probability 1 CS 3793/5233 Artificial Intelligence 1 Motivation Motivation Random Variables Semantics Dice Example Joint Dist. Ex. Axioms Agents don t have complete knowledge about the world. Agents need to make decisions

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net Bayesian Networks aka belief networks, probabilistic networks A BN over variables {X 1, X 2,, X n } consists of: a DAG whose nodes are the variables a set of PTs (Pr(X i Parents(X i ) ) for each X i P(a)

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Directed Graphical Models

Directed Graphical Models Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express

More information

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Lecture 4: State Estimation in Hidden Markov Models (cont.) EE378A Statistical Signal Processing Lecture 4-04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm 6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Advanced Data Science

Advanced Data Science Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information