Probability and Structure in Natural Language Processing

Size: px

Start display at page:

Download "Probability and Structure in Natural Language Processing"

Monica Janel Miles
6 years ago
Views:

1 Probability and Structure in Natural Language Processing Noah Smith Heidelberg University, November 2014

3 methods in NLP arrived ~20 years ago and now dominate. Mercer was right: There's no data like more data. And there's more and more data. Lots of new and new techniques it's formidable to learn and keep up with all of them.

4 Thesis Most of the main ideas are related and similar to each other. Different approaches to decoding. Different learning criteria. Supervised and unsupervised learning. Umbrella: reasoning about discrete structures. This is good news!

5 Plan 1. Graphical models and inference Monday 2. Decoding and structures Tuesday 3. Supervised learning Wednesday 4. Hidden variables Thursday

6 The content is formal, but the style doesn't need to be. Ask Help me find the right pace. Lecture 4 can be dropped/reduced if needed.

7 Lecture 1: Graphical Models and Inference

8 Random Variables Probability usually defined by events Events are complicated! We tend to group events by a(ributes Person Age, Grade, HairColor Random variables formalize a_ributes: Grade = A is shorthand for event { : f Grade ( )=A} Proper@es of random variable X: Val(X) = possible values of X For discrete (categorical): For con@nuous: Nonnega@vity: x P (X = x)dx =1 Val(X) P (X = x) =1 x Val(X),P(X = x) 0

9 Aeer learning that α is true, how do we feel about β? P(β α) Ω α α β β

10 Chain Rule P ( ) =P ( )P ( ) P ( 1 k) =P ( 1 )P ( 2 1) P ( k 1... k 1)

11 Bayes Rule likelihood prior posterior constant γ is an external event

12 Independence α and β are independent if P(β α) = P(β) P (α β) Proposi4on: α and β are independent if and only if P(α β) = P(α) P(β)

13 Condi4onal Independence Independence is rarely true. α and β are condi4onally independent given γ if P(β α γ) = P(β γ) P (α β γ) Proposi4on: P (α β γ) if and only if P(α β γ) =P (α γ) P(β γ)

14 Joint and P (Grade, Intelligence) = Intelligence = very high Intelligence = high Grade = A Compute the marginal over each individual random variable? Grade = B

15 General Case p(x 1 = x) = X X P (X 1 = x, X 2 = x 2,...,X n = x n ) x 2 2Val(X 2 ) x n 2Val(X n ) How many terms?

16 Basic Concepts So Far Atomic outcomes: assignment of x 1,,x n to X 1,, X n Condi@onal probability: P(X, Y) = P(X) P(Y X) Bayes rule: P(X Y) = P(Y X) P(X) / P(Y) Chain rule: P(X 1,,X n ) = P(X 1 ) P(X 2 X 1 ) P(X k X 1,,X k- 1 ) Marginals: deriving P(X = x) from P(X, Y)

17 Sets of Variables Sets of variables X, Y, Z X is independent of Y given Z if P (X=x Y=y Z=z), x Val(X), y Val(Y), z Val(Z) Shorthand: Condi@onal independence: P (X Y Z) For P (X Y ), write P (X Y) Proposi4on: P sa@sfies (X Y Z) if and only if P(X,Y Z) = P(X Z) P(Y Z)

18 Factor Graphs

19 Factor Graphs Random variable nodes (circles) Factor nodes (squares) Edge between variable and factor if the factor depends on that variable. The graph is A factor is a func@on from tuples of r.v. values to nonnega@ve numbers. X 1 X 2 X 3 φ 1 φ 2 φ 3 φ 4 P (X = x) / Y j j(x j )

20 Two Kinds of Factors probability tables E.g., P(X 2 X 1, X 3 ) Leads to Bayesian networks, causal explana@ons Poten@al func@ons Arbitrary posi@ve scores Leads to Markov networks X 1 X 2 X 3 φ 1 φ 2 φ 3 φ 4

21 Example: Bayesian Network The flu causes sinus Flu All. Allergies also cause sinus Sinus causes a runny nose R.N. S.I. H Sinus inflamma@on causes headaches

22 ϕ F ϕ A The flu causes sinus inflamma@on Allergies also cause sinus inflamma@on Sinus inflamma@on causes a runny nose Sinus inflamma@on causes headaches Flu ϕ SR ϕ FAS S.I. All. ϕ SH R.N. H. Some local configura@ons are more likely than others.

23 F 0 1 ϕ F (F) F A S ϕ FAS (F,A,S) ϕ F Flu ϕ FAS ϕ A All. A ϕ A (A) ϕ SR S.I. ϕ SH R.N. H. S R ϕ SR (S, R) S H ϕ SH (S, H) Some local configura@ons are more likely than others.

24 Example: Markov Network Swinging couples or confused students A C B, D B D A, C B D A C A B D C

25 Example: Markov Network Each random variable is a vertex. Undirected edges. Factors are associated with subsets of nodes that form cliques. A factor maps assignments of its nodes to nonnega@ve values. B A C D

26 In this example, associate a factor with each edge. Could also have factors for single nodes! A B ϕ 1 (A, B) B C ϕ 2 (B, C) B A D ϕ 4 (A, D) A D C C D ϕ 3 (C, D)

27 Markov Networks Probability P (a, b, c, d) 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) P (a, b, c, d) = 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) Z = a,b,c,d 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) A a,b,c,d B D A B ϕ 1 (A, B) B C ϕ 2 (B, C) C D ϕ 3 (C, D) A D ϕ 4 (A, D) C

28 Example: Markov Network Probability P (a, b, c, d) 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) P (a, b, c, d) = 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) Z = a,b,c,d 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) A a,b,c,d = 7,201,840 B D A B ϕ 1 (A, B) B C ϕ 2 (B, C) C D ϕ 3 (C, D) A D ϕ 4 (A, D) C

29 Example: Markov Network Probability P (a, b, c, d) 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) P (a, b, c, d) = 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) Z = a,b,c,d 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) A a,b,c,d = 7,201,840 B D A B ϕ 1 (A, B) B C ϕ 2 (B, C) C D ϕ 3 (C, D) A D ϕ 4 (A, D) C P(0, 1, 1, 0) = 5,000,000 / Z = 0.69

30 Example: Markov Network Probability P (a, b, c, d) 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) P (a, b, c, d) = 1(a, b) 2 (b, c) 3 (c, d) 4 (a, d) 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) Z = a,b,c,d 1(a,b) 2 (b,c) 3 (c,d) 4 (a,d) A a,b,c,d = 7,201,840 B D A B ϕ 1 (A, B) B C ϕ 2 (B, C) C D ϕ 3 (C, D) A D ϕ 4 (A, D) C P(1, 1, 0, 0) = 10 / Z =

31 Independence and Structure There's a lot of theory about how BNs and MNs encode condi@onal independence assump@ons. BNs: A variable X is independent of its non- descendants given its parents. MNs: Condi@onal independence derived from Markov blanket and separa,on proper@es. Local configura@ons can be used to check all condi@onal independence ques@ons; almost no need to look at the values in the factors!

32 Independence Spectrum various graphs Y i i(x i ) full independence assump@ons (x) everything is dependent

33 Products of Factors Given two factors with different scopes, we can calculate a new factor equal to their products. product(x y) = 1(x) 2(y)

34 Products of Factors Given two factors with different scopes, we can calculate a new factor equal to their products. A B ϕ 1 (A, B) B C ϕ 2 (B, C) = A B C ϕ 3 (A, B, C)

35 Factor Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via maximiza@on: (X) = max Y (X,Y) We can refer to this new factor by max Y ϕ.

36 Factor Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via maximiza@on: (X) = max A B C ϕ (A, B, C) maximizing out B Y (X,Y) A C ψ(a, C) B= B= B= B=0

37 Factor Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via marginaliza@on: (X) = y Val(Y ) (X,y)

38 Factor Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via marginaliza@on: (X) = A B C ϕ (A, B, C) y Val(Y ) summing out B (X,y) A C ψ(a, C)

39 Factor Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via marginaliza@on: (X) = A B C ϕ (A, B, C) y Val(Y ) summing out C (X,y) A B ψ(a, B)

40 Factor Given X and Y (Y X), we can turn a factor ϕ(x, Y) into a factor ψ(x) via marginaliza@on: (X) = y Val(Y ) (X,y) We can refer to this new factor by Y ϕ.

41 Marginalizing Everything? Take a factor graph s everything factor by mul@plying all of its factors. Sum out all the variables (one by one). What do you get?

42 Factors Are Like Numbers Products are ϕ 1 ϕ 2 = ϕ 2 ϕ 1 Products are associa@ve: (ϕ 1 ϕ 2 ) ϕ 3 = ϕ 1 (ϕ 2 ϕ 3 ) Sums are commuta@ve: X Y ϕ = Y X ϕ (max, too). Distribu@vity of mul@plica@on over marginaliza@on and maximiza@on: X Scope( 1 ) X ( 1 2) = 1 X max ( 1 2) = 1 max X X 2 2

43 Inference

44 Querying the Model ϕ F ϕ A Inference (e.g., do you have allergies or the flu?) Flu ϕ FAS S.I. All. What's the best explana@on for your symptoms? ϕ SR ϕ SH R.N. H. Ac@ve data collec@on (what is the next best r.v. to observe?)

45 A Bigger Example: Your Car The car doesn't start. What do we conclude about the ba_ery age? 18 random variables 2 18 possible scenarios

46 Inference: An Ubiquitous Obstacle Decoding is inference (lecture 2). Learning is inference (lectures 3 and 4). Exact inference is #P- complete. Even approxima@ons within a given absolute or rela@ve error are hard.

47 Inference Problems Given values for some random variables (X V) Most Probable what are the most probable values of the rest of the r.v.s V \ X? (More generally ) Maximum A Posteriori (MAP): what are the most probable values of some other r.v.s, Y (V \ X)? Random sampling from the posterior over values of Y Full posterior over values of Y Marginal probabili@es from the posterior over Y Minimum Bayes risk: What is the Y with the lowest expected cost? Cost- augmented decoding: What is the most dangerous Y?

Approaches to Inference inference exact approximate variable elimina@on ILP randomized

loopy belief propaga@on LP relaxa@ons local search Gibbs simulated annealing mean field

48 Approaches to Inference inference exact approximate variable ILP randomized dynamic program'ng MCMC importance sampling randomized search loopy belief LP local search Gibbs simulated annealing mean field dual decomp. beam search today tomorrow hard inference methods; soe inference methods; methods for both

49 Exact Marginal for Y This will be a generaliza@on of algorithms you may already have seen: the forward and backward algorithms. The general name is variable elimina@on. Aeer we see it for the marginal, we'll see how to use it for the MAP.

50 Inference Example Goal: P(D) A B P(B A) = ϕ AB (A, B) C D P(D C) = ϕ CD (C, D) A B C D A P(A) = ϕ A (A) 0 1 B C P(C B) = ϕ BC (B, C)

51 Inference Example Let s calculate P(B) first. A B C D

52 Inference Example Let s calculate P(B) first. A B P(B A) = ϕ AB (A, B) A B A P(A) = ϕ A (A) 0 1 P (B) = a Val(A) P (A = a)p (B A = a) C D

53 Inference Example Let s calculate P(B) first. A B P(B A) = ϕ AB (A, B) A B A P(A) = ϕ A (A) 0 1 P (B) = a Val(A) P (A = a)p (B A = a) C Note: C and D don t ma_er. D

54 Inference Example Let s calculate P(B) first. A B P (B) = P (A = a)p (B A = a) C a Val(A) B P(B) = ϕ B (B) 0 1 A P(A) = ϕ A (A) 0 1 A B P(B A) = ϕ AB (A, B) D

55 Inference Example New model in which A is eliminated; defines P(B, C, D) C D P(D C) = ϕ CD (C, D) B P(B) = ϕ B (B) 0 1 B C D B C P(C B) = ϕ BC (B, C)

56 Inference Example Same thing to eliminate B. B P (C) = b Val(B) P (B = b)p (C B = b) C C P(C) = ϕ C (C) 0 1 B P(B) = ϕ B (B) 0 1 B C P(C B) = ϕ BC (B, C) D

57 Inference Example New model in which B is eliminated; defines P(C, D) C P(C) = ϕ C (C) 0 1 C C D P(D C) = ϕ CD (C, D) D

58 Simple Inference Example Last step to get P(D): P (D) = D P(D) = ϕ D (D) 0 1 c Val(C) P (C = c)p (D C = c) C P(C) = ϕ C (C) 0 1 C D P(D C) = ϕ CD (C, D) D

59 Simple Inference Example that the same step happened for each random variable: We created a new factor over the variable and its successor We summed out (marginalized) the variable. P (D) = P (A = a)p (B = b A = a)p (C = c B = b)p (D C = c) a Val(A) b Val(B) c Val(C) = P (D C = c) P (C = c B = b) P (A = a)p (B = b A = a) c Val(C) b Val(B) a Val(A)

60 That Was Variable We reused from previous steps and avoided doing the same work more than once. Dynamic programming à la forward algorithm! We exploited the graph structure (each subexpression only depends on a small number of variables). Exponen@al blowup avoided!

61 What Remains Variable in general The version (for MAP inference) A bit about approximate inference

62 One Variable Input: Set of factors Φ, variable Z to eliminate Output: new set of factors Ψ 1. Let Φ' = {ϕ Φ Z Scope(ϕ)} 2. Let Ψ = {ϕ Φ Z Scope(ϕ)} 3. Let ψ be Z ϕ Φ' ϕ 4. Return Ψ {ψ}

63 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. S.I. Let's eliminate H. ϕ SR ϕ SH R.N. H.

64 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. Let's eliminate H. 1. Φ' = {ϕ SH } 2. Ψ = {ϕ F, ϕ A, ϕ FAS, ϕ SR } 3. ψ = H ϕ Φ' ϕ 4. Return Ψ {ψ} S.I. ϕ SR ϕ SH R.N. H.

65 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. Let's eliminate H. 1. Φ' = {ϕ SH } 2. Ψ = {ϕ F, ϕ A, ϕ FAS, ϕ SR } 3. ψ = H ϕ SH 4. Return Ψ {ψ} S.I. ϕ SR ϕ SH R.N. H.

66 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. Let's eliminate H. 1. Φ' = {ϕ SH } 2. Ψ = {ϕ F, ϕ A, ϕ FAS, ϕ SR } 3. ψ = H ϕ SH 4. Return Ψ {ψ} S.I. ϕ SR ϕ SH R.N. H. S H ϕ SH (S, H) S ψ(s)

67 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. S.I. Let's eliminate H. 1. Φ' = {ϕ SH } 2. Ψ = {ϕ F, ϕ A, ϕ FAS, ϕ SR } 3. ψ = H ϕ SH 4. Return Ψ {ψ} ϕ SR R.N. S H ϕ SH (S, H) ψ S ψ(s)

68 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. Let's eliminate H. We can actually ignore the new factor, equivalently just dele@ng H! Why? In some cases elimina@ng a variable is really easy! ϕ SR R.N. S.I. S ψ(s)

69 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. S.I. H is already eliminated. ϕ SR Let's now eliminate S. R.N.

70 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. S.I. Elimina@ng S. 1. Φ' = {ϕ SR, ϕ FAS } 2. Ψ = {ϕ F, ϕ A } 3. ψ FAR = S ϕ Φ' ϕ 4. Return Ψ {ψ FAR } ϕ SR R.N.

71 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FAS All. S.I. Elimina@ng S. 1. Φ' = {ϕ SR, ϕ FAS } 2. Ψ = {ϕ F, ϕ A } 3. ψ FAR = S ϕ SR ϕ FAS 4. Return Ψ {ψ FAR } ϕ SR R.N.

72 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ψ FAR All. Elimina@ng S. 1. Φ' = {ϕ SR, ϕ FAS } 2. Ψ = {ϕ F, ϕ A } 3. ψ FAR = S ϕ SR ϕ FAS 4. Return Ψ {ψ FAR } R.N.

73 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ψ FAR All. Finally, eliminate A. R.N.

74 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ψ FAR All. Elimina@ng A. 1. Φ' = {ϕ A, ϕ FAR } 2. Ψ = {ϕ F } 3. ψ FR = A ϕ A ψ FAR 4. Return Ψ {ψ FR } R.N.

75 Query: P(Flu runny nose) A. 1. Φ' = {ϕ A, ϕ FAR } 2. Ψ = {ϕ F } 3. ψ FR = A ϕ A ψ FAR 4. Return Ψ {ψ FR } Example ϕ F Flu ψ FR R.N.

76 Chain, Again A A P(A) = ϕ A (A) 0 Goal: P(D) A B P(B A) = ϕ AB (A, B) 1 Earlier, we eliminated A, then B, then C B C B C P(C B) = ϕ BC (B, C) C D P(D C) = ϕ CD (C, D) D

77 Chain, Again A A P(A) = ϕ A (A) 0 Goal: P(D) A B P(B A) = ϕ AB (A, B) 1 Earlier, we eliminated A, then B, then C. Let s start with C C D P(D C) = ϕ CD (C, D) 0 0 B C B C P(C B) = ϕ BC (B, C) D

78 Chain, Again A A P(A) = ϕ A (A) 0 Goal: P(D) A B P(B A) = ϕ AB (A, B) 1 Earlier, we eliminated A, then B, then C. Let s start with C C D P(D C) = ϕ CD (C, D) 0 0 B C B C P(C B) = ϕ BC (B, C) D

79 Chain, Again A Elimina@ng C. B B C P(C B) = ϕ BC (B, C) C D P(D C) = ϕ CD (C, D) B C D ϕ BCD (B, C, D) C D

80 Chain, Again A Elimina@ng C. B B C D ϕ BCD (B, C, D) B D ψ(b, D) C D

81 Chain, Again B will be similarly complex. A B P(B A) = ϕ AB (A, B) A B A P(A) = ϕ A (A) B D ψ(b, D) D

82 Variable Comments Can prune away all non- ancestors of the query variables. Ordering makes a difference!

83 What about Evidence? So far, we've just considered the posterior/ marginal P(Y). Next: P(Y X = x). It's almost the same: the addi@onal step is to reduce factors to respect the evidence.

84 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FA S All. S.I. Let's reduce to R = true (runny nose). ϕ SR ϕ SH R.N. H. S R ϕ SR (S, R) S R ϕ' S (S) S R ϕ' S (S)

85 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FA S All. S.I. Let's reduce to R = true (runny nose). ϕ' S ϕ SH H. S R ϕ' S (S)

86 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FA S All. S.I. Now run variable elimina@on all the way down to one factor (for F). ϕ' S ϕ SH H. Eliminate H.

87 Example ϕ F ϕ A Query: P(Flu runny nose) Flu ϕ FA S All. S.I. Now run variable elimina@on all the way down to one factor (for F). ϕ' S Eliminate S.

88 Example ϕ F ϕ A Query: P(Flu runny nose) Now run variable elimina@on all the way down to one factor (for F). Flu ψ FA All. Eliminate A.

89 Example ϕ F Query: P(Flu runny nose) Flu ψ F Now run variable elimina@on all the way down to one factor (for F). Take final product.

90 Example Query: P(Flu runny nose) ϕ F ψ F Now run variable elimina@on all the way down to one factor.

91 Comments depends on the size of the intermediate factors. Hence, variable ordering ma_ers a lot. But it's NP- hard to find the best one. For MNs, chordal graphs permit inference linear in the size of the original factors. For BNs, polytree structures do the same. If you can avoid big intermediate factors, you can make inference linear in the size of the original factors.

92 Variable for P(Y X = x) Input: Graphical model on V, set of query variables Y, evidence X = x Output: factor ϕ and scalar α 1. Φ = factors in the model 2. Reduce factors in Φ by X = x 3. Choose variable ordering π on Z = V \ Y \ X 4. ϕ = Variable- Elimina@on(Φ, Z, π) 5. α = z Val(Z) ϕ(z) 6. Return ϕ, α

93 Ge ng Back to NLP structured NLP models were chosen for these HMMs, PCFGs (with a li_le work) But not: IBM model 3 To decode, we need MAP inference for decoding! When models get complicated, need approxima@ons!

94 From Marginals to MAP Replace factor steps with maximiza,on. Add bookkeeping to keep track of the maximizing values. Add a traceback at the end to recover the solu@on. This is analogous to the connec@on between the forward algorithm and the Viterbi algorithm. Ordering challenge is the same.

95 Variable (Max- Product Version with Decoding) Input: Set of factors Φ, ordered list of variables Z to eliminate Output: new factor 1. For each Z i Z (in order): Let (Φ, ψ Z i) = Eliminate- One(Φ, Z i ) 2. Return ϕ Φ ϕ, Traceback({ψ Z i})

96 One Variable (Max- Product Version with Bookkeeping) Input: Set of factors Φ, variable Z to eliminate Output: new set of factors Ψ 1. Let Φ' = {ϕ Φ Z Scope(ϕ)} 2. Let Ψ = {ϕ Φ Z Scope(ϕ)} 3. Let τ be max Z ϕ Φ' ϕ Let ψ be ϕ Φ' ϕ (bookkeeping) 4. Return Ψ {τ}, ψ

97 Traceback Input: Sequence of factors with associated variables: (ψ Z 1,, ψ Z k) Output: z * Each ψ Z is a factor with scope including Z and variables eliminated a<er Z. Work backwards from i = k to 1: Let z i = arg max z ψ Z i(z, z i+1, z i+2,, z k ) Return z

98 About the Traceback No extra expense. Linear traversal over the intermediate factors. The factor for both sum- product VE and max- product VE can be generalized. Example: get the K most likely assignments

99 Variable Tips Any ordering will be correct. Most orderings will be too expensive. There are for choosing an ordering. If the graph is chain- like, work from one end toward the other.

100 (Rocket Science: True MAP) Evidence: X = x Query: Y Other variables: Z = V \ X \ Y y = arg max P (Y = y X = x) y Val(Y ) = arg max y Val(Y ) z Val(Z) P (Y = y, Z = z X = x) First, marginalize out Z, then do MAP inference over Y given X = x This is not usually a_empted in NLP, with some excep@ons.

101 Shots You will probably never implement the general variable algorithm. You will rarely use exact inference. Understand the inference problem would look like in exact form; then approximate. you get lucky. You ll appreciate be_er as they come along.

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum Graphical Models Lecture 10: Variable Elimina:on, con:nued Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. 1 Last Time Probabilis:c inference is