Graphical Models and Kernel Methods

Size: px
Start display at page:

Download "Graphical Models and Kernel Methods"

Transcription

1 Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, / 123

2 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 2 / 123

3 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 3 / 123

4 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 4 / 123

5 The envelope quiz red ball = $$$ 5 / 123

6 The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black 5 / 123

7 The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black Should you choose this envelope or the other envelope? 5 / 123

8 The envelope quiz Probabilistic inference 6 / 123

9 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) 6 / 123

10 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 6 / 123

11 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 6 / 123

12 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B 6 / 123

13 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 6 / 123

14 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 P (B=b E=1)P (E=1) P (E = 1 B = b) = P (B=b) = 1/2 1/2 3/4 = 1/3. Switch. 6 / 123

15 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d 7 / 123

16 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label 7 / 123

17 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } 7 / 123

18 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) 7 / 123

19 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) Learning: estimate p(x 1,..., x d ) from training data X (1),..., X (N), where X (i) = (x (i) 1,..., x(i) d ) 7 / 123

20 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) 8 / 123

21 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) 8 / 123

22 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) 8 / 123

23 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) 8 / 123

24 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force 8 / 123

25 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data 8 / 123

26 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force 8 / 123

27 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force Graphical model: efficient representation, inference, and learning on p(x 1,..., x d ), exactly or approximately 8 / 123

28 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) 9 / 123

29 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field 9 / 123

30 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence 9 / 123

31 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } 9 / 123

32 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational 9 / 123

33 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational If p(x 1,..., x d ) not given, estimate it from data 9 / 123

34 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational If p(x 1,..., x d ) not given, estimate it from data parameter and structure learning 9 / 123

35 Graphical-Model-Nots Graphical model is the study of probabilistic models 10 / 123

36 Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model 10 / 123

37 Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model These are not graphical models: neural network decision tree network flow HMM template (but HMMs are!) 10 / 123

38 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 11 / 123

39 Directed graphical models 12 / 123

40 Directed graphical models Also called Bayesian networks 13 / 123

41 Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j 13 / 123

42 Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k 13 / 123

43 Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k A directed acyclic graph (DAG) contains no cycles 13 / 123

44 Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. 14 / 123

45 Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i 14 / 123

46 Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i By specifying the CPDs for all i, we specify a joint distribution p(x 1,..., x d ) 14 / 123

47 Example: Burglary, Earthquake, Alarm, John and Marry Binary variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 P (B, E, A, J, M) = P (B)P ( E)P (A B, E)P (J A)P ( M A) = ( ) (1 0.7) / 123

48 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) 16 / 123

49 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right 16 / 123

50 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial 16 / 123

51 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial p(x i y) depends on the feature type: multinomial (count x i ), Gaussian (continuous x i ), etc. 16 / 123

52 No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) 17 / 123

53 No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) 17 / 123

54 No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) However, it is perfectly fine to design BNs causally 17 / 123

55 What do we need probabilistic models for? Make predictions. p(y x) plus decision theory 18 / 123

56 What do we need probabilistic models for? Make predictions. p(y x) plus decision theory Interpret models. Very natural to include latent variables 18 / 123

57 Example: Latent Dirichlet Allocation (LDA) β φ T w z N d θ D α A generative model for p(φ, θ, z, w α, β): For each topic t φ t Dirichlet(β) For each document d θ Dirichlet(α) For each word position in d topic z Multinomial(θ) word w Multinomial(φ z ) Inference goals: p(z w, α, β), argmax φ,θ p(φ, θ w, α, β) 19 / 123

58 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) 20 / 123

59 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) 20 / 123

60 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s 20 / 123

61 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s Conditional independence in a BN is precisely specified by d-separation ( directed separation ) 20 / 123

62 d-separation Case 1: Tail-to-Tail C C A A, B in general dependent B A B 21 / 123

63 d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) 21 / 123

64 d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) An observed C is a tail-to-tail node, blocks the undirected path A-B 21 / 123

65 d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent 22 / 123

66 d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C 22 / 123

67 d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C An observed C is a head-to-tail node, blocks the path A-B 22 / 123

68 d-separation Case 3: Head-to-Head A B A B C A, B in general independent C 23 / 123

69 d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants C 23 / 123

70 d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants An observed C is a head-to-head node, unblocks the path A-B C 23 / 123

71 d-separation Variable groups A and B are conditionally independent given C, if all undirected paths from nodes in A to nodes in B are blocked 24 / 123

72 d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A F E B C 25 / 123

73 d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A, B dependent given C A F E B C 25 / 123

74 d-separation Example 2 The path from A to B is blocked both at E and F A F E B C 26 / 123

75 d-separation Example 2 The path from A to B is blocked both at E and F A, B conditionally independent given F A F E B C 26 / 123

76 Undirected graphical models 27 / 123

77 Undirected graphical models Also known as Markov Random Fields 28 / 123

78 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs 28 / 123

79 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation 28 / 123

80 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive 28 / 123

81 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) 28 / 123

82 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + 28 / 123

83 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z C 28 / 123

84 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z Z = C ψ C(X C )dx is the partition function C 28 / 123

85 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} 29 / 123

86 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 29 / 123

87 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 29 / 123

88 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) 29 / 123

89 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) 29 / 123

90 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) 29 / 123

91 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) When the parameter a > 0, favor homogeneous chains 29 / 123

92 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) When the parameter a > 0, favor homogeneous chains When the parameter a < 0, favor inhomogeneous chains 29 / 123

93 Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) 30 / 123

94 Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k ( k ) p(x) = 1 Z exp w i f i (X) i=1 30 / 123

95 Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k Equivalent to MRF p(x) = 1 Z p(x) = 1 Z exp ( k i=1 w i f i (X) ) C ψ C(X C ) with ψ C (X C ) = exp (w C f C (X)) 30 / 123

96 Example: Ising Model θ s x s θ st x t This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + θ st x s x t s V (s,t) E f s (X) = x s, f st (X) = x s x t 31 / 123

97 Example: Image Denoising [From Bishop PRML] noisy image argmax X P (X Y ) p θ (X Y ) = 1 Z exp θ s x s + s V θ s = (s,t) E { c ys = 1 c y s = 0, θ st > 0 θ st x s x t 32 / 123

98 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian 33 / 123

99 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite 33 / 123

100 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix 33 / 123

101 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 33 / 123

102 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 When Ω ij 0, there is an edge between x i, x j 33 / 123

103 Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if A, B become disconnected by removing C and all edges involving C A C B 34 / 123

104 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 35 / 123

105 Exact Inference 36 / 123

106 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. 37 / 123

107 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) 37 / 123

108 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) 37 / 123

109 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms 37 / 123

110 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms Not covered: Variable elimination and junction tree (aka clique tree) 37 / 123

111 Markov Chain Monte Carlo 38 / 123

112 Markov Chain Monte Carlo Forward sampling 39 / 123

113 Markov Chain Monte Carlo Forward sampling Gibbs sampling 39 / 123

114 Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling 39 / 123

115 Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. 39 / 123

116 Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. Unbiased (after burn-in), but can have high variance 39 / 123

117 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q 40 / 123

118 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) 40 / 123

119 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) The variance of the estimator decreases as O(1/m) 40 / 123

120 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(x Q X E ) 40 / 123

121 Forward Sampling Draw X P (X) 41 / 123

122 Forward Sampling Draw X P (X) Throw away X if it doesn t match the evidence X E 41 / 123

123 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 42 / 123

124 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 42 / 123

125 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 42 / 123

126 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 4. If A = 1 sample J Ber(0.9) else J Ber(0.05) 42 / 123

127 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 4. If A = 1 sample J Ber(0.9) else J Ber(0.05) 5. If A = 1 sample M Ber(0.7) else M Ber(0.01) 42 / 123

128 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) 43 / 123

129 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples 43 / 123

130 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) 43 / 123

131 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) Does not work for Markov Random Fields (can t sample from P (X)) 43 / 123

132 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

133 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

134 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

135 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

136 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

137 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

138 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123

139 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123

140 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123

141 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123

142 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. For example, B P (B E = 1, A = 0) P (B)P (A = 0 B, E = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123

143 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

144 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

145 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

146 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... Keep all samples after burn in. P (B = 1 E = 1, M = 1) is the fraction of samples with B = 1. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123

147 Gibbs Sampling Example 2: The Ising Model A B x s D This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + s V C (s,t) E θ st x s x t 47 / 123

148 Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D C 48 / 123

149 Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models p(x s x s ) = p(x s x N(s) ) N(s) is the neighbors of s. C 48 / 123

150 Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models N(s) is the neighbors of s. The Gibbs update is p(x s = 1 x N(s) ) = C p(x s x s ) = p(x s x N(s) ) 1 exp( (θ s + t N(s) θ stx t )) / 123

151 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) 49 / 123

152 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π 49 / 123

153 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) 49 / 123

154 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) 49 / 123

155 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing 49 / 123

156 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice burn in : discard X (0),..., X (T ) 49 / 123

157 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice burn in : discard X (0),..., X (T ) Use all of X (T +1),... for inference (they are correlated); Do not thin 49 / 123

158 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p 50 / 123

159 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form 50 / 123

160 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, for Y (i) p(y ) E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 50 / 123

161 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed 50 / 123

162 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler T i ((Y i, y i ) (Y i, y i )) = p(y i Y i) 50 / 123

163 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler T i ((Y i, y i ) (Y i, y i )) = p(y i Y i) Note p(y i Y i) = p(y i, Z Y i)dz 50 / 123

164 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position 51 / 123

165 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position 51 / 123

166 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n ( ) i,j 51 / 123

167 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n (d i) i, : length of document d i, excluding the current position n ( ) i,j 51 / 123

168 Belief Propagation 52 / 123

169 Factor Graph For both directed and undirected graphical models A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

170 Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

171 Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node Factors represent computation A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

172 The Sum-Product Algorithm Also known as belief propagation (BP) 54 / 123

173 The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate 54 / 123

174 The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph 54 / 123

175 The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph Alternative view: variational approximation (more later) 54 / 123

176 Example: A Simple HMM The Hidden Markov Model template (not a graphical model) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 55 / 123

177 Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 x =R 1 x =G 2 56 / 123

178 Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 Factor graph x =R 1 x =G 2 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123

179 Messages A message is a vector of length K, where K is the number of values x takes. 57 / 123

180 Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 57 / 123

181 Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 57 / 123

182 Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 2. µ x f : message from a variable node x to a factor node f 57 / 123

183 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

184 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

185 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

186 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 If a leaf is a variable node x, µ x f (x) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

187 Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123

188 Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived Let x be in factor f s. ne(x)\f s are factors connected to x excluding f s. µ x fs (x) = µ f x (x) f ne(x)\f s µ z1 f 2 (z 1 = 1) = 1/4 µ z1 f 2 (z 1 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123

189 Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123

190 Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123

191 Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) We get µ f2 z 2 (z 2 = 1) = 1/32, µ f2 z 2 (z 2 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123

192 Up to Root, Back Down The message has reached the root, pass it back down µ z2 f 2 (z 2 = 1) = 1 µ z2 f 2 (z 2 = 2) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 61 / 123

193 Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123

194 Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). We get µ f2 z 1 (z 1 = 1) = 7/16 µ f2 z 1 (z 1 = 2) = 3/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123

195 From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) 63 / 123

196 From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 63 / 123

197 From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 One can also compute the marginal of the set of variables x s involved in a factor f s p(x s ) f s (x s ) µ x f (x) x ne(f) 63 / 123

198 Handling Evidence Observing x = v, 64 / 123

199 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or 64 / 123

200 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v 64 / 123

201 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, 64 / 123

202 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) 64 / 123

203 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) The conditional is easily obtained by normalization p(x X E ) = p(x, X E) x p(x, X E ) 64 / 123

204 Loopy Belief Propagation So far, we assumed a tree graph 65 / 123

205 Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence 65 / 123

206 Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence Loopy BP may not convergence, but works in many cases 65 / 123

207 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 66 / 123

208 Parameter Learning Assume the graph structure is given 67 / 123

209 Parameter Learning Assume the graph structure is given Parameters: 67 / 123

210 Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) 67 / 123

211 Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) 67 / 123

212 Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) Principle: maximum likelihood estimate 67 / 123

213 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed 68 / 123

214 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 68 / 123

215 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) 68 / 123

216 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) gradient method for undirected models 68 / 123

217 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved i=1 68 / 123

218 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved given X 1 o,..., Xo n, the MLE is ( ) n ˆθ = argmax log p(xo, i X h θ) θ i=1 X h i=1 68 / 123

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Markov chain Monte Carlo Lecture 9

Markov chain Monte Carlo Lecture 9 Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Artificial Intelligence Bayesian Networks

Artificial Intelligence Bayesian Networks Artificial Intelligence Bayesian Networks Stephan Dreiseitl FH Hagenberg Software Engineering & Interactive Media Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Graphical Models - Part I

Graphical Models - Part I Graphical Models - Part I Oliver Schulte - CMPT 726 Bishop PRML Ch. 8, some slides from Russell and Norvig AIMA2e Outline Probabilistic Models Bayesian Networks Markov Random Fields Inference Outline Probabilistic

More information

Statistical Learning

Statistical Learning Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A.

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Probabilistic Graphical Models: Representation and Inference

Probabilistic Graphical Models: Representation and Inference Probabilistic Graphical Models: Representation and Inference Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Andrew Moore 1 Overview

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Graphical Models - Part II

Graphical Models - Part II Graphical Models - Part II Bishop PRML Ch. 8 Alireza Ghane Outline Probabilistic Models Bayesian Networks Markov Random Fields Inference Graphical Models Alireza Ghane / Greg Mori 1 Outline Probabilistic

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Bayes Networks 6.872/HST.950

Bayes Networks 6.872/HST.950 Bayes Networks 6.872/HST.950 What Probabilistic Models Should We Use? Full joint distribution Completely expressive Hugely data-hungry Exponential computational complexity Naive Bayes (full conditional

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Bayesian networks: approximate inference

Bayesian networks: approximate inference Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability

More information

PROBABILISTIC REASONING SYSTEMS

PROBABILISTIC REASONING SYSTEMS PROBABILISTIC REASONING SYSTEMS In which we explain how to build reasoning systems that use network models to reason with uncertainty according to the laws of probability theory. Outline Knowledge in uncertain

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Bayesian networks. Chapter Chapter

Bayesian networks. Chapter Chapter Bayesian networks Chapter 14.1 3 Chapter 14.1 3 1 Outline Syntax Semantics Parameterized distributions Chapter 14.1 3 2 Bayesian networks A simple, graphical notation for conditional independence assertions

More information

Directed Graphical Models

Directed Graphical Models Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probability. CS 3793/5233 Artificial Intelligence Probability 1 CS 3793/5233 Artificial Intelligence 1 Motivation Motivation Random Variables Semantics Dice Example Joint Dist. Ex. Axioms Agents don t have complete knowledge about the world. Agents need to make decisions

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch. Sampling Methods Oliver Schulte - CMP 419/726 Bishop PRML Ch. 11 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure

More information

Bayesian Approaches Data Mining Selected Technique

Bayesian Approaches Data Mining Selected Technique Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)

More information

Bayes Nets: Independence

Bayes Nets: Independence Bayes Nets: Independence [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Bayes Nets A Bayes

More information

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination Outline Syntax Semantics Exact inference by enumeration Exact inference by variable elimination s A simple, graphical notation for conditional independence assertions and hence for compact specication

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Bayes Nets: Sampling Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z) Graphical Models Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 8 Pattern Recognition and Machine Learning by Bishop some slides from Russell and Norvig AIMA2e Möller/Mori 2

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

Conditional Independence and Factorization

Conditional Independence and Factorization Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

2 : Directed GMs: Bayesian Networks

2 : Directed GMs: Bayesian Networks 10-708: Probabilistic Graphical Models 10-708, Spring 2017 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Jayanth Koushik, Hiroaki Hayashi, Christian Perez Topic: Directed GMs 1 Types

More information

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car CSE 573: Artificial Intelligence Autumn 2012 Bayesian Networks Dan Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer Outline Probabilistic models (and inference)

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis Graphical Models Foundations of Data Analysis Torsten Möller and Thomas Torsney-Weir Möller/Mori 1 Reading Chapter 8 Pattern Recognition and Machine Learning by Bishop some slides from Russell and Norvig

More information

Conditional Independence

Conditional Independence Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu 1 Conditional Independence Topics 1. What is Conditional Independence? Factorization of probability distribution into marginals 2. Why

More information

Artificial Intelligence Bayes Nets: Independence

Artificial Intelligence Bayes Nets: Independence Artificial Intelligence Bayes Nets: Independence Instructors: David Suter and Qince Li Course Delivered @ Harbin Institute of Technology [Many slides adapted from those created by Dan Klein and Pieter

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

CS Lecture 4. Markov Random Fields

CS Lecture 4. Markov Random Fields CS 6347 Lecture 4 Markov Random Fields Recap Announcements First homework is available on elearning Reminder: Office hours Tuesday from 10am-11am Last Time Bayesian networks Today Markov random fields

More information

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller. Sampling Methods Machine Learning orsten Möller Möller/Mori 1 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure defined

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due next Wednesday (Nov 4) in class Start early!!! Project milestones due Monday (Nov 9)

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II. Copyright Richard J. Povinelli rev 1.0, 10/1//2001 Page 1 Probabilistic Reasoning Systems Dr. Richard J. Povinelli Objectives You should be able to apply belief networks to model a problem with uncertainty.

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Probabilistic Reasoning Systems

Probabilistic Reasoning Systems Probabilistic Reasoning Systems Dr. Richard J. Povinelli Copyright Richard J. Povinelli rev 1.0, 10/7/2001 Page 1 Objectives You should be able to apply belief networks to model a problem with uncertainty.

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution

More information