Graphical Models and Kernel Methods
|
|
- Marian Ball
- 5 years ago
- Views:
Transcription
1 Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, / 123
2 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 2 / 123
3 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 3 / 123
4 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 4 / 123
5 The envelope quiz red ball = $$$ 5 / 123
6 The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black 5 / 123
7 The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black Should you choose this envelope or the other envelope? 5 / 123
8 The envelope quiz Probabilistic inference 6 / 123
9 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) 6 / 123
10 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 6 / 123
11 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 6 / 123
12 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B 6 / 123
13 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 6 / 123
14 The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 P (B=b E=1)P (E=1) P (E = 1 B = b) = P (B=b) = 1/2 1/2 3/4 = 1/3. Switch. 6 / 123
15 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d 7 / 123
16 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label 7 / 123
17 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } 7 / 123
18 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) 7 / 123
19 Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) Learning: estimate p(x 1,..., x d ) from training data X (1),..., X (N), where X (i) = (x (i) 1,..., x(i) d ) 7 / 123
20 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) 8 / 123
21 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) 8 / 123
22 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) 8 / 123
23 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) 8 / 123
24 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force 8 / 123
25 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data 8 / 123
26 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force 8 / 123
27 It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force Graphical model: efficient representation, inference, and learning on p(x 1,..., x d ), exactly or approximately 8 / 123
28 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) 9 / 123
29 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field 9 / 123
30 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence 9 / 123
31 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } 9 / 123
32 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational 9 / 123
33 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational If p(x 1,..., x d ) not given, estimate it from data 9 / 123
34 What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational If p(x 1,..., x d ) not given, estimate it from data parameter and structure learning 9 / 123
35 Graphical-Model-Nots Graphical model is the study of probabilistic models 10 / 123
36 Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model 10 / 123
37 Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model These are not graphical models: neural network decision tree network flow HMM template (but HMMs are!) 10 / 123
38 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 11 / 123
39 Directed graphical models 12 / 123
40 Directed graphical models Also called Bayesian networks 13 / 123
41 Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j 13 / 123
42 Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k 13 / 123
43 Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k A directed acyclic graph (DAG) contains no cycles 13 / 123
44 Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. 14 / 123
45 Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i 14 / 123
46 Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i By specifying the CPDs for all i, we specify a joint distribution p(x 1,..., x d ) 14 / 123
47 Example: Burglary, Earthquake, Alarm, John and Marry Binary variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 P (B, E, A, J, M) = P (B)P ( E)P (A B, E)P (J A)P ( M A) = ( ) (1 0.7) / 123
48 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) 16 / 123
49 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right 16 / 123
50 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial 16 / 123
51 Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial p(x i y) depends on the feature type: multinomial (count x i ), Gaussian (continuous x i ), etc. 16 / 123
52 No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) 17 / 123
53 No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) 17 / 123
54 No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) However, it is perfectly fine to design BNs causally 17 / 123
55 What do we need probabilistic models for? Make predictions. p(y x) plus decision theory 18 / 123
56 What do we need probabilistic models for? Make predictions. p(y x) plus decision theory Interpret models. Very natural to include latent variables 18 / 123
57 Example: Latent Dirichlet Allocation (LDA) β φ T w z N d θ D α A generative model for p(φ, θ, z, w α, β): For each topic t φ t Dirichlet(β) For each document d θ Dirichlet(α) For each word position in d topic z Multinomial(θ) word w Multinomial(φ z ) Inference goals: p(z w, α, β), argmax φ,θ p(φ, θ w, α, β) 19 / 123
58 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) 20 / 123
59 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) 20 / 123
60 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s 20 / 123
61 Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s Conditional independence in a BN is precisely specified by d-separation ( directed separation ) 20 / 123
62 d-separation Case 1: Tail-to-Tail C C A A, B in general dependent B A B 21 / 123
63 d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) 21 / 123
64 d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) An observed C is a tail-to-tail node, blocks the undirected path A-B 21 / 123
65 d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent 22 / 123
66 d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C 22 / 123
67 d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C An observed C is a head-to-tail node, blocks the path A-B 22 / 123
68 d-separation Case 3: Head-to-Head A B A B C A, B in general independent C 23 / 123
69 d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants C 23 / 123
70 d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants An observed C is a head-to-head node, unblocks the path A-B C 23 / 123
71 d-separation Variable groups A and B are conditionally independent given C, if all undirected paths from nodes in A to nodes in B are blocked 24 / 123
72 d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A F E B C 25 / 123
73 d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A, B dependent given C A F E B C 25 / 123
74 d-separation Example 2 The path from A to B is blocked both at E and F A F E B C 26 / 123
75 d-separation Example 2 The path from A to B is blocked both at E and F A, B conditionally independent given F A F E B C 26 / 123
76 Undirected graphical models 27 / 123
77 Undirected graphical models Also known as Markov Random Fields 28 / 123
78 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs 28 / 123
79 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation 28 / 123
80 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive 28 / 123
81 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) 28 / 123
82 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + 28 / 123
83 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z C 28 / 123
84 Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z Z = C ψ C(X C )dx is the partition function C 28 / 123
85 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} 29 / 123
86 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 29 / 123
87 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 29 / 123
88 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) 29 / 123
89 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) 29 / 123
90 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) 29 / 123
91 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) When the parameter a > 0, favor homogeneous chains 29 / 123
92 Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) When the parameter a > 0, favor homogeneous chains When the parameter a < 0, favor inhomogeneous chains 29 / 123
93 Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) 30 / 123
94 Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k ( k ) p(x) = 1 Z exp w i f i (X) i=1 30 / 123
95 Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k Equivalent to MRF p(x) = 1 Z p(x) = 1 Z exp ( k i=1 w i f i (X) ) C ψ C(X C ) with ψ C (X C ) = exp (w C f C (X)) 30 / 123
96 Example: Ising Model θ s x s θ st x t This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + θ st x s x t s V (s,t) E f s (X) = x s, f st (X) = x s x t 31 / 123
97 Example: Image Denoising [From Bishop PRML] noisy image argmax X P (X Y ) p θ (X Y ) = 1 Z exp θ s x s + s V θ s = (s,t) E { c ys = 1 c y s = 0, θ st > 0 θ st x s x t 32 / 123
98 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian 33 / 123
99 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite 33 / 123
100 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix 33 / 123
101 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 33 / 123
102 Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 When Ω ij 0, there is an edge between x i, x j 33 / 123
103 Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if A, B become disconnected by removing C and all edges involving C A C B 34 / 123
104 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 35 / 123
105 Exact Inference 36 / 123
106 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. 37 / 123
107 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) 37 / 123
108 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) 37 / 123
109 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms 37 / 123
110 Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms Not covered: Variable elimination and junction tree (aka clique tree) 37 / 123
111 Markov Chain Monte Carlo 38 / 123
112 Markov Chain Monte Carlo Forward sampling 39 / 123
113 Markov Chain Monte Carlo Forward sampling Gibbs sampling 39 / 123
114 Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling 39 / 123
115 Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. 39 / 123
116 Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. Unbiased (after burn-in), but can have high variance 39 / 123
117 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q 40 / 123
118 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) 40 / 123
119 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) The variance of the estimator decreases as O(1/m) 40 / 123
120 Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(x Q X E ) 40 / 123
121 Forward Sampling Draw X P (X) 41 / 123
122 Forward Sampling Draw X P (X) Throw away X if it doesn t match the evidence X E 41 / 123
123 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 42 / 123
124 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 42 / 123
125 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 42 / 123
126 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 4. If A = 1 sample J Ber(0.9) else J Ber(0.05) 42 / 123
127 Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 4. If A = 1 sample J Ber(0.9) else J Ber(0.05) 5. If A = 1 sample M Ber(0.7) else M Ber(0.01) 42 / 123
128 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) 43 / 123
129 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples 43 / 123
130 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) 43 / 123
131 Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) Does not work for Markov Random Fields (can t sample from P (X)) 43 / 123
132 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
133 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
134 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
135 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
136 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
137 Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
138 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123
139 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123
140 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123
141 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123
142 Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. For example, B P (B E = 1, A = 0) P (B)P (A = 0 B, E = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = / 123
143 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
144 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
145 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
146 Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... Keep all samples after burn in. P (B = 1 E = 1, M = 1) is the fraction of samples with B = 1. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = / 123
147 Gibbs Sampling Example 2: The Ising Model A B x s D This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + s V C (s,t) E θ st x s x t 47 / 123
148 Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D C 48 / 123
149 Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models p(x s x s ) = p(x s x N(s) ) N(s) is the neighbors of s. C 48 / 123
150 Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models N(s) is the neighbors of s. The Gibbs update is p(x s = 1 x N(s) ) = C p(x s x s ) = p(x s x N(s) ) 1 exp( (θ s + t N(s) θ stx t )) / 123
151 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) 49 / 123
152 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π 49 / 123
153 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) 49 / 123
154 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) 49 / 123
155 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing 49 / 123
156 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice burn in : discard X (0),..., X (T ) 49 / 123
157 Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice burn in : discard X (0),..., X (T ) Use all of X (T +1),... for inference (they are correlated); Do not thin 49 / 123
158 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p 50 / 123
159 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form 50 / 123
160 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, for Y (i) p(y ) E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 50 / 123
161 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed 50 / 123
162 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler T i ((Y i, y i ) (Y i, y i )) = p(y i Y i) 50 / 123
163 Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler T i ((Y i, y i ) (Y i, y i )) = p(y i Y i) Note p(y i Y i) = p(y i, Z Y i)dz 50 / 123
164 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position 51 / 123
165 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position 51 / 123
166 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n ( ) i,j 51 / 123
167 Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n (d i) i, : length of document d i, excluding the current position n ( ) i,j 51 / 123
168 Belief Propagation 52 / 123
169 Factor Graph For both directed and undirected graphical models A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123
170 Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123
171 Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node Factors represent computation A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123
172 The Sum-Product Algorithm Also known as belief propagation (BP) 54 / 123
173 The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate 54 / 123
174 The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph 54 / 123
175 The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph Alternative view: variational approximation (more later) 54 / 123
176 Example: A Simple HMM The Hidden Markov Model template (not a graphical model) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 55 / 123
177 Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 x =R 1 x =G 2 56 / 123
178 Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 Factor graph x =R 1 x =G 2 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123
179 Messages A message is a vector of length K, where K is the number of values x takes. 57 / 123
180 Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 57 / 123
181 Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 57 / 123
182 Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 2. µ x f : message from a variable node x to a factor node f 57 / 123
183 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123
184 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123
185 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123
186 Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 If a leaf is a variable node x, µ x f (x) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123
187 Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123
188 Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived Let x be in factor f s. ne(x)\f s are factors connected to x excluding f s. µ x fs (x) = µ f x (x) f ne(x)\f s µ z1 f 2 (z 1 = 1) = 1/4 µ z1 f 2 (z 1 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123
189 Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123
190 Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123
191 Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) We get µ f2 z 2 (z 2 = 1) = 1/32, µ f2 z 2 (z 2 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) / 123
192 Up to Root, Back Down The message has reached the root, pass it back down µ z2 f 2 (z 2 = 1) = 1 µ z2 f 2 (z 2 = 2) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 61 / 123
193 Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123
194 Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). We get µ f2 z 1 (z 1 = 1) = 7/16 µ f2 z 1 (z 1 = 2) = 3/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123
195 From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) 63 / 123
196 From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 63 / 123
197 From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 One can also compute the marginal of the set of variables x s involved in a factor f s p(x s ) f s (x s ) µ x f (x) x ne(f) 63 / 123
198 Handling Evidence Observing x = v, 64 / 123
199 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or 64 / 123
200 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v 64 / 123
201 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, 64 / 123
202 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) 64 / 123
203 Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) The conditional is easily obtained by normalization p(x X E ) = p(x, X E) x p(x, X E ) 64 / 123
204 Loopy Belief Propagation So far, we assumed a tree graph 65 / 123
205 Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence 65 / 123
206 Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence Loopy BP may not convergence, but works in many cases 65 / 123
207 Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 66 / 123
208 Parameter Learning Assume the graph structure is given 67 / 123
209 Parameter Learning Assume the graph structure is given Parameters: 67 / 123
210 Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) 67 / 123
211 Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) 67 / 123
212 Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) Principle: maximum likelihood estimate 67 / 123
213 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed 68 / 123
214 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 68 / 123
215 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) 68 / 123
216 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) gradient method for undirected models 68 / 123
217 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved i=1 68 / 123
218 Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved given X 1 o,..., Xo n, the MLE is ( ) n ˆθ = argmax log p(xo, i X h θ) θ i=1 X h i=1 68 / 123
Chris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationLecture 8: Bayesian Networks
Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1
More informationCS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional
More informationBayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1
Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain
More informationMCMC and Gibbs Sampling. Kayhan Batmanghelich
MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationMachine Learning Lecture 14
Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de
More informationMarkov chain Monte Carlo Lecture 9
Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events
More informationDirected Graphical Models
CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential
More informationBayesian Networks BY: MOHAMAD ALSABBAGH
Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationPart I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS
Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a
More informationArtificial Intelligence Bayesian Networks
Artificial Intelligence Bayesian Networks Stephan Dreiseitl FH Hagenberg Software Engineering & Interactive Media Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationGraphical Models - Part I
Graphical Models - Part I Oliver Schulte - CMPT 726 Bishop PRML Ch. 8, some slides from Russell and Norvig AIMA2e Outline Probabilistic Models Bayesian Networks Markov Random Fields Inference Outline Probabilistic
More informationStatistical Learning
Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A.
More informationProbabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April
Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationLecture 6: Graphical Models
Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:
More informationStatistical Approaches to Learning and Discovery
Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationDirected and Undirected Graphical Models
Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationProbabilistic Graphical Models: Representation and Inference
Probabilistic Graphical Models: Representation and Inference Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Andrew Moore 1 Overview
More informationDirected and Undirected Graphical Models
Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationGraphical Models - Part II
Graphical Models - Part II Bishop PRML Ch. 8 Alireza Ghane Outline Probabilistic Models Bayesian Networks Markov Random Fields Inference Graphical Models Alireza Ghane / Greg Mori 1 Outline Probabilistic
More informationAn Introduction to Bayesian Machine Learning
1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems
More informationBayes Networks 6.872/HST.950
Bayes Networks 6.872/HST.950 What Probabilistic Models Should We Use? Full joint distribution Completely expressive Hugely data-hungry Exponential computational complexity Naive Bayes (full conditional
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationBayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders
More informationBayesian Networks. Motivation
Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationBayesian networks: approximate inference
Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationGraphical Models. Andrea Passerini Statistical relational learning. Graphical Models
Andrea Passerini passerini@disi.unitn.it Statistical relational learning Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability
More informationPROBABILISTIC REASONING SYSTEMS
PROBABILISTIC REASONING SYSTEMS In which we explain how to build reasoning systems that use network models to reason with uncertainty according to the laws of probability theory. Outline Knowledge in uncertain
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationBayesian networks. Chapter Chapter
Bayesian networks Chapter 14.1 3 Chapter 14.1 3 1 Outline Syntax Semantics Parameterized distributions Chapter 14.1 3 2 Bayesian networks A simple, graphical notation for conditional independence assertions
More informationDirected Graphical Models
Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express
More information6.867 Machine learning, lecture 23 (Jaakkola)
Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationLatent Dirichlet Allocation
Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationProbability. CS 3793/5233 Artificial Intelligence Probability 1
CS 3793/5233 Artificial Intelligence 1 Motivation Motivation Random Variables Semantics Dice Example Joint Dist. Ex. Axioms Agents don t have complete knowledge about the world. Agents need to make decisions
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationDirected Probabilistic Graphical Models CMSC 678 UMBC
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,
More informationSampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.
Sampling Methods Oliver Schulte - CMP 419/726 Bishop PRML Ch. 11 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure
More informationBayesian Approaches Data Mining Selected Technique
Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More information4 : Exact Inference: Variable Elimination
10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference
More informationRecall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem
Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)
More informationBayes Nets: Independence
Bayes Nets: Independence [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Bayes Nets A Bayes
More informationBayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination
Outline Syntax Semantics Exact inference by enumeration Exact inference by variable elimination s A simple, graphical notation for conditional independence assertions and hence for compact specication
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationIntelligent Systems:
Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationCS 188: Artificial Intelligence. Bayes Nets
CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Bayes Nets: Sampling Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.
More informationp(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)
Graphical Models Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 8 Pattern Recognition and Machine Learning by Bishop some slides from Russell and Norvig AIMA2e Möller/Mori 2
More informationInference in Graphical Models Variable Elimination and Message Passing Algorithm
Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption
More informationConditional Independence and Factorization
Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More information2 : Directed GMs: Bayesian Networks
10-708: Probabilistic Graphical Models 10-708, Spring 2017 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Jayanth Koushik, Hiroaki Hayashi, Christian Perez Topic: Directed GMs 1 Types
More informationOutline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car
CSE 573: Artificial Intelligence Autumn 2012 Bayesian Networks Dan Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer Outline Probabilistic models (and inference)
More informationUndirected Graphical Models
Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates
More informationProbabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis
Graphical Models Foundations of Data Analysis Torsten Möller and Thomas Torsney-Weir Möller/Mori 1 Reading Chapter 8 Pattern Recognition and Machine Learning by Bishop some slides from Russell and Norvig
More informationConditional Independence
Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu 1 Conditional Independence Topics 1. What is Conditional Independence? Factorization of probability distribution into marginals 2. Why
More informationArtificial Intelligence Bayes Nets: Independence
Artificial Intelligence Bayes Nets: Independence Instructors: David Suter and Qince Li Course Delivered @ Harbin Institute of Technology [Many slides adapted from those created by Dan Klein and Pieter
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationCS Lecture 4. Markov Random Fields
CS 6347 Lecture 4 Markov Random Fields Recap Announcements First homework is available on elearning Reminder: Office hours Tuesday from 10am-11am Last Time Bayesian networks Today Markov random fields
More informationSampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.
Sampling Methods Machine Learning orsten Möller Möller/Mori 1 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure defined
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due next Wednesday (Nov 4) in class Start early!!! Project milestones due Monday (Nov 9)
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationObjectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.
Copyright Richard J. Povinelli rev 1.0, 10/1//2001 Page 1 Probabilistic Reasoning Systems Dr. Richard J. Povinelli Objectives You should be able to apply belief networks to model a problem with uncertainty.
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationUndirected graphical models
Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical
More informationProbabilistic Reasoning Systems
Probabilistic Reasoning Systems Dr. Richard J. Povinelli Copyright Richard J. Povinelli rev 1.0, 10/7/2001 Page 1 Objectives You should be able to apply belief networks to model a problem with uncertainty.
More informationThe Ising model and Markov chain Monte Carlo
The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte
More informationBayesian Inference for Dirichlet-Multinomials
Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution
More information