Graphical Models and Kernel Methods

Similar documents
Chris Bishop s PRML Ch. 8: Graphical Models

Lecture 8: Bayesian Networks

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Inference in Bayesian Networks

Probabilistic Graphical Models (I)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Machine Learning Lecture 14

Markov chain Monte Carlo Lecture 9

Directed Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Chapter 16. Structured Probabilistic Models for Deep Learning

Undirected Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Artificial Intelligence Bayesian Networks

The Origin of Deep Learning. Lili Mou Jan, 2015

Graphical Models - Part I

Statistical Learning

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Lecture 6: Graphical Models

Statistical Approaches to Learning and Discovery

13: Variational inference II

Probabilistic Graphical Networks: Definitions and Basic Results

Machine Learning 4771

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Directed and Undirected Graphical Models

Probabilistic Graphical Models: Representation and Inference

Directed and Undirected Graphical Models

Lecture 13 : Variational Inference: Mean Field Approximation

Graphical Models - Part II

An Introduction to Bayesian Machine Learning

Bayes Networks 6.872/HST.950

Probabilistic Graphical Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks. Motivation

Machine Learning for Data Science (CS4786) Lecture 24

Bayesian Machine Learning

Bayesian networks: approximate inference

Based on slides by Richard Zemel

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Rapid Introduction to Machine Learning/ Deep Learning

Bayesian Methods for Machine Learning

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

PROBABILISTIC REASONING SYSTEMS

p L yi z n m x N n xi

Bayesian networks. Chapter Chapter

Directed Graphical Models

6.867 Machine learning, lecture 23 (Jaakkola)

Pattern Recognition and Machine Learning

Latent Dirichlet Allocation

Variational Inference (11/04/13)

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

CPSC 540: Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Machine Learning

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Undirected Graphical Models: Markov Random Fields

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Directed Probabilistic Graphical Models CMSC 678 UMBC

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Bayesian Approaches Data Mining Selected Technique

STA 4273H: Statistical Machine Learning

Variational Scoring of Graphical Model Structures

4 : Exact Inference: Variable Elimination

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Bayes Nets: Independence

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CPSC 540: Machine Learning

Intelligent Systems:

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

CS 188: Artificial Intelligence. Bayes Nets

CS 343: Artificial Intelligence

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Conditional Independence and Factorization

2 : Directed GMs: Bayesian Networks

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Undirected Graphical Models

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Conditional Independence

Artificial Intelligence Bayes Nets: Independence

Introduction to Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Introduction to Machine Learning

CS Lecture 4. Markov Random Fields

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Probabilistic Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Undirected graphical models

Probabilistic Reasoning Systems

The Ising model and Markov chain Monte Carlo

Bayesian Inference for Dirichlet-Multinomials

Transcription:

Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 2 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 3 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 4 / 123

The envelope quiz red ball = $$$ 5 / 123

The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black 5 / 123

The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black Should you choose this envelope or the other envelope? 5 / 123

The envelope quiz Probabilistic inference 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 P (B=b E=1)P (E=1) P (E = 1 B = b) = P (B=b) = 1/2 1/2 3/4 = 1/3. Switch. 6 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d 7 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label 7 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } 7 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) 7 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) Learning: estimate p(x 1,..., x d ) from training data X (1),..., X (N), where X (i) = (x (i) 1,..., x(i) d ) 7 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force Graphical model: efficient representation, inference, and learning on p(x 1,..., x d ), exactly or approximately 8 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational If p(x 1,..., x d ) not given, estimate it from data 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational If p(x 1,..., x d ) not given, estimate it from data parameter and structure learning 9 / 123

Graphical-Model-Nots Graphical model is the study of probabilistic models 10 / 123

Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model 10 / 123

Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model These are not graphical models: neural network decision tree network flow HMM template (but HMMs are!) 10 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 11 / 123

Directed graphical models 12 / 123

Directed graphical models Also called Bayesian networks 13 / 123

Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j 13 / 123

Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k 13 / 123

Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k A directed acyclic graph (DAG) contains no cycles 13 / 123

Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. 14 / 123

Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i 14 / 123

Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i By specifying the CPDs for all i, we specify a joint distribution p(x 1,..., x d ) 14 / 123

Example: Burglary, Earthquake, Alarm, John and Marry Binary variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 P (B, E, A, J, M) = P (B)P ( E)P (A B, E)P (J A)P ( M A) = 0.001 (1 0.002) 0.94 0.9 (1 0.7).000253 15 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) 16 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right 16 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial 16 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial p(x i y) depends on the feature type: multinomial (count x i ), Gaussian (continuous x i ), etc. 16 / 123

No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) 17 / 123

No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) 17 / 123

No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) However, it is perfectly fine to design BNs causally 17 / 123

What do we need probabilistic models for? Make predictions. p(y x) plus decision theory 18 / 123

What do we need probabilistic models for? Make predictions. p(y x) plus decision theory Interpret models. Very natural to include latent variables 18 / 123

Example: Latent Dirichlet Allocation (LDA) β φ T w z N d θ D α A generative model for p(φ, θ, z, w α, β): For each topic t φ t Dirichlet(β) For each document d θ Dirichlet(α) For each word position in d topic z Multinomial(θ) word w Multinomial(φ z ) Inference goals: p(z w, α, β), argmax φ,θ p(φ, θ w, α, β) 19 / 123

Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) 20 / 123

Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) 20 / 123

Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s 20 / 123

Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s Conditional independence in a BN is precisely specified by d-separation ( directed separation ) 20 / 123

d-separation Case 1: Tail-to-Tail C C A A, B in general dependent B A B 21 / 123

d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) 21 / 123

d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) An observed C is a tail-to-tail node, blocks the undirected path A-B 21 / 123

d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent 22 / 123

d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C 22 / 123

d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C An observed C is a head-to-tail node, blocks the path A-B 22 / 123

d-separation Case 3: Head-to-Head A B A B C A, B in general independent C 23 / 123

d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants C 23 / 123

d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants An observed C is a head-to-head node, unblocks the path A-B C 23 / 123

d-separation Variable groups A and B are conditionally independent given C, if all undirected paths from nodes in A to nodes in B are blocked 24 / 123

d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A F E B C 25 / 123

d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A, B dependent given C A F E B C 25 / 123

d-separation Example 2 The path from A to B is blocked both at E and F A F E B C 26 / 123

d-separation Example 2 The path from A to B is blocked both at E and F A, B conditionally independent given F A F E B C 26 / 123

Undirected graphical models 27 / 123

Undirected graphical models Also known as Markov Random Fields 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z C 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z Z = C ψ C(X C )dx is the partition function C 28 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) When the parameter a > 0, favor homogeneous chains 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) When the parameter a > 0, favor homogeneous chains When the parameter a < 0, favor inhomogeneous chains 29 / 123

Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) 30 / 123

Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k ( k ) p(x) = 1 Z exp w i f i (X) i=1 30 / 123

Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k Equivalent to MRF p(x) = 1 Z p(x) = 1 Z exp ( k i=1 w i f i (X) ) C ψ C(X C ) with ψ C (X C ) = exp (w C f C (X)) 30 / 123

Example: Ising Model θ s x s θ st x t This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + θ st x s x t s V (s,t) E f s (X) = x s, f st (X) = x s x t 31 / 123

Example: Image Denoising [From Bishop PRML] noisy image argmax X P (X Y ) p θ (X Y ) = 1 Z exp θ s x s + s V θ s = (s,t) E { c ys = 1 c y s = 0, θ st > 0 θ st x s x t 32 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian 33 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite 33 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix 33 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 33 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 When Ω ij 0, there is an edge between x i, x j 33 / 123

Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if A, B become disconnected by removing C and all edges involving C A C B 34 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 35 / 123

Exact Inference 36 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms Not covered: Variable elimination and junction tree (aka clique tree) 37 / 123

Markov Chain Monte Carlo 38 / 123

Markov Chain Monte Carlo Forward sampling 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. Unbiased (after burn-in), but can have high variance 39 / 123

Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q 40 / 123

Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) 40 / 123

Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) The variance of the estimator decreases as O(1/m) 40 / 123

Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(x Q X E ) 40 / 123

Forward Sampling Draw X P (X) 41 / 123

Forward Sampling Draw X P (X) Throw away X if it doesn t match the evidence X E 41 / 123

Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 42 / 123

Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 42 / 123

Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 42 / 123

Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 4. If A = 1 sample J Ber(0.9) else J Ber(0.05) 42 / 123

Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 4. If A = 1 sample J Ber(0.9) else J Ber(0.05) 5. If A = 1 sample M Ber(0.7) else M Ber(0.01) 42 / 123

Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) 43 / 123

Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples 43 / 123

Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) 43 / 123

Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) Does not work for Markov Random Fields (can t sample from P (X)) 43 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. For example, B P (B E = 1, A = 0) P (B)P (A = 0 B, E = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... Keep all samples after burn in. P (B = 1 E = 1, M = 1) is the fraction of samples with B = 1. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Example 2: The Ising Model A B x s D This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + s V C (s,t) E θ st x s x t 47 / 123

Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D C 48 / 123

Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models p(x s x s ) = p(x s x N(s) ) N(s) is the neighbors of s. C 48 / 123

Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models N(s) is the neighbors of s. The Gibbs update is p(x s = 1 x N(s) ) = C p(x s x s ) = p(x s x N(s) ) 1 exp( (θ s + t N(s) θ stx t )) + 1 48 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice burn in : discard X (0),..., X (T ) 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice burn in : discard X (0),..., X (T ) Use all of X (T +1),... for inference (they are correlated); Do not thin 49 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, for Y (i) p(y ) E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler T i ((Y i, y i ) (Y i, y i )) = p(y i Y i) 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler T i ((Y i, y i ) (Y i, y i )) = p(y i Y i) Note p(y i Y i) = p(y i, Z Y i)dz 50 / 123

Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position 51 / 123

Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position 51 / 123

Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n ( ) i,j 51 / 123

Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n (d i) i, : length of document d i, excluding the current position n ( ) i,j 51 / 123

Belief Propagation 52 / 123

Factor Graph For both directed and undirected graphical models A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node Factors represent computation A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) 54 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate 54 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph 54 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph Alternative view: variational approximation (more later) 54 / 123

Example: A Simple HMM The Hidden Markov Model template (not a graphical model) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 55 / 123

Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 x =R 1 x =G 2 56 / 123

Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 Factor graph x =R 1 x =G 2 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 56 / 123

Messages A message is a vector of length K, where K is the number of values x takes. 57 / 123

Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 57 / 123

Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 57 / 123

Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 2. µ x f : message from a variable node x to a factor node f 57 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 If a leaf is a variable node x, µ x f (x) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123

Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived Let x be in factor f s. ne(x)\f s are factors connected to x excluding f s. µ x fs (x) = µ f x (x) f ne(x)\f s µ z1 f 2 (z 1 = 1) = 1/4 µ z1 f 2 (z 1 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123

Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 60 / 123

Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 60 / 123

Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) We get µ f2 z 2 (z 2 = 1) = 1/32, µ f2 z 2 (z 2 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 60 / 123

Up to Root, Back Down The message has reached the root, pass it back down µ z2 f 2 (z 2 = 1) = 1 µ z2 f 2 (z 2 = 2) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 61 / 123

Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123

Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). We get µ f2 z 1 (z 1 = 1) = 7/16 µ f2 z 1 (z 1 = 2) = 3/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123

From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) 63 / 123

From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( 0.7 0.3 P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 63 / 123

From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( 0.7 0.3 P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 One can also compute the marginal of the set of variables x s involved in a factor f s p(x s ) f s (x s ) µ x f (x) x ne(f) 63 / 123

Handling Evidence Observing x = v, 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) The conditional is easily obtained by normalization p(x X E ) = p(x, X E) x p(x, X E ) 64 / 123

Loopy Belief Propagation So far, we assumed a tree graph 65 / 123

Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence 65 / 123

Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence Loopy BP may not convergence, but works in many cases 65 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 66 / 123

Parameter Learning Assume the graph structure is given 67 / 123

Parameter Learning Assume the graph structure is given Parameters: 67 / 123

Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) 67 / 123

Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) 67 / 123

Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) Principle: maximum likelihood estimate 67 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) gradient method for undirected models 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved i=1 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved given X 1 o,..., Xo n, the MLE is ( ) n ˆθ = argmax log p(xo, i X h θ) θ i=1 X h i=1 68 / 123