Review: Directed Models (Bayes Nets)

Similar documents
3 : Representation of Undirected GM

Conditional Independence and Factorization

CSC 412 (Lecture 4): Undirected Graphical Models

3 Undirected Graphical Models

Directed and Undirected Graphical Models

Undirected graphical models

Chris Bishop s PRML Ch. 8: Graphical Models

Representation of undirected GM. Kayhan Batmanghelich

Machine Learning 4771

Undirected Graphical Models: Markov Random Fields

CPSC 540: Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Probabilistic Graphical Models

Undirected Graphical Models

Lecture 4 October 18th

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Statistical Approaches to Learning and Discovery

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Undirected Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

4.1 Notation and probability review

Machine Learning Summer School

Rapid Introduction to Machine Learning/ Deep Learning

Gibbs Fields & Markov Random Fields

Probabilistic Graphical Models

Probabilistic Graphical Models (I)

6.867 Machine learning, lecture 23 (Jaakkola)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS Lecture 4. Markov Random Fields

Lecture 6: Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Markov Networks.

Markov Random Fields

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

A brief introduction to Conditional Random Fields

An Introduction to Bayesian Machine Learning

Based on slides by Richard Zemel

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Directed and Undirected Graphical Models

Markov Chains and MCMC

Introduction to Graphical Models

Probabilistic Graphical Models Lecture Notes Fall 2009

10708 Graphical Models: Homework 2

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Probabilistic Graphical Models

EE512 Graphical Models Fall 2009

Markov properties for undirected graphs

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Gibbs Field & Markov Random Field

Lecture 15. Probabilistic Models on Graph

2 : Directed GMs: Bayesian Networks

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Probabilistic Graphical Models

Machine Learning Lecture 14

Statistical Learning

MAP Examples. Sargur Srihari

STA 4273H: Statistical Machine Learning

Submodularity in Machine Learning

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Directed Graphical Models

Sampling Methods (11/30/04)

Graphical Models and Independence Models

Using Graphs to Describe Model Structure. Sargur N. Srihari

Parametrizations of Discrete Graphical Models

Markov properties for undirected graphs

Directed Graphical Models or Bayesian Networks

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Probabilistic Graphical Models

Chapter 17: Undirected Graphical Models

Variational Inference (11/04/13)

Alternative Parameterizations of Markov Networks. Sargur Srihari

Probabilistic Graphical Models

The Ising model and Markov chain Monte Carlo

Bayesian Machine Learning - Lecture 7

Graphical Models - Part II

Probabilistic Graphical Models

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Graphical Models and Kernel Methods

Bayesian Machine Learning

Machine Learning Lecture Notes

p L yi z n m x N n xi

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Example: multivariate Gaussian Distribution

CRF for human beings

Bayesian Networks. Alan Ri2er

Stat 8501 Lecture Notes Spatial Lattice Processes Charles J. Geyer November 16, Introduction

Graphical Models - Part I

An Introduction to Exponential-Family Random Graph Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Junction Tree, BP and Variational Methods

Transcription:

X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected path between x and y there is a node w such that either:. w has converging arrows along the path ( w ) and neither w nor its descendents are in z or 2. w does not have converging arrows along the path and w z. The Bayes-Ball algorithm can be used to check d-separation. It is always possible to find a distibution consistent with the graph. Most general such distribution is a product of parent-conditionals: P(x,,..., x n ) = i x x P(x i x πi ) x x 5 Review: Goal of Graphical Models Graphical models aim to provide compact factorizations of large joint probability distributions. These factorizations are achieved using local functions which exploit conditional independencies in the models. The graph tells us a basic set of conditional independencies that must be true. From these we can derive more that also must be true. These independencies are crucial to developing efficient algorithms valid for all numerical settings of the local functions. Local functions tell us the quantitative details of the distribution. Certain numerical settings of the distribution may have more independencies present, but these do not come from the graph. Undirected Models Also graphs with one node per random variable and edges that connect pairs of nodes, but now the edges are undirected. Semantics: every node is conditionally independent from its non-neighbours given its neighbours, i.e. x A x C x B if every path b/w x A and x C goes through x B XA XB XC X X Y Can model symmetric interactions that directed models cannot. aka Markov Random Fields, Markov Networks, Boltzmann Machines, Spin Glasses, Ising Models

Simple Graph Separation In undirected models, simple graph separation (as opposed to d-separation) tells us about conditional independencies. x A x C x B if every path between x A and x C is blocked by some node in x B. Conditional Parameterization? In directed models, we started with p(x) = i p(x i x πi ) and we derived the d-separation semantics from that. Undirected models: have the semantics, need parametrization. What about this conditional parameterization? p(x) = i p(x i x neighbours(i) ) XA Markov Ball algorithm: remove x B and see if there is any path from x A to x C. XB XC Good: product of local functions. Good: each one has a simple conditional interpretation. Bad: local functions cannot be arbitrary, but must agree properly in order to define a valid distribution. Markov Blanket Marginal Parameterization? b is a Markov blanket for x iff x y b y / b OK, what about this marginal parameterization? p(x) = i p(x i, x neighbours(i) ) Markov boundary : minimal Markov blanket For undirected models, this is set of neighbours. Q: What is the Markov blanket (boundary) in a directed model? A: {parents+children+parents-of-children} Good: product of local functions. Good: each one has a simple marginal interpretation. Bad: only very few pathalogical marginals on overalpping nodes can be multiplied to give a valid joint.

Clique Potentials Whatever factorization we pick, we know that only connected nodes can be arguments of a single local function. A clique is a fully connected subset of nodes. Thus, consider using a product of positive clique potentials: P(X) = ψ c (x c ) = ψ c (x c ) cliques c X cliques c The product of functions that don t need to agree with each other. Still factors in the way that the graph semantics demand. Without loss of generality we can restrict ourselves to maximal cliques. (Why?) x X x Boltzmann Distributions We often represent the clique potentials using their logs: ψ C (x C ) = exp{ H C (x C )} for arbitrary real valued energy functions H C (x C ). The negative sign is a standard convention. This gives the joint a nice additive structure: P(X) = exp{ H C (x c )} = exp{ H(X)} cliques C where the sum in the exponent is called the free energy : H(X) = H C (x c ) C This way of defining a probability distribution based on energies is the Boltzmann distribution from statistical physics. Examples of Clique Potentials Partition Function x x X Normalizer (X) above is called the partition function. Computing the normalizer and its derivatives can often be the hardest part of inferene and learning in undirected models. Often the factored structure of the distribution makes it possible to efficiently do the sums/integrals required to compute. Don t always have to compute, e.g. for conditional probabilities. X i _ X i X i + (a) x i _ x i+ x i _.5.2.2.5 x _ i.5.2.2.5 (b)

Interpretation of Clique Potentials The model implies x z y We can write this as: X Y p(x, y, z) = p(y)p(x y)p(z y) p(x, y, z) = p(x, y)p(z y) = ψ xy (x, y)ψ yz (y, z) p(x, y, z) = p(x y)p(z, y) = ψ xy (x, y)ψ yz (y, z) cannot have all potentials be marginals cannot have all potentials be conditionals The positive clique potentials can only be thought of as general compatibility, goodness or happiness functions over their variables, but not as probability distributions. Expressive Power Can we always convert directed undirected? No. No directed model can represent these and only these independencies. x y {w, z} w z {x, y} X W (a) Y X (b) No undirected model can represent these and only these independencies. x y Y Hammersley-Clifford Theorem (97) H-C theorem tells us that the family of distributions defined by the conditional independence semantics on the graph and the family defined by products of potential functions on maximal cliques are the same. Tricky point: the potential functions are arbitrary real valued, but strictly positive. For directed models, there is a version of this theorem which tells us that the family of distributions defined by the conditional independencies semantics of the directed graph and the family defined by products of parent-conditionals are the same. Notice the crucial difference between graphs, which tells us independencies that are true no matter what local functions we choose, and numerical functions which could introduce some extra independencies, once we know them. Example: Ising Models Common model for binary nodes: spin-glass/ Ising lattice. Nodes are arranged in a regular topology (often a regular packing grid) and connected only to their geometric neighbours. For example, if we think of each node as a pixel, we might want to encourage nearby pixels to have similar intensities. Energy is of the form: H(x) = ij β ij x i x j + i α i x i

Example: Gaussian Distribution The most common and important undirected graphical model on a set of continuous valued nodes is the Gaussian (normal). It uses pairwise potentials between every pair of nodes to define an energy identical to the Ising model, but for continuous values: H(x) = (x i µ i )V ij (x j µ j ) ij where µ is the mean and V is the inverse covariace matrix. Like a fully connected lattice. Also, the Gaussian is the maximum entropy distribution consistent with the mean and covariance defined by µ and V. Example: Boltzmann Machines Fully observed Boltzmann machines are the binary equivalent of a Gaussian distribution: fully connected Ising models on a set of binary random variables. (Also maxent.) Energy is the same: H(x) = ij β ijx i x j + i α ix i Boltzmann machines also add the possibility of having some units (random variables) which are never observed. These are called hidden units or latent variables and we will see much more about them later. For continuous variables, the equivalent of a Boltzmann machine with hidden units is called a factor analysis model.