Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Undirected Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Undirected Graphical Models: Markov Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS Lecture 4. Markov Random Fields

Probabilistic Graphical Models (I)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Directed and Undirected Graphical Models

Graphical Models - Part II

Conditional Independence and Factorization

Probabilistic Graphical Models Lecture Notes Fall 2009

Machine Learning Lecture 14

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Review: Directed Models (Bayes Nets)

Markov Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Rapid Introduction to Machine Learning/ Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

3 : Representation of Undirected GM

Undirected Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Machine Learning - Lecture 7

Markov Chains and MCMC

CSC 412 (Lecture 4): Undirected Graphical Models

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

3 Undirected Graphical Models

Lecture 4 October 18th

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

6.867 Machine learning, lecture 23 (Jaakkola)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Markov properties for undirected graphs

Probabilistic Graphical Models

Graphical Models and Kernel Methods

Markov properties for undirected graphs

Lecture 17: May 29, 2002

Statistical Approaches to Learning and Discovery

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

4 : Exact Inference: Variable Elimination

Directed Graphical Models or Bayesian Networks

Gibbs Fields & Markov Random Fields

A brief introduction to Conditional Random Fields

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

EE512 Graphical Models Fall 2009

Undirected graphical models

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Data Mining 2018 Bayesian Networks (1)

Representation of undirected GM. Kayhan Batmanghelich

Graphical Models and Independence Models

Inference in Bayesian Networks

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Markov Independence (Continued)

p L yi z n m x N n xi

Intelligent Systems:

Gibbs Field & Markov Random Field

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

CS Lecture 3. More Bayesian Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

{ p if x = 1 1 p if x = 0

Probabilistic Graphical Models

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Random Field Models for Applications in Computer Vision

Introduction to Graphical Models

Machine Learning 4771

Artificial Intelligence Bayes Nets: Independence

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Directed Graphical Models

The Monte Carlo Method: Bayesian Networks

Approximate Message Passing Algorithms

Bayes Nets: Independence

Markov Networks.

CPSC 540: Machine Learning

Bayesian Networks to design optimal experiments. Davide De March

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Lecture 15. Probabilistic Models on Graph

Using Graphs to Describe Model Structure. Sargur N. Srihari

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

9 Forward-backward algorithm, sum-product on factor graphs

Recap: Inference in Probabilistic Graphical Models. R. Möller Institute of Information Systems University of Luebeck

Probabilistic Graphical Models

An Introduction to Bayesian Machine Learning

Bayesian Networks: Representation, Variable Elimination

Alternative Parameterizations of Markov Networks. Sargur Srihari

4.1 Notation and probability review

2 : Directed GMs: Bayesian Networks

Probabilistic Graphical Models

CS 5522: Artificial Intelligence II

The Origin of Deep Learning. Lili Mou Jan, 2015

Intelligent Systems (AI-2)

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

Finding p(a, b f ) - Analytical method a f e b 1 c 522of 828

Finding p(a, b f ) - Analytical method a f e b 1 2 Conditional probability with the given observable(s): c p(a, b, c, e f ) 523of 828

Finding p(a, b f ) - Analytical method a f e b 1 2 Conditional probability with the given observable(s): c p(a, b, c, e f ) 3 Rewrite it as a joint distribution over all variables p(a, b, c, e f ) = p(a, b, c, e, f ) p(f ) 524of 828

Finding p(a, b f ) - Analytical method a f e b 1 2 Conditional probability with the given observable(s): c p(a, b, c, e f ) 3 Rewrite it as a joint distribution over all variables p(a, b, c, e f ) = p(a, b, c, e, f ) p(f ) 4 Factorise the joint probability according to the graph p(a, b, c, e f ) = p(a, b, c, e, f ) p(f ) = p(a) p(f ) p(e a, f ) p(b f ) p(c e) p(f ) 525of 828

Finding p(a, b f ) - Analytical method a f e b 1 2 Conditional probability with the given observable(s): c p(a, b, c, e f ) 3 Rewrite it as a joint distribution over all variables p(a, b, c, e f ) = p(a, b, c, e, f ) p(f ) 4 Factorise the joint probability according to the graph p(a, b, c, e f ) = p(a, b, c, e, f ) p(f ) = p(a) p(f ) p(e a, f ) p(b f ) p(c e) p(f ) 5 Marginalise over all variables we don t care about p(a, b f ) = c,e p(a) p(f ) p(e a, f ) p(b f ) p(c e) p(f ) = p(a) p(b f ) 526of 828

Finding p(a, b f ) - Graphical method a f 1 e c b 527of 828

Finding p(a, b f ) - Graphical method a f 1 e c 2 Check whether a b f holds or not. Result : a b f holds. Reason : The path from a to b is blocked by f because f is a TT-node and observed. Therefore, a b f. b 528of 828

Finding p(a, b f ) - Graphical method a f 1 e c 2 Check whether a b f holds or not. Result : a b f holds. Reason : The path from a to b is blocked by f because f is a TT-node and observed. Therefore, a b f. 3 Write down the factorisation b p(a, b f ) = p(a f ) p(b f ) = p(a) p(b f ) 529of 828

Bayesian Network - D-separation A third example Is x 3 d-separated from x 6 given x 1 and x 5? Mark x 1 and x 5 as observed. x4 x2 x6 x1 x3 x5 530of 828

Bayesian Network - D-separation Is x 3 d-separated from x 6 given x 1 and x 5? x4 x2 x6 x1 x3 x5 531of 828

Bayesian Network - D-separation Is x 3 d-separated from x 6 given x 1 and x 5? 4 junctions of interest between x 3 and x 6 : {x 3, x 1, x 2} is blocked because x 1 is TT-node and observed. {x 3, x 5, x 6} is blocked because x 5 is a HT-node and observed. {x 3, x 5, x 2} is not blocked because x 5 is a HH-node and observed. {x 5, x 2, x 6} is not blocked because x 2 is a TT-node and unobserved. The path {x 3, x 5, x 2, x 6 } prevents D-separation. Therefore, x 3 is not d-separated from x 6 given x 1 and x 5 as not all paths between x 3 and x 6 are blocked. x4 x2 x6 x1 x3 x5 532of 828

Part XIV Probabilistic Graphical Models 2 533of 828

(MRFs) (Markov network, undirected graphical model) are defined over a graph with undirected edges. MRFs allow for different conditional independence statements than Bayesian networks. In a Bayesian network, the definition of a blocked path was subtle for a HH-node because it did include that all descendants were unobservable. Is there an alternative graphical semantics for probability distributions such that conditional independence is determined by simple graph separation? Yes, removing the direction from the edges removes the asymmetry between parent and child nodes and subsequently the subtleties associated with the HH-node. 534of 828

Graph Separation Definition (Graph Separation) In an undirected graph G, having A, B and C disjoint subsets of nodes, if every path from A to B includes at least one node from C, then C is said to separate A from B in G. A C B 535of 828

Conditional Independence in MRFs Definition (Conditional Independence in Markov Random Field) In an undirected graph G, having A, B and C disjoint subsets of nodes, A is conditionally independent of B given C iff C separates A from B in G. A B C A C B 536of 828

Markov Random Field Definition (Markov Random Field) A Markov Random Field is a set of probability distributions { p(x) p(x) > 0, x } such that there exists an undirected graph G with disjoint subsets of nodes A, B and C, in which whenever C separates A from B in G, A B C. Although we sometimes say "the MRF is such an undirected graph", we mean "the MRF represents the set of all probability distributions whose conditional independency statements are precisely those given by graph separation in the graph". 537of 828

Factorisation in MRFs Assume two nodes x i and x j that are not connected by an edge. Given all other nodes in the graph, x i and x j must be conditionally independent as all paths between x i and x j are blocked by observed nodes. p(x i, x j x \{i,j} ) = p(x i x \{i,j} ) p(x j x \{i,j} ) where x \{i,j} denotes the set of all variables x with x i and x j removed. This is suggestive of the importance of considering sets of connected nodes (cliques). Can we use this for the factorisation of the graph? 538of 828

Cliques in a graph A clique is a subset of nodes in a graph such that there exists an edge between all pairs of nodes in the subset. (The nodes in a clique are fully connected.) A maximal clique of a graph is a clique which is not a proper subset of another clique. (No other nodes of the graph can be added to a maximal clique without destroying the property of full connectedness.) x 1 x 2 x 3 x 4 539of 828

Factorisation using Cliques We can express the factorisation with the help of the cliques. In fact, we only need the maximal cliques, as any function over the variables of a subset of a maximal clique can be expressed as a function of the members of the maximal clique. Denote by C the set of maximal cliques of a graph. For a maximal clique C C, denote by x C the subset of the variables x which belong to C. x 1 x 2 x 3 x 4 540of 828

Factorisation using Cliques A probability distribution p(x) factorises with respect to a given undirected graph G if it can be written as p(x) = 1 ψ C (x C ) Z C C where C is the set of maximal cliques of G, and potential functions ψ C (x C ) 0. The constant Z = x p(x) ensures the correct normalisation of p(x). 541of 828

Conditional Independence Factorisation Theorem (Factorisation Conditional Independence) If a probability distribution factorises according to an undirected graph, and if A, B and C are disjoint subsets of nodes such that C separates A from B in the graph, then the distribution satisfies A B C. Theorem (Conditional Independence Factorisation (Hammersley-Clifford Theorem)) If a strictly positive probability distribution p(x) > 0, x, satisfies the conditional independence statements implied by graph separation over a particular undirected graph, then it also factorises according to the graph. 542of 828

Factorisation with strictly positive potential functions As the potential functions ψ C (x C ) are strictly positive, one can express them as exponential ψ C (x C ) = exp{ E(x C )} of an energy function E(x C ). The exponential distribution is called the Boltzmann distribution. The joint distribution is defined as the product of the potentials, and so the total energy is obtained by adding the energies of each of the maximal cliques. { p(x) = 1 ψ C (x C ) = 1 Z Z exp } E(x C ) C C C C 543of 828

Example of Image Denoising Given an unknown and noise-free image described by binary pixels x i { 1, +1} for i = 1,..., D. Randomly flip the sign of some pixels with some small probability and denote the pixels of the noisy image by y i for i = 1,..., D. Goal : Recover the original noise-free image. Original image. After randomly changing 10% of the pixels. 544of 828

Example of Image Denoising Prior knowledge 1 : We know the noise level is small. Therefore: Strong correlation between original pixels x i and noisy pixels y i. Prior knowledge 2 : Neighbouring pixels x i and x j in an image are strongly correlated (given a decent resolution). Prior knowledge can be captured in a Markov Random Field. y i x i 545of 828

Example of Image Denoising Cliques {x i, y i } : Choose an energy function of the form ηx i y i for η > 0 resulting in a potential function exp{ηx i y i }. This favours equal signs for x i and y i. Cliques {x i, x j } where i and j are indices of neighbouring pixels : Choose an energy function of the form βx i x j for β > 0 again favouring equal signs for x i and x j. We need to accomodate for the fact that one kind of pixels might exist more often than the other kind. This can be done by adding a term hx i for each pixel in the noise-free image. (Potential function is an arbitrary, nonnegative function over maximal cliques, so we are allowed to multiply it by any nonnegative function of subsets of the clique.) 546of 828

Example of Image Denoising Energy function E(x, y) = h i x i β i,j x i x j η i x i y i which defines a joint distribution over x and y p(x, y) = 1 exp{ E(x, y)}. Z y i x i 547of 828

Example - Iterated Conditional Modes (ICM) 1 Fix the elements for y as we have them observed. (Implicitily defines p(x y)). 2 Initialise x i = y i for i = 1,..., D. 3 Take one node x j and evaluate the total energy for both possible states of x j = { 1, +1} keeping all other variables fixed. Set x j to the state having the lower energy. 4 Repeat for another node until stopping criterion is satisfied. Local minimum (ICM). Global minimum (graph-cut). 548of 828

Two frameworks for graphical models, can we convert between them? Bayesian Network x 1 x 2 x N 1 x N p(x) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 2 )... p(x N x N 1 ) Random Markov Field x 1 x 2 x N 1 x N p(x) = 1 Z ψ 1,2(x 1, x 2 ) ψ 2,3 (x 2, x 3 )... ψ N 1,N (x N 1, x N ) 549of 828

Chains are the same Bayesian Network p(x) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 2 )... p(x N x N 1 ) Random Markov Field p(x) = 1 Z ψ 1,2(x 1, x 2 ) ψ 2,3 (x 2, x 3 )... ψ N 1,N (x N 1, x N ) Find corresponding terms ψ 1,2 (x 1, x 2 ) = p(x 1 ) p(x 2 x 1 ) ψ 2,3 (x 2, x 3 ) = p(x 3 x 2 ). ψ N 1,N (x N 1, x N ) = p(x N x N 1 ) and note that Z = 1 in this case. 550of 828

Moralisation For other kind of Bayesian Networks (BNs), create the cliques of a MRF by adding undirected edges between all pairs of parents for each node in the graph. This process of marrying the parents is called moralisation, and the result is a moral graph. BUT the resulting MRF may represent different conditional independence statements than the original BN. Example: The MRF is fully connected, and exhibits NO conditional independence properties, in contrast to the original directed graph. x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 551of 828

Distribution vs Graph Definition (D-map) A graph is a D-map (dependency map) of a distribution if every conditional independence statement satisfied by the distribution is reflected in the graph. Definition (I-map) A graph is an I-map (independence map) of a distribution if every conditional independence statement implied by the graph is satisfied in the distribution. Definition (P-map) A graph is a P-map (perfect map) of a distribution if it is both a D-map and an I-map for the distribution. 552of 828

Consider probability distributions depending on N variables P set of all probability distributions D set of all probability distributions that can be represented as a perfect map by an directed graph U set of all probability distributions that can be represented as a perfect map by an undirected graph D U P 553of 828

Only as Bayesian Network A B C A directed graph whose conditional independence properties cannot be expressed using an undirected graph over the same three variables. 554of 828

Only as Random Markov Field A C B D An undirected graph whose conditional independence properties cannot be expressed using a directed graph over the same four variables. 555of 828

BNs vs. MRFs: Similarities In both types of Graphical Models A relationship between the conditional independence statements satisfied by a distribution and the associated simplified algebraic structure of the distribution is made in term of graphical objects. The conditional independence statements are related to concepts of separation between variables in the graph. The simplified algebraic structure (factorisation of p(x)) is related to local pieces of the graph (child + its parents in BNs, cliques in MRFs). 556of 828

BNs vs. MRFs: Differences Differences The set of probability distributions that can be represented as MRFs is different from the set that can be represented as BNs. Although both MRFs and BNs are expressed as a factorisation of local functions on the graph, the MRF has a normalisation constant Z = x C C ψ C(x C ) that couples all factors, whereas the BN has not. The local pieces of the BN are probability distributions themselves, whereas in MRFs they need only be non-negative functions (i.e. they may not have range [0, 1] as probabilities do). 557of 828