Statistical Learning

Similar documents
Chris Bishop s PRML Ch. 8: Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Graphical Models and Kernel Methods

Directed and Undirected Graphical Models

3 : Representation of Undirected GM

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CSC 412 (Lecture 4): Undirected Graphical Models

An Introduction to Bayesian Machine Learning

Undirected Graphical Models: Markov Random Fields

Probabilistic Graphical Models (I)

Machine Learning Lecture 14

Directed Graphical Models

Lecture 4 October 18th

Inference in Bayesian Networks

Conditional Independence and Factorization

Chapter 16. Structured Probabilistic Models for Deep Learning

Directed and Undirected Graphical Models

Probabilistic Graphical Networks: Definitions and Basic Results

Machine Learning 4771

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 6: Graphical Models

Undirected Graphical Models

5. Sum-product algorithm

4.1 Notation and probability review

Markov Random Fields

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Probabilistic Graphical Models

CS Lecture 4. Markov Random Fields

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

p L yi z n m x N n xi

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Review: Directed Models (Bayes Nets)

Probabilistic Graphical Models

Intelligent Systems:

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

6.867 Machine learning, lecture 23 (Jaakkola)

Statistical Approaches to Learning and Discovery

Undirected Graphical Models

4 : Exact Inference: Variable Elimination

2 : Directed GMs: Bayesian Networks

Linear Dynamical Systems

Rapid Introduction to Machine Learning/ Deep Learning

Bayesian Machine Learning - Lecture 7

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Graphical Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Probabilistic Graphical Models: Representation and Inference

Dynamic Approaches: The Hidden Markov Model

Introduction to Bayesian Learning

Junction Tree, BP and Variational Methods

9 Forward-backward algorithm, sum-product on factor graphs

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Variational Inference (11/04/13)

Conditional Random Field

CPSC 540: Machine Learning

Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Lecture 8: Bayesian Networks

Based on slides by Richard Zemel

Undirected graphical models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Graphical Models - Part II

Representation of undirected GM. Kayhan Batmanghelich

Basic math for biology

Rapid Introduction to Machine Learning/ Deep Learning

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

Probabilistic Graphical Models

Probabilistic Graphical Models

Graphical Models - Part I

Directed Graphical Models or Bayesian Networks

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

A brief introduction to Conditional Random Fields

STA 414/2104: Machine Learning

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Bayesian Networks Inference with Probabilistic Graphical Models

Example: multivariate Gaussian Distribution

Bayesian Machine Learning

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Probabilistic Machine Learning

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Learning P-maps Param. Learning

STA 4273H: Statistical Machine Learning

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Conditional Independence

Bayesian Networks (Part II)

Bayesian Networks. Alan Ri2er

Lecture 17: May 29, 2002

Gibbs Fields & Markov Random Fields

STA 4273H: Statistical Machine Learning

Lecture 15. Probabilistic Models on Graph

Variable Elimination: Algorithm

Alternative Parameterizations of Markov Networks. Sargur Srihari

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Cours 7 12th November 2014

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Transcription:

Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 1 / 51

Bayesian Networks and Graphical Models Bayes nets in a nutshell: Structured probability (density/mass) functions f X (x; θ) Provides a graph-based language/grammar to express conditional independence Allows formalizing the problem of inferring a subset of components of X from another subset thereof Allows formalizing the problem of learning θ from observed i.i.d. realizations of X: x 1,..., x n Bayes nets are one type of graphical models, based on directed graphs. Other types (more on them later): Markov random fields (MRF), based on undirected graphs Factor graphs Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 2 / 51

Bayesian Networks: Introduction Notation: we use the more compact p(x) notation instead of the more correct f X (x) For X R n, the pdf/pmf p(x) can be factored by Bayes law: p(x) = p(x 1 x 2,..., x n ) p(x 2,..., x n ). = p(x 1 x 2,..., x n ) p(x 2 x 3,..., x n ) p(x 3,..., x n ) = p(x 1 x 2,..., x n ) p(x n 1 x n ) p(x n ) Of course, this can be done in n! different ways; e.g., p(x) = p(x n x n 1,..., x 1 ) p(x n 1,..., x 1 ). = p(x n x n 1,..., x 1 ) p(x n 1 x n 2,..., x 1 ) p(x n 2,..., x 1 ) = p(x n x n 1,..., x 1 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 3 / 51

Bayesian Networks: Introduction For X R n, the pdf/pmf p(x) can be factored by Bayes law: p(x) = p(x n x n 1,..., x 1 ) p(x 2 x 1 ) p(x 1 ) In general, this is not more compact than p(x) Example: if xi {1,..., K}, a general p(x) has K n 1 K n parameters But p(x n x n 1,..., x 1 ) alone has (K 1)K n 1 K n parameters!...unless, there are some conditional independencies Example: X is a Markov chain: p(x i x i 1,..., x 1 ) = p(x i x i 1 ) in this case, p(x) = p(x n x n 1 ) p(x n 1 x n 2 ) p(x 2 x 1 ) p(x 1 ) has n K (K 1) + K n K 2 parameters linear in n, rather than exponential! Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 4 / 51

Conditional Independence Bayes nets are built on conditional independence Random variables X and Y are conditional independent, given Z, if f X,Y Z (x, y z) = f X Z (x z)f Y Z (y z) Naturally, X, Y, and Z can be groups of random variables Notation: X Y Z or X = Y Z Equivalent relationship (in short notation): p(x y, z) = p(x, y z) p(y z) = p(x z) p(y z) p(y z) = p(x z) Factorization: if X = (X 1, X 2, X 3 ) and X 1 X 3 X 2, then p(x) = p(x 3 x 2, x 1 ) p(x 2 x 1 ) p(x 1 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 5 / 51

Graphical Models Graph-based representations of the joint pdf/pmf p(x) Each node i represents random a variable X i The conditional independence properties are encoded by the presence/absence of edges in the graph Example: X 1 X 3 X 2 is represented by one of the following: p(x) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 3 x 2 ) p(x 1 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 1 x 2 ) p(x 2 x 3 ) p(x 3 ) = p(x 1 x 2 ) p(x 3 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 6 / 51

(Directed) Graph Concepts Directed graph G = (V, E), where set of nodes or vertices V = {1,..., V } 1 set of edges E V V ; i.e., each element of E has the form (s, t), with x, t V 2 3 in this context, we assume (v, v) E, v V 4 5 Note: in an undirected graph, each edge is a 2 element multiset; i.e., has the form {u, v}, where u, v V Parents of a node: pa(v) = {s V : (s, v) E} Example: pa(4) = {2, 3} Children of a node: ch(v) = {t V : (v, t) E} Example: ch(3) = {4, 5} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 7 / 51

(Directed) Graph Concepts (cont.) Root: a node v s.t. pa(v) = Example: 1 is a root. Leaf: a node v s.t. ch(v) = Example: 4 and 5 are leaves. Reachability: node t is reachable from s if there is a sequence of edges ( (vs1, v t1 ),..., (v sn, v tn ) ) s.t. v s1 = s, v si = v ti 1, i = 2,..., n, and v tn = t Ex: 5 reachable from 1; 2 not reachable from 5 1 2 3 4 5 Ancestors of a node: anc(v) = {s : v is reachable from s} Examples: anc(5) = {1, 3}; anc(4) = {1, 2, 3}; anc(1) = Descendants of a node: desc(v) = {t : t is reachable from v} Examples: desc(1) = {2, 3, 4, 5}; desc(2) = {4}; desc(5) = Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 8 / 51

(Directed) Graph Concepts (cont.) Neighborhood of a node nbr(v) = {u : (u, v) E (v, u) E} Example: nbr(3) = {1, 4, 5} In-degree of node v is the cardinality of pa(v) Example: the in-degree of 4 is 2 Out-degree of node v is the cardinality of ch(v) Example: the out-degree of 3 is 2 Cycle (or loop): (v 1, v 2,..., v n ) : v 1 = v n, (v i, v i+1 ) E Example: the graph shown above has no loops/cycles. 1 2 3 4 5 Directed acyclic graph (DAG): a directed graph with no loops/cycles. Directed tree: a DAG where each node has 1 or 0 parents. Subgraph of G = (V, E) induced by a subset of nodes S V: G S = (S, E S ), where E S = {(u, v) E : u, v S} Example: G {1,3,5} = ({1, 3, 5}, {(1, 3), (3, 5)}) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 9 / 51

Directed Graphical Models (DGM) DGM, a.k.a. Bayesian networks, belief networks, causal networks Consider X = (X 1,..., X V ) with pdf/pmf p(x) = p(x 1,..., x V ) Consider a graph G = (V, E), where V = {1,..., V } G is a DGM for X if (with x S = {x v, v S}) V p(x) = p(x v x pa(v) ) }{{} Example v=1 cond. prob. dist. (CPD) 1 p(x) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 1 ) p(x 4 x 2, x 3 ) p(x 5 x 3 ) 2 3 The DGM is not unique (example in slide 6) 4 5 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 10 / 51

Directed Graphical Models: Examples Naïve Bayes classification (generative model): class variable Y {1,..., K}, with prior p(y); class-conditional pdf/pmf p(x y) p(y, x) = p(x y) p(y) = p(y) V p(x v y) v=1 Y X 1 X 2 X 3 X 4 Tree-augmented naïve Bayes (TAN) classification: class variable Y {1,..., K}, with prior p(y); class-conditional pdf/pmf p(x y) Y V p(y, x) = p(y) p(x v x pa(v), y) v=1 where the DAG is a tree. X 1 X2 X 3 X 4 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 11 / 51

Directed Graphical Models: More Examples First order Markov chain: p(x) = p(x 1 ) p(x 2 x 1 ) p(x V x V 1 ) x 1 x 2 x 3 Second order Markov chain: p(x) = p(x 1, x 2 ) p(x 3 x 2, x 1 ) p(x V x V 1, x V 2 ) x 1 x 2 x 3 x 4 Hidden Markov model (HMM): (Z, X) = (Z 1,..., Z T, X 1,..., X T ) T p(z, x) = p(z 1 )p(x 1 z 1 ) p(x v z v ) p(z v z v 1 ) v=2 z 1 z 2 z T x 1 x 2 x T Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 12 / 51

Inference in Directed Graphical Models Visible and hidden variables: x = (x v, x h ) Joint pdf/pmf p(x v, x h θ) Inferring the hidden variables from the visible ones: p(x h x v, θ) = p(x h, x v θ) p(x v θ) = p(x h, x v θ) p(x h, x v θ) x h...with integral instead of sum, if x h has real components Sometimes, only a subset x q of x h is of interest, x h = (x q, x n ): p(x q x v, θ) = x n p(x q, x n x v, θ)...x n are sometimes called nuisance variables Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 13 / 51

Learning in Directed Graphical Models Observe samples x 1,..., x N of N i.i.d. copies of X p(x θ) MAP estimate of θ ( ˆθ = arg max log p(θ) + θ N i=1 ) log p(x i θ) Plate notation: for a collection of i.i.d. copies of some variable θ θ.. X 1 X N X i N If there are hidden variables, x should be understood as denoting only the visible ones Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 14 / 51

Learning in Directed Graphical Models (cont.) For N i.i.d. observations D = (x 1,..., x N ), N N V p(d θ) = p(x i θ) = p(x iv x i,pa(v), θ) = i=1 i=1 v=1 where D v is the data associated with node v and θ v the corresponding parameters If the prior factorizes, p(θ) = v p(θ v), the posterior p(θ D) p(θ) p(d θ) = also factorizes V p(θ v )p(d v θ t ) v=1 V p(d v θ v ) v=1 V p(θ v D v ) v=1 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 15 / 51

Learning in Directed Graphical Models: Categorical CPD Each x v {1,..., K v } The number of configurations of x pa(v) is C v = s pa(v) Abusing notation, write x pa(v) = c, for c {1,..., C v }, to denote that x pa(v) takes the c-th configuration Parameters: θ vck = P(x v = k x pa(v) = c) (of course K v k=1 θ vck = 1) Denote θ vc = (θ vc1,..., θ vckv ) Counts: N vck = N i=1 1(x iv = k, x i,pa(v) = c) Maximum likelihood estimates: ˆθ vck = N vck Kv j=1 N vcj MAP estimate w/ Dirichlet(α vc ) prior: ˆθ vck = K s N vck + α vck Kv j=1 (N vcj + α vcj ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 16 / 51

Conditional Independence Properties Consider X = (X 1,..., X V ) p(x), and V = {1,..., V } Let x A x B x C be a true conditional independence (CI) statement about p(x), where A, B, C are disjoint subsets of V Let I(p) be the set of all (true) CI statements about p(x) Graph G = (V, E) G expresses a collection I(G) of CI statements x A G x B x C (explained below) G is an I-map (independence map) of p(x) if I(G) I(p) Example: if G is fully connected, I(G) =, thus I(G) I(p), for any p Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 17 / 51

Conditional Independence Properties (cont.) What type conditional independence (CI) statements are expressed by some graph G? In an path through some node m, there are three possible orientation structures: tail to tail head to tail head to head Before proceeding to the general statement, we next exemplify which type of CI corresponds to each of these structures. Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 18 / 51

Conditional Independence Structures Tail to tail X Y Z p(x, y, z) p(x, y z) = p(z) p(x z) p(y z) p(z) = p(z) = p(x z) p(y z) Head to tail X Y Z p(x, y z) = p(y z) p(z x) p(x) p(z) = p(y z) p(x z) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 19 / 51

Conditional Independence Structures Head to head p(x, y z) p(x z) p(y z) X Y Classical example: the noisy fuel gauge Binary variables: X = battery OK, Y = full tank, Z = fuel gauge on P(X =1) = P(Y =1) = 0.9 x y P(Z = 1 x, y) 1 1 0.8 1 0 0.2 0 1 0.2 0 0 0.1 Gauge is off (Z = 0); is the tank empty? P(Y = 0 Z = 0) = 0.257...but, the battery is also dead, P(Y = 0 Z = 0, X = 0) = 0.11 Dead battery explains away the empty tank Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 20 / 51

D-Separation Graph G = (V, E) and three disjoint subsets of V: A, B, and C An (undirected) path from a node in A to a node in B is blocked by C if it includes a node m such that m C and the arrows meet head to tail or tail to tail the arrows meet head to head, m C, and desc(m) C = C D-separates A from B and x A G x B x C, if any path from a node in A to a node in B is blocked by C Example: x 4 G x 5 x 1 (tail to tail) x 1 G x 2 x 4 (head to head in 5 {4}) x 5 G x 6 x {2,3} (tail to tail in 2 and head to tail in 3) x 1 G x 2 x 7 is false (head to head in 5, but 7 desc(5)) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 21 / 51

Markov Blanket The Markov blanket of node m (mb(m)) is the set of nodes, conditioned on which x m is independent of all other nodes. Which nodes belong to mb(m)? V p(x i x pa(i) ) i=1 p(x m {x j : j m}) = V p(x i x pa(i) ) = x m i=1 j:m j,m pa(j) j:m j,m pa(j) p(x j x pa(j) ) p(x j x pa(j) ) x m...thus mb(m) = pa(m) ch(m) pa(ch(m)) }{{} coparents i mb(m) i mb(m) p(x i x pa(i) ) p(x i x pa(i) ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 22 / 51

Undirected Graphical Models (Markov Random Fields) MRF are based on undirected graphs G = (V, E), where each edge {u, v} is a sub-multiset of V of cardinality 2 Conditional independence statements result from simple separation (simpler than D-separation) definition Let A, B, C be disjoint subsets of V If any path from a node in A to a node in B goes through C, it is said that C separates A from B Graph G is an I-map for (X 1,..., X V ) = X p(x) if C separates A from B x A x B x C... a perfect I-map if can be replaced by A complete graph is an I-map for any p(x) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 23 / 51

Markov Random Fields (cont.) Neighborhood: N(i) = {j : {i, j} E} In an MRF, the Markov blanket is simply the neighborhood mb(i) = N(i): p(x i x V\{i} ) = p(x i x N(i) ) x i x V\({i} N(i)) x N(i) Clique: set of mutually neighboring nodes Maximal clique: a clique that is not contained in any other clique; C(G) denotes the set of maximal cliques of G Examples: {1, 2} is a (non-maximal) clique; {1, 2, 3, 5} is a not clique (1 and 5 are not neighbors); {1, 2, 3} and {4, 5, 6, 7} are maximal cliques 1 2 3 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 24 / 51 5 4 6 7

Hammersley-Clifford Theorem and Gibbs Distributions Let p(x) = p(x 1,..., x V ), such that p(x) > 0, x. Then, G is an I-map for p(x) if and only where p(x) = 1 Z C C(G) ψ C (x C ) = 1 Z exp( Z = x C C(G) ψ C (x C ) C C(G) E C (x C ) ) is the partition function (with integration, rathar than summation, in the case of continuous variables) ψ C is called clique potential; E C is called clique energy This is known in statistical physics as a Gibbs distribution Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 25 / 51

Local Conditionals in Gibbs Distributions Consider a Gibbs distribution over a graph G p(x) = 1 Z C C(G) Local conditional distribution p(x i x N(i) ) = = ψ C (x C ) = 1 Z exp ( 1 Z(x N(i) ) C: i C C C(G) ψ C (x C ) 1 ( Z(x N(i) ) exp C: i C ) E C (x C ) ) E C (x C ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 26 / 51

Auto Models Auto models: based on 2-node cliques: E C (x C ) = E D (x D ) D: D =2,D C equivalently, ψ C (x C ) = ψ D (x D ) D: D =2,D C Joint distribution has the form p(x) = 1 Z D: D =2,D C C(G) = 1 Z exp ( ψ D (x D ) D: D =2,D C C(G) ) E D (x D ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 27 / 51

Auto Models: Gaussian Markov Random Fields (GMRF) X R V, with Gaussian ( p(x) exp 1 ) 2 (x µ)t A(x µ) ( exp 1 2 V V i=1 j=1 ) A ij (x i µ i ) (x j µ j ) where A, the inverse of the covariance matrix, is symmetric Each pair-wise clique {i, j} has energy { Aij (x E {i,j} ({x i, x j }) = i µ i ) (x j µ j ) if i j A ii 2 (x i µ i ) 2 if i = j Neighborhood system: N(i) = {j : A ij 0} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 28 / 51

Auto Models: Ising and Potts Fields X { 1, +1} V, with energy function { β xi = x E {i,j} (x i, x j ) = j β x i x j where β > 0 (ferromagnetic interaction) or β < 0 (anti-ferromagnetic interaction) Computing Z is NP-hard in general Generalization to K states: the Potts model, X {1,..., K} V, with energy function { β xi = x E {i,j} (x i, x j ) = j 0 x i x j Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 29 / 51

Illustration of Potts Fields Samples of Potts model, with K = 10, Graph is 2D grid; neighborhood of each node is the set of its 4 nearest neighbors β = 1.42 β = 1.44 β = 1.46 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 30 / 51

DGM and MRF: The Easy Case Problem: how to write a DGM as an MRF? In some cases, there is a trivial relationship: A Markov chain p(x) = p(x 1 ) p(x 2 x 1 ) p(x N x N 1 ) can obviously be written as an MRF (with Z = 1) p(x) = 1 Z ψ {1,2}(x 1, x 2 ) }{{} p(x 1 ) p(x 2 x 1 ) ψ {2,3} (x 2, x 2 ) }{{} p(x 3 x 2 ) ψ {N,N 1} (x N, x N 1 ) }{{} p(x N x N 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 31 / 51

DGM and MRF: The General Case Procedure: 1 Insert undirected arrows between all pairs of parents of each node ( moralization ) 2 Make all edges undirected; 3 Initialize all clique potentials to 1; 4 Take each factor in the DGM and multiply it by the potential of a clique that contains all the involved nodes; Example p(x) = p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) }{{} ψ {1,2,3,4} (x 1,x 2,x 3,x 4 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 32 / 51

From DGM and MRF: Another Example DGM: p(x) = p(x 1 ) p(x 2 ) p(x 3 x 1, x 2 ) p(x 4 x 3 ) p(x 5 x 3 ) p(x 6 x 4, x 5 ) Cliques: C = { {1, 2, 3}, {3, 4, 5}, {4, 5, 6} } MRF: p(x) = ψ(x 1, x 2, x 3 ) }{{} (p(x 1 ) p(x 2 ) p(x 3 x 1,x 2 )) ψ(x 3, x 4, x 5 ) }{{} (p(x 4 x 3 ) p(x 5 x 3 )) ψ(x 4, x 5, x 6 ) }{{} p(x 6 x 4,x 5 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 33 / 51

Efficient Inference Motivating example: compute a marginal p(x n ) in chain graph p(x) = p(x 1,..., x N ) = ψ(x 1, x 2 ) ψ(x 2, x 3 ) ψ(x N 1, x N ) Naïve solution (suppose x 1, x 2,..., x N {1,..., K} and Z = 1) K K K K p(x n ) = p(x 1,..., x N ) x 1 =1 x n 1 =1 x n+1 =1 x N =1 has cost O(K N ) (just computing p(x 1,..., x N ) has cost O(K N ))...the structure of p(x 1,..., x N ) is not being exploited Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 34 / 51

Efficient Inference (cont.) Reorder the summations and use the structure K K K K p(x n ) = ψ(x 1, x 2 ) ψ(x N 1, x N ) x n+1=1 x N =1 x n 1=1 x 1=1 K K K = ψ(x n, x n+1 ) ψ(x N 1, x N ) ψ(x N 1, x N ) x n+1=1 x N 1 =1 x N =1 K K K ψ(x n 1, x n ) ψ(x 2, x 3 ) ψ(x 1, x 2 ) x n 1=1 x 2=1 x 1=1 = µ β (x n ) µ α (x n ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus 10 20 ) The key is the distributive property: ab + ac = a(b + c) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 35 / 51

Efficient Inference (cont.) Can be seen as a message passing µ β (x n ) = K x n+1=1 ψ(x n, x n+1 ) K x N 1 =1 ψ(x N 2, x N 1 ) K x N =1 ψ(x N 1, x N ) } {{ } } {{ µ β (x N 1 ) } µ β (x N 2 ) Formally, right to left messages: µ β (x N ) = 1, for x N {1,..., K} K µ β (x j ) = ψ(x j, x j+1 ) µ β (x j+1 ), for x j {1,..., K} x j+1 =1 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 36 / 51

Efficient Inference (cont.) Can be seen as a message passing K µ α (x n ) = ψ(x n 1, x n ) x n 1=1 K ψ(x 2, x 3 ) x 2=1 K ψ(x 1, x 2 ) x 1=1 } {{ } } {{ µ α(x 2) } µ α(x 3) Formally, left to right messages: µ α (x 1 ) = 1, for x 1 {1,..., K} K µ α (x j ) = ψ(x j 1, x j ) µ β (x j 1 ), for x j {1,..., K} x j 1 =1 Known as the sum-product algorithm, a.k.a., belief propagation Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 37 / 51

Efficient Inference (cont.) Another example: compute the MAP configuration: max x p(x) p(x) = p(x 1,..., x N ) = ψ(x 1, x 2 ) ψ(x 2, x 3 ) ψ(x N 1, x N ) Naïve solution (suppose x 1, x 2,..., x N {1,..., K} and Z = 1) max x p(x) = max x 1 max max p(x 1,..., x N ) x 2 x N has cost O(K N ) (just computing p(x 1,..., x N ) has cost O(K N ))...the structure of p(x 1,..., x N ) is not being exploited Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 38 / 51

Efficient Inference (cont.) Exploit the structure max x p(x) = max x 1 max max ψ(x 1, x 2 ) ψ(x N 1, x N ) x 2 x N ψ(x N 2, x N 1 ) max x N 1 = max ψ(x 1, x 2 ) max x 1 Formally, right to left messages: µ(x N ) = 1, for x N {1,..., K} N x } {{ } ψ(x N 1, x N ) } {{ µ(x N 1 ) } µ(x N 2 ) µ(x j ) = max x j+1 ψ(x j, x j+1 ) µ(x j+1 ), for x j {1,..., K} Can also be done from left to right, using the reverse ordering Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 39 / 51

Efficient Inference (cont.) Exploit the structure max x p(x) = max x 1 max max ψ(x 1, x 2 ) ψ(x N 1, x N ) x 2 x N ψ(x N 1, x N ) max x N 1 = max ψ(x 1, x 2 ) max x 1 N x } {{ } ψ(x N 1, x N ) } {{ µ(x N 1 ) } µ(x N 2 ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus 10 20 ) The key is the distributive property: max{a b, a c} = a max{b, c} Equivalently, for max log p(x), use max{a+b, a+c} = a+max{b, c} To compute arg max x p(x) need a backward and a forward pass; why and how? Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 40 / 51

Efficient Inference (cont.) Inference on a general DGM via message passing ( ) ˆx 1 = arg max max p(x 1, x 2..., x N ) = arg max µ(x 1 ) x 1 x 2,...,x N x 1 ˆx 2 = arg max x 2 ψ(ˆx 1, x 2 ) µ(x 2 ).. ˆx N = arg max x N ψ(ˆx N 1, x N ) µ(x N ) }{{} 1 = arg max x N ψ(ˆx N 1, x N ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus 10 20 ) This is similar to dynamic programming and the Viterbi algorithm This is know as the max-product (or max-sum, with logs) algorithm How to extend this to more general graphical structures? Easy for DGM! A general algorithm is more conveniently written for factor graphs Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 41 / 51

Factor Graphs Joint pdf/pmf of X = (X 1,..., X V )(or hybrid) is a product p(x) = s S f s (x s ), where S 2 {1,...,V }, i.e., s {1,..., V } Each factor f s only depends on a subset x s of components of x Example: p(x 1, x 2, x 3 ) = f a (x 1, x 2 ) f b (x 1, x 2 ) f c (x 2, x 3 ) f d (x 3 ) Seen as an MRF, C = { {1, 2}, {2, 3} }, thus p(x) ψ {1,2} ψ {2,3}, ψ {1,2} f a (x 1, x 2 ) f b (x 1, x 2 ) and ψ {2,3} f c (x 2, x 3 ) f d (x 3 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 42 / 51

Factor Graphs Factor graphs are bpartite: two disjoint subsets of nodes (factors and variables); no edges between nodes in the same subset. Mapping an MRF to a factor graph is not unique: Mapping an Bayesian network to a factor graph is not unique: Neighborhood: ne(x) = {s S : x s} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 43 / 51

Sum-Product Algorithm on Factor Graphs Working example: compute a marginal p(x) = x\x p(x) Assume the graph is a tree Group factors in the following way p(x) = f s (x s ) = F s (x, X s ) s S s ne(x) where X s is the subset of variables on the subtree connected to x via factor s F s (x, X s ) is the product of all the factors in the subtree connected to s Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 44 / 51

Sum-Product Algorithm on Factor Graphs Rewrite the marginalization: p(x) = F s (x, X s ) = x\x s ne(x) s ne(x) µ fs x (x) {}}{ F s (x, X s ) X s F s (x, X s ) = f s (x, x 1,..., x M )G 1 (x 1, X s1 ) G M (x M, X sm ) µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x µ xm fs (x m) {}}{ X xm G m (x m, X sm ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 45 / 51

Sum-Product Algorithm on Factor Graphs Factor-to-variable (FtV) messages: µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x variable-to-factor (VtF) messages {}}{ µ xm f s (x m ) Computing FtV message from f s to x Compute product of VtF messages coming from all variables except x Multiply by the local factor Marginalize w.r.t. all variables except x Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 46 / 51

Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages: µ xm fs (x m ) = G m (x m, X sm ) X xm But G m (x m, X sm ) = F l (x m, X ml ) l ne(x m)\f s µ xm f s (x m ) = = l ne(x m)\f s X ml F l (x m, X ml ) l ne(x m)\f s µ fl x m (x m ) Each VtF message is the product of the FtV messages that the variable receives from the other factors. Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 47 / 51

Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages, from leaf variables µ xm fs (x m ) = µ fl x m (x m ) = 1 l ne(x m)\f s Factor-to-variable (FtV) messages, from leaf factors µ fs x(x) = f s (x, x 1,..., x M ) x 1 x M µ xm fs (x m ) = f s (x) m ne(f s)\x Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 48 / 51

Sum-Product on Factor Graphs: Detailed Example p(x) = f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) p(x 2 ) = µ fa x2 (x 2 ) µ fb x 2 (x 2 ) µ fb x 2 (x 2 ) = f a (x 1, x 2 ) f b (x 2, x 3 ) fc(x 2, x 4 ) x 1 x 3 x 4 = x 1 x 3 x 4 f a (x 1, x 2 )f b (x 2, x 3 )fc(x 2, x 4 ) = p(x 2 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 49 / 51

Max-Sum Algorithm on Factor Graphs Message passing for MAP: max log p(x) = max log f s (x) x s S Distributive property max{a + b, a + c} = a + max{b, c} Max-sum messages: µf x (x) = max x 1,...,x M µx f (x) = l ne(x)\f log f(x, x 1,..., x M ) + µ fl x(x) At leaf variables and factors: µ f x (x) = log f(x) µx f (x) = 0 m ne(f)\x µ xm f (x m ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 50 / 51

Recommended Reading C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (this lecture was very much based on chapter 8 of this book). Chapter 8 is freely available at http://research.microsoft.com/en-us/ um/people/cmbishop/prml/pdf/bishop-prml-sample.pdf K. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012 (chapters 10 and 19). Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 51 / 51