Statistical Learning - PDF Free Download

Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 1 / 51

Bayesian Networks and Graphical Models Bayes nets in a nutshell: Structured probability (density/mass) functions f X (x; θ) Provides a graph-based language/grammar to express conditional independence Allows formalizing the problem of inferring a subset of components of X from another subset thereof Allows formalizing the problem of learning θ from observed i.i.d. realizations of X: x 1,..., x n Bayes nets are one type of graphical models, based on directed graphs. Other types (more on them later): Markov random fields (MRF), based on undirected graphs Factor graphs Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 2 / 51

Bayesian Networks: Introduction Notation: we use the more compact p(x) notation instead of the more correct f X (x) For X R n, the pdf/pmf p(x) can be factored by Bayes law: p(x) = p(x 1 x 2,..., x n ) p(x 2,..., x n ). = p(x 1 x 2,..., x n ) p(x 2 x 3,..., x n ) p(x 3,..., x n ) = p(x 1 x 2,..., x n ) p(x n 1 x n ) p(x n ) Of course, this can be done in n! different ways; e.g., p(x) = p(x n x n 1,..., x 1 ) p(x n 1,..., x 1 ). = p(x n x n 1,..., x 1 ) p(x n 1 x n 2,..., x 1 ) p(x n 2,..., x 1 ) = p(x n x n 1,..., x 1 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 3 / 51

Bayesian Networks: Introduction For X R n, the pdf/pmf p(x) can be factored by Bayes law: p(x) = p(x n x n 1,..., x 1 ) p(x 2 x 1 ) p(x 1 ) In general, this is not more compact than p(x) Example: if xi {1,..., K}, a general p(x) has K n 1 K n parameters But p(x n x n 1,..., x 1 ) alone has (K 1)K n 1 K n parameters!...unless, there are some conditional independencies Example: X is a Markov chain: p(x i x i 1,..., x 1 ) = p(x i x i 1 ) in this case, p(x) = p(x n x n 1 ) p(x n 1 x n 2 ) p(x 2 x 1 ) p(x 1 ) has n K (K 1) + K n K 2 parameters linear in n, rather than exponential! Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 4 / 51

Conditional Independence Bayes nets are built on conditional independence Random variables X and Y are conditional independent, given Z, if f X,Y Z (x, y z) = f X Z (x z)f Y Z (y z) Naturally, X, Y, and Z can be groups of random variables Notation: X Y Z or X = Y Z Equivalent relationship (in short notation): p(x y, z) = p(x, y z) p(y z) = p(x z) p(y z) p(y z) = p(x z) Factorization: if X = (X 1, X 2, X 3 ) and X 1 X 3 X 2, then p(x) = p(x 3 x 2, x 1 ) p(x 2 x 1 ) p(x 1 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 5 / 51

Graphical Models Graph-based representations of the joint pdf/pmf p(x) Each node i represents random a variable X i The conditional independence properties are encoded by the presence/absence of edges in the graph Example: X 1 X 3 X 2 is represented by one of the following: p(x) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 3 x 2 ) p(x 1 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 1 x 2 ) p(x 2 x 3 ) p(x 3 ) = p(x 1 x 2 ) p(x 3 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 6 / 51

(Directed) Graph Concepts Directed graph G = (V, E), where set of nodes or vertices V = {1,..., V } 1 set of edges E V V ; i.e., each element of E has the form (s, t), with x, t V 2 3 in this context, we assume (v, v) E, v V 4 5 Note: in an undirected graph, each edge is a 2 element multiset; i.e., has the form {u, v}, where u, v V Parents of a node: pa(v) = {s V : (s, v) E} Example: pa(4) = {2, 3} Children of a node: ch(v) = {t V : (v, t) E} Example: ch(3) = {4, 5} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 7 / 51

(Directed) Graph Concepts (cont.) Root: a node v s.t. pa(v) = Example: 1 is a root. Leaf: a node v s.t. ch(v) = Example: 4 and 5 are leaves. Reachability: node t is reachable from s if there is a sequence of edges ( (vs1, v t1 ),..., (v sn, v tn ) ) s.t. v s1 = s, v si = v ti 1, i = 2,..., n, and v tn = t Ex: 5 reachable from 1; 2 not reachable from 5 1 2 3 4 5 Ancestors of a node: anc(v) = {s : v is reachable from s} Examples: anc(5) = {1, 3}; anc(4) = {1, 2, 3}; anc(1) = Descendants of a node: desc(v) = {t : t is reachable from v} Examples: desc(1) = {2, 3, 4, 5}; desc(2) = {4}; desc(5) = Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 8 / 51

(Directed) Graph Concepts (cont.) Neighborhood of a node nbr(v) = {u : (u, v) E (v, u) E} Example: nbr(3) = {1, 4, 5} In-degree of node v is the cardinality of pa(v) Example: the in-degree of 4 is 2 Out-degree of node v is the cardinality of ch(v) Example: the out-degree of 3 is 2 Cycle (or loop): (v 1, v 2,..., v n ) : v 1 = v n, (v i, v i+1 ) E Example: the graph shown above has no loops/cycles. 1 2 3 4 5 Directed acyclic graph (DAG): a directed graph with no loops/cycles. Directed tree: a DAG where each node has 1 or 0 parents. Subgraph of G = (V, E) induced by a subset of nodes S V: G S = (S, E S ), where E S = {(u, v) E : u, v S} Example: G {1,3,5} = ({1, 3, 5}, {(1, 3), (3, 5)}) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 9 / 51

Directed Graphical Models (DGM) DGM, a.k.a. Bayesian networks, belief networks, causal networks Consider X = (X 1,..., X V ) with pdf/pmf p(x) = p(x 1,..., x V ) Consider a graph G = (V, E), where V = {1,..., V } G is a DGM for X if (with x S = {x v, v S}) V p(x) = p(x v x pa(v) ) }{{} Example v=1 cond. prob. dist. (CPD) 1 p(x) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 1 ) p(x 4 x 2, x 3 ) p(x 5 x 3 ) 2 3 The DGM is not unique (example in slide 6) 4 5 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 10 / 51

Directed Graphical Models: Examples Naïve Bayes classification (generative model): class variable Y {1,..., K}, with prior p(y); class-conditional pdf/pmf p(x y) p(y, x) = p(x y) p(y) = p(y) V p(x v y) v=1 Y X 1 X 2 X 3 X 4 Tree-augmented naïve Bayes (TAN) classification: class variable Y {1,..., K}, with prior p(y); class-conditional pdf/pmf p(x y) Y V p(y, x) = p(y) p(x v x pa(v), y) v=1 where the DAG is a tree. X 1 X2 X 3 X 4 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 11 / 51

Directed Graphical Models: More Examples First order Markov chain: p(x) = p(x 1 ) p(x 2 x 1 ) p(x V x V 1 ) x 1 x 2 x 3 Second order Markov chain: p(x) = p(x 1, x 2 ) p(x 3 x 2, x 1 ) p(x V x V 1, x V 2 ) x 1 x 2 x 3 x 4 Hidden Markov model (HMM): (Z, X) = (Z 1,..., Z T, X 1,..., X T ) T p(z, x) = p(z 1 )p(x 1 z 1 ) p(x v z v ) p(z v z v 1 ) v=2 z 1 z 2 z T x 1 x 2 x T Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 12 / 51

Inference in Directed Graphical Models Visible and hidden variables: x = (x v, x h ) Joint pdf/pmf p(x v, x h θ) Inferring the hidden variables from the visible ones: p(x h x v, θ) = p(x h, x v θ) p(x v θ) = p(x h, x v θ) p(x h, x v θ) x h...with integral instead of sum, if x h has real components Sometimes, only a subset x q of x h is of interest, x h = (x q, x n ): p(x q x v, θ) = x n p(x q, x n x v, θ)...x n are sometimes called nuisance variables Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 13 / 51

Learning in Directed Graphical Models Observe samples x 1,..., x N of N i.i.d. copies of X p(x θ) MAP estimate of θ ( ˆθ = arg max log p(θ) + θ N i=1 ) log p(x i θ) Plate notation: for a collection of i.i.d. copies of some variable θ θ.. X 1 X N X i N If there are hidden variables, x should be understood as denoting only the visible ones Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 14 / 51

Learning in Directed Graphical Models (cont.) For N i.i.d. observations D = (x 1,..., x N ), N N V p(d θ) = p(x i θ) = p(x iv x i,pa(v), θ) = i=1 i=1 v=1 where D v is the data associated with node v and θ v the corresponding parameters If the prior factorizes, p(θ) = v p(θ v), the posterior p(θ D) p(θ) p(d θ) = also factorizes V p(θ v )p(d v θ t ) v=1 V p(d v θ v ) v=1 V p(θ v D v ) v=1 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 15 / 51

Learning in Directed Graphical Models: Categorical CPD Each x v {1,..., K v } The number of configurations of x pa(v) is C v = s pa(v) Abusing notation, write x pa(v) = c, for c {1,..., C v }, to denote that x pa(v) takes the c-th configuration Parameters: θ vck = P(x v = k x pa(v) = c) (of course K v k=1 θ vck = 1) Denote θ vc = (θ vc1,..., θ vckv ) Counts: N vck = N i=1 1(x iv = k, x i,pa(v) = c) Maximum likelihood estimates: ˆθ vck = N vck Kv j=1 N vcj MAP estimate w/ Dirichlet(α vc ) prior: ˆθ vck = K s N vck + α vck Kv j=1 (N vcj + α vcj ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 16 / 51

Conditional Independence Properties Consider X = (X 1,..., X V ) p(x), and V = {1,..., V } Let x A x B x C be a true conditional independence (CI) statement about p(x), where A, B, C are disjoint subsets of V Let I(p) be the set of all (true) CI statements about p(x) Graph G = (V, E) G expresses a collection I(G) of CI statements x A G x B x C (explained below) G is an I-map (independence map) of p(x) if I(G) I(p) Example: if G is fully connected, I(G) =, thus I(G) I(p), for any p Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 17 / 51

Conditional Independence Properties (cont.) What type conditional independence (CI) statements are expressed by some graph G? In an path through some node m, there are three possible orientation structures: tail to tail head to tail head to head Before proceeding to the general statement, we next exemplify which type of CI corresponds to each of these structures. Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 18 / 51

Conditional Independence Structures Tail to tail X Y Z p(x, y, z) p(x, y z) = p(z) p(x z) p(y z) p(z) = p(z) = p(x z) p(y z) Head to tail X Y Z p(x, y z) = p(y z) p(z x) p(x) p(z) = p(y z) p(x z) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 19 / 51

Conditional Independence Structures Head to head p(x, y z) p(x z) p(y z) X Y Classical example: the noisy fuel gauge Binary variables: X = battery OK, Y = full tank, Z = fuel gauge on P(X =1) = P(Y =1) = 0.9 x y P(Z = 1 x, y) 1 1 0.8 1 0 0.2 0 1 0.2 0 0 0.1 Gauge is off (Z = 0); is the tank empty? P(Y = 0 Z = 0) = 0.257...but, the battery is also dead, P(Y = 0 Z = 0, X = 0) = 0.11 Dead battery explains away the empty tank Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 20 / 51

D-Separation Graph G = (V, E) and three disjoint subsets of V: A, B, and C An (undirected) path from a node in A to a node in B is blocked by C if it includes a node m such that m C and the arrows meet head to tail or tail to tail the arrows meet head to head, m C, and desc(m) C = C D-separates A from B and x A G x B x C, if any path from a node in A to a node in B is blocked by C Example: x 4 G x 5 x 1 (tail to tail) x 1 G x 2 x 4 (head to head in 5 {4}) x 5 G x 6 x {2,3} (tail to tail in 2 and head to tail in 3) x 1 G x 2 x 7 is false (head to head in 5, but 7 desc(5)) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 21 / 51

Markov Blanket The Markov blanket of node m (mb(m)) is the set of nodes, conditioned on which x m is independent of all other nodes. Which nodes belong to mb(m)? V p(x i x pa(i) ) i=1 p(x m {x j : j m}) = V p(x i x pa(i) ) = x m i=1 j:m j,m pa(j) j:m j,m pa(j) p(x j x pa(j) ) p(x j x pa(j) ) x m...thus mb(m) = pa(m) ch(m) pa(ch(m)) }{{} coparents i mb(m) i mb(m) p(x i x pa(i) ) p(x i x pa(i) ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 22 / 51

Undirected Graphical Models (Markov Random Fields) MRF are based on undirected graphs G = (V, E), where each edge {u, v} is a sub-multiset of V of cardinality 2 Conditional independence statements result from simple separation (simpler than D-separation) definition Let A, B, C be disjoint subsets of V If any path from a node in A to a node in B goes through C, it is said that C separates A from B Graph G is an I-map for (X 1,..., X V ) = X p(x) if C separates A from B x A x B x C... a perfect I-map if can be replaced by A complete graph is an I-map for any p(x) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 23 / 51

Markov Random Fields (cont.) Neighborhood: N(i) = {j : {i, j} E} In an MRF, the Markov blanket is simply the neighborhood mb(i) = N(i): p(x i x V\{i} ) = p(x i x N(i) ) x i x V\({i} N(i)) x N(i) Clique: set of mutually neighboring nodes Maximal clique: a clique that is not contained in any other clique; C(G) denotes the set of maximal cliques of G Examples: {1, 2} is a (non-maximal) clique; {1, 2, 3, 5} is a not clique (1 and 5 are not neighbors); {1, 2, 3} and {4, 5, 6, 7} are maximal cliques 1 2 3 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 24 / 51 5 4 6 7

Hammersley-Clifford Theorem and Gibbs Distributions Let p(x) = p(x 1,..., x V ), such that p(x) > 0, x. Then, G is an I-map for p(x) if and only where p(x) = 1 Z C C(G) ψ C (x C ) = 1 Z exp( Z = x C C(G) ψ C (x C ) C C(G) E C (x C ) ) is the partition function (with integration, rathar than summation, in the case of continuous variables) ψ C is called clique potential; E C is called clique energy This is known in statistical physics as a Gibbs distribution Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 25 / 51

Local Conditionals in Gibbs Distributions Consider a Gibbs distribution over a graph G p(x) = 1 Z C C(G) Local conditional distribution p(x i x N(i) ) = = ψ C (x C ) = 1 Z exp ( 1 Z(x N(i) ) C: i C C C(G) ψ C (x C ) 1 ( Z(x N(i) ) exp C: i C ) E C (x C ) ) E C (x C ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 26 / 51

Auto Models Auto models: based on 2-node cliques: E C (x C ) = E D (x D ) D: D =2,D C equivalently, ψ C (x C ) = ψ D (x D ) D: D =2,D C Joint distribution has the form p(x) = 1 Z D: D =2,D C C(G) = 1 Z exp ( ψ D (x D ) D: D =2,D C C(G) ) E D (x D ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 27 / 51

Auto Models: Gaussian Markov Random Fields (GMRF) X R V, with Gaussian ( p(x) exp 1 ) 2 (x µ)t A(x µ) ( exp 1 2 V V i=1 j=1 ) A ij (x i µ i ) (x j µ j ) where A, the inverse of the covariance matrix, is symmetric Each pair-wise clique {i, j} has energy { Aij (x E {i,j} ({x i, x j }) = i µ i ) (x j µ j ) if i j A ii 2 (x i µ i ) 2 if i = j Neighborhood system: N(i) = {j : A ij 0} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 28 / 51

Auto Models: Ising and Potts Fields X { 1, +1} V, with energy function { β xi = x E {i,j} (x i, x j ) = j β x i x j where β > 0 (ferromagnetic interaction) or β < 0 (anti-ferromagnetic interaction) Computing Z is NP-hard in general Generalization to K states: the Potts model, X {1,..., K} V, with energy function { β xi = x E {i,j} (x i, x j ) = j 0 x i x j Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 29 / 51

Illustration of Potts Fields Samples of Potts model, with K = 10, Graph is 2D grid; neighborhood of each node is the set of its 4 nearest neighbors β = 1.42 β = 1.44 β = 1.46 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 30 / 51

DGM and MRF: The Easy Case Problem: how to write a DGM as an MRF? In some cases, there is a trivial relationship: A Markov chain p(x) = p(x 1 ) p(x 2 x 1 ) p(x N x N 1 ) can obviously be written as an MRF (with Z = 1) p(x) = 1 Z ψ {1,2}(x 1, x 2 ) }{{} p(x 1 ) p(x 2 x 1 ) ψ {2,3} (x 2, x 2 ) }{{} p(x 3 x 2 ) ψ {N,N 1} (x N, x N 1 ) }{{} p(x N x N 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 31 / 51

DGM and MRF: The General Case Procedure: 1 Insert undirected arrows between all pairs of parents of each node ( moralization ) 2 Make all edges undirected; 3 Initialize all clique potentials to 1; 4 Take each factor in the DGM and multiply it by the potential of a clique that contains all the involved nodes; Example p(x) = p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) }{{} ψ {1,2,3,4} (x 1,x 2,x 3,x 4 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 32 / 51

From DGM and MRF: Another Example DGM: p(x) = p(x 1 ) p(x 2 ) p(x 3 x 1, x 2 ) p(x 4 x 3 ) p(x 5 x 3 ) p(x 6 x 4, x 5 ) Cliques: C = { {1, 2, 3}, {3, 4, 5}, {4, 5, 6} } MRF: p(x) = ψ(x 1, x 2, x 3 ) }{{} (p(x 1 ) p(x 2 ) p(x 3 x 1,x 2 )) ψ(x 3, x 4, x 5 ) }{{} (p(x 4 x 3 ) p(x 5 x 3 )) ψ(x 4, x 5, x 6 ) }{{} p(x 6 x 4,x 5 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 33 / 51

Efficient Inference Motivating example: compute a marginal p(x n ) in chain graph p(x) = p(x 1,..., x N ) = ψ(x 1, x 2 ) ψ(x 2, x 3 ) ψ(x N 1, x N ) Naïve solution (suppose x 1, x 2,..., x N {1,..., K} and Z = 1) K K K K p(x n ) = p(x 1,..., x N ) x 1 =1 x n 1 =1 x n+1 =1 x N =1 has cost O(K N ) (just computing p(x 1,..., x N ) has cost O(K N ))...the structure of p(x 1,..., x N ) is not being exploited Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 34 / 51

Efficient Inference (cont.) Reorder the summations and use the structure K K K K p(x n ) = ψ(x 1, x 2 ) ψ(x N 1, x N ) x n+1=1 x N =1 x n 1=1 x 1=1 K K K = ψ(x n, x n+1 ) ψ(x N 1, x N ) ψ(x N 1, x N ) x n+1=1 x N 1 =1 x N =1 K K K ψ(x n 1, x n ) ψ(x 2, x 3 ) ψ(x 1, x 2 ) x n 1=1 x 2=1 x 1=1 = µ β (x n ) µ α (x n ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus 10 20 ) The key is the distributive property: ab + ac = a(b + c) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 35 / 51

Efficient Inference (cont.) Can be seen as a message passing µ β (x n ) = K x n+1=1 ψ(x n, x n+1 ) K x N 1 =1 ψ(x N 2, x N 1 ) K x N =1 ψ(x N 1, x N ) } {{ } } {{ µ β (x N 1 ) } µ β (x N 2 ) Formally, right to left messages: µ β (x N ) = 1, for x N {1,..., K} K µ β (x j ) = ψ(x j, x j+1 ) µ β (x j+1 ), for x j {1,..., K} x j+1 =1 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 36 / 51

Efficient Inference (cont.) Can be seen as a message passing K µ α (x n ) = ψ(x n 1, x n ) x n 1=1 K ψ(x 2, x 3 ) x 2=1 K ψ(x 1, x 2 ) x 1=1 } {{ } } {{ µ α(x 2) } µ α(x 3) Formally, left to right messages: µ α (x 1 ) = 1, for x 1 {1,..., K} K µ α (x j ) = ψ(x j 1, x j ) µ β (x j 1 ), for x j {1,..., K} x j 1 =1 Known as the sum-product algorithm, a.k.a., belief propagation Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 37 / 51

Efficient Inference (cont.) Another example: compute the MAP configuration: max x p(x) p(x) = p(x 1,..., x N ) = ψ(x 1, x 2 ) ψ(x 2, x 3 ) ψ(x N 1, x N ) Naïve solution (suppose x 1, x 2,..., x N {1,..., K} and Z = 1) max x p(x) = max x 1 max max p(x 1,..., x N ) x 2 x N has cost O(K N ) (just computing p(x 1,..., x N ) has cost O(K N ))...the structure of p(x 1,..., x N ) is not being exploited Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 38 / 51

Efficient Inference (cont.) Exploit the structure max x p(x) = max x 1 max max ψ(x 1, x 2 ) ψ(x N 1, x N ) x 2 x N ψ(x N 2, x N 1 ) max x N 1 = max ψ(x 1, x 2 ) max x 1 Formally, right to left messages: µ(x N ) = 1, for x N {1,..., K} N x } {{ } ψ(x N 1, x N ) } {{ µ(x N 1 ) } µ(x N 2 ) µ(x j ) = max x j+1 ψ(x j, x j+1 ) µ(x j+1 ), for x j {1,..., K} Can also be done from left to right, using the reverse ordering Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 39 / 51

Efficient Inference (cont.) Exploit the structure max x p(x) = max x 1 max max ψ(x 1, x 2 ) ψ(x N 1, x N ) x 2 x N ψ(x N 1, x N ) max x N 1 = max ψ(x 1, x 2 ) max x 1 N x } {{ } ψ(x N 1, x N ) } {{ µ(x N 1 ) } µ(x N 2 ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus 10 20 ) The key is the distributive property: max{a b, a c} = a max{b, c} Equivalently, for max log p(x), use max{a+b, a+c} = a+max{b, c} To compute arg max x p(x) need a backward and a forward pass; why and how? Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 40 / 51

Efficient Inference (cont.) Inference on a general DGM via message passing ( ) ˆx 1 = arg max max p(x 1, x 2..., x N ) = arg max µ(x 1 ) x 1 x 2,...,x N x 1 ˆx 2 = arg max x 2 ψ(ˆx 1, x 2 ) µ(x 2 ).. ˆx N = arg max x N ψ(ˆx N 1, x N ) µ(x N ) }{{} 1 = arg max x N ψ(ˆx N 1, x N ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus 10 20 ) This is similar to dynamic programming and the Viterbi algorithm This is know as the max-product (or max-sum, with logs) algorithm How to extend this to more general graphical structures? Easy for DGM! A general algorithm is more conveniently written for factor graphs Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 41 / 51

Factor Graphs Joint pdf/pmf of X = (X 1,..., X V )(or hybrid) is a product p(x) = s S f s (x s ), where S 2 {1,...,V }, i.e., s {1,..., V } Each factor f s only depends on a subset x s of components of x Example: p(x 1, x 2, x 3 ) = f a (x 1, x 2 ) f b (x 1, x 2 ) f c (x 2, x 3 ) f d (x 3 ) Seen as an MRF, C = { {1, 2}, {2, 3} }, thus p(x) ψ {1,2} ψ {2,3}, ψ {1,2} f a (x 1, x 2 ) f b (x 1, x 2 ) and ψ {2,3} f c (x 2, x 3 ) f d (x 3 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 42 / 51

Factor Graphs Factor graphs are bpartite: two disjoint subsets of nodes (factors and variables); no edges between nodes in the same subset. Mapping an MRF to a factor graph is not unique: Mapping an Bayesian network to a factor graph is not unique: Neighborhood: ne(x) = {s S : x s} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 43 / 51

Sum-Product Algorithm on Factor Graphs Working example: compute a marginal p(x) = x\x p(x) Assume the graph is a tree Group factors in the following way p(x) = f s (x s ) = F s (x, X s ) s S s ne(x) where X s is the subset of variables on the subtree connected to x via factor s F s (x, X s ) is the product of all the factors in the subtree connected to s Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 44 / 51

Sum-Product Algorithm on Factor Graphs Rewrite the marginalization: p(x) = F s (x, X s ) = x\x s ne(x) s ne(x) µ fs x (x) {}}{ F s (x, X s ) X s F s (x, X s ) = f s (x, x 1,..., x M )G 1 (x 1, X s1 ) G M (x M, X sm ) µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x µ xm fs (x m) {}}{ X xm G m (x m, X sm ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 45 / 51

Sum-Product Algorithm on Factor Graphs Factor-to-variable (FtV) messages: µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x variable-to-factor (VtF) messages {}}{ µ xm f s (x m ) Computing FtV message from f s to x Compute product of VtF messages coming from all variables except x Multiply by the local factor Marginalize w.r.t. all variables except x Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 46 / 51

Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages: µ xm fs (x m ) = G m (x m, X sm ) X xm But G m (x m, X sm ) = F l (x m, X ml ) l ne(x m)\f s µ xm f s (x m ) = = l ne(x m)\f s X ml F l (x m, X ml ) l ne(x m)\f s µ fl x m (x m ) Each VtF message is the product of the FtV messages that the variable receives from the other factors. Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 47 / 51

Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages, from leaf variables µ xm fs (x m ) = µ fl x m (x m ) = 1 l ne(x m)\f s Factor-to-variable (FtV) messages, from leaf factors µ fs x(x) = f s (x, x 1,..., x M ) x 1 x M µ xm fs (x m ) = f s (x) m ne(f s)\x Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 48 / 51

Sum-Product on Factor Graphs: Detailed Example p(x) = f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) p(x 2 ) = µ fa x2 (x 2 ) µ fb x 2 (x 2 ) µ fb x 2 (x 2 ) = f a (x 1, x 2 ) f b (x 2, x 3 ) fc(x 2, x 4 ) x 1 x 3 x 4 = x 1 x 3 x 4 f a (x 1, x 2 )f b (x 2, x 3 )fc(x 2, x 4 ) = p(x 2 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 49 / 51

Max-Sum Algorithm on Factor Graphs Message passing for MAP: max log p(x) = max log f s (x) x s S Distributive property max{a + b, a + c} = a + max{b, c} Max-sum messages: µf x (x) = max x 1,...,x M µx f (x) = l ne(x)\f log f(x, x 1,..., x M ) + µ fl x(x) At leaf variables and factors: µ f x (x) = log f(x) µx f (x) = 0 m ne(f)\x µ xm f (x m ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 50 / 51

Recommended Reading C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (this lecture was very much based on chapter 8 of this book). Chapter 8 is freely available at http://research.microsoft.com/en-us/ um/people/cmbishop/prml/pdf/bishop-prml-sample.pdf K. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012 (chapters 10 and 19). Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May 2018 51 / 51