Statistical Learning

Size: px

Start display at page:

Download "Statistical Learning"

Lewis Melton
5 years ago
Views:

1 Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

2 Bayesian Networks and Graphical Models Bayes nets in a nutshell: Structured probability (density/mass) functions f X (x; θ) Provides a graph-based language/grammar to express conditional independence Allows formalizing the problem of inferring a subset of components of X from another subset thereof Allows formalizing the problem of learning θ from observed i.i.d. realizations of X: x 1,..., x n Bayes nets are one type of graphical models, based on directed graphs. Other types (more on them later): Markov random fields (MRF), based on undirected graphs Factor graphs Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

3 Bayesian Networks: Introduction Notation: we use the more compact p(x) notation instead of the more correct f X (x) For X R n, the pdf/pmf p(x) can be factored by Bayes law: p(x) = p(x 1 x 2,..., x n ) p(x 2,..., x n ). = p(x 1 x 2,..., x n ) p(x 2 x 3,..., x n ) p(x 3,..., x n ) = p(x 1 x 2,..., x n ) p(x n 1 x n ) p(x n ) Of course, this can be done in n! different ways; e.g., p(x) = p(x n x n 1,..., x 1 ) p(x n 1,..., x 1 ). = p(x n x n 1,..., x 1 ) p(x n 1 x n 2,..., x 1 ) p(x n 2,..., x 1 ) = p(x n x n 1,..., x 1 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

4 Bayesian Networks: Introduction For X R n, the pdf/pmf p(x) can be factored by Bayes law: p(x) = p(x n x n 1,..., x 1 ) p(x 2 x 1 ) p(x 1 ) In general, this is not more compact than p(x) Example: if xi {1,..., K}, a general p(x) has K n 1 K n parameters But p(x n x n 1,..., x 1 ) alone has (K 1)K n 1 K n parameters!...unless, there are some conditional independencies Example: X is a Markov chain: p(x i x i 1,..., x 1 ) = p(x i x i 1 ) in this case, p(x) = p(x n x n 1 ) p(x n 1 x n 2 ) p(x 2 x 1 ) p(x 1 ) has n K (K 1) + K n K 2 parameters linear in n, rather than exponential! Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

5 Conditional Independence Bayes nets are built on conditional independence Random variables X and Y are conditional independent, given Z, if f X,Y Z (x, y z) = f X Z (x z)f Y Z (y z) Naturally, X, Y, and Z can be groups of random variables Notation: X Y Z or X = Y Z Equivalent relationship (in short notation): p(x y, z) = p(x, y z) p(y z) = p(x z) p(y z) p(y z) = p(x z) Factorization: if X = (X 1, X 2, X 3 ) and X 1 X 3 X 2, then p(x) = p(x 3 x 2, x 1 ) p(x 2 x 1 ) p(x 1 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

p(x) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 3 x 2 ) p(x 1 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 1 x 2 ) p(x 2 x 3 )

6 Graphical Models Graph-based representations of the joint pdf/pmf p(x) Each node i represents random a variable X i The conditional independence properties are encoded by the presence/absence of edges in the graph Example: X 1 X 3 X 2 is represented by one of the following: p(x) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 3 x 2 ) p(x 1 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) p(x) = p(x 1 x 2 ) p(x 2 x 3 ) p(x 3 ) = p(x 1 x 2 ) p(x 3 x 2 ) p(x 2 ) = p(x 3 x 2 ) p(x 2 x 1 ) p(x 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

7 (Directed) Graph Concepts Directed graph G = (V, E), where set of nodes or vertices V = {1,..., V } 1 set of edges E V V ; i.e., each element of E has the form (s, t), with x, t V 2 3 in this context, we assume (v, v) E, v V 4 5 Note: in an undirected graph, each edge is a 2 element multiset; i.e., has the form {u, v}, where u, v V Parents of a node: pa(v) = {s V : (s, v) E} Example: pa(4) = {2, 3} Children of a node: ch(v) = {t V : (v, t) E} Example: ch(3) = {4, 5} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

8 (Directed) Graph Concepts (cont.) Root: a node v s.t. pa(v) = Example: 1 is a root. Leaf: a node v s.t. ch(v) = Example: 4 and 5 are leaves. Reachability: node t is reachable from s if there is a sequence of edges ( (vs1, v t1 ),..., (v sn, v tn ) ) s.t. v s1 = s, v si = v ti 1, i = 2,..., n, and v tn = t Ex: 5 reachable from 1; 2 not reachable from Ancestors of a node: anc(v) = {s : v is reachable from s} Examples: anc(5) = {1, 3}; anc(4) = {1, 2, 3}; anc(1) = Descendants of a node: desc(v) = {t : t is reachable from v} Examples: desc(1) = {2, 3, 4, 5}; desc(2) = {4}; desc(5) = Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

9 (Directed) Graph Concepts (cont.) Neighborhood of a node nbr(v) = {u : (u, v) E (v, u) E} Example: nbr(3) = {1, 4, 5} In-degree of node v is the cardinality of pa(v) Example: the in-degree of 4 is 2 Out-degree of node v is the cardinality of ch(v) Example: the out-degree of 3 is 2 Cycle (or loop): (v 1, v 2,..., v n ) : v 1 = v n, (v i, v i+1 ) E Example: the graph shown above has no loops/cycles Directed acyclic graph (DAG): a directed graph with no loops/cycles. Directed tree: a DAG where each node has 1 or 0 parents. Subgraph of G = (V, E) induced by a subset of nodes S V: G S = (S, E S ), where E S = {(u, v) E : u, v S} Example: G {1,3,5} = ({1, 3, 5}, {(1, 3), (3, 5)}) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

10 Directed Graphical Models (DGM) DGM, a.k.a. Bayesian networks, belief networks, causal networks Consider X = (X 1,..., X V ) with pdf/pmf p(x) = p(x 1,..., x V ) Consider a graph G = (V, E), where V = {1,..., V } G is a DGM for X if (with x S = {x v, v S}) V p(x) = p(x v x pa(v) ) }{{} Example v=1 cond. prob. dist. (CPD) 1 p(x) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 1 ) p(x 4 x 2, x 3 ) p(x 5 x 3 ) 2 3 The DGM is not unique (example in slide 6) 4 5 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

11 Directed Graphical Models: Examples Naïve Bayes classification (generative model): class variable Y {1,..., K}, with prior p(y); class-conditional pdf/pmf p(x y) p(y, x) = p(x y) p(y) = p(y) V p(x v y) v=1 Y X 1 X 2 X 3 X 4 Tree-augmented naïve Bayes (TAN) classification: class variable Y {1,..., K}, with prior p(y); class-conditional pdf/pmf p(x y) Y V p(y, x) = p(y) p(x v x pa(v), y) v=1 where the DAG is a tree. X 1 X2 X 3 X 4 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

12 Directed Graphical Models: More Examples First order Markov chain: p(x) = p(x 1 ) p(x 2 x 1 ) p(x V x V 1 ) x 1 x 2 x 3 Second order Markov chain: p(x) = p(x 1, x 2 ) p(x 3 x 2, x 1 ) p(x V x V 1, x V 2 ) x 1 x 2 x 3 x 4 Hidden Markov model (HMM): (Z, X) = (Z 1,..., Z T, X 1,..., X T ) T p(z, x) = p(z 1 )p(x 1 z 1 ) p(x v z v ) p(z v z v 1 ) v=2 z 1 z 2 z T x 1 x 2 x T Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

13 Inference in Directed Graphical Models Visible and hidden variables: x = (x v, x h ) Joint pdf/pmf p(x v, x h θ) Inferring the hidden variables from the visible ones: p(x h x v, θ) = p(x h, x v θ) p(x v θ) = p(x h, x v θ) p(x h, x v θ) x h...with integral instead of sum, if x h has real components Sometimes, only a subset x q of x h is of interest, x h = (x q, x n ): p(x q x v, θ) = x n p(x q, x n x v, θ)...x n are sometimes called nuisance variables Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

14 Learning in Directed Graphical Models Observe samples x 1,..., x N of N i.i.d. copies of X p(x θ) MAP estimate of θ ( ˆθ = arg max log p(θ) + θ N i=1 ) log p(x i θ) Plate notation: for a collection of i.i.d. copies of some variable θ θ.. X 1 X N X i N If there are hidden variables, x should be understood as denoting only the visible ones Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

15 Learning in Directed Graphical Models (cont.) For N i.i.d. observations D = (x 1,..., x N ), N N V p(d θ) = p(x i θ) = p(x iv x i,pa(v), θ) = i=1 i=1 v=1 where D v is the data associated with node v and θ v the corresponding parameters If the prior factorizes, p(θ) = v p(θ v), the posterior p(θ D) p(θ) p(d θ) = also factorizes V p(θ v )p(d v θ t ) v=1 V p(d v θ v ) v=1 V p(θ v D v ) v=1 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

16 Learning in Directed Graphical Models: Categorical CPD Each x v {1,..., K v } The number of configurations of x pa(v) is C v = s pa(v) Abusing notation, write x pa(v) = c, for c {1,..., C v }, to denote that x pa(v) takes the c-th configuration Parameters: θ vck = P(x v = k x pa(v) = c) (of course K v k=1 θ vck = 1) Denote θ vc = (θ vc1,..., θ vckv ) Counts: N vck = N i=1 1(x iv = k, x i,pa(v) = c) Maximum likelihood estimates: ˆθ vck = N vck Kv j=1 N vcj MAP estimate w/ Dirichlet(α vc ) prior: ˆθ vck = K s N vck + α vck Kv j=1 (N vcj + α vcj ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

17 Conditional Independence Properties Consider X = (X 1,..., X V ) p(x), and V = {1,..., V } Let x A x B x C be a true conditional independence (CI) statement about p(x), where A, B, C are disjoint subsets of V Let I(p) be the set of all (true) CI statements about p(x) Graph G = (V, E) G expresses a collection I(G) of CI statements x A G x B x C (explained below) G is an I-map (independence map) of p(x) if I(G) I(p) Example: if G is fully connected, I(G) =, thus I(G) I(p), for any p Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

18 Conditional Independence Properties (cont.) What type conditional independence (CI) statements are expressed by some graph G? In an path through some node m, there are three possible orientation structures: tail to tail head to tail head to head Before proceeding to the general statement, we next exemplify which type of CI corresponds to each of these structures. Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

Conditional Independence Structures Tail to

19 Conditional Independence Structures Tail to tail X Y Z p(x, y, z) p(x, y z) = p(z) p(x z) p(y z) p(z) = p(z) = p(x z) p(y z) Head to tail X Y Z p(x, y z) = p(y z) p(z x) p(x) p(z) = p(y z) p(x z) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

2 0 0 0.1 Gauge is off (Z = 0); is the tank empty? P(Y = 0 Z = 0) = 0.257.

20 Conditional Independence Structures Head to head p(x, y z) p(x z) p(y z) X Y Classical example: the noisy fuel gauge Binary variables: X = battery OK, Y = full tank, Z = fuel gauge on P(X =1) = P(Y =1) = 0.9 x y P(Z = 1 x, y) Gauge is off (Z = 0); is the tank empty? P(Y = 0 Z = 0) = but, the battery is also dead, P(Y = 0 Z = 0, X = 0) = 0.11 Dead battery explains away the empty tank Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

if any path from a node in A to a node in B is blocked by C Example: x 4 G x 5 x 1 (tail to tail) x 1 G x 2 x 4 (head to head in 5 {4}) x 5 G x 6 x {2,3} (tail to

21 D-Separation Graph G = (V, E) and three disjoint subsets of V: A, B, and C An (undirected) path from a node in A to a node in B is blocked by C if it includes a node m such that m C and the arrows meet head to tail or tail to tail the arrows meet head to head, m C, and desc(m) C = C D-separates A from B and x A G x B x C, if any path from a node in A to a node in B is blocked by C Example: x 4 G x 5 x 1 (tail to tail) x 1 G x 2 x 4 (head to head in 5 {4}) x 5 G x 6 x {2,3} (tail to tail in 2 and head to tail in 3) x 1 G x 2 x 7 is false (head to head in 5, but 7 desc(5)) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

22 Markov Blanket The Markov blanket of node m (mb(m)) is the set of nodes, conditioned on which x m is independent of all other nodes. Which nodes belong to mb(m)? V p(x i x pa(i) ) i=1 p(x m {x j : j m}) = V p(x i x pa(i) ) = x m i=1 j:m j,m pa(j) j:m j,m pa(j) p(x j x pa(j) ) p(x j x pa(j) ) x m...thus mb(m) = pa(m) ch(m) pa(ch(m)) }{{} coparents i mb(m) i mb(m) p(x i x pa(i) ) p(x i x pa(i) ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

23 Undirected Graphical Models (Markov Random Fields) MRF are based on undirected graphs G = (V, E), where each edge {u, v} is a sub-multiset of V of cardinality 2 Conditional independence statements result from simple separation (simpler than D-separation) definition Let A, B, C be disjoint subsets of V If any path from a node in A to a node in B goes through C, it is said that C separates A from B Graph G is an I-map for (X 1,..., X V ) = X p(x) if C separates A from B x A x B x C... a perfect I-map if can be replaced by A complete graph is an I-map for any p(x) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

24 Markov Random Fields (cont.) Neighborhood: N(i) = {j : {i, j} E} In an MRF, the Markov blanket is simply the neighborhood mb(i) = N(i): p(x i x V\{i} ) = p(x i x N(i) ) x i x V\({i} N(i)) x N(i) Clique: set of mutually neighboring nodes Maximal clique: a clique that is not contained in any other clique; C(G) denotes the set of maximal cliques of G Examples: {1, 2} is a (non-maximal) clique; {1, 2, 3, 5} is a not clique (1 and 5 are not neighbors); {1, 2, 3} and {4, 5, 6, 7} are maximal cliques Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May /

25 Hammersley-Clifford Theorem and Gibbs Distributions Let p(x) = p(x 1,..., x V ), such that p(x) > 0, x. Then, G is an I-map for p(x) if and only where p(x) = 1 Z C C(G) ψ C (x C ) = 1 Z exp( Z = x C C(G) ψ C (x C ) C C(G) E C (x C ) ) is the partition function (with integration, rathar than summation, in the case of continuous variables) ψ C is called clique potential; E C is called clique energy This is known in statistical physics as a Gibbs distribution Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

26 Local Conditionals in Gibbs Distributions Consider a Gibbs distribution over a graph G p(x) = 1 Z C C(G) Local conditional distribution p(x i x N(i) ) = = ψ C (x C ) = 1 Z exp ( 1 Z(x N(i) ) C: i C C C(G) ψ C (x C ) 1 ( Z(x N(i) ) exp C: i C ) E C (x C ) ) E C (x C ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

27 Auto Models Auto models: based on 2-node cliques: E C (x C ) = E D (x D ) D: D =2,D C equivalently, ψ C (x C ) = ψ D (x D ) D: D =2,D C Joint distribution has the form p(x) = 1 Z D: D =2,D C C(G) = 1 Z exp ( ψ D (x D ) D: D =2,D C C(G) ) E D (x D ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

28 Auto Models: Gaussian Markov Random Fields (GMRF) X R V, with Gaussian ( p(x) exp 1 ) 2 (x µ)t A(x µ) ( exp 1 2 V V i=1 j=1 ) A ij (x i µ i ) (x j µ j ) where A, the inverse of the covariance matrix, is symmetric Each pair-wise clique {i, j} has energy { Aij (x E {i,j} ({x i, x j }) = i µ i ) (x j µ j ) if i j A ii 2 (x i µ i ) 2 if i = j Neighborhood system: N(i) = {j : A ij 0} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

29 Auto Models: Ising and Potts Fields X { 1, +1} V, with energy function { β xi = x E {i,j} (x i, x j ) = j β x i x j where β > 0 (ferromagnetic interaction) or β < 0 (anti-ferromagnetic interaction) Computing Z is NP-hard in general Generalization to K states: the Potts model, X {1,..., K} V, with energy function { β xi = x E {i,j} (x i, x j ) = j 0 x i x j Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

30 Illustration of Potts Fields Samples of Potts model, with K = 10, Graph is 2D grid; neighborhood of each node is the set of its 4 nearest neighbors β = 1.42 β = 1.44 β = 1.46 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

31 DGM and MRF: The Easy Case Problem: how to write a DGM as an MRF? In some cases, there is a trivial relationship: A Markov chain p(x) = p(x 1 ) p(x 2 x 1 ) p(x N x N 1 ) can obviously be written as an MRF (with Z = 1) p(x) = 1 Z ψ {1,2}(x 1, x 2 ) }{{} p(x 1 ) p(x 2 x 1 ) ψ {2,3} (x 2, x 2 ) }{{} p(x 3 x 2 ) ψ {N,N 1} (x N, x N 1 ) }{{} p(x N x N 1 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

32 DGM and MRF: The General Case Procedure: 1 Insert undirected arrows between all pairs of parents of each node ( moralization ) 2 Make all edges undirected; 3 Initialize all clique potentials to 1; 4 Take each factor in the DGM and multiply it by the potential of a clique that contains all the involved nodes; Example p(x) = p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) }{{} ψ {1,2,3,4} (x 1,x 2,x 3,x 4 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

33 From DGM and MRF: Another Example DGM: p(x) = p(x 1 ) p(x 2 ) p(x 3 x 1, x 2 ) p(x 4 x 3 ) p(x 5 x 3 ) p(x 6 x 4, x 5 ) Cliques: C = { {1, 2, 3}, {3, 4, 5}, {4, 5, 6} } MRF: p(x) = ψ(x 1, x 2, x 3 ) }{{} (p(x 1 ) p(x 2 ) p(x 3 x 1,x 2 )) ψ(x 3, x 4, x 5 ) }{{} (p(x 4 x 3 ) p(x 5 x 3 )) ψ(x 4, x 5, x 6 ) }{{} p(x 6 x 4,x 5 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

34 Efficient Inference Motivating example: compute a marginal p(x n ) in chain graph p(x) = p(x 1,..., x N ) = ψ(x 1, x 2 ) ψ(x 2, x 3 ) ψ(x N 1, x N ) Naïve solution (suppose x 1, x 2,..., x N {1,..., K} and Z = 1) K K K K p(x n ) = p(x 1,..., x N ) x 1 =1 x n 1 =1 x n+1 =1 x N =1 has cost O(K N ) (just computing p(x 1,..., x N ) has cost O(K N ))...the structure of p(x 1,..., x N ) is not being exploited Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

35 Efficient Inference (cont.) Reorder the summations and use the structure K K K K p(x n ) = ψ(x 1, x 2 ) ψ(x N 1, x N ) x n+1=1 x N =1 x n 1=1 x 1=1 K K K = ψ(x n, x n+1 ) ψ(x N 1, x N ) ψ(x N 1, x N ) x n+1=1 x N 1 =1 x N =1 K K K ψ(x n 1, x n ) ψ(x 2, x 3 ) ψ(x 1, x 2 ) x n 1=1 x 2=1 x 1=1 = µ β (x n ) µ α (x n ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus ) The key is the distributive property: ab + ac = a(b + c) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

36 Efficient Inference (cont.) Can be seen as a message passing µ β (x n ) = K x n+1=1 ψ(x n, x n+1 ) K x N 1 =1 ψ(x N 2, x N 1 ) K x N =1 ψ(x N 1, x N ) } {{ } } {{ µ β (x N 1 ) } µ β (x N 2 ) Formally, right to left messages: µ β (x N ) = 1, for x N {1,..., K} K µ β (x j ) = ψ(x j, x j+1 ) µ β (x j+1 ), for x j {1,..., K} x j+1 =1 Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

37 Efficient Inference (cont.) Can be seen as a message passing K µ α (x n ) = ψ(x n 1, x n ) x n 1=1 K ψ(x 2, x 3 ) x 2=1 K ψ(x 1, x 2 ) x 1=1 } {{ } } {{ µ α(x 2) } µ α(x 3) Formally, left to right messages: µ α (x 1 ) = 1, for x 1 {1,..., K} K µ α (x j ) = ψ(x j 1, x j ) µ β (x j 1 ), for x j {1,..., K} x j 1 =1 Known as the sum-product algorithm, a.k.a., belief propagation Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

38 Efficient Inference (cont.) Another example: compute the MAP configuration: max x p(x) p(x) = p(x 1,..., x N ) = ψ(x 1, x 2 ) ψ(x 2, x 3 ) ψ(x N 1, x N ) Naïve solution (suppose x 1, x 2,..., x N {1,..., K} and Z = 1) max x p(x) = max x 1 max max p(x 1,..., x N ) x 2 x N has cost O(K N ) (just computing p(x 1,..., x N ) has cost O(K N ))...the structure of p(x 1,..., x N ) is not being exploited Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

39 Efficient Inference (cont.) Exploit the structure max x p(x) = max x 1 max max ψ(x 1, x 2 ) ψ(x N 1, x N ) x 2 x N ψ(x N 2, x N 1 ) max x N 1 = max ψ(x 1, x 2 ) max x 1 Formally, right to left messages: µ(x N ) = 1, for x N {1,..., K} N x } {{ } ψ(x N 1, x N ) } {{ µ(x N 1 ) } µ(x N 2 ) µ(x j ) = max x j+1 ψ(x j, x j+1 ) µ(x j+1 ), for x j {1,..., K} Can also be done from left to right, using the reverse ordering Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

40 Efficient Inference (cont.) Exploit the structure max x p(x) = max x 1 max max ψ(x 1, x 2 ) ψ(x N 1, x N ) x 2 x N ψ(x N 1, x N ) max x N 1 = max ψ(x 1, x 2 ) max x 1 N x } {{ } ψ(x N 1, x N ) } {{ µ(x N 1 ) } µ(x N 2 ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus ) The key is the distributive property: max{a b, a c} = a max{b, c} Equivalently, for max log p(x), use max{a+b, a+c} = a+max{b, c} To compute arg max x p(x) need a backward and a forward pass; why and how? Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

41 Efficient Inference (cont.) Inference on a general DGM via message passing ( ) ˆx 1 = arg max max p(x 1, x 2..., x N ) = arg max µ(x 1 ) x 1 x 2,...,x N x 1 ˆx 2 = arg max x 2 ψ(ˆx 1, x 2 ) µ(x 2 ).. ˆx N = arg max x N ψ(ˆx N 1, x N ) µ(x N ) }{{} 1 = arg max x N ψ(ˆx N 1, x N ) Cost O(N K 2 ) (versus O(K N ); e.g., 2000 versus ) This is similar to dynamic programming and the Viterbi algorithm This is know as the max-product (or max-sum, with logs) algorithm How to extend this to more general graphical structures? Easy for DGM! A general algorithm is more conveniently written for factor graphs Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

42 Factor Graphs Joint pdf/pmf of X = (X 1,..., X V )(or hybrid) is a product p(x) = s S f s (x s ), where S 2 {1,...,V }, i.e., s {1,..., V } Each factor f s only depends on a subset x s of components of x Example: p(x 1, x 2, x 3 ) = f a (x 1, x 2 ) f b (x 1, x 2 ) f c (x 2, x 3 ) f d (x 3 ) Seen as an MRF, C = { {1, 2}, {2, 3} }, thus p(x) ψ {1,2} ψ {2,3}, ψ {1,2} f a (x 1, x 2 ) f b (x 1, x 2 ) and ψ {2,3} f c (x 2, x 3 ) f d (x 3 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

43 Factor Graphs Factor graphs are bpartite: two disjoint subsets of nodes (factors and variables); no edges between nodes in the same subset. Mapping an MRF to a factor graph is not unique: Mapping an Bayesian network to a factor graph is not unique: Neighborhood: ne(x) = {s S : x s} Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

$Sum-Product Algorithm on Factor Graphs Working example: compute a marginal p(x) = x\x p(x) Assume the graph is a tree Group factors in the following way p(x) = f s (x s ) = F s (x, X s ) s S s ne(x)$

44 Sum-Product Algorithm on Factor Graphs Working example: compute a marginal p(x) = x\x p(x) Assume the graph is a tree Group factors in the following way p(x) = f s (x s ) = F s (x, X s ) s S s ne(x) where X s is the subset of variables on the subtree connected to x via factor s F s (x, X s ) is the product of all the factors in the subtree connected to s Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

45 Sum-Product Algorithm on Factor Graphs Rewrite the marginalization: p(x) = F s (x, X s ) = x\x s ne(x) s ne(x) µ fs x (x) {}}{ F s (x, X s ) X s F s (x, X s ) = f s (x, x 1,..., x M )G 1 (x 1, X s1 ) G M (x M, X sm ) µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x µ xm fs (x m) {}}{ X xm G m (x m, X sm ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

46 Sum-Product Algorithm on Factor Graphs Factor-to-variable (FtV) messages: µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x variable-to-factor (VtF) messages {}}{ µ xm f s (x m ) Computing FtV message from f s to x Compute product of VtF messages coming from all variables except x Multiply by the local factor Marginalize w.r.t. all variables except x Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

$Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages: µ xm fs (x m ) = G m (x m, X sm ) X xm But G m (x m, X sm ) = F l (x m, X ml ) l ne(x m)\f s µ xm f s (x m ) = = l ne(x m)\f$

47 Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages: µ xm fs (x m ) = G m (x m, X sm ) X xm But G m (x m, X sm ) = F l (x m, X ml ) l ne(x m)\f s µ xm f s (x m ) = = l ne(x m)\f s X ml F l (x m, X ml ) l ne(x m)\f s µ fl x m (x m ) Each VtF message is the product of the FtV messages that the variable receives from the other factors. Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

48 Sum-Product Algorithm on Factor Graphs Variable-to-factor (VtF) messages, from leaf variables µ xm fs (x m ) = µ fl x m (x m ) = 1 l ne(x m)\f s Factor-to-variable (FtV) messages, from leaf factors µ fs x(x) = f s (x, x 1,..., x M ) x 1 x M µ xm fs (x m ) = f s (x) m ne(f s)\x Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

49 Sum-Product on Factor Graphs: Detailed Example p(x) = f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) p(x 2 ) = µ fa x2 (x 2 ) µ fb x 2 (x 2 ) µ fb x 2 (x 2 ) = f a (x 1, x 2 ) f b (x 2, x 3 ) fc(x 2, x 4 ) x 1 x 3 x 4 = x 1 x 3 x 4 f a (x 1, x 2 )f b (x 2, x 3 )fc(x 2, x 4 ) = p(x 2 ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

50 Max-Sum Algorithm on Factor Graphs Message passing for MAP: max log p(x) = max log f s (x) x s S Distributive property max{a + b, a + c} = a + max{b, c} Max-sum messages: µf x (x) = max x 1,...,x M µx f (x) = l ne(x)\f log f(x, x 1,..., x M ) + µ fl x(x) At leaf variables and factors: µ f x (x) = log f(x) µx f (x) = 0 m ne(f)\x µ xm f (x m ) Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

51 Recommended Reading C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (this lecture was very much based on chapter 8 of this book). Chapter 8 is freely available at um/people/cmbishop/prml/pdf/bishop-prml-sample.pdf K. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012 (chapters 10 and 19). Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 5 May / 51

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular