Graphical models. Sunita Sarawagi IIT Bombay

Size: px

Start display at page:

Download "Graphical models. Sunita Sarawagi IIT Bombay"

Tracey Cox
5 years ago
Views:

1 1 Graphical models Sunita Sarawagi IIT Bombay

2 2 Probabilistic modeling Given: several variables: x 1,... x n, n is large. Task: build a joint distribution function Pr(x 1,... x n ) Goal: Answer several kind of projection queries on the distribution

3 2 Probabilistic modeling Given: several variables: x 1,... x n, n is large. Task: build a joint distribution function Pr(x 1,... x n ) Goal: Answer several kind of projection queries on the distribution Basic premise Explicit joint distribution is dauntingly large Queries are simple marginals (sum or max) over the joint distribution.

4 3 Example Variables are attributes are people. Age Income Experience Degree Location 10 ranges 7 scales 7 scales 3 scales 30 places An explicit joint distribution over all columns not tractable: number of combinations: = Queries: Estimate fraction of people with Income > 200K and Degree= Bachelors, Income < 200K, Degree= PhD and experience > 10 years. Many, many more.

5 4 Alternatives to an explicit joint distribution Assume all columns are independent of each other: bad assumption

6 4 Alternatives to an explicit joint distribution Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies Many highly correlated pairs income age, income experience, age experience Ad hoc methods of combining these into a single estimate

7 4 Alternatives to an explicit joint distribution Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies Many highly correlated pairs income age, income experience, age experience Ad hoc methods of combining these into a single estimate Go beyond pairwise correlations: conditional independencies income age, but income age experience experience degree, but experience degree income Graphical models make explicit an efficient joint distribution from these independencies

8 5 Graphical models Model joint distribution over several variables as a product of smaller factors that is 1 Intuitive to represent and visualize Graph: represent structure of dependencies Potentials over subsets: quantify the dependencies 2 Efficient to query given values of any variable subset, reason about probability distribution of others. many efficient exact and approximate inference algorithms

9 5 Graphical models Model joint distribution over several variables as a product of smaller factors that is 1 Intuitive to represent and visualize Graph: represent structure of dependencies Potentials over subsets: quantify the dependencies 2 Efficient to query given values of any variable subset, reason about probability distribution of others. many efficient exact and approximate inference algorithms Graphical models = graph theory + probability theory.

10 Graphical models in use Roots in statistical physics for modeling interacting atoms in gas and solids [ 1900] Early usage in genetics for modeling properties of species [ 1920] AI: expert systems ( 1970s-80s) Now many new applications: Error Correcting Codes: Turbo codes, impressive success story (1990s) Robotics and Vision: image denoising, robot navigation. Text mining: information extraction, duplicate elimination, hypertext classification, help systems Bio-informatics: Secondary structure prediction, Gene discovery Data mining: probabilistic classification and clustering.

11 7 Part I: Outline 1 Representation Directed graphical models: Bayesian networks Undirected graphical models 2 Inference Queries Exact inference on chains Variable elimination on general graphs Junction trees 3 Approximate inference Generalized belief propagation Sampling: Gibbs, Particle filters 4 Constructing a graphical model Graph Structure Parameters in Potentials 5 General framework for Parameter learning in graphical models 6 References

12 Representation Structure of a graphical model: Graph + Potential Graph Nodes: variables x = x 1,... x n Continuous: Sensor temperatures, income Discrete: Degree (one of Bachelors, Masters, PhD), Levels of age, Labels of words Edges: direct interaction Directed edges: Bayesian networks Undirected edges: Markov Random fields Directed Location Age Degree Experience Income Undirected Age Location Degree Experience Income

13 9 Representation Potentials: ψ c (x c ) Scores for assignment of values to subsets c of directly interacting variables. Which subsets? What do the potentials mean? Different for directed and undirected graphs

14 9 Representation Potentials: ψ c (x c ) Scores for assignment of values to subsets c of directly interacting variables. Which subsets? What do the potentials mean? Different for directed and undirected graphs Probability Factorizes as product of potentials Pr(x = x 1,... x n ) ψ S (x S )

15 10 Directed graphical models: Bayesian networks Graph G: directed acyclic Parents of a node: Pa(xi ) = set of nodes in G pointing to x i

16 10 Directed graphical models: Bayesian networks Graph G: directed acyclic Parents of a node: Pa(xi ) = set of nodes in G pointing to x i

17 10 Directed graphical models: Bayesian networks Graph G: directed acyclic Parents of a node: Pa(xi ) = set of nodes in G pointing to x i Potentials: defined at each node in terms of its parents. ψ i (x i, Pa(x i )) = Pr(x i Pa(x i )

18 10 Directed graphical models: Bayesian networks Graph G: directed acyclic Parents of a node: Pa(xi ) = set of nodes in G pointing to x i Potentials: defined at each node in terms of its parents. ψ i (x i, Pa(x i )) = Pr(x i Pa(x i ) Probability distribution Pr(x 1... x n ) = n Pr(x i pa(x i )) i=1

19 11 Example of a directed graph Location Age Degree Experience Income

20 11 Example of a directed graph Location Age Degree Experience Income ψ 1 (L) = Pr(L) NY CA London Other

21 11 Example of a directed graph Location Age Degree Experience Income ψ 1 (L) = Pr(L) NY CA London Other ψ 2 (A) = Pr(A) > or, a Guassian distribution (µ, σ) = (35, 10)

22 11 Example of a directed graph Location Age Degree Experience Income ψ 1 (L) = Pr(L) NY CA London Other ψ 2 (A) = Pr(A) > or, a Guassian distribution (µ, σ) = (35, 10)

23 11 Example of a directed graph Location Age Degree Experience Income ψ 1 (L) = Pr(L) NY CA London Other ψ 2 (A) = Pr(A) ψ 2 (E, A) = Pr(E A) > > > or, a Guassian distribution (µ, σ) = (35, 10)

24 11 Example of a directed graph Location Age Degree Experience Income ψ 1 (L) = Pr(L) NY CA London Other ψ 2 (A) = Pr(A) > or, a Guassian distribution (µ, σ) = (35, 10) ψ 2 (E, A) = Pr(E A) > > ψ 2 (I, E, D) = Pr(I D, A) 3 dimensional table, or a histogram approximation.

25 11 Example of a directed graph Location Age Degree Experience Income ψ 1 (L) = Pr(L) NY CA London Other ψ 2 (A) = Pr(A) > or, a Guassian distribution (µ, σ) = (35, 10) ψ 2 (E, A) = Pr(E A) > > ψ 2 (I, E, D) = Pr(I D, A) 3 dimensional table, or a histogram approximation. Probability distribution Pa(x = L, D, I, A, E) = Pr(L) Pr(D) Pr(A) Pr(E A) Pr(I D, E)

26 12 Conditional Independencies Given three sets of variables X, Y, Z, set X is conditionally independent of Y given Z (X Y Z) iff Pr(X Y, Z) = Pr(X Z)

27 12 Conditional Independencies Given three sets of variables X, Y, Z, set X is conditionally independent of Y given Z (X Y Z) iff Pr(X Y, Z) = Pr(X Z) Local conditional independencies in BN: for each x i x i ND(x i ) Pa(x i )

28 Income 12 Conditional Independencies Given three sets of variables X, Y, Z, set X is conditionally independent of Y given Z (X Y Z) iff Pr(X Y, Z) = Pr(X Z) Local conditional independencies in BN: for each x i x i ND(x i ) Pa(x i ) L E, D, A, I A L, D E L, D A I A E, D Location Age Degree Experience

29 13 CIs and Fractorization Theorem Local CI = Factorization Proof. x 1, x 2,..., x n topographically ordered (parents before children). Local CI: Pr(x i x 1,..., x i 1 ) = Pr(x i Pa(x i )) Chain rule: Pr(x 1,..., x n ) = i Pr(x i x 1,..., x i 1 ) = i Pr(x i Pa(x i ))

30 Global CIs in a BN Three sets of variables X, Y, Z. If Z d-separates X from Y in BN then, X Y Z. In a directed graph H, Z d-separates X from Y if all paths P from any X to Y is blocked by Z. A path P is blocked by Z when 1 x 1 x 2... x k and x i Z 2 x 1 x 2... x k and x i Z 3 x 1... x i... x k and x i Z 4 x 1... x i... x k and x i Z and Desc(x i ) Z

31 Global CIs in a BN Three sets of variables X, Y, Z. If Z d-separates X from Y in BN then, X Y Z. In a directed graph H, Z d-separates X from Y if all paths P from any X to Y is blocked by Z. A path P is blocked by Z when 1 x 1 x 2... x k and x i Z 2 x 1 x 2... x k and x i Z 3 x 1... x i... x k and x i Z 4 x 1... x i... x k and x i Z and Desc(x i ) Z Theorem The d-separation test identifies the complete set of conditional independencies that hold in all distributions that conform to a given Bayesian network.

32 Popular Bayesian networks Hidden Markov Models: speech recognition, information extraction y 1 y 2 y 3 y 4 y 5 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 State variables: discrete phoneme, entity tag Observation variables: continuous (speech waveform), discrete y 1 y (Word) 2 y 3 y 4 y 5 y 6 y 7 Kalman Filters: State variables: continuous Discussed later y 1 y 2 y 2 y 3 y 3 y 4 y 4 y Topic models for text data 5 y 5 y 6 y 6 y 7 1 Principled mechanism to categorize multi-labeled text documents while incorporating priors in a flexible generative framework 2 Application: news tracking QMR (Quick Medical Reference) system PRMs: Probabilistic relational networks: 15

33 16 Undirected graphical models Graph G: arbitrary undirected graph Useful when variables interact symmetrically, no natural parent-child relationship Example: labeling pixels of an image. Potentials ψ C (y C ) defined on arbitrary cliques C of G. ψ C (y C ): Any arbitrary non-negative value, cannot be interpreted as probability.

34 16 Undirected graphical models Graph G: arbitrary undirected graph Useful when variables interact symmetrically, no natural parent-child relationship Example: labeling pixels of an image. Potentials ψ C (y C ) defined on arbitrary cliques C of G. ψ C (y C ): Any arbitrary non-negative value, cannot be interpreted as probability. Probability distribution Pr(y 1... y n ) = 1 Z ψ C (y C ) C G where Z = y C G ψ C(y C ) (partition function)

35 16 Undirected graphical models Graph G: arbitrary undirected graph Useful when variables interact symmetrically, no natural parent-child relationship Example: labeling pixels of an image. Potentials ψ C (y C ) defined on arbitrary cliques C of G. ψ C (y C ): Any arbitrary non-negative value, cannot be interpreted as probability. Probability distribution Pr(y 1... y n ) = 1 Z ψ C (y C ) C G where Z = y C G ψ C(y C ) (partition function)

36 17 Example y i = 1 (part of foreground), 0 otherwise.

37 17 Example y i = 1 (part of foreground), 0 otherwise. Node potentials ψ1 (0) = 4, ψ 1 (1) = 1 ψ2 (0) = 2, ψ 2 (1) = 3... ψ9 (0) = 1, ψ 9 (1) = 1

38 Example y i = 1 (part of foreground), 0 otherwise. Node potentials ψ1 (0) = 4, ψ 1 (1) = 1 ψ2 (0) = 2, ψ 2 (1) = 3... ψ9 (0) = 1, ψ 9 (1) = 1 Edge potentials: Same for all edges ψ(0, 0) = 5, ψ(1, 1) = 5, ψ(1, 0) = 1, ψ(0, 1) = 1

39 Example y i = 1 (part of foreground), 0 otherwise. Node potentials ψ1 (0) = 4, ψ 1 (1) = 1 ψ2 (0) = 2, ψ 2 (1) = 3... ψ9 (0) = 1, ψ 9 (1) = 1 Edge potentials: Same for all edges ψ(0, 0) = 5, ψ(1, 1) = 5, ψ(1, 0) = 1, ψ(0, 1) = 1 Probability: Pr(y 1... y 9 ) 9 k=1 ψ k(y k ) (i,j) E(G) ψ(y i, y j )

40 18 Conditional independencies (CIs) in an undirected graphical model Let V = {y 1,..., y n }. 1 Local CI: y i V ne(y i ) {y i } ne(y i ) 2 Pairwise CI: y i y j V {y i, y j } if edge (y i, y j ) does not exist. 3 Global CI: X Y Z if Z separates X and Y in the graph. Equivalent when the distribution is positive. 1 y 1 y 3, y 5 y 6, y 7, y 8, y 9 y 2, y 4 2 y 1 y 3 y 2, y 4, y 5, y 6, y 7, y 8, y 9 3 y 1, y 2, y 3 y 7, y 8, y 9 y 4, y 5, y 6

41 19 Relationship between Local-CI and Global-CI Let G be a undirected graph over V = x 1,..., x n nodes and P(x 1,..., x n ) be a distribution. If P satisfies Global-CIs of G, then P will also satisfy the local-cis of G but the reverse is not always true. We will show this with an example. Consider a distribution over 5 binary variables: P(x 1,..., x 5 ) where x 1 = x 2, x 4 = x 5 and x 3 = x 2 ANDx 4. Let G be x 1 x 2 x 3 x 4 x 5 All 5 local CIs in the graph: e.g. x 1 {x 3, x 4, x 5 } x 2 etc hold in the graph. However, the global CI: x 2 x 4 x 3 does not hold.

42 Relationship between Local-CI and Pairwise-CI Let G be a undirected graph over V = x 1,..., x n nodes and P(x 1,..., x n ) be a distribution. If P satisfies Local-CIs of G, then P will also satisfy the pairwise-cis of G but the reverse is not always true. We will show this with an example. Consider a distribution over 3 binary variables: P(x 1, x 2, x 3 ) where x 1 = x 2 = x 3. That is, P(x 1, x 2, x 3 ) = 1/2 when all three are equal and 0 otherwise. Let G be x 1 x 2 x 3 All 2 pairwise CIs in the graph: e.g. x 1 {x 3 } x 2 and x 2 {x 3 } x 1 hold in the graph. However, the local CI: x 1 x 3 does not hold.

43 21 Factorization implies Global-CI Theorem Let G be a undirected graph over V = x 1,..., x n nodes and P(x 1,..., x n ) be a distribution. If P can be factorized as per the cliques of G, then P will also satisfy the global-cis of G Proof. Discussed in class.

44 Global-CI does not imply factorization. But global-ci does not imply factorization. Consider a distribution over 4 binary variables: P(x 1, x 2, x 3, x 4 ) Let G be x 1 x 2 x 4 x 3 Let P(x 1, x 2, x 3, x 4 ) = 1/8 when x 1, x 2, x 3, x 4 takes values from this set ={0000,1000,1100,1110,1111,0111,0011,0001}. In all other cases it is zero. One can painfully check that all four globals CIs in the graph: e.g. x 1 {x 3 } x 2, x 4 etc hold in the graph. Now let us look at factorization. The factors correspond to the edges in ψ(x 1, x 2 ). Each of the four possible assignment of each factor will get a positive value. But that cannot represent the zero probability for cases like x 1, x 2, x 3, x 4 = 0101.

45 Fractorization and CIs Theorem (Hammerseley Clifford Theorem) If a positive distribution P(x 1,..., x n ) confirms to the pairwise CIs of a UDGM G, then it can be factorized as per the cliques C of G as P(x 1,..., x n ) C G ψ C (y C ) Proof. Skipped.

46 24 Popular undirected graphical models Interacting atoms in gas and solids [ 1900] Markov Random Fields in vision for image segmentation Conditional Random Fields for information extraction Social networks Bio-informatics: annotating active sites in a protein molecules.

47 Comparing directed and undirected graphs Some distributions can only be expressed in one and not the other. x 2 x 3 x 4 x 3 x 4 x 5 x 5 Potentials Directed: conditional probabilities, more intuitive Undirected: arbitrary scores, easy to set. Dependence structure Directed: Complicated d-separation test Undirected: Graph separation: A B C iff C separates A and B in G. Often application makes the choice clear. Directed: Causality Undirected: Symmetric interactions. 25

48 26 Part I: Outline 1 Representation Directed graphical models: Bayesian networks Undirected graphical models 2 Inference Queries Exact inference on chains Variable elimination on general graphs Junction trees 3 Approximate inference Generalized belief propagation Sampling: Gibbs, Particle filters 4 Constructing a graphical model Graph Structure Parameters in Potentials 5 General framework for Parameter learning in graphical models 6 References

49 27 Inference queries 1 Marginal probability queries over a small subset of variables: Find Pr(Income= High & Degree= PhD ) Find Pr(pixel y9 = 1) Pr(x 1 ) = Pr(x 1... x n ) x 2...x n

50 27 Inference queries 1 Marginal probability queries over a small subset of variables: Find Pr(Income= High & Degree= PhD ) Find Pr(pixel y9 = 1) Pr(x 1 ) = Pr(x 1... x n ) x 2...x n 2 Most likely labels of remaining variables: (MAP queries) Find most likely entity labels of all words in a sentence Find likely temperature at sensors in a room x = argmax x1...x n Pr(x 1... x n )

51 28 Exact inference on chains Given, Graph Potentials: ψi (y i, y i+1 ) Pr(y1,... y n ) = i ψ i(y i, y i+1 ) Find, Pr(y i ) for any i, say Pr(y 5 = 1) Exact method: Pr(y5 = 1) = y 1,...y 4 Pr(y 1,... y 4, 1) requires exponential number of summations. A more efficient alternative...

52 29 Exact inference on chains Pr(y 5 = 1) = Pr(y 1,... y 4, 1) y 1,...y 4 = y 1 y 2 y 3 y 4 ψ 1 (y 1, y 2 )ψ 2 (y 2, y 3 )ψ 3 (y 3, y 4 )ψ 4 (y 4, 1)

53 29 Exact inference on chains Pr(y 5 = 1) = Pr(y 1,... y 4, 1) y 1,...y 4 = ψ 1 (y 1, y 2 )ψ 2 (y 2, y 3 )ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 ) ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4

54 29 Exact inference on chains Pr(y 5 = 1) = Pr(y 1,... y 4, 1) y 1,...y 4 = ψ 1 (y 1, y 2 )ψ 2 (y 2, y 3 )ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 ) ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 )B 3 (y 3 ) y 1 y 2 y 3

55 29 Exact inference on chains Pr(y 5 = 1) = Pr(y 1,... y 4, 1) y 1,...y 4 = ψ 1 (y 1, y 2 )ψ 2 (y 2, y 3 )ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 ) ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 )B 3 (y 3 ) y 1 y 2 y 3 = ψ 1 (y 1, y 2 )B 2 (y 2 ) y 1 y 2

56 29 Exact inference on chains Pr(y 5 = 1) = Pr(y 1,... y 4, 1) y 1,...y 4 = ψ 1 (y 1, y 2 )ψ 2 (y 2, y 3 )ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 ) ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 )B 3 (y 3 ) y 1 y 2 y 3 = ψ 1 (y 1, y 2 )B 2 (y 2 ) y 1 y 2 = B 1 (y 1 ) y 1

57 29 Exact inference on chains Pr(y 5 = 1) = Pr(y 1,... y 4, 1) y 1,...y 4 = ψ 1 (y 1, y 2 )ψ 2 (y 2, y 3 )ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 ) ψ 3 (y 3, y 4 )ψ 4 (y 4, 1) y 1 y 2 y 3 y 4 = ψ 1 (y 1, y 2 ) ψ 2 (y 2, y 3 )B 3 (y 3 ) y 1 y 2 y 3 = ψ 1 (y 1, y 2 )B 2 (y 2 ) y 1 y 2 = B 1 (y 1 ) y 1 An alternative view: flow of beliefs B i (.) from node i + 1 to node i

58 30 Adding evidence Given fixed values of a subset of variables x e (evidence), find the 1 Marginal probability queries over a small subset of variables: Find Pr(Income= High Degree= PhD ) Pr(x 1 ) = x 2...x m Pr(x 1... x n x e )

59 Adding evidence Given fixed values of a subset of variables x e (evidence), find the 1 Marginal probability queries over a small subset of variables: Find Pr(Income= High Degree= PhD ) Pr(x 1 ) = x 2...x m Pr(x 1... x n x e ) 2 Most likely labels of remaining variables: (MAP queries) Find likely temperature at sensors in a room given readings from a subset of them x = argmax x1...x m Pr(x 1... x n x e ) Easy to add evidence, just change the potential.

60 31 Inference in HMMs Given, y 1 y 2 y 3 y 4 y 5 y 6 y 7 Graph x 1 x 2 x 3 x 4 x 5 x 6 x 7 Potentials: Pr(yi y i 1 ), Pr(x i y i ) Evidence variables: x = x1... x n = o 1... o n. y 1 y 2 y 3 y 4 y 5 y 6 y Find most likely values of the hidden state variables. 7 y = y 1... y n argmax y Pr(y x = o) y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y 7

61 31 Inference in HMMs Given, y 1 y 2 y 3 y 4 y 5 y 6 y 7 Graph x 1 x 2 x 3 x 4 x 5 x 6 x 7 Potentials: Pr(yi y i 1 ), Pr(x i y i ) Evidence variables: x = x1... x n = o 1... o n. y 1 y 2 y 3 y 4 y 5 y 6 y Find most likely values of the hidden state variables. 7 y = y 1... y n argmax y Pr(y x = o) y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 Define ψ i (y i 1, y i ) = Pr(y i y i 1 ) Pr(x i = o i y i ) Reduced graph only a single chain of y nodes. y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y 7

62 31 Inference in HMMs Given, y 1 y 2 y 3 y 4 y 5 y 6 y 7 Graph x 1 x 2 x 3 x 4 x 5 x 6 x 7 Potentials: Pr(yi y i 1 ), Pr(x i y i ) Evidence variables: x = x1... x n = o 1... o n. y 1 y 2 y 3 y 4 y 5 y 6 y Find most likely values of the hidden state variables. 7 y = y 1... y n argmax y Pr(y x = o) y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 Define ψ i (y i 1, y i ) = Pr(y i y i 1 ) Pr(x i = o i y i ) Reduced graph only a single chain of y nodes. y 1 y 2 y 3 y 4 y 5 y 6 y 7 Algorithm same as earlier, just replace Sum with Max y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y 7

63 Inference in HMMs Given, y 1 y 2 y 3 y 4 y 5 y 6 y 7 Graph x 1 x 2 x 3 x 4 x 5 x 6 x 7 Potentials: Pr(yi y i 1 ), Pr(x i y i ) Evidence variables: x = x1... x n = o 1... o n. y 1 y 2 y 3 y 4 y 5 y 6 y Find most likely values of the hidden state variables. 7 y = y 1... y n argmax y Pr(y x = o) y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 Define ψ i (y i 1, y i ) = Pr(y i y i 1 ) Pr(x i = o i y i ) Reduced graph only a single chain of y nodes. y 1 y 2 y 3 y 4 y 5 y 6 y 7 Algorithm same as earlier, just replace Sum with Max This isythe 1 y 2 well-known y 2 y 3 y 3 y 4 Viterbi y 4 y 5 y 5 yalgorithm 6 y 6 y 7

64 Variable elimination on general graphs Given, arbitrary sets of potentials ψ C (x C ), C = cliques in a graph G. Find, Z = x 1,...,x n C ψ C(x C ) x 1,... x n = good ordering of variables F = ψ C (x C ), C = cliques in a graph G. for i = 1... n do F i = factors in F that contain x i M i = product of factors in F i m i = x i M i F = F F i {m i } end for

65 33 Example: Variable elimination Given, ψ 12 (x 1, x 2 ), ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), ψ 45 (x 4, x 5 ),, ψ 35 (x 3, x 5 ). Find, Z = x 1,...,x 5 ψ 12 (x 1, x 2 )ψ 24 (x 2, x 4 )ψ 23 (x 2, x 3 )ψ 45 (x 4, x 5 )ψ 35 (x 3, x 5 ). 1 x 1 : {ψ 12 (x 1, x 2 )} M 1 (x 1, x 2 ) x 1 m1 (x 2 )

66 33 Example: Variable elimination Given, ψ 12 (x 1, x 2 ), ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), ψ 45 (x 4, x 5 ),, ψ 35 (x 3, x 5 ). Find, Z = x 1,...,x 5 ψ 12 (x 1, x 2 )ψ 24 (x 2, x 4 )ψ 23 (x 2, x 3 )ψ 45 (x 4, x 5 )ψ 35 (x 3, x 5 ). 1 x 1 : {ψ 12 (x 1, x 2 )} M 1 (x 1, x 2 ) x 1 m1 (x 2 ) 2 x 2 : {ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), m 1 (x 2 )} M 2 (x 2, x 3, x 4 ) m 2 (x 3, x 4 ) x 2

67 33 Example: Variable elimination Given, ψ 12 (x 1, x 2 ), ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), ψ 45 (x 4, x 5 ),, ψ 35 (x 3, x 5 ). Find, Z = x 1,...,x 5 ψ 12 (x 1, x 2 )ψ 24 (x 2, x 4 )ψ 23 (x 2, x 3 )ψ 45 (x 4, x 5 )ψ 35 (x 3, x 5 ). 1 x 1 : {ψ 12 (x 1, x 2 )} M 1 (x 1, x 2 ) x 1 m1 (x 2 ) 2 x 2 : {ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), m 1 (x 2 )} M 2 (x 2, x 3, x 4 ) m 2 (x 3, x 4 ) 3 x 3 : {ψ 35 (x 3, x 5 ), m 2 (x 3, x 4 )} M 3 (x 3, x 4, x 5 ) x 2 x 3 m3 (x 4, x 5 )

68 33 Example: Variable elimination Given, ψ 12 (x 1, x 2 ), ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), ψ 45 (x 4, x 5 ),, ψ 35 (x 3, x 5 ). Find, Z = x 1,...,x 5 ψ 12 (x 1, x 2 )ψ 24 (x 2, x 4 )ψ 23 (x 2, x 3 )ψ 45 (x 4, x 5 )ψ 35 (x 3, x 5 ). 1 x 1 : {ψ 12 (x 1, x 2 )} M 1 (x 1, x 2 ) x 1 m1 (x 2 ) 2 x 2 : {ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), m 1 (x 2 )} M 2 (x 2, x 3, x 4 ) m 2 (x 3, x 4 ) 3 x 3 : {ψ 35 (x 3, x 5 ), m 2 (x 3, x 4 )} M 3 (x 3, x 4, x 5 ) 4 x 4 : {ψ 45 (x 4, x 5 ), m 3 (x 4, x 5 )} M 4 (x 4, x 5 ) x 2 x 3 m3 (x 4, x 5 ) x 4 m4 (x 5 )

69 Example: Variable elimination Given, ψ 12 (x 1, x 2 ), ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), ψ 45 (x 4, x 5 ),, ψ 35 (x 3, x 5 ). Find, Z = x 1,...,x 5 ψ 12 (x 1, x 2 )ψ 24 (x 2, x 4 )ψ 23 (x 2, x 3 )ψ 45 (x 4, x 5 )ψ 35 (x 3, x 5 ). 1 x 1 : {ψ 12 (x 1, x 2 )} M 1 (x 1, x 2 ) x 1 m1 (x 2 ) 2 x 2 : {ψ 24 (x 2, x 4 ), ψ 23 (x 2, x 3 ), m 1 (x 2 )} M 2 (x 2, x 3, x 4 ) m 2 (x 3, x 4 ) 3 x 3 : {ψ 35 (x 3, x 5 ), m 2 (x 3, x 4 )} M 3 (x 3, x 4, x 5 ) 4 x 4 : {ψ 45 (x 4, x 5 ), m 3 (x 4, x 5 )} M 4 (x 4, x 5 ) 5 x 5 : x {m 5 (x 5 )} M 5 (x 5 ) 5 Z x 2 x 3 m3 (x 4, x 5 ) x 4 m4 (x 5 )

70 34 Choosing a variable elimination order Complexity of VE O(nm w ) where w is the maximum number of variables in any factor. Wrong elimination order can give rise to very large intermediate factors. Example: eliminating x 2 first will give a factor of size 4. Given an example where the penalty can be really severe (?) Choosing the optimal elimination order is NP hard for general graphs. Polynomial time algorithm exists for chordal graphs. A graph is chordal or triangulated if all cycles of length greater than three have a shortcut. Optimal triangulation of graphs is NP hard. (Many heuristics)

71 Junction tree algorithm An optimal general-purpose algorithm for exact marginal/map queries Simultaneous computation of many queries Efficient data structures Complexity: O(m w N) w= size of the largest clique in (triangulated) graph, m = number of values of each discrete variable in the clique. linear for trees. Basis for many approximate algorithms. Many popular inference algorithms special cases of junction trees Viterbi algorithm of HMMs Forward-backward algorithm of Kalman filters

72 36 Junction tree Junction tree JT of a triangulated graph G with nodes x 1,..., x n is a tree where Nodes = cliques of G Edges ensure that if any two nodes contain a variable x i then x i is present in every node in the unique path between them (Running intersection property).

73 Junction tree Junction tree JT of a triangulated graph G with nodes x 1,..., x n is a tree where Nodes = cliques of G Edges ensure that if any two nodes contain a variable x i then x i is present in every node in the unique path between them (Running intersection property). Constructing a junction tree Efficient polynomial time algorithms exist for creating a JT from a triangulated graph. 1 Enumerate a covering set of cliques 2 Connect cliques to get a tree that satisfies the running intersection property. If graph is non-triangulated, triangulate first using heuristics, optimal triangulation is NP-hard.

74 37 Creating a junction tree from a graphical model 1. Starting graph 2. Triangulate graph 3. Create clique nodes x 1 x 2 x 3 x 1 x 2 x 3 x 1 x 2 x 4 x 5 x 4 x 5 x 2 x 3 x 4 x 3 x 4 x 5 4. Create tree edges such that variables connected. x 1 x 2 x 2 5) Assign potentials to exactly one subsumed clique node. x 1 x 2 x 2 Ψ 1 x 2 x 3 x 4 x 3 x 4 x 3 x 4 x 5 x 2 x 3 x 4 x 3 x 4 x 3 x 4 x 5 Ψ 23 Ψ 24 Ψ 45 Ψ 35

75 38 Finding cliques of a triangulated graph Theorem Every triangulated graph has a simplicial vertex, that is, a vertex whose neighbors form a complete set. Input: Graph G. n = number of vertices of G for i = 1,..., n do π i = pick any simplicial vertex in G C i = {π i } Ne(π i ) remove π i from G end for Return maximal cliques from C 1,... C n

76 39 Connecting cliques to form junction tree Separator variables = intersection of variables in the two cliques joined by an edge. Theorem A clique tree that satisfies the running intersection property maximizes the number of separator variables. Input: Cliques: C 1,... C k Form a complete weighted graph H with cliques as nodes and edge weights = size of the intersection of the two cliques it connects. T = maximum weight spanning tree of H Return T as the junction tree.

77 40 Belief propagation on junction trees Each node c sends belief Bc c (.) to each of its neighbors c once it has beliefs from every other neighbor N(c) {c }. Bc c (.) = belief that clique c has about the distribution of labels to common variables s = c c B c c (x s ) = x c s ψ c (x c ) d N(c) {c } Replace sum with max for MAP queries. B d c (x d c )

78 Belief propagation on junction trees Each node c sends belief Bc c (.) to each of its neighbors c once it has beliefs from every other neighbor N(c) {c }. Bc c (.) = belief that clique c has about the distribution of labels to common variables s = c c B c c (x s ) = x c s ψ c (x c ) d N(c) {c } Replace sum with max for MAP queries. Compute marginal probability of any variable x i as 1 c = clique in JT containing x i 2 Pr(x i ) x c xi ψ c (x c ) d N(c) B d c(x d c ) B d c (x d c )

79 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B (y 2 ) = y 1 ψ 12 (y 12 ) to its only neighbor.

80 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B (y 2 ) = y 1 ψ 12 (y 12 ) to its only neighbor. 2 Clique 345 sends belief B (y 34 ) = y 5 ψ 234 (y 345 ) to 234

81 41 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B (y 2 ) = y 1 ψ 12 (y 12 ) to its only neighbor. 2 Clique 345 sends belief B (y 34 ) = y 5 ψ 234 (y 345 ) to Clique 234 sends belief B (y 34 ) = y 2 ψ 234 (y 234 )B (y 2 ) to 345

82 41 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B (y 2 ) = y 1 ψ 12 (y 12 ) to its only neighbor. 2 Clique 345 sends belief B (y 34 ) = y 5 ψ 234 (y 345 ) to Clique 234 sends belief B (y 34 ) = y 2 ψ 234 (y 234 )B (y 2 ) to Clique 234 sends belief B (y 2 ) = y 4 ψ 234 (y 234 )B (y 34 ) to 12

83 41 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B (y 2 ) = y 1 ψ 12 (y 12 ) to its only neighbor. 2 Clique 345 sends belief B (y 34 ) = y 5 ψ 234 (y 345 ) to Clique 234 sends belief B (y 34 ) = y 2 ψ 234 (y 234 )B (y 2 ) to Clique 234 sends belief B (y 2 ) = y 4 ψ 234 (y 234 )B (y 34 ) to 12

41 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B 12 234 (y 2 ) = y 1 ψ 12 (y 12 ) to its only

84 41 Example ψ 234 (y 234 ) = ψ 23 (y 23 )ψ 34 (y 34 ) ψ 345 (y 345 ) = ψ 35 (y 35 )ψ 45 (y 45 ) ψ 234 (y 12 ) = ψ 12 (y 12 ) 1 Clique 12 sends belief B (y 2 ) = y 1 ψ 12 (y 12 ) to its only neighbor. 2 Clique 345 sends belief B (y 34 ) = y 5 ψ 234 (y 345 ) to Clique 234 sends belief B (y 34 ) = y 2 ψ 234 (y 234 )B (y 2 ) to Clique 234 sends belief B (y 2 ) = y 4 ψ 234 (y 234 )B (y 34 ) to 12 Pr(y 1 ) y 2 ψ 12 (y 12 )B (y 2 )

85 42 Part I: Outline 1 Representation Directed graphical models: Bayesian networks Undirected graphical models 2 Inference Queries Exact inference on chains Variable elimination on general graphs Junction trees 3 Approximate inference Generalized belief propagation Sampling: Gibbs, Particle filters 4 Constructing a graphical model Graph Structure Parameters in Potentials 5 General framework for Parameter learning in graphical models 6 References

86 Why approximate inference Exact inference is NP hard. Complexity: O(w m ) w= tree width = size of the largest clique in (triangulated) graph-1, m = number of values of each discrete variable in the clique. Many real-life graphs produce large cliques on triangulation A n n grid has a tree width of n A Kalman filter on K parallel state variables influencing a common observation variable, has a tree width of size K + 1

87 Generalized belief propagation Approximate junction tree with a cluster graph where 1 Nodes = arbitrary clusters, not cliques in triangulated graph. Only ensure all potentials subsumed. 2 Separator nodes on edges = subset of intersecting variables. Special case: factor graphs. Example cluster graph Starting graph x 1 x 2 x 3 x 4 x5 Junction tree. Cluster graph x 1 x 2 x 2 x 3 x 4 x 5 x 2 x 2 x 3 x 4 x 3 x 4 x 3 x 4 x 5 x 1 x 2 x 2 x 3 x 2 x 4 x 3 x 5 x 4 x 5 44

88 45 Belief propagation in cluster graphs Graph can have loops, tree-based two-phase method not applicable. Many variants on scheduling order of propagating beliefs. Simple loopy belief propagation [?] Tree-reweighted message passing [?,?] Residual belief probagation [?] Many have no guarantees of convergence. Specific tree-based orders do [?] Works well in practice, default method of choice.

89 46 MCMC (Gibbs) sampling Useful when all else failes, guaranteed to converge to the optimal over infinite number of samples. Basic premise: easy to compute conditional probability Pr(x i fixed values of remaining variables) Algorithm Start with some initial assignment, say x 1 = [x 1,..., x n ] = [0,..., 0] For several iterations For each variable xi Get a new sample x t+1 by replacing value of x i with a new value sampled according to probability Pr(x i x1 t,... x i 1 t, x i+1 t,..., x n) t

90 47 Others Combinatorial algorithms for MAP [?]. Greedy algorithms: relaxation labeling. Variational methods like mean-field and structured mean-field. LP and QP based approaches.

91 48 Part I: Outline 1 Representation Directed graphical models: Bayesian networks Undirected graphical models 2 Inference Queries Exact inference on chains Variable elimination on general graphs Junction trees 3 Approximate inference Generalized belief propagation Sampling: Gibbs, Particle filters 4 Constructing a graphical model Graph Structure Parameters in Potentials 5 General framework for Parameter learning in graphical models 6 References

92 Graph Structure 1 Manual: Designed by domain expert Used in applications where dependency structure is well-understood Example: QMR systems, Kalman filters, Vision (Grids), HMM for speech recognition and IE. 2 Learned from examples NP hard to find the optimal structure. Widely researched, mostly posed as a branch and bound search problem. Useful in dynamic situations

93 50 Parameters in Potentials 1 Manual: Provided by domain expert Used in infrequently constructured graphs, example QMR systems Also where potentials are an easy function of the attributes of connected graphs, example: vision networks. 2 Learned: from examples More popular since difficult for humans to assign numeric values Many variants of parameterizing potentials. 1 Table potentials: each entry a parameter, example, HMMs 2 Potentials: combination of shared parameters and data attributes: example, CRFs.

94 51 Learning potentials Given sample D = {x 1,..., x N } of data generated from a distribution P(x) represented by a graphical model with known structure G, learn potentials ψ C (x C ). Two dimensions: 1 All variables observed or not. 1 Fully observed: each training sample x i has all n variables observed. 2 Partially observed: a subset of the variables are observed. 2 Potentials coupled with a log-partition function or not. 1 No: Closed form solutions 2 Yes: Potentials attached to arbitrary overlapping subset of variables in a UDGM. Example = edge potentials in a grid graph. iterative solution as in the case of learning with shared parameters Discussed later.

95 Easy case: fully observed, decoupled potentials, table potentials 1 Potentials in a Bayesian network P(x) = i Pr(x i Pa(x i ) 2 potentials attached to maximal cliques in UDGM. C Cliques Pr(x) = Pr(x c) S Separators Pr(x s) Maximum likelihood estimation with constraints on potentials to make them behave like probabilities: Pr(x C ) = N i=1 [[xi C == x c]] N Pr(x j pa(x j )) = N i=1 [[x i j == x j, x i Pa(j) = pa(x j)]] N i=1 [[xi Pa(j) = pa(x j)]]

96 53 Part I: Outline 1 Representation Directed graphical models: Bayesian networks Undirected graphical models 2 Inference Queries Exact inference on chains Variable elimination on general graphs Junction trees 3 Approximate inference Generalized belief propagation Sampling: Gibbs, Particle filters 4 Constructing a graphical model Graph Structure Parameters in Potentials 5 General framework for Parameter learning in graphical models 6 References

97 54 The framework Conditional distribution Pr(y x, θ), potentials are function of x and parameters θ to be learned. y = y 1,..., y n forms a graphical model: directed or undirected. Undirected: C Pr(y 1,..., y n x, θ) = ψ c(y c, x, θ) Z θ (x) = 1 Z θ (x) exp( F θ (y c, c, x)) c where Z θ (x) = y exp( c F θ(y c, c, x)) clique potential ψ c (y c, x) = exp(f θ (y c, c, x))

98 55 Local conditional probability for BN Pr(y 1,..., y n x, θ) = j = j Pr(y j y Pa(j), x, θ) exp(f θ (y Pa(j), y, j, x)) m y =1 exp(f θ(y Pa(j), y, j, x))

99 56 Forms of F θ (y c, c, x) Log-linear model over user-defined features. E.g. CRFs, Maxent models, etc. Let K be number of features. Denote a feature as f k (y c, c, x). Then, K F θ (y c, c, x) = θ k f k (y c, c, x) k=1 Arbitrary function, e.g. a neural network that takes as input y c, c, x and transforms them possibly non-linearly into a real value. θ are the parameters of the network.

100 Example: Named Entity Recognition 57

101 Named Entity Recognition: Features 58

102 59 Training Given N input output pairs D = {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} Form of F θ Learn parameters θ by maximum likelihood. max LL(θ, D) = max θ θ N log Pr(y i x i, θ) i=1

103 60 Training for BN LL(θ, D) = = N log Pr(y i x i, θ) i=1 N log j i=1 = i = i j j Pr(y i j y i Pa(j), x i, θ) log Pr(y i j y i Pa(j), x i, θ) F θ (y i Pa(j), y i j, j, x i )) log m exp(f θ (ypa(j), i y, j, x i )) y =1 Like normal classification task. No challenge arising during training because of graphical model. Normalizer is easy to compute.

104 61 Training undirected graphical model LL(θ, D) = = N log Pr(y i x i, θ) i=1 N 1 log Z θ (x i ) exp( c i=1 F θ (y i c, c, x i )) = [ F θ (yc, i c, x i ) log Z θ (x i ) i c The first part is easy to compute but the second term requires to invoke an inference algorithm to compute Z θ (x i ) for each i. Computing the gradient of the above objective with respect to θ also requires inference.

105 62 Training via gradient descent Assume log-linear models like in CRFs where F θ (y i c, c, x i ) = θ f(x i, y i c, c) Also, for brevity write f(x i, y i ) = c f(xi, y i c, c) LL(θ) = i log Pr(y i x i, θ) = i (θ f(x i, y i ) log Z θ (x i )) Add a regularizer to prevent over-fitting. max θ (θ f(x i, y i ) log Z θ (x i )) θ 2 /C i Concave in θ = gradient descent methods will work.

106 Gradient of the training objective L(θ) = i f(x i, y i ) y f(y, x i ) exp θ f(x i, y ) 2θ/C Z θ (x i ) = i f(x i, y i ) y f(x i, y ) Pr(y θ, x i ) 2θ/C = i f(x i, y i ) E Pr(y θ,x i )f(x i, y ) 2θ/C E Pr(y θ,x i )f k (x i, y ) = y f k (x i, y ) Pr(y θ, x i ) = y c f k(x i, y c, c) Pr(y θ, x i ) = c y f k(x i, y c c, c) Pr(y c θ, x i )

107 64 Computing E Pr(y θ t,x i )f k (x i, y) Three steps: 1 Pr(y θ t, x i ) is represented as an undirected model where nodes are the different components of y, that is y 1,..., y n. The potential ψ c (y c, x, θ) on clique c is exp(θ t f(x i, y i c, c)) 2 Run a sum-product inference algorithm on above UGM and compute for each c, y c marginal probability µ(y c, c, x i ). 3 Using these µs we compute E Pr(y θ t,x i )f k (x i, y) = c y c µ(y c, c, x i )f k (x i, c, y c )

108 65 Training algorithm 1: Initialize θ 0 = 0

109 65 Training algorithm 1: Initialize θ 0 = 0 2: for t = 1... T do 3: for i = 1... N do 4: g k,i = f k (x i, y i ) E Pr(y θ t,x i )f k (x i, y ) k = 1... K 5: end for 6: g k = i g k,i k = 1... K

110 65 Training algorithm 1: Initialize θ 0 = 0 2: for t = 1... T do 3: for i = 1... N do 4: g k,i = f k (x i, y i ) E Pr(y θ t,x i )f k (x i, y ) k = 1... K 5: end for 6: g k = i g k,i k = 1... K 7: θ t k = θt 1 k + γ t (g k 2θ t 1 k 8: Exit if g zero 9: end for /C)

111 65 Training algorithm 1: Initialize θ 0 = 0 2: for t = 1... T do 3: for i = 1... N do 4: g k,i = f k (x i, y i ) E Pr(y θ t,x i )f k (x i, y ) k = 1... K 5: end for 6: g k = i g k,i k = 1... K 7: θk t = θt 1 k /C) 8: Exit if g zero 9: end for Running time of the algorithm is O(INn(m 2 + K)) where I is the total number of iterations. + γ t (g k 2θ t 1 k

112 Partially observed, decoupled potentials y 1 y 2 y 3 y 4 y 5 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 EM Algorithm y 1 y 2 y 3 y 4 y 5 y 6 y Input: Graph G, Data D with 7 observed subset of variables x and hidden variables z. y 1 y 2 y 2 y 3 y 3 y 4 y 4 y 5 y 5 y 6 y 6 y Initially (t = 0): Assign random 7 variables of parameters Pr(x j pa(x j )) t for = 1,..., T do E-step for i = 1,..., N do Use inference in G to estimate conditionals Pr i (z c x i ) t for all variable subsets (i, pa(i)) involving any hidden variable. end for M-step Pr(x j pa(x j ) = z c ) t = end for N i=1 Pr i (z c x i )[[x i j ==x j ]] N i=1 Pr i (z c x i ) t 66

113 67 Part I: Outline 1 Representation Directed graphical models: Bayesian networks Undirected graphical models 2 Inference Queries Exact inference on chains Variable elimination on general graphs Junction trees 3 Approximate inference Generalized belief propagation Sampling: Gibbs, Particle filters 4 Constructing a graphical model Graph Structure Parameters in Potentials 5 General framework for Parameter learning in graphical models 6 References

114 More on graphical models Koller and Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, Wainwright s article in FnT for Machine Learning Kevin Murphy s brief online introduction ( Graphical models. M. I. Jordan. Statistical Science (Special Issue on Bayesian Statistics), 19, , (http: // Other text books: R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer-Verlag J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Graphical models by Lauritzen, Oxford science publications F. V. Jensen. Bayesian Networks and Decision Graphs. Springer

Graphical models. Sunita Sarawagi IIT Bombay

Graphical models. Sunita Sarawagi IIT Bombay 1 Graphical models Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita 2 Probabilistic modeling Given: several variables: x 1,... x n, n is large. Task: build a joint distribution function Pr(x