Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr btzhang@bi.snu.ac.kr http://bi.snu.ac.kr

Overview I Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering -- uncertainty and complexity -- and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity -- a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined ensuring that the system as a whole is consistent and providing ways to interface models to data. (c) 2004 SNU CSE Biointelligence Lab 2

Overview II The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highlyinteracting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms. Many of the classical multivariate probabilistic systems studied in fields such as statistics systems engineering information theory pattern recognition and statistical mechanics are special cases of the general graphical model formalism -- examples include mixture models factor analysis hidden Markov models Kalman filters and Ising models. (c) 2004 SNU CSE Biointelligence Lab 3

Overview III The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages -- in particular specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover the graphical model formalism provides a natural framework for the design of new systems. --- Michael Jordan 1998. (c) 2004 SNU CSE Biointelligence Lab 4

Three Points of View Representation Basically probabilistic graphical models (PGMs) represent the probabilistic relationship among a set of random variables. For example the relationship between symptoms and disease. Inference Given a PGM one can calculate a conditional probability of interest. Learning Given a set of examples (data) one can find a plausible PGM for describing the underlying process. (c) 2004 SNU CSE Biointelligence Lab 5

Two Main Classes of PGMs Undirected models Edges have no direction. Markov random fields Markov networks etc. Image analysis Directed models Edges have a direction. Bayesian networks belief networks etc. Intuitive Causal analysis (c) 2004 SNU CSE Biointelligence Lab 6

Contents Bayesian networks Causal networks Bayesian networks Inference in Bayesian networks Learning Bayesian networks from data Applications Concluding remarks Bibliography (c) 2004 SNU CSE Biointelligence Lab 7

Causal Networks Node: event Arc: causal relationship between two nodes A B: A causes B. Causal network for the car start problem [Jensen 01] Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 8

Reasoning with Causal Networks 1. My car does not start. increases the certainty of no fuel and dirty spark plugs. increases the certainty of fuel meter s standing for the empty. 2. Fuel meter stands for the half. decreases the certainty of no fuel increases the certainty of dirty spark plugs. Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 9

d-separation : the Set of Rules for Reasoning Connections in causal networks Serial Converging Diverging Definition [Jensen 01]: Two nodes in a causal network are d- separated if for all paths between them there is an intermediate node V such that the connection is serial or diverging and the state of V is known or the connection is converging and neither V nor any of V s descendants have received evidence. If A and B are d-separated then changes in the certainty of A have no impact on the certainty of B and vice versa. (c) 2004 SNU CSE Biointelligence Lab 10

d-separation in the Car Start Problem 1. Start and Fuel are dependent on each other. 2. Start and Clean Spark Plugs are dependent on each other. 3. Fuel and Fuel Meter Standing are dependent on each other. 4. Fuel and Clean Spark Plugs are conditionally dependent on each other given the value of Start. Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 11

Probability for the Certainty in Causal Networks Basic axioms P(A) = 1 iff A is certain. Σ A P(A) = 1. (the summation is taken over all the possible values of A.) P(A B) = P(A) + P(B) iff A and B are mutually exclusive. Conditional probability P(A B) = P(A B) / P(B) = P(B A)P(A) / P(B) Event in the causal network a variable If A and B are d-separated then P(A B) = P(A). A and B are independent. A and B are (conditionally) independent given the value of C. (c) 2004 SNU CSE Biointelligence Lab 12

Definition: Bayesian Networks A Bayesian network consists of the following. A set of n variables X = {X 1 X 2 X n } and a set of directed edges between variables. The variables (nodes) with the directed edges form a directed acyclic graph (DAG) structure. Directed cycles are not modeled. To each variable X i and its parents Pa(X i ) there is attached a conditional probability table for P(X i Pa(X i )). Modeling for the continuous variables is also possible. (c) 2004 SNU CSE Biointelligence Lab 13

Bayesian Network Represents the Joint Probability Distribution By the d-separation property the Bayesian network over n variables X = {X 1 X 2 X n } represents P(X) as follows: P( X X... X 1 2 n i = 1 ) = n P( X i Pa( X i )). Given the joint probability distribution any conditional probability can be calculated in principle. (c) 2004 SNU CSE Biointelligence Lab 14

Bayesian Network for the Car Start Problem P(Fu = Yes) = 0.98 P(CSP = Yes) = 0.96 Fuel Clean Spark Plugs Fuel Meter Standing Start P(FMS Fu) P(St Fu CSP) Fu = Yes Fu = No Fu = Yes Fu = No FMS = Full 0.39 0.001 CSP = Yes (0.99 0.01) (0 1) FMS = Half 0.60 0.001 CSP = No (0.01 0.99) (0 1) FMS = Empty 0.01 0.998 (c) 2004 SNU CSE Biointelligence Lab 15

The Car Start Problem Revisited 1. No start P(St = No) = 1 (evidence 1) Update the conditional probabilities P(Fu St = No) P(CSP St = No) and P(FMS St = No) 2. Fuel meter stands for the half P(FMS = Half) = 1 (evidence 2) Update the conditional probabilities P(Fu St = No FMS = Half) and P(CSP St = No FMS = Half). Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 16

Calculation of the Conditional Probabilities Calculation of P(CSP St = No FMS = Half) is as follows. P( CSP St FMS) = P( CSP St FMS) P( St FMS) = Fu Fu CSP P( Fu CSP St FMS) P( Fu CSP St FMS) Summations in the above equation are taken over all the possible values of the variables. In general cases the calculation of the conditional probability by marginalization is nearly impossible. (c) 2004 SNU CSE Biointelligence Lab 17

Causal Networks vs. Bayesian Networks Certainty vs. probability calculus A causes B vs. B depends on A P(B A) conditional probability Impact dependence d-separation conditional independencies Causality probabilistic dependence Probabilistic dependence causality ( ) (c) 2004 SNU CSE Biointelligence Lab 21

Equivalent Bayesian Network Structures Bayesian network structure A corresponding set of probability distributions Informal definition: equivalence of the Bayesian network structure Two Bayesian network structures are equivalent if the set of distributions that can be represented using one of the DAGs is identical to the set of distributions that can be represented using the other. (c) 2004 SNU CSE Biointelligence Lab 22

Verma and Pearl s s Theorem Theorem [Verma and Pearl 90]: Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures. X Y v-structure (X Z Y) Z : X and Y are parents of Z and not adjacent to each other. (c) 2004 SNU CSE Biointelligence Lab 24

PDAG Representations Minimal PDAG representations of the equivalence class The only directed edges are those that participate in v-structures. Completed PDAG representation Every directed edge corresponds to a compelled edge and every undirected edge corresponds to a reversible edge. (c) 2004 SNU CSE Biointelligence Lab 25

Inference in Bayesian Networks Infer the probability of an event given some observations. [Frey 98] Infer the exact distribution over small groups of variables in singly-connected networks. Probability propagation Convert a multiply-connected network to the singlyconnected network. Not practical especially for large networks Approximate inference methods Monte Carlo approaches Variational methods Helmholtz machines (c) 2004 SNU CSE Biointelligence Lab 27

Singly-Connected Networks A singly-connected network has only a single path (ignoring edge directions) connecting any two vertices. f C x s f A u y z v f D f B w f E (c) 2004 SNU CSE Biointelligence Lab 28

(c) 2004 SNU CSE Biointelligence Lab 29 Factorization of Factorization of the Global Distribution and Inference the Global Distribution and Inference Example network represents the joint probability distribution as follows: The probability of s given the value of z is calculated as ). ( ) ( ) ( ) ( ) ( ) ( z y f y u f x u f w v f v u s f z y x w v u s P E D C B A = ')]}. ( ) ( )][ ( )}{[ ( ){ ( ') ( ') ( ') ( ') ( ') / ( ') ( = = = = = = = = = = y E D x C w B v u A y x w v u s z z y f y u f x u f w v f v u s f z z s P z z y x w v u s P z z s P z z s P z z s P z z s P

The Generalized Forward-Backward Algorithm The generalized forward-backward algorithm is one flavor of the probability propagation. The generalized forward-backward algorithm: 1. Convert a Bayesian network into the factor graph. 2. The factor graph is arranged as a horizontal tree with an arbitrary chosen root vertex. 3. Beginning at the left-most level messages are passed level by level forward to the root. 4. Messages are passed level by level backward from root to the leaves. Messages represent the propagated probability through edges of the graphical model. (c) 2004 SNU CSE Biointelligence Lab 30

Convert a Bayesian Network into the Factor Graph z 10 z 1 z 2 z 3 z z 4 1 z 5 z 6 z 7 z 2 z 3 z 5 z 6 z 4 z 7 z 7 z 4 z 3 z 2 z 6 z9 z 8 z 9 z 10 z 8 z 9 z 10 z 1 z 5 z 8 (c) 2004 SNU CSE Biointelligence Lab 31

Calculation of the Message The variable-to-function message: If x is unobserved then µ x A( x) = µ B x( x) µ C x( x). If x is observed as x then µ x ( x') 1 µ ( x) = A = x A 0 (for other values). The function-to-variable message: µ ( x ) f ( x y z) µ ( y) µ A x = y z A y A z A ( z). (c) 2004 SNU CSE Biointelligence Lab 33

Computation of the Conditional Probability After the generalized forward-backward algorithm ends each edge in the factor graph has its calculated message values. The probability of x given the observations v is as follows: P( x v) = βµ ( x) µ ( x) µ ( x) A x B x where β is a normalizing constant. C x (c) 2004 SNU CSE Biointelligence Lab 34

Inference in the Multiply-Connected Network Probabilistic inference in Bayesian networks (also in Markov random fields and factor graphs) in general is very hard. Approximate inferences Use probability propagation in the multiply-connected network. Monte Carlo methods Variational inference Helmholtz machines (c) 2004 SNU CSE Biointelligence Lab 35

Learning Bayesian Networks Parametric learning Learn the local probability distribution for each node given a DAG structure. P( X1 X 2... X n) = i = P( X i Pa( X 1 i )) Structural learning Learn the DAG structure. Bayesian network learning Structural learning parametric learning n (c) 2004 SNU CSE Biointelligence Lab 36

Four Possible Situations Given structure complete data ML MAP and Bayesian learning Given structure incomplete data EM algorithm variational method and Markov chain Monte Carlo (MCMC) method Unknown structure complete data Greedy search GA MCMC and Bayesian learning Unknown structure incomplete data Structure search + EM or MCMC (c) 2004 SNU CSE Biointelligence Lab 37

Parametric Learning Learning for the local probability distribution Complete data Maximum likelihood learning Bayesian learning [Heckerman 96] P( θ ) = Dir( θ α... α ij ij Incomplete data ij ij ij1 P( θ D) = Dir( θ α + N... α EM (expectation-maximization) algorithm [Heckerman 96] Markov chain Monte Carlo methods ij1 ijr i ) ij1 ijr i + N ijr i ) (c) 2004 SNU CSE Biointelligence Lab 38

Structural Learning Metric approach Use a scoring metric to measure how well a particular structure fits an observed set of cases. A search algorithm is used. Find a canonical form of an equivalence class. Independence approach An independence oracle (approximated by some statistical test) is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated. Search for a PDAG. (c) 2004 SNU CSE Biointelligence Lab 39

Scoring Metrics for Bayesian Networks Likelihood L(G θ G C) = P(C G h θ G ) G h : the hypothesis that the data (C) was generated by a distribution that can be factored according to G. The maximum likelihood metric of G M ML ( G C) = max L( G θ C) θ G prefer the complete graph structure. G (c) 2004 SNU CSE Biointelligence Lab 40

Information Criterion Scoring Metrics The Akaike information criterion (AIC) metric M AIC ( G C) = log M ( G C) Dim( G) ML The Bayesian information criterion (BIC) metric M BIC ( G C) = log M ( G C) ML 1 2 Dim( G) log N (c) 2004 SNU CSE Biointelligence Lab 41

MDL Scoring Metrics The minimum description length (MDL) metric 1 M 1( G C) = log P( G) M BIC ( G C) MDL + The minimum description length (MDL) metric 2 M MDL 2( G C) = log M ML( G C) EG log N c Dim( G) (c) 2004 SNU CSE Biointelligence Lab 42

Bayesian Scoring Metrics A Bayesian metric h h M ( G C ξ ) = log P( G ξ ) + log P( C G ξ ) + c The BDe (Bayesian Dirichlet & likelihood equivalence) metric [Heckerman et al. 95] p( C G) = = p( G) p( G) p( C G) Γ( α ) Γ( α n q ij r ijk i= 1 j= 1 k = 1 Γ( αij + Nij ) Γ i i N + ( α ijk ) ijk ). α ij = α N N k ijk ij = k Γ(1) = 1 Γ( x + 1) = xγ( x) ijk Prior Sufficient statistics calculated from D (c) 2004 SNU CSE Biointelligence Lab 43

Greedy Search Algorithm for Bayesian Network Learning Generate the initial Bayesian network structure G 0. For m = 1 2 3 until convergence. Among all the possible local changes (insertion of an edge reversal of an edge and deletion of an edge) in G m 1 the one leads to the largest improvement in the score is performed. The resulting graph is G m. Stopping criterion Score(G m 1 ) == Score(G m ). At each iteration (learning Bayesian networks consisting of n variables) O(n 2 ) local changes should be evaluated to select the best one. Random restarts is usually adopted to escape the local maxima. (c) 2004 SNU CSE Biointelligence Lab 44

Other Approaches to the Structural Learning Genetic algorithms Markov chain Monte Carlo sampling Bayesian learning Summing over all the possible structures Possible space is exponential in the number of variables. approximation (c) 2004 SNU CSE Biointelligence Lab 45

Applications Classification Neural networks vs. PGMs Text mining Topic extraction Motion tracking Bioinformatics Gene-regulatory network construction Gene-drug dependency analysis (c) 2004 SNU CSE Biointelligence Lab 46

Gene-Regulatory Network Construction Eran Segal et al. Module Networks: Identifying Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Nature Genetics 34(2): 166-76 2004. (c) 2004 SNU CSE Biointelligence Lab 47

Concluding Remarks Probabilistic graphical models Probability theory (uncertainty) + Graph theory (complexity) Framework of thought Artificial intelligence machine learning data mining Representation inference and learning Further works are needed for these topics. In the viewpoint of engineering Implement an established theory for specific applications. (c) 2004 SNU CSE Biointelligence Lab 49

Bibliography [Jensen 96] Jensen F.V. An Introduction to Bayesian Networks Springer-Verlag 1996. [Jensen 01] Jensen F.V. Bayesian Networks and Decision Graphs Springer-Verlag 2001. [Heckerman 96] Heckerman D. A tutorial on learning with Bayesian networks Technical Report MSR-TR-95-06 Microsoft Research 1996. [Pearl 88] Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan Kaufmann Publishers 1988. [Spirtes et al. 00] Spirtes P. Glymour C. and Scheines R. Causation Prediction and Search 2 nd edition MIT Press 2000. [Frey 98] Frey B.J. Graphical Models for Machine Learning and Digital Communication MIT Press 1998. [Friedman and Goldszmidt 99] Friedman N. and Goldszmidt M. Learning Bayesian networks with local structure Learning in Graphical Models pp. 421-460 MIT Press 1999. [Heckerman et al. 95] Heckerman D. Geiger D. and Chickering D.M. Learning Bayesian networks: the combination of knowledge and statistical data Technical Report MSR-TR-94-09 Microsoft Research 1995. [Verma and Pearl 90] Verma T. and Pearl J. Equivalence and synthesis of causal models In Proceedings of UAI 90 pp. 220 227 1990. http://www.ai.mit.edu/~murphyk/bayes/bnintro.html (c) 2004 SNU CSE Biointelligence Lab 50