Introduction to Probabilistic Graphical Models

Similar documents
Learning in Bayesian Networks

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

COMP538: Introduction to Bayesian Networks

Lecture 6: Graphical Models: Learning

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Machine Learning Summer School

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks Inference with Probabilistic Graphical Models

p L yi z n m x N n xi

Machine Learning Lecture 14

Probabilistic Graphical Networks: Definitions and Basic Results

Rapid Introduction to Machine Learning/ Deep Learning

4 : Exact Inference: Variable Elimination

Mixtures of Gaussians with Sparse Structure

Towards an extension of the PC algorithm to local context-specific independencies detection

Probabilistic Graphical Models (I)

Bayesian Networks BY: MOHAMAD ALSABBAGH

Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach

An Empirical-Bayes Score for Discrete Bayesian Networks

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Linear Dynamical Systems

CS Lecture 3. More Bayesian Networks

Artificial Intelligence: Cognitive Agents

An Empirical-Bayes Score for Discrete Bayesian Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Directed and Undirected Graphical Models

Recent Advances in Bayesian Inference Techniques

Introduction to Probabilistic Graphical Models

Causality in Econometrics (3)

Causal Inference & Reasoning with Causal Bayesian Networks

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Biointelligence Lab School of Computer Sci. & Eng. Seoul National University

Chris Bishop s PRML Ch. 8: Graphical Models

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Directed Graphical Models

Directed Graphical Models

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Directed Graphical Models or Bayesian Networks

Probabilistic Graphical Models

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Bayesian Networks. Introduction

Lecture 5: Bayesian Network

Graphical Models and Kernel Methods

3 : Representation of Undirected GM

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

9 Forward-backward algorithm, sum-product on factor graphs

Learning of Causal Relations

Probabilistic Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

STA 4273H: Statistical Machine Learning

Statistical Approaches to Learning and Discovery

Beyond Uniform Priors in Bayesian Network Structure Learning

Learning causal network structure from multiple (in)dependence models

2 : Directed GMs: Bayesian Networks

Probabilistic Graphical Models for Image Analysis - Lecture 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Probabilistic Graphical Models

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Introduction to Artificial Intelligence. Unit # 11

STA 4273H: Statistical Machine Learning

Causal Models with Hidden Variables

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Pattern Recognition and Machine Learning

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Dynamic Approaches: The Hidden Markov Model

Introduction to Causal Calculus

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Exact model averaging with naive Bayesian classifiers

CS6220: DATA MINING TECHNIQUES

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Study Notes on the Latent Dirichlet Allocation

Junction Tree, BP and Variational Methods

10708 Graphical Models: Homework 2

Arrowhead completeness from minimal conditional independencies

{ p if x = 1 1 p if x = 0

Lecture 4 October 18th

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Minimum Free Energies with Data Temperature for Parameter Learning of Bayesian Networks

CAUSAL MODELS: THE MEANINGFUL INFORMATION OF PROBABILITY DISTRIBUTIONS

Noisy-OR Models with Latent Confounding

Probabilistic Graphical Models

Learning Semi-Markovian Causal Models using Experiments

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Learning Bayesian networks

12 : Variational Inference I

Score Metrics for Learning Bayesian Networks used as Fitness Function in a Genetic Algorithm

A Decision Theoretic View on Choosing Heuristics for Discovery of Graphical Models

Representation of undirected GM. Kayhan Batmanghelich

Directed and Undirected Graphical Models

Structure Learning: the good, the bad, the ugly

Learning Causality. Sargur N. Srihari. University at Buffalo, The State University of New York USA

Bayesian Networks to design optimal experiments. Davide De March

Learning Bayesian Networks Does Not Have to Be NP-Hard

Transcription:

Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr btzhang@bi.snu.ac.kr http://bi.snu.ac.kr

Overview I Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering -- uncertainty and complexity -- and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity -- a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined ensuring that the system as a whole is consistent and providing ways to interface models to data. (c) 2004 SNU CSE Biointelligence Lab 2

Overview II The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highlyinteracting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms. Many of the classical multivariate probabilistic systems studied in fields such as statistics systems engineering information theory pattern recognition and statistical mechanics are special cases of the general graphical model formalism -- examples include mixture models factor analysis hidden Markov models Kalman filters and Ising models. (c) 2004 SNU CSE Biointelligence Lab 3

Overview III The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages -- in particular specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover the graphical model formalism provides a natural framework for the design of new systems. --- Michael Jordan 1998. (c) 2004 SNU CSE Biointelligence Lab 4

Three Points of View Representation Basically probabilistic graphical models (PGMs) represent the probabilistic relationship among a set of random variables. For example the relationship between symptoms and disease. Inference Given a PGM one can calculate a conditional probability of interest. Learning Given a set of examples (data) one can find a plausible PGM for describing the underlying process. (c) 2004 SNU CSE Biointelligence Lab 5

Two Main Classes of PGMs Undirected models Edges have no direction. Markov random fields Markov networks etc. Image analysis Directed models Edges have a direction. Bayesian networks belief networks etc. Intuitive Causal analysis (c) 2004 SNU CSE Biointelligence Lab 6

Contents Bayesian networks Causal networks Bayesian networks Inference in Bayesian networks Learning Bayesian networks from data Applications Concluding remarks Bibliography (c) 2004 SNU CSE Biointelligence Lab 7

Causal Networks Node: event Arc: causal relationship between two nodes A B: A causes B. Causal network for the car start problem [Jensen 01] Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 8

Reasoning with Causal Networks 1. My car does not start. increases the certainty of no fuel and dirty spark plugs. increases the certainty of fuel meter s standing for the empty. 2. Fuel meter stands for the half. decreases the certainty of no fuel increases the certainty of dirty spark plugs. Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 9

d-separation : the Set of Rules for Reasoning Connections in causal networks Serial Converging Diverging Definition [Jensen 01]: Two nodes in a causal network are d- separated if for all paths between them there is an intermediate node V such that the connection is serial or diverging and the state of V is known or the connection is converging and neither V nor any of V s descendants have received evidence. If A and B are d-separated then changes in the certainty of A have no impact on the certainty of B and vice versa. (c) 2004 SNU CSE Biointelligence Lab 10

d-separation in the Car Start Problem 1. Start and Fuel are dependent on each other. 2. Start and Clean Spark Plugs are dependent on each other. 3. Fuel and Fuel Meter Standing are dependent on each other. 4. Fuel and Clean Spark Plugs are conditionally dependent on each other given the value of Start. Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 11

Probability for the Certainty in Causal Networks Basic axioms P(A) = 1 iff A is certain. Σ A P(A) = 1. (the summation is taken over all the possible values of A.) P(A B) = P(A) + P(B) iff A and B are mutually exclusive. Conditional probability P(A B) = P(A B) / P(B) = P(B A)P(A) / P(B) Event in the causal network a variable If A and B are d-separated then P(A B) = P(A). A and B are independent. A and B are (conditionally) independent given the value of C. (c) 2004 SNU CSE Biointelligence Lab 12

Definition: Bayesian Networks A Bayesian network consists of the following. A set of n variables X = {X 1 X 2 X n } and a set of directed edges between variables. The variables (nodes) with the directed edges form a directed acyclic graph (DAG) structure. Directed cycles are not modeled. To each variable X i and its parents Pa(X i ) there is attached a conditional probability table for P(X i Pa(X i )). Modeling for the continuous variables is also possible. (c) 2004 SNU CSE Biointelligence Lab 13

Bayesian Network Represents the Joint Probability Distribution By the d-separation property the Bayesian network over n variables X = {X 1 X 2 X n } represents P(X) as follows: P( X X... X 1 2 n i = 1 ) = n P( X i Pa( X i )). Given the joint probability distribution any conditional probability can be calculated in principle. (c) 2004 SNU CSE Biointelligence Lab 14

Bayesian Network for the Car Start Problem P(Fu = Yes) = 0.98 P(CSP = Yes) = 0.96 Fuel Clean Spark Plugs Fuel Meter Standing Start P(FMS Fu) P(St Fu CSP) Fu = Yes Fu = No Fu = Yes Fu = No FMS = Full 0.39 0.001 CSP = Yes (0.99 0.01) (0 1) FMS = Half 0.60 0.001 CSP = No (0.01 0.99) (0 1) FMS = Empty 0.01 0.998 (c) 2004 SNU CSE Biointelligence Lab 15

The Car Start Problem Revisited 1. No start P(St = No) = 1 (evidence 1) Update the conditional probabilities P(Fu St = No) P(CSP St = No) and P(FMS St = No) 2. Fuel meter stands for the half P(FMS = Half) = 1 (evidence 2) Update the conditional probabilities P(Fu St = No FMS = Half) and P(CSP St = No FMS = Half). Fuel Clean Spark Plugs Fuel Meter Standing Start (c) 2004 SNU CSE Biointelligence Lab 16

Calculation of the Conditional Probabilities Calculation of P(CSP St = No FMS = Half) is as follows. P( CSP St FMS) = P( CSP St FMS) P( St FMS) = Fu Fu CSP P( Fu CSP St FMS) P( Fu CSP St FMS) Summations in the above equation are taken over all the possible values of the variables. In general cases the calculation of the conditional probability by marginalization is nearly impossible. (c) 2004 SNU CSE Biointelligence Lab 17

Initial State P(Fu) P(CSP) P(St) and P(FMS) (c) 2004 SNU CSE Biointelligence Lab 18

No Start P(Fu St = No) P(CSP St = No) and P(FMS St = No) (c) 2004 SNU CSE Biointelligence Lab 19

Fuel Meter Stands for Half P(Fu St = No FMS = Half) and P(CSP St = No FMS = Half) (c) 2004 SNU CSE Biointelligence Lab 20

Causal Networks vs. Bayesian Networks Certainty vs. probability calculus A causes B vs. B depends on A P(B A) conditional probability Impact dependence d-separation conditional independencies Causality probabilistic dependence Probabilistic dependence causality ( ) (c) 2004 SNU CSE Biointelligence Lab 21

Equivalent Bayesian Network Structures Bayesian network structure A corresponding set of probability distributions Informal definition: equivalence of the Bayesian network structure Two Bayesian network structures are equivalent if the set of distributions that can be represented using one of the DAGs is identical to the set of distributions that can be represented using the other. (c) 2004 SNU CSE Biointelligence Lab 22

Example: Equivalent Two DAGs X Y X Y Two DAGs say that X and Y are dependent on each other. Equivalence class (c) 2004 SNU CSE Biointelligence Lab 23

Verma and Pearl s s Theorem Theorem [Verma and Pearl 90]: Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures. X Y v-structure (X Z Y) Z : X and Y are parents of Z and not adjacent to each other. (c) 2004 SNU CSE Biointelligence Lab 24

PDAG Representations Minimal PDAG representations of the equivalence class The only directed edges are those that participate in v-structures. Completed PDAG representation Every directed edge corresponds to a compelled edge and every undirected edge corresponds to a reversible edge. (c) 2004 SNU CSE Biointelligence Lab 25

Example: PDAG Representations X W V X W V Y Z Y Z An equivalence class X W V X W V Minimal PDAG Y Y Completed PDAG Z Z (c) 2004 SNU CSE Biointelligence Lab 26

Inference in Bayesian Networks Infer the probability of an event given some observations. [Frey 98] Infer the exact distribution over small groups of variables in singly-connected networks. Probability propagation Convert a multiply-connected network to the singlyconnected network. Not practical especially for large networks Approximate inference methods Monte Carlo approaches Variational methods Helmholtz machines (c) 2004 SNU CSE Biointelligence Lab 27

Singly-Connected Networks A singly-connected network has only a single path (ignoring edge directions) connecting any two vertices. f C x s f A u y z v f D f B w f E (c) 2004 SNU CSE Biointelligence Lab 28

(c) 2004 SNU CSE Biointelligence Lab 29 Factorization of Factorization of the Global Distribution and Inference the Global Distribution and Inference Example network represents the joint probability distribution as follows: The probability of s given the value of z is calculated as ). ( ) ( ) ( ) ( ) ( ) ( z y f y u f x u f w v f v u s f z y x w v u s P E D C B A = ')]}. ( ) ( )][ ( )}{[ ( ){ ( ') ( ') ( ') ( ') ( ') / ( ') ( = = = = = = = = = = y E D x C w B v u A y x w v u s z z y f y u f x u f w v f v u s f z z s P z z y x w v u s P z z s P z z s P z z s P z z s P

The Generalized Forward-Backward Algorithm The generalized forward-backward algorithm is one flavor of the probability propagation. The generalized forward-backward algorithm: 1. Convert a Bayesian network into the factor graph. 2. The factor graph is arranged as a horizontal tree with an arbitrary chosen root vertex. 3. Beginning at the left-most level messages are passed level by level forward to the root. 4. Messages are passed level by level backward from root to the leaves. Messages represent the propagated probability through edges of the graphical model. (c) 2004 SNU CSE Biointelligence Lab 30

Convert a Bayesian Network into the Factor Graph z 10 z 1 z 2 z 3 z z 4 1 z 5 z 6 z 7 z 2 z 3 z 5 z 6 z 4 z 7 z 7 z 4 z 3 z 2 z 6 z9 z 8 z 9 z 10 z 8 z 9 z 10 z 1 z 5 z 8 (c) 2004 SNU CSE Biointelligence Lab 31

Message Passing in the Graphical Model Two types of messages: Variable-to-function messages Function-to-variable messages f B y x µ x A µ A x f A f C z (c) 2004 SNU CSE Biointelligence Lab 32

Calculation of the Message The variable-to-function message: If x is unobserved then µ x A( x) = µ B x( x) µ C x( x). If x is observed as x then µ x ( x') 1 µ ( x) = A = x A 0 (for other values). The function-to-variable message: µ ( x ) f ( x y z) µ ( y) µ A x = y z A y A z A ( z). (c) 2004 SNU CSE Biointelligence Lab 33

Computation of the Conditional Probability After the generalized forward-backward algorithm ends each edge in the factor graph has its calculated message values. The probability of x given the observations v is as follows: P( x v) = βµ ( x) µ ( x) µ ( x) A x B x where β is a normalizing constant. C x (c) 2004 SNU CSE Biointelligence Lab 34

Inference in the Multiply-Connected Network Probabilistic inference in Bayesian networks (also in Markov random fields and factor graphs) in general is very hard. Approximate inferences Use probability propagation in the multiply-connected network. Monte Carlo methods Variational inference Helmholtz machines (c) 2004 SNU CSE Biointelligence Lab 35

Learning Bayesian Networks Parametric learning Learn the local probability distribution for each node given a DAG structure. P( X1 X 2... X n) = i = P( X i Pa( X 1 i )) Structural learning Learn the DAG structure. Bayesian network learning Structural learning parametric learning n (c) 2004 SNU CSE Biointelligence Lab 36

Four Possible Situations Given structure complete data ML MAP and Bayesian learning Given structure incomplete data EM algorithm variational method and Markov chain Monte Carlo (MCMC) method Unknown structure complete data Greedy search GA MCMC and Bayesian learning Unknown structure incomplete data Structure search + EM or MCMC (c) 2004 SNU CSE Biointelligence Lab 37

Parametric Learning Learning for the local probability distribution Complete data Maximum likelihood learning Bayesian learning [Heckerman 96] P( θ ) = Dir( θ α... α ij ij Incomplete data ij ij ij1 P( θ D) = Dir( θ α + N... α EM (expectation-maximization) algorithm [Heckerman 96] Markov chain Monte Carlo methods ij1 ijr i ) ij1 ijr i + N ijr i ) (c) 2004 SNU CSE Biointelligence Lab 38

Structural Learning Metric approach Use a scoring metric to measure how well a particular structure fits an observed set of cases. A search algorithm is used. Find a canonical form of an equivalence class. Independence approach An independence oracle (approximated by some statistical test) is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated. Search for a PDAG. (c) 2004 SNU CSE Biointelligence Lab 39

Scoring Metrics for Bayesian Networks Likelihood L(G θ G C) = P(C G h θ G ) G h : the hypothesis that the data (C) was generated by a distribution that can be factored according to G. The maximum likelihood metric of G M ML ( G C) = max L( G θ C) θ G prefer the complete graph structure. G (c) 2004 SNU CSE Biointelligence Lab 40

Information Criterion Scoring Metrics The Akaike information criterion (AIC) metric M AIC ( G C) = log M ( G C) Dim( G) ML The Bayesian information criterion (BIC) metric M BIC ( G C) = log M ( G C) ML 1 2 Dim( G) log N (c) 2004 SNU CSE Biointelligence Lab 41

MDL Scoring Metrics The minimum description length (MDL) metric 1 M 1( G C) = log P( G) M BIC ( G C) MDL + The minimum description length (MDL) metric 2 M MDL 2( G C) = log M ML( G C) EG log N c Dim( G) (c) 2004 SNU CSE Biointelligence Lab 42

Bayesian Scoring Metrics A Bayesian metric h h M ( G C ξ ) = log P( G ξ ) + log P( C G ξ ) + c The BDe (Bayesian Dirichlet & likelihood equivalence) metric [Heckerman et al. 95] p( C G) = = p( G) p( G) p( C G) Γ( α ) Γ( α n q ij r ijk i= 1 j= 1 k = 1 Γ( αij + Nij ) Γ i i N + ( α ijk ) ijk ). α ij = α N N k ijk ij = k Γ(1) = 1 Γ( x + 1) = xγ( x) ijk Prior Sufficient statistics calculated from D (c) 2004 SNU CSE Biointelligence Lab 43

Greedy Search Algorithm for Bayesian Network Learning Generate the initial Bayesian network structure G 0. For m = 1 2 3 until convergence. Among all the possible local changes (insertion of an edge reversal of an edge and deletion of an edge) in G m 1 the one leads to the largest improvement in the score is performed. The resulting graph is G m. Stopping criterion Score(G m 1 ) == Score(G m ). At each iteration (learning Bayesian networks consisting of n variables) O(n 2 ) local changes should be evaluated to select the best one. Random restarts is usually adopted to escape the local maxima. (c) 2004 SNU CSE Biointelligence Lab 44

Other Approaches to the Structural Learning Genetic algorithms Markov chain Monte Carlo sampling Bayesian learning Summing over all the possible structures Possible space is exponential in the number of variables. approximation (c) 2004 SNU CSE Biointelligence Lab 45

Applications Classification Neural networks vs. PGMs Text mining Topic extraction Motion tracking Bioinformatics Gene-regulatory network construction Gene-drug dependency analysis (c) 2004 SNU CSE Biointelligence Lab 46

Gene-Regulatory Network Construction Eran Segal et al. Module Networks: Identifying Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Nature Genetics 34(2): 166-76 2004. (c) 2004 SNU CSE Biointelligence Lab 47

Gene-Drug Dependency Analysis (c) 2004 SNU CSE Biointelligence Lab 48

Concluding Remarks Probabilistic graphical models Probability theory (uncertainty) + Graph theory (complexity) Framework of thought Artificial intelligence machine learning data mining Representation inference and learning Further works are needed for these topics. In the viewpoint of engineering Implement an established theory for specific applications. (c) 2004 SNU CSE Biointelligence Lab 49

Bibliography [Jensen 96] Jensen F.V. An Introduction to Bayesian Networks Springer-Verlag 1996. [Jensen 01] Jensen F.V. Bayesian Networks and Decision Graphs Springer-Verlag 2001. [Heckerman 96] Heckerman D. A tutorial on learning with Bayesian networks Technical Report MSR-TR-95-06 Microsoft Research 1996. [Pearl 88] Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan Kaufmann Publishers 1988. [Spirtes et al. 00] Spirtes P. Glymour C. and Scheines R. Causation Prediction and Search 2 nd edition MIT Press 2000. [Frey 98] Frey B.J. Graphical Models for Machine Learning and Digital Communication MIT Press 1998. [Friedman and Goldszmidt 99] Friedman N. and Goldszmidt M. Learning Bayesian networks with local structure Learning in Graphical Models pp. 421-460 MIT Press 1999. [Heckerman et al. 95] Heckerman D. Geiger D. and Chickering D.M. Learning Bayesian networks: the combination of knowledge and statistical data Technical Report MSR-TR-94-09 Microsoft Research 1995. [Verma and Pearl 90] Verma T. and Pearl J. Equivalence and synthesis of causal models In Proceedings of UAI 90 pp. 220 227 1990. http://www.ai.mit.edu/~murphyk/bayes/bnintro.html (c) 2004 SNU CSE Biointelligence Lab 50