Probabilistic Graphical Networks: Definitions and Basic Results

Similar documents
Probabilistic Machine Learning

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Probabilistic Graphical Models (I)

Graphical Models and Kernel Methods

Rapid Introduction to Machine Learning/ Deep Learning

Directed Graphical Models

2 : Directed GMs: Bayesian Networks

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

STA 4273H: Statistical Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

Bayesian Machine Learning

CS Lecture 3. More Bayesian Networks

DAG models and Markov Chain Monte Carlo methods a short overview

Bayesian Networks BY: MOHAMAD ALSABBAGH

Chris Bishop s PRML Ch. 8: Graphical Models

Sampling Methods (11/30/04)

Introduction to Bayesian Learning

13: Variational inference II

CS 343: Artificial Intelligence

Directed Graphical Models or Bayesian Networks

Statistical Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

An Introduction to Bayesian Machine Learning

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Probabilistic Graphical Models

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Bayesian Inference and MCMC

Review: Bayesian learning and inference

Lecture 6: Graphical Models

4 : Exact Inference: Variable Elimination

Statistical Approaches to Learning and Discovery

Lecture 13 : Variational Inference: Mean Field Approximation

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Lecture 8: Bayesian Networks

3 : Representation of Undirected GM

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Reasoning Under Uncertainty: Belief Network Inference

Conditional Independence and Factorization

Machine Learning Lecture 14

an introduction to bayesian inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Conditional Independence

Probabilistic Graphical Models: Representation and Inference

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

CPSC 540: Machine Learning

Lecture 6: Graphical Models: Learning

2 : Directed GMs: Bayesian Networks

Learning in Bayesian Networks

Introduction to Probabilistic Graphical Models

The Origin of Deep Learning. Lili Mou Jan, 2015

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Bayesian Networks

Part 1: Expectation Propagation

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Preliminaries Bayesian Networks Graphoid Axioms d-separation Wrap-up. Bayesian Networks. Brandon Malone

Data Mining 2018 Bayesian Networks (1)

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Markov Networks.

Machine Learning Summer School

The Monte Carlo Method: Bayesian Networks

STA 4273H: Statistical Machine Learning

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Introduction to Probabilistic Graphical Models

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Bayesian Graphical Models

Markov Independence (Continued)

Introduction to Probabilistic Graphical Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Probabilistic Reasoning Systems

Bayesian Approaches Data Mining Selected Technique

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Directed Probabilistic Graphical Models CMSC 678 UMBC

Artificial Intelligence: Cognitive Agents

Pattern Recognition and Machine Learning

Undirected Graphical Models

Bayesian Networks (Part II)

Tutorial on Probabilistic Programming with PyMC3

Bayesian Networks Representation

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Bayesian networks: approximate inference

An Introduction to Bayesian Networks: Representation and Approximate Inference

Artificial Intelligence Bayes Nets: Independence

Introduction to Machine Learning

Advanced Probabilistic Modeling in R Day 1

CS6220: DATA MINING TECHNIQUES

Directed and Undirected Graphical Models

Artificial Intelligence Bayesian Networks

Lecture 4 October 18th

Transcription:

This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical Models are not really models, but are in fact a tool for describing conditional independencies between random variables in a joint probability distribution, and a tool for visualizing inference algorithms for a particular probability distribution. Examples of these are dynamic programming, Markov Chain Monte Carlo (MCMC). Exploiting the properties of the graph allow for efficient inference. This is the main reason why PGMs are useful. One can go from a specified acyclic graph to a joint probability distribution, or from a known distribution to a directed acyclic graph. I begin with a few definitions. Theorem 0.1 (General Multiplication Rule ( multi-mult )) For any n event A 1,A 2,...,A n on probability space Ω, n P ( A i ) = P (A 1 A 2... A n ) = P (A 1 )P (A 2 A 1 )P (A 3 A 1 A 2 )...P (A n A 1 A 2... A n 1 ) That is, the probability of a series of multiple events can be computed by calculating the product of a series of conditional probabilities. Definition 0.2 (Probabilistic Graphical Network (PGN)) A PGN (sometimes called a Bayesian Network, or Belief Network, or PGM probabilistic graphical model) B is an annotated directed acyclic graph (DAG) that represents a joint probability distribution for a set of random variables V. The network is defined by a graph and a parameter set: B = G, Θ where G is the DAG with nodes X 1,X 2,..., X n, which represent random variables, and where the edges (lines linking nodes) represent the dependencies between variables. The graph G encodes independence assumptions. Each variable X i is independent of its nondescendants given its parents in graph G. The second component of parameters is Θ. This set contains the parameter θ xi π i = P B (x i π i ) 1

for every realization x i of X i conditioned on π i, which is the set of parents of X i in graph G. B defines a unique joint probability distribution (JPDF) over V such that: P B (X 1, X 2,..., X n ) = n P B (X i π i ) = n θ Xi π i A Bayes Network is so-called not because it necessarily uses Bayesian statistics (although it can), but because it uses Bayes Theorem to calculate conditional probabilities (and thus requires summation or integration). In general, the full summation (integration) over discrete (continuous) variables is called exact inference and known to be an NP-hard problem. An example of a PGM is given by the following probability distribution and graph: P(X 1,..., X 6 ) = n P B (X i π i ) = p(x 1 )p(x 2 X 1 )p(x 3 X 1 )p(x 4 X 2, X 3 )p(x 5 X 2 )p(x 6 X 4 ) Figure 1: An example DAG *Note that the arrow structure does not necessarily imply dependence. Only independence can be read off a graph. Thus, there are often multiple ways to graph the same probability distribution. Approximate inference to find the distribution of various probabilities is accomplished via stochastic sampling methods. A variety of Markov chain Monte Carlo (MCMC) techniques, including Gibbs sampling and the Metropolis-Hastings algorithm, are commonly used for approximate inference. There are other methods for approximate inference such as loopy belief propagation and variational methods. 2

Types of Graphical Connections: Direct Connection Serial Connection Diverging Connection (common cause) Converging Connection (common effect) Figure 2: An example of a PGN for Bayesian Linear Regression Another example of PGN use is in Bayesian Naïve Bayes Classification. A quick digression: a standard Naïve Bayes Classifier is formed from observed categorical data Y i in some finite set, with each observation having corresponding d-dimensional feature vector X (i) = {X (i) 1, X(i) 2,..., X(i) d }. This is a basic machine learning technique for classification. The reason it is naïve is that the model assumes that each component of the feature variable, X (i) j, is independent of the other components, conditional on Y i. This means that for a single observation tuple (Y,X) with fixed observation index i, P θ (Y i, X (i) ) = (the Bayes part) P θ(y i )P θ (X (i) Y i ) P θ (X (I) ) ( naïvely ) P θ (Y i ) d j=1 P θ (X (i) j Y i ) 3

The classification occurs by estimating θ from the data, choosing a probability model for Y and each term of X, and finding the maximum a posteriori (MAP) estimate of Y: that is, the value of Y that maximizes the above expression. Turning the subscript θ into a conditioning term, we can put a prior on θ and get a posterior distribution. For a categorical scenario, denote P(y θ) = π(y) = (π(1),..., π(m)), and P(x j Y=y,θ) = r jy (x j ), with θ = (π, {r yj } d j=1 ). The next step is to choose priors. Starting with π, we have P(π) Dirichlet(π α), a multinomial generalization of the Beta distribution. Then P(r y=k,j ) = Dirichlet(r y=k,j β). Assuming marginal independence among components of θ, we have P(θ) = P(π) d,m j=1,k=1 P (r k=k,j). We would like to know the appropriate Y for X. So for new data we want the predictive distribution of the probability of Y given X and the data D. The PGN for this model can be written: Figure 3: An example of a PGN for Bayesian Naïve Bayes Classification 4

Definition 0.3 (Ancestral Set) Let X be a set of nodes in network N. Then the ancestral set an(x) of X consists of all nodes in X and all the ancestors of their ancestors. If an(x) = X then we say X is ancestral. Definition 0.4 (Leaf Node) A leaf node is a node without any children nodes. Definition 0.5 (Complete DAG) A Complete graph has directed edges between each of the vertices. Definition 0.6 (Blocked Path) A path between nodes X and Y is blocked by set Z if 1. The path contains a node Z that is in Z and the connection at Z is either serial or diverging. 2. Or, the path contains a node W such that W and its descendants are not in Z and the connection at W is a converging connection (collision node). Definition 0.7 (d-separation) Two nodes X and Y are d separated by set Z if all paths between X and Y are blocked by Z. If this is so, then X Y Z. *note: the proof of the fact that X Y Z (given the usual set of assumptions acyclicity, etc.) is shown in the N.L. Zhang ppt., Introduction to Bayesian Networks: Lecture 3, pp. 25-26. This is the so-called Global Markov Property of Bayes Networks. For a given graph G we want to know if X i is independent of X j given node C. In the example below, we are looking at whether any two given nodes are independent given X 3, the shaded node. 5

Figure 4: An example of a PGN conditioned on node 3 (red dots are annotated blockages) Table 1: Are nodes i and j conditionally independent given node 3? i j d-sep? 1 4 yes 1 2 no 4 5 yes 4 7 no 4 8 yes 4 6 no 2 7 yes 2 5 yes 5 8 yes 6

Definition 0.8 (Local Markov Property) For a PGM, a variable X is independent of all its non-descendants, given its parents. *note: when constructing a causal network in the fashion of a PGM, we are implicitly assuming that causality implies the local Markov property. Definition 0.9 (Markov Blanket) A Markov Blanket (MB) of a node X is the set the parents of X, the children of X, and the parents of the children of X. A corollary to this definition is that in a PGM, a variable X is conditionally independent of all other remaining variables, given the known values of its Markov Blanket (as the MB d-separates X from all other nodes in the network). 7