Directed Graphical Models or Bayesian Networks

Similar documents
Bayesian Networks Representation

BN Semantics 3 Now it s personal!

BN Semantics 3 Now it s personal! Parameter Learning 1

Learning P-maps Param. Learning

2 : Directed GMs: Bayesian Networks

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Structure Learning: the good, the bad, the ugly

Directed and Undirected Graphical Models

Directed Graphical Models

Directed Graphical Models

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

PROBABILISTIC REASONING SYSTEMS

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Bayesian Networks

CS Lecture 3. More Bayesian Networks

2 : Directed GMs: Bayesian Networks

Probabilistic Graphical Models

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Rapid Introduction to Machine Learning/ Deep Learning

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Probabilistic Graphical Models

Learning Bayes Net Structures

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayes Nets: Independence

CS 5522: Artificial Intelligence II

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Artificial Intelligence Bayes Nets: Independence

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Uncertainty and Bayesian Networks

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Undirected Graphical Models: Markov Random Fields

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Bayesian Network Representation

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

Probabilistic Reasoning. (Mostly using Bayesian Networks)

CS 188: Artificial Intelligence. Bayes Nets

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

Introduction to Bayesian Networks. Probabilistic Models, Spring 2009 Petri Myllymäki, University of Helsinki 1

Introduction to Bayesian Networks

Probabilistic Graphical Networks: Definitions and Basic Results

Causal Inference & Reasoning with Causal Bayesian Networks

An Introduction to Bayesian Machine Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian networks. Chapter Chapter

Graphical Models - Part I

Machine Learning Lecture 14

CS 343: Artificial Intelligence

Lecture 4 October 18th

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Directed and Undirected Graphical Models

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

3 : Representation of Undirected GM

Learning in Bayesian Networks

Bayesian Networks. CompSci 270 Duke University Ron Parr. Why Joint Distributions are Important

Quantifying uncertainty & Bayesian networks

Bayesian Networks Inference with Probabilistic Graphical Models

Introduction to Artificial Intelligence. Unit # 11

Intelligent Systems (AI-2)

Probabilistic representation and reasoning

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Probabilistic Graphical Models Redes Bayesianas: Definição e Propriedades Básicas

Bayesian Networks Representation

Chris Bishop s PRML Ch. 8: Graphical Models

Review: Bayesian learning and inference

Introduction to Artificial Intelligence Belief networks

Preliminaries Bayesian Networks Graphoid Axioms d-separation Wrap-up. Bayesian Networks. Brandon Malone

Probabilistic representation and reasoning

Probabilistic Graphical Models (I)

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CSE 473: Artificial Intelligence Autumn 2011

From Bayesian Networks to Markov Networks. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Machine Learning 4771

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Bayesian networks: Representation

Conditional Independence and Factorization

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

4.1 Notation and probability review

Probabilistic Models

Introduction to Probabilistic Graphical Models

CPSC 540: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

4 : Exact Inference: Variable Elimination

Bayesian Networks. Motivation

CS Lecture 4. Markov Random Fields

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

6.867 Machine learning, lecture 23 (Jaakkola)

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Bayesian networks: Modeling

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Transcription:

Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Bayesian Networks One of the most exciting recent advancements in statistical AI Compact representation for exponentially-large probability distributions Fast marginalization algorithm Exploit conditional independencies Allergy Flu Difference from undirected graphical models! Sinus Nose Headache 2

Handwriting recognition 3

Handwriting recognition X 1 X 2 X 3 X 4 X 5 4

Summary: basic concepts for R.V. Outcome: assign x 1,, x n to X 1, X n Conditional probability: P(X, Y) = P(X) P(Y X) Bayes rule: P(Y X) = P X Y P(Y) P(X) Chain rule: P X 1,, X n = P X 1 P X 2 X 1 P X n X 1,, X n 1 5

Summary: conditional independence X is independent of Y given Z if P(X = x Y = y, Z = z) = P(X = x Z = z) x Val X, y Val Y, z Val Z Shorthand: (X Y Z) For (X Y ), write X Y Proposition: (X Y Z) if and only if P(X, Y Z) = P(X Z)P(Y Z) 6

Representation of Bayesian Networks Consider P X i Assign probability to each x i Val X i Assume Val X i k 1 = k, how many independent parameters? Consider P X 1,, X n How many independent parameters if Val X i k n 1 = k? Bayesian Networks can represent the same joint probability with fewer parameters 7

Simple Bayesian Networks

What if variables are independent? What if all variables are independent? Is it enough to have X i X j Not enough!!!, i, j Must assume that X Y, X, Y subsets of X 1,, X n X 1, X 3 X 2, X 4, X 1 X 7, X 8, X 9, Can write as P X 1,, X n = P X i i=1 n Bayesian networks with no edges X 1 X 2 X 3 X n How many independent parameters now? n k 1 9

Conditional parameterization two nodes Grade (G) is determined by intelligence (I) P(I) = VH H 0.85 0.15 I P G I = G I VH H A 0.9 0.5 B 0.1 0.5 G P I = VH, G = B = P I = VH P G = B I = VH = 0.85 0.1 = 0.085 10

Conditional parameterization three nodes Grade and SAT score are determined by intelligence P(I) I (G S I) P(S G, I = P(S I) P(G I) P(S I) G S P(I, G, S) = P(I) P(G I) P(S I), why? Chain rule P(I, G, S) = P(I) P(G I) P(S G, I) Use conditional independence, we get P(I) P(G I) P(S I) 11

The naïve Bayes model Class variable: C Evidence variables: X 1, X n Assume that (X Y C), X, Y subsets of X 1,, X n C P C, X 1,, X n = P C P X i C i, why? Chain rule P C, X 1,, X n = P C P X 1 C P X 2 C, X 1 P X n C, X 1,, X n 1 X 1 X 2 X n P X 2 C, X 1 = P X 2 C X_1 X_2 C P X n C, X 1,, X n 1 = P X n C X n X 1,, X n 1 C 12

More Complicated Bayesian Networks and an Example

Causal Structure We will learn the semantics of Bayesian Networks (BNs), relate them to independence assumptions encoded by the graph Suppose we know the following: The flu (F) causes sinus inflammation (S) Allergies (A) cause sinus inflammation Sinus inflammation causes a runny nose (N) Sinus inflammation causes headaches (H) How are these variables connected? Allergy Sinus Flu Nose Headache 14

Possible queries Probabilistic Inference Eg., P(A = t H = t, N = f) Most probable explanation Allergy Flu max F,A,S P(F, A, S H = t, N = t) Sinus Active data collection What s the next best test variable to observed Nose Headache 15

Car starts BN 18 binary variables P A, F, L,, Spark Inference P(BatteryAge Starts=f) = Alt FanBelt P A, F, L, Sum over 2 16 terms, why BN so fast? Use the sparse graph structure For the HailFinder BN in JavaBayes More than 3 54 = 58149737003040059690390169 terms 16

Factored joint distribution--preview P(A) Allergy Flu P(F) P(S F, A) Sinus P(N S) Nose Headache P(H S) P F, A, S, H, N = P F P A P S F, A P H S P N S 17

Number of parameters 1 P(A) Allergy Flu P(F) 1 S FA tt tf ft ff t 0.9 0.7 0.8 0.2 f 0.1 0.3 0.2 0.8 P(S F, A) Sinus 1 1 1 1 4 P(N S) 2 Nose Headache Bayesian networks (BN) has 10 parameters P(H S) Full probability table P(F, A, S, H, N) explicitly has 2 5 1 = 31 parameters 2 18

Key: Independence assumptions H N H N S Allergy Flu A N A N S Sinus Knowing sinus separates the symptom variables from each other Nose Headache 19

Marginal independence Flu and Allergy are (marginally) independent F A P F, A = P F P A Allergy Flu More generally: subsets of {X 1,, X n }, X Y, X X 1,, X n, Y X 1,, X n Nose Sinus Headache P X 1,, X n = P X i i Flu = t Flu = f Flu = t 0.1 Flu = f 0.9 Al = t 0.3 Al = f 0.7 Al = t 0.1 0.3 = 0.03 0.3 0.9 Al = f 0.1 0.7 0.9 0.7 20

Conditional independence Flu and headache are not (marginally) independent F H, P F H P F, P F, H P F p H Allergy Flu Flu and headache are independent given Sinus infection F H S, P F, H S P F H, S = P F S = P F S P H S Nose Sinus Headache More generally: X 1, X 2,, X n are independent of each other given C P X 1, X 2,, X n C = P X 1 C P X 2,, X n C P X 1, X 2,, X n C = P X i C i 21

The conditional independence assumption Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents X NonDescendants X Pa X Flu: Pa Flu =, NonDescendants Flu = {A} F A Allergy Flu Nose: Pa nose = S, NonDescendants nose = N {F, A, H} S F, A, H Sinus Sinus: Pa sinus = F, A, NonDescendants sinus = No assumption about S??? F, A Nose Headache 22

Explaining away Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents X NonDescendants X Pa X local Markov assumption not imply F A S Allergy Flu XOR: P A = t = 0.5, P F = t = 0.5, S = F XOR A S = t, A = t F = f, ie., P F = f S = t, A = t =0 Sinus P F = t = 0.2, P F = t S = t = 0.5, P F = t S = t, A = t = 0.3 P F = t P F = t S = t, A = t P F = t S = t Knowing A = t lowers the probability of F = t (A=t explains away F=t) Depends on P(S F,A), it could be P F = t S = t, A = t P F = t S = t Nose Headache 23

Why can we decompose joint distribution? Chain rule and Local Markov Assumption Pick a topological ordering of nodes Interpret a BN using particular chain rule order Use the conditional independence assumption 2 Allergy Sinus Flu 3 1 4 5 Nose Headache P N, H, S, A, F = P F P A F P S F, A P H S, F, A P N S, F, A, H P(A) P(S F, A) P(H S) P(N S) A F H FA S H FAH S P(N, H, S, A, F) = P(F) P(A) P(S FA) P(H S) P(N S) 24

General Bayesian Networks

A general Bayes net Set of random variables X 1,, X n Directed acyclic graph (DAG) Loops are ok But no directed cycle Local Markov Assumptions A variable X is independent of its non-descendants given its parents and only its parents (X NonDescendants X Pa X ) Conditional probability tables (CPTs), P(X i Pa Xi ), for each X i Joint distribution P X 1,, X n = P X i Pa Xi i 26

Question? What distributions can be represented by a Bayesian Networks? What Bayesian Networks can represent a distribution? What are the independence assumptions encoded by a BN? In addition to the local Markov assumption A B C D Local Markov A C B D AB C Derived independence A D B 27

Conditional Independence in Problem World, Data, Reality: Bayesian Networks: True distribution P Contains conditional independence assertions I(P) Graph G encodes local independence assumptions I l G Key representational assumption: I l G I P 28

The representation theorem True conditional independence => BN factorization BN encodes local conditional independence assumptions I l G If local conditional independence in BN are subset of conditional independence in P I P I l G obtain Then the joint probability P can be written as P(X 1,, X n ) = P(X i Pa Xi ) i 29

The representation theorem BN factorization => True conditional independence BN encodes local conditional independence assumptions I l G If the joint probability P can be written as P(X 1,, X n ) = P(X i Pa Xi ) i obtain Then local conditional independence in BN are subset of conditional independence in P I P I l G 30

Example: naïve Bayes True conditional independence => BN Factorization Independence assumptions: X i are independent of each other given C C That is X Y, X, Y subsets of X 1,, X n X 1, X 3 X 2, X 4, X 1 X 7, X 8, X 9, Same as local conditional independence X 1 X 2 X n prove P C, X 1,, X n = P(C) P X i C i Use chain rule, and local Markov property P C, X 1,, X n = P C P X 1 C P X 2 C, X 1 P X n C, X 1,, X n 1 P X 2 C, X 1 = P X 2 C X_1 X_2 C P X n C, X 1,, X n 1 = P X n C X n X 1,, X n 1 C 31

Example: naïve Bayes BN Factorization => True conditional independence Assume P C, X 1,, X n = P(C) P X i C i C Prove Independence assumptions: X i are independent of each other given C X 1 X 2 X n That is X Y, X, Y subsets of X 1,, X n X 1, X 3 X 2, X 4, X 1 X 7, X 8, X 9, Eg. n = 4, P(X1, X_2 C) = P(X 1,X 2,C) P(C) = 1 P C x 3, x 4 P X 1 C P X 2 C P X 3 C = x3,x4 P(X 1,X 2,X 3,X 4,C) P(C) P X 4 C P C = P X 1 C P X 2 C x 3 P X 3 C x 4 P(X 4 C) = P X 1 C P X 2 C 1 1 32

How about the general case? BN encodes local conditional independence assumptions I l G If local conditional independence in BN are subset of conditional independence in P I P I l G Every P has least one BN structure G If the joint probability P can be written as P(X 1,, X n ) = P(X i Pa Xi ) i This BN is an I-map of P Read independence of P from BN structure G obtain obtain P factorizes according to BN Then the joint probability P can be written as P(X 1,, X n ) = P(X i Pa Xi ) Then local conditional independence in BN are subset of conditional independence in P I P I l G i 33

Proof from I-map to factorization Topological ordering of X 1, X 2, X n : Number variables such that Parent has lower number than child ie., X i X j i < j Variable has lower number than all its descendants 2 Allergy Flu 1 Sinus 3 4 5 Nose Headach Use chain rule P X 1,, X n = P X 1 P X 2 X 1 P X n X 1, X n 1 P X i X 1,, X i 1 : Pa Xi {X 1,, X i 1 }, And there is no descendant of X i in {X 1,, X i 1 } P X i X 1,, X i 1 = P(X i Pa Xi ) Local Markov Assumption: X NonDescendants X Pa X 34

Adding edges doesn t hurt Let G be an I-map for P, any DAG G that includes the same directed edges as G is also an I-map of P G is strictly more expressive than G If G is an I-map for P, then adding edges still results in an I-map P(N S,A) P(N S, A = t) = P(N S, A = f) P(N S) = P(N S, A) N A S Allergy Sinus Flu Nose Headache 35

Minimal I-maps G is a minimal I-map for P if deleting any edges for G makes it no longer an I-map Eg. A B, A B C, minimal I-map A C B, if remove an edge then no longer I-map Obtain a minimal I-map given a set of variabls and conditional independence assertions Choose an ordering on variables X 1,, X n For i = 1 to n Add X i to network Define parents of X i, Pa Xi, in graphs as the minimal subset of nodes such that local Markov assumptions hold Define/learn CPT P X i Pa Xi ) 36

Read conditional independence from BN structure G

Conditional Independence encoded in BN Local Markov assumption X NonDescendants X Pa X There are other conditional independence Eg., explaining away, derived Allergy Flu A B C D Sinus A F A F S Local Markov A C B D AB C Derived independence A D B What other derived conditional independence there? 38

3-node cases Causal effect: Allergy Sinus Headache A H A H S Common cause Nose Sinus Headache N H N H S Common effect (V-structure) Allergy Sinus Flu A F A F S 39

How about other derived relations?? F? {B, E, G, J} F {B, E, G, J} A B? A, C, F, I? B, E, G, J A, C, F, I {B, E, G, J} C D E? B? J E B J E? E? F K E F K F H G? E? F {K, I} E F {K, I}? F? G D F G D I J? F? G H F G H? F? G {H, K} K F G {H, K}? F? G {H, A} F G {H, A} 40

Active trails A trail X 1 X 2 X k is an active trail when variables O X 1,, X n are observed if for each consecutive triplet in the trail: X i 1 X i X i+1 and X i is not observed X i O X i 1 X i X i+1 and X i is not observed X i O X i 1 X i X i+1 and X i is not observed X i O X i 1 X i X i+1 and X i is observed X i O or one of its descendants (V-structure) 41

Active trails and conditional independence Variables X i and X j are independent given Z X 1,, X n, if there is no active trail between X i and X j when variables Z are observed (X i X j Z) C A D B E We say that X i and X j are d- separated given Z (dependency separation) F I H J G Eg., F G K 42

Soundness of d-separation Given BN with structure G Set of conditional independence assertions obtained by d- separation I G = X Y Z: d sep G X; Y Z Soundness of d-separation If P factorizes over G then I G I P (not only I l G I P ) Interpretation: d-separation only captures true conditional independencies For most P s that factorize over G, I G = I P (P-map) 43

Bayesian Networks are not enough

Inexistence of P-maps, example 1 XOR: A = B XOR C A B, A B C B C, A C B C A, B C A A C B Not P-map Can not read B C A C Minimal I-map A B 45

Inexistence of P-maps, example 2 Swinging couples of variables X 1, Y 1 and X 2, Y 2 X 1 Y 1 X 2 Y 2 X 1 X 2 Y 1 Y 2 Y 1 Y 2 X 1 X 2 No Bayesian Network P-map Y 2 X 1 X 2 Need undirected graphical models! Y 1 46