Directed Graphical Models

Similar documents
Directed Graphical Models or Bayesian Networks

CS 5522: Artificial Intelligence II

Probabilistic Graphical Models: Representation and Inference

Bayes Nets: Independence

Graphical Models and Kernel Methods

Learning P-maps Param. Learning

Based on slides by Richard Zemel

Directed Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

CS 188: Artificial Intelligence. Bayes Nets

Bayesian Networks. Alan Ri2er

Artificial Intelligence Bayes Nets: Independence

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

BN Semantics 3 Now it s personal! Parameter Learning 1

Bayesian Networks. Motivation

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Sampling from Bayes Nets

Notes on Machine Learning for and

CPSC 540: Machine Learning

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Bayesian Networks Representation

CS 343: Artificial Intelligence

Intelligent Systems (AI-2)

Lecture 8: Bayesian Networks

Introduction to Bayesian Learning

Bayes Nets III: Inference

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Learning in Bayesian Networks

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

The Monte Carlo Method: Bayesian Networks

STA 4273H: Statistical Machine Learning

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Stephen Scott.

Statistical Approaches to Learning and Discovery

Introduction to Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

A Tutorial on Learning with Bayesian Networks

Structure Learning: the good, the bad, the ugly

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

An Introduction to Bayesian Machine Learning

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Lecture 6: Graphical Models: Learning

Learning Bayesian Networks (part 1) Goals for the lecture

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Probabilistic Graphical Networks: Definitions and Basic Results

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Probabilistic Reasoning Systems

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Review: Bayesian learning and inference

Machine Learning

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Graphical Models - Part I

Probabilistic Graphical Models (I)

Machine Learning Summer School

PROBABILISTIC REASONING SYSTEMS

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CSE 473: Artificial Intelligence Autumn 2011

CS6220: DATA MINING TECHNIQUES

The Origin of Deep Learning. Lili Mou Jan, 2015

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Probabilistic Models. Models describe how (a portion of) the world works

Product rule. Chain rule

Chapter 16. Structured Probabilistic Models for Deep Learning

13: Variational inference II

Bayesian Approaches Data Mining Selected Technique

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Networks. CompSci 270 Duke University Ron Parr. Why Joint Distributions are Important

Lecture 5: Bayesian Network

4 : Exact Inference: Variable Elimination

STA 4273H: Statistical Machine Learning

Causal Inference & Reasoning with Causal Bayesian Networks

Directed and Undirected Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Uncertainty and Bayesian Networks

Directed Probabilistic Graphical Models CMSC 678 UMBC

STA 4273H: Statistical Machine Learning

LEARNING WITH BAYESIAN NETWORKS

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Algorithmisches Lernen/Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Approximate Inference

Introduction to Artificial Intelligence. Unit # 11

Learning Bayes Net Structures

Machine Learning Lecture 14

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Transcription:

Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell

Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express sets of conditional independence assumptions via graph structure Graph structure plus associated parameters define joint probability distribution over set of variables Two types of graphical models: 10-601 Directed graphs (aka Bayesian Networks) Undirected graphs (aka Markov Random Fields)

Graphical Models Why Care? Among most important ML developments of the decade Graphical models allow combining: Prior knowledge in form of dependencies/independencies Prior knowledge in form of priors over parameters Observed training data Principled and ~general methods for Probabilistic inference Learning Useful in practice Diagnosis, help systems, text analysis, time series models,...

Conditional Independence Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write E.g.,

Marginal Independence Definition: X is marginally independent of Y if Equivalently, if Equivalently, if

Represent Joint Probability Distribution over Variables

Describe network of dependencies

Bayes Nets define Joint Probability Distribution in terms of this graph, plus parameters Benefits of Bayes Nets: Represent the full joint distribution in fewer parameters, using prior knowledge about dependencies Algorithms for inference and learning

Bayesian Networks Definition A Bayes network represents the joint probability distribution over a collection of random variables A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD s) Each node denotes a random variable Edges denote dependencies For each node X i its CPD defines P(X i Pa(X i )) The joint distribution over all variables is defined to be Pa(X) = immediate parents of X in the graph

Bayesian Network Nodes = random variables StormClouds A conditional probability distribution (CPD) is associated with each node N, defining P(N Parents(N)) Lightning Rain Parents P(W Pa) P( W Pa) L, R 0 1.0 L, R 0 1.0 L, R 0.2 0.8 L, R 0.9 0.1 WindSurf Thunder WindSurf The joint distribution over all variables:

Bayesian Network StormClouds What can we say about conditional independencies in a Bayes Net? One thing is this: Each node is conditionally independent of its non-descendents, given only its immediate parents. Lightning Thunder Rain WindSurf Parents P(W Pa) P( W Pa) L, R 0 1.0 L, R 0 1.0 L, R 0.2 0.8 L, R 0.9 0.1 WindSurf

Some helpful terminology Parents = Pa(X) = immediate parents Antecedents = parents, parents of parents,... Children = immediate children Descendents = children, children of children,...

Bayesian Networks CPD for each node X i describes P(X i Pa(X i )) Chain rule of probability says that in general: But in a Bayes net:

StormClouds How Many Parameters? Lightning Rain Parents P(W Pa) P( W Pa) L, R 0 1.0 L, R 0 1.0 L, R 0.2 0.8 L, R 0.9 0.1 WindSurf Thunder WindSurf To define joint distribution in general? To define joint distribution for this Bayes Net?

StormClouds Inference in Bayes Nets Lightning Rain Parents P(W Pa) P( W Pa) L, R 0 1.0 L, R 0 1.0 L, R 0.2 0.8 L, R 0.9 0.1 WindSurf Thunder WindSurf P(S=1, L=0, R=1, T=0, W=1) =

StormClouds Learning a Bayes Net Lightning Rain Parents P(W Pa) P( W Pa) L, R 0 1.0 L, R 0 1.0 L, R 0.2 0.8 L, R 0.9 0.1 WindSurf Thunder WindSurf Consider learning when graph structure is given, and data = { <s,l,r,t,w> } What is the MLE solution? MAP?

Algorithm for Constructing Bayes Network Choose an ordering over variables, e.g., X 1, X 2,... X n For i=1 to n Add X i to the network Select parents Pa(X i ) as minimal subset of X 1... X i-1 such that Notice this choice of parents assures (by chain rule) (by construction)

Example Bird flu and Allegies both cause Nasal problems Nasal problems cause Sneezes and Headaches

What is the Bayes Network for X1, X4 with NO assumed conditional independencies?

What is the Bayes Network for Naïve Bayes?

Naïve Bayes (Same as Gaussian Mixture Model w/ Diagonal Covariance) Y X 1 X 2 X 3 X 4

What do we do if variables are mix of discrete and real valued?

Bayes Network for a Hidden Markov Model Implies the future is conditionally independent of the past, given the present Unobserved state: S t-2 S t-1 S t S t+1 S t+2 Observed output: O t-2 O t-1 O t O t+1 O t+2

Inference in Bayes Nets In general, intractable (NP-complete) For certain cases, tractable Assigning probability to fully observed set of variables Or if just one variable unobserved Or for singly connected graphs (ie., no undirected loops) Variable elimination Belief propagation For multiply connected graphs Junction tree Sometimes use Monte Carlo methods Generate many samples according to the Bayes Net distribution, then count up the results Variational methods for tractable approximate solutions

Example Bird flu and Allegies both cause Sinus problems Sinus problems cause Headaches and runny Nose

Prob. of joint assignment: easy Suppose we are interested in joint assignment <F=f,A=a,S=s,H=h,N=n> What is P(f,a,s,h,n)? let s use p(a,b) as shorthand for p(a=a, B=b)

Prob. of marginals: not so easy How do we calculate P(N=n)? let s use p(a,b) as shorthand for p(a=a, B=b)

Generating a sample from joint distribution: easy How can we generate random samples drawn according to P(F,A,S,H,N)? Hint: random sample of F according to P(F=1) = θ F=1 : draw a value of r uniformly from [0,1] if r<θ then output F=1, else F=0 let s use p(a,b) as shorthand for p(a=a, B=b)

Generating a sample from joint distribution: easy How can we generate random samples drawn according to P(F,A,S,H,N)? Hint: random sample of F according to P(F=1) = θ F=1 : draw a value of r uniformly from [0,1] if r<θ then output F=1, else F=0 Solution: draw a random value f for F, using its CPD then draw values for A, for S A,F, for H S, for N S

Generating a sample from joint distribution: easy Note we can estimate marginals like P(N=n) by generating many samples from joint distribution, then count the fraction of samples for which N=n Similarly, for anything else we care about P(F=1 H=1, N=0)! weak but general method for estimating any probability term

Learning of Bayes Nets Four categories of learning problems Graph structure may be known/unknown Variable values may be fully observed / partly unobserved Easy case: learn parameters for graph structure is known, and data is fully observed Interesting case: graph known, data partly known Gruesome case: graph structure unknown, data partly unobserved

Learning CPTs from Fully Observed Data Example: Consider learning the parameter Flu Allergy Sinus Max Likelihood Estimate is Headache Nose k th training example δ(x) = 1 if x=true, = 0 if x=false Remember why? let s use p(a,b) as shorthand for p(a=a, B=b)

MLE estimate of from fully observed data Maximum likelihood estimate Our case: Flu Headache Sinus Allergy Nose

Estimate from partly observed data What if FAHN observed, but not S? Can t calculate MLE Flu Headache Sinus Allergy Nose Let X be all observed variable values (over all examples) Let Z be all unobserved variable values Can t calculate MLE: WHAT TO DO?

Estimate from partly observed data What if FAHN observed, but not S? Can t calculate MLE Flu Headache Sinus Allergy Nose Let X be all observed variable values (over all examples) Let Z be all unobserved variable values Can t calculate MLE: EM seeks* to estimate: * EM guaranteed to find local maximum

EM seeks estimate: Flu Sinus Allergy Headache Nose here, observed X={F,A,H,N}, unobserved Z={S}

EM Algorithm - Precisely EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Define Iterate until convergence: E Step: Use X and current θ to calculate P(Z X,θ) M Step: Replace current θ by Guaranteed to find local maximum. Each iteration increases

E Step: Use X, θ, to Calculate P(Z X,θ) observed X={F,A,H,N}, unobserved Z={S} Flu Sinus Allergy Headache Nose How? Bayes net inference problem. let s use p(a,b) as shorthand for p(a=a, B=b)

E Step: Use X, θ, to Calculate P(Z X,θ) observed X={F,A,H,N}, unobserved Z={S} Flu Sinus Allergy Headache Nose How? Bayes net inference problem. let s use p(a,b) as shorthand for p(a=a, B=b)

EM and estimating observed X = {F,A,H,N}, unobserved Z={S} Flu Headache Sinus Allergy Nose E step: Calculate P(Z k X k ; θ) for each training example, k M step: update all relevant parameters. For example: Recall MLE was:

EM and estimating Flu Sinus More generally, Headache Given observed set X, unobserved set Z of boolean values Allergy Nose E step: Calculate for each training example, k the expected value of each unobserved variable M step: Calculate estimates similar to MLE, but replacing each count by its expected count

Using Unlabeled Data to Help Train Naïve Bayes Classifier Learn P(Y X) Y X 1 X 2 X 3 X 4 Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0? 0 1 1 0? 0 1 0 1

EM and estimating Given observed set X, unobserved set Y of boolean values E step: Calculate for each training example, k the expected value of each unobserved variable Y M step: Calculate estimates similar to MLE, but replacing each count by its expected count let s use y(k) to indicate value of Y on kth example

EM and estimating Given observed set X, unobserved set Y of boolean values E step: Calculate for each training example, k the expected value of each unobserved variable Y M step: Calculate estimates similar to MLE, but replacing each count by its expected count MLE would be:

From [Nigam et al., 2000]

Experimental Evaluation Newsgroup postings 20 newsgroups, 1000/group Web page classification student, faculty, course, project 4199 web pages Reuters newswire articles 12,902 articles 90 topics categories

20 Newsgroups

Conditional Independence Properties A is independent of B given C I(G) is the set of all such conditional independence assumptions encoded by G G is an I-map for P iff I(G) I(P) Where I(P) is the set of all CI statements that hold for P In other words: G doesn t make any assertions that are not true about P

Conditional Independence Properties (cont) Note: fully connected graph is an I-map for all distributions G is a minimal I-map of P if: G is an I-map of P There is no G G which is an I-map of P Question: How to determine if? Easy for undirected graphs Kind of complicated for DAGs (Bayesian Nets)

D-separation Definitions: An undirected path P is d-separated by a set of nodes E (containing evidence) iff at least one of the following conditions hold: P contains a chain s -> m -> t or s <- m <- t where m is evidence P contains a fork s <- m -> t where m is in the evidence P contains a v-structure s -> m <- t where m is not in the evidence, nor any descendent of m

D-seperation (cont) A set of nodes A is D-separated from a set of nodes B, if given a third set of nodes E iff each undirected path from every node in A to every node in B is d-seperated by E Finally, define the CI properties of a DAG as follows:

Bayes Ball Algorithm Simple way to check if A is d-separated from B given E 1. Shade in all nodes in E 2. Place balls in each node in A and let them bounce around according to some rules Note: balls can travel in either direction 3. Check if any balls from A reach nodes in B

Bayes Ball Rules

Explaining Away (inter-causal reasoning) Example: Toss two coins and observe their sum

E xample Battery Radio Ignition Gas Starts Moves Are Gas and Radio independent? Given Battery? Ignition? Starts? Moves? 13

Bent Coin Bayesian Network P (x 1,x 2,...,x n H )=P ( H )P (x 1 H )P (x 2 H )...P(x n H ) 56

Bent Coin Bayesian Network Probability of Each coin flip is conditionally independent given Θ P (x 1,x 2,...,x n H )=P ( H )P (x 1 H )P (x 2 H )...P(x n H ) 56

Bent Coin Bayesian Network (Plate Notation) 57

Learning Bayes-net structure Given data, which model is correct? model 1: X Y model 2: X Y

Bayesian approach Given data, which model is correct? more likely? model 1: X Y p( m1 ) = 0.7 Data d p( m1 d) = 0.1 model 2: X Y p( m2 ) = 0.3 p( m2 d) = 0.9

Bayesian approach: Model averaging Given data, which model is correct? more likely? model 1: X Y p( m1 ) = 0.7 Data d p( m1 d) = 0.1 model 2: X Y p( m2 ) = 0.3 p( m2 d) = 0.9 average predictions

Bayesian approach: Model selection Given data, which model is correct? more likely? model 1: X Y p( m1 ) = 0.7 Data d p( m1 d) = 0.1 model 2: X Y p( m2 ) = 0.3 p( m2 d) = 0.9 Keep the best model: - Explanation - Understanding - Tractability

To score a model, use Bayes theorem Given data d: model score p( m d) p( m) p( d m) "marginal likelihood" likelihood p( d m) = p( d θ, m) p( θ m) dθ

Thumbtack example ) ( ) # ( ) ( ) # ( ) # # ( ) ( ) (1 ) ( ) (1 ) ( 1 # 1 # # # t t h h t h t h t h t h t h t h d d m p m p t h α α α α α α α α θ θ θ θ θ θ θ α α Γ + Γ Γ + Γ + + + Γ + Γ = = = + + d conjugate prior X heads/tails

More complicated graphs X heads/tails Y heads/tails 3 separate thumbtack-like learning problems ) ( ) # ( ) ( ) # ( ) # # ( ) ( ) ( ) # ( ) ( ) # ( ) # # ( ) ( ) ( ) # ( ) ( ) # ( ) # # ( ) ( ) ( t t h h t h t h t t h h t h t h t t h h t h t h t h t h t h t h t h t h m p α α α α α α α α α α α α α α α α α α α α α α α α Γ + Γ Γ + Γ + + + Γ + Γ Γ + Γ Γ + Γ + + + Γ + Γ Γ + Γ Γ + Γ + + + Γ + Γ = d X Y X=heads Y X=tails

Model score for a discrete Bayes net p( d m) = n qi r i ij ijk ijk i= 1 j= 1 Γ( αij + Nij ) k = 1 Γ( αijk ) Γ( α ) Γ( α + N ) k N : # cases where X = x and Pa = pa i q i j ijk i i i i r : number of states of α : ij number of instances of parents of = α N = X ri ri ijk ij k = 1 k = 1 i N ijk X i

Computation of marginal likelihood Efficient closed form if Local distributions from the exponential family (binomial, poisson, gamma,...) Parameter independence Conjugate priors No missing data (including no hidden variables)

Structure search Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) Heuristic methods Greedy Greedy with restarts MCMC methods initialize structure score all possible single changes any changes better? no return saved structure yes perform best change

Structure priors 1. All possible structures equally likely 2. Partial ordering, required / prohibited arcs 3. Prior(m) α Similarity(m, prior BN)

Parameter priors All uniform: Beta(1,1) Use a prior Bayes net

Parameter priors Recall the intuition behind the Beta prior for the thumbtack: The hyperparameters α h and α t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" Equivalent sample size = α h + α t The larger the equivalent sample size, the more confident we are about the long-run fraction

Parameter priors x 1 x 2 x 3 x 4 x 5 x 6 x 7 equivalent + sample size imaginary count for any variable configuration x 8 x 9 parameter modularity parameter priors for any Bayes net structure for X 1 X n

Combining knowledge & data prior network+equivalent sample size x 1 x 2 x 3 x 4 x 5 data x 8 x 6 x 7 improved network(s) x 1 x 9 x 2 x 3 x 4 x 5 x 1 true false false true x 2 false false false true. x 3 true true false false...... x 8 x 6 x 7 x 9

Example: College Plans Data (Heckerman et. Al 1997) Data on 5 variables that might influence high school students decision to attend college: Sex: Male or Female SES: Socio economic status (low, lower-middle, middle, uppermiddle, high) IQ: discritized into low, lower middle, upper middle, high PE: Parental Encouragement (low or high) CP: College plans (yes or no) 128 possible joint configurations Heckerman et. al. computed the exact posterior over all 29,281 possible 5 node DAGs Except those in which Sex or SAS have parents and/or CP have children (prior knowledge)

Bayes Nets What You Should Know Representation Bayes nets represent joint distribution as a DAG + Conditional Distributions D-separation lets us decode conditional independence assumptions Inference NP-hard in general For some graphs, some queries, exact inference is tractable Approximate methods too, e.g., Monte Carlo methods, Learning Easy for known graph, fully observed data (MLE s, MAP est.) EM for partly observed data, known graph