Directed and Undirected Graphical Models

Similar documents
Junction Tree, BP and Variational Methods

Probabilistic Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

Rapid Introduction to Machine Learning/ Deep Learning

CSC 412 (Lecture 4): Undirected Graphical Models

Machine Learning Lecture 14

An Introduction to Bayesian Machine Learning

Graphical Models and Kernel Methods

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Undirected Graphical Models

Probabilistic Graphical Models (I)

Lecture 6: Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Bayesian Machine Learning

3 : Representation of Undirected GM

Directed and Undirected Graphical Models

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Directed Graphical Models or Bayesian Networks

Statistical Approaches to Learning and Discovery

2 : Directed GMs: Bayesian Networks

Based on slides by Richard Zemel

Intelligent Systems (AI-2)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

CS Lecture 4. Markov Random Fields

Naïve Bayes classification

CPSC 540: Machine Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Lecture 8: Bayesian Networks

Bayesian Networks Inference with Probabilistic Graphical Models

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayesian networks Lecture 18. David Sontag New York University

Statistical Learning

Introduction to Bayesian Learning

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Machine Learning 4771

Review: Directed Models (Bayes Nets)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Graphical Models - Part I

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

CS6220: DATA MINING TECHNIQUES

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Directed Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

4 : Exact Inference: Variable Elimination

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

A brief introduction to Conditional Random Fields

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian Machine Learning - Lecture 7

Probabilistic Graphical Models

Intelligent Systems (AI-2)

Undirected graphical models

Probabilistic Graphical Models

Representation of undirected GM. Kayhan Batmanghelich

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Probabilistic Graphical Models

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Lecture 15. Probabilistic Models on Graph

Undirected Graphical Models: Markov Random Fields

Probabilistic Machine Learning

Intelligent Systems (AI-2)

Probabilistic Classification

Directed Graphical Models (aka. Bayesian Networks)

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Review: Bayesian learning and inference

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Lecture 4 October 18th

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Be able to define the following terms and answer basic questions about them:

Bayesian Networks BY: MOHAMAD ALSABBAGH

The Origin of Deep Learning. Lili Mou Jan, 2015

Probabilistic Graphical Models

Probabilistic Graphical Models. Rudolf Kruse, Alexander Dockhorn Bayesian Networks 153

Bayesian Networks. Alan Ri2er

Machine Learning Summer School

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

CS6220: DATA MINING TECHNIQUES

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Conditional Independence and Factorization

Bayesian Networks Representation

Graphical Models 359

Hidden Markov Models

Lecture 6: Graphical Models: Learning

Random Field Models for Applications in Computer Vision

Intelligent Systems (AI-2)

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

6.867 Machine learning, lecture 23 (Jaakkola)

Probabilistic Graphical Models: Representation and Inference

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 9: PGM Learning

Transcription:

Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information, see http://mlg.eng.cam.ac.uk/adrian/ 1 / 29

High level overview of our 3 lectures 1. LP relaxations for MAP inference (last week) 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact inference, belief propagation, variational methods for approximate inference (Monday) Further reading / viewing: Murphy, Machine Learning: a Probabilistic Perspective Barber, Bayesian Reasoning and Machine Learning Bishop, Pattern Recognition and Machine Learning Koller and Friedman, Probabilistic Graphical Models https://www.coursera.org/course/pgm Wainwright and Jordan, Graphical Models, Exponential Families, and Variational Inference 2 / 29

Background: the marginal polytope (all valid marginals) µ = 0! 1! 0! 1! 1! 0! 0" 0" 1" 0" 0! 0! 0! 1! 0! 0! 1! 0" Marginal polytope! (Wainwright & Jordan, 03)! µ = 1 µ 2 + µ valid marginal probabilities! X 1! = 1! 1! 0! 0! 1! 1! 0! 1" 0" 0" 0" 0! 1! 0! 0! 0! 0! 1! 0" Assignment for X 1" Assignment for X 2" Assignment for X 3! Edge assignment for" X 1 X 3! Edge assignment for" X 1 X 2! Edge assignment for" X 2 X 3! X 1! = 0! X 2! = 1! X 3 =! 0! X 2! = 1! X 3 =! 0! Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes 3 / 29

Stylized illustration of polytopes marginal polytope M = L n global consistency... triplet polytope L 3 triplet consistency local polytope L 2 pair consistency More accurate Less accurate More computationally intensive Less computationally intensive 4 / 29

When is MAP inference (relatively) easy? Tree 5 / 29

Key ideas for graphical models 1 Represent the world as a collection of random variables X 1,..., X n with joint distribution p(x 1,..., X n ) 2 Learn the distribution from data 3 Perform inference (typically MAP or marginal) 6 / 29

Challenges 1 Represent the world as a collection of random variables X 1,..., X n with joint distribution p(x 1,..., X n ) How can we compactly describe this joint distribution? Directed graphical models (Bayesian networks) Undirected graphical models (Markov random fields, factor graphs) 2 Learn the distribution from data Maximum likelihood estimation, other methods? How much data do we need? How much computation does it take? 3 Perform inference (typically MAP or marginal) Exact inference: Junction Tree Algorithm Approximate inference (belief propagation, variational methods...) 7 / 29

How can we compactly describe the joint distribution? Example: Medical diagnosis Binary variable for each symptom (e.g. fever, cough, fast breathing, shaking, nausea, vomiting ) Binary variable for each disease (e.g. pneumonia, flu, common cold, bronchitis, tuberculosis ) Diagnosis is performed by inference in the model: p(pneumonia = 1 cough = 1, fever = 1, vomiting = 0) One famous model, Quick Medical Reference (QMR-DT), has 600 diseases and 4000 symptoms 8 / 29

Representing the distribution Naively, we could represent the distribution with a big table of probabilities for every possible outcome How many outcomes are there in QMR-DT? 2 4600 Learning of the distribution would require a huge amount of data Inference of conditional probabilities, e.g. p(pneumonia = 1 cough = 1, fever = 1, vomiting = 0) would require summing over exponentially many values Moreover, gives no way to make predictions with previously unseen observations We need structure 9 / 29

Structure through independence If X 1,..., X n are independent, then p(x 1,..., x n ) = p(x 1 )p(x 2 ) p(x n ) For binary variables, probabilities for 2 n outcomes can be described by how many parameters? 10 / 29

Structure through independence If X 1,..., X n are independent, then p(x 1,..., x n ) = p(x 1 )p(x 2 ) p(x n ) For binary variables, probabilities for 2 n outcomes can be described by how many parameters? n However, this is not a very useful model observing a variable X i cannot influence our predictions of X j Instead: if X 1,..., X n are conditionally independent given Y, denoted as X i X i Y, then p(y, x 1,..., x n ) = p(y) This is a simple yet powerful model n p(x i y) i=1 10 / 29

Example: naive Bayes for classification Classify e-mails as spam (Y = 1) or not spam (Y = 0) Let i {1,..., n} index the words in our vocabulary X i = 1 if word i appears in an e-mail, and 0 otherwise E-mails are drawn according to some distribution p(y, X 1,..., X n ) Suppose that the words are conditionally independent given Y Then, p(y, x 1,... x n ) = p(y) n p(x i y) Easy to learn the model with maximum likelihood. Predict with: p(y = 1 x 1,... x n ) = i=1 p(y = 1) n i=1 p(x i Y = 1) y={0,1} p(y = y) n i=1 p(x i Y = y) Is conditional independence a reasonable assumption? 11 / 29

Example: naive Bayes for classification Classify e-mails as spam (Y = 1) or not spam (Y = 0) Let i {1,..., n} index the words in our vocabulary X i = 1 if word i appears in an e-mail, and 0 otherwise E-mails are drawn according to some distribution p(y, X 1,..., X n ) Suppose that the words are conditionally independent given Y Then, p(y, x 1,... x n ) = p(y) n p(x i y) Easy to learn the model with maximum likelihood. Predict with: p(y = 1 x 1,... x n ) = i=1 p(y = 1) n i=1 p(x i Y = 1) y={0,1} p(y = y) n i=1 p(x i Y = y) Is conditional independence a reasonable assumption? A model may be wrong but still useful 11 / 29

Directed graphical models = Bayesian networks A Bayesian network is specified by a directed acyclic graph DAG= (V, E) with: 1 One node i V for each random variable X i 2 One conditional probability distribution (CPD) per node, p(x i x Pa(i) ), specifying the variable s probability conditioned on its parents values The DAG corresponds 1-1 with a particular factorization of the joint distribution: p(x 1,... x n ) = p(x i x Pa(i) ) i V 12 / 29

Bayesian networks are generative models naive Bayes Label Y X 1 X 2 X 3... X n Features Evidence is denoted by shading in a node Can interpret Bayesian network as a generative process. For example, to generate an e-mail, we 1 Decide whether it is spam or not spam, by samping y p(y ) 2 For each word i = 1 to n, sample x i p(x i Y = y) 13 / 29

Bayesian network structure conditional independencies Local Local Structures & & Generalizing earlier example, can show that a variable is Independencies independent from its non-descendants given its parents # #,8BB8)'@=$6):,8BB8)'@=$6): # P%H%)*'Q'76&8<@>69 G'=)7',!! " R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R P%H%)*'Q'76&8<@>69 G'=)7', Common parent fixing B decouples A and C! " 6?6>9'8A'G'=)7','=$6'%)76@6)76):R,=9&=76,=9&=76 Cascade knowing B decouples A and C!! # # " S)8F%)*'Q " S)8F%)*'Q 76&8<@>69 76&8<@>69 G'=)7', G'=)7',! R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8' # R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8' " 6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R 6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R head to tail 6?6>'*6)6'G'@$8?%769')8'?6>'8A'*6)6',R T29:$<&:<$6! T29:$<&:<$6! # # S)8F%)*','&8<@>69'G'=)7'Q S)8F%)*','&8<@>69'G'=)7'Q " " V-structure! U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L', Knowing # C couples A and B U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L', RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R This important RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R " phenomona is called explaining away R'Q'FL$L:L', p(a, B, W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X C) = p(a)p(b)p(c A, B) head to head W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X "#$%&'(%)*'+',-./'011!20113 O1 &6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R "#$%&'(%)*'+',-./'011!20113 O1 " tail to tail 56'&8)&6@:9'=$6'$%&5X 14 / 29

D-separation ( directed separation ) in Bayesian networks Bayes Ball Algorithm to determine whether X Z Y by looking at graph d-separation Look to see if there is active path between X and Z when variables Y are observed: Y Y X Z X Z (a) (b) 15 / 29

D-separation ( directed separation ) in Bayesian networks X Z X Z Bayes Ball Algorithm(a) to determine whether X (b) Z Y by looking at graph d-separation Look to see if there is active path between X and Z when variables Y are observed: X Z X Z Y (a) Y (b) 16 / 29

D-separation ( directed separation ) in Bayesian networks Bayes Ball Algorithm to determine whether X Z Y by looking at graph d-separation Look to see if there is active path between X and Z when variables Y are observed: X Y Z X Y Z (a) (b) If no such path, then X and Z are d-separated with respect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query 17 / 29

D-separation example 1 X 4 X 2 X 1 X 6 X 3 X 5 X 4 18 / 29

D-separation example 2 X 3 X 5 X 4 X 2 X 1 X 6 X 3 X 5 19 / 29

2011 Turing Award was for Bayesian networks 20 / 29

Example: hidden Markov model (HMM) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 X 6 Frequently used for speech recognition and part-of-speech tagging Joint distribution factors as: p(y, x) = p(y 1 )p(x 1 y 1 ) T p(y t y t 1 )p(x t y t ) t=2 p(y 1 ) is the initial distribution of the starting state p(y t y t 1 ) is the transition probability between hidden states p(x t y t ) is the emission probability What are the conditional independencies here? 21 / 29

Example: hidden Markov model (HMM) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 X 6 Frequently used for speech recognition and part-of-speech tagging Joint distribution factors as: p(y, x) = p(y 1 )p(x 1 y 1 ) T p(y t y t 1 )p(x t y t ) t=2 p(y 1 ) is the initial distribution of the starting state p(y t y t 1 ) is the transition probability between hidden states p(x t y t ) is the emission probability What are the conditional independencies here? Many, e.g. Y 1 {Y 3,..., Y 6 } Y 2 21 / 29

Summary A Bayesian network specifies the global distribution by a DAG and local conditional probability distributions (CPDs) for each node Can interpret as a generative model, where variables are sampled in topological order Examples: naive Bayes, hidden Markov models (HMMs), latent Dirichlet allocation Conditional independence via d-separation Compute the probability of any assignment by multiplying CPDs Maximum likelihood learning of CPDs is easy (decomposes, can estimate each CPD separately) 22 / 29

Undirected graphical models An alternative representation for a joint distribution is an undirected graphical model As for directed models, we have one node for each random variable Rather than CPDs, we specify (non-negative) potential functions over sets of variables associated with cliques C of the graph, p(x 1,..., x n ) = 1 Z φ c (x c ) c C Z is the partition function and normalizes the distribution: Z = φ c (ˆx c ) ˆx 1,...,ˆx n c C Like a CPD, φ c (x c ) can be represented as a table, but it is not normalized Also known as Markov random fields (MRFs) or Markov networks 23 / 29

Undirected graphical models p(x 1,..., x n ) = 1 Z φ c (x c ), Z = φ c (ˆx c ) c C ˆx 1,...,ˆx n c C Simple example (each edge potential function encourages its variables to take the same value): A B C φ A,B(a, b) = 0 A 1 B 0 1 10 1 1 10 C φ B,C(b, c) = 0 1 φ A,C(a, c) = 0 B 1 10 1 1 10 0 A 1 C 0 1 10 1 1 10 p(a, b, c) = 1 Z φ A,B(a, b) φ B,C (b, c) φ A,C (a, c), where Z = φ A,B (â, ˆb) φ B,C (ˆb, ĉ) φ A,C (â, ĉ) = 2 1000 + 6 10 = 2060. â,ˆb,ĉ {0,1} 3 24 / 29

Markov network structure conditional independencies Let G be the undirected graph where we have one edge for every pair of variables that appear together in a potential Conditional independence is given by graph separation X B X A X A X C X B if there is no path from a A to c C after removing all variables in B X C 25 / 29

Markov blanket A set U is a Markov blanket of X if X / U and if U is a minimal set of nodes such that X (X {X } U) U In undirected graphical models, the Markov blanket of a variable is precisely its neighbors in the graph: X In other words, X is independent of the rest of the nodes in the graph given its immediate neighbors 26 / 29

Directed and undirected models are different 27 / 29

Directed and undirected models are different With <3 edges, 27 / 29

Example: Ising model Invented by the physicist Wilhelm Lenz (1920), who gave it as a problem to his student Ernst Ising Mathematical model of ferromagnetism in statistical mechanics The spin of an atom is influenced by the spins of atoms nearby on the material: = +1 = -1 Each atom X i { 1, +1}, whose value is the direction of the atom spin If a spin at position i is +1, what is the probability that the spin at position j is also +1? Are there phase transitions where spins go from disorder to order? 28 / 29

Example: Ising model Each atom X i { 1, +1}, whose value is the direction of the atom spin The spin of an atom is influenced by the spins of atoms nearby on the material: = +1 = -1 p(x 1,, x n ) = 1 Z exp 1 ( w i,j x i x j + T i<j i θ i x i ) When w i,j > 0, adjacent atoms encouraged to have the same spin (attractive or ferromagnetic); w i,j < 0 encourages X i X j Node potentials θ i encode the bias of the individual atoms Varying the temperature T makes the distribution more or less spiky 29 / 29

Supplementary material Extra slides for questions or further explanation 30 / 29

Basic idea Suppose we have a simple chain A B C D, we want to compute p(d) p(d) is a set of values, {p(d = d), d Val(D)}. Algorithm computes sets of values at a time an entire distribution The joint distribution factors as p(a, B, C, D) = p(a)p(b A)p(C B)p(D C) In order to compute p(d), we have to marginalize over A, B, C: p(d) = a,b,c p(a = a, B = b, C = c, D) 31 / 29

How can we perform the sum efficiently? Our goal is to compute p(d) = a,b,c p(a, b, c, D) = a,b,c p(a)p(b a)p(c b)p(d c) = c p(d c)p(c b)p(b a)p(a) b a We can push the summations inside to obtain: p(d) = c p(d c) b p(c b) p(b a)p(a) }{{} a ψ 1(a,b) }{{} τ 1(b) Let s call ψ 1 (A, B) = P(A)P(B A). Then, τ 1 (B) = a ψ 1(a, B) Similarly, let ψ 2 (B, C) = τ 1 (B)P(C B). Then, τ 2 (C) = b ψ 1(b, C) This procedure is dynamic programming: computation is inside out instead of outside in 32 / 29

Inference in a chain Generalizing the previous example, suppose we have a chain X 1 X 2 X n, where each variable has k states For i = 1 up to n 1, compute (and cache) p(x i+1 ) = x i p(x i+1 x i )p(x i ) Each update takes k 2 time (why?) The total running time is O(nk 2 ) In comparison, naively marginalizing over all latent variables has complexity O(k n ) We did inference over the joint without ever explicitly constructing it! 33 / 29

ML learning in Bayesian networks Maximum likelihood learning: max θ l(θ; D), where l(θ; D) = log p(d; θ) = log p(x; θ) x D = log p(x i ˆx pa(i)) i ˆx pa(i) x D: x pa(i) =ˆx pa(i) In Bayesian networks, we have the closed form ML solution: θ ML x i x pa(i) = N x i,x pa(i) ˆx i Nˆxi,x pa(i) where N xi,x pa(i) is the number of times that the (partial) assignment x i, x pa(i) is observed in the training data We can estimate each CPD independently because the objective decomposes by variable and parent assignment 34 / 29

Parameter learning in Markov networks How do we learn the parameters of an Ising model? = +1 = -1 p(x 1,, x n ) = 1 ( Z exp w i,j x i x j + i<j i θ i x i ) 35 / 29

Bad news for Markov networks The global normalization constant Z(θ) kills decomposability: θ ML = arg max θ = arg max θ = arg max θ log x D p(x; θ) ( ) log φ c (x c ; θ) log Z(θ) x D c ( ) log φ c (x c ; θ) D log Z(θ) x D The log-partition function prevents us from decomposing the objective into a sum over terms for each potential Solving for the parameters becomes much more complicated c 36 / 29