Based on slides by Richard Zemel

Similar documents
But if z is conditioned on, we need to model it:

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Chris Bishop s PRML Ch. 8: Graphical Models

CPSC 540: Machine Learning

Directed Graphical Models

STA 4273H: Statistical Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Lecture 2: Simple Classifiers

4.1 Notation and probability review

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 4 October 18th

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

CSC 412 (Lecture 4): Undirected Graphical Models

Lecture 16 Deep Neural Generative Models

Bayesian Machine Learning

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Graphical Models and Kernel Methods

Introduction to Machine Learning Midterm, Tues April 8

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Probabilistic Models

Review: Directed Models (Bayes Nets)

CS 188: Artificial Intelligence Fall 2009

An Introduction to Bayesian Machine Learning

13: Variational inference II

Conditional Independence and Factorization

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Introduction to Machine Learning

Probabilistic Graphical Models

Directed Graphical Models

Latent Variable Models

CS 5522: Artificial Intelligence II

3 : Representation of Undirected GM

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic Graphical Models

Probabilistic Graphical Models

Lecture 1: Probabilistic Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

STA 4273H: Statistical Machine Learning

Directed and Undirected Graphical Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Directed and Undirected Graphical Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic Graphical Models (I)

STA 414/2104: Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Machine Learning Lecture 14

CSC321 Lecture 18: Learning Probabilistic Models

1 Bayesian Linear Regression (BLR)

2 : Directed GMs: Bayesian Networks

Bayes Nets: Independence

Introduction to Bayesian Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Machine Learning

Undirected Graphical Models

Naïve Bayes classification

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

K-Means and Gaussian Mixture Models

A brief introduction to Conditional Random Fields

Conditional Independence

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

10-701/15-781, Machine Learning: Homework 4

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Notes on Machine Learning for and

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

CSC2515 Assignment #2

Undirected Graphical Models

Note Set 5: Hidden Markov Models

Logistic Regression. Machine Learning Fall 2018

Introduction to Probability and Statistics (Continued)

CPSC 540: Machine Learning

Graphical Models 359

Pattern Recognition and Machine Learning

Machine Learning, Fall 2009: Midterm

Bayes Nets. CS 188: Artificial Intelligence Fall Example: Alarm Network. Bayes Net Semantics. Building the (Entire) Joint. Size of a Bayes Net

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Conditional probabilities and graphical models

Introduction to Probabilistic Graphical Models

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Lecture 15. Probabilistic Models on Graph

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Probabilistic Graphical Models

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayes Nets III: Inference

Probabilistic Graphical Models

Probabilistic Graphical Networks: Definitions and Basic Results

Mixtures of Gaussians. Sargur Srihari

Probabilistic Models. Models describe how (a portion of) the world works

Transcription:

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel

Learning outcomes What aspects of a model can we express using graphical notation? Which aspects are not captured in this way? How do independencies change as a result of conditioning? Reasons for using latent variables Common motifs such as mixtures and chains How to integrate out unobserved variables

Joint Probabilities Chain rule implies that any joint distribution equals p(x 1:D )=p(x 1 )p(x 2 x 1 )p(x 3 x 1,x 2 )p(x 4 x 1,x 2,x 3 )...p(x D x 1:D 1 ) Directed graphical model implies a restricted factorization

Notation: x A x B x C Conditional Independence Definition: two (sets of) variables x A and x B are conditionally independent given a third x C if: P (x A, x B x C )=P (x A x C )P (x B x C ); 8x C which is equivalent to saying P (x A x B, x C )=P (x A x C ); 8x C Only a subset of all distributions respect any given (nontrivial) conditional independence statement. The subset of distributions that respect all the CI assumptions we make is the family of distributions consistent with our assumptions. Probabilistic graphical models are a powerful, elegant and simple way to specify such a family.

Directed Graphical Models Consider directed acyclic graphs over n variables. Each node has (possibly empty) set of parents π i We can then write P (x 1,...,x N )= Y i P (x i x i ) Hence we factorize the joint in terms of local conditional probabilities Exponential in fan-in of each node instead of in N

Conditional Independence in DAGs If we order the nodes in a directed graphical model so that parents always come before their children in the ordering then the graphical model implies the following about the distribution: {x i? x i x i } ; 8i where x i are the nodes coming before x i that are not its parents In other words, the DAG is telling us that each variable is conditionally independent of its non-descendants given its parents. Such an ordering is called a topological ordering

Consider this six node network: Example DAG The joint probability is now:

Missing Edges Key point about directed graphical models: Missing edges imply conditional independence Remember that by the chain rule we can always write the full joint as a product of conditionals, given an ordering: P (x 1, x 2,...)=P (x 1 )P (x 2 x 1 )P (x 3 x 2, x 1 )P (x 4 x 3, x 2, x 1 ) If the joint is represented by a DAGM, then some of the conditioned variables on the right hand sides are missing. This is equivalent to enforcing conditional independence. Start with the idiot s graph : each node has all previous nodes in the ordering as its parents. Now remove edges to get your DAG. Removing an edge into node i eliminates an argument from the conditional probability factor P (x i x 1, x 2,...,x i 1 )

D-Separation D-separation, or directed-separation is a notion of connectedness in DAGMs in which two (sets of) variables may or may not be connected conditioned on a third (set of) variable. D-connection implies conditional dependence and d-separation implies conditional independence. In particular, we say that x A x B x C if every variable in A is d- separated from every variable in B conditioned on all the variables in C. To check if an independence is true, we can cycle through each node in A, do a depth-first search to reach every node in B, and examine the path between them. If all of the paths are d- separated, then we can assert x A x B x C Thus, it will be sufficient to consider triples of nodes. (Why?) Pictorially, when we condition on a node, we shade it in.

Chain Q: When we condition on y, are x and z independent? P (x, y, z) =P (x)p (y x)p (z y) which implies P (z x, y) = = and therefore x z y P (x, y, z) P (x, y) P (x)p (y x)p (z y) P (x)p (y x) = P (z y) Think of x as the past, y as the present and z as the future.

Common Cause Q: When we condition on y, are x and z independent? P (x, y, z) =P (y)p (x y)p (z y) which implies P (x, z y) = = and therefore x z y P (x, y, z) P (y) P (y)p (x y)p (z y) P (y) = P (x y)p (z y)

Explaining Away Q: When we condition on y, are x and z independent? P (x, y, z) =P (x)p (z)p (y x, z) x and z are marginally independent, but given y they are conditionally dependent. This important effect is called explaining away (Berkson s paradox.) For example, flip two coins independently; let x=coin1,z=coin2. Let y=1 if the coins come up the same and y=0 if different. x and z are independent, but if I tell you y, they become coupled!

Bayes-Ball Algorithm To check if x A x B x C we need to check if every variable in A is d-separated from every variable in B conditioned on all vars in C. In other words, given that all the nodes in x C are clamped, when we wiggle nodes x A can we change any of the nodes in x B? The Bayes-Ball Algorithm is a such a d-separation test. We shade all nodes x C, place balls at each node in x A (or x B ), let them bounce around according to some rules, and then ask if any of the balls reach any of the nodes in x B (or x A ).

Bayes-Ball Rules The three cases we considered tell us rules:

Bayes-Ball Boundary Rules We also need the boundary conditions: Here s a trick for the explaining away case: If y or any of its descendants is shaded, the ball passes through. Notice balls can travel opposite to edge directions.

Canonical Micrographs

Examples of Bayes-Ball Algorithm

Examples of Bayes-Ball Algorithm Notice: balls can travel opposite to edge direction

Plates

Plates & Parameters Since Bayesian methods treat parameters as random variables, we would like to include them in the graphical model. One way to do this is to repeat all the iid observations explicitly and show the parameter only once. A better way is to use plates, in which repeated quantities that are iid are put in a box.

Plates: Macros for Repeated Structures Plates are like macros that allow you to draw a very complicated graphical model with a simpler notation. The rules of plates are simple: repeat every structure in a box a number of times given by the integer in the corner of the box (e.g. N), updating the plate index variable (e.g. n) as you go. Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.

Nested/Intersecting Plates Plates can be nested, in which case their arrows get duplicated also, according to the rule: draw an arrow from every copy of the source node to every copy of the destination node. Plates can also cross (intersect), in which case the nodes at the intersection have multiple indices and get duplicated a number of times equal to the product of the duplication numbers on all the plates containing them.

Example: Nested Plates

Example DAGM: Markov Chain Markov Property: Conditioned on the present, the past and future are independent

Unobserved Variables Certain variables Q in our models may be unobserved, either some of the time or always, either at training time or at test time Graphically, we will use shading to indicate observation

Partially Unobserved (Missing) Variables If variables are occasionally unobserved they are missing data, e.g., undefined inputs, missing class labels, erroneous target values In this case, we can still model the joint distribution, but we define a new cost function in which we sum out or marginalize the missing values at training or test time: `( ; D) = X log p(x c, y c )+ X log p(x m ) complete = X complete log p(x c, y c )+ missing X missing log X y p(x m, y ) Recall that p(x) = X q p(x, q)

Latent Variables What to do when a variable z is always unobserved? Depends on where it appears in our model. If we never condition on it when computing the probability of the variables we do observe, then we can just forget about it and integrate it out. e.g., given y, x fit the model p(z, y x) = p(z y)p(y x, w)p(w). (In other words if it is a leaf node.) But if z is conditioned on, we need to m odel it: e.g. given y, x fit the model p(y x) = Σ z p(y x, z)p(z)

Where Do Latent Variables Come From? Latent variables may appear naturally, from the structure of the problem, because something wasn t measured, because of faulty sensors, occlusion, privacy, etc. But also, we may want to intentionally introduce latent variables to model complex dependencies between variables without looking at the dependencies between them directly. This can actually simplify the model (e.g., mixtures).

Latent Variables Models & Regression You can think of clustering as the problem of classification with missing class labels. You can think of factor models (such as factor analysis, PCA, ICA, etc.) as linear or nonlinear regression with missing inputs.

Why is Learning Harder? In fully observed iid settings, the probability model is a product, thus the log likelihood is a sum where terms decouple. (At least for directed models.) `( ; D) = log p(x, z ) = log p(z z ) + log p(x z, x ) With latent variables, the probability already contains a su m, so the log likelihood has all paramete rs coupled together via z: `( ; D) = log X z = log X z p(x, z ) p(z z ) + log p(x z, x ) (Just as with the partition function in undirected models)

Why is Learning Harder? Likelihood = log X z p(z z ) + log p(x z, x ) couples parameters: We can treat this as a black box probability function and just try to optimize the likelihood as a function of θ (e.g. gradient descent). However, sometimes taking advantage of the latent variable structure can make parameter estimation easier. Good news: soon we will see how to deal with latent variables. Basic trick: put a tractable distribution on the values you don t know. Basic math: use convexity to lower bound the likelihood.

Mixture Models Most basic latent variable model with a single discrete node z. Allows different submodels (experts) to contribute to the (conditional) density model in different parts of the space. Divide & conquer idea: use simple parts to build complex models (e.g., multimodal densities, or piecewise-linear regressions).

Mixture Densities Exactly like a classification model but the class is unobserved and so we sum it out. What we get is a perfectly valid density: KX p(x ) = p(z = k z )p(x z = k, k ) k=1 = X k p k (x k ) k where the mixing proportions add to one: Σ k α k = 1. We can use Bayes rule to compute the posterior probability of the mixture component given some data: p(z = k x, ) = kp k (x k ) P j jp j (x j ) these quantities are called responsibilities.

Example: Gaussian Mixture Models Consider a mixture of K Gaussian components: p(x ) = X k N (x µ k, k ) k p(z = k x, ) = kn (x µ k, k ) P j jn (x µ j, j ) `( ; D) = X n log X k k N (x (n) µ k, k ) Density model: p(x θ) is a familiarity signal. Clustering: p(z x, θ) is the assignment rule, l(θ) is the cost.

Example: Mixtures of Experts Also called conditional mixtures. Exactly like a class-conditional model but the class is unobserved and so we sum it out again: KX p(y x, ) = p(z = k x, z )p(y z = k, x, k ) where k=1 = X k X k (x) =1 k k (x z )p k (y x, k ) 8x Harder: must learn α(x) (unless chose z independent of x). We can still use Bayes rule to compute the posterior probability of the mixture component given some data: p(z = k x, y, ) = k(x)p k (y x, k ) P j j(x)p j (y x, j ) this function is often called the gating function.

Example: Mixtures of Linear Regression Experts Each expert generates data according to a linear function of the input plus additive Gaussian noise: p(y x, ) = X k k N (y The gate function can be a softmax classification machine k (x) =p(z = k x) = T k x, Remember: we are not modeling the density of the inputs x 2 k) e T k x Pj e T j x

Gradient Learning with Mixtures We can learn mixture densities using gradient descent on the likelihood as usual. The gradients are quite interesting: `( ) = log p(x ) =log X k p k (x k ) k @` @ = 1 X @p k (x k ) k p(x ) @ = X k k 1 k p(x ) p k(x k ) @ log p k(x k ) @ = X k k p k (x k ) p(x ) @`k @ k = X k k r k @`k @ k In other words, the gradient is the responsibility weighted sum of the individual log likelihood gradients

Parameter Constraints If we want to use general optimizations (e.g., conjugate gradient) to learn latent variable models, we of ten have to make sure parameters respect certain constraints (e.g., Σ k α k = 1, Σ k α k positive definite) A good trick is to re-parameterize these quantities in terms of unconstrained values. For mixing proportions, use the softmax: k = P exp(q k) j exp(q j) For covariance matrices, use the Cholesky decomposition 1 = A T A 1/2 = Y i A ii where A is upper diagonal with positive diagonal A ii =exp(r i ) > 0 A ij = a ij (j>i) A ij =0(j<i)

Logsumexp Often you can easily compute b k = log p(x z = k, θ k ), but it will be very negative, say -10 6 or smaller. Now, to compute l = log p(x θ) you need to compute log P k eb k (e.g., for calculating responsibilities at test time or for learning) Careful! Do not compute this by doing log(sum(exp(b))). You will get underflow and an incorrect answer. Instead do this: Add a constant exponent B to all the values b k such that the largest value equals zero: B = max(b). Compute log(sum(exp(b - B))) + B. Example: if log p(x z = 1) = 120 and log p(x z = 2) = 120, what is log p(x) = log [p(x z = 1) + p(x z = 2)]? Answer: log[2e 120 ] = 120 + log 2. Rule of thumb: never use log or exp by itself

Hidden Markov Models (HMMs) A very popular form of latent variable model Z t à Hidden states taking one of K discrete values X t à Observations taking values in any space Example: discrete, M observation symbols B 2< KxM p(x t = j z t = k) =B kj

Inference in Graphical Models x E à Observed evidence variables (subset of nodes) x F à unobserved query nodes we d like to infer x R à remaining variables, extraneous to this query but part of the given graphical representation

Inference with Two Variables Table look-up p(y x = x) Bayes Rule p(x y =ȳ) = p(ȳ x)p(x) p(ȳ)

Naïve Inference Suppose each variable takes one of k discrete values p(x 1,x 2,...,x 5 )= X x 6 p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) Costs O(k) operations to update each of O(k 5 ) table entries Use factorization and distributed law to reduce complexity = p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 ) X x 6 p(x 6 x 2,x 5 )

Inference in Directed Graphs

Inference in Directed Graphs

Inference in Directed Graphs

Learning outcomes What aspects of a model can we express using graphical notation? Which aspects are not captured in this way? How do independencies change as a result of conditioning? Reasons for using latent variables Common motifs such as mixtures and chains How to integrate out unobserved variables

Questions? Thursday: Tutorial on automatic differentiation This week: Assignment 1 released