Chapter 16. Structured Probabilistic Models for Deep Learning

Similar documents
Chris Bishop s PRML Ch. 8: Graphical Models

Directed and Undirected Graphical Models

The Origin of Deep Learning. Lili Mou Jan, 2015

Chapter 20. Deep Generative Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Undirected Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

An Introduction to Bayesian Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Rapid Introduction to Machine Learning/ Deep Learning

Using Graphs to Describe Model Structure. Sargur N. Srihari

3 : Representation of Undirected GM

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Undirected Graphical Models: Markov Random Fields

Directed Graphical Models

Bayesian Machine Learning - Lecture 7

Graphical Models and Kernel Methods

Deep unsupervised learning

Based on slides by Richard Zemel

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Review: Directed Models (Bayes Nets)

CPSC 540: Machine Learning

Probabilistic Graphical Models (I)

Introduction to Restricted Boltzmann Machines

Lecture 16 Deep Neural Generative Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Lecture 15. Probabilistic Models on Graph

Lecture 6: Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models

Greedy Layer-Wise Training of Deep Networks

Statistical Approaches to Learning and Discovery

Variational Inference (11/04/13)

Machine Learning Lecture 14

Bayesian Machine Learning

STA 4273H: Statistical Machine Learning

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Graphical Models

Need for Sampling in Machine Learning. Sargur Srihari

Restricted Boltzmann Machines

Intelligent Systems:

Representation of undirected GM. Kayhan Batmanghelich

STA 4273H: Statistical Machine Learning

Probabilistic Machine Learning

Conditional Independence

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

STA 4273H: Statistical Machine Learning

Bayesian Approaches Data Mining Selected Technique

4.1 Notation and probability review

Unsupervised Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Machine Learning 4771

2 : Directed GMs: Bayesian Networks

STA 4273H: Statistical Machine Learning

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Bayesian Learning in Undirected Graphical Models

Naïve Bayes classification

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

p L yi z n m x N n xi

Alternative Parameterizations of Markov Networks. Sargur Srihari

Lecture 4 October 18th

Undirected Graphical Models

Introduction to Machine Learning Midterm, Tues April 8

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

UNSUPERVISED LEARNING

Graphical Models 359

Alternative Parameterizations of Markov Networks. Sargur Srihari

Markov Networks.

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Notes on Markov Networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Probabilistic Graphical Models

Machine Learning Summer School

CS Lecture 4. Markov Random Fields

MAP Examples. Sargur Srihari

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Probabilistic Graphical Models

Linear Dynamical Systems

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Graphical Models - Part I

Directed Graphical Models or Bayesian Networks

Lecture 9: PGM Learning

Qualifier: CS 6375 Machine Learning Spring 2015

{ p if x = 1 1 p if x = 0

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Conditional Independence and Factorization

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Directed and Undirected Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models

Causality in Econometrics (3)

Graphical Models - Part II

Introduction to Probabilistic Graphical Models

Parameter learning in CRF s

Notes on Boltzmann Machines

Variational Scoring of Graphical Model Structures

9 Forward-backward algorithm, sum-product on factor graphs

Transcription:

Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning

Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe a probability distribution with an emphasis on visualizing which random variables interact with each other directly Each node represents a random variable Each edge represents a direct interaction x 0 x 1 x 2 x 0 x 1 x 2 x 3 x 4 x 5 x 3 x 4 x 5 Directed models (Bayesian Nets) Undirected models (Markov Nets) lso known as probabilistic graphical models, or graphical models

Peng et al.: Deep Learning and Practice 3 Learning, Sampling, and Inference Thing we will be concerned with around the graphical models Learning the model structure p(x) and parameters θ θ = arg max θ p(x; θ) Drawing samples from the learned model x p(x; θ ) or x 2 p(x 2 x 1 ; θ ) Doing approximate or exact inference arg max x 2 p(x 2 x 1 ; θ ) arg max x 2 q(x 2 x 1 ; w)

Peng et al.: Deep Learning and Practice 4 Directed Graphical Models directed model defined on x is specified by 1. directed acyclic graph G with nodes denoting elements x i of x 2. set of local conditional probability distributions p(x i P a G (x i )) with P a G (x i ) giving the parent nodes of x i in G and factorizes the joint distribution of the node variables as p(x) = i p(x i P a G (x i )) Such graphical models are also known as Bayesian/belief networks

Peng et al.: Deep Learning and Practice 5 They are most naturally applicable in situations where there is clear causality between variables For convenience, we sometimes introduce plate notation

Peng et al.: Deep Learning and Practice 6 s an example, we have for the following graph x 0 x 1 x 2 x 3 x 4 x 5 p(x 0, x 1, x 2, x 3, x 4, x 5 ) =p(x 0 )p(x 1 x 0 )p(x 2 x 1 )p(x 3 x 0 ) p(x 4 x 1, x 0 )p(x 5 x 4 ) When compared to the chain rule of probability, p(x) = p(x i x i 1, x i 2,..., x 0 ), i=0 the graph factorization implies certain conditional independence, e.g. p(x 2 x 1, x 0 ) = p(x 2 x 1 )

Peng et al.: Deep Learning and Practice 7 p(x 3 x 2, x 1, x 0 ) = p(x 3 x 0 ) Note however it only specifies which variables are allowed to appear in the arguments; there is no constraint on how we define each conditional probability distribution In the present example, we may as well specify p(x 1 x 0 ) = f 1 (x 1, x 0 ) = p(x 1 ) p(x 2 x 1 ) = f 2 (x 2, x 1 ) = p(x 2 ) p(x 3 x 0 ) = f 3 (x 3, x 0 ) = p(x 3 ) p(x 4 x 1, x 0 ) = f 4 (x 4, x 1, x 0 ) = p(x 4 ) p(x 5 x 4 ) = f 5 (x 5, x 4 ) = p(x 5 ) to arrive at a fully factorized distribution p(x 0, x 1, x 2, x 3, x 4, x 5 ) = p(x 0 )p(x 1 )p(x 2 )p(x 3 )p(x 4 )p(x 5 )

Peng et al.: Deep Learning and Practice 8 s such, there could be several distributions that satisfy the graph factorization; it is helpful to think of a directed graph as a filter where DF denotes the set of distributions that satisfy the factorization described by the graph To be precise, for any given graph, the DF will include any distributions that have additional independence properties beyond those described by the graph

Peng et al.: Deep Learning and Practice 9 Extreme case I: fully connected graph will accept any possible distribution over the given variables x 0 x 1 x 3 x 2 p(x 0, x 1, x 2, x 3 ) = p(x 0 )p(x 1 x 0 )p(x 2 x 1, x 0 )p(x 3 x 2, x 1, x 0 ) (simply the chain rule of probability)

Peng et al.: Deep Learning and Practice 10 Extreme case II: fully disconnected graph will only accept a fully factorized distribution x 0 x 1 x 3 x 2 p(x 0, x 1, x 2, x 3 ) = p(x 0 )p(x 1 )p(x 2 )p(x 3 ) It is also straightforward to see that a fully factorized distribution will pass through any graph

Peng et al.: Deep Learning and Practice 11 In general, to model n discrete variables each having k values, we need a table of size O(k n ); the conditional independence implied by the graph can reduce the table size to O(k m ), given m is the maximum number of conditioning variables for all x i This suggests that as long as each variable has few parents in the graph, the distribution can be represented with very few parameters

Peng et al.: Deep Learning and Practice 12 Undirected Graphical Models n undirected graphical model is defined on an undirected graph G and factorizes the joint distribution of its node variables as a product of potential functions φ(c) over the maximum cliques C of the graph where p(x) = 1 Z C G p(x) is an unnormalized distribution φ(c) = 1 Z p(x) Z is a normalization constant (called the partition function) φ(c) is a clique potential and is non-negative They are also known as Markov random fields or Markov networks

Peng et al.: Deep Learning and Practice 13 clique is a subset of the nodes in a graph G in which there exists a link between every pair of nodes in the subset maximum clique C is a clique such that it is not possible to include any other nodes in the graph without ceasing to be a clique s an example, we have for the following graph x 0 x 1 x 2 x 3 x 4 x 5 p(x) = 1 Z φ a(x 0, x 3 )φ b (x 0, x 1, x 4 )φ c (x 1, x 2 )φ d (x 4, x 5 )

Peng et al.: Deep Learning and Practice 14 The clique potential φ measures the affinity of its member variables in each of their possible joint states One choice for φ is the energy-based model (Boltzmann distribution) φ(c) = exp( E(x C )) where x C denote the variables in that clique The choice of φ needs some attention; not every choice would result in a legitimate probability distribution, e.g. with x R and and β < 0 φ(x) = exp( βx 2 ) In the present case, the unnormalized joint distribution is also a Boltzmann distribution with a total energy given by the sum of the

Peng et al.: Deep Learning and Practice 15 energies of all the maximum cliques p(x) = exp( E(x)), with E(x) = C G E(x C ) Each energy term imposes a particular soft constraint on the variables Example: Image de-noising y i { 1, +1}: Observed image pixels

Peng et al.: Deep Learning and Practice 16 x i { 1, +1}: Hidden noise-free image pixels The maximum cliques of the graph are seen to be The joint distribution is given by {x i, y i }, {x i, x j } p(x, y) = 1 Z exp( E(x, y)) The (complete) energy function is assumed to be E(x, y) = i E(x i, y i ) + i,j E(x i, x j ) = η i x i y i β i,j x i x j + h i x i Z is an (intractable) function of model parameters η, β and h Z = x,y exp( E(x, y))

Peng et al.: Deep Learning and Practice 17 De-noising can be cast as an inference problem arg max x p(x y) s shown, the partition function Z often does not have tractable forms; some approximate algorithms are needed in estimating the model parameters, e.g., with the maximum likelihood principle

Peng et al.: Deep Learning and Practice 18 D-Separation We often want to know which subsets of variables are conditionally independent given the values of other sets of variables x 0 x 1 x 2 x 3 x 4 x 5 Is the set of variables {x 1, x 2 } conditionally independent of the variable x 5, given the values of {x 0, x 4 }? or equivalently, p(x 1, x 2, x 5 x 0, x 4 )? = p(x 1, x 2 x 0, x 4 )p(x 5 x 0, x 4 ), p(x 1, x 2 x 0, x 4, x 5 )? = p(x 1, x 2 x 0, x 4 )

Peng et al.: Deep Learning and Practice 19 The key rules can be deduced from observing three simple examples a b a b a b s Head-to-Tail s Tail-to-Tail s Head-to-Head Head-to-Tail: a and b are independent (d-separated) given s p(a, b s) = p(a)p(s a)p(b s) p(s) = p(a s)p(b s) Tail-to-Tail: a and b are independent (d-separated) given s p(a, b s) = p(s)p(a s)p(b s) p(s) = p(a s)p(b s)

Peng et al.: Deep Learning and Practice 20 Head-to-Head: a and b are in general dependent given s p(a, b s) = p(a)p(b)p(s a, b) p(s) p(a s)p(b s) The head-to-head rule can generalize to the case where a descendant of s is observed a b s c p(a, b c) p(a c)p(b c) in general

Peng et al.: Deep Learning and Practice 21 To summarize, given, B, C are three non-intersecting sets of nodes, and B are conditionally independent given C if all paths from any node in to any node in B satisfy Meeting either head-to-tail or tail-to-tail at a node in C, or Meeting head-to-head at a node, and neither the node, nor any of its descendant, is in C In other words, these paths are blocked or inactive These rules tell us only those independencies implied by the graph; recall however that not all independencies of a distribution is captured by the graph (c.f. the filter interpretation)

Peng et al.: Deep Learning and Practice 22 Separation Separation refers to the conditional independencies implied by the undirected graph Given, B, C are three non-intersecting sets of nodes, and B are conditionally independent (separated) given C if all paths from any node in to any node in B pass through one or more nodes in C

Peng et al.: Deep Learning and Practice 23 Conversion between Directed and Undirected Models Some independencies can be represented by only one of them Conversion from a directed model D to an undirected model U 1. dding an edge to U for any pair of nodes a, b if there is a directed edge between them in D 2. dding an edge to U for any pair of nodes a, b if they are both parents of a third node in D a b a b c a b and a b c c Moralized graph

Peng et al.: Deep Learning and Practice 24 In the present case, the potential function φ is given by φ(a, b, c) = p(a)p(b)p(c a, b) Conversion from an undirected model U to a directed model U is much less common, and in general, presents problems due to the normalization constraints (study by yourself)

Peng et al.: Deep Learning and Practice 25 Restricted Boltzmann Machines (RBM) n energy-based model with binary visible and hidden units h 1 h 2 h 3 h 4 v 1 v 2 v 3 E(v, h) = b T v c T h v T W h There is no direct interaction between visible units or between hidden units (essentially, a bipartite graph) From the separation rules, we have p(h v) = i p(h i v)

Peng et al.: Deep Learning and Practice 26 p(v h) = i p(v i h) which are both factorial By the definition of E(v, h), p(h i = 1 v) and p(v i = 1 h) are evaluated to be p(h i = 1 v) = σ(v T W ;,i + c i ) p(v i = 1 h) = σ(w i,; h + b i ) The hidden units h, although not interpretable, denote features that describe visible units v and can be inferred by p(h i = 1 v) Samples of visible units v can be generated by sampling all of v given h and then all of h given v via block Gibbs sampling

Peng et al.: Deep Learning and Practice 27 It is also possible to sample part of v given the values of the others for applications such as image completion (essentially, RBM is a fully probabilistic model) Training input Results of image completion Estimating the model parameters W, b, c is achieved with the maximum likelihood principle arg max p(v; W, b, c) W,b,c

Peng et al.: Deep Learning and Practice 28 where the marginal distribution of visible units is given by p(v; W, b, c) = 1 Z exp( E(v, h)) h It however is noticed that the partition function Z is intractable Z = v,h exp( E(v, h)) which is a function of the model parameters W, b, c Some specialized training techniques involving sampling are needed

Peng et al.: Deep Learning and Practice 29 Deep Boltzmann Machines (DBM) Introducing layers of hidden units to RBM h (3) h (3) h (3) h (3) h (2) h (2) h (2) h (2) h (1) h (1) h (1) h (1) v 1 v 2 v 3 E(v, h (1), h (2), h (3) ) = v T W (1) h (1) h (1)T W (2) h (2) h (2)T W (3) h (3)

Peng et al.: Deep Learning and Practice 30 From the graph, the posterior distribution is no longer factorial p(h (1), h (2), h (3) v) p(h (1) v)p(h (2) v)p(h (3) v) pproximate inference (based on variational inference) is needed p(h (1), h (2), h (3) v) q(h (1) v)q(h (2) v)q(h (3) v) Layer-wise unsupervised pre-training is also common

Peng et al.: Deep Learning and Practice 31 More Examples: Label Noise Model nother deep learning approach to graphical models is to approximate their conditional distributions with deep neural networks Objective: To infer ground truth labels for images Visible variables (noisy data) x : Image ỹ : Noisy label (one-hot vector) Latent variables y : True label (one-hot vector) z : Label noise type

Peng et al.: Deep Learning and Practice 32 Graphical model θ 1 x y θ 2 z ỹ p(ỹ, y, z x) = p(ỹ y, z) }{{} Hand designed N p(y x; θ 1 ) }{{} N.N. p(z x; θ 1 ) }{{} N.N.

Peng et al.: Deep Learning and Practice 33 Label noise type and the conditional distribution p(ỹ y, z) Noise free (z = 1): ỹ = y p(ỹ y, z) = ỹ T Iy Random noise (z = 2): ỹ is any value other than the true y where p(ỹ y, z) = 1 L 1ỹT (U I)y U is a matrix of 1 s L is the number of possible labels Confusing noise (z = 3): ỹ is any value close to the true y p(ỹ y, z) = ỹ T Cy

Peng et al.: Deep Learning and Practice 34 Training of θ 1, θ 2 is based on the EM algorithm (study the paper) Testing is achieved by the neural network p(y x; θ 1 ) Note that unlike RBM/DBM, the hidden variables here are interpretable as is the case with most conventional graphical models

Peng et al.: Deep Learning and Practice 35 Review Directed vs. undirected graphical models Probability distributions and their graph representations Training, sampling, and inference for graphical models Extracting conditional independence: d-separation and separation Deep learning with graphical models