Using Graphs to Describe Model Structure. Sargur N. Srihari

Similar documents
Need for Sampling in Machine Learning. Sargur Srihari

Chapter 16. Structured Probabilistic Models for Deep Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari

Inference as Optimization

Introduction to Probabilistic Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Alternative Parameterizations of Markov Networks. Sargur Srihari

Chris Bishop s PRML Ch. 8: Graphical Models

Variational Inference. Sargur Srihari

Machine Learning Lecture 14

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Lecture 16 Deep Neural Generative Models

Directed and Undirected Graphical Models

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Undirected Graphical Models: Markov Random Fields

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

STA 4273H: Statistical Machine Learning

Undirected Graphical Models

From Bayesian Networks to Markov Networks. Sargur Srihari

Probabilistic Graphical Models

An Introduction to Bayesian Machine Learning

3 : Representation of Undirected GM

Directed Graphical Models or Bayesian Networks

Review: Directed Models (Bayes Nets)

Probability and Information Theory. Sargur N. Srihari

Bayesian Machine Learning - Lecture 7

Representation of undirected GM. Kayhan Batmanghelich

Based on slides by Richard Zemel

Basic Sampling Methods

3 Undirected Graphical Models

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Graphical Models (I)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Variational Inference and Learning. Sargur N. Srihari

Probabilistic Graphical Models

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Bayesian Networks (Part II)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The Origin of Deep Learning. Lili Mou Jan, 2015

CS 188: Artificial Intelligence Fall 2008

Graphical Models and Kernel Methods

CPSC 540: Machine Learning

Conditional Independence and Factorization

CS Lecture 4. Markov Random Fields

Probabilistic Models

Variable Elimination: Algorithm

Notes on Markov Networks

Learning MN Parameters with Approximation. Sargur Srihari

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CSC 412 (Lecture 4): Undirected Graphical Models

Lecture 6: Graphical Models

Probabilistic Graphical Models

Lecture 15. Probabilistic Models on Graph

CS 5522: Artificial Intelligence II

Rapid Introduction to Machine Learning/ Deep Learning

10708 Graphical Models: Homework 2

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

13: Variational inference II

CS 343: Artificial Intelligence

Bayesian Learning in Undirected Graphical Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Conditional Independence

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

CS 188: Artificial Intelligence Fall 2009

Multivariate Gaussians. Sargur Srihari

MAP Examples. Sargur Srihari

Variational Inference (11/04/13)

2 : Directed GMs: Bayesian Networks

STA 4273H: Statistical Machine Learning

CS 5522: Artificial Intelligence II

Variable Elimination: Algorithm

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Directed and Undirected Graphical Models

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Linear Dynamical Systems

Intelligent Systems:

CS Lecture 3. More Bayesian Networks

Probabilistic Machine Learning

Bayesian Networks: Representation, Variable Elimination

CSE 473: Artificial Intelligence Autumn 2011

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Undirected Graphical Models

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Directed Graphical Models

Artificial Intelligence Bayes Nets: Independence

Intelligent Systems (AI-2)

Markov Networks.

Undirected graphical models

Mixtures of Gaussians. Sargur Srihari

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

STA 414/2104: Machine Learning

Transcription:

Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1

Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe model structure 3. Sampling from graphical models 4. Advantages of structured modeling 5. Learning about Dependencies 6. Inference and Approximate Inference 7. The deep learning approach to structured probabilistic models 1. Ex: The Restricted Boltzmann machine 2

Topics in Using Graphs to Describe Model Structure 1. Directed Models 2. Undirected Models 3. The Partition Function 4. Energy-based Models 5. Separation and D-separation 6. Converting between Undirected and Directed Graphs 7. Factor Graphs 3

Graphs to describe model structure We describe model structure using graphs, where Each node represents a random variable Each edge represents a direct interaction These direct interactions imply other indirect interactions But only direct interactions need be explicitly modeled 4

Types of graphical models More than one way to describe interactions in a probability distribution using a graph We describe some of the most popular approaches Graphical models can be largely divided into two categories Models based on directed acyclic graphs Models based on undirected graphs 5

1. Directed Models One type of structured probabilistic model is the directed graphical model Also known as a belief network or a Bayesian network The term Bayesian is used since the probabilities can be judgmental They usually represent degrees of belief rather than frequencies of events 6

Example of Directed Graphical Model Relay race example Bob s finishing time t 1 depends on Alice s finishing time t 0, Carol s finishing time t 2 depends on Bob s finishing time t 1 7

Meaning of directed edges Drawing an arrow from a to b means we define a conditional probability distribution (CPD) over b via a conditional distribution with a as one of the variables on the right side of the conditional bar i.e., distribution over b depends on the value of a 8

Formal directed graphical model Defined on variables x is defined by a directed graphical acyclic graph G whose vertices are random variables in the model and a set of local CPDs p(x i Pa G (x i )) where Pa G (x i ) gives the parents of x i in G The probability distribution over x is given by p(x) = p(x i Pa G (x i )) In the relay race example i p(t 0,t 1,t 2 )=p(t 0 )p(t 1 t 0 )p(t 2 t 1 ) 9

Savings achieved by directed model If t 0, t 1 and t 2 are discrete with 100 values then single table would require 999,999 values By making tables for only conditional probabilities we need only 18,999 values To model n discrete variables each having k values, cost of a single table is O(k n ) If m is maximum no. of variables appearing on either side of conditioning bar in a single CPD then cost of tables for directed PGM is O(k m ) So long as each variable has few parents in graph, distribution represented by few parameters

2. Undirected models Directed graphical models give us a language to describe structured probabilistic models Another language is: undirected models Synonyms: Markov Random Fields, Markov Nets Use graphs whose edges are undirected Directed models are useful when there is clear directionality When interactions have no clear directionality, more appropriate to use an undirected graph Often when we know causality and causality flows in one direction 11

Models without clear direction When interactions have no clear direction in interactions, or operate in both directions, it is appropriate to use an undirected model An example with three binary variables described next 12

Ex: Undirected Model for Health A model over three binary variables: Whether or not you are sick, h y Whether or not your coworker is sick, h c Whether or not your roommate is sick, h r Assuming coworker and roommate do not know each other, very unlikely one of them will give a cold to the other Event is so rare we do not model it There is no clear directionality either This motivates using an undirected model 13

The health undirected graph You and your rommmate may infect each other with a cold You and your work colleague may do the same Assuming room-mate and colleague do not know each other they can only get infected through you 14

Undirected graph definition If two variables directly interact with each other then the nodes are connected Edge has no arrow and has no CPD An undirected PGM is defined on a graph G For each clique C in the graph, a factor ϕ(c), or clique potential, measures the affinity for being in each of their joint states A clique is a subset of nodes all connected to each other Together they define an unnormalized distribution!p(x) = C G φ(c) 15

Efficiency of Unnormalized Distribution Unnormalized probability distribution is efficient to work with so long as the cliques are small It encodes that states with higher affinity ϕ(c) are more likely Since there is little structure to the definition of the cliques, there is no guarantee that multiplying them together will yield a probability distribution 16

Reading factorization information from an undirected graph This graph (with five cliques) implies that p(a,b,c,d,e, f ) = 1 Z φ a,b (a,b)φ b,c (b,c)φ a,d (a,d)φ b,e (b,e)φ e,f (e, f ) for an appropriate choice of the ϕ functions Example of clique potentials shown next 17

Ex: Clique potential h y = health of you h r = health of roommate h c =health of colleague One clique is between h y and h c Factor for this clique can be defined by a table ϕ(h y,h c ) A state of 1 indicates good health While state of 0 indicates poor health Both are usually healthy, so the corresponding state has highest affinity State of only one being being sick has lowest affinity State of both being sick has higher affinity than one being sick Similar factor needed for the other clique between h y and h c 18

3. The partition function The unnormalized probability distribution Is guaranteed to be non-negative everywhere It is not guaranteed to sum or integrate to 1 To obtain a valid probability distribution we must use the normalized (or Gibbs) distribution p(x) = 1 Z ˆp(x) Where Z causes the distribution to sum to one Z =!p(x)dx!p(x) = C G φ(c) Z is a constant when the ϕ functions are constant If ϕ has parameters then Z is a function of those parameters, commonly written without arguments Known as the partition function in statistical physics

Intractability of Z Since Z is an integral or sum over all possible values of x it is intractable to compute In order to compute a normalized probability of an undirected model: Model structure and definitions of ϕ functions must be conducive to computing Z efficiently In deep learning applications Z is intractable and we must resort to approximations 20

Choice of factors When designing undirected models important to know that for some factors, Z does not exist! 1. If there is a single scalar variable x ε R and we choose a single clique potential ϕ(x)= x 2 then Z = x 2 dx This integral diverges Hence there is no probability distribution for this choice 2. The choice of a parameter of the ϕ functions determines whether the distribution exists 1. For ϕ(x ; β)=exp(-βx 2 ), the β parameter determines whether Z exists Positive β defines a Gaussian over x Other values of β make ϕ impossible to normalize 21

Key difference between BN & MN Directed models are: defined directly in terms of probability distributions From the start Undirected models are: Defined loosely in terms of ϕ functions that are then converted into probability distributions This changes intuitions to work with these models One key idea to keep in mind in working with MNs: Domain of variables has a dramatic effect on kind of probability distributions a given set of ϕ functions corresponds to We will see how we can define distributions for different domains 22

What distribution does an MN give? Consider an n-dimensional random variable x={x i }i=1,..,n And an undirected model parameterized with biases b Suppose we have one clique for each x i : ϕ (i) (x i )=exp(b i x i ) x 1 x i x n What kind of probability distribution is modeled? p(x) = 1 Z ˆp(x)!p(x) = φ (i) (x i ) = exp(b 1 x 1 +.. + b n x n ) Z =!p(x)dx C G The answer is that we do not have enough information Because we have not specified the domain of x Three example domains are: 1.x ε R n, an n-dimensional vector of real values 2.x ε {0,1} n, an n-dimensional vector of binary values 3. Domain of x is the set of elementary basis vectors {[1,0,..0], [0,1,..,0],.,[0,0,..,1]}

Effect of domain of x on distribution We have n random variables, x={x i }i=1,..,n For each x i : ϕ (i) (x i )=exp(b i x i ) What kind of probability distribution is modeled? p(x) = 1 Z ˆp(x)!p(x) = φ (i) (x i ) = exp(b 1 x 1 +.. + b n x n ) Z =!p(x)dx C G 1. If xεr n Z =!p(x)dx diverges and no probability distribution exists 2. If xε{0,1} n p(x) factorizes into n independent distributions with 1 p(x i =1)=σ(b i ), where σ(x) = ex = x 1 + e 1 + e x Each independent distribution is a binomial with parameter σ(b i ) 3. If domain of x is the set of basis vectors {[1,0,..0],[0,1,..,0],..,[0,0,..,1]}then p(x)=softmax(b) So a large value of b i reduces to p(x j )=1 for j i, i.e., multiclass Often by careful choice of domain of x we can obtain complicated behavior from a simple set ϕ functions 24

4. Energy-based Models (EBMs) Many interesting theoretical results of undirected models depend on assumption that x,!p(x) > 0 We can enforce this using an EBM where!p(x) = exp( E(x))!p(x) = E(x) is known as the Energy function Because exp(z)>0 z, no energy function will result in a probability of zero for any x If we were to learn clique potentials directly we would need to impose constraints for minimum probability value By learning the energy function we can use unconstrained optimization: probabilities can approach 0 C G φ(c)

Boltzmann Machine Terminology Any distribution of the form!p(x) = exp( E(x)) is referred to as a Boltzmann distribution For this reason, many energy-based models are referred to as Boltzmann machines No consensus on when to call it a energy-based model or a Boltzmann Machine Boltzmann machines first referred to only binary variables Today mean-covariance restricted Boltzmann Machines deal with real-valued variables Boltzmann Machines refer to models with latent variables and those without are referred to as MRFs or log-linear models 26

Cliques,factors and energy Cliques in the undirected graph correspond to factors in the uunnormalized probability function Cliques in undirected graph also correspond to different terms of an energy function Because exp(a)exp(b)=exp(a+b) different cliques in undirected graph correspond to different terms of the energy function i.e., energy-based model is a special Markov network Exponentiation makes each term of the energy function correspond to a factor for a different clique Reading the form of the energy function from an undirected graph is shown next 27

Graph and Corresponding Energy This graph (with five cliques) implies that E(a,b,c,d,e,f)= E a,b (a,b)+e b,c (b,c)+e a,d (a,d)+e b,e (b,e)+e e,f (e,f) We can obtain the ϕ functions by setting each ϕ to the exponential of the corresponding negative energy, e.g., ϕ a,b (a,b)=exp(-e(a,b)) 28

Energy-based Model as Experts An energy based model with multiple terms in its energy function can be viewed as a product of experts Each term corresponds to a factor in the probability distribution Each term determines whether a soft constraint is satisfied Each expert may impose only one constraint that concerns a low-dimensional projection of the random variables When combined by multiplication of probabilities, the experts together enforce a high-dimensional constraint

Role of negative sign in energy The negative sign in!p(x) = exp( E(x)) serves no functional purpose from a ML perspective This sign could be incorporated into the definition of the Energy function It is there mainly for compatibility with physics literature Some ML researchers omit the negative sign and refer to the negative energy as harmony 30

Free Energy instead of Probability Many algorithms that operate on probabilistic models do not need to compute p model (x) but only log!p model (x) For energy-based models with latent variables h, these algorithms are phrased in terms of the negative of this quantity, called the free energy F(x) = log h exp( E(x,h) ) Deep learning prefers this formulation 31

5. Separation and D-Separation Edges in a directed graph tell which variables directly interact We often need to know which variables indirectly interact Some of these interactions can be enabled or disabled by observing other variables More formally we would like to know which variables are conditionally independent of each other given the values of other sets of variables 32

Separation in undirected models Identifying conditional independences is very simple in the case of undirected models In this case conditional independence implied by the graph is called separation Set of variables A is separated from variables B given third set of variables S if the graph implies that A is independent of B given S If two variables a and b are connected by a path involving only unobserved variables then those variables are not separated If no path exists between them, or all paths contain an observed variable then they are separated 33

Separation in undirected graphs b is shaded to indicate it is observed b blocks path from a to c, so a and c are separated given b There is an active path from a to d, so a and d are not separated given b 34

Separation in Directed Graphs In the context of directed graphs, these separation concepts are called d-separation d stands for dependence D-separation is defined the same as separation for undirected graphs: A set of variables A is d-separated from a set of variables B given a third set of variables S if the graph structure implies that A is independent of B given S 35

Examining Active Paths Two variables are dependent if there is an active path between them They are d-separated if there is no path between them In directed nets determining whether a path is active is more complicated A guide to identifying active paths in a directed model is given next 36

All active paths of length 2 Active paths between random variables a and b 37

Reading properties from a graph 38