Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1
Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe model structure 3. Sampling from graphical models 4. Advantages of structured modeling 5. Learning about Dependencies 6. Inference and Approximate Inference 7. The deep learning approach to structured probabilistic models 1. Ex: The Restricted Boltzmann machine 2
Topics in Using Graphs to Describe Model Structure 1. Directed Models 2. Undirected Models 3. The Partition Function 4. Energy-based Models 5. Separation and D-separation 6. Converting between Undirected and Directed Graphs 7. Factor Graphs 3
Graphs to describe model structure We describe model structure using graphs, where Each node represents a random variable Each edge represents a direct interaction These direct interactions imply other indirect interactions But only direct interactions need be explicitly modeled 4
Types of graphical models More than one way to describe interactions in a probability distribution using a graph We describe some of the most popular approaches Graphical models can be largely divided into two categories Models based on directed acyclic graphs Models based on undirected graphs 5
1. Directed Models One type of structured probabilistic model is the directed graphical model Also known as a belief network or a Bayesian network The term Bayesian is used since the probabilities can be judgmental They usually represent degrees of belief rather than frequencies of events 6
Example of Directed Graphical Model Relay race example Bob s finishing time t 1 depends on Alice s finishing time t 0, Carol s finishing time t 2 depends on Bob s finishing time t 1 7
Meaning of directed edges Drawing an arrow from a to b means we define a conditional probability distribution (CPD) over b via a conditional distribution with a as one of the variables on the right side of the conditional bar i.e., distribution over b depends on the value of a 8
Formal directed graphical model Defined on variables x is defined by a directed graphical acyclic graph G whose vertices are random variables in the model and a set of local CPDs p(x i Pa G (x i )) where Pa G (x i ) gives the parents of x i in G The probability distribution over x is given by p(x) = p(x i Pa G (x i )) In the relay race example i p(t 0,t 1,t 2 )=p(t 0 )p(t 1 t 0 )p(t 2 t 1 ) 9
Savings achieved by directed model If t 0, t 1 and t 2 are discrete with 100 values then single table would require 999,999 values By making tables for only conditional probabilities we need only 18,999 values To model n discrete variables each having k values, cost of a single table is O(k n ) If m is maximum no. of variables appearing on either side of conditioning bar in a single CPD then cost of tables for directed PGM is O(k m ) So long as each variable has few parents in graph, distribution represented by few parameters
2. Undirected models Directed graphical models give us a language to describe structured probabilistic models Another language is: undirected models Synonyms: Markov Random Fields, Markov Nets Use graphs whose edges are undirected Directed models are useful when there is clear directionality When interactions have no clear directionality, more appropriate to use an undirected graph Often when we know causality and causality flows in one direction 11
Models without clear direction When interactions have no clear direction in interactions, or operate in both directions, it is appropriate to use an undirected model An example with three binary variables described next 12
Ex: Undirected Model for Health A model over three binary variables: Whether or not you are sick, h y Whether or not your coworker is sick, h c Whether or not your roommate is sick, h r Assuming coworker and roommate do not know each other, very unlikely one of them will give a cold to the other Event is so rare we do not model it There is no clear directionality either This motivates using an undirected model 13
The health undirected graph You and your rommmate may infect each other with a cold You and your work colleague may do the same Assuming room-mate and colleague do not know each other they can only get infected through you 14
Undirected graph definition If two variables directly interact with each other then the nodes are connected Edge has no arrow and has no CPD An undirected PGM is defined on a graph G For each clique C in the graph, a factor ϕ(c), or clique potential, measures the affinity for being in each of their joint states A clique is a subset of nodes all connected to each other Together they define an unnormalized distribution!p(x) = C G φ(c) 15
Efficiency of Unnormalized Distribution Unnormalized probability distribution is efficient to work with so long as the cliques are small It encodes that states with higher affinity ϕ(c) are more likely Since there is little structure to the definition of the cliques, there is no guarantee that multiplying them together will yield a probability distribution 16
Reading factorization information from an undirected graph This graph (with five cliques) implies that p(a,b,c,d,e, f ) = 1 Z φ a,b (a,b)φ b,c (b,c)φ a,d (a,d)φ b,e (b,e)φ e,f (e, f ) for an appropriate choice of the ϕ functions Example of clique potentials shown next 17
Ex: Clique potential h y = health of you h r = health of roommate h c =health of colleague One clique is between h y and h c Factor for this clique can be defined by a table ϕ(h y,h c ) A state of 1 indicates good health While state of 0 indicates poor health Both are usually healthy, so the corresponding state has highest affinity State of only one being being sick has lowest affinity State of both being sick has higher affinity than one being sick Similar factor needed for the other clique between h y and h c 18
3. The partition function The unnormalized probability distribution Is guaranteed to be non-negative everywhere It is not guaranteed to sum or integrate to 1 To obtain a valid probability distribution we must use the normalized (or Gibbs) distribution p(x) = 1 Z ˆp(x) Where Z causes the distribution to sum to one Z =!p(x)dx!p(x) = C G φ(c) Z is a constant when the ϕ functions are constant If ϕ has parameters then Z is a function of those parameters, commonly written without arguments Known as the partition function in statistical physics
Intractability of Z Since Z is an integral or sum over all possible values of x it is intractable to compute In order to compute a normalized probability of an undirected model: Model structure and definitions of ϕ functions must be conducive to computing Z efficiently In deep learning applications Z is intractable and we must resort to approximations 20
Choice of factors When designing undirected models important to know that for some factors, Z does not exist! 1. If there is a single scalar variable x ε R and we choose a single clique potential ϕ(x)= x 2 then Z = x 2 dx This integral diverges Hence there is no probability distribution for this choice 2. The choice of a parameter of the ϕ functions determines whether the distribution exists 1. For ϕ(x ; β)=exp(-βx 2 ), the β parameter determines whether Z exists Positive β defines a Gaussian over x Other values of β make ϕ impossible to normalize 21
Key difference between BN & MN Directed models are: defined directly in terms of probability distributions From the start Undirected models are: Defined loosely in terms of ϕ functions that are then converted into probability distributions This changes intuitions to work with these models One key idea to keep in mind in working with MNs: Domain of variables has a dramatic effect on kind of probability distributions a given set of ϕ functions corresponds to We will see how we can define distributions for different domains 22
What distribution does an MN give? Consider an n-dimensional random variable x={x i }i=1,..,n And an undirected model parameterized with biases b Suppose we have one clique for each x i : ϕ (i) (x i )=exp(b i x i ) x 1 x i x n What kind of probability distribution is modeled? p(x) = 1 Z ˆp(x)!p(x) = φ (i) (x i ) = exp(b 1 x 1 +.. + b n x n ) Z =!p(x)dx C G The answer is that we do not have enough information Because we have not specified the domain of x Three example domains are: 1.x ε R n, an n-dimensional vector of real values 2.x ε {0,1} n, an n-dimensional vector of binary values 3. Domain of x is the set of elementary basis vectors {[1,0,..0], [0,1,..,0],.,[0,0,..,1]}
Effect of domain of x on distribution We have n random variables, x={x i }i=1,..,n For each x i : ϕ (i) (x i )=exp(b i x i ) What kind of probability distribution is modeled? p(x) = 1 Z ˆp(x)!p(x) = φ (i) (x i ) = exp(b 1 x 1 +.. + b n x n ) Z =!p(x)dx C G 1. If xεr n Z =!p(x)dx diverges and no probability distribution exists 2. If xε{0,1} n p(x) factorizes into n independent distributions with 1 p(x i =1)=σ(b i ), where σ(x) = ex = x 1 + e 1 + e x Each independent distribution is a binomial with parameter σ(b i ) 3. If domain of x is the set of basis vectors {[1,0,..0],[0,1,..,0],..,[0,0,..,1]}then p(x)=softmax(b) So a large value of b i reduces to p(x j )=1 for j i, i.e., multiclass Often by careful choice of domain of x we can obtain complicated behavior from a simple set ϕ functions 24
4. Energy-based Models (EBMs) Many interesting theoretical results of undirected models depend on assumption that x,!p(x) > 0 We can enforce this using an EBM where!p(x) = exp( E(x))!p(x) = E(x) is known as the Energy function Because exp(z)>0 z, no energy function will result in a probability of zero for any x If we were to learn clique potentials directly we would need to impose constraints for minimum probability value By learning the energy function we can use unconstrained optimization: probabilities can approach 0 C G φ(c)
Boltzmann Machine Terminology Any distribution of the form!p(x) = exp( E(x)) is referred to as a Boltzmann distribution For this reason, many energy-based models are referred to as Boltzmann machines No consensus on when to call it a energy-based model or a Boltzmann Machine Boltzmann machines first referred to only binary variables Today mean-covariance restricted Boltzmann Machines deal with real-valued variables Boltzmann Machines refer to models with latent variables and those without are referred to as MRFs or log-linear models 26
Cliques,factors and energy Cliques in the undirected graph correspond to factors in the uunnormalized probability function Cliques in undirected graph also correspond to different terms of an energy function Because exp(a)exp(b)=exp(a+b) different cliques in undirected graph correspond to different terms of the energy function i.e., energy-based model is a special Markov network Exponentiation makes each term of the energy function correspond to a factor for a different clique Reading the form of the energy function from an undirected graph is shown next 27
Graph and Corresponding Energy This graph (with five cliques) implies that E(a,b,c,d,e,f)= E a,b (a,b)+e b,c (b,c)+e a,d (a,d)+e b,e (b,e)+e e,f (e,f) We can obtain the ϕ functions by setting each ϕ to the exponential of the corresponding negative energy, e.g., ϕ a,b (a,b)=exp(-e(a,b)) 28
Energy-based Model as Experts An energy based model with multiple terms in its energy function can be viewed as a product of experts Each term corresponds to a factor in the probability distribution Each term determines whether a soft constraint is satisfied Each expert may impose only one constraint that concerns a low-dimensional projection of the random variables When combined by multiplication of probabilities, the experts together enforce a high-dimensional constraint
Role of negative sign in energy The negative sign in!p(x) = exp( E(x)) serves no functional purpose from a ML perspective This sign could be incorporated into the definition of the Energy function It is there mainly for compatibility with physics literature Some ML researchers omit the negative sign and refer to the negative energy as harmony 30
Free Energy instead of Probability Many algorithms that operate on probabilistic models do not need to compute p model (x) but only log!p model (x) For energy-based models with latent variables h, these algorithms are phrased in terms of the negative of this quantity, called the free energy F(x) = log h exp( E(x,h) ) Deep learning prefers this formulation 31
5. Separation and D-Separation Edges in a directed graph tell which variables directly interact We often need to know which variables indirectly interact Some of these interactions can be enabled or disabled by observing other variables More formally we would like to know which variables are conditionally independent of each other given the values of other sets of variables 32
Separation in undirected models Identifying conditional independences is very simple in the case of undirected models In this case conditional independence implied by the graph is called separation Set of variables A is separated from variables B given third set of variables S if the graph implies that A is independent of B given S If two variables a and b are connected by a path involving only unobserved variables then those variables are not separated If no path exists between them, or all paths contain an observed variable then they are separated 33
Separation in undirected graphs b is shaded to indicate it is observed b blocks path from a to c, so a and c are separated given b There is an active path from a to d, so a and d are not separated given b 34
Separation in Directed Graphs In the context of directed graphs, these separation concepts are called d-separation d stands for dependence D-separation is defined the same as separation for undirected graphs: A set of variables A is d-separated from a set of variables B given a third set of variables S if the graph structure implies that A is independent of B given S 35
Examining Active Paths Two variables are dependent if there is an active path between them They are d-separated if there is no path between them In directed nets determining whether a path is active is more complicated A guide to identifying active paths in a directed model is given next 36
All active paths of length 2 Active paths between random variables a and b 37
Reading properties from a graph 38