A graph contains a set of nodes (vertices) connected by links (edges or arcs)

BOLTZMANN MACHINES

Generative Models

Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable, and links represent probabilistic dependencies between random variables. Two types of graphical models: 1). Bayesian networks, also known as Directed Graphical Models (the links have a particular directionality indicated by the arrows) 2). Markov Random Fields, also known as Undirected Graphical Models (the links do not carry arrows and have no directional significance). Hybrid graphical models that combine directed and undirected graphical models, such as Deep Belief Networks.

Bayesian Networks (Directed Graphical Models ) Directed Graphs are useful for expressing causal (parent-child) relationships between random variables. Let us consider an arbitrary joint distribution over three random variables a, b and c (can be either discrete or continuous). By application of the product rule of probability, we can get (of course there can be other possible decompositions also) Represent the joint distribution in terms of a simple graphical model:

Bayesian Networks (Directed Graphical Models ) If there is a link going from node a to node b, then we say that: - node a is a parent of node b. - node b is a child of node a. If each node has incoming links from all lower numbered nodes, then the graph is fully connected; there is a link between all pairs of nodes. Absence of links conveys certain information about the properties of the class of distributions that the graph conveys. (Factorization Property): The joint distribution defined by the directed graph is given by the product of conditional distribution for each node conditioned on its parents:

Bayesian Networks (Directed Graphical Models ) Example: Important restriction: There must be no directed cycles!. Such graphs are also called directed acyclic graphs (DAGs). - Reason: If we have a cycle then there is no guarantee that joint distribution is a valid distribution

Bayesian Networks (Directed Graphical Models ) In probabilistic graphical models, random variables will be denoted by open circles and deterministic parameters will be denoted by smaller solid circles. When we apply a graphical model to a problem in machine learning, we will set some of the variables to specific observed values (e.g. condition on the data).

Bayesian Networks (Directed Graphical Models ) Ancestral Sampling Our goal is draw a sample from this distribution. - Start at the top and sample in order

Bayesian Networks (Directed Graphical Models ) Conditional Independence An important feature of graphical models is that conditional independence properties of the joint distribution can be read directly from the graph without performing any analytical manipulations Example: IID data (a) Given mu all x s are independent; (b) If we integrate out mu then x s are not longer independent

Bayesian Networks (Directed Graphical Models ) Markov Blanket in Directed Models - The Markov blanket of a node is the minimal set of nodes that must be observed to make this node independent of all other nodes In a directed model, the Markov blanket includes parents, children and co-parents (i.e. all the parents of the node s children) due to explaining away. The Markov blanket property has the advantage that in graphs where we have many RVs, then we can do inference for each RV in parallel given its Markov blanket. This property will become basis of Markov chain, belief networks etc.

Markov Random Fields (Undirected Graphical Models) Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are useful for expressing soft constraints between random variables The joint distribution defined by the graph is given by the product of non-negative potential functions over the maximal cliques (connected subset of nodes) where the normalizing constant Z is called a partition function. Example:

Markov Random Fields (Undirected Graphical Models) Cliques: It is a subset of nodes such that there exists a link between all pairs of nodes in a subset. Maximal Clique: a clique such that it is not possible to include any other nodes in the set without it ceasing to be a clique. This graph has 7 cliques: Two maximal cliques:

Markov Random Fields (Undirected Graphical Models) Each potential function is a mapping from the joint configurations of random variables in a clique to non-negative real numbers. In contrast to directed graphs, the potential functions do not have a specific probabilistic interpretation. This gives us greater flexibility in choosing the potential functions. Potential functions are often represented as exponentials: where E(x) is called an energy function.

Markov Random Fields (Undirected Graphical Models) For many interesting real-world problems, we need to introduce hidden or latent variables. Our random variables will contain both visible and hidden variables x = (v, h). In general, computing both partition function and summation over hidden variables will be intractable, except for special cases. Thus, Parameter learning becomes a very challenging task.

Markov Random Fields (Undirected Graphical Models) Conditional Independence in Undirected graphs: - It is easier compared to directed models. - Two sets of nodes are conditionally independent if the observations block all paths between them.

Markov Random Fields (Undirected Graphical Models) Markov Blanket in Undirected graphs: - This is simpler than in directed models, since there is no explaining away. - The conditional distribution of conditioned on all the variables in the graph is dependent only on the variables in the Markov blanket.

Markov Random Fields (Undirected Graphical Models) Since conditional independence of random variables and the factorization properties of the joint probability distribution are closely related, one can ask if there exists a general factorization form of the distributions of MRFs. Hammersley-Clifford Theorem: Below two sets of distributions are the same. The set of distributions consistent with the conditional independence relationships defined by the undirected graph. The set of distributions consistent with the factorization defined by potential functions on maximal cliques of the graph.

Restricted Boltzmann Machines RBMs are undirected bipartite graphical models with maximal clique size of 2. Stochastic binary visible variables: Stochastic binary hidden variables:

Restricted Boltzmann Machines The energy of the joint configuration is given as: Note that the graph of an RBM has only connections between the layer of hidden and visible variables but not between two variables of the same layer. In terms of probability this means that the hidden variables are independent given the state of the visible variables and vice versa: We can also show that (derivation in Fisher s paper):

Restricted Boltzmann Machines RBM learning algorithms are based on gradient ascent on the log-likelihood i.e Log-Likelihood Gradient of MRFs with Latent Variables For a RBM model with parameters given a single training example v is, the log-likelihood

Restricted Boltzmann Machines In the last step we used that the conditional probability can be written in the following way:

Restricted Boltzmann Machines The first term can be computed efficiently because it factorizes nicely. For example, w.r.t. the parameter we get:

Restricted Boltzmann Machines Similarly, the second term can also be written as Therefore, w.r.t. the parameter we get (if using outer summation over v):

Restricted Boltzmann Machines Thus, the derivative of the log-likelihood of a single training pattern v w.r.t. the weight becomes

Restricted Boltzmann Machines For the mean of this derivative over a training set notations are used: often the following with q denoting the empirical distribution. This gives the often stated rule:

Restricted Boltzmann Machines Analogously we can obtain In all these equations, first term can be computed analytically, because of conditional independence of variables. But, the second term which runs over all possible configuration of visible variables needs exponential number of terms and thus is intractable. To avoid this computational burden, the second expectation can be approximated by samples drawn from the model distribution based on MCMC techniques.

Restricted Boltzmann Machines Contrastive Divergence Start sampling chain at a training example Obtain the point by Gibbs sampling (usually k=1) Replace the expectation by a point estimate at

Restricted Boltzmann Machines The independence between the variables in one layer makes Gibbs sampling especially easy: Instead of sampling new values for all variables subsequently, the states of all variables in one layer can be sampled jointly. This is also referred to as block Gibbs sampling. Each step t consists of sampling from and sampling from subsequently. The log-likelihood gradient w.r.t. θ of the log-likelihood for one training pattern is then approximated by Note that since energy function is linear function of parameters, hence it is easy to compute these derivative.

Restricted Boltzmann Machines These equations says that for positive samples we want to decrease the energy and for negative sample (that when I am away from true data) I want to increase the energy. This can be plotted like below.

Restricted Boltzmann Machines

Restricted Boltzmann Machines Persistent Contrastive Divergence The algorithm corresponds to standard CD learning without reinitializing the visible units of the Markov chain with a training sample each time we want to draw a sample v(k) approximately from the RBM distribution. Instead one keeps persistent chains which are run for k Gibbs steps after each parameter update (i.e., the initial state of the current Gibbs chain is equal to v(k) from the previous update step). The fundamental idea underlying PCD is that one could assume that the chains stay close to the stationary distribution if the learning rate is sufficiently small and thus the model changes only slightly between parameter updates

Restricted Boltzmann Machines Example

Restricted Boltzmann Machines Gaussian-Bernoulli RBM: GBRBMs are variant of RBMs that can be used for modeling real-valued vectors such as pixel intensities and filter responses. To do this, we only need to modify the energy function, such that now each visible unit will correspond to a Gaussian distributed RV, instead of binomial distribution as in case of RBMs. To obtain Gaussian-distributed units, one adds quadratic terms to the energy. Adding gives rise to a diagonal covariance matrix between units of the same layer, where is the continuous value of a Gaussian unit and is the variance of RV. Recommend to normalize the training set by: subtracting the mean of each input dividing each input by the training set standard deviation

Restricted Boltzmann Machines Gaussian-Bernoulli RBM Consider modeling visible real-valued units and be binary stochastic hidden units. The energy of the state {v; h} of the Gaussian-Bernoulli RBM is defined as The density that the model assigns to a visible vector v is given by

Restricted Boltzmann Machines Similar to the standard RBMs, the conditional distributions factorize as Observe that conditioned on the states of the hidden units, each visible unit is modeled by a Gaussian distribution, whose mean is shifted by the weighted combination of the hidden unit activations. Given a set of observations, the derivative of the log-likelihood with respect to the model parameters takes a very similar form when compared to binary RBMs.

Restricted Boltzmann Machines The Replicated Softmax Model is useful for modeling sparse count data, such as word count vectors in a document. Again, we only need to modify the energy function, such that now each visible unit will correspond to a Multinomial distributed RV, instead of binomial distribution as in case of RBMs. Specifically, let K be the dictionary size, M be the number of words appearing in a document, and be binary stochastic hidden topic features. Let V be a M K observed binary matrix with iff the multinomial visible unit i takes on value (meaning the word in the document is the dictionary word).

Restricted Boltzmann Machines The energy of the state {V; h} can be defined as is a symmetric interaction term between visible unit i that takes on value k, and hidden feature j, is the bias of unit i that takes on value k, and is the bias of hidden feature j.

Restricted Boltzmann Machines ------------------------------------------------------------------------------------------------------------

Restricted Boltzmann Machines Collaborative Filtering

Restricted Boltzmann Machines Local vs. Distributed Representations

Deep Belief Networks DBNs are generative model that mixes undirected and directed connections between variables. In given network top 2 layers distribution is an RBM! other layers form a Bayesian network with conditional distributions: This is not a feed-forward neural network.

Deep Belief Networks The joint distribution of a DBN is as follows:

Deep Belief Networks Layer-wise Pretraining: Improve prior on last layer by adding another hidden layer Keep all lower layers as constant when training the upper layers

Deep Belief Networks Variational Bound The reason why stacking of layers increase likelihood is because of the by stacking we are increasing the ELBO.

Deep Belief Networks The above equation is called variational bound if is equal to the true conditional, then we have an equality the bound is tight! In fact, difference between the left and right terms is the KL divergence between and The ELBO equation can be rewritten as Thus if we increase the term then we can improve. This is the basis for DBNs i.e., we take and model it with another network, so that its value can be maximized. Thus, layerwise pretraining improves variational lower bound.

Deep Belief Networks

Deep Belief Networks Sampling from DBNs To sample from the DBN model: Sample using alternating Gibbs sampling from RBM Sample lower layers using sigmoid belief network

Deep Belief Networks Note that, if we replace the top RBM in DBNs with a Gaussian prior i.e. if p(h2) is a Gaussian, then whole system will become a VAE, or if RBM is replaced with iid samples or independent Bernoulli's then the system will become Sigmoid belief network or Helmholtz machine.

Deep Belief Networks

Deep Belief Networks DBNs for Classification After layer-by-layer unsupervised pretraining, discriminative fine-tuning by backpropagation achieves an error rate of 1.2% on MNIST. SVM s get 1.4% and randomly initialized backprop gets 1.6%. Clearly unsupervised learning helps generalization. This is because w/o unsupervised pretraining we are only supplying only the one bit of information to the model (i.e., the label), while if we do pretraining then we are providing much information along with label.

Deep Boltzmann Machines DBMs are undirected graphical models with multiple layers of hidden variables.

Deep Boltzmann Machines

Deep Boltzmann Machines DBMs has bottom-up + top-down inference this allows to model input better compared to DBNs (or conventional neural nets) which only have bottom-up inference.

Deep Boltzmann Machines

Deep Boltzmann Machines The conditional distributions can be given as: In RBMs, all the hidden units in a layer were independent of each other that allowed us to calculate the data dependent expectation analytically. While, in DBMs, the hidden units are no longer independent of each therefore now we need to use some technique to approximate the data dependent expectation also.

Deep Boltzmann Machines Hidden units in a layer are no longer independent of each other, hence both expectations are intractable.

Deep Boltzmann Machines Minimize KL between approximating and true distributions with respect to variational parameters

Deep Boltzmann Machines

Deep Boltzmann Machines sampling from a two hidden layer DBM by running Markov chain

Deep Boltzmann Machines In practice we simulate several Markov chains in parallel to generate M samples.

Deep Boltzmann Machines

References Fischer, Asja, and Christian Igel. "An introduction to restricted Boltzmann machines." Iberoamerican Congress on Pattern Recognition. Springer, Berlin, Heidelberg, 2012. Russ Salakhutdinov: https://www.youtube.com/watch?v=fdloqrlx8hs&list=plpixoj-hndsosl Buy7_UEVQkyfhHapa Hugo Larochelle: https://www.youtube.com/watch?v=sgz6btthmpw&list=pl6xpj9i5qxyecohn7tqghaj6naprnmubh

THANK YOU!!