Robust Classification using Boltzmann machines by Vasileios Vasilakakis The scope of this report is to propose an architecture of Boltzmann machines that could be used in the context of classification, in order to increase robustness of classification under probably noisy/ adverse conditions. Boltzmann machines are probability distributions on high dimensional binary vectors which are analogous to Gaussian Markov Random Fields in that they are fully determined by first and second order moments. A key difference however is that augmenting Boltzmann machines with hidden variables enlarges the class of distributions that can be modeled, so that in principle it is possible to model distributions of arbitrary complexity. (On the other hand, marginalizing over hidden variables in a Gaussian distribution merely gives another Gaussian. A Boltzmann machine is a probability distribution on binary vectors x of the form: P (x= 1 Z ee (x where the energy function E(x has the form: E ( x=h T W x The normalizing constant Z is refered to as the partition function. Restricted Boltzmann machines, are Boltzmann machines where there is a visible layer and a hidden layer with no hidden to hidden and visible to visible connections. By the Markov property both Q(h v nad P(v h both factorize Q(h v=π j Q(h j v P (v h=π i P (v i h Thus there is no need for variational Bayes and Gibbs sampling can be implemented efficiently by alternating between the hidden and the visible levels (blocked Gibbs sampling. Restricted Boltzmann machines can be modified in order to model gaussian visible variables and binary hidden variables, having the following energy function: E (v, h= 1 2 (v bt (v b c T h v T W h The conditional distribution P(v h is Gaussian with mean b+wh and Identity covariance matrix. As for the posterior Q(h v, it factorizes as where Q(h v=π j Q(h j v
Q(h j v=sigmoid (c j +v T W. j We can write: P (v=σ h P(h P (v h we can interpret P(v as a Gaussian Mixture with a prodigious number of components, where the mixture weights are evaluated in a complicated way and each component has the identity matrix as a covariance matrix. Of course the assumption of the identity matrix as a covariance matrix is unreasonable unless the data has been appropriately preprocessed (e.g. By whitening it so that the mean of the data is zero and the global covariance matrix is the identity matrix Gaussian-Bernoulli and Bernoulli-Bernoulli restricted Botlzmann machines have been recently used in order to learn higher-level features, based on the given data. Greedy-pretraining of restricted Boltzmann machine, usually is performed based on the following steps: a Train each Restricted Boltzmann machine separately b Use the activation of the trained features as features as if they were data and learn features in a second hidden layer c It can be proved that each time we add another layer (of same size of features, we improve a variational lower bound on the log probability of the training data d Fine-tune the stack of pretrained RBMs using cross entropy error function with backpropagation. The scope of this report is to change the structure of the higher-layer Boltzmann machine in order to perform directly classification, without the need of slow fine-tuning. The architecture in order to extend the classic Bernoulli-Bernoulli in order to perform classification is shown in the following graph: We can see that now, our Boltzmann machine is comprised of visible units which are our data, or the activations of our data in the nth layer, h are the hidden units and y are the class labels that our data have.
In the new RBM the energy function can be modified and the new formula is given by: E ( y, x,h= h T W x b T x c T h d T y h T Uy with the parameters (W,b,c,d,U and y=(0,0,0,1,0,0,0 where 1 shows the class of the specific data vector. The size of the class array should be modified according to the number of the classes, in order to keep the binary nature of the class and make the learning under the Bernoulli-Bernoulli- Softmax BM easier. As we can see from the figure there are no connections between visibile units and class units and there are no connections among the hidden units, among the visible units and among the class units and therefore we have directly: P (x h=π i P ( x i h P (x i =1 h=sigmoid (b i +h T W e(d y+h T U P ( y h= (Σ y e (d y+h T U j P (h y, x=π j P(h j y, x The hidden units in that case are devised in order not to binarize our visible vector but also to capture information about the target class every time. Contrastive divergence that is used in order to traing the classical Bernoulli-Bernoulli restricted Boltzmann machine can be used in order to train the Bernoulli-Bernoulli-Softmax RBM, where the algorithm is given:
During the positive phase of the Contrastive divergence, our data are clamped into the visible units giving x zero, and also the class information is clamped based on the class of the visible vector. The hidden units are estimated based on the formula of P(h v which is our initial estimate of the probablity that the hidden units will have for the specific visible vector. In the negative phase, the perform n-steps, based on the traditional contrastive divergence formulation in order to estimate the distribution of P(x,y,h under the BM distribution. Initially, we sample our hidden vector, given the clampled visible and target units. Then we resample the class units based on the estimated hidden, and also we resample the visible units based on the estimated hidden units. Finally, we reestimate our hidden units based on the newly sample visible and class units. Usually only one step of Contrastive Divergence seems to show good results, but including more steps makes our joint distribution to approximate the Equilibrium distribution and train the model better. The number of Contrastive Divergence steps should be manually tuned, while the experiment. During the update phase the energy of the joint configuration of visible/hidden/class units is used in order to find its gradients concerning the parameters of our models. Therefore the gradients of the weights, the visible biases, hidden biases, class biases,class weights should be calculated in order to perform gradient descent on the negative log likelihood of the generative model. Under the previous configuration, classification can be performed based on the following formula:, y e( F(x P ( y x= (Σ y (e ( F( x, y where F(x,y is the free energy of the joint configuration: F (x, y=d y +Σ j softplus(c j +U jy +x T W. j Though in our previous setting, the goal of the B-B-S machine is to create a generative model based on the data distribution in order to perform classification. Though, there is no guarantee that a good generative model of p(x is good in terms of classification and therfore, this model can be extended in order to directly optimize p(y x,h instead of the joint p(y,x,h that is done using the previous configuration. By changing the optimization objective, we can obtain a discriminative criterion, optimizing directly the probability that each visible vector belongs to a specific class, rather than the generative probability. Therefore we can devise a hybrid model that combines the strengths of discriminative and generative trainnig of the B-B-S Boltzmann machine, having as only tuning parameters the learning rate of the classific BM and the contribution of each model. Having created the B-B-S Boltzmann machine the following experiments could be performed: a Training a Gaussian-Binary Boltzmann machine for the initially Gaussianly distributed data b Training a stack of Binary-Binary Boltzmann machines using as data the activations from the previous layer. cadd a B-B-S Boltzmann machine at the higher layer using the hybrid training procedure described before in order to perform classification Apart from the previous obvious experiment for performing classification using Boltzmann machines, there is a series of experiments that could be conducted in order to investigate the
efficiency and contribution of the specific BM. a Train directly a stack of B-B-S Boltzmann machines and investigate the classification performance. b Extend the Restricted nature of the B-B-S to the general Boltzmann machine, where there are connections also between the visible units, hidden units and class units. In that case simple contrastive divergence is not sufficient because the posterior probability between the hidden variables cannot factorize. In this case Variational Bayes / Mean field approximation should be used This configuration is more promising when compared to the previous one, as it also manages to capture more correlations among the units and therefore it learns a better generative model, though it lies on the strong assumption of factorizability based on VB which may not be appropriate in order to train the Boltzmann machine. c Compare the previous configuration with 1. A classic Multilayer Perceptron 2. A classic Multilayer Perceptron where the weigths have been initialized using a stack of RBMs