Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Similar documents
Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Restricted Boltzmann Machines for Collaborative Filtering

Deep unsupervised learning

Lecture 16 Deep Neural Generative Models

The Origin of Deep Learning. Lili Mou Jan, 2015

Greedy Layer-Wise Training of Deep Networks

Restricted Boltzmann Machines

UNSUPERVISED LEARNING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

arxiv: v2 [cs.ne] 22 Feb 2013

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Learning to Disentangle Factors of Variation with Manifold Learning

Deep Boltzmann Machines

Chapter 20. Deep Generative Models

Introduction to Restricted Boltzmann Machines

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Deep Neural Networks

Contrastive Divergence

Pattern Recognition and Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning for Signal Processing Bayes Classification and Regression

Kyle Reing University of Southern California April 18, 2018

How to do backpropagation in a brain

Basic Principles of Unsupervised and Unsupervised

Unsupervised Learning

Restricted Boltzmann Machines

Deep Generative Models. (Unsupervised Learning)

Learning Energy-Based Models of High-Dimensional Data

Undirected Graphical Models: Markov Random Fields

Introduction to Graphical Models

Naïve Bayes classification

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Notes on Boltzmann Machines

Reading Group on Deep Learning Session 1

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

STA 4273H: Statistical Machine Learning

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Based on slides by Richard Zemel

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Multi-layer Neural Networks

STA 4273H: Statistical Machine Learning

Advanced statistical methods for data analysis Lecture 2

Ch 4. Linear Models for Classification

CS60010: Deep Learning

An efficient way to learn deep generative models

Undirected Graphical Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSC321 Lecture 4: Learning a Classifier

Multilayer Perceptron

Deep Learning Architecture for Univariate Time Series Forecasting

Nonparametric Bayesian Methods (Gaussian Processes)

Modeling Natural Images with Higher-Order Boltzmann Machines

Modeling Documents with a Deep Boltzmann Machine

Linear discriminant functions

Markov Chains and Hidden Markov Models

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Introduction to Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

INITIALIZING NEURAL NETWORKS USING RESTRICTED BOLTZMANN MACHINES. by Amanda Anna Erhard B.S. in Electrical Engineering, University of Pittsburgh, 2014

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Gaussian Cardinality Restricted Boltzmann Machines

Knowledge Extraction from DBNs for Images

Lecture 5: Logistic Regression. Neural Networks

Variational Autoencoder

Logistic Regression & Neural Networks

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayesian Learning in Undirected Graphical Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

CSC321 Lecture 4: Learning a Classifier

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning Lecture 5

CSCI-567: Machine Learning (Spring 2019)

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Deep Belief Networks are compact universal approximators

Logistic Regression. Machine Learning Fall 2018

STA 414/2104: Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Lecture 4 Towards Deep Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Time Series Classification

CPSC 540: Machine Learning

Discriminative Models

PATTERN CLASSIFICATION

Logistic Regression. Seungjin Choi

Transcription:

Robust Classification using Boltzmann machines by Vasileios Vasilakakis The scope of this report is to propose an architecture of Boltzmann machines that could be used in the context of classification, in order to increase robustness of classification under probably noisy/ adverse conditions. Boltzmann machines are probability distributions on high dimensional binary vectors which are analogous to Gaussian Markov Random Fields in that they are fully determined by first and second order moments. A key difference however is that augmenting Boltzmann machines with hidden variables enlarges the class of distributions that can be modeled, so that in principle it is possible to model distributions of arbitrary complexity. (On the other hand, marginalizing over hidden variables in a Gaussian distribution merely gives another Gaussian. A Boltzmann machine is a probability distribution on binary vectors x of the form: P (x= 1 Z ee (x where the energy function E(x has the form: E ( x=h T W x The normalizing constant Z is refered to as the partition function. Restricted Boltzmann machines, are Boltzmann machines where there is a visible layer and a hidden layer with no hidden to hidden and visible to visible connections. By the Markov property both Q(h v nad P(v h both factorize Q(h v=π j Q(h j v P (v h=π i P (v i h Thus there is no need for variational Bayes and Gibbs sampling can be implemented efficiently by alternating between the hidden and the visible levels (blocked Gibbs sampling. Restricted Boltzmann machines can be modified in order to model gaussian visible variables and binary hidden variables, having the following energy function: E (v, h= 1 2 (v bt (v b c T h v T W h The conditional distribution P(v h is Gaussian with mean b+wh and Identity covariance matrix. As for the posterior Q(h v, it factorizes as where Q(h v=π j Q(h j v

Q(h j v=sigmoid (c j +v T W. j We can write: P (v=σ h P(h P (v h we can interpret P(v as a Gaussian Mixture with a prodigious number of components, where the mixture weights are evaluated in a complicated way and each component has the identity matrix as a covariance matrix. Of course the assumption of the identity matrix as a covariance matrix is unreasonable unless the data has been appropriately preprocessed (e.g. By whitening it so that the mean of the data is zero and the global covariance matrix is the identity matrix Gaussian-Bernoulli and Bernoulli-Bernoulli restricted Botlzmann machines have been recently used in order to learn higher-level features, based on the given data. Greedy-pretraining of restricted Boltzmann machine, usually is performed based on the following steps: a Train each Restricted Boltzmann machine separately b Use the activation of the trained features as features as if they were data and learn features in a second hidden layer c It can be proved that each time we add another layer (of same size of features, we improve a variational lower bound on the log probability of the training data d Fine-tune the stack of pretrained RBMs using cross entropy error function with backpropagation. The scope of this report is to change the structure of the higher-layer Boltzmann machine in order to perform directly classification, without the need of slow fine-tuning. The architecture in order to extend the classic Bernoulli-Bernoulli in order to perform classification is shown in the following graph: We can see that now, our Boltzmann machine is comprised of visible units which are our data, or the activations of our data in the nth layer, h are the hidden units and y are the class labels that our data have.

In the new RBM the energy function can be modified and the new formula is given by: E ( y, x,h= h T W x b T x c T h d T y h T Uy with the parameters (W,b,c,d,U and y=(0,0,0,1,0,0,0 where 1 shows the class of the specific data vector. The size of the class array should be modified according to the number of the classes, in order to keep the binary nature of the class and make the learning under the Bernoulli-Bernoulli- Softmax BM easier. As we can see from the figure there are no connections between visibile units and class units and there are no connections among the hidden units, among the visible units and among the class units and therefore we have directly: P (x h=π i P ( x i h P (x i =1 h=sigmoid (b i +h T W e(d y+h T U P ( y h= (Σ y e (d y+h T U j P (h y, x=π j P(h j y, x The hidden units in that case are devised in order not to binarize our visible vector but also to capture information about the target class every time. Contrastive divergence that is used in order to traing the classical Bernoulli-Bernoulli restricted Boltzmann machine can be used in order to train the Bernoulli-Bernoulli-Softmax RBM, where the algorithm is given:

During the positive phase of the Contrastive divergence, our data are clamped into the visible units giving x zero, and also the class information is clamped based on the class of the visible vector. The hidden units are estimated based on the formula of P(h v which is our initial estimate of the probablity that the hidden units will have for the specific visible vector. In the negative phase, the perform n-steps, based on the traditional contrastive divergence formulation in order to estimate the distribution of P(x,y,h under the BM distribution. Initially, we sample our hidden vector, given the clampled visible and target units. Then we resample the class units based on the estimated hidden, and also we resample the visible units based on the estimated hidden units. Finally, we reestimate our hidden units based on the newly sample visible and class units. Usually only one step of Contrastive Divergence seems to show good results, but including more steps makes our joint distribution to approximate the Equilibrium distribution and train the model better. The number of Contrastive Divergence steps should be manually tuned, while the experiment. During the update phase the energy of the joint configuration of visible/hidden/class units is used in order to find its gradients concerning the parameters of our models. Therefore the gradients of the weights, the visible biases, hidden biases, class biases,class weights should be calculated in order to perform gradient descent on the negative log likelihood of the generative model. Under the previous configuration, classification can be performed based on the following formula:, y e( F(x P ( y x= (Σ y (e ( F( x, y where F(x,y is the free energy of the joint configuration: F (x, y=d y +Σ j softplus(c j +U jy +x T W. j Though in our previous setting, the goal of the B-B-S machine is to create a generative model based on the data distribution in order to perform classification. Though, there is no guarantee that a good generative model of p(x is good in terms of classification and therfore, this model can be extended in order to directly optimize p(y x,h instead of the joint p(y,x,h that is done using the previous configuration. By changing the optimization objective, we can obtain a discriminative criterion, optimizing directly the probability that each visible vector belongs to a specific class, rather than the generative probability. Therefore we can devise a hybrid model that combines the strengths of discriminative and generative trainnig of the B-B-S Boltzmann machine, having as only tuning parameters the learning rate of the classific BM and the contribution of each model. Having created the B-B-S Boltzmann machine the following experiments could be performed: a Training a Gaussian-Binary Boltzmann machine for the initially Gaussianly distributed data b Training a stack of Binary-Binary Boltzmann machines using as data the activations from the previous layer. cadd a B-B-S Boltzmann machine at the higher layer using the hybrid training procedure described before in order to perform classification Apart from the previous obvious experiment for performing classification using Boltzmann machines, there is a series of experiments that could be conducted in order to investigate the

efficiency and contribution of the specific BM. a Train directly a stack of B-B-S Boltzmann machines and investigate the classification performance. b Extend the Restricted nature of the B-B-S to the general Boltzmann machine, where there are connections also between the visible units, hidden units and class units. In that case simple contrastive divergence is not sufficient because the posterior probability between the hidden variables cannot factorize. In this case Variational Bayes / Mean field approximation should be used This configuration is more promising when compared to the previous one, as it also manages to capture more correlations among the units and therefore it learns a better generative model, though it lies on the strong assumption of factorizability based on VB which may not be appropriate in order to train the Boltzmann machine. c Compare the previous configuration with 1. A classic Multilayer Perceptron 2. A classic Multilayer Perceptron where the weigths have been initialized using a stack of RBMs